~j3s/topn

a project i threw together for gitlab once upon a time
adf23c48 — Jesse Olson 3 years ago
Fix wording, add inline image
ce285f1a — Jesse Olson 3 years ago
Remove unecessary comment
0a382777 — Jesse Olson 3 years ago
Add documentation plus very informative chart.

refs

master
browse  log 

clone

read-only
https://giit.cyberia.club/~j3s/topn
read/write
git@giit.cyberia.club:~j3s/topn

You can also use your local clone with git send-email.

#topn in Golang

A simple topn implementation in Golang.

#Assumptions

  • File has consistent line endings and is formatted per the problem description
  • RAM is a constraint

#Benchmarks + Validations

#Test System Stats

          ::::.    ':::::     ::::'           jesse@nostromo
          ':::::    ':::::.  ::::'            --------------
            :::::     '::::.:::::             OS: NixOS 17.09.3222.21693048d72 (Hummingbird) x86_64
      .......:::::..... ::::::::              Host: 7470AA4 ThinkPad X200s
     ::::::::::::::::::. ::::::    ::::.      Kernel: 4.9.86
    ::::::::::::::::::::: :::::.  .::::'      Uptime: 1 day, 9 hours, 13 mins
           .....           ::::' :::::'       Packages: 2104
          :::::            '::' :::::'        Shell: zsh 5.4.2
 ........:::::               ' :::::::::::.   Resolution: 1280x800
:::::::::::::                 :::::::::::::   DE: none+i3
 ::::::::::: ..              :::::            WM: i3
     .::::: .:::            :::::             Theme: Arc-Darker [GTK2/3]
    .:::::  :::::          '''''    .....     Icons: Arc [GTK2/3]
    :::::   ':::::.  ......:::::::::::::'     Terminal: st
     :::     ::::::. ':::::::::::::::::'      CPU: Intel 2 L9400 (2) @ 1.867GHz
            .:::::::: '::::::::::             Memory: 2480MiB / 7732MiB
           .::::''::::.     '::::.
          .::::'   ::::.     '::::.
         .::::      ::::      '::::.

#Small File - 145 bytes

First, we'll test this sorting algorithm against a tiny little text file.

λ time ./topn -file small -top 3
123451
111111
65643
./topn -file small -top 3  0.00s user 0.00s system 85% cpu 0.002 total


λ sort -nr files/small | head -n3
123451
111111
65643

Fast, seems to work as expected. Let's make the file only contain negative numbers.

λ time ./topn -file small -top 3
-1
-2
-3
./topn -file small -top 3  0.00s user 0.00s system 76% cpu 0.005 total

No problem with negative numbers. Let's try expanding our selection, and ensure that nothing is out of the ordinary.

λ for i in {1..20}; do diff <(./topn -file small -top $i) <(sort -nr files/small | head -n $i); done
λ

What happens if we set -top to a higher value than the number of lines in a file?

λ ./topn -top 90
12345
123
λ

Looks like it returns a sorted list of the numbers that exist in our file, but doesn't complain that there aren't enough results to fill our request. Possibly a feature that could be implemented in the future.

#Medium File - 1885145 bytes (1.8M)

λ time ./topn -file medium -top 3
123451
123451
123451
./topn -file medium -top 3  0.07s user 0.01s system 100% cpu 0.071 total

λ time sort -nr files/medium | head -n 3
123451
123451
123451
sort -nr files/medium  0.51s user 0.03s system 175% cpu 0.302 total
head -n 3  0.00s user 0.00s system 1% cpu 0.298 total

According to the above, we seem to be faster than sort! Of course this is expected since sort is sorting the entire file, but it's good to know that we're on track. The last time I performed this challenge (in Ruby) I just couldn't outrace sort.

#Large File - 1562251264 bytes (1.5 GB)

λ time ./topn -file large -top 3
123451
123451
123451
./topn -file large -top 3  49.58s user 1.39s system 100% cpu 50.806 total

This is where things can get interesting. I know that my code isn't fully optimized, so I'm interested in seeing how cranking up the returned results impacts our time to completion.

λ time ./topn -file large -top 56
./topn -file large -top 56  50.11s user 1.44s system 99% cpu 51.588 total

700ms for 53 additional results. Let's take it just a little tiny bit further.

λ time ./topn -file large -top 56032
./topn -file large -top 56032  606.22s user 2.58s system 98% cpu 10:17.65 total

As expected, the number of requested results has a dramatic impact on the time the program takes to complete.

#Runtime/Space Complexity

I am not a computer scientist, and do not feel confident going into the details of time/space complexity. I do have the following insights.

  • The program is CPU-bound on my laptop
  • The program uses a negligible about of memory since it processes one line/comparison at a time, allowing for files of arbitrary sizes (200GB+) to be processed
  • Sorting can be done faster if I were to spend the time and optimize this code

I put together the following chart, perhaps it's insightful:

clicky

#Room for Improvement

#Sorting
if currentLine > largestNumbers[0] {
	largestNumbers[0] = currentLine
	// Rework with sort-on-insertion
	sort.Ints(largestNumbers)
}

The above sorting method is not unoptimized. Since we're re-sorting the entire slice of collected numbers on each number insertion, we're using unreasonable amounts of CPU time to do so - this has increasing losses as the slice grows. A better method would be to write an insertion sort. An insertion sort would simply leave the leftmost data in order, and insert the data, seeking from the right. I would expect this to improve performance, but not dramatically.

#Parallelization

It might be possible to introduce some level of parallel processing that could expediate performance. We could, for instance, read the file on one thread starting from the top, and another starting from the middle. I am not keen on the specifics, so I'm not positive this would help performance. First we'd have to read the file to tell where the middle of it is - and our primary problem (the sorting mechanism) would still be single-threaded.

This means we could form two distinct slices and merge them together at the end. Again, I'm skeptical about the performance gains here.

On a hard disk, this type of parallel processing might actually slow things down - the head of the disk would have to hop back and forth in the file - possible the worst thing we could do for IO performance. On a solid state drive this would matter less, so the performance gains may be more noticeable there.

Of course the only way to know for certain is to test this, and I'm not experienced enough with parallel processing to take on that task in under a day. :)