This post revisits the tests from my previous post, measuring how long various languages took to process a file.
After optimisations and suggestions were made from various people, I have some new results, and as before, the tests themselves are available at http://github.com/TJC/PerfTesting
I also have a user-submitted test for doing the test in C, which performed very well, but possibly doesn't perform quite such rigorous CSV parsing as the others.
The results are:
Small file:
Perl 0.744 seconds
Scala 0.842 seconds
Go: 1.55 seconds
C: 0.083 seconds
Medium file:
Perl: 7.12 seconds
Scala: 3.28 seconds
Go: 15.1 seconds
C: 0.780 seconds
Big file:
Perl: 71.2 seconds
Scala: 23.9 seconds
Go: 153 seconds
C: 7.83 seconds
Note the memory sizes:
Scala: 114 MB
Perl: 6 MB
Go: 2 MB
C: 0.5 MB
I find it interesting to note that the Scala code, originally, was taking 115 seconds for the largest file. While I am not a Scala expert, the code was still reasonably straight-forward and not actually *wrong*. However by changing the code around quite a bit, and using a different numeric formatting and output engine, the performance more than quadrupled.
I will welcome any patches for the Go version - I'm sure it must be possible to make it go much faster!
Subscribe to:
Post Comments (Atom)
Big file:
ReplyDeleteGo: 153 seconds
????????????
I know! I don't understand why it's the slowest of the lot -- if someone who knows more Go wants to have a shot at fixing it, please be my guest!
ReplyDeleteI thought Go would be the second-fastest, after C..
The problem is not the Perl. Do again using the same algorithm used in C.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteGo is a very slow language for tasks like this. I posted a suite of benchmarks to the Go mailing list - a small set of microbenchmarks I use in different languages to understand the strengths and weaknesses - and Go was 4x slower than Python for a variety of simple tasks. Go is not fast, it simply claims to be fast.
ReplyDeleteWhen I commented out the "print stuff out" part of the Perl program, I found that accounted for about half the runtime when I tested it on a medium-sized file.
ReplyDeletethis sort of comparison is how fud gets circulated!
ReplyDeletethe scala version is not a test of scala but rather the java csv library. if you really want to do scala/java justice, you should code whatever algorithm is used in the c version in scala by using the low level io streams. same goes to other languages as well.
The C program appears to be taking advantage of the fact that the CSV is ASCII only. In the "real world" that might be the case in which case ASCII only libraries are nice and speedy. But in most Internet applications, ASCII-only is a poor assumption. I'd be curious how your C benchmark behaves using UTF-8 libraries like the Scala, Go, and Perl versions.
ReplyDelete@walterc > the scala version is not a test of scala but rather the java csv library
ReplyDeleteTime the code and you'll find out that for Scala it's more a test of formatting and printing doubles ;-)
@Toby > However by changing the code around quite a bit, and using a different numeric formatting and output engine, the performance more than quadrupled.
ReplyDeleteAll the performance improvement came from changing the one line that consumed 80% of the time taken by your Scala program - the 10 million printf calls.
Wrapping System.out in a BufferedOutputStream was a completely ordinary thing that everybody would do.
And when there are 10 million calls, being specific is likely to be faster, so - DecimalFormat("0.00")
(Incidentally - you don't need that return statement.)
When are you going to show that the Scala "Small file" times are twice as fast using JVM -client?
env JAVA_OPTS=-client scala -classpath opencsv-2.2.jar:. PerfTest ../input.csv > /dev/null
Not only are you an arrogant, annoying troll, Isaac, but you're also wrong. Twice.
ReplyDeleteThe -server version runs faster on my system than the client one, for small files.
A large part of the performance increase came from your patch that refactored the program to use recursion instead of a loop. (And also doubled the memory usage)
@Toby > The -server version runs faster on my system than the client one, for small files.
ReplyDeleteDo you think there even is a -client JVM for x64 Ubuntu or do you think you saw ordinary variation between successive elapsed time measurements?
@Toby > A large part of the performance increase came from your patch that refactored the program to use recursion instead of a loop.
You are mistaken.
We can all easily check the recursive version of your unbuffered printf code -
def csvparser(filename: String) {
val reader = new CSVReader(new FileReader(filename))
reader.readNext() // skip header line
convertDataRows(reader)
}
def convertDataRows(csv: CSVReader) {
csv.readNext() match {
case null =>
return
case columns =>
val name = columns(0)
val result = columns(1).toDouble * columns(2).toDouble
printf("%s is %.02f\n", name, result);
convertDataRows(csv)
}
}
while loop printf
ReplyDeleteRoutine took: 91926.0 msecs
Routine took: 93780.0 msecs
Routine took: 93498.0 msecs
Routine took: 92328.0 msecs
Routine took: 91101.0 msecs
Routine took: 93374.0 msecs
recursive printf
Routine took: 94566.0 msecs
Routine took: 95907.0 msecs
Routine took: 90587.0 msecs
Routine took: 93347.0 msecs
Routine took: 94912.0 msecs
Routine took: 94919.0 msecs
while loop buffered decimal format
Routine took: 29886.0 msecs
Routine took: 29220.0 msecs
Routine took: 30553.0 msecs
Routine took: 31508.0 msecs
Routine took: 31225.0 msecs
Routine took: 29229.0 msecs
recursive buffered decimal format
Routine took: 32865.0 msecs
Routine took: 31867.0 msecs
Routine took: 29493.0 msecs
Routine took: 31422.0 msecs
Routine took: 29829.0 msecs
Routine took: 30002.0 msecs
So, lesson learned? Use C.
ReplyDelete"So, lesson learned? Use C."
ReplyDeleteNo. The C version does not count, because it does not do the same. It breaks on UTF-8 files.
For me C version worked with UTF-8 as well.
ReplyDelete