[Helma-user] file read and processing pretty slow. ideas?
Joshua Paine
joshua at papercrown.org
Thu Jul 12 22:49:06 CEST 2007
I'm coming to Helma from PHP. I solved the kata at
<http://codekata.pragprog.com/2007/01/kata_six_anagra.html> in PHP and
in JavaScript. I tremendously prefer the JavaScript code (that's why I'm
coming to Helma), but it was pretty slow compared to PHP. One source of
slowness was certainly IO, especially reading.
Here's my anagrams.hac code:
var f, w, hash, start, a;
start = (new Date()).getTime();
(f = helma.File('c:/temp/wordlist.txt')).open();
hash = {};
while(w = f.readln()) {
if(!hash[(key = w.toLowerCase().split('').filter(function(c){ return
c>='0'; }).sort().join())]) hash[key] = [w];
else hash[key].push(w);
}
f.close();
res.contentType='text/plain';
for(a in hash) if(hash[a].length>1) res.writeln(hash[a].join(' '));
writeln('found anagrams in '+(((new Date()).getTime() -
start)/1000).toFixed(3)+' seconds');
This takes about 3x as long to run as the equivalent (though longer and
uglier) PHP on my system. Actually not quite equivalent, because in PHP
handling a text file line by line is actually quickest done by reading
the whole file into a string in one go with file_get_contents and then
exploding it on "\n". I tried a similar technique in Helma using
helma.File.readAll().split("\n"), but that proved to be slower than
using helma.File.readln.
Is there a faster way to read and go through a large file? A faster way
to do the rest?
Briefly, it works by stripping non-alphanumeric characters of each word,
lowercasing, and sorting the rest of the characters. The sorted,
stripped, lowercase word becomes a key in an object (being used merely
as a hash table) which stores an array of words that reduce to that key,
e.g.:
{
'bdu':['bud','dub'],
'angt':['gnat','tang'],
...
}
Then loops through the keys and print out the ones that have more than
one entry.
More information about the Helma-user
mailing list