'sed / perl regex extremly slow
So, I've got a file called cracked.txt, which contains thousands(80million+) lines of this:
dafaa15bec90fba537638998a5fa5085:_BD:zzzzzz12
a8c2e774d406b319e33aca8b38540063:2JB:zzzzzz999
d6d24dfcef852729d10391f186da5b08:WNb:zzzzzzzss
2f1c72ccc940828b5daf4ab98e0f8731:@]9:zzzzzzzz
3b7633b6c19d79e5ab76bdb9cce4fd42:#A9:zzzzzzzz
a3dc9c03ff845776b485fa8337c9625a:yQ,:zzzzzzzz
ade1d43b29674814a16e96098365f956:FZ-:zzzzzzzz
ba93090dfa64d964889f521788aca889:/.g:zzzzzzzz
c3bd6861732affa3a437df46a6295810:m}Z:zzzzzzzz
b31d9f86c28bd1245819817e353ceeb1:>)L:zzzzzzzzzzzz
and in my output.txt 80 million lines like this:
('chen123','45a36afe044ff58c09dc3cd2ee287164','','','','f+P',''),
('chen1234','45a36afe044ff58c09dc3cd2ee287164','','','','f+P',''),
('chen125','45a36afe044ff58c09dc3cd2ee287164','','','','f+P',''),
(45a36afe044ff58c09dc3cd2ee287164 and f+P change every line)
What I've done is created a simple bash script to match the cracked.txt to output.txt and join them.
cat './cracked.txt' | while read LINE; do
pwd=$(echo "${LINE}" | awk -F ":" '{print $NF}' | sed -e 's/\x27/\\\\\\\x27/g' -e 's/\//\\\x2f/g' -e 's/\x22/\\\\\\\x22/g' )
hash=$(echo "${LINE}" | awk -F ":" '{print $1}')
lines=$((lines+1))
echo "${lines} ${pwd}"
perl -p -i -e "s/${hash}/${hash} ( ${pwd} ) /g" output.txt
#sed -u -i "s/${hash}/${hash} ( ${pwd} ) /g" output.txt
done
As you can see by the comment, I've tried sed, and perl. perl seems to be a tad faster than sed I'm getting something like one line per second.
I've never used perl, so I've got no idea how to use that to my advantage (multi threading, etc)
What would the best way to speed up this process?
Thanks
edit: I got a suggestion that it would be better to use something like this:
while IFS=: read pwd seed hash; do
...
done < cracked.txt
But because inbetween the first and last occurance of : (awk '{print $1}' awk '{print $NF}', : could appear inbetween there, it would make it bad(corrupt it) I could use it just for the "hash", but not for the "pwd". edit again; The above wouldn't work, because I would have to name all the other data, which ofc will be a problem.
Solution 1:[1]
The problem with bash scripting is that, while very flexible and powerful, it creates new processes for nearly anything, and forking is expensive. In each iteration of the loop, you spawn 3×echo, 2×awk, 1×sed and 1×perl. Restricting yourself to one process (and thus, one programming language) will boost performance.
Then, you are re-reading output.txt each time in the call to perl. IO is always slow, so buffering the file would be more efficient, if you have the memory.
Multithreading would work if there were no hash collisions, but is difficult to program. Simply translating to Perl will get you a greater performance increase than transforming Perl to multithreaded Perl.[citation needed]
You would probably write something like
#!/usr/bin/perl
use strict; use warnings;
open my $cracked, "<", "cracked.txt" or die "Can't open cracked";
my @data = do {
open my $output, "<", "output.txt" or die "Can't open output";
<$output>;
};
while(<$cracked>) {
my ($hash, $seed, $pwd) = split /:/, $_, 3;
# transform $hash here like "$hash =~ s/foo/bar/g" if really neccessary
# say which line we are at
print "at line $. with pwd=$pwd\n";
# do substitutions in @data
s/\Q$hash\E/$hash ( $pwd )/ for @data;
# the \Q...\E makes any characters in between non-special,
# so they are matched literally.
# (`C++` would match many `C`s, but `\QC++\E` matches the character sequence)
}
# write @data to the output file
(not tested or anything, no guarantees)
While this would still be an O(n²) solution, it would perform better than the bash script. Do note that it can be reduced to O(n), when organizing @data into a hash tree, indexed by hash codes:
my %data = map {do magic here to parse the lines, and return a key-value pair} @data;
...;
$data{$hash} =~ s/\Q$hash\E/$hash ( $pwd )/; # instead of evil for-loop
In reality, you would store a reference to an array containing all lines that contain the hash code in the hash tree, so the previous lines would rather be
my %data;
for my $line (@data) {
my $key = parse_line($line);
push @$data{$key}, $line;
}
...;
s/\Q$hash\E/$hash ( $pwd )/ for @{$data{$hash}}; # is still faster!
On the other hand, a hash with 8E7 elems might not exactly perform well. The answer lies in benchmarking.
Solution 2:[2]
When parsing logs on my work i do this thing: split file for N parts (N=num_processors); align split points to \n. Start N threads to work each part. Works really fast but harddrive is bottleneck.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Galimov Albert |
