Quotation Detection |
July 20th, 2012 |
algorithms, tech |
for each comment A: for each later comment B: for each word X in A: for each word Y in B: do A and B match for N words starting at X and Y?This is psuedocode for an
O(n^2*m^2)
solution. Not so
good. I coded it up in javascript and while I think it's fast enough
in Chrome, Firefox, and even my phone, it took IE8 nearly a
minute. While I could probably run this on my server and do fine
with some combination of caching and C, it seems inefficient. Is
there a better algorithm? What is the general version of this problem
called?
Update 2012-07-21: Several commenters suggested a better
algorithm: to find all quotes of length N, build a dictionary from all
sequences of N words to a list of comments in which they appeared.
This is O(n*m)
and running it it's much faster. I tested it
in IE8, and it loaded
quickly instead of freezing the
browser. I thought briefly about doing something like this
earlier, but wrote it off as using insane amounts of memory. After
more people suggested it, I realized it only uses N times as much
memory as just storing the comments.
Comment via: google plus, facebook