-Re: Plagiarism - document similarity
Em 2011-07-12, 08:10
again, I have to emphasize read what I gave you.
The algorithm in my link was explained for non-scientists and if you are
going to download Solr you will find the class to have a look on how
they implemented that algorithm.
More easy would mean that someone else is writing the code for you ;).
Am 12.07.2011 09:58, schrieb Luca Natti:
> Thanks to all ,
> i need to start from the beginning theory ,
> you are speaking arab :) to me, or in other words i need
> a less theoretical approach, or in other words some real code to put my
> hands on.
> Excuse this raw approach but i need a real fast to implement and understand
> to use in real world scenario possibly now ;) .
> Alternatively i need a basic text(book) to start reading and arrive to
> understand what you are saying.
> thanks again
> On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>> Easier to simply index all, say, three word phrases and use a TF-IDF score.
>> This will give you a good proxy for sequence similarity. Documents should
>> either be chopped on paragraph boundaries to have a roughly constant length
>> or the score should not be normalized by document length.
>> Log likelihood ratio (LLR) test can be useful to extract good query
>> from the subject document. TF-IDF score is a reasonable proxy for this
>> although it does lead to some problems. The reason TF-IDF works as a query
>> term selection method and why it fails can be seen from the fact that
>> is very close to one of the most important terms in the LLR score.
>> On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg <
>> andrew.clegg+[EMAIL PROTECTED]
>>> On 11 July 2011 08:19, Dawid Weiss <[EMAIL PROTECTED]> wrote:
>>>> - bioinformatics, in particular gene sequencing to detect long
>>>> near-matching sequences (a variation of the above, I'm not familiar
>>>> with any particular algorithms, but I imagine this is a well explored
>>> The classic is Smith & Waterman:
>>> This approach been used in general text processing tasks too, e.g.:
>>>> given the funds they receive ;),
>>> Hah! Less so these days I'm afraid :-)
>>> http://tinyurl.com/andrew-clegg-linkedin |