|
|
-
Re: Plagiarism - document similarity
Luca Natti 2011-07-11, 07:15
yes, i'm interested in plagiarism applied to research papers, university notes, thesis. Any theory and *best* snippets of code/examples is very appreciated! thanks in advance for your help, On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg <[EMAIL PROTECTED]> wrote: > If 'puzzling' means direct plagiarism, then some sort of > longest-common-subsequence might be a better metric. > > If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for > me. > > On Friday, 8 July 2011, Sergey Bartunov <[EMAIL PROTECTED]> wrote: > > You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis> > > > On 8 July 2011 12:47, Luca Natti <[EMAIL PROTECTED]> wrote: > >> Is there a way to compute similarity between docs? > >> And similarity by paragraphs? > >> > >> What We want to tell is if a research paper is original or made by > >> "puzzling" other works. > >> > >> thanks! > >> > > > > -- > > http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg>
-
Re: Plagiarism - document similarity
Dawid Weiss 2011-07-11, 07:19
I've seen people doing all kinds of things to detect this. A few directions to research around: - suffix trees/ suffix arrays to detect longest common subsequences (perfect matches though), - bioinformatics, in particular gene sequencing to detect long near-matching sequences (a variation of the above, I'm not familiar with any particular algorithms, but I imagine this is a well explored space given the funds they receive ;), - techniques for fuzzy matching/ near-duplicate detection, but combined with arbitrary document chunking or method-specific data (for example shingles). These should yield you tons of reading material to start with (Google Scholar, Citeseer). Sorry for not being any more specific. Dawid On Mon, Jul 11, 2011 at 9:15 AM, Luca Natti <[EMAIL PROTECTED]> wrote: > yes, i'm interested in plagiarism applied to research papers, university > notes, thesis. > Any theory and *best* snippets of code/examples is very appreciated! > thanks in advance for your help, > > > On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg <[EMAIL PROTECTED]> wrote: > >> If 'puzzling' means direct plagiarism, then some sort of >> longest-common-subsequence might be a better metric. >> >> If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for >> me. >> >> On Friday, 8 July 2011, Sergey Bartunov <[EMAIL PROTECTED]> wrote: >> > You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis>> > >> > On 8 July 2011 12:47, Luca Natti <[EMAIL PROTECTED]> wrote: >> >> Is there a way to compute similarity between docs? >> >> And similarity by paragraphs? >> >> >> >> What We want to tell is if a research paper is original or made by >> >> "puzzling" other works. >> >> >> >> thanks! >> >> >> > >> >> -- >> >> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg>> >
-
Re: Plagiarism - document similarity
Em 2011-07-11, 08:35
Hi Luca, how about quoting another researcher's work? Are you also interested in the amount of quotes in respect to the whole document? I think it is not impossible to let an algorithm find out whether some subsequences in both documents are correctly marked, but it might be hard. Depending on your business-case you might find out that there will be a lot of false-positives when judging someone's work as plagiarism. Another idea to find out similarity between the content of two documents is implemented in Nutch. Fortunately I found a piece of documentation in the solr-api-docs where you can read about it: http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.htmlYou could do something like that for content-blocks of a document (several sentences or a fixed window of words). This way you are able to find out similarities between documents where the author has rewritten a part of another researcher's work. This way you are able to find out phrases where the longest-common-subsequence is small but a human would see the similarities between both documents and the possiblity of a plagiarism. Regards, Em Am 11.07.2011 09:15, schrieb Luca Natti: > yes, i'm interested in plagiarism applied to research papers, university > notes, thesis. > Any theory and *best* snippets of code/examples is very appreciated! > thanks in advance for your help, > > > On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg <[EMAIL PROTECTED]> wrote: > >> If 'puzzling' means direct plagiarism, then some sort of >> longest-common-subsequence might be a better metric. >> >> If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for >> me. >> >> On Friday, 8 July 2011, Sergey Bartunov <[EMAIL PROTECTED]> wrote: >>> You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis>>> >>> On 8 July 2011 12:47, Luca Natti <[EMAIL PROTECTED]> wrote: >>>> Is there a way to compute similarity between docs? >>>> And similarity by paragraphs? >>>> >>>> What We want to tell is if a research paper is original or made by >>>> "puzzling" other works. >>>> >>>> thanks! >>>> >>> >> >> -- >> >> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg>> >
-
Re: Plagiarism - document similarity
Luca Natti 2011-07-11, 10:00
Somethig that gives also false positives ok, because we can check by hand for the final decision on the doc. I need more specific directions with some examples , because we have few time to implement this. On Mon, Jul 11, 2011 at 10:35 AM, Em <[EMAIL PROTECTED]> wrote: > Hi Luca, > > how about quoting another researcher's work? Are you also interested in > the amount of quotes in respect to the whole document? I think it is not > impossible to let an algorithm find out whether some subsequences in > both documents are correctly marked, but it might be hard. Depending on > your business-case you might find out that there will be a lot of > false-positives when judging someone's work as plagiarism. > > Another idea to find out similarity between the content of two documents > is implemented in Nutch. Fortunately I found a piece of documentation in > the solr-api-docs where you can read about it: > > http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html> > You could do something like that for content-blocks of a document > (several sentences or a fixed window of words). This way you are able to > find out similarities between documents where the author has rewritten a > part of another researcher's work. > This way you are able to find out phrases where the > longest-common-subsequence is small but a human would see the > similarities between both documents and the possiblity of a plagiarism. > > Regards, > Em > > Am 11.07.2011 09:15, schrieb Luca Natti: > > yes, i'm interested in plagiarism applied to research papers, university > > notes, thesis. > > Any theory and *best* snippets of code/examples is very appreciated! > > thanks in advance for your help, > > > > > > On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg <[EMAIL PROTECTED]> > wrote: > > > >> If 'puzzling' means direct plagiarism, then some sort of > >> longest-common-subsequence might be a better metric. > >> > >> If this isn't what the OP meant, then sorry! 'Puzzling' is a new term > for > >> me. > >> > >> On Friday, 8 July 2011, Sergey Bartunov <[EMAIL PROTECTED]> wrote: > >>> You may start from > http://en.wikipedia.org/wiki/Latent_semantic_analysis> >>> > >>> On 8 July 2011 12:47, Luca Natti <[EMAIL PROTECTED]> wrote: > >>>> Is there a way to compute similarity between docs? > >>>> And similarity by paragraphs? > >>>> > >>>> What We want to tell is if a research paper is original or made by > >>>> "puzzling" other works. > >>>> > >>>> thanks! > >>>> > >>> > >> > >> -- > >> > >> http://tinyurl.com/andrew-clegg-linkedin | > http://twitter.com/andrew_clegg> >> > > >
-
Re: Plagiarism - document similarity
Ken Krugler 2011-07-11, 14:19
On Jul 11, 2011, at 3:00am, Luca Natti wrote: > Somethig that gives also false positives ok, > because we can check by hand for the final decision on the doc. > > I need more specific directions with some examples , > because we have few time to implement this. See "Winnowing: Local Algorithms for Document Fingerprinting" by Schleimer, Wilderson and Aiken. And any papers on MOSS, an implementation of the ideas contained in that paper, for detecting plagiarism. -- Ken > On Mon, Jul 11, 2011 at 10:35 AM, Em <[EMAIL PROTECTED]> wrote: > >> Hi Luca, >> >> how about quoting another researcher's work? Are you also interested in >> the amount of quotes in respect to the whole document? I think it is not >> impossible to let an algorithm find out whether some subsequences in >> both documents are correctly marked, but it might be hard. Depending on >> your business-case you might find out that there will be a lot of >> false-positives when judging someone's work as plagiarism. >> >> Another idea to find out similarity between the content of two documents >> is implemented in Nutch. Fortunately I found a piece of documentation in >> the solr-api-docs where you can read about it: >> >> http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html>> >> You could do something like that for content-blocks of a document >> (several sentences or a fixed window of words). This way you are able to >> find out similarities between documents where the author has rewritten a >> part of another researcher's work. >> This way you are able to find out phrases where the >> longest-common-subsequence is small but a human would see the >> similarities between both documents and the possiblity of a plagiarism. >> >> Regards, >> Em >> >> Am 11.07.2011 09:15, schrieb Luca Natti: >>> yes, i'm interested in plagiarism applied to research papers, university >>> notes, thesis. >>> Any theory and *best* snippets of code/examples is very appreciated! >>> thanks in advance for your help, >>> >>> >>> On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg <[EMAIL PROTECTED]> >> wrote: >>> >>>> If 'puzzling' means direct plagiarism, then some sort of >>>> longest-common-subsequence might be a better metric. >>>> >>>> If this isn't what the OP meant, then sorry! 'Puzzling' is a new term >> for >>>> me. >>>> >>>> On Friday, 8 July 2011, Sergey Bartunov <[EMAIL PROTECTED]> wrote: >>>>> You may start from >> http://en.wikipedia.org/wiki/Latent_semantic_analysis>>>>> >>>>> On 8 July 2011 12:47, Luca Natti <[EMAIL PROTECTED]> wrote: >>>>>> Is there a way to compute similarity between docs? >>>>>> And similarity by paragraphs? >>>>>> >>>>>> What We want to tell is if a research paper is original or made by >>>>>> "puzzling" other works. >>>>>> >>>>>> thanks! >>>>>> >>>>> >>>> >>>> -- >>>> >>>> http://tinyurl.com/andrew-clegg-linkedin | >> http://twitter.com/andrew_clegg>>>> >>> >> -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.comcustom data mining solutions
-
Re: Plagiarism - document similarity
Ted Dunning 2011-07-11, 22:33
Easier to simply index all, say, three word phrases and use a TF-IDF score. This will give you a good proxy for sequence similarity. Documents should either be chopped on paragraph boundaries to have a roughly constant length or the score should not be normalized by document length. Log likelihood ratio (LLR) test can be useful to extract good query features from the subject document. TF-IDF score is a reasonable proxy for this although it does lead to some problems. The reason TF-IDF works as a query term selection method and why it fails can be seen from the fact that TF-IDF is very close to one of the most important terms in the LLR score. On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg <andrew.clegg+[EMAIL PROTECTED] > wrote: > On 11 July 2011 08:19, Dawid Weiss <[EMAIL PROTECTED]> wrote: > > - bioinformatics, in particular gene sequencing to detect long > > near-matching sequences (a variation of the above, I'm not familiar > > with any particular algorithms, but I imagine this is a well explored > > space > > The classic is Smith & Waterman: > > http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm> > This approach been used in general text processing tasks too, e.g.: > > > http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf> > > given the funds they receive ;), > > Hah! Less so these days I'm afraid :-) > > Andrew. > > -- > > http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg>
-
Re: Plagiarism - document similarity
Luca Natti 2011-07-12, 07:58
Thanks to all , i need to start from the beginning theory , you are speaking arab :) to me, or in other words i need a less theoretical approach, or in other words some real code to put my hands on. Excuse this raw approach but i need a real fast to implement and understand algorithm to use in real world scenario possibly now ;) . Alternatively i need a basic text(book) to start reading and arrive to understand what you are saying. thanks again On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Easier to simply index all, say, three word phrases and use a TF-IDF score. > This will give you a good proxy for sequence similarity. Documents should > either be chopped on paragraph boundaries to have a roughly constant length > or the score should not be normalized by document length. > > Log likelihood ratio (LLR) test can be useful to extract good query > features > from the subject document. TF-IDF score is a reasonable proxy for this > although it does lead to some problems. The reason TF-IDF works as a query > term selection method and why it fails can be seen from the fact that > TF-IDF > is very close to one of the most important terms in the LLR score. > > On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg < > andrew.clegg+[EMAIL PROTECTED] > > wrote: > > > On 11 July 2011 08:19, Dawid Weiss <[EMAIL PROTECTED]> wrote: > > > - bioinformatics, in particular gene sequencing to detect long > > > near-matching sequences (a variation of the above, I'm not familiar > > > with any particular algorithms, but I imagine this is a well explored > > > space > > > > The classic is Smith & Waterman: > > > > http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm> > > > This approach been used in general text processing tasks too, e.g.: > > > > > > > http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf> > > > > given the funds they receive ;), > > > > Hah! Less so these days I'm afraid :-) > > > > Andrew. > > > > -- > > > > http://tinyurl.com/andrew-clegg-linkedin | > http://twitter.com/andrew_clegg> > >
-
Re: Plagiarism - document similarity
Em 2011-07-12, 08:10
Hi Luca, again, I have to emphasize read what I gave you. The algorithm in my link was explained for non-scientists and if you are going to download Solr you will find the class to have a look on how they implemented that algorithm. More easy would mean that someone else is writing the code for you ;). Regards, Em Am 12.07.2011 09:58, schrieb Luca Natti: > Thanks to all , > > i need to start from the beginning theory , > you are speaking arab :) to me, or in other words i need > a less theoretical approach, or in other words some real code to put my > hands on. > Excuse this raw approach but i need a real fast to implement and understand > algorithm > to use in real world scenario possibly now ;) . > Alternatively i need a basic text(book) to start reading and arrive to > understand what you are saying. > > thanks again > > On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> Easier to simply index all, say, three word phrases and use a TF-IDF score. >> This will give you a good proxy for sequence similarity. Documents should >> either be chopped on paragraph boundaries to have a roughly constant length >> or the score should not be normalized by document length. >> >> Log likelihood ratio (LLR) test can be useful to extract good query >> features >> from the subject document. TF-IDF score is a reasonable proxy for this >> although it does lead to some problems. The reason TF-IDF works as a query >> term selection method and why it fails can be seen from the fact that >> TF-IDF >> is very close to one of the most important terms in the LLR score. >> >> On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg < >> andrew.clegg+[EMAIL PROTECTED] >>> wrote: >> >>> On 11 July 2011 08:19, Dawid Weiss <[EMAIL PROTECTED]> wrote: >>>> - bioinformatics, in particular gene sequencing to detect long >>>> near-matching sequences (a variation of the above, I'm not familiar >>>> with any particular algorithms, but I imagine this is a well explored >>>> space >>> >>> The classic is Smith & Waterman: >>> >>> http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm>>> >>> This approach been used in general text processing tasks too, e.g.: >>> >>> >>> >> http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf>>> >>>> given the funds they receive ;), >>> >>> Hah! Less so these days I'm afraid :-) >>> >>> Andrew. >>> >>> -- >>> >>> http://tinyurl.com/andrew-clegg-linkedin | >> http://twitter.com/andrew_clegg>>> >> >
|
|