|
Alex McLintock
2010-07-02, 15:53
Mischa Tuffield
2010-07-02, 15:57
Kevin Conor
2010-07-02, 15:58
Mischa Tuffield
2010-07-02, 16:00
Andrzej Bialecki
2010-07-02, 16:11
Claudio Martella
2010-07-02, 16:12
Max Lynch
2010-07-02, 16:41
Julien Nioche
2010-07-02, 18:42
Thomas Tague
2010-07-04, 10:53
AJ Chen
2010-07-04, 22:15
|
-
OpenCalais alternatives for use with Nutch?Alex McLintock 2010-07-02, 15:53
I'm quite interested in OpenCalais - a Reuters/Thompson initiative. It
is a web service to take your free text and identify important terms in it like people, businesses, places, and so on. If you are the document owner you can submit your document to their web site and get back important tags saying what this document is about. I'd like to tag this sort of data and feed it into a Lucene style index so that it can be used in searches AND in focussed/topical crawls. Now, here comes the problem. When we crawl the web we don't own the documents we are crawling so we don't really have permission to use Reuters' servers to do this analysis. (Maybe we could cut a deal though if we were a big enough company). So has anyone else looked at alternatives to OpenCalais which takes free text and tries to understand what it is about? I've been looking for software to do this but nothing seems suitable. Alex
-
Re: OpenCalais alternatives for use with Nutch?Mischa Tuffield 2010-07-02, 15:57
Hi Alex,
As far as I am aware Zemanta [1] does something similar to OpenCalais, but it is mainly used in text in blogs, as apposed to news related text. I might be worth you checking out their stuff, I could be wrong though... Mischa [1] http://developer.zemanta.com/ On 2 Jul 2010, at 16:53, Alex McLintock wrote: > I'm quite interested in OpenCalais - a Reuters/Thompson initiative. It > is a web service to take your free text and identify important terms > in it like people, businesses, places, and so on. If you are the > document owner you can submit your document to their web site and get > back important tags saying what this document is about. I'd like to > tag this sort of data and feed it into a Lucene style index so that it > can be used in searches AND in focussed/topical crawls. > > Now, here comes the problem. When we crawl the web we don't own the > documents we are crawling so we don't really have permission to use > Reuters' servers to do this analysis. (Maybe we could cut a deal > though if we were a big enough company). > > So has anyone else looked at alternatives to OpenCalais which takes > free text and tries to understand what it is about? I've been looking > for software to do this but nothing seems suitable. > > Alex ___________________________________ Mischa Tuffield PhD Email: [EMAIL PROTECTED] Homepage - http://mmt.me.uk/ Garlik Limited, 1-3 Halford Road, Richmond, TW10 6AW +44(0)845 645 2824 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
-
Re: OpenCalais alternatives for use with Nutch?Kevin Conor 2010-07-02, 15:58
That sounds like Named Entity Recognition. It's typically done with a
Conditional Random Field. You could take a look at http://nlp.stanford.edu/software/CRF-NER.shtml. On Fri, Jul 2, 2010 at 10:53 AM, Alex McLintock <[EMAIL PROTECTED]>wrote: > I'm quite interested in OpenCalais - a Reuters/Thompson initiative. It > is a web service to take your free text and identify important terms > in it like people, businesses, places, and so on. If you are the > document owner you can submit your document to their web site and get > back important tags saying what this document is about. I'd like to > tag this sort of data and feed it into a Lucene style index so that it > can be used in searches AND in focussed/topical crawls. > > Now, here comes the problem. When we crawl the web we don't own the > documents we are crawling so we don't really have permission to use > Reuters' servers to do this analysis. (Maybe we could cut a deal > though if we were a big enough company). > > So has anyone else looked at alternatives to OpenCalais which takes > free text and tries to understand what it is about? I've been looking > for software to do this but nothing seems suitable. > > Alex >
-
Re: OpenCalais alternatives for use with Nutch?Mischa Tuffield 2010-07-02, 16:00
Or you could have a look at the NLP stuff out of Sheffield University in the UK :
GATE : http://gate.ac.uk/ Mischa *not an NLP expert On 2 Jul 2010, at 16:58, Kevin Conor wrote: > That sounds like Named Entity Recognition. It's typically done with a > Conditional Random Field. You could take a look at > http://nlp.stanford.edu/software/CRF-NER.shtml. > > On Fri, Jul 2, 2010 at 10:53 AM, Alex McLintock <[EMAIL PROTECTED]>wrote: > >> I'm quite interested in OpenCalais - a Reuters/Thompson initiative. It >> is a web service to take your free text and identify important terms >> in it like people, businesses, places, and so on. If you are the >> document owner you can submit your document to their web site and get >> back important tags saying what this document is about. I'd like to >> tag this sort of data and feed it into a Lucene style index so that it >> can be used in searches AND in focussed/topical crawls. >> >> Now, here comes the problem. When we crawl the web we don't own the >> documents we are crawling so we don't really have permission to use >> Reuters' servers to do this analysis. (Maybe we could cut a deal >> though if we were a big enough company). >> >> So has anyone else looked at alternatives to OpenCalais which takes >> free text and tries to understand what it is about? I've been looking >> for software to do this but nothing seems suitable. >> >> Alex >> ___________________________________ Mischa Tuffield PhD Email: [EMAIL PROTECTED] Homepage - http://mmt.me.uk/ Garlik Limited, 1-3 Halford Road, Richmond, TW10 6AW +44(0)845 645 2824 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
-
Re: OpenCalais alternatives for use with Nutch?Andrzej Bialecki 2010-07-02, 16:11
On 2010-07-02 18:00, Mischa Tuffield wrote:
> Or you could have a look at the NLP stuff out of Sheffield University in the UK : > > GATE : > > http://gate.ac.uk/ ..or OpenNLP. See also Behemoth http://code.google.com/p/behemoth-pebble). In short - no need to subject yourself to draconian Terms of Service :) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-
Re: OpenCalais alternatives for use with Nutch?Claudio Martella 2010-07-02, 16:12
I can advice you GATE and ANNIE. Gate is a framework for Text-mining.
ANNIE is a pipeline of Gate's components for the extraction of Named Entities like names of people, locations, companies etc. You can use Gate/Annie programmatically throught their Java API. http://gate.ac.uk/ Alex McLintock wrote: > I'm quite interested in OpenCalais - a Reuters/Thompson initiative. It > is a web service to take your free text and identify important terms > in it like people, businesses, places, and so on. If you are the > document owner you can submit your document to their web site and get > back important tags saying what this document is about. I'd like to > tag this sort of data and feed it into a Lucene style index so that it > can be used in searches AND in focussed/topical crawls. > > Now, here comes the problem. When we crawl the web we don't own the > documents we are crawling so we don't really have permission to use > Reuters' servers to do this analysis. (Maybe we could cut a deal > though if we were a big enough company). > > So has anyone else looked at alternatives to OpenCalais which takes > free text and tries to understand what it is about? I've been looking > for software to do this but nothing seems suitable. > > Alex > > -- Claudio Martella Digital Technologies Unit Research & Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [EMAIL PROTECTED] http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
-
Re: OpenCalais alternatives for use with Nutch?Max Lynch 2010-07-02, 16:41
Opennlp is a great java package with working named entity recognition. I've had a lot of success with it. That said, the Stanford nlp stuff is always highly acclaimed, though I had trouble figuring it all out the last time I tried it.
On Jul 2, 2010, at 11:12 AM, Claudio Martella <[EMAIL PROTECTED]> wrote: > I can advice you GATE and ANNIE. Gate is a framework for Text-mining. > ANNIE is a pipeline of Gate's components for the extraction of Named > Entities like names of people, locations, companies etc. You can use > Gate/Annie programmatically throught their Java API. > > http://gate.ac.uk/ > > > Alex McLintock wrote: >> I'm quite interested in OpenCalais - a Reuters/Thompson initiative. It >> is a web service to take your free text and identify important terms >> in it like people, businesses, places, and so on. If you are the >> document owner you can submit your document to their web site and get >> back important tags saying what this document is about. I'd like to >> tag this sort of data and feed it into a Lucene style index so that it >> can be used in searches AND in focussed/topical crawls. >> >> Now, here comes the problem. When we crawl the web we don't own the >> documents we are crawling so we don't really have permission to use >> Reuters' servers to do this analysis. (Maybe we could cut a deal >> though if we were a big enough company). >> >> So has anyone else looked at alternatives to OpenCalais which takes >> free text and tries to understand what it is about? I've been looking >> for software to do this but nothing seems suitable. >> >> Alex >> >> > > > -- > Claudio Martella > Digital Technologies > Unit Research & Development - Analyst > > TIS innovation park > Via Siemens 19 | Siemensstr. 19 > 39100 Bolzano | 39100 Bozen > Tel. +39 0471 068 123 > Fax +39 0471 068 129 > [EMAIL PROTECTED] http://www.tis.bz.it > > Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. > >
-
Re: OpenCalais alternatives for use with Nutch?Julien Nioche 2010-07-02, 18:42
Thanks Andrzej for mentioning Behemoth! It now lives on
http://github.com/jnioche/behemoth There are indeed quite a few open source NLP framework / resources available that can be used for doing NER, including GATE or Apache UIMA. Apart from the obvious benefits related to their licenses, these resources also have the advantage of being customisable. Behemoth allows to deploy GATE or UIMA applications on Hadoop but it would be doable to embed them in a custom parsing / indexing plugin for Nutch. Some of Behemoth components could be used to that effect. Note that openNLP, stanford or lingpipe components are already available as GATE plugins. GATE also gives you a lot of things e.g. GUIs on top of these resources so it is good to use it regardless of the actual component that you want to get your annotations from. I was (and probably still am) a committer on GATE so feel free to get in touch. <sales_pitch> DigitalPebble do provide support and consultancy services on GATE / UIMA (as well as Nutch of course) </sales_pitch> :-) I recently gave a presentation on Behemoth at Berlin Buzzwords [1] which might be of interest. In particular there is a component that converts Nutch segments into the structure used by Behemoth. HTH, Julien [1] http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/nioche_bbuzz2010.odp On 2 July 2010 17:41, Max Lynch <[EMAIL PROTECTED]> wrote: > Opennlp is a great java package with working named entity recognition. > I've had a lot of success with it. That said, the Stanford nlp stuff is > always highly acclaimed, though I had trouble figuring it all out the last > time I tried it. > > > On Jul 2, 2010, at 11:12 AM, Claudio Martella <[EMAIL PROTECTED]> > wrote: > > > I can advice you GATE and ANNIE. Gate is a framework for Text-mining. > > ANNIE is a pipeline of Gate's components for the extraction of Named > > Entities like names of people, locations, companies etc. You can use > > Gate/Annie programmatically throught their Java API. > > > > http://gate.ac.uk/ > > > > > > Alex McLintock wrote: > >> I'm quite interested in OpenCalais - a Reuters/Thompson initiative. It > >> is a web service to take your free text and identify important terms > >> in it like people, businesses, places, and so on. If you are the > >> document owner you can submit your document to their web site and get > >> back important tags saying what this document is about. I'd like to > >> tag this sort of data and feed it into a Lucene style index so that it > >> can be used in searches AND in focussed/topical crawls. > >> > >> Now, here comes the problem. When we crawl the web we don't own the > >> documents we are crawling so we don't really have permission to use > >> Reuters' servers to do this analysis. (Maybe we could cut a deal > >> though if we were a big enough company). > >> > >> So has anyone else looked at alternatives to OpenCalais which takes > >> free text and tries to understand what it is about? I've been looking > >> for software to do this but nothing seems suitable. > >> > >> Alex > >> > >> > > > > > > -- > > Claudio Martella > > Digital Technologies > > Unit Research & Development - Analyst > > > > TIS innovation park > > Via Siemens 19 | Siemensstr. 19 > > 39100 Bolzano | 39100 Bozen > > Tel. +39 0471 068 123 > > Fax +39 0471 068 129 > > [EMAIL PROTECTED] http://www.tis.bz.it > > > > Short information regarding use of personal data. According to Section 13 > of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we > process your personal data in order to fulfil contractual and fiscal > obligations and also to send you information regarding our services and > events. Your personal data are processed with and without electronic means > and by respecting data subjects' rights, fundamental freedoms and dignity, > particularly with regard to confidentiality, personal identity and the right > to personal data protection. At any time and without formalities you can DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com
-
Re: OpenCalais alternatives for use with Nutch?Thomas Tague 2010-07-04, 10:53
Alex McLintock <alex.mclintock <at> gmail.com> writes: > > I'm quite interested in OpenCalais - a Reuters/Thompson initiative. It > is a web service to take your free text and identify important terms > in it like people, businesses, places, and so on. If you are the > document owner you can submit your document to their web site and get > back important tags saying what this document is about. I'd like to > tag this sort of data and feed it into a Lucene style index so that it > can be used in searches AND in focussed/topical crawls. > > Now, here comes the problem. When we crawl the web we don't own the > documents we are crawling so we don't really have permission to use > Reuters' servers to do this analysis. (Maybe we could cut a deal > though if we were a big enough company). > > So has anyone else looked at alternatives to OpenCalais which takes > free text and tries to understand what it is about? I've been looking > for software to do this but nothing seems suitable. > > Alex > > Alex: Ah, we enter the sticky area of ownership, IP, fair use rights and all of that. The TOS themselves are the rule - but a few comments. The OpenCalais TOS don't in themselves insist that you "own" the content you submit. You have to make a decision for content that you don't "own" on whether your usage of that content with Calais is covered by fair use. We can't make that decision for you - but a review of the OpenCalais gallery will show you many organizations that have made the decision that web-derived content can be utilized by OpenCalais. Our hard and fast limitations on use are in the TOS and are pretty straightforward. The include no hate speech, no porn, no deep packet inspection uses and a few others. Basically do no evil. The issue itself will persist regardless of the tool you chose. Open source, commercial - whatever. In the end you'll still need to own the decision on whether you're allowed to "use" other content in your service. Regards, Tom
-
Re: OpenCalais alternatives for use with Nutch?AJ Chen 2010-07-04, 22:15
OpenNlp, stanford nlp tools, and GATE are good suggestions. But, when
compared to OpenCalais, one big difference should not be overlooked. OpenCalais web service gives many semantic analysis results right out-of-box. With other more general NLP tools, you will probably need to spend lots of efforts to build a tool that can deliver similar results. -aj On Fri, Jul 2, 2010 at 8:53 AM, Alex McLintock <[EMAIL PROTECTED]>wrote: > I'm quite interested in OpenCalais - a Reuters/Thompson initiative. It > is a web service to take your free text and identify important terms > in it like people, businesses, places, and so on. If you are the > document owner you can submit your document to their web site and get > back important tags saying what this document is about. I'd like to > tag this sort of data and feed it into a Lucene style index so that it > can be used in searches AND in focussed/topical crawls. > > Now, here comes the problem. When we crawl the web we don't own the > documents we are crawling so we don't really have permission to use > Reuters' servers to do this analysis. (Maybe we could cut a deal > though if we were a big enough company). > > So has anyone else looked at alternatives to OpenCalais which takes > free text and tries to understand what it is about? I've been looking > for software to do this but nothing seems suitable. > > Alex > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA |