|
|
-
Extracting data from websites
David Rose 2012-07-30, 12:22
Hi all,
I apologize for how basic my question is, but I am very new to all of this, machine learning, writing code, all of it. I was finally able to get Mahout downloaded, installed, and running. I was assigned a project at my work to try to use Mahout to extract data from websites that we input. Is this possible? Can anyone help me with suggestions or instructions on how to do so? I appreciate any help on this, as I have only two more weeks to finish this project.
Thanks,
David Rose
-
Re: Extracting data from websites
Sean Owen 2012-07-30, 12:26
Extract as in web crawl? No it's nothing to do with that. Extract as in entity extraction? I don't think there are relevant implementations here either, though that begins to border on machine learning. This is more about clustering and classification of documents than anything else.
On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[EMAIL PROTECTED]> wrote:
> Hi all, > > I apologize for how basic my question is, but I am very new to all of > this, machine learning, writing code, all of it. I was finally able to get > Mahout downloaded, installed, and running. I was assigned a project at my > work to try to use Mahout to extract data from websites that we input. Is > this possible? Can anyone help me with suggestions or instructions on how > to do so? I appreciate any help on this, as I have only two more weeks to > finish this project. > > Thanks, > > David Rose
-
Re: Extracting data from websites
Xavier Rampino 2012-07-30, 12:29
If you want to develop scrapers, I suggest you take a look at jsoup ( http://jsoup.org/), which allows you to parse HTML easily. If you need subsequent classification of the websites, then maybe you'll need Mahout On Mon, Jul 30, 2012 at 2:26 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > Extract as in web crawl? No it's nothing to do with that. > Extract as in entity extraction? I don't think there are relevant > implementations here either, though that begins to border on machine > learning. > This is more about clustering and classification of documents than anything > else. > > On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[EMAIL PROTECTED]> wrote: > > > Hi all, > > > > I apologize for how basic my question is, but I am very new to all of > > this, machine learning, writing code, all of it. I was finally able to > get > > Mahout downloaded, installed, and running. I was assigned a project at > my > > work to try to use Mahout to extract data from websites that we input. > Is > > this possible? Can anyone help me with suggestions or instructions on how > > to do so? I appreciate any help on this, as I have only two more weeks to > > finish this project. > > > > Thanks, > > > > David Rose >
-
Re: Extracting data from websites
David Rose 2012-07-30, 12:59
The clustering and classification is something that we want to use. We would want to grab news from sites we input on specific industries or companies, and then have them classified based on relevance.
On Jul 30, 2012, at 8:26 AM, Sean Owen wrote:
> Extract as in web crawl? No it's nothing to do with that. > Extract as in entity extraction? I don't think there are relevant > implementations here either, though that begins to border on machine > learning. > This is more about clustering and classification of documents than anything > else. > > On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[EMAIL PROTECTED]> wrote: > >> Hi all, >> >> I apologize for how basic my question is, but I am very new to all of >> this, machine learning, writing code, all of it. I was finally able to get >> Mahout downloaded, installed, and running. I was assigned a project at my >> work to try to use Mahout to extract data from websites that we input. Is >> this possible? Can anyone help me with suggestions or instructions on how >> to do so? I appreciate any help on this, as I have only two more weeks to >> finish this project. >> >> Thanks, >> >> David Rose
-
Re: Extracting data from websites
Pat Ferrel 2012-07-30, 14:30
You may want to look at Bixo (openbixo.org), which is a web crawler built on hadoop. There is a little extension to it that parses into plain text using boilerpipe (removes boilerplate text from pages) and Tika. The cralwer will take a list of URLs and filter them with regex's (in or out). It then puts the text into hadoop sequence files, which can be directly input to mahout for vectorization, the first step to most of the mahout analysis features including clustering and classifications. I forked the code to add the hadoop output, boilerpipe, and filtering, all of which are command line driven: https://github.com/pferrel/bixoThe project includes a couple tools for independent tasks--just ignore them. Read Mahout in Action. It will make your next two weeks more productive. On 7/30/12 5:59 AM, David Rose wrote: > The clustering and classification is something that we want to use. We would want to grab news from sites we input on specific industries or companies, and then have them classified based on relevance. > > On Jul 30, 2012, at 8:26 AM, Sean Owen wrote: > >> Extract as in web crawl? No it's nothing to do with that. >> Extract as in entity extraction? I don't think there are relevant >> implementations here either, though that begins to border on machine >> learning. >> This is more about clustering and classification of documents than anything >> else. >> >> On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[EMAIL PROTECTED]> wrote: >> >>> Hi all, >>> >>> I apologize for how basic my question is, but I am very new to all of >>> this, machine learning, writing code, all of it. I was finally able to get >>> Mahout downloaded, installed, and running. I was assigned a project at my >>> work to try to use Mahout to extract data from websites that we input. Is >>> this possible? Can anyone help me with suggestions or instructions on how >>> to do so? I appreciate any help on this, as I have only two more weeks to >>> finish this project. >>> >>> Thanks, >>> >>> David Rose > >
-
Re: Extracting data from websites
David Rose 2012-07-30, 14:31
Is there a way to combine both Apache Nutch and Mahout in order to do what I am trying to do? On Jul 30, 2012, at 8:29 AM, Xavier Rampino wrote: > If you want to develop scrapers, I suggest you take a look at jsoup ( > http://jsoup.org/), which allows you to parse HTML easily. If you need > subsequent classification of the websites, then maybe you'll need Mahout > > On Mon, Jul 30, 2012 at 2:26 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > >> Extract as in web crawl? No it's nothing to do with that. >> Extract as in entity extraction? I don't think there are relevant >> implementations here either, though that begins to border on machine >> learning. >> This is more about clustering and classification of documents than anything >> else. >> >> On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[EMAIL PROTECTED]> wrote: >> >>> Hi all, >>> >>> I apologize for how basic my question is, but I am very new to all of >>> this, machine learning, writing code, all of it. I was finally able to >> get >>> Mahout downloaded, installed, and running. I was assigned a project at >> my >>> work to try to use Mahout to extract data from websites that we input. >> Is >>> this possible? Can anyone help me with suggestions or instructions on how >>> to do so? I appreciate any help on this, as I have only two more weeks to >>> finish this project. >>> >>> Thanks, >>> >>> David Rose >>
-
Re: Extracting data from websites
Lance Norskog 2012-07-30, 21:46
The easiest web crawler I know of is 'wget'. On Mon, Jul 30, 2012 at 7:31 AM, David Rose <[EMAIL PROTECTED]> wrote: > Is there a way to combine both Apache Nutch and Mahout in order to do what I am trying to do? > On Jul 30, 2012, at 8:29 AM, Xavier Rampino wrote: > >> If you want to develop scrapers, I suggest you take a look at jsoup ( >> http://jsoup.org/), which allows you to parse HTML easily. If you need >> subsequent classification of the websites, then maybe you'll need Mahout >> >> On Mon, Jul 30, 2012 at 2:26 PM, Sean Owen <[EMAIL PROTECTED]> wrote: >> >>> Extract as in web crawl? No it's nothing to do with that. >>> Extract as in entity extraction? I don't think there are relevant >>> implementations here either, though that begins to border on machine >>> learning. >>> This is more about clustering and classification of documents than anything >>> else. >>> >>> On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[EMAIL PROTECTED]> wrote: >>> >>>> Hi all, >>>> >>>> I apologize for how basic my question is, but I am very new to all of >>>> this, machine learning, writing code, all of it. I was finally able to >>> get >>>> Mahout downloaded, installed, and running. I was assigned a project at >>> my >>>> work to try to use Mahout to extract data from websites that we input. >>> Is >>>> this possible? Can anyone help me with suggestions or instructions on how >>>> to do so? I appreciate any help on this, as I have only two more weeks to >>>> finish this project. >>>> >>>> Thanks, >>>> >>>> David Rose >>> > -- Lance Norskog [EMAIL PROTECTED]
|
|