Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Tika, mail # dev - % of different content types out there on the web


+
Mattmann, Chris A 2012-01-28, 02:01
+
Markus Jelsma 2012-01-31, 12:39
+
Mattmann, Chris A 2012-01-31, 14:55
Copy link to this message
-
Re: % of different content types out there on the web
Markus Jelsma 2012-01-31, 14:54


On Tuesday 31 January 2012 15:55:06 Mattmann, Chris A (388J) wrote:
> Hi Markus,
>
> Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes
> compared to the size of the entire corpus?

Unfortunately no, we don't keep record of those, just filter them away as soon
as wel can.

>
> Cheers,
> Chris
>
> On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote:
> > We only crawl HTML and PDF files for a lot of cc-TLD's so we only have
> > data on those two. However, we also explicitly filter out all/most
> > unwanted suffixes. We do have a lot of suffixes that we encountered so
> > far.
> >
> > On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
> >> (sorry for the cross post)
> >>
> >> Hey Guys,
> >>
> >> I'm trying to find a good citation or estimate (if anyone has done one)
> >> that estimates the breakout (by % or some other metric) of content types
> >> out there out the web (with a whole web crawl or a meaningful
> >> representative dataset) that are non HTML.
> >>
> >> Anyone have any ideas about this?
> >>
> >> Thanks!
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: [EMAIL PROTECTED]
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [EMAIL PROTECTED]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

--
Markus Jelsma - CTO - Openindex
+
Julien Nioche 2012-01-29, 16:29