|
|
-
RE: Excel Parser - Blank Cell
Gangwal, Adish 2012-01-26, 21:21
Sorry attaching the example excel which I am trying to parse
_____________________________________________ From: Gangwal, Adish (IS Consultant) Sent: Friday, January 13, 2012 6:26 PM To: '[EMAIL PROTECTED]' Subject: Excel Parser - Blank Cell
Hi,
When I parse the excel which has an empty cell, it doesn't create a extra tab character.
If there are three cells of which middle one is empty, it skips the middle cell and only outputs 1st and 3rd cell with a tab
For example below, the first column 'FLAG' is empty and we desire a tab character like row 1 and 2. In row 3 the text 'ID COST - LONG TERM INVESTMENTS' should have a tab before Attaching the example excel sheet
How can I tell tika not to ignore the empty cells ?
Note : - If there are white spaces it correctly inserts tabs Example output -
Flag Description Starting Balance Debits Credits Net Activity Ending Balance
1 ASSETS EXCLUDING MARKET VALUE
2 ID COST - SWAPS 2,502,043.770 196,996,488.330 197,527,735.400 -531,247.070 1,970,796.700
3 ID COST - LONG TERM INVESTMENTS 814,320,658.100 210,385,704.520 235,299,892.650 -24,914,188.130
-
RE: Excel Parser - Blank Cell
Nick Burch 2012-01-27, 12:03
On Thu, 26 Jan 2012, Gangwal, Adish (IS Consultant) wrote: > When I parse the excel which has an empty cell, it doesn't create a > extra tab character. > > If there are three cells of which middle one is empty, it skips the > middle cell and only outputs 1st and 3rd cell with a tab
Tika itself doesn't generate tab characters, it generates xhtml table elements. It's the text content handler that does tabs
In general though, Tika will generate the text that is present.
If you're trying to generate a CSV or similar, and want full control over what shows up, missing cells etc, then I'd suggest you look at using Apache POI directly.
Nick
-
RE: Excel Parser - Blank Cell
Gangwal, Adish 2012-01-27, 22:14
Thanks Nick
We want to use Tika as it supports different doc formats and not just xls or doc like POI I think Streamed parsing also makes Tika a lot faster and efficient than POI to parse even large docs of 15 MB or greater.
I understand that Tika uses POI under the cover to parse excel. So , is there some way, to tell Tika (and in turn POI) to follow some Missing Cell Policy.
This will help to produce Text document in a very readable format in case of missing cells
Any direct is really appreciated
-Adish
-----Original Message----- From: Nick Burch [mailto:[EMAIL PROTECTED]] Sent: Friday, January 27, 2012 7:04 AM To: '[EMAIL PROTECTED]' Subject: RE: Excel Parser - Blank Cell
On Thu, 26 Jan 2012, Gangwal, Adish (IS Consultant) wrote: > When I parse the excel which has an empty cell, it doesn't create a > extra tab character. > > If there are three cells of which middle one is empty, it skips the > middle cell and only outputs 1st and 3rd cell with a tab
Tika itself doesn't generate tab characters, it generates xhtml table elements. It's the text content handler that does tabs
In general though, Tika will generate the text that is present.
If you're trying to generate a CSV or similar, and want full control over what shows up, missing cells etc, then I'd suggest you look at using Apache POI directly.
Nick
-
RE: Excel Parser - Blank Cell
Nick Burch 2012-01-30, 12:40
On Fri, 27 Jan 2012, Gangwal, Adish (IS Consultant) wrote: > We want to use Tika as it supports different doc formats and not just > xls or doc like POI I think Streamed parsing also makes Tika a lot > faster and efficient than POI to parse even large docs of 15 MB or > greater.
The streamed parsing of Excel files in Tika is powered by POI!
> I understand that Tika uses POI under the cover to parse excel. So , is > there some way, to tell Tika (and in turn POI) to follow some Missing > Cell Policy.
A missing cell policy won't help here, you're doing streaming event parsing.
It sounds like you have some very specific business requirements around the minimum number of cells per row, missing and blank cell handling etc. Tika is never going to be able to do everything for everyone, so for your specific case you may be best off writing your own custom parser and dropping that into Tika. XLS2CSVmra is a good basis for doing XLS -> CSV with full control over missing cells and missing rows (you can set a minimum number of columns to output for example), and XLSX2CSV has a similar thing for XLSX -> CSV
Nick
|
|