I want to read csv and xlsx files in my erlang project. Is there any plugin/library which can help me with this.
I’m using erlang otp 24.3.4
I want to read csv and xlsx files in my erlang project. Is there any plugin/library which can help me with this.
I’m using erlang otp 24.3.4
don’t know about xlsx, but for csv you can try my lib: https://github.com/hansph/erlangstat
Cannot handle all edge cases, simply try.
HP
This should work for your CSV reading/writing
https://github.com/SiftLogic/erlcsv
I have a simple xlsx parser that we use to parse some (internal) uploads of spreadsheets.
It needs some TLC to make it robust enough to use for all kinds of spreadsheets. Especially big ones, as Excel likes to save 1M+ rows now and then…
I have pasted it here:
If you improve it, please share the improvements.
@amey we use this CSV reader for many years with no issue:
We have a library that can edit XLSX files i.e. read and add info in a template XLSX file. Not open source yet because only used internally but if there is enough interest we can certainly open-source it
A library that can both read and write XLSX is interesting.
In Zotonic we have some data exporters here:
And a CSV reader and writer here:
is there any way to extract data from .xls?
Several people have pointed to CSV parsers. While CSV format is specified in RFC 4180, that’s a post facto specification, and you should never be surprised to encounter CSV files that don’t quite conform to it.
.XLSX files are zipped folders of (Microsoft Open)XML files.
Erlang can do XML.
Erlang can do ZIP.
The major problem with any kind of spreadsheet format (and if you can handle both CSV and XLSX you should have no problem with HTML files containing
s,right?) is that there are three levels of structure.Microformat for cells. How are timestamps and durations represented? Are decimal points and/or thousands separators allowed in numbers and if so what are they?
Container structure. This is what is covered by RFC 4180 and the Microsoft spec for XLSX.
Application structure. In CSV, is there a header row giving names? Is there a header row giving units? Are there any repeating fields? Is the layout a single rectangle or several rectangles? &c.
My particular interest in CSV files is weather data and sea state data, and my experience has been that parsing CSV is next to trivial (my smallest CSV parser is under 70 lines, pretty-printed) BUT dealing with levels 1 and 3 has been a pain in the posterior costing me hours of actually looking at the data. And then finding that the next CSv file from the same source has a different level 3 structure.
My point is that you mustn’t expect too much from any CSV or XLSX parser in any programming language.
Try this: Parsing CSV in erlang | ppolv's blog
The “Parsing CSV in erlang | ppolv’s blog” parser comes with this explicit goal: “Ensure that all records had the same number of fields.” But this is not a guaranteed property of real CSV files, and for at least two reasons. The first is that some spreadsheets are not regular tables to start with. (Level 3 structure.) The second is that some spreadsheets represent missing data as completely empty cells, and do not write trailing empty cells. For example, suppose you have data for three years, with a row for each year, and a column for each month. But recording began in April of the first year and ended in May of the third year.
You’ll see a file something like this:
Just to add another one: https://github.com/mkrusemark/ecsv. It’s a fork of https://github.com/pankajsoni19/ecsv with the capability to change the delimiter. It handles all cases. I use it to import >30 MBs of CSV daily. But it’s not the fastest :-(.
We had a home-grown CSV decoder/encoder that has served us well in the past. We decided to dust it off only to throw into this thread: GitHub - hnc-agency/hnc-csv