Here on the data team at Cloudbakers, we analyze, process, and transform data in a number of formats, including CSV. CSV is one of the more common file formats for storing data, however, it's also fraught with a number of issues that can make working with it painful. Unlike an alternative format such as JSON, CSV is not formalized or standardized. This means that something may be called a CSV, but not necessarily that it will be read properly by the CSV parser you happen to be using. I hope that this post gives you a better understanding of the challenges of working with CSV files and some ways to transform and analyze CSV-formatted data. I'll assume you have some degree of technical experience, but not necessarily that you write code.
What is CSV?
The first sentence of the Wikipedia Entry provides a simple (but already problematic) definition of CSV:
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.
By that definition, a CSV file would look something like this (the first line in this case represents column headers):
This seems relatively straightforward, but very quickly, problems begin to arise. For example, what if there is a comma in a value? You could encapsulate the value in double quotes, but what if the quote characters are also inside that value? What if the character used to delineate records (ie, a newline) is present within the data? What if your file uses a character other than a comma to delineate values? And so on. The answers to these questions are often ad hoc and system dependent, so depending on where the CSV data you're working with came from, you need to apply a different set of rules in order to properly interpret that data. If you're lucky, your source system provides thorough documentation. If you aren't, then you'll have to guess and/or reverse engineer whatever ruleset your CSV file follows.
CSV has been in use since the 1970s, but it is not standardized. The closest thing that exists to a standard for CSV is RFC 4180, released in 2005. While this may be a useful reference, there is no guarantee a file that a system you're working with calls a "CSV" file follows this RFC. For example, Microsoft Excel explicitly does not. The RFC itself states this:
It does not specify an Internet standard of any kind.
The Library of Congress provides a more "real-world" definition of CSV and describes many common deviations:
- In locales where the comma character is used in place of a decimal point in numbers, the separator between fields/columns is often a semicolon.
- The line break character may be CR or LF, not necessarily CRLF.
- Some Unix-based applications may use a different escape mechanism for indicating that one of the separator characters occurs within a text value. The individual character is preceded by a backslash character rather than enclosing the entire string in double quotes. Single quotes may be treated as equivalent to double-quotes for escaping (also known as "text-qualification"). Several other caveats are worth noting:
- The last record in a file may or may not end with a line break character.
- Non-printable characters may be included in text fields by using one of several c-style character escape sequences: ### or \o### Octal; \x## Hex; \d### Decimal; and \u#### Unicode.
- The treatment of whitespace adjacent to field and record separators varies among applications. If whitespace at the beginning and end of a textual field value is significant, the text string should be text-qualified, i.e. enclosed in quotes.
- In some uses, there is an assumption of strong data typing, with unquoted fields considered to be numeric, and quoted fields considered to be text data.
I've seen most or all of these deviations in practice. In some cases, tools will export "CSV" by just splicing a comma between records, which breaks very easily (For example SQL Server's "CSV" format).
The LOC page has a lot of additional details if you're interested in further reading, but I would not recommend assuming any CSV data is RFC 4180 compliant or that any parser you're interacting with implements RFC 4180. You may want to look into the documentation for the specific CSV writer or parser that you're working with, if it exists, to understand how that individual system handles CSV. Some systems will handle CSV relatively well, while others may fail in inscrutable ways.
If you're working with CSV-formatted data, you may want to transform your CSV data and load it into a different system which has different expectations for how a CSV is formatted. You also may want to load data from a CSV and perform transformations and analytics on it. Here are a number of tools that I find useful when working with CSV files. Most of these tools assume you have some experience with the command line and installing software from a binary or package repository.
This command line tool is my go-to for most CSV-related tasks – it's reliable and extremely fast, and if you're working with large (>1GB) files, there isn't much competition. It can do a number of simple operations, including slicing data, sorting data, performing basic analytics, reformatting CSV data, fixing mis-matched rows, etc. Check out the documentation for more information. If you're doing more complex analytics or transformations on your data, xsv may not be appropriate and you may want to look into some of the other tools below.
Another command line tool, I used CSVkit before I discovered XSV. It serves a similar role, but is not nearly as fast. There are quite a few additional features that CSVkit has, for example converting between CSV and other formats. This is a good library to use if you need to do something XSV can't do or if you're working with smaller files and performance isn't as important.
- sqlite (or other relational database)
Most relational databases have the ability to import or export from CSVs. This is a great strategy for doing complex analytics on large datasets with relatively good performance. It's also easy if you're already familiar with SQL. I recommend SQLite for doing analytics like this, as it doesn't require a server and can write directly to disk or even run in memory, but you could also use another relational database that you're more familiar with.
A relatively new tool is DuckDB, an embedded columnar database that can also read CSVs. I haven't used this, but it could be a great option if you're looking for a higher-performance embedded database for analytics workflows!
If you're familiar with Pandas, the Python data manipulation and analysis library, it can be a useful way of working with CSV data, but can suffer issues with performance on large (several GB) datasets, depending on your computer's specs. Data scientists often use Pandas in conjunction with Jupyter for data analysis, and both tools have an extensive community surrounding them. Pandas is great for doing more sophisticated transformation and analysis on your data after loading it from CSVs. You can also export data from Pandas into a number of different formats.
If you are working with very large amounts of data (10s or 100s of GBs), your computer may not be able to handle running analytics locally via SQLite or Pandas. You'll probably want to move your CSV file(s) into columnar database, and these hosted tools allow you to not have to worry about setting up a database server.
- Writing code in the programming language of your choice
If you are comfortable writing code, your programming language probably has a CSV parsing library. For example, I often use the Python CSV library, part of the standard Python library. This is a useful strategy if you are working with a large amount of data and are very sensitive to performance, or if you want to perform a task that the above tools can't handle. There may be circumstances where using Unix tools such as awk or sed may make sense for working with CSVs, but I typically will use a command line tool specifically written to deal with CSVs. Writing your own code may be especially helpful for fixing malformed CSV so that it can be read by a different parser.
Tips and tricks
I'm sure this is obvious at this point, but in almost all cases, you don't want to try and write your own CSV parser by simply splitting a file on commas or joining a list of values with a comma. If you're writing code, almost certainly there exists a CSV parsing library that handles as many of the issues that we've discussed as possible. However, in certain circumstances, you may want to look into the details of the CSV parser for your language to understand more thoroughly what exactly it means by "CSV".
If your CSV file came from a machine export, look into the documentation for that system to see how it exports CSVs. For example, Excel, MySQL, PostgreSQL, etc all make different decisions in terms of how they format their CSV files. Another proprietary or legacy system you're working with may make different decisions. See if whatever system you're working with documents its CSV format. In some cases, it may not.
Be aware of data types. CSV is not typed, and different systems may have different ways of distinguishing between an integer (1) and a string, (e.g. a ZIP code, 01234). Make sure that your system doesn't turn string IDs (that could start with zero) into integers. Be aware of the differences between how different systems treat NULL values, and the differences between NULL and an empty string, which may or may not be the same value in a CSV file.
If you have the decision to use any other open format aside from CSV, you may want to consider using that format over CSV.
Sometimes issues may be resolved by using a tool like XSV to make all fields quoted in your CSV, or change the delimiter so a system can parser it properly.
If you continue to struggle loading your CSV into some system and that system can load another format, such as JSON, maybe try converting your CSV to JSON before loading it. JSON is formally specified and may cause fewer issues.
Check out Data Science at the Command Line for a broader overview of working with data via the command line beyond just CSV files.
CSV may be frustrating, but be aware that you're not alone and others have probably encountered similar issues. Make liberal use of Google and StackOverflow if you encounter issues. There are good reasons for CSV to exist – it's simple, lightweight, and human readable, but it's apparent simplicity hides a number of common issues. Be prepared to handle these issues when they occur.