Why you may’t parse CSV with an everyday expression

Common expressions are a really useful gizmo in a programmer’s toolbox. However they’ll’t do all the pieces. And one of many issues they’ll’t do is to reliably parse CSV (comma separated worth) information. It’s because an everyday expression doesn’t retailer state. You want a state machine (or one thing equal) to parse a CSV file.

For instance, take into account this (very brief) CSV file (3 double quotes + 1 comma + 3 double quotes):

“””,”””

That is appropriately interpreted as:

quote to start out the info worth + escaped quote + comma + escaped quote + quote to finish the info worth

E.g. a single worth of:

“,”

How every character is interpreteted will depend on what characters come earlier than and after it. E.g. the primary quote places you into an ‘inside information’ state. The second quote places you right into a ‘is perhaps an escaped for the next character or is perhaps finish of information’ state. The third quote places you again right into a ‘inside information’ state.

Irrespective of how difficult a regex you provide you with, it should all the time be potential to create a CSV file that your regex can’t appropriately parse. And as soon as the parsing goes improper, all the pieces after that time might be rubbish.

You may write a regex that may deal with CSV file the place you might be assured there aren’t any commas, quotes or carriage returns within the information values. However commas, quotes or carriage returns within the information values are completely legitimate in CSV information. So it is just ever going to deal with a subset of all of the potential well-formed CSV information.

Notice that you simply can parse a TSV (tab separated worth) file with a regex, as TSV information are (typically!) not allowed to include tabs or carriage returns in information and due to this fact don’t want escaping.

See additionally on Stackoverflow:

Using regular expressions to parse HTML: why not?