The Regex warrior: transform a text file in <2 mins

Recently I needed to transform a csv file with some simple processing. If you want to transform a file line by line, replacing a few things, deleting a few others, regular expressions are your friend.

Sed is a unix command line utility for modifying files line by line with regular expressions. It’s simpler than awk, and works very similarly to the vim substitution you may be vaguely familiar with.

We’re going to take a csv file and do the following with regex + sed:

  1. Delete some rows that meet a condition
  2. Delete a column (alright, I admit excel’s not bad for this one)
  3. Modify some cells in the table
  4. Move some data to its own column

We’ll be using this mock csv file data set:

Dan Smith;56543678;dan_profile.jpg;;;false;true
James Jones;56467654;james_profile.png;;;false;true
Clément Girard;56467632;clement_profile.png;;;false;false
Jack Mack;56485367;jack_profile.png;;;true;false
Chris Cross;98453;chris_profile.png;;;true;true

Removing some users

First let’s remove users who are not enabled (last column == false). Sed lets us delete any line that return a match for a regex.

The basic structure of a sed command is like this:

sed 's/{regex match}/{substitution}/' <input_file >output_file

There’s also a delete command, deleting any line that has a match:

sed '/{regex match}/d' <input_file >output_file

Let’s use it

sed '/false$/d' <test.csv >output.csv

regex 1 delete unenabled users

That command deletes these line in the csv:

Clément Girard;56467632;clement_profile.png;;;false;false
Jack Mack;56485367;jack_profile.png;;;true;false

Some of our users also have an old user Id (only 5 digits). Let’s delete those users from our csv too.

sed -r '/;[1-9]{5};/d' <test.csv >output.csv

regex 2 delete user ids

That command deletes this line in the csv:

Chris Cross;98453;chris_profile.png;;;true;true

Here we’re using the OR syntax: []. This means that the next character in the match will be one of the chars between the braces, in this case it’s a range of possible digits between 1 and 9.

We’re also using the quantifier {}, it repeats the previous character match rule 5 times, so will match a 5 digit number.

Note we added the -r flag to sed so that we could use regex quantifier. This flag enabled extended regular expressions, giving us extra syntax.

Removing a column

Next we want to remove the admin column. Removing the ‘admin’ column title is easy enough, but let’s use regex to remove the data. Our csv has 2 boolean columns, admin and enabled, we want to match both of these, and replace the match with just the ‘enabled column’, which we want to keep.

sed -r 's/(true|false);(true|false)$/\2/' <test.csv >output.csv

regex 3 delete column

In this example we’ve used capture groups. By surrounding a match in parentheses, we can save it to a variable – the first capture is saved to ‘\1’, the second to ‘\2’ etc.

In the substitution section of our sed string, we replaced the entire match with capture group ‘\2’. In other words we’ve replaced the final two columns in each row with just the final column, thus removing the second-to-last column from the csv.

We’ve also used the pipe ‘|’ as an OR operator, to match ‘true’ or ‘false’.

We’re left with a csv that looks like this:

Dan Smith;56543678;dan_profile.jpg;;;true
James Jones;56467654;james_profile.png;;;true

Modifying cells in the table

Next we’re going to modify the url column to store the relative url rather than the absolute path. We can use a regex like this:

sed 's_http:\/\/site[.]com\/\(.*\)_\1_' <test.csv >output.csv

This is very difficult to read because we have to escape each forward slash in the url with a backslash. Fortunately, we can change the sed delimiter from a forward slash to an underscore. This means we don’t have to escape forward slashes in the regex part of our sed command:

sed -r 's_http://site[.]com/(.*)_\1_' <test.csv >output.csv

That’s much more readable!

regex 4 modify cell

Here we match any characters after the base url using .* (this will match everything after the base url in the row). We save that in a capture group, so we now have a string starting with the relative url. By substituting the match with this, we’ve replaced the full url with the relative url.

We’re left with a csv that looks like this:

Dan Smith;56543678;dan_profile.jpg;;user/dan;true
James Jones;56467654;james_profile.png;;user/james;true

Moving data to its own column

Finally, let’s take a column and split it into 2, moving some of its data to the new column. Let’s replace the name column with ‘first name’ and ‘last name’. We can start by renaming the headers in the first row, then use sed + regex to split each row in our csv in 2 automatically!

We could start with this:

sed -r 's/^([a-zA-Z]* )/\1;/' <test.csv >output.csv

regex 5 make column

Here we use OR square bracket [] notation again. In this case we match a character in a range of upper or lower case alphabetical characters. On Ubuntu Linux, this matches international alphabet characters like é, but this depends on your environment.

We save everything up to a space character (which delimits the first name from the last name) into a capture group and substitute it with itself plus a ‘;’ – thus moving first name into its own column.

The problem with this is our first name is left with a trailing space character before the column delimiter (;). It would look like this:

first name;last name;id;thumbnail;email;url;enabled
Dan ;Smith;56543678;dan_profile.jpg;;user/dan;true
James ;Jones;56467654;james_profile.png;;user/james;true

Instead, we can do something like this:

sed -r 's/^([a-zA-Z]*)( )(.*)$/\1;\3/' <test.csv >output.csv

This matches the space character, but also saves it into capture group ‘\2’. We then substitute the whole match with \1;\3 – effectively putting everything back together without the space, and putting a ‘;’ character in its place. We now have our new columns, first name and last name.

There’s actually an even easier solution than this, we just replace the first empty space in each row with a ‘;’ like this:

sed -r 's/ /;/' <test.csv >output.csv

That was fast and easy!

We’re left with a csv that looks like this:

first name;last name;id;thumbnail;email;url;enabled

With only a few commands, we’ve managed to transform a csv from our terminal.

You liked this article? You'd probably be a good match for our ever-growing tech team at Theodo.

Join Us

  • Jason Lin

    Nice explanation! The images with highlighting and accompanying notes were very useful!

  • Tim Jinkerson

    Well written, informative and with a very clean layout. Sed is a really useful tool, and I would recommend this post to anyone wishing to learn it.

  • Ross Smith

    Nice introduction to sed and regexes. One caveat, though: for portability, you should be using sed -E instead of sed -r. They both do the same thing, but -E is Posix standard and will work anywhere, while -r is specific to GNU sed. (In particular, -r won’t work on the Mac command line.)

  • Wojtek Bażant

    Nice! You demonstrate many features and I think this will be a very helpful article to many people.

    If you find yourself having to do this kind of thing a lot, have you tried Perl? 😉 In fact your examples will mostly work if you replace `sed -r` with `perl -pe`.

    When I do this kind of thing I pipe the individual commands into each other – so I don’t have that intermediate file, but always start from the beginning – at the end I can examine the whole chain, maybe even reuse it later.

    Also, some of the moves you described are contrived and inconvenient with `sed`, but would typically be done with `awk`: any time you want to “switch two columns around”, or indeed have columns. Or `perl -lane` for Awk mode.

  • Ben Burkhart

    Terrible article. Regular expressions are the wrong tool for the job here. There’s so many intricacies with the “CSV format” with embedded quotes, newlines, embedded commas, etc that your extremely simplistic use cases would never apply to any real life usage.

    There is a command line suite of tools for working with CSV content called “csvkit” That’s the right solution, not this rubbish.

  • Hi, which tool did you have use to build the images?
    Thank you

  • Hubert Liberacki

    Article is written in very decent form, however just as @disqus_o03JFI0CBI:disqus said, regex is a bad choice here.
    In the situation above, they have extremely poor performance, and they shall be used only for simple jobs. Definitely never use regex for implementing a parser, or some parser substitution.

    As it was mentioned before, use previously defined parser to do the job, it will be much more efficient. First thing in programming is do not reinvent the wheel when it’s not needed, unless for pure academic porpoises.

    If you find already written parsers bad, by bad I consider many conditions too big, too slow, too complexed etc. Then create a smaller version for your own, with subset of functionality but use good text parsing algorithms which are defined there for ages right now.

    Regex is always a bad tool, for that kind of job.

  • trbox

    Came here to ask the same.

  • paolomainardi