How to extract columns from a delimited file?

· Read in about 3 min · (455 Words)

Have you ever needed to extract columns from a delimited file? I find myself in these situations fairly often. Here are a few more interesting ways to do it than opening in Excel and persing the columns.

If you need to process many files or you want to batch multiple such extractions, there are some utilities to help with that.

Although these generally are included on Linux, you can install them on Windows and they work just as well. If you are not on Windows 10 or don’t want to bother installing the beta Windows Subsystem for Linux, then you could download the free UnxUtils or similar distribution and add it to the path.

Many ways to do it, but I am going to show you how to do it with:

  • cut
  • awk

Option 1: cut

Cut can be used to process files or streams and extract columns from a delimited file or specific columns specified by character positions.

Here are the relevant options for the cut utility:

Usage: cut OPTION... [FILE]...
Print selected parts of lines from each FILE to standard output.

With no FILE, or when FILE is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -b, --bytes=LIST        select only these bytes
  -c, --characters=LIST   select only these characters
  -d, --delimiter=DELIM   use DELIM instead of TAB for field delimiter
  -f, --fields=LIST       select only these fields;  also print any line
                            that contains no delimiter character, unless
                            the -s option is specified
....

Use one, and only one of -b, -c or -f.  Each LIST is made up of one
range, or many ranges separated by commas.  Selected input is written
in the same order that it is read, and is written exactly once.

For example:

cut -d "," -f3-5,22,33,35 input.csv > output.csv

takes as an input the input.csv and writes out to output.csv. It specifies the delimiter to be comma and the output to contain only columns 3,4,5,22,33, and 35.

Option 2: awk

AWK is a language for text processing that was developed in the 1970s (and still kicking). Awk is very powerful and there is much, much more to it than I have time and knowledge to describe here, so I will just go right to the example:

awk -F, '{OFS=",";print $3,$4,$5,$22,$33}' input.csv > output.csv

It does exactly the same as the cut example. Essentially, the -F specifies the input delimiter, and the text within the single quotes is the actual awk script. OFS variable holds the output delimiter and the print statement just prints the parsed fields.

Warning: Neither cut nor awk will work properly with escaped csv files! If you need to support escape charachters, you’d need to look for some other utility.

Happy extracting!