BMI Students

Sunday, March 06, 2005

Text files and Unix

I often have large text files to deal with. There are three essential Unix utilities for doing this (without resorting to awk).

1. Less: is just like more, except better. You can go backwards, and use '/' to search (like vi). And it doesn't load the whole file into memory.
2. Sort: can sort a file according to one of its fields. For instance, "sort -k 10 -n file" sorts "file" by the 10th field, and does so numerically (as opposed to alphabetically). Sort -k 10,11,12 also works as you would expect.
3. Cut: allows you to look at the first n columns of a file. For instance, "cut -c0-100 file" shows the first 100 characters of each line in "file". If you have a big DNA sequence, all on one line, then you can cut out your area of interest easily.

"sort -k 10 file |cut -c0-100 |less" = sweeeet....

5 Comments:

  • Yes you can actually do quite a bit within unix with just a few commands. Use them some and it becomes very natural. Other commands I use all the time:
    wc
    uniq (with sort)
    grep (very key)

    These less frequently: paste, head, tail

    Finally, these are probably good too, but I haven't gotten into the habit of using: tr, expand, unexpand

    By Blogger Serge, at 10:56 AM  

  • I have an unholy love for cut (and its cool older brother gcut). If cut -f2,3,1 would do what I think it should do, I would marry it.

    By Blogger jchang, at 7:28 PM  

  • tail is pretty cool. To expound on that, you can skip the first couple of lines of a textfile with tail such as:

    # skip the first line
    tail +2 FILENAME

    Some versions of tail do not support this option, so check the man page.

    Also, if your program is outputing to a file, and you want to follow the progress you can use:

    # follows the output
    tail -f OUTPUTFILE

    I also would add join to the list of cool tools. I use it often to look for intersections between lists.

    # common elements
    join FILE1 FILE2
    # unique to FILE1
    join -v 1 FILE1 FILE2
    # unique to FILE2
    join -v 2 FILE1 FILE2

    join requires the input files to be sorted. There are options to specify which key and what the output format looks like (man join).

    Finally, if you are using bash (the best shell ever ;) ), you can sort right on the command:

    join <(sort FILE1) <(sort FILE2)

    This takes advantage of named pipes in unix. The <(command) will create a temporary named pipe for the output of command and return the pipe's filename for input into the calling command.

    By Blogger Mike, at 12:05 PM  

  • My friend Devin also loves cut with his entire body, including his pee-pee. It seems to inspire strong feelings and urges.

    By Blogger brian, at 1:25 PM  

  • Also a great unix utility: sdiff. See the diff of two files side by side - this is really useful.

    By Blogger brian, at 4:41 PM  

Post a Comment

<< Home