Shell - crib sheet

Even in a windows, Windows, world useful to know.
HPC resources and cloud platform- and infrastructure-as-a-service.
More than just file and directory management.
Bolt together programs into powerful data processing pipelines.
Automation.
Bash, "Bourne again shell"

Help

$ man COMMAND

Up and down arrows to scroll.

/ followed by search term (e.g. /help) then ENTER.

q to exit.

$ COMMAND --help # Command syntax, usage and other information.

Top tip: if writing your own executables, be consistent, and provide --help.

Comments

# This is a comment. It is not executed.

Who

$ who # who is logged on
$ whoami # who am I logged on as

Directories

$ pwd       # Path to current directory (folder in Windows)
$ ls        # List directory
$ ls *.txt  # Wild-card
$ ls *_hai*
$ ls -R     # Recurse
$ ls -F     # Append / to directories
$ ls -l     # Permissions, date, size, owner, group...
$ cd /      # Root directory
$ cd ~      # Home directory. There's no place like ~
$ ls -a     # Hidden files
$ ls .      # Current directory
$ ls ..     # Parent directory
$ cd ..
$ cd        # Default to home directory
$ mkdir directory
$ mkdir ~/directory
$ mv directory another_directory
$ rmdir empty_directory

Files

$ cat file           # View file
$ less file          # Page through file
$ more file          # Page through file
$ head -2 file       # First N lines
$ tail -3 file       # Last N lines
$ cp file1 file2     # Copy
$ cp *.txt directory
$ rm file.txt        # Delete - no recycle bin.
$ rm -r directory    # Recurse
$ rm -rf directory   # Recurse and force - beware

History

Up arrow browses previous commands

$ history
$ !NNNN   # Rerun Nth command in history

Word count

$ wc file     # Filter
$ wc -l file  # Lines only
$ wc -w file  # Words only
$ wc -l *.txt # Total

Use to find out number of records in a data file if one record per line.

Regular expressions

$ ls *.txt        # Zero or more characters
$ ls ?o*          # Exactly one character
$ ls a[bcde]*.txt # Exactly one of the characters listed
$ ls a[cde]*.txt  
$ ls *.*
$  *.[!txt]*      # No sequence involving t, x or t

Searching within files

Global/regular expression/print

$ grep the haiku.txt
$ grep day haiku.txt
$ grep is haiku.txt
$ grep 'it is' haiku.txt
$ grep -w is haiku.txt   # Exact match
$ grep -n it haiku.txt   # lines with matches
$ grep -i the haiku.txt  # Ignore case
$ grep -wn is haiku.txt
$ grep -wnv is haiku.txt # Non-matching lines
$ grep  -wnr Today *     # Recurse

Input and output redirection

> redirects output (AKA standard output)

$ grep -r not * > found_nots.txt
$ cat found_nots.txt
$ ls *.txt > txt_files.txt
$ cat txt_files.txt
$ cat                            # Echoes standard input
Blah
CTRL-D
$ cat > myscript.txt
Blah
CTRL-D
$ cat myscript.txt

< redirects input (AKA standard input).

$ cat < haiku.txt
$ ls idontexist.txt > output.txt
$ cat output.txt

Error message is output on standard error.

$ ls idontexist.txt 2> output.txt               # 2 is standard error
$ ls haiku.txt 1> output.txt                    # 1 is standard output
$ ls idontexist.txt haiku.txt > output.txt 2>&1

Exercise - grep

pdb/ contains a set of protein database files

Each .pdb file lists atoms in a protein

Write a single command that

Uses grep to find all hydrogen (H) atoms in all .pdb files.
Stores these in hydrogen.txt.

You will need wild card, exact matches output redirection

Problems with the solution?

Chains could be labelled with identifiers H and L (for heavy and light).
AUTHOR contains an initial e.g. HARRY H CORBETT.

Important:

Understand data.
Review script.
Validate that actual results equal expected results.

Searching for files

$ find .                # Find all
$ find . -type d        # Directories only
$ find . -type f        # Files only
$ find . -maxdepth 2    # Maximum depth of tree
$ find . -mindepth 3    # Minimum depth of tree
$ find . -name *.txt    # Fails as wild-card is expanded
$ find . -name '*.txt'  # Name or pattern matching
$ find . -iname '*.TXT' # Ignore case
$ find . -empty         # Empty files only
$ touch emptyfile.txt   # Create empty file
$ find . -empty

`` back-ticks execute a command

$ wc -l `find . -name '*.txt'`

Exercise - find

Write a single command that

Uses find to find all .pdb files.
Uses cat to list their contents.
Stores contents in proteins.txt.

You will need back ticks, find file name option, output redirection.

Power of the pipe

Count text files

$ find . -name '*.txt' > files.tmp
$ wc -l files.tmp

find outputs a list of files, wc inputs a list of files, skip the temporary file.

$ find . -name '*.txt' | wc -l

| is a pipe.

$ echo "Number of .txt files:" ; find . -name '*.txt' | wc -l

; equivalent to running two commands on separate lines.

Question: what does this do?

$ ls | grep s | wc -l

Answer: counts the number of files with s in their name.

$ history | grep 'wget'

Power of well-defined modular components with well-defined interfaces,

Bolt together to create powerful computational and data processing workflows.
Good design principle applicable to programming - Python modules, C libraries, Java classes - modularity and reuse.
"little pieces loosely joined" - history + grep = function to search for a command.

Exercise - pipes

Write a single command that

Uses find to find all .pdb files.
Uses cat to list their contents.
Uses grep to find all the hydrogen (H) atoms in their contents.
Uses wc to count the number of hydrogen atoms found.
Stores the count in hydrogen_count.txt.

You will need commands from previous exercises, back ticks, pipe.

Variables

$ set                            # See all variables
$ MYFILE=data.txt
$ echo $MYFILE
$ echo "My file name is $MYFILE"
$ bash                           # Spawn new shell
$ echo $MYFILE
CTRL-D
$ export MYFILE                  # Export to new shells
$ bash
$ echo $MYFILE
CTRL-D
$ echo $PATH
$ export PATH=$PATH:/path/to/bin # Common requirement

$ let NUM=$NUM+1                 # Simple arithmetic
$ TEXT_FILES=`ls *.txt`          # Save output in variable
$ echo TEXT_FILES

.bashrc to define variables and other actions to do when logging in e.g.

export JAVA_HOME=/opt/local/java1.5
export PATH=$JAVA_HOME/bin:$PATH
export ANT_HOME=/home/michelj/Software/apache-ant-1.7.0
export PATH=$ANT_HOME/bin:$PATH

Conditionals

$ NUM=1
$ if [ "$NUM" -eq 1 ]; then echo "Equal"; fi
$ WORD="hello"
$ if [ "$WORD" = "hello" ];  then echo "The same"; fi

Loops

$ for i in `cat file`; do echo $i; done | sort | uniq

$ for PDB in `find . -name '*.pdb'`; do
    echo $PDB
 done

Shell scripts

Save retyping.

$ nano protein_filter.sh
#!/bin/bash
DATE=`date`
echo "Processing date: $DATE"
for PDB in `find . -name "*.pdb"`; do
    echo $PDB
done
echo "Processing completed!"

$ sh protein_filter.sh
$ chmod +x protein_filter.sh # Mark as executable
$ ./protein_filter.sh

Exercise - shell scripts

Edit protein_filter.sh so that it

Has variables ATOM with value 'H' and PDB_EXT with value 'pdb'.
For each .pdb file it prints the file name and a count of the number of hydrogen atoms in that file.
Double-quotes are used in the find expression as this means shell variables are expanded.

You will need parts of your command from the previous exercise.

Edit protein_filter.sh so that it takes the atom value from the command-line e.g.

$ ./protein_filter.sh H
$ ./protein_filter.sh C

$1 provides access to the first command-line argument.

Permissions

$ ls -l # permissions, dates, sizes, owner, group, size in byte, creation/modification date/time, name.

Users, groups, others.

Read, write, execute.

$ chmod a+r haiku.txt     # Add permission - all read
$ chmod a-r haiku.txt     # Remove permission - all not read
$ chmod u+r haiku.txt     # User read
$ chmod g+w haiku.txt     # Group write
$ chmod o+x haiku.txt     # Other execute
$ chmod g+rx haiku.txt    # Group read and execute
$ chmod go+rx haiku.txt   # Group and other read and execute
$ chmod ugo=rwx haiku.txt # Set permission

Jobs and processes

$ ./counter.sh > output.txt
CTRL^Z                        # Suspend
$ wc -l output.txt
$ jobs -l                     # Jobs, job number, process ID
$ fg JOBNUMBER                # Resume in foreground
CTRL^Z
$ bg JOBNUMBER                # Resume in background
$ wc -l output.txt            # Still working
$ ./counter.sh > output.txt & # Start in background, job number, process ID
$ kill PROCESSID                  
$ jobs
$ ps                          # Processes
$ top                         # Resource consumption
$ bash
$ nohup ./counter.sh > output.txt & # Continue after log out
$ nohup allows processes to continue even after the user logs out.
CTRL-D
$ wc -l output.txt

Secure shell

$ ssh username@boot-camp.software-carpentry.org
$ ssh username@boot-camp.software-carpentry.org ls # Run remote command
$ scp file.txt username@boot-camp.software-carpentry.org:
$ scp username@boot-camp.software-carpentry.org:directory/file.txt . # Relative path
$ scp -r username@boot-camp.software-carpentry.org:directory copy

Packaging

$ zip -r pdb.zip pdb  # Package and compress, recurse
$ mkdir tmp
$ cp pdb.zip tmp
$ cd tmp
$ unzip pdb.zip

$ tar -cvf pdb.tar pdb # Create, list files, file archive
$ mkdir tmp
$ cp pdb.tar tmp
$ tar -xf pdb.tar
$ zip pdb.tar

Top tip: If preparing bundles, put your content in a directory then zip or tar up that single directory. It can be annoying if someone unzips or untars a bundle and it spews its contents all over their directory, possibly overwriting their files.

Top tip: If preparing bundles of your software put the version number or a date in the name. If someone asks for advice, you'll know exactly what version they have.

$ ls -l
$ gzip pdb.tar
$ gunzip pdb.tar.gz

$ ls -l pdb.zip
$ md5sum pdb.zip

Top tip: when putting packages up for download also put up the file size and MD5 sum so people can check they've not been tampered with.

Download files via command-line

Primary Care Trust Prescribing Data - April 2011 onwards

$  wget http://www.ic.nhs.uk/catalogue/PUB02342/prim-care-trus-pres-data-apr-jun-2011-dat.csv

script

$ script
$ ls -l
$ CTRL-D
$ cat typescript

Record commands typed, commands with lots of outputs, trial-and-error when building software.

Send exact copy of command and error message to support.

Turn into blog or tutorial.

Shell power

(Bentley, Knuth, McIlroy 1986) Programming pearls: a literate program Communications of the ACM, 29(6), pp471-483, June 1986. DOI: [10.1145/5948.315654].

Dr. Drang, More shell, less egg, December 4th, 2011.

Common words problem: read a text file, identify the N most frequently-occurring words, print out a sorted list of the words with their frequences.

10 plus pages of Pascal ... or ... 1 line of shell

$ nano words.sh
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
$ chmod +x words.sh
$ nano words.sh < README.md
$ nano words.sh < README.md 10

"A wise engineering solution would produce, or better, exploit-reusable parts." - Doug McIlroy

Summary

Shell and scripts

Modular components with specific responsibilities and well-defined APIs.
Glue together into powerful computational and data processing pipelines.
Reduce errors by reusing tried-and-tested components.
Reduce errors by automation - history, shell scripts.
Don't reinvent the wheel.
Free up time to do research.

Links

Software Carpentry's online shell lectures.
G. Wilson, D. A. Aruliah, C. T. Brown, N. P. Chue Hong, M. Davis, R. T. Guy, S. H. D. Haddock, K. Huff, I. M. Mitchell, M. Plumbley, B. Waugh, E. P. White, P. Wilson (2012) "Best Practices for Scientific Computing", arXiv:1210.0530 [cs.MS].

Exercises

grep

grep -w 'H' *pdb > hydrogen.txt

Check attendees all have same result.

wc hydrogen.txt

find

cat `find . -name '*.pdb'` > proteins.txt

Check via comparing expected to actual result - lookahead to testing.

wc pdb/*.pdb
wc proteins.txt

Pipes

cat `find . -name '*.pdb'` | grep -w 'H' | wc -l > hydrogen_count.txt

Shell scripts

DATE=`date`
ATOM=$1
PDB_EXT="pdb"

echo "Processing date: $DATE"
for PDB in `find . -name "*.$PDB_EXT"`; do
    COUNT=`grep -w $ATOM $PDB | wc -l`
    echo "$PDB $COUNT"
done
echo "Processing completed!"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Shell - crib sheet

Help

Comments

Who

Directories

Files

History

Word count

Regular expressions

Searching within files

Input and output redirection

Exercise - grep

Searching for files

Exercise - find

Power of the pipe

Exercise - pipes

Variables

Conditionals

Loops

Shell scripts

Exercise - shell scripts

Permissions

Jobs and processes

Secure shell

Packaging

Download files via command-line

script

Shell power

Summary

Links

Exercises

grep

find

Pipes

Shell scripts

Files

README.md

Latest commit

History

README.md

File metadata and controls

Shell - crib sheet

Help

Comments

Who

Directories

Files

History

Word count

Regular expressions

Searching within files

Input and output redirection

Exercise - grep

Searching for files

Exercise - find

Power of the pipe

Exercise - pipes

Variables

Conditionals

Loops

Shell scripts

Exercise - shell scripts

Permissions

Jobs and processes

Secure shell

Packaging

Download files via command-line

script

Shell power

Summary

Links

Exercises

grep

find

Pipes

Shell scripts