In this section we are going to learn the basics of using the Linux/Unix Shell. We need to learn how to use this Command Line Interface (CLI) as it is the fastest and simplest way to manage our environment and processes needed for our analysis piplines. Many Graphical User Interfaces (GUI) will claim to handle these tasks for you but they are generally slower to use than the CLI and the CLI allows far greater automation of complex tasks via scripts. Finally you will need to master this if you plan to use any supercomputers as they all run Linux and the CLI is often the only way to access the system. This guide is fairly brief but will give you enough to get up and running.
Before we begin it will be useful to clarify the language surrounding this section. You will very often hear the content of this lecture in courses called "Introduction to Linux" or something similar which is not entirely accurate. Formally, Linux is actually a free, open source, Unix-like operating system kernel. A kernel is the part of an operating system that is responsible for the lowest level tasks like: memory management, process management/task management, and disk management. The Linux kernel forms the basis of several operating systems which are called Linux distributions like, Ubuntu, Fedora, Red Hat Enterprise Linux (RHEL) as well as Android and ChromeOS. Mac and iOS use a very similar Unix-like kernel so is sometimes said to be based on Linux but it is separate. Windows has its own kernel but as of Windows 10 it includes a Linux subsystem that you can enable. We will not be learning anything about how to install or manage a Linux distribution or kernel in this course.
What we are going to be learning is "BASH" which stands for Bourne Again SHell. BASH is typically the default login shell for Linux where a shell is a programme that allows the user to provide text commands directly to the operating system. It is accessed via a terminal which is a programme that creates a window for the shell to run in. If you are on a supercomputer it will almost certainly be running Linux and you will only access the system via a terminal running the BASH shell. This is where the confusion between shell, terminal and Linux comes from.
BASH is actually a programming language that contains many commands which directly connect to the operating system. If you are on Linux or Mac (note that ZSH is the default shell on macOS in recent versions) you can just open a terminal window by default, if you are on Windows you will need to activate the Windows Subsystem for Linux (WSL) see here: https://itsfoss.com/install-bash-on-windows/
Now let's begin.
When you open a terminal window you will get a window with a command prompt and you will usually be in your home directory. The first thing we need to do is find out where we are so we use pwd
(print working directory):
$ pwd
This tells us where we are. Next we need to move to the folder where we want to work but first we need to know where we could go with ls
(list) which tells us what is in the directory we are in:
$ ls
We can add options to this command to get more information or to alter the behaviour. All BASH commands have them and they are added with the "-
" prefix. For example we could add -a
(all) to show hidden files (ones beginning with a .
, typically these are hidden for a reason as you shouldn't need to mess with them too often. We will see some examples as we go through the course):
$ ls -a
To get a list of all possible options for a command you just need add man
before it to consult the manual pages
$ man ls
The manual is exited by typing "q". The options (bit after the dash) can be combined together and input in any order. For example to get the list in long format, by time, in reverse order, including hidden:
$ ls -ltra
NOTE -- Permissions
This list leads us to an important aspect of Linux which you need to consider when working with bash which is permissions. The permissions for each file is described by the first 10-characters in long format (drwxrwxrwx
would mean directory read write execute at user
, then group
, then other
level. If a letter is replaced by -
then the corresponding permission is denied for that set of users) which is followed by the owner, group, size, date last accessed, and filename.
The permissions are important as they protect you work and workspace from other users on the system. Typically you would have rwx
for user so you can read and edit your files as well as execute them (which just means run them, or for directories/folders open them), r--
for group so people in your group can see your code but not edit or run it (directories will need r_x
otherwise they will not be able to open them, omitting w
means that they can't create new files in your directories) and ---
or r--
for other depending on taste. You should avoid making either group or other have w
as it means that they can edit you stuff (or replace your text with viruses) or in the case of directories, dump unlimited content into your folders. When you create a file it will probably default to -rw-r--r--
as this is safe. However, if it is a BASH script you will need to change it to -rwxr--r--
in order to run it.
Permissions can be changed with chmod
, which takes octal numbers as arguments, so chmod 644 filename
makes filename
readable by anyone and writeable by the owner only. This because 6
in octal is 110
in binary which translates to rw-
(1
is on 0
off) for user, and 4 in octal is 100
in binary which translates to r--
for everyone else so 644 is -rw-r--r--
. It also works in symbolic mode where the same command would be chmod u=rw,go=r filename
see man chmod
for many other options. The d
cannot be changed (it's either a directory or it's not and chmod
can't do much about that).
The same but only for files called hello.py
:
$ ls -ltra hello.py
or all files ending in .py
:
$ ls -ltra *.py
NOTE -- Wildcards
Here *
is a wildcard in that it can match any sequence of characters (including none at all). By characters we mean any symbol at all, not just letters. There are other wildcards you can use including:
-
?
which matches one character soma?
would match "mat", "map" and "man" but not "mast". -
[]
matches any of the contents som[uao]m
matches "mum", "mam", "mom". You can also use dash to indicate a range. So[0-9]
matches any numeral and[a-z]
matches any letter.!
negates the match so[!9]
will match all but "9" and^
will negate all in range so[^1-4]
will match all but "1,2,3,4". There are also some standard (POSIX
) sets you can use: [[:lower:]], [[:upper:]], [[:alpha:]], [[:digit:]], [[:alnum:]], [[:punct:]], and [[:space:]] which match: lowercase letters, uppercase letters, upper or lowercase letters, numeric digits, alpha-numeric characters, punctuation characters, and white space characters respectively. -
{}
is a list of things, comma separated without spaces. -
\
is used to make special symbols literal. So if you wanted to match?
you would use\?
You can combine any and all of these together, eg:
$ ls m?m?s*_[!0][0-9][0-9].py
Now let's learn some more commands
To navigate which directory we are in we use cd
(change directory) command:
$ cd some_directory
To get back one level:
$ cd ..
Or many levels:
$ cd ../../../../..
Or return to the home directory:
$ cd
We can also do this with
$ cd ~
as ~
is a shortcut to the HOME
shell variable. It is useful as it can be used with any command as as part of a path, e.g. cd ~/documents
To get to get to the ROOT
directory we use
$ cd /
and to go the the directory we were in last we use
$ cd -
All commands allow tab completion for file/directory names so cd m<\tab>
would match all directories with m*
and complete as much as is unique which is pretty helpful. When changing directory to something like "My Documents" we need to treat the space as literal otherwise it thinks you've asked to change into two directories which doesn't make sense. For this we can either use:
$ cd "My Documents"
$ cd My\ Documents
For this reason you should avoid creating any directories or filenames with spaces in them when working on a Linux system.
Next we might want to create or destroy files. To create a file you can use any command that alters a file as generally they will create the file if it does not exist. touch
is a common one to use for this as all it does normally is update the files timestamp.
$ touch christmas_list.txt
will create a blank file called christmas_list.txt
. Interestingly, it also lets you edit the access and modification time stamps to be whatever you want, which is helpful if you need to prove you were busy coding when the diamonds went missing. The syntax is:
$ touch -d 1999-12-25T01:32:24 christmas_list.txt
Deleting files is more specific. Here we use rm
(remove), for directories use mkdir
and rmdir
:
$ mkdir tmp
$ cd tmp
$ touch testfile.txt
$ cd ..
$ rm tmp/testfile.txt
$ rmdir tmp
These all accept wild cards so rm *.out
removes all files ending in .out
. rm
will also remove empty directories, and with -r
option (recursive) will delete the directory and everything in it. You can also disable confirmation with -f
(force). Be *VERY* careful with this. rm -rf *
will remove all files and directories from this directory up without confirmation. There is no "Bin" on the command line where files go to while you think about things, deletions cannot be reversed once they are gone they are gone forever. rm *
will not delete hidden files as *
will only match non-hidden ones.
We can make copies of files with cp
(copy) where the syntax is cp file_from file_to
:
$ touch a.txt
$ cp a.txt b.txt
$ ls -ltr *.txt
or just move them with mv
(move):
$ mv b.txt c.txt
$ ls -ltr *.txt
mv
is for changing the file's directory or renaming files. mv
is much quicker than cp
as cp
actually duplicates all the data in a new location, mv
only changes the name and path (directory it is listed in. As an aside, directories don't really exist on physical file systems, they are just a tag to help people keep track of them and to aid display so mv
never actually moves anything). mv
can move multiple files using wildcards in the first argument provided the second argument is a destination directory. It cannot rename multiple files via syntax like: mv *.csv *.txt
which you may think can change the extension of all csv files.
To rename multiple files there sometimes is the command rename
(it's non-standard so won't be on all distributions, which is a shame as it's handy. Check with man rename
) which has the form (note that as it is non-standard this can also change depending on distributions!)
rename 'old string' 'new string' 'pattern to match files'
So to change all our *.txt
files to *_old.txt
:
$ rename .txt _old.txt *.txt
$ ls -ltr *.txt
If rename is not present in your distribution you can duplicate it with either a script or on a single line with fancy re-direction. The command: find . -name "*.txt" -exec sh -c 'mv "$1" "${1%.txt}.csv"' _ {} \;
uses the the -exec
or execute option to specify a command to run for each file found; the above changes the extension from ".txt" to ".csv". We will understand more of this command when we look at scripting later
Finally when we need to find things we can use find
to locate files in a directory tree. The syntax is find
'where to look' 'options of which -name
is always wanted' 'filename with optional wildcards':
$ find . -name "*.txt"
NOTE - Wildcard expansion
Here is a real "trap for young players". If you just typed find *.txt
you would think that the command worked perfectly but it isn't doing what you think. When you use wildcards on the command line they are expanded before the command is run. So in this case it expands *.txt
to match all files in the current directory, then finds each of them in turn. The correct version above will search for any file that matches the "*.txt"
in this folder and all sub-directories. The quotes indicate that we do not want the wildcards expanded but passed to the command as is. However, there is a difference between single and double quotes with ""
meaning we prefer for the wildcards not to be expanded and ''
indicating that they must not be expanded at all. The difference can be important.
There is additional complexity for users on macOS or Windows where the behaviour can be a little different as they are not strictly Linux systems. (On my mac the -name
option protects what follows from being expanded so find . -name *.txt
produces the correct behaviour even thought it shouldn't)
To simply read a file, to see what is in it, we can use cat
, more
or less
:
$ cat d.txt
$ more d.txt
$ less d.txt
cat
is good for small files as it reads all of them at once and displays the text. It's main purpose is actually to concatenate files (join them together). less
and more
both just do a page at a time and have a lot of options with -n
for line numbers being the most useful. head
and tail
lets you read from the top or bottom of a file and tail -f
(follow) is useful for tracking output to files your code is writing to without locking them (which can cause code to crash.)
If the files are large and we only want to find some particular section we can use grep
(global search for a regular expression and print) to find text in a file, eg:
$ grep hello file.txt
Will return the lines in file.txt
which contain the text hello
anywhere on them. To use regular expressions you need to add the option -E, which just means Extended which means it can use regex, regular expressions.
NOTE: Regular Expressions
It is important to note that regular expression wildcards are different to the wildcards we met earlier!!! Now we have the following:
.
matches a single character rather than?
- [a-z] and standard sets work the same
a?
,a*
,a+
, mean match 0 or 1a
, 0 or morea
, 1 or morea
respectively.a{N}
,a{N,}
anda{N,M}
means match N times, N or more times and N to M times respectively.^
and$
mean it must start at the beginning or end of a line respectively\<
and/>
matches empty strings before and after.*
gets you the behaviour of*
from before as it will match any number of.
, which is any character.
The difference is that the first set of wildcards are for Globbing which matches filenames, and are expanded by the shell. The second set above are for Regular Expressions which are for defining search patterns for text, and are expanded by the function. If you remember the filename vs text search distinction this should help you to avoid confusing them.
Here also meet the importance of quotes. Suppose we have a file called greeting.txt
which contains the text hello
. For the following commands we would see:
$ grep hello greeting.txt -> hello
$ grep hello *.txt -> hello
$ grep hello "*.txt" -> grep: *.txt: No such file or directory
$ grep hello '*.txt' -> grep: *.txt: No such file or directory
grep has several useful options -A, -B, -C
(note capitalisation) followed by num
will return num lines before, after, before and after, the match respectively. This can be very useful for finding uses of functions in your code. -n
will also print the line numbers. For example:
$ grep -n -C 3 'func1' "*.py"
will give you all the uses of func1
in you python files in the current directory, with the preceding and trailing 3 lines, and give you the line numbers where they occour.
Grep can also search for multiple things at once using |
which here means "or", eg:
$ grep -n -C 3 'func1|func2' "*.py"
and multiple files with a glob, or just by listing the files
Up and down arrows allows you navigate through previously entered commands which you can run again with enter. (Note that if you want to "scroll up" in the terminal to see the previous commands output(and the mouse wheel is not working/ available), you (usually) need to hold the shift key while pressing the up/down arrows, otherwise you will only see the history of the commands.)
To see your command history you can use the built-in shell command history
which will display a numbered list of all the commands you have entered previously. You can then retrieve a command using !num
which will run the command number num
from history. If the list of commands is too long you can also "reverse search" your command history it by pressing CTRL+R
and typing the part of the command you remember. You can also use the up and down arrows to navigate from the retrieved command to find others.
Now that we have seen some basic commands we can learn one of the more powerful aspects of bash, which is redirection. Redirection lets us pass the output of one command to another. This lets us combine commands to perform some quite sophisticated things.
We have a selection of tools that let us redirect the input, STDIN (0)
, and output, STDOUT (1)
and STDERR (2)
, of a command.
STDIN --> | COMMAND | --> STDOUT
| | --> STDERR
Firstly we can redirect STDOUT
from one command to a file or the contents of a file to a command using >
and <
$ ls -l > output.txt
Will list our directory, and write it to a file called output.txt
, which it will create it if it does not exist (so > somefile.txt
works the same as touch somefile.txt
). We can also append to the end of an existing file using >>
.
$ grep 'func' code.py >> output
>
redirects only STDOUT
to the file and passes STDERR
to be displayed in the terminal. To capture the errors we need to use 2>
(as 2
means STDERR
), eg:
$ ls test.txt 2> errors.txt
If we want to capture both STDOUT
and STDERR
we can either &>
or combine the ouput:
$ ls *.txt &> output.txt
$ ls *.txt >output.txt 2>&1
Where the first combines the two and writes them to output.txt
and the second directs the STDOUT
to output.txt
then directs STDERR
to wherever STDOUT
points. The &
in front of the 1
makes it mean STDOUT
, rather than a file called "1". This expansion is done before the command is run so it works the same. This can be confusing, so let's look at the second case step by step.
Starting location After > output.txt After 2>&1
---
STDOUT /dev/tty ./output.txt ./output.txt
STDERR /dev/tty /dev/tty ./output.txt
where /dev/tty
is just a special location which means the terminal. The other special location is /dev/null
which delete the output. If we did the reverse ls *.txt 2>&1 >output.txt
we would have:
Starting location After 2>&1 >output.txt
---
STDOUT /dev/tty (1) /dev/tty (1) ./output.txt
STDERR /dev/tty (2) /dev/tty (1) /dev/tty (1)
Where we have use numbers in brackets to differentiate the identical /dev/tty
locations.
We can also do the reverse and pass the contents of the file to STDIN
using <
$ grep hello < some_file.txt
which will search for hello
in the file some_file.txt
(you can do this without the <
and it will still work fine). Why would you do this as it seems unnecessary? There is a subtle difference between the two which is that <
anonymises the input. This means that if we compare the two methods:
$ wc -l dog.txt produces: 1 dog.txt
$ wc -l < dog.txt produces: 1
so second removed the default printing of the filename from wc
(which does a word count). There are also situations where piping the input to commands simpler.
The next key action we can do is to redirect the output of one command to another. This is called pipeing and is done with |
for example.
$ ls -l | grep Jan > January_Files.txt
Will list all files in our directory in long format, then use grep to select all those which were last edited in January, then put the output in a file called January_Files.txt. This is super useful (particularly with grep to search output) and can be used to do a wide range of things.
$ history|grep "grep" (find all grep commands you have used)
$ head -n 100 file.txt | tail -n 20 > lines_81_to_100.txt
$ ls -la | more (allows us to look at the output of `ls` one page at a time)
You can chain together as may command as you want:
cat file.txt | sort | uniq | head -n 3 > first_three.txt
Which uses cat
to load the contents of the text file into memory then sort
s it alphabetically, removed duplicates with uniq
, and selects the first three lines to output to a file.
- Make a directory called tmp and change into it
- Create a file test.txt
- copy it to test2.txt then rename test.txt to test1.txt
- delete the files, then delete the directory
- Go into the directory
pycamb
(if you have cloned the repository) then write a single line command that finds all.py
files and count how any lines containif
- Guess the meaning of the following:
$ ls i_do_not_exist *.txt 2> /dev/null | grep [de]
$ ls i_do_not_exist *.txt 2>&1 | grep [de]
$ ls i_do_not_exist *.txt 2>&1 1>/dev/null | grep [de]
$ ls i_do_not_exist *.txt 1>/dev/null 2>&1 | grep [de]