Keeping Files Tidy¶
Files quickly proliferate and need to be kept tidy. It is important that the correct people can access the files, and file systems are well-structured for easy navigation.
Applications will produce different types of files, both small and large. Both can have their own problems, hence it is important to know what you are doing when you need to move things around.
While this tutorial describes some general principles in handling files within a Linux environment, it is mainly directed on how to do things on the Apocrita cluster.
The file space on Apocrita is split into 3 areas, which are detailed on our documentation site. It is important to understand the distinction between the different locations we provide to store your data:
- Home directories
- Research group storage space
Manual tidy-up of scratch space
Due to a quirk of the auto-deleting process, it may take many months to
purge a file tree with lot of nested directories, because the directory is
marked as modified whenever a file is deleted from it. Over time your
scratch space may accumulate a lot of empty folders. To tidy these up manually,
you can run
find /data/scratch/$USER -type d -empty to list these
directories, and they can be deleted with
find /data/scratch/$USER -type d -empty -delete.
Symbolic links (symlinks) are special files that actually point to other files. You might choose to create a symlink in the top level of your home directory that points to your scratch space:
ln -s /data/scratch/$USER scratch
After which, running
cd scratch from the home directory will change directory
to your scratch area. Because it is a pointer to a folder in another location,
your home directory quota will not be depleted if you store files in this
pointer to the scratch folder.
Symlinks can point anywhere and when listed with
ls -l they will tell you
where they point to. Should the destination be deleted, the symlink will
continue to exist, but will not work.
Symlinks that are in this state can be found with
find . -xtype l.
Parent directory and current directory¶
Every directory below the root has two special entries. The
. folder points to
the current directory, and
.. refers to the parent directory in the file
cd .. moves you to the parent of your current directory. The
pwd (print working directory) command will provide the full path to your
current location, within the filesystem.
All files (and folders) that start with a
. in Linux are hidden from the
default output of the
ls command. Many configuration files and directories
use this method, so that your directories don't appear too cluttered. They work
just like any other files and take up space, so it's good to bear in mind when
trying to reduce your disk usage (for example
.conda directory is the
location of your conda environments and may become very large). Running
ls -la will additionally show hidden files.
du command (for displaying disk usage) is a bit fiddly, we have
dua utility, which shows space consumed by visible and hidden
files, and is quicker than
du. For example:
$ module load dua
512 B .gitignore
512 B LICENSE
512 B README.md
21.81 MiB Notebooks
29.13 MiB Crawler
44.63 MiB .git
45.19 MiB Evaluation
140.76 MiB total
If you are running this on directories containing a lot of files, it is
advisable to run this in a
qlogin session rather than on a frontend.
Faster listing of large directories¶
Listing a large directory containing thousands of files can slow down
drastically when running
ls. To remedy this, add the following lines to
.bashrc file (found in your home directory) to reduce the waiting
time to a fraction (see
for why this works).
# make ls faster
We apply quotas based on the number of files (although this is a generous amount, and can be increased with good reason) and also disk space. If you exceed your quota, you will get warning emails and eventually you will be restricted from creating any more files and you will need to move or remove files to get below your quota limit again.
qmquota will provide a detailed output of the space and file quotas in
filesets you have access to. More detail is here.
If you are moving a lot of files around, then we suggest you follow our guidelines on this page.
To move files around on the cluster, see our docs page on moving data. When a file is moved, the backup system needs to backup the file in the new location, and then mark the old file to be purged from backups, so please bear in mind the potential impact on our backup system if you are moving very large amounts of data around on the cluster.
If you have any ideas for any storage topics you would like us to cover, please let us know.
If you also work with confidential data, you are encouraged to complete the new QMPlus module on Data Classification at QMPlus (QMUL users only).
Image Credit: shawnanggg on Unsplash