Skip to content

Keeping Files Tidy

Files quickly proliferate and need to be kept tidy. It is important that the correct people can access the files, and file systems are well-structured for easy navigation.

Applications will produce different types of files, both small and large. Both can have their own problems, hence it is important to know what you are doing when you need to move things around.

While this tutorial describes some general principles in handling files within a Linux environment, it is mainly directed on how to do things on the Apocrita cluster.

The file space on Apocrita is split into 3 areas, which are detailed on our documentation site. It is important to understand the distinction between the different locations we provide to store your data:

  • Home directories
  • Scratch
  • Research group storage space

Manual tidy-up of scratch space

Due to a quirk of the auto-deleting process, it may take many months to purge a file tree with lot of nested directories, because the directory is marked as modified whenever a file is deleted from it. Over time your scratch space may accumulate a lot of empty folders. To tidy these up manually, you can run find /data/scratch/$USER -type d -empty to list these directories, and they can be deleted with find /data/scratch/$USER -type d -empty -delete.

Symbolic links (symlinks) are special files that actually point to other files. You might choose to create a symlink in the top level of your home directory that points to your scratch space:

cd $HOME
ln -s /data/scratch/$USER scratch

After which, running cd scratch from the home directory will change directory to your scratch area. Because it is a pointer to a folder in another location, your home directory quota will not be depleted if you store files in this pointer to the scratch folder.

Symlinks can point anywhere and when listed with ls -l they will tell you where they point to. Should the destination be deleted, the symlink will continue to exist, but will not work.

Symlinks that are in this state can be found with find . -xtype l.

Parent directory and current directory

Every directory below the root has two special entries. The . folder points to the current directory, and .. refers to the parent directory in the file tree. Executing cd .. moves you to the parent of your current directory. The pwd (print working directory) command will provide the full path to your current location, within the filesystem.

Hidden files

All files (and folders) that start with a . in Linux are hidden from the default output of the ls command. Many configuration files and directories use this method, so that your directories don't appear too cluttered. They work just like any other files and take up space, so it's good to bear in mind when trying to reduce your disk usage (for example .conda directory is the location of your conda environments and may become very large). Running ls -la will additionally show hidden files.

Since the du command (for displaying disk usage) is a bit fiddly, we have provided the dua utility, which shows space consumed by visible and hidden files, and is quicker than du. For example:

$ module load dua
$ dua
   512   B .gitignore
   512   B LICENSE
   512   B README.md
 21.81 MiB Notebooks
 29.13 MiB Crawler
 44.63 MiB .git
 45.19 MiB Evaluation
140.76 MiB total

If you are running this on directories containing a lot of files, it is advisable to run this in a qlogin session rather than on a frontend.

Faster listing of large directories

Listing a large directory containing thousands of files can slow down drastically when running ls. To remedy this, add the following lines to your .bashrc file (found in your home directory) to reduce the waiting time to a fraction (see here for why this works).

# make ls faster
export LS_COLORS='ex=00:su=00:sg=00:ca=00:'

Quotas

We apply quotas based on the number of files (although this is a generous amount, and can be increased with good reason) and also disk space. If you exceed your quota, you will get warning emails and eventually you will be restricted from creating any more files and you will need to move or remove files to get below your quota limit again.

qmquota will provide a detailed output of the space and file quotas in filesets you have access to. More detail is here.

If you are moving a lot of files around, then we suggest you follow our guidelines on this page.

Finally

To move files around on the cluster, see our docs page on moving data. When a file is moved, the backup system needs to backup the file in the new location, and then mark the old file to be purged from backups, so please bear in mind the potential impact on our backup system if you are moving very large amounts of data around on the cluster.

If you have any ideas for any storage topics you would like us to cover, please let us know.

If you also work with confidential data, you are encouraged to complete the new QMPlus module on Data Classification at QMPlus (QMUL users only).


Image Credit: shawnanggg on Unsplash