Tool - Misc

Miscellaneous stuff around HPC Scheduling and Batch Systems.

How to integrate an On-demand file system into SLURM can be found here

A more addvanced exmaple on how to apply a submitfilter to Moab is aavailable here

On Demand File System (BeeGFS) integration into SLURM

In modern HPC systems, parallel (distributed) file systems, like Lustre, BeeGFS and IBM Spectrum scale are used as storage file systems. However, I/O performance of HPC systems has failed to keep up with the increase in computational power. As a result, the I/O subsystem is often the bottleneck of the entire HPC system. At the same time, (NVMe)-SSDs are emering into the compute nodes. These SSDs are usually only visiable within a compute node and for parallel distributed jobs. These node-local SSDs can be used as storage for a BeeGFS On-demand (BeeOND) file systems

test

Figures illustrates the concept. Here are two jobs which allocate three nodes each. In the above picture, jobs write directly to the global parallel file system. In the picture below, a private parallel file system is created for each job. This private ODFS is only available on the allocated nodes of a job and uses the node-local storage (in this case the SSDs).

test

Before we integrate BeeOND ( BeeGFS ON Demand) into the Slurm batch system, make sure BeeOND/BeeGFS is working when using manual. Read the BeeOND docs for installation and testing. It is recommend to have pdsh installed.

#Manual testing
root@node01:~$ cat mynodefile
node1
node2
node3
node4
#create directories on all nodes
#/mnt/odfs is going to be the mountpoint for the file system
#BeeGFS will store the data in /mnt/beeond 
root@node01:~$ pdsh -w node0[1-4]  mkdir /mnt/odfs /mnt/beeond
#-P to test that everything works fine with pdsh, if pdfs is not availbale dont use -P
root@node01:~$ beeond start -n mynodefile -d /mnt/beeond -c /mnt/odfs -P
root@node01:~$ df -h /mnt/odfs
beegfs_ondemand 1000G    7,2G  993G    1% /mnt/odfs
#You should be able to read and write in /mnt/odfs from all nodes (node[1-4])

First create on all nodes /etc/odfs/odfs.conf:

#Where ot mount that loopback image
#!!!! create /mnt/loop42 on all nodes
ODFSLOOPBACKMOUNTPOINT=/mnt/loop42
#Location of image file
#Basenames $JOB_ID added to follow configs
ODFSIMAGE=/tmp/odfs.img.${SLURM_JOB_ID}
ODFSNODEFILE=/tmp/nodefile.odfs.${SLURM_JOB_ID}
#!!!! create /mnt/odfs dir on all nodes
ODFSMOUNTPOINT=/mnt/odfs/${SLURM_JOB_ID}
ODFSSTATUSFILE=/tmp/odfs-status.${SLURM_JOB_ID}
#configure connection-based authentication file
ODFSCONNAUTHFILE=/tmp/odfs_caf
#################################
#Which filesystem to put on imagefile
ODFSMKFSTYPE=/sbin/mkfs.xfs
#Quiet
ODFSMKFSOPTION=-q
#mountoptions
ODFSMOUNTOPTION=-onoatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,loop
#Size of image file, dd seek used, file will increase with usage
#
ODFSIMAGESIZE=250G

Create a a helper script on all nodes /etc/odfs/functions.sh:

function odfs_create_loopmount() {
     dd if=/dev/zero of=${ODFSIMAGE} bs=1 count=0 seek=${ODFSIMAGESIZE} status=none
     ${ODFSMKFSTYPE} ${ODFSMKFSOPTION} ${ODFSIMAGE} 
     mount ${ODFSMOUNTOPTION} ${ODFSIMAGE} ${ODFSLOOPBACKMOUNTPOINT}
}

function odfs_remove_loopmount() {
     i=0
     while [[ $i  -le 5 ]]
     do
        # Sometimes BeeOND just hangs
        # Try a few times then give up
        odfs_kill_beeond 
        sleep 3
        mountpoint -q ${ODFSMOUNTPOINT}
        if [[ "$?" -ne "0" ]]
            then
                break
        fi
        i=$[$i+1]
     done
     #message "removing loopmount"
     k=0
     while [[ $k  -le 5 ]]
     do
        logger "running umount on ${ODFSLOOPBACKMOUNTPOINT}"
        /usr/bin/umount -d -f -v ${ODFSLOOPBACKMOUNTPOINT}
        sleep 1
        mountpoint -q ${ODFSLOOPBACKMOUNTPOINT}
        if [[ "$?" -ne "0" ]]
            then
                break
        fi
        k=$[$k+1]
     done
     rm  ${ODFSIMAGE}
}



function odfs_kill_beeond() {
    /usr/bin/beeond stoplocal -i /tmp/beeond.tmp -q -L -c -u
    killall -q /opt/beegfs/sbin/beegfs-storage
    killall -q /opt/beegfs/sbin/beegfs-helperd
    killall -q /opt/beegfs/sbin/beegfs-mgmtd
    killall -q /opt/beegfs/sbin/beegfs-meta
    /usr/sbin/rmmod beegfs
    rm /tmp/beeond.tmp
    #logger "rmmod beegfs"
}

function odfs_stop_beeond() {
        #user requested adafs and now stop it
        #try the nice way
        #/usr/bin/beeond stop -q -n ${ODFSNODEFILE} -i /tmp/beeond.tmp -L -c
	logger "odfs stopping beenod"
        /usr/bin/beeond stoplocal -q -i /tmp/beeond.tmp -L -c -u
        rm  ${ODFSNODEFILE}  ${ODFSSTATUSFILE}  
}

Next add these two functions to your prolog/epilogue. Here some modifications have to be done, based on your setup. /etc/slurm/prolog

function start_odfs() {
	if [[ ${SLURM_JOB_CONSTRAINTS} =~ "BEEOND" ]] ; then
	        source /etc/odfs/odfs.conf
	        source /etc/odfs/functions.sh
            #Enable Debug for some output
	        ODFSDEBUG=0
	        ODFSDEBUGLOG=/tmp/odfs-slurm.$SLURM_JOB_ID
		if [[ $ODFSDEBUG == 1 ]];then
	                #create file initial
	                echo "prolog beeond AAAA" > $ODFSDEBUGLOG
	                echo "constraint found $SLURM_JOB_ID" >> $ODFSDEBUGLOG
	        fi

		job_hosts=`scontrol show hostnames`
		head_node=`echo $job_hosts | awk '{print $1;}'`
		if [[ $ODFSDEBUG == 1 ]];then
			echo $job_hosts >> $ODFSDEBUGLOG 
			echo $head_node >> $ODFSDEBUGLOG 
			export >> $ODFSDEBUGLOG 
		fi
		#Create dd-imagefile and loopback mountpoint if configured on every node
		if [[ ${ODFSLOOPBACK} == 1 ]]; then
			#This prologue must run on all allocated nodes
			#check your slurm config
			odfs_create_loopmount
	    fi

		#Create nodefile for BeeOND
        #nodefile=/tmp/odfs-slurm-nodelist.$SLURM_JOB_ID
        echo $job_hosts | tr " " "\n" > $ODFSNODEFILE 2>&1

        #Start beeond only of job master node
		if [[ ${SLURMD_NODENAME} == $head_node ]] ; then
            #create connauthfile, need pdcp to distribute
            #define in odfs.conf -> ODFSCONNAUTHFILE=/tmp/odfs_caf
            # only for advanced security, comment if not needed
            RANDOMCRED=`/usr/bin/openssl rand -base64 12`
            CRED="${SLURM_JOB_ID}-${RANDOMCRED}"
            echo ${CRED}  > ${ODFSCONNAUTHFILE}
            chmod 600 ${ODFSCONNAUTHFILE}
            pdcp -w ${SLURM_JOB_NODELIST} ${ODFSCONNAUTHFILE} ${ODFSCONNAUTHFILE} 
            ##########################################################################
            #Start beeond 
            /usr/bin/beeond start -n ${ODFSNODEFILE}  -d ${ODFSLOOPBACKMOUNTPOINT}  -c ${ODFSMOUNTPOINT} -i /tmp/beeond.tmp -P -F -L /tmp 
        fi

        #remount nosuid nodev
        mount -o remount,nosuid,nodev ${ODFSMOUNTPOINT} 
        
        if [[ ${SLURMD_NODENAME} == $head_node ]] ; then
            ### Create multiple dirs with different stripe count for more convience
            mkdir ${ODFSMOUNTPOINT}/stripe_1
            /usr/bin/beegfs-ctl --setpattern --numtargets=1 --mount=${ODFSMOUNTPOINT} ${ODFSMOUNTPOINT}/stripe_1 > /dev/null  2>&1
            mkdir ${ODFSMOUNTPOINT}/stripe_8
            /usr/bin/beegfs-ctl --setpattern --numtargets=8 --mount=${ODFSMOUNTPOINT} ${ODFSMOUNTPOINT}/stripe_8 > /dev/null  2>&1
            mkdir ${ODFSMOUNTPOINT}/stripe_16
            /usr/bin/beegfs-ctl --setpattern --numtargets=16 --mount=${ODFSMOUNTPOINT} ${ODFSMOUNTPOINT}/stripe_16 > /dev/null  2>&1
            mkdir ${ODFSMOUNTPOINT}/stripe_32
            /usr/bin/beegfs-ctl --setpattern --numtargets=32 --mount=${ODFSMOUNTPOINT} ${ODFSMOUNTPOINT}/stripe_32 > /dev/null  2>&1	
            mkdir ${ODFSMOUNTPOINT}/stripe_default
            #/usr/bin/beegfs-ctl --setpattern --numtargets=4 --mount=${ODFSMOUNTPOINT} ${ODFSMOUNTPOINT}/stripe_default > /dev/null  2>&1

            chown ${SLURM_JOB_USER}  ${ODFSMOUNTPOINT}/stripe_1 ${ODFSMOUNTPOINT}/stripe_8 ${ODFSMOUNTPOINT}/stripe_16 ${ODFSMOUNTPOINT}/stripe_32  ${ODFSMOUNTPOINT}/stripe_default
        fi
        
		#Create Status file for cleanup in epilog
        echo "USER: ${SLURM_JOB_USER}" > ${ODFSSTATUSFILE}
        echo "JOBID: ${SLURM_JOB_ID}" >> ${ODFSSTATUSFILE}
        echo "MOUNTPOINT: ${ODSMOUNTPOINT}" >> ${ODFSSTATUSFILE}
        chmod 700 ${ODFSSTATUSFILE}
		chmod 700 ${ODFSDEBUGLOG}
		echo "export ODFS=${ODFSMOUNTPOINT}"
	fi
}
#This looks strange probbaly  add just
# start_odfs
#at the end of your prolog file
start_odfs

/etc/slurm/epilog

function stop_odfs() {
	#Little bit messy, SLURM_JOB_CONSTRAINTS not available in epilog
	#so we are checking for the odfs Status file
    source /etc/odfs/odfs.conf
    ODFSDEBUG=1
	if [[ -r ${ODFSSTATUSFILE} ]] ; then
        source /etc/odfs/odfs.conf
        source /etc/odfs/functions.sh
        ODFSDEBUGLOG=/tmp/odfs-slurm.$SLURM_JOB_ID
        if [[ $ODFSDEBUG == 1 ]];then
                echo "EPILOGUE prolog beeond AAAA" >> $ODFSDEBUGLOG
                export >> $ODFSDEBUGLOG
        fi

        job_hosts=`scontrol show hostnames`
        head_node=`echo $job_hosts | awk '{print $1;}'`
        if [[ $DEBUG == 1 ]];then
            echo $job_hosts >> $ODFSDEBUGLOG
            echo $head_node >> $ODFSDEBUGLOG
            export >> $ODFSDEBUGLOG 
        fi
        #First stop beeond, we are doing quick stoplocal
        #make sure epilogue is executed on all nodes 
        odfs_stop_beeond
        odfs_kill_beeond
        odfs_remove_loopmount
    fi
    #removing probably leftover files
    rm -r ${ODFSIMAGE} ${ODFSNODEFILE} ${ODFSMOUNTPOINT} ${ODFSSTATUSFILE} ${ODFSCONNAUTHFILE}
    #rm ${ODFSIMAGE} ${ODFSNODEFILE} ${ODFSMOUNTPOINT} ${ODFSSTATUSFILE} ${ODFSSTATUSFILE} ${ODFSDEBUGLOG}
    #only remove status files if ODFSDEBUG is not enabled	
    if [[ $ODFSDEBUG == 0 ]];then
        rm ${ODFSSTATUSFILE} ${ODFSDEBUGLOG}
    fi
#This looks strange probbaly  add just
# start_odfs
#at the end of your epilog file
stop_odfs
}

Now the a constraint has to be added to slurm config, so a user can request beeond while job submission

NodeName=node[001-100] Feature=BEEOND 

Now a user should just –constrain on the CLI to the sbatch command or in his jobfiles

#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND

and slurm will create an ODFS for the User which can be accesses within his job. After the job has started Beeond can be found in /mnt/odfs/$SLURM_JOB_ID directory. Within the mountpoint three pre-configured directories are available:

#for small files (stripe count = 1)
/mnt/odfs/$SLURM_JOB_ID/stripe_1
#stripe count = 4
/mnt/odfs/$SLURM_JOB_ID/stripe_default
#stripe count = 8, 16 or 32, use these directories for medium sized 
#and large files or when using MPI-IO
/mnt/odfs/$SLURM_JOB_ID/stripe_8
/mnt/odfs/$SLURM_JOB_ID/stripe_16
/mnt/odfs/$SLURM_JOB_ID/stripe_32

If you request less nodes than stripe count, the stripe count will be max number of nodes, e.g., You only request 8 nodes , so the directory with stripe count 16 is basically only with a stripe count 8.

The capacity of the private file system depends on the number of nodes. For each node you get 250Gbyte.

!!! Be careful when creating large files, use always the directory with the max stripe count for large files. If you create large files use a higher stripe count. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger>4 (4 x 250GB).

If you request 100 nodes for your job, the private file system is 100 * 250 Gbyte ~ 25 Tbyte (approx) capacity.

Recommendation:

The private file system is using its own metadata server. This metadata server is started on the first nodes. Depending on your application, the metadata server is consuming decent amount of CPU power. Probably adding a extra node to your job could improve the usability of the on-demand file system. Start your application with the MPI option:

mpirun -nolocal myapplication With the -nolocal option the node where mpirun is initiated is not used for your application. This node is fully available for the meta data server of your requested on-demand file system.

Moab Submitfilter

Moab is offering a submitfilter to modify the job xml data before submitting to the Moab scheduler. More about the submitfilter can be found in the MOAB docs

Below is an example for such a submitfilter. Just add it to MOAB configuration with

SUBMITFILTER /etc/some/path/filter.pl

This is a exmaple for a submitfilter that worked with older (7.x) versions of Moab. Be careful when using it with newer versions. Probably set $debug = 1 and check the written submitfilter logs.

#!/usr/bin/perl
use strict;
use XML::Twig;
use Data::Dumper;
## Simple filter example that re-directs the output to a file.

#my $xmlinput;

my $job_user;
my $machine = "clustername";
my $debug = 0;

#Read XML from STDIN

my $xmlinput;
while(<>){
        $xmlinput .= $_ ;
}
my $XMLstruct = XML::Twig->new();
$XMLstruct->parse($xmlinput);
my $root= $XMLstruct->root;
#$XMLstruct->print;
if (check_jobxml_element($root,"UserId")){
   $job_user = get_jobxml_element($root,"UserId");
}
#Output original XML to file if $debug is true
#File is written into directory where msub is called
if($debug){
    my $origfile = "submitfilter_" . $job_user . "_original.log";
    open FILE,">>$origfile" or die "Couldn't open $origfile: $!";
    my $orig_jobtext = $XMLstruct->sprint;
    print FILE $orig_jobtext;
    print FILE "\n";
    close FILE;
}

#Add Outputfile if none specified by user
#Only for non interactive jobs
if (!check_jobxml_element($root,"Interactive")){
    if (!check_jobxml_element($root,"OutputFile")){
        $root->insert_new_elt('first_child', 'OutputFile', '');
        my $working_dir;
        if (check_jobxml_element($root,"InitialWorkingDirectory")){
            $working_dir = get_jobxml_element($root,"InitialWorkingDirectory")
        }
        my $job_name_prefix = "_STDIN";
        if (check_jobxml_element($root,"JobName")){
           $job_name_prefix = "_" . get_jobxml_element($root,"JobName");
        }
        $root = set_jobxml_element($root,"OutputFile","$working_dir" . "/job_" . $machine . "_" . '$(JOBID)' . ".out"); 
    }
    if (!check_jobxml_element($root,"ErrorFile")){
        $root->insert_new_elt('first_child', 'ErrorFile', '');
        my $working_dir;
        if (check_jobxml_element($root,"InitialWorkingDirectory")){
            $working_dir = get_jobxml_element($root,"InitialWorkingDirectory")
        }
        my $job_name_prefix = "_STDIN";
        if (check_jobxml_element($root,"JobName")){
           $job_name_prefix = "_" . get_jobxml_element($root,"JobName");
        }
        $root = set_jobxml_element($root,"ErrorFile","$working_dir" . "/job_" . $machine . "_" . '$(JOBID)' . ".out");
    }
}

#Set /bin/bash for interactive jobs
if (check_jobxml_element($root,"Interactive")){
    if (!check_jobxml_element($root,"ShellName")){
        $root->insert_new_elt('first_child', 'ShellName', '');
        set_jobxml_element($root,"ShellName","/bin/bash --login");
    }
}
# add Moab variables to the job's environment (see msub -E)
my $extension;
if (check_jobxml_element($root,"Extension")){
    $extension = get_jobxml_element($root,"Extension");
    if ($extension !~ m/x=ENVREQUESTED:TRUE/){
        add_jobxml_txt_to_element($root,"Extension",";ENVREQUESTED:TRUE;");
    }
}else{
    $root->insert_new_elt('first_child', 'Extension', '');
    $root = set_jobxml_element($root,"Extension",";x=ENVREQUESTED:TRUE;");
}


#Rewrite memory from pmem -> --slurm --mem-per-cpu
# workaround if Slurm is used as RM with Moab as scheduler
# Rewrite memory requested by msub to slurm --mem-per-cpu
#if (check_jobxml_element($root,"Memory")){
#    #insert RMFlags and use msub pmem value
#    $root->insert_new_elt('first_child', 'RMFlags', '');
#    my $memory = get_jobxml_element($root,"Memory");
#    $root = set_jobxml_element($root,"RMFlags","--mem-per-cpu=$memory");
#    $root->first_child('Memory')->delete;
#    
#}

## enforce user 
#if (!check_jobxml_element($root,"Memory")){
#       die "Please set a value for memory \n";
#} 

#Output modified XML to file if ENV var SUBMITFILER_DEBUG is set
if($debug){
    my $modfile = "submitfilter_" . $job_user . "_modified.log";
    open FILE,">>$modfile" or die "Couldn't open $modfile: $!";
    my $mod_jobtext = $XMLstruct->sprint;
    print FILE "$mod_jobtext";
    print FILE "\n";
    close FILE;
}
#print XML to STDOUT, which will be sent to the scheduler
$XMLstruct->print;

#Some  helper functions to modify XML 
sub check_jobxml_element{
    my $root = shift;
    my $element = shift;
    my $present;
    if ($root->first_child($element)){
        return 1;
    }else{
        return 0;
    }
}
sub get_jobxml_element{
    my $root = shift;
    my $element = shift;
    my $value;
    $value = $root->first_child($element)->text;
    return $value;
}
sub set_jobxml_element{
    my $root = shift;
    my $element = shift;
    my $text = shift;
    my $attribut = $root->first_child($element);
    $attribut->set_text($text);
    return $root;
}
sub add_jobxml_txt_to_element{
    my $root = shift;
    my $element = shift;
    my $text = shift;
    my $content = $root->first_child($element)->text;
    $text = $content . "$text";
    my $attribut = $root->first_child($element);
    $attribut->set_text($text);
    
}

1;