Skip to main content

A new take on using Snakemake: pass around log-files instead of output-files

Project description

snakemake-jobmonitor package

snakemake-jobmonitor is an alternative take on the regular Snakemake workflow. Instead of passing input and output-files around, it passes log-files around. The log-files contain pointers to result-files. The advantage of this is much better progress monitoring, error handling and logging. The JobMonitor and JobResult classes ensure that this can be achieved with minimal code that is easy to read and maintain. snakemake-jobmonitor is a super minimal library of just two pages, installed by `pip install snakemake-jobmonitor'. It does not modify Snakemake, only the way Snakemake is used.

Regular Snakemake

Snakemake is a powerful workflow-engine that compiles rules into a DAG (Directed Acyclic Graph) and automatically determines a parallel execution strategy. Rules invoke each other via filenames, which typically contain wildcards so that the same rule can be invoked for multiple cases.

Regular workflow example

Example of a Snakefile (examples/regular.snk, run with snakemake -s regular.snk --cores 1 --forceall) that splits a color image into the red, green and blue component for cases '1','2' and '3'.

import os
from PIL import Image

inputFolder = 'path_to_cases'
outputFolder = '../scratch/regular/path_to_results'
os.makedirs(outputFolder,exist_ok=True)

allCases = ['1','2','3']

def somethingUseful(colorInfile, redOutfile,greenOutfile,blueOutfile):
    im = Image.open(colorInfile)
    r,g,b = im.split()
    r.save(redOutfile); g.save(greenOutfile); b.save(blueOutfile)
    # uncomment this to raise a Division By Zero error:
    #1/0

def createReport(allRed,allBlue,allGreen, reportFile):
    with open(reportFile,'wt') as fp:
        fp.write(f'Red files:\n{",\n".join(allRed)}\n\n')
        fp.write(f'Blue files:\n{",\n".join(allBlue)}\n\n')
        fp.write(f'Green files:\n{",\n".join(allGreen)}\n\n')

rule runSingleCase:
    input:
        color=inputFolder+'/case-{case}_RGB.jpeg'
    output:
        R=outputFolder+'/case-{case}_R.png',
        G=outputFolder+'/case-{case}_G.png',
        B=outputFolder+'/case-{case}_B.png'
    run:
        somethingUseful(input.color, output.R,output.G,output.B)

rule runAllCases:
    input:
        R=[outputFolder+f'/case-{c}_R.png' for c in allCases],
        G=[outputFolder+f'/case-{c}_G.png' for c in allCases],
        B=[outputFolder+f'/case-{c}_B.png' for c in allCases]
    output:
        report=f'{outputFolder}/report.txt'
    default_target:
        True
    run:
        createReport(input.R,input.G,input.B, output.report)

Practical issues

For larger workflows some issues arise:

  1. Snakemake does not come with a good progress monitor. There is a possibility to use the 'WMS monitoring protocol' but this is cumbersome to setup and being phased out. It is replaced by 'logger plugins', but these are still experimental.

  2. If an error occurs, the pipeline stops and produces a very long error trace, most of which is irrelevant to the error. Or one can opt to ignore errors, but this will cause errors down the line that are even more difficult to trace.

  3. Snakemake produces a log-file that contains information about process execution, but does not contain the console-output of the processes called by each rule. This is because a global log file is not suitable to contain logs from different components that may run in parallel.

  4. If a rule has many outputs, and another rule needs these as inputs, the rules become cluttered.

An alternative approach

To solve these issues, snakemake-jobmonitor changes the way rules interact. Every rule is producing a log-file instead of output files. And instead of having rule B request the output of rule A, it requests the log-file of rule A. Inside that log-file there is a pointer to where the rule results are stored.

Snakemake-jobmonitor is implemented as a class that acts as a context-manager. A typical rule looks as follows:

rule decomposeSingle:
    input:
        color='inputFolder/case-{case}_RGB.png'
    log:
        'logFolder/case-{case}_decompose.log'
    run:
        caseFolder = f'{outputFolder}/case-{wildcards.case}'
        with JobMonitor(log,'Decompose RGB into R,G,B',caseFolder) as job:
            doDecompose(input.color, 
                job.result('R.png'),job.result('G.png'),job.result('B.png'))

The rule has changed in a few places, instead of producing three output files it produces a log file. In the statement that starts with with JobMonitor, JobMonitor creates the log file and in that file it stores the path where the rule output will be written, in this case in the caseFolder folder. In the last line, the job.result('R.png') creates the output folder and returns the full path to the file.

Although the code has become two lines longer, it offers huge advantages:

  1. JobMonitor automatically creates the .log file, but while the rule executes the extension is changed into '.running'. So, at any moment you can see what Snakemake is working on by listing all .running files In the log folder.

  2. If an error occurs within the JobMonitor context, the error is appended to the log file and written to a .error file (with otherwise the same name as the .log file). So, one can easily find all rules that gave errors by listing .error files in the log folder. After fixing the code that produced the error, delete the corresponding log-file before re-running Snakemake.

  3. Naturally, every rule produces its own log. In addition, JobMonitor provides a run method to invoke external software. This method is mostly the same as subprocess.run, but it captures all output to the .log file and sends errors to the .error file.

  4. Rules have inputs that are log files produced by other rules. And a single output: its own log file. The Snakefile is not cluttered by declaring all the output files that may be produced by each rule. Those are accessed indirectly via the result-pointer in its log-file.

Full snakemake-jobmonitor example

Here is the full version of the previous example in the snakemake-jobmonitor style (examples/jobmon.snk, run with snakemake -s jobmon.snk --cores 1).

import os
from PIL import Image
from snakemake_jobmonitor import JobMonitor, JobResult

inputFolder = 'path_to_cases'
outputFolder = '../scratch/jobmon/path_to_results'
logFolder = '../scratch/jobmon/path_to_logs'

allCases = ['1','2','3']

def doSomethingUseful(colorInfile, redOutfile,greenOutfile,blueOutfile):
    im = Image.open(colorInfile)
    r,g,b = im.split()
    r.save(redOutfile); g.save(greenOutfile); b.save(blueOutfile)
    # uncomment this to raise a Division By Zero error:
    #1/0

def createReport(allRed,allBlue,allGreen, reportFile):
    with open(reportFile,'wt') as fp:
        fp.write(f'Red files:\n{",\n".join(allRed)}\n\n')
        fp.write(f'Blue files:\n{",\n".join(allBlue)}\n\n')
        fp.write(f'Green files:\n{",\n".join(allGreen)}\n\n')

rule runSingleCase:
    input:
        color=inputFolder+'/case-{case}_RGB.jpeg'
    log:
        logFolder+'/case-{case}_decompose.log'
    run:
        caseFolder = f'{outputFolder}/case-{wildcards.case}'
        with JobMonitor(log,'Decompose RGB into R,G,B',caseFolder) as job:
            doSomethingUseful(
                input.color, 
                job.result('R.png'),job.result('G.png'),job.result('B.png')
            )

rule runAllCases:
    input:
        [logFolder+f'/case-{cs}_decompose.log' for cs in allCases]
    log:
        logFolder+f'/decomposeAll.log'
    default_target:
        True
    run:
        with JobMonitor(log,'Decompose All',outputFolder) as job:
            R = [JobResult(f)('R.png') for f in input]
            G = [JobResult(f)('G.png') for f in input]
            B = [JobResult(f)('B.png') for f in input]
            createReport(R,G,B, job.result('report.txt') )

Usage of JobMonitor

Signature: JobMonitor(logFile,description,resultFolder)

The JobMonitor class takes three arguments:

  • logFile path to the log file. If the file exists, it will be overwritten.

  • description brief description of what the rule does.

  • resultFolder path to the result folder. One can also pass a result prefix by adding an asterisc at the end. Examples:

    • /my/results/case-1 will cause results to be written in the case-1 folder.

    • /my/results/case-1_* will cause results to be written in the results folder, and every file therein will start with case-1_.

JobMonitor should be used as a context manager, like

with JobMonitor(logFile,description,resultFolder) as job:
    doSomething()

Inside the context, job can be used for the following tasks:

  1. Create/access the result of this rule via job.result(resultFile)

    This returns a filename that concatenates the previously specified resultFolder with resultFile, and will make sure the folder is created. One can also write results in subfolders, by just adding arguments, like job.result(subFolder,resultFile). Examples:

    • If the resultFolder is specified as /my/results/case-1, then job.result('test','R.png') will return /my/results/case-1/test/R.png.

    • If the resultFolder is specified as /my/results/case-1_*, then job.result('test','R.png') will return /my/results/case-1_test/R.png

    This is all that job.result is doing, it just returns filenames and creates folders, it does not create results, that is up to the code inside the rule.

  2. Run an external command via job.run(command,liveUpdates=False).

    Here command must NOT be a string, but rather a list of strings that follows the exact same rules as the subprocess.run command. The advantage of using job.run() is that it saves stdout and stderror to the log/error file respectively.

    If liveUpdates is set to False, then the log/error file will be updated once the command is finished, if set to True the update is more frequent.

As a general rule, we recommend that log-files are all stored in the same folder, with hierarchy expressed in the file name. For result-files it can be more natural to use a hierarchical folder structure.

Usage of JobResult

Signature: JobResult(logFile)

We already used job.result in the previous chapter to access result files inside the JobMonitor context. The JobResult class is to access results of other rules that produced log-files. It makes use of the fact that every log-file contains, on the second line, the resultFolder of the rule that created it.

If we start for example with:

result = JobResult('/my/logfolder/case-1_test.log')

then result can be used in the same way as job.result in the previous chapter. For example, result(subFolder,resultFile) will return the concatenation of resultFolder, subFolder and resultFile. It will not create any folders, that only happens in the JobMonitor context.

JobResult has some additional convenience methods:

  • result.file(*args) is the same as result(*args)

  • result.folder(*args) returns the result folder, internally using os.path.dirname(resultFile)

  • result.parseJson(*args) parses the json-formatted result file and returns its content.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snakemake_jobmonitor-0.1.2.tar.gz (49.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

snakemake_jobmonitor-0.1.2-py2.py3-none-any.whl (9.7 kB view details)

Uploaded Python 2Python 3

File details

Details for the file snakemake_jobmonitor-0.1.2.tar.gz.

File metadata

  • Download URL: snakemake_jobmonitor-0.1.2.tar.gz
  • Upload date:
  • Size: 49.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for snakemake_jobmonitor-0.1.2.tar.gz
Algorithm Hash digest
SHA256 476710a26aca9d22776ec7059481599093d6729eed7cf9f70d901c439aa2e795
MD5 d8e9c0f0bfffcf1627b3856f25aa6821
BLAKE2b-256 1ee45710ab5625b31de90e830a7da5bc650c5c1833a4eec4c81efacb95ddfb52

See more details on using hashes here.

File details

Details for the file snakemake_jobmonitor-0.1.2-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for snakemake_jobmonitor-0.1.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 7e34d9875e3e335ce36b4d90a09e700883d5c487ac9ce01d42ca575fe185f17a
MD5 e148284c37178f0921e2c6dab5400d46
BLAKE2b-256 1a9ce346361100e5b2bb8863aa34d31187b0146bab31572e275c2110765517f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page