Files

The configuration of PhantomBuster for each individual experiment is controlled via three CSV files. They are passed via command line arguments to the individual stages.

input_files.csv

CSV file that lists each input file (BAM or FASTQ) to be processed. Each row corresponds to a single input file. The column file is mandatory and must be the path to the input file. Paths can be absolute or relative to the directory in which phantombuster is executed. The column group is optional and groups different files into logical groups, in case that the sequences of a read are located in different files, as is common for FASTQ files. In FASTQ format the read 1, read 2, index 1 and index 2 are commonly seperated into files of the schema SampleName_S1_L001_X_001.fastq.gz where X is R1 for read 1, R2 for read 2 (paired end sequencing), I1 for index 1 and I2 for index 2. Each group must be unique and files that do not share reads/sequences must be in seperate groups. The column prefix is optional and configures which regular expressions are used for the file (see also regexes.csv). For the implementation that parses the file see phantombuster.config_files.read_input_files_file.

input_files.csv columns

Column

Mandatory

Meaning

file

Yes

Path to input file

group

No

Grouping of files which must be processed together

prefix

No

Regexes to use for extraction

input_files.csv Minimal Example

file

/scratch/experiment20240702/data/20240702.bam

input_files.csv Complex Example

file

group

prefix

SampleOne_S1_L001_R1_001.fastq.gz

S1

FASTQ_R1

SampleOne_S1_L001_R2_001.fastq.gz

S1

FASTQ_R2

SampleOne_S1_L001_I1_001.fastq.gz

S1

FASTQ_I1

SampleOne_S1_L001_I2_001.fastq.gz

S1

FASTQ_I2

SampleTwo_S2_L001_R1_001.fastq.gz

S2

FASTQ_R1

SampleTwo_S2_L001_R2_001.fastq.gz

S2

FASTQ_R2

SampleTwo_S2_L001_I1_001.fastq.gz

S2

FASTQ_I1

SampleTwo_S2_L001_I2_001.fastq.gz

S2

FASTQ_I2

20240702.bam

S3

BAM

regexes.csv

Regular expression to extract the barcodes from the sequences. Each row specifies one regular expression to apply to a read region. The column regex is mandatory and specifies the regular expression to use when extracting barcodes. For accepted regular expression syntax see the regex module. Named groups are extracted as barcodes (e.g. (?P<sample>w+) is a regular expression that captures a barcode named sample). Named groups with numbers at the end are concatenated to make a single barcode (e.g. (?P<sample0>w+)ACGTACGT(?P<sample1>w+) results in a single sample barcode). The column tag is mandatory and specifies the read region the regex is applied to. Possible values are query, bc and b2 for BAM files, and name and seq for FASTQ files. The column prefix is optional and is used to specify different regular expressions for different input files. Each prefix value in the input_files.csv should have a corresponding entry in the regexes.csv file. The column group is optional and can be used to specify regexes for a specific group of input files. The use of the group column is deprecated, instead use the prefix column. For the implementation that parses the file see phantombuster.config_files.read_regex_file.

regexes.csv Columns

Column

Mandatory

Meaning

regex

Yes

Regular expression to extract barcodes

tag

Yes

Target Read region (query, bc, b2, name, seq)

prefix

No

Indicates for which input files the regex applies to

group

No

Deprecated. Indicates the input file group the regex applies to

regexes.csv Minimal Example

tag

regex

b2

"^[ACGTN]{3}(?P<sample>[ACGTN]{5})"

query

"(?P<lid>[ACGTN]{5,6}(?P<lib>ACGT|GTAC){s<=1}[ACGTN]+)"

regexes.csv Complex Example

prefix

tag

regex

FASTQ_R1

seq

"^(?P<cell>[ACGTN]{12})\w*(AGGACGAAACACC){s<=1}(?P<grna>\w{20})"

FASTQ_R2

seq

"(?P<lid>\w+)"

FASTQ_I1

seq

"(?P<sample0>\w{8})"

FASTQ_I2

seq

"(?P<sample1>\w{8})"

BAM

query

"^(?P<cell>[ACGTN]{12})\w*(AGGACGAAACACC){s<=1}(?P<grna>\w{20})"

BAM

bc

"(?P<sample0>\w{8})"

BAM

b2

"(?P<sample1>\w{8})"

The complex examples continues the example in input_files.csv Complex Example. Each named group/barcode has a corresponding entry in barcodes.csv Complex Example.

barcodes.csv

CSV files that lists all barcodes occuring in the experiment and their type. All columns are mandatory. The column barcode is the name of the barcode, each barcode specified in regexes.csv must correspond to a barcode here. The column type configures whether the valid sequences of the barcode are known (reference) or not (random). The column referencefile is a path to a CSV file with valid sequences for reference barcodes and is ignored for random barcodes. For the content of the reference file, see barcode sequences files. The column threshold is used for the error correction of reference barcodes. A barcode sequence is corrected to a reference sequence if their hamming distance is below the error threshold. A value of auto will determine the largest possible error threshold that allows for unique error correction and is recommended. For random barcodes the column is ignored. The column min_length determines the minimal length of the barcode sequences, sequences of that barcode below the value are discarded. The column max_length determines the maximal length of the barcode sequences, sequences of that barcode above the value are discarded.

The order of the barcodes is significant and used for the error correction of random barcodes. When correcting random barcode sequences, only sequences are compared for which all barcode sequences above the barcode are equal. Practically that means that barcodes should be specified from general to specific. With the four barcodes sample, grna, lineage and cell two sequences of the lineage barcode would only be compared if they have the same sample and grna values. Sequences with different sample or grna values can not originate from the same lineage and are thus not compared. For two cell barcode squences to be compared their lineage sequence would also need to be the same.

The min_length and max_length columns overlap in their purpose with the length restrictions directly in the regular expression. As the regular expression allows to configure minimal, maximal lengths and more directly, the min_length and max_length are deprecated and should be set to -. Instead formulate any resctrictions on the barcodes directly in the regular expression.

For the implementation that parses the file see phantombuster.config_files.read_barcode_hierarchy_file.

barcodes.csv Columns

Column

Mandatory

Meaning

barcode

Yes

Name of the barcode

type

Yes

Barcode type, either reference or random

referencefile

Yes

Path to reference file for reference barcodes, ignored otherwise

threshold

Yes

Threshold for error correction (auto or int)

min_length

Yes

DEPRECATED Minimal length of barcode, set to - to disable

max_length

Yes

DEPRECATED Maximal length of barcode, set to - to disable

barcodes.csv Minimal Example

barcode

type

referencefile

threshold

min_length

max_length

sample

reference

/scratch/experiment20240702/sample_barcodes.csv

auto

-

-

lib

reference

/scratch/experiment20240702/library_barcodes.csv

auto

-

-

lid

random

-

-

50

50

barcodes.csv Complex Example

barcode

type

referencefile

threshold

min_length

max_length

sample

reference

sample_barcodes.csv

auto

-

-

grna

random

-

-

-

-

lid

random

-

-

50

50

cell

random

-

-

-

-

The complex examples continues the example in regexes.csv Complex Example. The regexes extract values for sample0 and sample1, which are concatenated to a single sample barcode. Thus, here only a single sample barcode is listed.

thresholds.csv

Read threshold for the thresholding step. Barcode combinations with a read count below the read threshold are discarded. Only the column threshold is mandatory, which specifies the read threshold. All other columns must be the name of a reference barcode as specified in barcodes.csv. Valid values of the column are then the names in the corresponding barcode sequence file (see barcode sequences files).

thresholds.csv Columns

Column

Mandatory

Meaning

threshold

Yes

Read threshold to apply

[BARCODE NAME]

No

specifies to which barcode combinations the threshold applies

No

Multiple barcode names can be supplied

thresholds.csv Minimal Example

threshold

100

thresholds.csv Complex Example

sample

threshold

Sample1

80

Sample2

120

barcode sequences files

For each reference barcode in barcodes.csv a file with all valid barcode sequences must be provided. These files must contain two columns. The name column assigns each sequence a human readable name, e.g. in the case of sample barcodes the sample name or id. The barcode column must consist of the sequence.

barcode sequence files Columns

Column

Mandatory

Meaning

name

Yes

human readable name

barcode

Yes

sequence (consisting of ACGT)

example

name

barcode

Sample1

CGTACTAGATAGAGAG

Sample2

TCCTGAGCTCTACTCT