SegPE Usage Documentation

0X01 installation environment

Installing Rust /Cargo

curl --proto '=https' --tlsv1.2 -sSf <https://sh.rustup.rs> | sh

Or upgrade, recommend upgrading to 1.87.0

rustup update stable

0X02 compile

Has been compiled for multi-platform/cross-platform use

Because of the use of SIMD acceleration, there will be differences on different CPUs :

Intel/ AMD x86 architecture CPU on

Modify Cargo.toml

[package.metadata.docs.rs]
features = ["simd_avx2"]

[target.'cfg(target_arch = "aarch64")'.dependencies]
block-aligner = { version = "0.5", features = ["simd_avx2"] }

Compile:

cargo build --features simd_avx2 --release

In theory, as long as the memory can withstand the memory required to read the FASTQ or the number of threads * batch_reads, there is no need to worry about memory overflow, so the number of threads can be used as much as possible.

0X03 CLI command parameter description

Algorithmic ideas of SegPE:
    - Exact match search: This is a more direct method for finding exact matches of artificial sequences.
    Regular expression matching and Hamming distance:
    This is suitable for detecting and locating index sequences, especially when mismatches of a certain length are taken into account.

    - Process and classify PE/SE sequences: After removing the adapter and index sequences, classify the PE/SE sequences and create new PE/SE FASTQ files.

    - Remove low-quality reads: This is a common step in bioinformatics to ensure that the data used for analysis is of high quality.

    - Use Multi-threads, SIMD and AsyncIO to handle large amounts of data.
    

Usage: segpe [OPTIONS] --five-art-fa <FIVE_ART_FA> --three-art-fa <THREE_ART_FA> --five-idx-fa <FIVE_IDX_FA> --pe1-fastq <PE1_FASTQ>

Options:
      --five-art-fa <FIVE_ART_FA>
          Path of 5' artificial fasta file

      --three-art-fa <THREE_ART_FA>
          Path of 3' artificial fasta file

      --five-idx-fa <FIVE_IDX_FA>
          Path of 5‘ index fasta file

      --three-idx-fa <THREE_IDX_FA>
          Path of 3’ index fasta file
          
          [default: ]

      --idx-loc <IDX_LOC>
          Location of index, 1: PE1 2: PE2 3: both
          
          [default: 1]

      --pe1-fastq <PE1_FASTQ>
          Path of PE1 fastq file

      --pe2-fastq <PE2_FASTQ>
          Path of PE1 fastq file
          
          [default: ]

  -s, --seed-len <SEED_LEN>
          Number of seed length, not allow to longer than index length
          
          [default: 6]

      --merge-pe
          Merged overlapped PE

      --error-tolerance <ERROR_TOLERANCE>
          Number of error tolerance
          
          [default: 1]

  -m, --match-score <MATCH_SCORE>
          Number of alignment macth score
          
          [default: 1]

      --error-score <ERROR_SCORE>
          Number of alignment mismacth score
          
          [default: -1]

      --gap-open-score <GAP_OPEN_SCORE>
          Number of alignment gap open score
          
          [default: -5]

      --gap-extend-score <GAP_EXTEND_SCORE>
          Number of alignment gap extend score
          
          [default: -1]

      --qual-trim <QUAL_TRIM>
          Low quality pruning threshold (Phred score, automatic ASCII recognition), no pruning if not set
          
          [default: 0]

      --quality-ascii-offset <QUALITY_ASCII_OFFSET>
          quality ASCII offset
          
          [default: 33]

      --n-trim
          Whether to trim N/non-ACGT at both ends

      --length-offset <LENGTH_OFFSET>
          The minimum retention length after removing low quality values. If it is shorter than this length, it will be classified as low quality
          
          [default: 50]

      --poly-trim <POLY_TRIM>
          Whether to trim poly-A/C/G/T at both ends
          
          [default: 0]

  -n, --num-threads <NUM_THREADS>
          Number of cucurrency threads
          
          [default: 8]

  -b, --batch <BATCH>
          batch size of reads, which every thread need to handle
          
          [default: 10000]

      --train <TRAIN>
          pretrain size of reads
          
          [default: 0]

      --trim-name
          add trim info in reads_name

  -o, --outdir <OUTDIR>
          Path of output directory
          
          [default: ./]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version

https://excalidraw.com/#room=735721d320c8e703d672,MonxP9T7XWpM2JvdqAiTdg

Untitled

Parameter	Explanation
--five-art-fa <FIVE_ART_FA>	Input the artificial sequence of the 5 'segment of the library building fragment, including: Sequencing Adapter /Primer, Barcode, Index, Key Sequence, and connect the fasta sequence generated from the sequencing direction of 5' - > 3 'according to the access sequence, including the artificial sequence that will be measured at the PE1/PE2 5' end
--three-art-fa	Input the artificial sequence segmented by the library fragment 3 ', including: Sequencing Adapter /Primer, Barcode, Index, Key Sequence, and connect the FASTA sequence generated from the sequencing direction of 5' - > 3 'according to the access sequence, including the artificial sequence that will be measured at the PE1/PE2 3' end (including the 3 'end after merging PE )
--five-idx-fa <FIVE_IDX_FA>	The index sequence placed at the 5 'end, used for classifing, whether it is the 5' end of PE1 or PE2, and this sequence should appear in five-art-fa
--three-idx-fa <THREE_IDX_FA>	The index sequence placed at the 3 'end, used for classifing, whether it is the 3' end of PE1 or PE2, and this sequence should also appear in three-art-fa
--idx-place < IDX _ LOC >	Specify the index sequence used to determine whether it is PE1, PE2, or both.1: PE1 2: PE2 3: bothDefault value: 1, that is, only the index sequence of PE1 is used for judgment
--pe1-fastq <PE1_FASTQ>	Enter the sequence FASTQ file for PE1
--pe2-fastq <PE2_FASTQ>	Enter the sequence FASTQ file for PE2
-s, --seed-len <SEED_LEN>	Minimum seed value for retrievalDefault value: 6
--merge-pe <MERGE_PE>	Set whether to merge overlapping PE sequencesDefault value: false
--error-tolerance <ERROR_TOLERANCE>	Comparison large fault tolerance valueDefault value: 1
-m, --match-score <MATCH_SCORE>	Compare the scores of matching bases in the scoring matrixDefault value: 1
--error-score <ERROR_SCORE>	Compare the scores of mismatched bases in the scoring matrixDefault value: -1
--gap-open-score <GAP_OPEN_SCORE>	Compare the scores of opening the gap in the scoring matrixDefault value: -5
--gap-extend-score <GAP_EXTEND_SCORE>	Compare the score of gap extension in the scoring matrixDefault value: -1
--qual-trim <QUAL_TRIM>	Low quality pruning threshold (Phred score, automatic ASCII recognition), no pruning if not set.
Default value: 0 (do not remove low quality value sequences)
--n-trim <N_TRIM>	Whether to trim N/non-ACGT at both ends, Boolean value.
Not removed by default
--poly-trim <POLY_TRIM>	Whether to trim poly-A/C/G/T （polymers）at both ends.
Default value: 0 (do not proceed)
--length-offset <LENGTH_OFFSET>	The minimum retention length after removing low quality values. If it is shorter than this length, it will be classified as low quality.
Default value: 50 (bp)
-n, --num-threads <NUM_THREADS>	Parallel threadsDefault value: 8
-b, --batch <BATCH>	The number of reads processed by a threadDefault value: 10000
--train <TRAIN>	Pre-training data, using the first multiple reads for pre-training, stores the art sequence for comparison and sorting. When actually running the data, if it matches, there is no need to compare and sort the data again.Default value: 0 (i.e. no pre-training)
--trim-name	When it is necessary to output the position information of the trimmed artificial sequence to the sequence name, select this parameter. By default, the sequence name will not be modified (version v0.1.13 or above).
-o, --outdir <OUTDIR>	Output pathDefault value:.⚠️ Note: If the output path has the same named output file, it will be appended.