SegPE Usage Documentation
Installing Rust /Cargo
curl --proto '=https' --tlsv1.2 -sSf <https://sh.rustup.rs> | sh
Or upgrade, recommend upgrading to 1.77.0
rustup update stable
Has been compiled for multi-platform/cross-platform use
Because of the use of SIMD acceleration, there will be differences on different CPUs :
Intel/ AMD x86 architecture CPU on
Modify Cargo.toml
[package.metadata.docs.rs]
features = ["simd_avx2"]
[target.'cfg(target_arch = "aarch64")'.dependencies]
block-aligner = { version = "0.5", features = ["simd_avx2"] }
Compile:
cargo build --features simd_avx2 --release
In theory, as long as the memory can withstand the memory required to read the FASTQ or the number of threads * batch_reads, there is no need to worry about memory overflow, so the number of threads can be used as much as possible.
segpe --help
Algorithmic ideas of SegPE:
- Exact match search: This is a more direct method for finding exact matches of artificial sequences.
Regular expression matching and Hamming distance:
This is suitable for detecting and locating index sequences, especially when mismatches of a certain length are taken into account.
- Process and classify PE sequences: After removing the adapter and index sequences, classify the PE sequences and create new PE FASTQ files.
- Use Multi-threads, SIMD and AsyncIO to handle large amounts of data.
Usage: segpe [OPTIONS] --five-art-fa <FIVE_ART_FA> --three-art-fa <THREE_ART_FA> --five-idx-fa <FIVE_IDX_FA> --pe1-fastq <PE1_FASTQ> --pe2-fastq <PE2_FASTQ>
Options:
--five-art-fa <FIVE_ART_FA>
Path of 5' artificial fasta file
--three-art-fa <THREE_ART_FA>
Path of 3' artificial fasta file
--five-idx-fa <FIVE_IDX_FA>
Path of 5‘ index fasta file
--three-idx-fa <THREE_IDX_FA>
Path of 3’ index fasta file
[default: ]
--idx-loc <IDX_LOC>
Location of index, 1: PE1 2: PE2 3: both
[default: 1]
--pe1-fastq <PE1_FASTQ>
Path of PE1 fastq file
--pe2-fastq <PE2_FASTQ>
Path of PE1 fastq file
-s, --seed-len <SEED_LEN>
Number of seed length, not allow to longer than index length
[default: 6]
--merge-pe
Merged overlapped PE
--error-tolerance <ERROR_TOLERANCE>
Number of error tolerance
[default: 1]
-m, --match-score <MATCH_SCORE>
Number of alignment macth score
[default: 1]
--error-score <ERROR_SCORE>
Number of alignment mismacth score
[default: -1]
--gap-open-score <GAP_OPEN_SCORE>
Number of alignment gap open score
[default: -5]
--gap-extend-score <GAP_EXTEND_SCORE>
Number of alignment gap extend score
[default: -1]
-n, --num-threads <NUM_THREADS>
Number of cucurrency threads
[default: 8]
-b, --batch <BATCH>
batch size of reads, which every thread need to handle
[default: 10000]
--train <TRAIN>
pretrain size of reads
[default: 0]
--trim-name
add trim info in reads_name
-o, --outdir <OUTDIR>
Path of output directory
[default: ./]
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
https://excalidraw.com/#room=735721d320c8e703d672,MonxP9T7XWpM2JvdqAiTdg
Parameter | Explanation |
---|---|
--five-art-fa <FIVE_ART_FA> | Input the artificial sequence of the 5 'segment of the library building fragment, including: Sequencing Adapter /Primer, Barcode, Index, Key Sequence, and connect the fasta sequence generated from the sequencing direction of 5' - > 3 'according to the access sequence, including the artificial sequence that will be measured at the PE1/PE2 5' end |
--three-art-fa | Input the artificial sequence segmented by the library fragment 3 ', including: Sequencing Adapter /Primer, Barcode, Index, Key Sequence, and connect the FASTA sequence generated from the sequencing direction of 5' - > 3 'according to the access sequence, including the artificial sequence that will be measured at the PE1/PE2 3' end (including the 3 'end after merging PE ) |
--five-idx-fa <FIVE_IDX_FA> | The index sequence placed at the 5 'end, used for classifing, whether it is the 5' end of PE1 or PE2, and this sequence should appear in five-art-fa |
--three-idx-fa <THREE_IDX_FA> | The index sequence placed at the 3 'end, used for classifing, whether it is the 3' end of PE1 or PE2, and this sequence should also appear in three-art-fa |
--idx-place < IDX _ LOC > | Specify the index sequence used to determine whether it is PE1, PE2, or both.1: PE1 2: PE2 3: bothDefault value: 1, that is, only the index sequence of PE1 is used for judgment |
--pe1-fastq <PE1_FASTQ> | Enter the sequence FASTQ file for PE1 |
--pe2-fastq <PE2_FASTQ> | Enter the sequence FASTQ file for PE2 |
-s, --seed-len <SEED_LEN> | Minimum seed value for retrievalDefault value: 6 |
--merge-pe <MERGE_PE> | Set whether to merge overlapping PE sequencesDefault value: false |
--error-tolerance <ERROR_TOLERANCE> | Comparison large fault tolerance valueDefault value: 1 |
-m, --match-score <MATCH_SCORE> | Compare the scores of matching bases in the scoring matrixDefault value: 1 |
--error-score <ERROR_SCORE> | Compare the scores of mismatched bases in the scoring matrixDefault value: -1 |
--gap-open-score <GAP_OPEN_SCORE> | Compare the scores of opening the gap in the scoring matrixDefault value: -5 |
--gap-extend-score <GAP_EXTEND_SCORE> | Compare the score of gap extension in the scoring matrixDefault value: -1 |
-n, --num-threads <NUM_THREADS> | Parallel threadsDefault value: 8 |
-b, --batch <BATCH> | The number of reads processed by a threadDefault value: 10000 |
--train <TRAIN> | Pre-training data, using the first multiple reads for pre-training, stores the art sequence for comparison and sorting. When actually running the data, if it matches, there is no need to compare and sort the data again.Default value: 0 (i.e. no pre-training) |
--trim-name | When it is necessary to output the position information of the trimmed artificial sequence to the sequence name, select this parameter. By default, the sequence name will not be modified (version v0.1.13 or above). |
-o, --outdir <OUTDIR> | Output pathDefault value:.⚠️ Note: If the output path has the same named output file, it will be appended. |