安装Rust/Cargo
curl --proto '=https' --tlsv1.2 -sSf <https://sh.rustup.rs> | sh
或者升级,推荐升级到1.87.0
rustup update stable
已编译可多平台/跨平台使用
因为用到了SIMD加速,所以在不同CPU上会有差异:
在Intel/AMD x86 架构CPU上
修改Cargo.toml
[package.metadata.docs.rs]
features = ["simd_avx2"]
[target.'cfg(target_arch = "aarch64")'.dependencies]
block-aligner = { version = "0.5", features = ["simd_avx2"] }
编译:
cargo build --features simd_avx2 --release
理论上只要内存吃得消读入FASTQ或线程数 * batch_reads所需要的内存,不必担心内存溢出的问题,因此可以尽可能大量使用线程数处理。
Algorithmic ideas of SegPE:
- Exact match search: This is a more direct method for finding exact matches of artificial sequences.
Regular expression matching and Hamming distance:
This is suitable for detecting and locating index sequences, especially when mismatches of a certain length are taken into account.
- Process and classify PE/SE sequences: After removing the adapter and index sequences, classify the PE/SE sequences and create new PE/SE FASTQ files.
- Remove low-quality reads: This is a common step in bioinformatics to ensure that the data used for analysis is of high quality.
- Use Multi-threads, SIMD and AsyncIO to handle large amounts of data.
Usage: segpe [OPTIONS] --five-art-fa <FIVE_ART_FA> --three-art-fa <THREE_ART_FA> --five-idx-fa <FIVE_IDX_FA> --pe1-fastq <PE1_FASTQ>
Options:
--five-art-fa <FIVE_ART_FA>
Path of 5' artificial fasta file
--three-art-fa <THREE_ART_FA>
Path of 3' artificial fasta file
--five-idx-fa <FIVE_IDX_FA>
Path of 5‘ index fasta file
--three-idx-fa <THREE_IDX_FA>
Path of 3’ index fasta file
[default: ]
--idx-loc <IDX_LOC>
Location of index, 1: PE1 2: PE2 3: both
[default: 1]
--pe1-fastq <PE1_FASTQ>
Path of PE1 fastq file
--pe2-fastq <PE2_FASTQ>
Path of PE1 fastq file
[default: ]
-s, --seed-len <SEED_LEN>
Number of seed length, not allow to longer than index length
[default: 6]
--merge-pe
Merged overlapped PE
--error-tolerance <ERROR_TOLERANCE>
Number of error tolerance
[default: 1]
-m, --match-score <MATCH_SCORE>
Number of alignment macth score
[default: 1]
--error-score <ERROR_SCORE>
Number of alignment mismacth score
[default: -1]
--gap-open-score <GAP_OPEN_SCORE>
Number of alignment gap open score
[default: -5]
--gap-extend-score <GAP_EXTEND_SCORE>
Number of alignment gap extend score
[default: -1]
--qual-trim <QUAL_TRIM>
Low quality pruning threshold (Phred score, automatic ASCII recognition), no pruning if not set
[default: 0]
--quality-ascii-offset <QUALITY_ASCII_OFFSET>
quality ASCII offset
[default: 33]
--n-trim
Whether to trim N/non-ACGT at both ends
--length-offset <LENGTH_OFFSET>
The minimum retention length after removing low quality values. If it is shorter than this length, it will be classified as low quality
[default: 50]
--poly-trim <POLY_TRIM>
Whether to trim poly-A/C/G/T at both ends
[default: 0]
-n, --num-threads <NUM_THREADS>
Number of cucurrency threads
[default: 8]
-b, --batch <BATCH>
batch size of reads, which every thread need to handle
[default: 10000]
--train <TRAIN>
pretrain size of reads
[default: 0]
--trim-name
add trim info in reads_name
-o, --outdir <OUTDIR>
Path of output directory
[default: ./]
-h, --help
Print help (see a summary with '-h')
-V, --version
https://excalidraw.com/#room=735721d320c8e703d672,MonxP9T7XWpM2JvdqAiTdg
参数 | 说明 |
---|---|
--five-art-fa <FIVE_ART_FA> | 输入建库片段5‘段接的人工序列,包括:Sequencing Adapter/Primer, Barcode, Index, Key Sequence,按接入序列从5‘->3'的测序方向连接生成的fasta序列,包括PE1/PE2 5’端会测到的人工序列 |
--three-art-fa | 输入建库片段3‘段接的人工序列,包括:Sequencing Adapter/Primer, Barcode, Index, Key Sequence,按接入序列从5‘->3'的测序方向连接生成的fasta序列, 包括PE1/PE2 3’端(包括merge PE后的3‘端)会测到的人工序列 |
--five-idx-fa <FIVE_IDX_FA> | 放在5‘端的索引序列,用于分库的序列,不管是PE1还是PE2的5’端,同时这个序列应该出现在five-art-fa |
--three-idx-fa <THREE_IDX_FA> | 放在3‘端的索引序列,用于分库的序列,不管是PE1还是PE2的3’端同时这个序列应该出现在three-art-fa |
--idx-loc <IDX_LOC> | 指定用于判断是PE1的索引序列,还是PE2的索引序列,还是同时一起判断。1: PE1 2: PE2 3: both默认值:1,即只用PE1的索引序列做判定 |
--pe1-fastq <PE1_FASTQ> | 输入PE1的序列FASTQ文件 |
--pe2-fastq <PE2_FASTQ> | 输入PE2的序列FASTQ文件 |
-s, --seed-len <SEED_LEN> | 用于检索的最小seed值。 |
默认值:6 | |
--merge-pe <MERGE_PE> | 设置是否合并测穿有overlapping的PE序列。 |
默认值:false | |
--error-tolerance <ERROR_TOLERANCE> | 比对大容错值。 |
默认值:1 | |
-m, --match-score <MATCH_SCORE> | 比对打分矩阵中匹配碱基的分值。 |
默认值:1 | |
--error-score <ERROR_SCORE> | 比对打分矩阵中错配碱基的分值。 |
默认值:-1 | |
--gap-open-score <GAP_OPEN_SCORE> | 比对打分矩阵中打开gap的分值。 |
默认值:-5 | |
--gap-extend-score <GAP_EXTEND_SCORE> | 比对打分矩阵中gap延伸的分值。 |
默认值:-1 | |
--qual-trim <QUAL_TRIM> | 低质量修剪阈值(Phred分数,自动识别ASCII),不设置则不修剪。 |
默认值: 0 (不去低质量值序列) | |
--n-trim <N_TRIM> | 是否修剪两端N/非ACGT, 布尔值。 |
默认不去除 | |
--poly-trim <POLY_TRIM> | 去掉末端polymers序列。 |
默认值:0 (不去除) | |
--length-offset <LENGTH_OFFSET> | 去掉低质量值后最小保留长度,低于此长度归为lowquality。 |
默认值:50 | |
-n, --num-threads <NUM_THREADS> | 并行线程数。 |
默认值: 8 | |
-b, --batch <BATCH> | 一个线程处理的reads数。 |
默认值:10000 | |
--train <TRAIN> | 预训练数据,用前多个reads做预训练,将比对分选的art序列存储起来,在实际跑数据时,如果匹配,就无须再进行比对分数据了。 |
默认值:0 (即不进行预训练) | |
--trim-name | 当需要将trim掉的人工序列的位置信息输出到序列名中,选择此参数,默认不再修改序列名(v0.1.13版本以上) |
-o, --outdir <OUTDIR> | 输出路径默认值:./ |
⚠️注意:如果输出路径有已命名的相同输出文件的话,会追加生成。 |