SegPE中文使用文档

0x01 安装环境

安装Rust/Cargo

curl --proto '=https' --tlsv1.2 -sSf <https://sh.rustup.rs> | sh

或者升级，推荐升级到1.87.0

rustup update stable

0x02 编译

已编译可多平台/跨平台使用

因为用到了SIMD加速，所以在不同CPU上会有差异：

在Intel/AMD x86 架构CPU上

修改Cargo.toml

[package.metadata.docs.rs]
features = ["simd_avx2"]

[target.'cfg(target_arch = "aarch64")'.dependencies]
block-aligner = { version = "0.5", features = ["simd_avx2"] }

编译：

cargo build --features simd_avx2 --release

理论上只要内存吃得消读入FASTQ或线程数 * batch_reads所需要的内存，不必担心内存溢出的问题，因此可以尽可能大量使用线程数处理。

0x03 CLI 命令参数说明

Algorithmic ideas of SegPE:
    - Exact match search: This is a more direct method for finding exact matches of artificial sequences.
    Regular expression matching and Hamming distance:
    This is suitable for detecting and locating index sequences, especially when mismatches of a certain length are taken into account.

    - Process and classify PE/SE sequences: After removing the adapter and index sequences, classify the PE/SE sequences and create new PE/SE FASTQ files.

    - Remove low-quality reads: This is a common step in bioinformatics to ensure that the data used for analysis is of high quality.

    - Use Multi-threads, SIMD and AsyncIO to handle large amounts of data.
    

Usage: segpe [OPTIONS] --five-art-fa <FIVE_ART_FA> --three-art-fa <THREE_ART_FA> --five-idx-fa <FIVE_IDX_FA> --pe1-fastq <PE1_FASTQ>

Options:
      --five-art-fa <FIVE_ART_FA>
          Path of 5' artificial fasta file

      --three-art-fa <THREE_ART_FA>
          Path of 3' artificial fasta file

      --five-idx-fa <FIVE_IDX_FA>
          Path of 5‘ index fasta file

      --three-idx-fa <THREE_IDX_FA>
          Path of 3’ index fasta file
          
          [default: ]

      --idx-loc <IDX_LOC>
          Location of index, 1: PE1 2: PE2 3: both
          
          [default: 1]

      --pe1-fastq <PE1_FASTQ>
          Path of PE1 fastq file

      --pe2-fastq <PE2_FASTQ>
          Path of PE1 fastq file
          
          [default: ]

  -s, --seed-len <SEED_LEN>
          Number of seed length, not allow to longer than index length
          
          [default: 6]

      --merge-pe
          Merged overlapped PE

      --error-tolerance <ERROR_TOLERANCE>
          Number of error tolerance
          
          [default: 1]

  -m, --match-score <MATCH_SCORE>
          Number of alignment macth score
          
          [default: 1]

      --error-score <ERROR_SCORE>
          Number of alignment mismacth score
          
          [default: -1]

      --gap-open-score <GAP_OPEN_SCORE>
          Number of alignment gap open score
          
          [default: -5]

      --gap-extend-score <GAP_EXTEND_SCORE>
          Number of alignment gap extend score
          
          [default: -1]

      --qual-trim <QUAL_TRIM>
          Low quality pruning threshold (Phred score, automatic ASCII recognition), no pruning if not set
          
          [default: 0]

      --quality-ascii-offset <QUALITY_ASCII_OFFSET>
          quality ASCII offset
          
          [default: 33]

      --n-trim
          Whether to trim N/non-ACGT at both ends

      --length-offset <LENGTH_OFFSET>
          The minimum retention length after removing low quality values. If it is shorter than this length, it will be classified as low quality
          
          [default: 50]

      --poly-trim <POLY_TRIM>
          Whether to trim poly-A/C/G/T at both ends
          
          [default: 0]

  -n, --num-threads <NUM_THREADS>
          Number of cucurrency threads
          
          [default: 8]

  -b, --batch <BATCH>
          batch size of reads, which every thread need to handle
          
          [default: 10000]

      --train <TRAIN>
          pretrain size of reads
          
          [default: 0]

      --trim-name
          add trim info in reads_name

  -o, --outdir <OUTDIR>
          Path of output directory
          
          [default: ./]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version

https://excalidraw.com/#room=735721d320c8e703d672,MonxP9T7XWpM2JvdqAiTdg

Untitled

参数	说明
--five-art-fa <FIVE_ART_FA>	输入建库片段5‘段接的人工序列，包括：Sequencing Adapter/Primer, Barcode, Index, Key Sequence，按接入序列从5‘->3'的测序方向连接生成的fasta序列，包括PE1/PE2 5’端会测到的人工序列
--three-art-fa	输入建库片段3‘段接的人工序列，包括：Sequencing Adapter/Primer, Barcode, Index, Key Sequence，按接入序列从5‘->3'的测序方向连接生成的fasta序列, 包括PE1/PE2 3’端(包括merge PE后的3‘端)会测到的人工序列
--five-idx-fa <FIVE_IDX_FA>	放在5‘端的索引序列，用于分库的序列，不管是PE1还是PE2的5’端，同时这个序列应该出现在five-art-fa
--three-idx-fa <THREE_IDX_FA>	放在3‘端的索引序列，用于分库的序列，不管是PE1还是PE2的3’端同时这个序列应该出现在three-art-fa
--idx-loc <IDX_LOC>	指定用于判断是PE1的索引序列，还是PE2的索引序列，还是同时一起判断。1: PE1 2: PE2 3: both默认值：1，即只用PE1的索引序列做判定
--pe1-fastq <PE1_FASTQ>	输入PE1的序列FASTQ文件
--pe2-fastq <PE2_FASTQ>	输入PE2的序列FASTQ文件
-s, --seed-len <SEED_LEN>	用于检索的最小seed值。
默认值：6
--merge-pe <MERGE_PE>	设置是否合并测穿有overlapping的PE序列。
默认值：false
--error-tolerance <ERROR_TOLERANCE>	比对大容错值。
默认值：1
-m, --match-score <MATCH_SCORE>	比对打分矩阵中匹配碱基的分值。
默认值：1
--error-score <ERROR_SCORE>	比对打分矩阵中错配碱基的分值。
默认值：-1
--gap-open-score <GAP_OPEN_SCORE>	比对打分矩阵中打开gap的分值。
默认值：-5
--gap-extend-score <GAP_EXTEND_SCORE>	比对打分矩阵中gap延伸的分值。
默认值：-1
--qual-trim <QUAL_TRIM>	低质量修剪阈值（Phred分数，自动识别ASCII），不设置则不修剪。
默认值: 0 （不去低质量值序列）
--n-trim <N_TRIM>	是否修剪两端N/非ACGT, 布尔值。
默认不去除
--poly-trim <POLY_TRIM>	去掉末端polymers序列。
默认值：0 （不去除）
--length-offset <LENGTH_OFFSET>	去掉低质量值后最小保留长度，低于此长度归为lowquality。
默认值：50
-n, --num-threads <NUM_THREADS>	并行线程数。
默认值: 8
-b, --batch <BATCH>	一个线程处理的reads数。
默认值：10000
--train <TRAIN>	预训练数据，用前多个reads做预训练，将比对分选的art序列存储起来，在实际跑数据时，如果匹配，就无须再进行比对分数据了。
默认值：0 （即不进行预训练）
--trim-name	当需要将trim掉的人工序列的位置信息输出到序列名中，选择此参数，默认不再修改序列名(v0.1.13版本以上）
-o, --outdir <OUTDIR>	输出路径默认值：./
⚠️注意：如果输出路径有已命名的相同输出文件的话，会追加生成。

0x01 安装环境

0x02 编译

0x03 CLI 命令参数说明

0x04 准备数据