Sampling and sequencing
The C. erythropterus A sample obtained from Lake Hulun (Inner Mongolia, China) was used for genome sequencing and assembly. Muscle tissue was stored at −80 °C and used for DNA extraction, genomic DNA sequencing, and Hi-C library construction. We used a standard SDS extraction method to obtain high molecular weight DNA.
Sequencing libraries were generated using the Truseq Nano DNA HT Sample Preparation Kit (Illumina, USA) and an index code for attribute sequences was added to each sample according to the manufacturer’s recommendations. These above-constructed libraries were sequenced by the Illumina NovaSeq 6000 platform, yielding paired-end reads of 150 bp with an additional size of approximately 350 bp. We obtained 41 Gb of raw genomic data for C. erythropterus As a result of Illumina sequencing.
Flow cells were sequenced on a PromethION sequencer according to the manufacturer’s instructions. Nanopore technology obtained 132 Gb of high-quality data from a long-read library covering 117.86 fold of the genome assembly.
A high-throughput chromatin conformational capture (Hi-C) library was constructed for sequencing to obtain chromosome-level assembly of the genome.10. We created the Hi-C library that uses the original samples as input. After trituration with liquid nitrogen, cross-linking was performed with 4% formaldehyde solution under vacuum for 30 min at room temperature. Add 2.5 M glycine to quench the crosslinking reaction for 5 min. Nuclei were digested with 100 units of MboI, labeled with biotin-14-dCTP, and then ligated with T4 DNA ligase. Following an overnight incubation to reverse cross-linking, the ligated DNA was segmented into 200–600 bp fragments. Acute repair and A-tailing of DNA fragments followed by biotin-streptavidin-mediated purification. Hi-C libraries were finally quantified and sequenced on an Illumina PE150.
RNA was also extracted from seven tissues C. erythropterusincluding intestine, liver, muscle, spleen, heart, gallbladder, and kidney, transcriptome sequencing was performed on the Illumina NovaSeq 6000 platform, and the resulting reads were used for gene prediction.
Genome size estimation and contig assembly
Illumina data were analyzed for k-mer depth frequency distribution to estimate genome size, heterozygosity, and amount of repetitive sequences. C. erythropterus. Genome size (G) was calculated according to the following formula: G = k-mer count/k-mer depth, where k-mer count and k-mer depth are the total number and average depth of 17 mers, respectively11. A k-mer depth frequency distribution analysis was used for the genome using 41 Gb of pure Illumina data. C. erythropterus (Fig. 1). Based on a total of 30,891,679,507 17-mers and a peak 17-mer depth of 27, the estimated genome size was 1120.68 Mb, heterozygosity was 0.31%, and the amount of repetitive sequences and guanine was 5-5-%. 37.95%, respectively (Table 1).
Initial assembly using all Nanopore sequencing data C. erythropterus genome was performed using the NextDenovo assembler (v2.3.1) (https://github.com/Nextomics/NextDenovo) with the following parameters: “read_ cutoff = 1k, pa_correction = 20, sort_options = -m 20 g -t 10, correction_options = -s 10″. Finally, the contigs sequence was corrected by NextPolish (v1.3.1)12 Using Illumina raw data as well as Nanopore sequencing data. These data were then assembled with NextDenovo, yielding a genome assembly of 1,085.49 Mb with a contig N50 of 23.28 Mb (Table 2). The length for this assembly is the same as the genome size estimated by k-mer analysis.
Chromosome-level genome assembly using Hi-C data
Using the Hi-C scaffold method13, contigs in the initial assembly are anchored and oriented to the chromosomal scale of the assembly. The Hi-C library generated 86 Gb of clean data. After the Hi-C edited contigs were placed into the ALLhic pipeline14 for segmentation, orientation, and sequencing, the final 99.49% of assembled sequences were mapped to 24 pseudochromosomes with chromosome lengths ranging from 31.72 Mb to 73.07 Mb ( Table 3 ). This result is consistent with karyotype results based on cytological observations15as many cyprinid fishes Ctenopharyngodon idellus16, Ancherythroculter nigrocauda17, Hypophthalmichthys molitrix and Hypophthalmichthys nobilis18 with chromosome number 2n = 48. We then manually constructed the Hi-C scaffold from the chromatin contact matrix in Juicebox ( Figure 2 ). 24 pseudochromosomes are easily distinguished based on the heat map, and the strength of the interaction signal around the diagonal is quite strong, indicating a high quality of genome assembly. After Hi-C correction, the final assembled genome was 1,085.51 Mb, and the scaffold N50 was 42.39 Mb (Table 2). genome size C. erythropterus like some cyprinid fishes Ctenopharyngodon idellus (1.07 Gb), Megalobrama amblycephala (1.09 Gb)19, Culter alburnus (1.02 Gb)19and Ancherythroculter nigrocauda (1.04 Gb), but much lower than that Cyprinus carpio (1.69 Gb)20.
Evaluation of genome assemblies
To assess the accuracy and completeness of the genome assembly, we first compared the Illumina reads to the assembly of the following. C. erythropterus with BWA (v0.7.8)21 here, 98.71% of the reads could be mapped to contigs. In addition, we assessed genome assembly integrity by comparing Universal Single-Copy Orthologs (BUSCO v5.2.1).22 with the vertebrata_odb10 database and CEGMA (v2.5).23. The final results of both showed that the assembly contained 98.5% of complete genes and 0.4% of fragmentally conserved single-copy orthologs ( Table 4 ), as well as 97.98% of 248 major eukaryotic genes. Overall, the results of these evaluations show us that C. erythropterus The genome assembly is complete and of high quality.
Repeat the annotation
Interpret repeated elements in its purpose C. erythropterus genome, methods combining homology comparison and ab initio prediction were used. For ab initio repeat annotation, a de novo repeat element database was constructed using LTR_FINDER (v1.0.7)24RepeatScout (v1.0.5)25 and RepeatModeler (v1.0.8)26RepeatMasker (v4.0.5)26 used to record duplicate items in the database. RepeatMasker and RepeatProteinMask (v4.0.5) were then used for known repeat element types by searching the Repbase database.27. Additionally, TRF (v4.07b)28 can be used to interpret a tandem repeat. Finally, we identified 557 Mb of repetitive sequences representing 51.34% of the assembled genome. These numbers are more than before Ctenopharyngodon idellus genome (38.06%) and Megalobrama amblycephala genome (38.68%), but slightly lower than that Danio rerio genome (52.2%). Within this, we identified a predominant 469 Mb LTR in the assembled genome (43.23%) (Table 5).
Gene prediction and annotation
We detected protein-coding genes C. erythropterus genome assembly by a combination of three methods: Ab initio prediction, homology-based prediction, and RNA-Seq prediction. Regarding ab initio prediction, Augustus (v3.2.3)29GlimmerHMM (v3.04)30SNAP (29-11-2013)31Geneid (v1.4)32and Genescan (v1.0)33 used in our automated gene prediction pipeline. Regarding the homology-based predictions, we downloaded the protein sequence Ancherythroculter nigrocauda (GWHAAZV00000000), Cyprinus carpio (GCF_000951615.1), Danio rerio (GCF_000002035.6), Sinocyclocheilus anshuiensis (GCF_001515605.1), Sinocyclocheilus grahami (GCF_001515645.1), Sinocyclocheilus rhinoceros (GCF_001515625.1) from the NCBI database and using TblastN (v2.2.26)34 to match with C. erythropterus The genome with an e-value cutoff of 1E-5 and then the aligned proteins were precisely aligned to homologous genomic sequences using GeneWise (v2.4.1).35 Software. Regarding RNA-Seq prediction, RNA-Seq data from seven tissues (including intestine, liver, muscle, spleen, heart, gallbladder, and kidney) were aligned with genomic fasta using TopHat (v2.0.11).36 and gene structures predicted using Cufflinks (v2.2.1)37. A non-redundant reference gene set was generated by combining genes predicted from the three methods using PASA (Program to Assemble Spliced Alignment) terminal exon support using EvidenceModeler (EVM, v1.1.1).38, as well as masked transposable elements as input to gene predictions. In total, a total of 33,706 protein-coding genes were predicted and annotated with an average exon number of 7.77 and an average CDS length of 1,363.50 bp per gene (Table 6). In a final analysis, we compared the distribution of gene number, gene length, coding DNA sequence (CDS) length, exon length, and intron length with other bony fishes ( Table 7 and Figure 3 ).
Predicted genes C. erythropterus Functionally annotated using BLAST39 v. SwissProt40Nr NCBI, KEGG41InterPro42GO43and Pfam44 A database with an e-value cutoff of 1E-5. InterproScan (v4.8)45 tool is used to predict protein function based on conserved protein structural domains using the InterPro database. The result was that 33,041 genes were successfully annotated C. erythropterusaccounting for 98.0% of all predicted genes (Table 8 and Figure 4).
Finally, miRNAs and snRNAs were identified by searching the Rfam database using the default settings of INFERNAL.46. We selected the human rRNA sequence as a reference and used BLAST39 Predicting rRNA sequences C. erythropterus. tRNAs were predicted using the tRNASCAN-SE program47. As a result, we recorded 1609 miRNA, 8135 tRNA, 1251 rRNA and 1060 snRNA genes (Table 9).