6. FAQ

What is the difference between T2T, Near-T2T, Gap free, and Gap less?
Genome
type
Chromosome
level
Find all
Telomeres 1
Find all
Centromeres 2
Fill all Gaps Highest BUSCO 3,
QV 4, and LAI 5
T2T
Near-T2T 6
Gap free
Gap less 6

✔: Yes, ❌: No, ⚪: Partial

1 Repeats of CCCTAAA should be at least greater than 20 times.

CCCTAAA repeat times
Telomere ≥ 20

2 Clearly define the positions of centromeres, the length and coverage of tandem repeat (TR) sequences, and centromere unit sequences (CENs, e.g., CEN170). If conditions permit, provide more comprehensive information, such as the coverage of long terminal repeat-retrotransposons (LTR-RTs), GC content, and the abundance of the centromere-specific histone H3 variant, CENH3.

Positions Coverage of TR CEN CENH3 Coverage of LTR-RTs % GC
Centromere

3 BUSCO is a widely used tool for assessing the completeness of genome assemblies from a set of conserved single-copy orthologs. The BUSCO score should consistently use embryophyta_odb as the reference database. To avoid missing core genes, it is recommended to use the genome mode. Here, we divide the BUSCO score into four levels: In-complete (≤ 90%), Acceptable (90 - 95%), Near-complete (95 - 98%), and Complete (≥ 98%).

In-complete Acceptable Near-complete Complete
% BUSCO ≤ 90 90 - 95 95 - 98 ≥ 98

4 QV value is used to evaluate the accuracy of the genome assembly, which is calculated by the formula QV = -10 * log10 (1 - e), where e is the error rate of the genome. Here, we divide the QV value into three levels: Low (QV ≤ 20), High (20 < QV < 60), and Gold (QV ≥ 60). We recommend that the QV value be calculated using the third-generation sequencing data (e.g., PacBio HiFi, Oxford Nanopore)

Low High Gold
QV ≤ 20 20 - 60 ≥ 60

5 LAI (LTR Assembly Index) is a metric used to evaluate the completeness and quality of a genome assembly, particularly focusing on the accurate detection and assembly of long terminal repeat (LTR) retrotransposons. The LAI value is calculated using the formula: LAI = (Total length of intact LTR sequences / Total genome length) × 100. Typically, a value less than 10 indicates a draft genome (LAI < 10), a value between 10 and 20 can be considered a reference-quality genome (10 ≤ LAI ≤ 20), and a value greater than 20 qualifies as a gold-standard genome (LAI > 20) [4].

Draft Reference Gold
LAI ≤ 10 10 - 20 ≥ 20

6 The number of gaps should be greater and equal to twice the number of chromosomes (n) and less than 50 (Gaps ≤ min(2n,50)).

Number
Gaps ≤ min(2n,50)
How to analyze the uploaded genome?

1. Generate URL

Once the user uploads a file, return a Status URL and a Result URL. The backend uses the Celery queue system for task management (allowing only one task to run at a time). The user can check the task execution status in real-time via the Status URL. If everything runs smoothly, the user can view and share the results through the Result URL.

2. Genome Format Check

Verify that the genome file is in FASTA format and check if the GFF3 annotation file corresponds correctly with the genome sequences:

  1. Each seqid in GFF3 must exist in FASTA and range within sequence length.
  2. start < end of each feature.
  3. Strand must be + or -.
  4. Require gene feature.
  5. gene must have ID attributes; mRNA must have ID and Parent attributes; Other features must have Parent attributes.

3. Genome Quality Check

Ensure the genome assembly meets PlanT2T requirements by checking the following conditions:

  1. The number of gaps must be less than 2 times the number of chromosomes and fewer than 50.
  2. Protein sequences must not have duplicate names.
  3. Less than 10% of the protein sequences should have internal stop codons or lack stop codons.
  4. If N90 > 10 Mb, contigs smaller than 10 Mb are filtered. If N90 ≤ 10 Mb, contigs smaller than 1 Mb are filtered.

4. Change Chromosome and Gene Name

If the chromosome name does not start with Chr (case insensitive), the chromosome names will be modified to Chr1, Chr2, etc. Gene names will be renamed based on the Latin name (e.g., Oryza sativa), cultivar name (e.g., NIP, optional), and haploid type (e.g., Haploid1, optional), resulting in names like Osa_NIP_H1_01G000010.

5. Telomere Identification

Perform whole-genome screening for the CCCTAAA repeat sequence and consider the presence of a telomere when the repeat count exceeds 20.

6. Centromere Identification

After excluding the 100 kbp sequences at the ends of the chromosomes, perform whole-genome screening for tandem repeat (TR) regions. The region with the higher coverage of TRs (>10%) and the longest length is considered a potential centromere region.

7. Transcription Factor Prediction

Perform whole-genome identification of plant transcription factors (TF) and transcriptional regulators (TR).

8. Protein Annotation

Use InterProScan to perform functional annotation of proteins, including GO terms, domain locations, functional information, and so on.

9. Protparam Analysis

Analyze the physicochemical properties of each protein, including:

  • Molecular Weight
  • Aromaticity
  • Instability Index
  • Isoelectric Point
  • Helix Fraction
  • Turn Fraction
  • Sheet Fraction
  • Reduced Cysteines Extinction Coefficient
  • Oxidized Cysteines Extinction Coefficient
  • GRAVY
  • Average Flexibility
  • Charge at pH 7.0

10. KEGG Annotation

Use KofamScan to perform KEGG Ortholog (KO) annotation on all protein files, retaining only the highest-scoring K number for each protein.

11. BUSCO Assessment

Use the embryophyta_odb10 database as the reference for Benchmarking Universal Single-Copy Orthologs (BUSCO) evaluation, with the percentage of Complete BUSCOs (C) serving as the final assessment metric.

12. Build OrgDB

Construct an OrgDB for each species based on GO and KEGG annotation results, retaining both the original gene names and the renamed ones. The KO annotations are manually curated to filter out ko unrelated to plants.

13. Chromosome Visualization

Generate a configuration file for use with Ideogram.js. This file contains chromosome length, centromere position, telomere position, and gap position.

14. Latin Name to TaxID

Convert user-provided Latin names to NCBI taxonomy IDs and generate the corresponding taxonomic categories, including Phylum, Class, Order, Family, and Genus. It is important to provide the correct Latin name where an NCBI taxonomy ID is available.

15. Genome Information Statistics

Calculate various genome statistics, including:

  • N50
  • L50
  • GC content
  • Number of gaps
  • Telomere count
  • Gene and transcript numbers
  • Different protein types
  • TF and TR counts

16. Save Information to MySQL

Store all functional annotation data, genome metadata, and Gene, PEP, and CDS sequences into the MySQL database for the specified species.

17. Build BLAST Index

Build BLAST indexes for nucleic acids and proteins, allowing users to search for similar sequences in the genome.

18. Build Genome Browser

Generate configuration files for JBrowser2, allowing users to explore the omics data based on the genome assembly and annotation.

20. Construct Genome and Gene Page

Generate a genome page for the species, including genome assembly information, annotation details, and download links. Each gene has a detailed page displaying gene function. All pages are accessible to all users.

How to solve the error when analyzing the uploaded genome?

Check Input Files - Ensure that the genome and annotation files are in the correct format and that the genome is a high-quality T2T assembly. Use the genomeCheck script to verify the files in advance.

# Download the genomeCheck script
wget https://biobigdata.nju.edu.cn/plant2t/script/genomeCheck
chmod +x genomeCheck
# Run the script
genome=example.fa.gz
annotation=example.gff3.gz
./genomeCheck -g ${genome} -a ${annotation}

After running the script smoothly, you will see the following output:

Pass GFF3 check: GFF3 annotations match the FASTA sequences.
Pass protein check: the rate of low-quality protein sequences is 1.34228%.
Pass gaps check: total 1 gaps were found in your genome.
Pass sequences filter: all sequences are longer than 1000000 bp.
Pass Chromosome names chack: Chromosome names start with 'Chr', no need to rename.

Two files were generated: 
/mnt/c/Users/haoyu/Desktop/example.fa.new.fa
/mnt/c/Users/haoyu/Desktop/example.gff3.new.gff3
You can upload them to PlanT2T for further analysis.

Pass all check!

If the output shows some errors, please fix them and rerun the script. If you cannot fix them, please open an issue on the GitHub.

How long does it take to analyze the uploaded genome?

The runtime is dependent on the genome size and the number of protein sequences. For example: Arabidopsis thaliana (125 Mb, 27,251 genes) takes approximately 10 hours, while Oryza sativa (400 Mb, 57,359) takes approximately 20 hours. Large genome sizes (>1Gb) or A large number of protein sequences (>50k) may take longer to process.

Back to top