Whole-genome sequencing of three local rice varieties ( Oryza sativa L.) in Vietnam

Recently, a new technology, Next-generation sequencing (NGS) has been launched and providing whole-genome sequences that helps identify molecular markers across the genome. DNA markers such as single nucleotides and insertion – deletion (InDel) polymorphisms were widely used for plant breeding particularly to distinguish important traits in rice. These PCR-based markers can be used for the precision detection of polymorphisms. Moreover, PCR-based approaches are simple and effective methods for dealing with the issue of fraudulent labeling and adulteration in the global rice industry. In this study, three local varieties of Oryza sativa L. in Vietnam were sequenced with up to ten times genome depth and at least four times coverage (~83%) using the Illumina HiSeq2000™ system, with an average of 6.5 GB clean data per sample, generated after filtering low-quality data. The data was approximately mapped up to 95% to the reference genome IRGSP 1.0. The results obtained from this study will contribute to a wide range of valuable information for further investigation into this germplasm.


Introduction
Rice is the primary food of more than half of the world's population. To date, thousands of rice varieties have been recorded, with different commercial names available in the market, while some unique varieties are particularly popular as they suit the tastes of consumers in a particular region (1). Attempts to create superior varieties/products via conventional breeding have constantly resulted in inconsistent or variations of product quality, which have contributed to the discrepancy of market prices that impact the income of rice farmers (2). Moreover, this variation may subsequently encourage some dishonest stakeholders and traders to mix or mislabel rice varieties to achieve excess profit (3). Traditionally, rice variety identification relies mainly on the expert opinions of breeders, extension services and other expert farmers and morphological descriptions. These methods, however, have several inherent levels of uncertainty. For example, in the absence of formal seed systems, the naming system can vary from time to time and from place to place, leading to inconsistencies in the name of the rice variety. In addition, environmental and plant development conditions influence morphological descriptors at different stages (4). These difficulties can be overcome by molecular markers, not only without environmental effects and stages of development but also with the available reference genomes of many species. In Vietnam, rice is an important food product and represents the first export commodity of the country. Commercially, Nang Thom Cho Dao Thom (NTCD-T) is the most expensive rice variety because of its finer flavor and superior quality. Consequently, local use is not enough, and the cost of other rice varieties is two to three times as high. Thus, developing a suitable method to authenticate this economically important trait of NTCD-T is required.
To solve this authenticity problem for NTCD-T, the molecular marker approach is appropriate given the same genetic background of these rice varieties in the market. DNA markers such as single-nucleotide polymorphism (SNP) and insertion/deletion polymorphism (InDel) markers are useful to distinguish among closely related varieties. Recently, whole rice genome data has become available and helpful for developing SNP and InDel markers using next-generation sequencing (NGS) technology (5). Compared to the need for unique facilities for SNP detection (6), the codominant technology for InDels is user-friendly and beneficial in some genetic analyses, especially in marker-assisted selection (MAS) (7). With the advancement of NGS technology and cost reduction, InDels were widely detected and created via resequencing and became a precious resource for the research of many organisms, especially rice (8,9). This study aimed to detect and create stable and  practical InDel markers based on information  resequencing from three closely related rice lines,  NTCD-T (aroma), NTCD-TN (mild aroma), and Tai Nguyen, compared to a reference genome sequence, Nipponbare, so that the NTCD-T national rice cultivar can be authenticable and distinguished from other Vietnamese rice cultivars. Our findings thus set up the foundation for a long-term assessment of the purity of other rice varieties not only in Vietnam but also in authentic rice markets around the world.

Plant Material
NTCD-T belongs to Oryza sativa L. subsp. indica, is an aromatic local rice and is geographically restricted to the Cho Dao district in the Long An province of Vietnam (10°33′29″N, 106°36′23″E), and the Nang Thom Cho Dao mild aromatic line (NTCD-TN) and Tai Nguyen are local varieties sharing similar phenotypic characteristics with NTCD-T. These varieties have been difficult to distinguish in the market and therefore were chosen for comparison purposes regarding authentic NTCD-T.

DNA Isolation and Genome Sequencing
Genomic DNA was extracted from the leaf tissue of rice seedling (two weeks after germination) using the NEXprep™ Plant RNA Mini Kit (Genes Laboratories, Korea). The DNA was purified according to the manufacturer's protocol. DNA purity was checked on 1% agarose gel (w/v), and concentration and quality were determined using Nanodrop (Thermofisher, USA) and Bioanalyzer (Agilent, USA). The wholegenome resequencing of the three samples was performed on an Illumina HiSeq 2000™ from Novogen (Novogen, Malaysia). Library preparation and whole-genome sequencing were performed according to the standard Illumina protocol using Illumina's paired-end sequencing technology according to the Illumina pipeline. The 150-bp paired-end readings generated from the three genotypes were deposited in the NCBI sequence read archive (SRA) under study accession number PRJNA576771.

Variant Analysis
We used SnpEff build V4.3+T.galaxy3 on the Galaxy UI (https://usegalaxy.org) to build the reference database for functional annotation, including the Os-Nipponbare-Reference-IRGSP-1.0 reference genome (ftp://ftp.ensemblgenomes.org/pub/plants/release-45/fas ta/oryza_sativa/dna/) and a GFF3 as the annotation file (ftp://ftp.ensemblgenomes.org/pub/plants/release-45/gff 3/oryza_sativa). The genomic distribution of SNPs and InDels was eventually calculated and identified using mostly the awk and sort/uniq command lines, the former to determine the chromosome positions of the VCF file before grouping those positions by windows of 10 kb and the latter to determine the counts of the variants in each window of 10 kb. The data was then displayed using the Circa software (http://omgenomics.com/circa/). In addition, both the SNPs' and InDels' variants among the three samples were presented in Venn diagrams using VennyV2.1.0 (https://bioinfogp.cnb.csic.es/tools/venny/index.html).

PCR validation for InDel markers
A set of 2 InDel markers with a 3-10 bp major allele difference in genomes of three samples were randomly selected from the developed InDel markers for accuracy and polymorphism validation by the PCR technique. 100 out of 120 InDel markers were randomly selected to amplify the genomic DNA of three tested samples for accuracy validation. Then, we further detected their polymorphisms by PCR amplification of genomic DNA in a panel of 5 rice cultivars in which three tested sample and including 1 aroma commercial rice Jasmine 85 and local salt tolerant variety Doc Phung with 2 selected pairs of InDel primers (Table 1).
PCR was performed in a 15-μl reaction volume containing 50 ng of template DNA, 7.5 μl of 2X PCR master mix (NEXpro HS-Taq DNA polymerase, Korea), 10 nM of each primer and ddH2O. The DNA amplification protocol included an initial denaturation for 3 min at 95 °C, followed by 35 cycles of denaturation for 30 s at 95 °C, annealing for 90 s at 60 °C and an extension for 30 s at 72 °C, with a final extension for 10 min at 72 °C. The reactions were performed in a C1000 thermal cycler (Bio-rad, Inc., Hercules, CA). The PCR products were Table 1. List of InDels markers used for detection of NTCD-T lines

Genomic Library Sequencing and Mapping
In this study, high-throughput Illumina sequencing technology was used to sequence commercial Vietnamese rice variety NTCD and two other accessions sharing the similar phenotypic characteristics. The genomic DNA was isolated from rice leaves in each variety to enable a downstream assessment. Illumina HiSeq 2000 was used for highperformance sequencing, and the resulting sequence readings were mapped with BWA to IRGSP-1.0. For the current 373,245,519 bp reference genome, the mapping rate of each sample ranged from 95.76% to 95.87%. Referring to the reference genome (without Ns), the coverage and the average read depths obtained across all twelve chromosomes were 85.23% and 16.43 times for NTCD-T, 83.46% and 12.89 times for Tai Nguyen, and 86.07% and 19.39 times for NTCD-TN respectively ( Table 2). This result is in the qualified normal range and may serve in the subsequent variation detection and related analyses.
The SNPs and InDels were called based on unique alignments of the cleaned readings to Os-Nipponbare-Reference-IRGSP-1.0. The total numbers of SNPs (Supplement S1) for NTCD-T and NTCD-TN were 1,959,489 and 2,040,301 respectively, while that of SNPs for Tai Nguyen was 1,859,736. For each comparison, however, the numbers of InDels were considerably less than those of SNPs. The total numbers of InDels were 174,506 and 184,111 for NTCD-T and NTCD-TN respectively and 157,693 for Tai Nguyen. Furthermore, the NTCD-T and NTCD-TN SNPs and InDel densities were closer and higher than those of Tai Nguyen. The result showed that both NTCD-T and NTCD-TN shared a similar genetic background and therefore no significant variation was observed.

Nonrandom genomic organization of DNA Polymorphisms
In this study, all twelve chromosomes of the three genomes have been investigated for the genomic organization of the DNA polymorphisms. Considerable variations between the numbers of identified SNPs and InDels (Fig. 1A, 1B) were observed. The number of DNA polymorphisms (SNPs and InDels) at each chromosome for all three genomes was proportionate to the chromosome length ( Fig. 2A, 2B). Overall, the numbers of SNPs and InDels in chromosome 3 were the most abundant, while those of chromosome 9 were the least abundant. In particular, with regard to the DNA polymorphisms in the three genomes, SNP variation was the highest in chromosome 1 and the lowest in chromosome 9 (Fig. 1A). Meanwhile, InDel variation was distributed most abundantly in chromosome 9 (Fig. 1B). The distributions of SNPs and InDels within the chromosomes in the three genomes were not uniform ( Fig. 2A, 2B). More DNA polymorphisms were distributed in NTCD-TN compared with those in NTCD-T and Tai Nguyen. The highest-density (>870) SNP regions of 100 kb and the lowest-density (<2) SNP regions of 100 kb were detected in all three genomes. In contrast, the highest-density (>105) InDel regions of 100 kb and lowest-frequency InDel (<2) regions of 100 kb were also detected in all three genomes. However, the substantially differentiated spreads of DNA polymorphisms have also been recorded in many plants including rice.

Analysis of SNPs and InDels
Upon further investigation of the numbers of transitions (Ts) and transversions (Tv), transition-totransversion (Ts/Tv) ratios as measures for possible random sequence errors were determined. In this study, the Ts/Tv ratios (Fig. 3A) (NTCD-T: 2.619; Tai Nguyen: 2.605; NTCD-TN: 2.628) were approximately higher than the statistical human Ts/Tv ratio (>2.1), indicating the high quality of the SNPs found in an oblique manner. The higher Ts/Tv ratio (termed as  transition bias) had been reported in rice and maize. Thus, the higher frequency of Ts mutations over Tv mutations is due to mispairing and improved Ts tolerance because of fewer chances to change protein structure/functions in Ts compared with Tv. The total number of Ts (A/G and C/T) was significantly higher than that of Tv (A/C, A/T, C/G and G/T) for all three genomes. In all cases, the frequencies of A/G were at a similar level as those of C/T. The frequencies of Tv were not, however, at a similar level; the G/C frequency was lower than those of the three other Tv types (Fig. 3B). Our findings have been consistent with those of previous reports on rice and other plants.
Variant impacts were predicted in protein coding genes by SnpEff V4.3t. The variant impact statistics showed the highest numbers in the modifier group (Supplement S2) (NTCD-T: 7,363,697; Tai Nguyen: 7,021,002; and NTCD-TN: 7,607,886), while the variants predicted with high impact were 1,829; 1,758 and 1,869 in NTCD-T, Tai Nguyen and NTCD-TN respectively. The variants that affected noncoding genes were recognized, and when the information was available, the corresponding biotypes were identified. A biotype is a group of organisms that share the same genotype. Therefore, these variants with high impact will act as markers for identifying biotypes. Furthermore, missense SNPs in the three genomes ranged between 58,598 and 61,635 compared with those in the reference genome (Supplement S3). These SNPs have also been used to identify biotypes, resulting in their own characteristics.

Genomic annotation of DNA polymorphisms
The variants were annotated using Nipponbare as the RefSeq and Ensembl gen sets. In different genomic regions, we performed genome-wide annotation of the SNPs and InDels (Fig. 4A, 4B). In general, the patterns of the SNPs and the InDels were quite similar in different genomic regions for all the comparisons; while the numbers of variants in NTCD-TN and NTCD-T were similar; both variant numbers were higher than that of Tai Nguyen. Most of the detected SNP and InDel variants were detected most abundantly in nonfunctional genome regions such as upstream, downstream and intergenic (Fig. 4C, 4D).  The high frequency of genetic variants in the noncoding region could result from less pressure from natural selection. In the coding regions, however, the DNA polymorphisms were much lower than those in the noncoding regions, although these regions have been shown to play important roles in evolution. This can lead to phenotypical variation in these varieties. Since high-effect variants generate nonfunctional proteins that induce various phenotypic differences in evolution, high-effect SNPs and InDels among the three genomes were investigated in this study. These high-effect variants include the disruption of slicing sites, loss of translation in the start codon and introduction to the premature stop codon. Nonsynonymous SNPs and high-effect InDels were responsible only for 0.79%-0.81% and 0.24%-0.25% in the overall polymorphisms respectively (Supplement S4).

Variant comparisons among the three genomes
The following variants were detected-a total of 1,959,489 SNPs and 174,506 InDels from NTCD-T, 1,859,736 SNPs and 157,693 InDels from Tai Nguyen, and 2,040,301 SNPs and 184,111 InDels compared to the Nipponbare genome. Unique SNPs were recorded for 139,087 of the total NTCD-T SNPs, 337,387 of the total Tai Nguyen SNPs and 189,426 of the total NTCD-TN SNPs, representing 5.4%, 13.1%, and 7.4% of all the SNPs respectively. Similarly, specific InDels were detected for 17,013 of the total NTCD-T InDels, 33,459 of the total Tai Nguyen InDels and 22,878 of the total NTCD-TN InDels, constituting 7.1%, 13.9%, and 9.5% of the overall InDels respectively (Fig. 5A, 5B). The following DNA mutations in the genomes were widely recognized-52,981 SNPs and 5,795 InDels between NTCD-T and Tai Nguyen, 381,507 SNPs and 42,795 InDels between NTCD-T and NTCD-TN, and 83,454 SNPs and 9,543 InDels between Tai Nguyen and NTCD-TN. Approximately, the three genomes commonly shared about 5,091 SNPs and 287 InDels. In agreement with previous reports, genomic variations induced by natural means among rice lines and among cultivars were observed to have similar patterns of chromosome distribution and nucleotide substitution.

Distinguish NTCD-T Varieties from other Rice Varieties based on InDels Marker
To validate two selected InDel markers for distinguishing the NTCD-T rice cultivar from other rice varieties, we further analyzed them via PCR and fragment analysis. The unique size of the amplified fragments between NTCD-T and four other tested cultivars could be discriminated (Fig. 6A, 6B). The validation results show that the marker NTCD-8 could be detected in the NTCD-T cultivar with a deletion of 4-6 bp at the exon of chromosome 8. This suggests that the primer NTCD-8 can be used for an authentic assay of NTCD-T.

Discussion
Adulteration is popular in most agricultural food products, where it is difficult to distinguish adulterants from foodstuffs using the naked eye/visual observation. Adulterated food products are higher in value among similar types. Among all adulteration detection techniques, DNA-based approaches are the most appropriate because they are easier, more affordable and more replicable. The development and cheaper sequence methods of thousands of DNA markers give unprecedented applications of such DNA-based approaches to address the adulteration crisis in the foodstuffs and rice industry in particular (20,21). A wide array of DNA-based markers are available for cultivar identification, including RAPD (22), restriction fragment length polymorphism (RFLP) (23), amplified fragment length polymorphisms (AFLP), SNPs (24), microsatellites (25) and InDels (26). Each DNA-based marker has its own advantages and disadvantages in their applications. Among all the DNA markers used to identify varieties in terms of reproductiveness, biological informativeness, and applicability, InDel markers are the most instructive. Taking advantage of a wide database of resequencing from NGS technology, InDel markers have been selected to develop an instructive DNA marker to detect rice lines that share genetic backgrounds. However, the high precision of InDel markers is calculated by the accuracy of the reference genome using an NGS strategy for InDel identification. Moreover, the accuracy of InDel polymorphism identification is affected when resequencing with a short read (27). Nevertheless, the accuracy of the polymorphism was performed based on the sequencing depth, with an average depth of sixteen times (Table 1), similar to that of the most current sequencing project, the 3,000 Rice Genomes Project (28). The present study shows that this profile of the sequencing depth is sufficient to accurately detect the polymorphisms in nucleotides. Many systems have been developed to study InDel markers in rice with 4-40 bp InDels (29), 30-100 bp InDels (30), or even larger InDels (31), i.e., up to 2,000 bp. These InDel PCR amplicons could be separated using polyacrylamide or agarose gels. Furthermore, these studies could only be performed via electrophoresis since the InDel polymorphisms differ by more than 4 bp. Here, fragment analysis  using bioptic technology with three sets of InDel markers was used to distinguish allelic variations of less than 4 bp (Fig. 6).

Conclusion
In this study, the InDels markers selected from rice genome resequencing were proven to be selective and sensitive method to distinguish rice line particularly for the aroma traits based on the simple PCR and polyacrylamide techniques. This result has shown that the potential of using these markers for rice variety identification particularly for the high value NTCD-T rice in Vietnam. Table S1. Total numbers of SNP and InDel in this study