Chromosome-level genome assembly and transcriptome- based annotation of the oleaginous yeast Rhodotorula toruloides CBS 14

Rhodotorula toruloides is an oleaginous yeast with high biotechnological potential. In order to understand the molecular physiology of lipid synthesis in R. toruloides and to advance metabolic engineering, a high-resolution genome is required. We constructed a genome draft of R. toruloides CBS 14, using a hybrid assembly approach, consisting of short and long reads generated by Illumina and Nanopore sequencing, respectively. The genome draft consists of 23 contigs and 3 scaffolds, with a N50 length of 1,529,952 bp, thus largely representing chromosomal organization. The total size is 20,534,857 bp with a GC content of 61.83%. Transcriptomic data from different growth conditions was used to aid species-specific gene annotation. In total we annotated 9,464 genes and identified 11,691 transcripts. Furthermore, we demonstrated the presence of a potential plasmid, an extrachromosomal circular structure of about 11 kb with a copy number about three times as high as the other chromosomes. Significance This obtained high-quality draft genome provides the suitable framework needed for genetic manipulations, and future studies of lipid metabolism and evolution of oleaginous yeasts. The identified extrachromosomal circular DNA may be useful for developing efficient episomal vectors for the manipulation of Rhodotorula yeasts. Data deposition This project has been deposited at ENA under the accession PRJEB-40807.


Introduction
The basidiomycete yeast Rhodotorula toruloides is an oleaginous microorganism with high biotechnological potential for lipid and carotenoid production. This yeast can naturally accumulate lipids up to 70% of dry cell weight, and a number of carotenoids (González-García et al., 2017;Park et al., 2018). R. toruloides can be cultivated to high cell densities on a wide range of substrates, including lignocellulose hydrolysate and other residual products (Júnnior et al., 2020;Sànchez Nogué et al., 2018;Singh et al., 2018;Wiebe et al., 2012, Chmielarz et al. 2021). This makes R. toruloides a promising host for the production of single-cell oil, as a sustainable and less controversial alternative to plant-derived oils for the production of biofuels, food and feed additives (Park et al., 2018).
The molecular physiology behind lipid synthesis in R. toruloides has been relatively little explored, which hinders effective metabolic engineering for improved lipid production. Some draft genome sequences from R. toruloides strains have been determined using short-read sequencing technologies, however a complete picture of the R. toruloides genome on chromosomal level is still lacking (Hu & Ji, 2016;Kumar et al., 2012;Morin et al., 2014;Paul et al., 2014;Sambles et al., 2017;Tran et al., 2019;Zhang et al., 2016;Zhu et al., 2012). The combination of short-and long-read sequencing strategies has been shown to improve the accuracy of genome sequences in yeast (Olsen et al., 2015;Tiukova et al., 2019).
Thus, we combined Nanopore long-read sequencing and Illumina short-read sequencing to obtain a chromosome-level genome assembly. We further generated comprehensive transcriptomic data from different growth conditions to aid species-specific annotation. The results provide a valuable resource for pathway analysis and manipulation of R. toruloides and enable better understanding of genome biology and evolution of basidiomycetous yeasts.
Cultivations were performed in triplicates, at 25°C, pH 6 and oxygen tension of 21%.
Total RNA was extracted in triplicates from 5 mL samples withdrawn after 10, 40, 72 and 96 h, respectively from bioreactors with mixed carbon source, and 10, 30 and 60 h, respectively from bioreactors with glycerol as sole carbon source, using the Quick-RNA™ Fecal/Soil Microbe MicroPrep Kit (Zymo Research, USA) according to the manufacturer's instructions with some modifications (Manzoor et al., 2016). Briefly, the cells were harvested and resuspended in 1 mL Trizol and disrupted in a FastPrep -24 bead beater (M.P. Biomedicals, USA) at speed 6.0 m/s for 40s. The homogenate was separated into layers by adding 0.2 mL chloroform and centrifugation. 400 µl of the upper layer containing the RNA was further processed as described in the manufacturer´s manual. DNAse I treatment was performed as described in the manual applying 26 U/mL at 37° for 15 minutes. The technical replicates of the retrieved RNA were pooled prior rRNA depletion, which was performed using the human riboPOOL™ (siTOOLsBiotech, Germany) and Streptavidin-coated magnetic beads (Thermo Fisher Scientific, Norway) following the two-step protocol of the manufacturer. rRNAdepleted samples were purified by ethanol precipitation. RNAse inhibitor (1 U/µL, Thermo Fisher) was added before storage. Total RNA and rRNA-depleted samples were tested for integrity and quantity on RNA Nano Chips (Agilent 2100 Bioanalyzer System, Agilent Technologies, Germany).

Library preparation and sequencing
The extracted DNA was divided into two samples and sequenced using either MinION (Oxford Nanopore Technologies) or Illumina sequencing platform. Before Nanopore library preparation, 5 µg of the retrieved DNA were "pre-cleaned" using 31.5 µL of AMPure magnetic beads, for removing short DNA fragments (Brandt et al., 2019). The DNA library was prepared using the Ligation Sequencing Kit SQK-LSK109 (Oxford Nanopore Technologies, Oxford, UK) and a modified protocol (Brandt et al., 2019). In order to verify the circular structure of contig_63, Sanger sequencing (Macrogen Europe B.V, the Netherlands) was performed from PCR amplicons using the primers shown in supplementary table s1. The PCR mixture consisted of 0.3 µL DNA, 1.25 µL primer, 12.5 µL Dream Taq Green PCR Mix (Thermo Scientific, Lithuania) and 0.8 µL DMSO in a total volume of 25 µL. Amplification was conducted as follows: initial denaturation at 95°C for 5 min, 35 cycles of denaturation (95°C for 30 s), annealing (1 min) and elongation (72°C) followed by a final extension at 72°C for 10 min. Annealing temperatures and elongation times were adapted to the respective primer combination (table s2). Amplification was assessed on agarose gel electrophoresis at 8.2 V/cm electric field strength. The gel was prepared using GelRed® Nucleic Acid Gel Stain (Biotium), 1.0% agarose and TAE 1X buffer (VWR Life Science). The PCR products from the corresponding sizes were purified using GeneJet Gel Extraction kit (Thermo Scientific, Lithuania). Geneious prime version 2021.0.1 (Biomatters Ltd.) was used for assembly of the Sanger sequences obtained and to align FAS2 and FAS21 from R. toruloides CBS 14 and FAS2 from R. toruloides NP11.

Quality control and genome assembly
Short and long reads generated by Illumina and Nanopore sequencing, respectively, were combined for hybrid de-novo assembly. Before, the quality of the long reads was ensured using NanoPlot v1.25.0 (De Coster et al., 2018) and all short reads were removed until reaching a target base coverage with Filtlong v0.2.0 (https://github.com/rrwick/Filtlong) (Wick, n.d.) applying the parameters --target_bases 5000000000 and --length_weight 8. Short reads were quality-trimmed and adapter-clipped with fastp v0.20.0 (Chen et al., 2018) using parameters -5 -3 -W 4 -M 20 -l 25 -x -n 5 -z 6. To achieve high contiguity in the initial assembly step, a draft assembly was produced from the preprocessed long reads using Flye v2.8 (Kolmogorov et al., 2019), setting the suggested genome size to 20 Mbp and keeping the plasmid and meta options activated. The Flye draft assembly was further subjected to long-read-based polishing using two rounds of Racon v1.4.7 (Vaser et al., 2017) followed by one round of Medaka v0.10.0 (https://github.com/nanoporetech/medaka) (Technologies, n.d.) using the model r941_min_high. We used minimap2 (-x map-ont) v2.17 (Li, 2018) and samtools v.1.10 (Li et al., 2009) to prepare alignment files for the Racon polishing. The assembly graph was visualized and investigated via Bandage v0.8.1 (Wick et al., 2015). The long-read assembly was finally subjected to two Pilon v1.23 (Walker et al., 2014) rounds, in which the qualityfiltered short reads were used for final polishing. Alignment files for Pilon were prepared using BWA v0.7.17 (Li & Durbin, 2009) and samtools. We mapped the Illumina and Nanopore reads with BWA and minimap2, respectively, to the final assembly and used the pileup.sh script from BBMap v38.86 (Bushnell, 2014) to calculate coverage histograms for each contig with a bin size of 1000 nt and plotted them with ggplot2. Besides the visual inspection of coverage patterns, we used the samtools coverage function and the faidx command on the long-readmapped data to filter out contigs violating at least one of the following cutoffs: minimum read number = 100, minimum bases covered by reads = 5000, minimum read coverage = 15, and minimum read coverage depth (amount of bases covered by reads in comparison to contig length) = 10. The resulting quality-controlled and cleaned assembly file was used for annotation and chromosome analyses.
Taxonomic classification was performed using sourmash v2.0.1 (Pierce et al., 2019) and its "LCA" method. Both indexing the tree and querying genomes used a k-mer size of 31 and a sampling fraction of 10 -4 . The LCA index was derived from publicly available genomes (GenBank, https://osf.io/4f8n3/).
The repeat-masked assembly was further annotated using a combination of homology-based gene comparison and RNA-Seq-derived transcript information. First, we used MetaEuk v1.ea903e5 (Karin et al., 2020) with the "easy-predict" subcommand for draft gene annotation providing all proteins (filtered models; best) obtained from the JGI fungal genome portal MycoCosm (http://jgi.doe.gov/fungi; downloaded April 2020) as a database. Next, we combined and improved the MetaEuk annotation with the obtained RNA-Seq data. Prior to this, RNA-Seq datasets were quality-checked using FastQC v0.11.8 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) (Andrews et al., 2010) and trimmed and adapter-clipped using fastp with the aforementioned parameters. Potential residual rRNA was removed by SortMeRNA v2.1b (Kopylova et al., 2012). The processed RNA-Seq reads were per-sample-mapped to the assembly using HISAT v2.2.0 (Kim et al., 2015) and subsequently provided to StringTie v2.1.1 (Kovaka et al., 2019;Pertea et al., 2015) for transcript annotation, guided by the initial MetaEuk gene annotation. Finally, the MetaEukguided StringTie information from each individual RNA-Seq sample were merged into a final single annotation using the StringTie merge functionality. In this step, we again provided the full MycoCosm-based MetaEuk annotation next to the individual RNA-Seq annotations to also include genes that were not recovered in the RNA-Seq data. Lastly, we extracted all annotated gene sequences and used the dammit pipeline v1. We used a custom python script to identify potential telomeres by using known motifs at contig ends. This information was provided along with the Nanopore reads to Tapestry v1.0.0 (Davey et al., 2020).

Results and Discussion
The hybrid de-novo assembly for R. toruloides CBS 14 ( fig. 1) . 2). This number of contigs and scaffold is significantly lower than that obtained using short-read sequencing only: the lowest number achieved in previous studies is 186 contigs (table 1). This clearly shows the higher accuracy, contiguity and completeness of the genome presented here, achieved through the improved coverage using a hybrid assembly approach. 56.37% of the genome is represented by 4 contigs and 3 scaffolds, each larger than 1 Mb. The rest is allocated in 10 medium-size contigs (   NA, not available; a, b and c refer to protein-coding sequences, putative genes and predicted proteins, respectively. The mitochondrial genome was recovered in one contig with a length of 157 kb. However, the final step of circularizing sequences in an assembly remains challenging and thus can result in shorter or even longer contigs, since assembly programs cannot clearly define where a circular DNA ends, even despite longer reads (Hunt et al. 2015). Thus, tandem repeats at the end of a contig can increase the length of a linear representation of a circular sequence. Such repeats can be identified by overlapping sequences at the ends of a contig. To account for this, in an additional step, we aligned the contig against itself to identify and remove circular repeats, ultimately resulting in an actual mitochondrial genome size of 69 kb. The mitochondrial genome has a GC content of 40.9%, which agrees with the GC values previously reported for R. toruloides strains NP11 and NBRC 0880 (Zhou et al. 2020;Coradetti et al. 2018). In addition to the mitochondrial genome, three further circular contigs (contig_64, contig_49, contig_63) were predicted. Among these, Contig 63 was particularly noticeable because it showed a read depth approximately three times higher than the other chromosomes, which may indicate relaxed replication regulation ( fig. 2). We confirmed the circular structure of Contig 63 by overlapping amplification using sequence-specific primers and subsequent Sanger sequencing of the amplicons obtained (supplementary fig. S1, supplementary file 2). This showed that the circular structure is about 1,143 bp larger than the one predicted (supplementary fig. S1). As explained above, circular sequence structures possess a certain challenge for assembly tools and can particularly result in longer but also shorter contigs that miss parts of the actual sequence at the end of a contig (Hunt et al. 2015). Extrachromosomal endogenous DNA have been previously found in Saccharomyces cerevisiae (Strope et al. 2015;Møller et al. 2015).
The presence of DNA mitochondrial plasmids in filamentous fungi, including some Basidiomycota species, have also been widely acknowledged before (Griffiths 1995;Wang et al. 2008;Cahan & Kennell 2005). Within them, mitochondrial circular DNA plasmids are less frequently found than linear. They are typically within a length range of 2.5-5 kb, and encoding enzymes involved in their replication such as a DNA polymerase or a reverse transcriptase (Cahan & Kennell 2005;Wang et al. 2008 table s5). Similar to the situation in S. cerevisiae (Burkl et al., 1972), the genes encoding the -and β-subunits of FAS are located in different chromosomes (supplementary table s5).
The MetaEuk annotation identified a gene (FAS21) on the opposite strand of the FAS2-gene (supplementary fig. S5), which encodes the -subunit of the fatty acid synthase complex (FAS) in R. toruloides NP11 (Zhu et al., 2012). FAS21 contains 2 exons and its mRNA would be complementary to parts of the FAS2-sequence. There is a growing number of natural antisense transcripts identified in fungal transcriptome analyses (Chacko & Lin, 2013 S6).
In the case of the transcriptomic data, 84.1% of the genes were indicated as complete (69.3% single copy,14.8% duplicated), 14.1% as fragmented and 1.8% genes as missing.