Last updated: 06 October 2014 ( )

Transcription

Last updated: 06 October 2014 ( )
Manual/Tutorial
Last updated: 06 October 2014
Contact: Michael Hackenberg ([email protected])
Table of Contents
INTRODUCTION
6
MAIN FEATURES
6
2 GETTING STARTED
7
2.1 DEPENDENCIES
7
2.2 INSTALL THE DATABASE
7
2.3 POPULATE THE DATABASE
8
2.3.1 GENOME SEQUENCES
8
2.3.2 MICRORNAS LIBRARIES
8
2.3.3 OTHER SMALL RNA SPECIES
9
2.3.4 SRNABENCH HELPER TOOLS
9
2.4 QUICK START
9
2.4.1 LIBRARY MAPPING MODE
10
2.4.2 GENOME MAPPING MODE
11
2.4.3 PREDICTION OF NOVEL MICRORNAS
11
2.4.4 USING OTHER LIBRARIES
12
2.4.5 DETECTING ISOMIRS
12
2.4.6 VISUALIZING ALIGNMENTS
13
3 SRNABENCH PARAMETERS
13
3.1 MANDATORY PARAMETERS
13
3.2 MAPPING PARAMETERS
14
3.3 ANALYSIS TYPES AND LIBRARIES
14
3.4 ADAPTER TRIMMING AND PRE-PROCESSING
16
3.5 PROFILING PARAMETERS
17
3.6 OUTPUT OPTIONS
18
3.7 PROGRAM NAMES
19
3.8 PREDICTION OF NOVEL MICRORNAS
19
4 WORK-FLOW
20
4.1 ANALYSIS STEPS
20
4.1.1 GENOME MODE
20
4.1.2 LIBRARY MODE
21
4.2 ISOMIR SPECIFICATION
21
4.3 PREDICTION OF NOVEL MICRORNAS
22
5 OUTPUT FILES
24
5.1 EXPRESSION PROFILING
24
5.1.1 *.GROUPED FILES
24
5.1.2 SINGLE ASSIGNMENT FILES: *_SINGLEA.GROUPED
25
5.2 MICRORNA SPECIFIC OUTPUT
26
5.2.1 MIRBASE_MAIN.TXT
26
5.2.2 HAIRPIN_NOVELSTAR.TXT
26
5.2.3 HAIRPIN_SUSPICIOUS.TXT
27
5.3 ISOMIR OUTPUT
27
5.3.1 MIRBASE_ISO.TXT
27
5.3.2 ISOMIR ANNOTATION
28
5.3.3 PER MATURE MICRORNA ISOMIR SUMMARY
29
5.3.4 ISOMIR SUMMARY
29
5.4 GENERAL INFORMATION
30
5.4.1 THE “READS.ANNOTATION” FILE
30
5.5 THE ’STAT’ FOLDER
31
5.5.1 READ LENGTH
31
5.5.2 MAPPING SUMMARY
32
5.6 ALIGNMENT FILES - PROCESSING PATTERN
32
5.7 NOVEL MICRORNAS
33
6 TIPS AND TRICKS
34
6.1 OVERWRITE THE MAPPING PARAMETERS
34
6.2 USING BOWTIE INDEXES AS LIBRARIES IN GENOME MODE
34
6.3 MULTI-SPECIES ANALYSIS
35
6.4 CONSTRUCTION OF SHARED LIBRARIES
35
6.5 PREDICTION OF NOVEL MICRORNAS
36
6.6 PROFILE TRNAS
36
7 DIFFERENTIAL EXPRESSION
36
7.1 MANDATORY PARAMETERS
36
7.2 ANALYSIS TYPES
37
7.3 DIFFERENTIAL EXPRESSION
38
7.4 ISOMIR ANALYSIS
38
7.5 READ LEVEL ANALYSIS
39
8 DIFFERENTIAL EXPRESSION OUTPUT
40
8.1 DIFFERENTIAL EXPRESSION OUTPUT
40
8.1.1 NOMENCLATURE OF DIFFERENTIAL EXPRESSION OUTPUT FILES
40
8.1.2 *.EDGER FILES
40
8.1.3 *.EXTEDGER FILES
41
8.2 EXPRESSION MATRIX
41
8.3 DIFFERENTIAL ISOMIR PATTERN
41
8.3.1 *.TTEST AND *.SIG FILES
42
8.3.2 SEQUENCING STATISTIC
42
20 June 2014
1 Introduction
sRNAbench is a free web-server tool and standalone application for processing smallRNA data obtained from next generation sequencing platforms, such as Illumina or
SOLiD. The sRNAbench tool is the replacement for miRanalyzer.
Main features

The expression levels of small RNAs can be profiled in two different ways
(depending on whether a good genome sequence/annotation is available or not):
i) mapping the reads against the genome and obtaining the expression levels by
means of fasta, gtf/gff or bed format annotations or ii) mapping against sequence
libraries in fasta/Bowtie index format (as it is done by miRanalyzer).

An unlimited number of genomes can be used in the analysis at the same time
without having to pool all sequences into a single file/Bowtie index. This feature
is especially important when analysing the interaction between parasites and
hosts, symbiosis or virus infected cells.

Adapter and barcode trimming can be performed.

sRNAbench accepts fastq, fastq.gz, sra, read count and fasta input format.

Extensive profiling of all microRNA sequences and length variants (isomiRs).
Furthermore, NTAs (non-templated additions) can be detected for all sequenced
libraries and not only for microRNAs.

Several statistics and graphical summaries are available.

The prediction of novel microRNAs was improved compared to miRanalyzer,
being more specific now. The prediction is based on structural, sequence and
biogenesis features.

Detection of differentially expressed microRNAs (including novel microRNAs)
and differences in the isomiR pattern.
sRNAbench can be used through its web-server instance, in this case consult the web
manual. However, the web application has some limitations like the restricted number
of configurable parameters and the limited number of available genome assemblies.
Therefore, to cover all possible analysis and user requirements, sRNAbench can also be
6
20 June 2014
installed locally. This tutorial will help the user during the local installation and will try
to be a guide during the entire small-RNAs analysis.
2 Getting started
2.1
Dependencies
sRNAbench is implemented in JAVA and apart from a JRE, it needs the following third
party software that must be installed in the PATH.

Vienna RNA package for RNA Secondary Structure Prediction and Comparison
(Vienna package 2). sRNAbench will only work with Vienna 2.0 or higher.

Bowtie - An ultra-fast memory-efficient short read aligner (Bowtie). sRNAbench
will only work with Bowtie1.

Only for the differential expression module some other software must be
installed: R package, Bioconductor, edgeR package and for some specific
analysis The Apache Commons Mathematics Library.
2.2
Install the database
sRNAbench relies on a local database where most of the library files, genome sequences
and Bowtie indexes must be stored. The database can have any arbitrary name and the
easiest way to generate it, is by means of the “start-up” database:
1. Download the “start-up” database: sRNAbenchDB and extract it to a directory of
your choice: ‘tar -xvzf sRNAbenchDB.tgz’. To populate the database, please
check 2.3
Populate the database.
NOTE: “start-up” database only includes genome and microRNA sequences for
human, human herpes virus type 8 (HHV-8) and Epstein-Barr virus (EBV) microRNAs.
2. Download the most recent version and replace the sRNAbench.jar file from the
database.
3. Only for differential expression module: Download the most recent version of
7
20 June 2014
sRNAbenchDE.jar and place it into the database.
Once the “start-up” database has been installed, the user should see the following
folders within the database folder selected:

libs: default location for all sequence libraries except the genome sequences.

index: contains the Bowtie indexes of the genome sequences.

seqOBJ: contains the genome sequences objects generated by makeSeqObj
program.

out: default output folder.
2.3 Populate the database
2.3.1 Genome sequences
In order to add a genome sequence to the database, two steps are needed:
1. Generate the genome bowtie index, by means of the ’bowtie-build’ software, and
place it into the index folder of the sRNAbenchDB.
2. Process the genome sequence(s) with makeSeqObj and place the obtained zip
file into the seqOBJ folder.
EXAMPLE: java -jar makeSeqObj.jar hg19.fa.
NOTE: this step might take a while (hours) in case of big genomes. Furthermore, it is
quite memory demanding, so probably the heap space needs to be increased by means
of -Xmx on the command line.
2.3.2 microRNAs libraries
To our knowledge, the best source for known microRNA sequences is miRBase. In
order to include the miRBase libraries to sRNAbench, download the mature and hairpin
sequences and extract them into the libs folder of the sRNAbench database.
NOTE: in unix systems, to download the libraries, move to the libs folder and type: ‘wget -nd
8
20 June 2014
ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz’ and ‘wget -nd
ftp://mirbase.org/pub/mirbase/CURRENT/hairpin.fa.gz’.
NOTE: both, mature microRNA and pre-microRNA (hairpin) annotations need to be included.
Currently, only fasta input format with the miRBase nomenclature is supported (fasta ids
starting with the short species name ’hsa’, ’mmu’, ’ath’, etc..) .
2.3.3 Other small RNA species
Other small RNA annotations can be provided either in fasta format, Bowtie index
format or bed format. BED format will be only valid if the genome sequence is
specified with species (see below 3.3
Analysis types and libraries). If no extension is
provided, sRNAbench will assume the existence of a Bowtie index in the libs folder and
will try to align directly against it. Otherwise, for a library with fa extension either the
coordinates are obtained by mapping the sequences against the genome sequence (if
species is set) or a Bowtie index is generated first by sRNAbench.
NOTE: the bed format annotations must be from the same genome sequence assembly.
2.3.4 sRNAbench helper tools
Some helper tools have been developed in order to facilitate the usage of Ensembl and
NCBI annotations. The helper tools can be accessed here.
2.4 Quick start
The “start-up” database contains an example dataset obtained from Gottwein et al. This
dataset was obtained from primary effusion lymphoma cell line BC-1 (human) which is
infected by two viruses: human herpes virus type 8 (HHV-8) and Epstein-Barr virus
(EBV). This dataset shows one of the main strengths of sRNAbench, the possibility to
analyse multi-species assays. Next, we will illustrate the use of sRNAbench by means of
some examples:
9
20 June 2014
2.4.1
Library mapping mode
First of all, we will align the input dataset to miRBase libraries for the three species:
‘java -jar sRNAbench.jar dbPath=/path2DB/sRNAbenchDB microRNA=hsa:ebv:kshv
input=SRR343332.fa.gz’
NOTE: If the output folder is not provided, the program will write the results into
‘sRNAbenchDB/out/SRR343332/’.
Next, some output files that you might be interested in will be explained (for the rest
please see 5

Output files):
reads.fa and reads_orig.fa: reads.fa contains all unmapped reads (after the run
has finished), while reads_orig.fa contains the initial set of reads.

hairpin_sense.grouped: the expression profiling of known pre-microRNAs.

mature_sense.grouped: the expression profiling of known mature microRNAs.

mature_sense_singleA.grouped: a single read assignment expression profiling
of mature_sense.grouped. The single read assignment (each read is only
assigned to one locus) is explained below.

hairpin_sense_singleA.grouped: a single read assignment expression profiling
of hairpin_sense.grouped. Only those reads that are not derived from mature
microRNAs do count.

miRBase_main.txt: profiling of known microRNAs at a pre-microRNA level.
This file includes the read counts for the canonical version (by default, those
defined by miRBase).

hairpin_NovelStar.txt: mature microRNA sequences that correspond to the
’arms’ that have not been reported before (are not included in miRBase)

reads.annotation: annotation of all reads, i.e. to which libraries the reads have
been mapped.

hairpin_suspicious.txt: pre-microRNA sequences that failed the consistency
test (either they have no stem-loop or one of the mature sequences folds back
onto itself).

short_reads.txt: reads filtered out due to minimum length.
10
20 June 2014

in stat folder Read Length: the read count as a function of read length.

in stat folder mapping summary: a summary statistics of the small RNAs
mapped.

in stat folder microRNA_species.txt: the distribution of mature microRNAs
over the species.

in stat folder hairpin.len: the length distribution of the hairpin mapped reads
(only forward strand).
2.4.2 Genome mapping mode
The same analysis carried out in 2.4.1
Library mapping mode, can also be performed
by means of the genome mapping mode.
NOTE: the default database only includes human chromosome 22, and therefore only the
microRNAs located on this chromosome can be profiled.
‘java -jar sRNAbench.jar dbPath=/path2DB/sRNAbenchDB microRNA=hsa:ebv:kshv
species=chr22:NC_007605:NC_009333 input=SRR343332.fa.gz
output=/path2DB/sRNAbenchDB/out/genome’
Compared to the library mapping mode, there are only minor differences on the
number of output files:

readsNotAssigned.fa: contains the reads that have been mapped to the genome,
but not assigned to any reference sequence.

in stat folder Read length: the length distribution of genome mapped reads.

in stat folder: the genome mapping mode retrieves the length distribution also
for antisense mapped reads.
2.4.3 Prediction of novel microRNAs
The prediction of novel microRNAs can be activated setting up predict=true:
‘java -jar sRNAbench.jar dbPath=/path2DB/sRNAbenchDB microRNA=hsa:ebv:kshv
species=chr22:NC_007605:NC_009333 input=SRR343332.fa.gz
11
20 June 2014
output=/path2DB/sRNAbenchDB/out/genome predict=true’
Additional output files:

novel.txt: Summary of novel microRNAs.

mature_novel.fa and hairpin_novel.fa: mature and pre-microRNA sequences.

folder novel: contains the alignments to the novel pre-microRNA sequences.
2.4.4 Using other libraries
Other libraries can be analysed setting libs=’library name’. If the library is provided in
fasta format, sRNAbench will first generate a bowtie index of this file. Otherwise, the
library will be mapped to the genome in order to obtain the chromosome coordinates of
the reference sequences.
WARNING: In genome mode, only unspliced reference sequences should be provided in fasta
format. Spliced genes should be provided as bowtie indexes or in bed or gtf/gff format (see 2.3.3
Other small RNA species and 6.2 Using bowtie indexes as libraries in genome mode).
‘java -jar sRNAbench.jar dbPath=/path2DB/sRNAbenchDB microRNA=hsa:ebv:kshv
input=SRR343332.fa.gz libs=hg19-tRNAs.fa’
Additional output files:

hg19-tRNAs_sense.grouped and hg19-tRNAs_antisense.grouped: reads
mapped to the sense and antisense strands of the hg19-tRNA.fa library.
2.4.5 Detecting isomiRs
IsomiRs can be detected adding isoMiR=true:
‘java -jar sRNAbench.jar dbPath=/path2DB/sRNAbenchDB microRNA=hsa:ebv:kshv
input=SRR343332.fa.gz libs=hg19-tRNAs.fa isoMiR=true’
12
20 June 2014
Additional output files:

miRBase_iso.txt: the isomiR information for each mature microRNA

isomiR annotation: (miRBase_isoAnnotation.txt) the isomiR annotation at read
level.

in stat folder, per mature microRNA isomiR summary
(isomiR_summary.txt): the number of reads found for the different isomiR types
for each of the mature microRNAs.

in stat folder, isomiR summary: sample wide summary as a function of isomiR
type.

in stat folder, miRBaseNTA_lenNTA.txt: summary of non-templated additions
as function of ’addition length’ (for example 1, 2, 3, etc… As added)
2.4.6
Visualizing alignments
The alignments of the microRNAs can be generated setting plotMiR=true:
‘java -jar sRNAbench.jar dbPath=/path2DB/sRNAbenchDB microRNA=hsa:ebv:kshv
input=SRR343332.fa.gz plotMiR=true’
Additional output files:

Alignment files - processing pattern (hairpin folder): holds the alignment files.
3 sRNAbench parameters
The parameters are provided using the following form: parameter=value. For example
to specify the mandatory path to the local database: dbPath=/home/usr/...
3.1

Mandatory parameters
dbPath: the full path to the sRNAbench database (for example:
‘/home/user/sRNAbenchDB/’).
13
20 June 2014

input: the path to the input file (fastq, sra, read/count, fasta, or sRNAbench fasta
format).
3.2
Mapping parameters
The following parameters are passed to the Bowtie aligner. For more information, please
see the Bowtie manual page.

noMM <int>: number of mismatches (default: noMM=1).

seed <int>: the length of the seed, -l parameter in Bowtie (default: seed=19).

alignType [n,v]: the alignment type, can be either ’n’ (-n parameter in Bowtie,
i.e. alignType=n) or ’v’ (-v parameter in Bowtie, i.e. alignType=v). Note that
when setting ’v’, the seed parameter will have no effect and Bowtie will try to
align the entire read. Briefly, ’n’ will perform a seed alignment (only the first
nucleotides are used for the alignment, i.e. mismatches outside the seed region
do not count).
NOTE: To detect isomiRs, ’n’ must be used.

mBowtie <int>: maximum number of allowed multiple mappings, -m parameter
in Bowtie (default: mBowtie=40).

p <int>: “Launch <int> parallel search threads” -from Bowtie manual(default: p=4).

chunkmbs <int>: “The number of megabytes of memory a given thread is given
to store path descriptors in --best mode” -from Bowtie manual- (default:
chumkmbs=128).
3.3

Analysis types and libraries
solid <Boolean>: if it is set to true, SOLiD input data is expected (default:
solid=false).

microRNA <species list>: microRNA species that should be used for the
analysis. An arbitrary number of species can be used, the short species names
14
20 June 2014
must be provided separated by ’:’.
EXAMPLE: microRNA=hsa:ebv will map the input reads simultaneously to human
(hsa) and Epstein-Barr virus (ebv) microRNAs.

species <genome assembly list>: genome sequences that will be used. An
arbitrary number of different genome sequences can be used, the names must be
provided separated by ’:’.
EXAMPLE: species=hg19_5:NC_007605 will map the input reads simultaneously to
hg19_5 (in this case, hg19_5 is the Bowtie index abbreviation for the human genome
version hg19/NCBI37 path 5 genome sequence) and the Epstein-Barr virus genome
sequence.
NOTE: Bowtie indexes for the genome sequences must be located within the
sRNAbench database index folder.

mature <String>: name of the library that holds the mature microRNAs (for
example mature.fa for miRBase library) (default: mature=mature.fa).

hairpin <String>: name of the microRNA precursor sequences (for example
hairpin.fa for miRBase library) (default: hairpin=hairpin.fa).

libs <String>: name of the library file. Typically, these files would hold other
types of small RNAs like tRNA, snoRNA, snRNA, piRNA, rRNA, yRNA,
vaultRNA, etc…
NOTE: If a name is given, the program will search for the file in the default
sRNAbench database folder (libs), however if a full path is given the program will use
this file, which then does not need to be within the sRNAbench database. The files can
be provided in fasta format, bed format or directly as Bowtie indexes.

homolog <species list>: a string of short species names separated by ‘:’,
indicating those species that should be used for homologous based microRNA
detection.
EXAMPLE: homolog=mmu:rno would map the reads (after profiling known
microRNAs) to the hairpin sequences of mouse and rat, in order to detect putative novel
15
20 June 2014
microRNAs based on homology.
NOTE: homolog=all will use all species except those provided on microRNA species
list.

noGenome <boolean>: if a genome sequence is provided (species) and
noGenome is set to true, then the program will execute both types of analysis: 1)
genome sequence based and 2) sequence libraries based.
NOTE: if species is not provided, sRNAbench will try to align against sequence
libraries directly and not to the genome. In this case, bed file format is not supported.
3.4

Adapter trimming and pre-processing
adapter <String>: the adapter sequence. If the adapter is not provided, then the
input is assumed to be adapter trimmed.

adapterStart <int>: position (0-based coordinates) in the read where the
adapter search will start (default: adapterStart=0).

adapterMinLength <int>: minimum length of the adapter to be detected
(default: adapterMinLength=10).

adapterMM <int>: maximum number of mismatches allowed between the
adapter sequence and the read (default: adapterMM=1).

removeBarcode <int>: eliminates the first <int> bases from the 5’ end of the
read (default: removeBarcode=0).

recursiveAdapterTrimming <boolean>: recursively search for the adapter at
the
3’
end,
without
considering
adapterMinLength
(default:
recursiveAdapterTrimming=false).
NOTE: This function might be activated for read length 36, if sRNA populations of
length between 27 and 34 should be analysed.

holdNonAdapter <boolean>: include the reads where the adapter sequence was
not found (default: holdNonAdapter=false).

writeNonAdapter <boolean>: write out the reads where the adapter was not
16
20 June 2014
found (default: writeNonAdapter=false).

guessAdapter <boolean>: the program tries to guess the adapter. Briefly,
sRNAbench will align the first 250.000 reads to the genome using the Bowtie
seed function (the adapters will not count for the mismatches). Then, the adapter
sequence is defined as the most frequent 10-mer starting at the first mismatch
(default: guessAdapter=false).

maxReadLength <int>: maximum length of input reads (filters out all reads
that are longer than <int>, by default this filter is not applied).

minReadLength <int>: minimum read length of input reads (filters out shorter
reads than <int>, default: minReadLength=15).

minRC <int>: minimum read count (filters out reads with less count than <int>,
default: minRC=1).

sep <String>: if the input is a fasta file, this parameter allows to indicate which
character separates the ’ID’ and the ’Read Count’. For example: >1-45798
(ID=1, Read Count = 45798), then the separate character will be set as sep=-.
Default: sep=#

libsFilter=<String>: name of the library file that should be used to filter out
certain reads (for example those that map to ribosomal RNA fragments). Fasta
format and bowtie index is accepted.
3.5
Profiling parameters
There are several parameters that could influence the profiling of known elements.

winUpMiR <int>: the upstream flanking for the detection of microRNAs
(default: winUpMiR=3).

winDownMiR <int>: the downstream flanking for the detection of microRNAs
(default: winDownMiR=5).
EXAMPLE: hsa-miR122-3p maps to the ’+’ strand of chr18 at start position 56,118,356
and end position 56,118,377. Then, this region is extended adding the downstream and
upstream flanks: from 56,118,356-winUpMiR to 56,118,377+winDownMiR. Therefore, all
the reads that lie within this region are assigned to the reference element (hsa-miR122-3p
17
20 June 2014
in this case).

winUpTrans <int>: same as winUpMiR, but applied to libs reference sequences
(default: winUpTrans=0).

winDownTrans <int>: same as winDownMiR, but applied to libs reference
sequences (default winDownTrans=0).

hierarchical <boolean>: Apply a hierarchical classification. hierarchical=true:
mapped reads cannot map again, i.e. are removed after mapping (like it is done
in miRanalyzer). If hierarchical=false, then reads can map to multiple libraries
(default: hierarchical=true).

base <0,1>: coordinates format of bed input files, 0-based or 1-based (default:
base=0).

matureMM <int>: number of allowed mismatches between the genome and the
known mature microRNA. This parameter influences the genome coordinate
detection of known microRNAs (default: matureMM=0).

hairpinMM <int>: number of allowed mismatches between the genome and the
known pre-microRNA sequence. This parameter influences the genome
coordinate detection of known pre-microRNAs (default: hairpinMM=0).

matureHomologMM <int>: number of allowed mismatches between the
putatively
homologous
microRNA
and
the
genome
(default:
matureHomologMM=2).

isomiRseed <int>: number of 5’ nucleotides which are not used to detect nontemplated additions (NTAs) (default: isomiRseed=18).
3.6

Output options
plotMiR <boolean>: plot out the microRNA alignments to the hairpin folder
(default: plotMiR=false).

plotLibs <boolean>: plot out the alignments against the libraries provided by
libs (default: plotLibs=false).

minRCplotLibs <int>: minimum read-count in order to write out the libs
alignment file (default: minRCplotLibs=200).

minRCplotMiR <int>: minimum read-count in order to write out the
18
20 June 2014
microRNA alignment file (default: minRCplotMiR=20).

isoLibs <boolean>: detect non-templated additions for libs (default:
isoLibs=false).

isoMiR <boolean>: profile and classify isomiRs (default: isoMiR=false).

maxLenForSecStruc <int>: the secondary structure is calculated if the length
of the reference sequence <= <int>, it only works if plotLibs=true and the readcount of the sRNA >= minRCplotLibs (default: maxLenForSecStruc=200).

fullIsoStat <boolean>: writes out a full isoMiR statistic, as a function of the
different species, 3p and 5p (default: fullIsoStat=false).

tRNA <String>: if a genomic tRNA library is used, tRNA mappings will be
summarized by anticodons. The value must be the same library set with libs.

writeGenomeDist <boolean>: writes out a mapping statistic as a function of
chromosome (default: writeGenomeDist=false).

graphics <boolean>: sRNAbench generates graphic files, R/ggplot2 must be
installed –execute ‘install.packages("ggplot2")’ in your R workspace - (default:
graphics=false).
3.7

Program names
RNAfold <String>: name of the RNAfold program. For example, if both
Vienna 2.0 and Vienna 1.8.5 or before are installed on the computer (default:
RNAfold=RNAfold).
3.8

Prediction of novel microRNAs
predict <boolean>: prediction of novel microRNAs. The prediction is
deactivated by default predict=false.

kingdom [animal,plant]: the microRNAs dataset kingdom must be indicated, in
order to use the appropriate feature thresholds (default: kingdom=animal).

maxClusterDist <integer>: the maximal distance between two read clusters.
By default maxClusterDist= 60 (animal) and maxClusterDist=180 (plant).

ForceHomolog <boolean>: if homolog and microRNA are not set, all known
19
20 June 2014
microRNAs (miRBase) are used to assign a name to the novel microRNAs by
means of sequence similarity (default: ForceHomolog=true).

novelStrict <boolean>: discards non bona fide microRNAs (default:
novelStrict=true).

novelName <String>: species short name used for the novel microRNAs. For
example, hsa (human), mmu (mouse), rno (rat), etc... (default: novelName=new).

seedFamily <int>: length of the seed region used to determine members of a
family (default: seedFamily=8).

seedFamilyMM <int>: number of mismatches allowed within the seedFamily
(default: seedFamilyMM=0).
4 Work-flow
4.1
Analysis steps
sRNAbench can be used in two different modes: Genome mode and Library mode. Both
modes share common pre-processing steps which consist on i) adapter trimming, ii)
rudimentary quality control (remove reads with Ns), iii) collapse identical reads into one
unique entry assigning a read count (number of times a given read was obtained in the
experiment). Moreover, both modes can be carried out in a hierarchical (by default) or
non-hierarchical way (a read can be assigned to several libraries) way. Hierarchical
means that each read that maps to a given library is removed from the analysis and
therefore cannot map again to another library (each read can only map to one annotation
group). The group mapping order is: 1) MicroRNAs (microRNA), 2) putative
homologous (homolog), 3) other libraries (libs).
NOTE: an unlimited number of libs can be provided and they will be used in the order as they
appear on the command line.
4.1.1 Genome mode
When a genome sequence is provided at the command line (species), then all reads are
20
20 June 2014
first mapped to this genome. Afterwards, the genome coordinates of the reference small
RNA annotations (microRNAs and those given by libs) are determined. For fasta
annotations, the sequences are mapped to the genome and the chromosomal coordinates
are retrieved in BED format. If the annotation input is given in BED format, the
corresponding coordinates are adopted directly.
4.1.2
Library mode
If no genome sequence is available, sRNAbench proceeds nearly identical as
miRanalyzer does. Instead of mapping to the genome sequence, the reads are
successively mapped first to the microRNA annotation and after this to the libraries
provided by libs.
4.2
IsomiR specification
sRNAbench does not only detect the canonical (miRBase) microRNA sequence, but also
all isomiRs (sequence variants). Frequently, a microRNA can show different posttranscriptional modifications. For example, it can be 3’ trimmed and adenylated. In
order to keep a simple classification schema, sRNAbench uses a hierarchical
classification for these isomiRs, basically for each read is tested if it belongs to one of
the following five classes:
1) The read is identical to the canonical sequence (miRBase entry).
2) The read starts and ends at the same position as the canonical sequence in the premicroRNA, but it shows sequence variations (most likely due to sequencing errors,
but RNA editing events and SNP might exist as well).
3) The read has non-templated additions (A, T(U), C or G added to its 3’ or 5’ end),
i.e. nucleotides at the 3’ end that do not match with the reference (template).
4) The read starts or ends at the same position as the canonical version. For this case
we can distinguish 4 groups:
a) 3’ trimmed read: the read starts at the same position as the canonical sequence
(same 5’ end) but it is shorter.
b) 3’ extended read: the read starts at the same position as the canonical sequence
21
20 June 2014
(same 5’ end) but it is longer and maintains the template nucleotides.
c) 5’ trimmed read: the read ends at the same position as the canonical sequence
(same 3’ end) but it is shorter.
d) 5’ extended read: the read ends at the same position as the canonical sequence
(same 3’ end) but it is longer and maintains the template nucleotides.
5) The read does coincide neither in 5’ nor in 3’ with the canonical sequence (multiple
length variant).
Figure 1: The figure shows the hierarchical isomiR classification used in sRNAbench.
4.3
Prediction of novel microRNAs
The implemented microRNA prediction method has been used before to detect novel
microRNAs in plants (Hackenberg et al.). Briefly:

The reads are mapped to the genome sequence

Reads that map to nearly identical positions in the genome are clustered into
’read clusters’ in the following way:
i) The reads are sorted by read count (read frequency).
ii) The most frequent read is chosen to form the first read cluster. Note that, the
22
20 June 2014
cluster coordinates are adopted from this most frequent read.
iii) The rest of the reads are checked to lie within a window defined by
‘clusterStart – 3 nt’ and ‘clusterEnd + 5 nt’ on the same strand (flanks were
added in order to assign all possible isomiRs to their read cluster).
iv) if the read belongs to an existing cluster, the read information (sequence and
the read count) is added to the cluster.
v) if the read does not belong to an existing cluster, a new cluster is opened.

After clustering all reads, read clusters with distances of less than 180 nt for
plants and less than 60 nt for animals are chosen. The bona fide miRNAs should
have two read clusters corresponding to the two arms processed from the premicroRNA sequence.

Last, the genomic sequence spanned by the two read clusters is extracted and its
secondary structure and alignment pattern of the derived pre-miRNA is
analysed. Then, the retained read cluster must:
i) Map to the stem of the putative pre-miRNA.
ii) Not fold back onto itself (i.e. it is not spanning the loop region).
iii) Be above all calculated feature thresholds shown in ¡Error! No se
encuentra el origen de la referencia..
23
20 June 2014
Figure 2: Features used in the prediction of novel microRNAs. The thresholds have been determined
using the same training set as described before miRanalyzer. They represent the percentile 5 (P5) of the
distributions obtained for known microRNAs for the high confidence prediction (HC) and 0.5 for the low
confidence prediction (LC).
5 Output files
5.1
Expression profiling
5.1.1 *.grouped Files
These files contain the expression profiling of the annotation provided:
1) name: name of the element.
2) unique reads: number of unique reads mapped to this element.
24
20 June 2014
3) read count: total number of reads mapped to this element.
4) read count (mult. map. adj.): each read is divided by the number of times it mapped
either to different genome loci (genome mode) or sequences in the library (sequence
library mode).
5) RPM (lib): Reads Per Million normalized by the total number of reads mapped to
the library.
6) RPM (total): Reads Per Million normalized by the total number of reads mapped to
the genome (genome mode) or by the total number of reads in the analysis
(sequence library mode).
7) chromosome string (only for genome mappings): the chromosome string has the
following format: “chromosome#chromosome start#chromosome end#strand”, it
refers to the genome position of the annotation.
NOTE: these coordinates are generated by sRNAbench if the annotation input is a fasta
file, and they are taken from the annotation file if the input is a BED or GTF/GFF file.
5.1.2 Single Assignment Files: *_singleA.grouped
These files contain an expression profiling based on a single assignment of the reads,
i.e. each read is only assigned once (to a locus or reference element). It is generated
using *.grouped file and the read.annotation file in the following way:
1) *.grouped files are sorted by read count.
2) Going from the most expressed microRNA (or other ncRNA) to lower expressed
ones, the reads that map to the reference sequence are summed and removed before
moving to the next microRNA. In this way, each read is uniquely assigned to one
reference sequence (the one with the highest total expression value).
NOTE: *_ singleA files are first generated for sense mappings, this means that a read that
maps to both, the sense and the antisense orientation is only assigned to the sense orientation.
WARNING: this file should not be used for differential expression analysis.
25
20 June 2014
5.2
MicroRNA specific output
5.2.1 miRBase_main.txt
This file contains a summary at the pre-microRNA level. Only microRNAs that have a
suitable secondary structure (stem-loop) and a mature sequence that do not fold back
onto itself are listed.
1) name: microRNA name.
2) UR: number of unique reads mapped to the pre-microRNA.
3) RC: total number of reads mapped to the pre-microRNA.
4) RC (adj.): adjusted read count (see 5.1.1
*.grouped Files).
5) UR5p: number of unique reads mapped to the 5’ arm.
6) RC5p: total number of reads mapped to the 5’ arm.
7) RC5p (adj.): adjusted read count of the 5’ arm.
8) name-5p: name of the 5’ arm.
9) UR3p: number of unique reads mapped to the 3’ arm.
10) RC3p: total number of reads mapped to the 3’ arm.
11) RC3p (adj.): adjusted read count of the 3 arm.
12) name-3p: name of the 3’ arm.
13) coordinates:
coordinate
string
id,
like:
chr1:hsa#198828173#198828282#
(“chromosome#chromosome start#chromosome end#strand”).
14) 5pcanonical: read count of the 5p arm ’canonical’ sequence.
15) 3pcanonical: read count of the 3p arm ’canonical’ sequence.
5.2.2 hairpin_NovelStar.txt
This file holds those mature microRNA sequences that are detected in the dataset, but
are not present in the annotation (miRBase). The sequences included must form a
Drosha/Dicer (DCL) compatible dsRNA with the annotated mature microRNA (1-2 nt
3’ overhang).
1) name: name of the “novel” mature microRNA.
26
20 June 2014
2) sequence: sequence of the “novel” mature microRNA.
3) read count: read count.
5.2.3 hairpin_suspicious.txt
This file lists all microRNA sequences present in the annotation database (normally
miRBase), for which any kind of error is detected (if plotMiR=true, these sequences are
written out to the hairpin_error folder).
1) name: microRNA id.
2) chromosome: chromosome.
3) start: chromosome start.
4) end: chromosome end.
5) strand: orientation.
5.3
isomiR output
5.3.1 miRBase_iso.txt
This file holds the isomiR composition for each mature microRNA. The file has the
following columns:
1) name: name of the mature microRNA.
2) pre-microRNA: name of the precursor sequence.
3) RC: read count of the mature sequence (canonical sequence and all isomiRs).
4) UR: unique reads mapped to the mature sequence (canonical sequence and all
isomiRs).
5) RPM (lib): Read Per Million normalized by the total number of reads mapped to the
library.
6) RPM (total): Reads Per Million normalized by the total number of reads mapped to
the genome (genome mode) or the total number of reads in the analysis (sequence
library mode).
7) arm: microRNA arm (either 3p or 5p).
8) isoString: this string holds the information for all detected isomiRs. The information
27
20 June 2014
for each isomiR is separated by ’|’. For decoding the string please see these
examples:
o nta#BASE_RC: all non-templated nucleotide additions.
EXAMPLE: nta#A_9568, means that 9,568 reads in the sample present one or
more 3’ terminal As, which are not present in the reference sequence (either
genome or miRBase hairpin sequence).
o nta#BASE#ADDITIONS_RC: non-templated nucleotide additions of a
given length.
EXAMPLE: nta#T#1_125, means that 125 reads present mono-uridylation at its 3’
end (1 U added, which is not present in the reference sequence).
o lv5p_RC: 5’ length variants.
EXAMPLE: lv5p_862 means that 862 reads show any type of 3’ length variation..
o lv3p_RC: 3’ length variants.
EXAMPLE: lv3p_222835 means that 222835 reads do show any type of 3’ length
variation.
o mlv_RC: Multiple length variants.
5.3.2
isomiR annotation
The file miRBase_isoAnnotation.txt assigns to each read mapped to a known microRNA
an isomiR related label.
1) read: read sequence.
2) name: name of the mature microRNA.
3) preMicro: name of the pre-microRNA.
4) isoClass: assigned isomiR class.
5) NucVar: observed nucleotide variation (reference > sample).
28
20 June 2014
6) read count: read count.
5.3.3 per mature microRNA isomiR summary
An isomiR summary for each of the mature microRNAs detected:
1) name: name of the mature microRNA.
2) UR: number of unique reads.
3) RC: read count.
4) RPM(total): Reads Per Million normalized to all pre-processed input (library
mapping)/genome mapped (genome mode) reads.
5) RPM(lib): Reads Per Million normalized to all reads mapped to a known
microRNA.
6) Canonical_RC: read count of the canonical sequence.
7) NTA(A): number of reads with a non-templated A addition.
8) NTA(U): number of reads with a non-templated U addition.
9) NTA(C): number of reads with a non-templated C addition.
10) NTA(G): number of reads with a non-templated G addition.
11) lv3pE: number of reads with 3’ length extension (longer than the canonical
sequence).
12) lv3pT: number of reads with 3’ length trimming (shorter than the canonical
sequence).
13) lv5pE: number of reads with 5’ length extension (longer than the canonical
sequence).
14) lv5pT: number of reads with 5’ length extension (shorter than the canonical
sequence).
15) mv: number of reads classified as multiple length variants.
5.3.4
The
isomiR summary
summary
is
composed
of
two
files:
“miRBase_allNTA.txt”
and
“miRBase_otherVariants”, which summarize all the isomiR statistics for NTA (nontemplated additions) and several length variants.
29
20 June 2014
1) name: the name of the isomiR type:
a) A (adenine addition), C (cytosine addition), T (U/T addition), G (G addition).
b) lv5pT: 5’ trimmed.
c) lv5pE: 5’ extended.
d) lv5p: 5’ length variant (lv5pT + lv5pE).
e) lv3pT: 3’ trimmed.
f) lv3pE: 3’ extended.
g) lv3p: 3’ length variant (lv3pT + lv3pE).
h) mv: multiple length variants.
2) totalRC: number of reads mapped to microRNAs.
3) NTA_count: number of reads that belong to a given isomiR type.
4) wMean: weighted mean (NTA_count/totalRC).
5) mean: mean isomiR ratio of the sample. For each microRNA, an isomiR ratio is
calculated: (number of reads belonging to a certain isomiR type) / (total number of
reads mapped to the microRNA).
6) stdDev: standard deviation of the mean.
5.4
General information
5.4.1
The “reads.annotation” file
This file contains a summary at the read level, i.e. each read is listed individually
together with all the annotations to which it mapped. It contains the following columns:
1) read sequence.
2) the read count.
3) Reads Per Million normalized by the total number of reads mapped to the genome
(genome mode) or the total number of reads in the analysis (sequence library mode).
4) Classification. It consists on the name of the annotation file (for libs) or
hairpin/mature (microRNAs from miRBase), plus the mapping orientation (sense or
anti-sense). The format is “AnnotationGroup#Orientation”.
30
20 June 2014
NOTE: the name can be “mixed” if the read maps to several different annotations.
5) Mapped Annotations. This column contains all the annotations to which the read has
been
aligned.
The
format
“AnnotationGroup#AnnotationName#Orientation#Chromosome
is:
position”.
The
chromosome position is only applicable in the genome mode, for library mode the
chromosome is set to’s’. The several different annotations to which a read can map
are separated by ’$’.
EXAMPLE:
“hv_030312_v2_18#MLOC_55934.3#antisense#s$hv_030312_v2_18#MLOC_55933.2#sen
se#s” means that i) the corresponding read mapped twice (two annotation strings separated
by ’$'), ii) the read maps to one gene in sense direction and another one in antisense
direction, iii) “hv_030312_v2_18 is the name of the library (AnnotationGroup) provided at
the common line by libs, iv) The gene names (AnnotationName) are MLOC_55933.2 and
MLOC_55934.3.
5.5
The ’stat’ folder
The stat folder contains various files with summary statistics:
5.5.1 Read Length
Two
files
summarize
the
read
length
distribution:
readLengthFull.txt
and
readLengthAnalysis.txt. The first one gives the length distribution without setting any
thresholds, as minimum length or minimum read count. The other one holds the
distribution of the reads used for the analysis. Both files have the following columns:
1) Read Length: length of the sequenced RNA fragment/molecule.
2) UR: number of unique reads
3) Percentage_UR: the percentage of unique reads having a given read length
4) RC: sum of all reads with a given read length.
5) Percentage_RC: percentage of the total read count having a given read length.
6) RPM: Reads Per Million: for readLengthFull.txt all input reads are used during the
31
20 June 2014
normalization, for readLengthAnalysis.txt only those reads that are included in the
analysis are used.
Note: there might be slight differences between both files due to the quality criterion which is
not applied in the ’readLenghtFull.txt’.
5.5.2 Mapping summary
A summary of the mapping process can be found in ByLibsStat.txt file. For each library
the following information is shown:

category: name of the library and the mapping orientation (sense, anti-sense),
coded into a string: “library#orientation”.

UC: unique reads.

RC: total read count.

RPM: Read Per Million normalized by the total number of assigned reads, i.e.
those that could be mapped (library mode) or assigned (genome mode) to any of
the RNA elements in the libraries. This column should sum 1,000,000.

RPMin: Reads Per Million normalized by the number of input reads (library
mode) or number of reads mapped to the genome (genome mode).
NOTE: If a read maps to both: sense and antisense orientation of a given reference sequence
(and/or a given category like mature, hairpin, tRNA, etc.), it counts partially for each category.
For example, a read with read count 10 which maps to the sense and antisense orientation of
microRNA hairpins, would add 5 reads to both the sense and the antisense orientation.
5.6
Alignment files - processing pattern
The alignments to the pre-microRNA sequences can be found in the hairpin folder,
there is one file for each pre-microRNA in the library. The file includes all the reads
that map to the pre-microRNA sequence, their count and isomiR information using the
nomenclature explained on 5.3
isomiR output. Using these files the Drosha/Dicer
(DLC) processing pattern can be studied.
32
20 June 2014
5.7
Novel microRNAs
When the predict parameter is set on, sRNAbench will try to find novel microRNAs. 3
main files are written out: novel.txt, mature_novel.fa (fasta file with mature sequences)
and hairpin_novel.txt (fasta file with pre-microRNA sequences). The alignments to the
novel microRNAs are written into the novel folder.
The novel.txt file has the following format:
1) name: an internal, unique name of the novel microRNA.
2) name2: name of a homologous microRNA or an sRNbench assigned name.
3) chrom: the chromosome/scaffold/contig.
4) chromStart: start coordinate of the novel pre-microRNA.
5) chromEnd: end coordinate of the novel pre-microRNA.
6) strand: orientation.
7) RC_hairpin: total read count of the pre-microRNA.
8) 5pSeq: mature microRNA sequence of the 5’ arm.
9) 5pRC: read count of the mature 5p (only the “canonical” sequence, no isomiRs are
included).
10) 3pSeq: mature microRNA sequence of the 3’ arm.
11) 3pRC: read count of the mature 3p (only the “canonical” sequence, no isomiRs are
included).
12) type: can be duplex (Dicer/Drosha pattern is found between the two most expressed
read clusters of the pre-microRNAs), duplexOther (Dicer/Drosha is found, but not
between the two most expressed read clusters), noDuplex (no Dicer/Drosha pattern
is found, probably those with only one arm but some additional reads that do not
belong to either miR nor miR*), single (only a single read cluster is detected, only
one arm).
13) hairpin: can be true (the pre-microRNA has a strict hairpin structure) or false (the
pre-microRNA does not have a strict hairpin structure).
14) homolog: the novel microRNA has putative homologs in other species.
15) preMiR sequence: the sequence of the pre-microRNA.
16) Conf: the confidence label (HC=high confidence; LC = low confidence)
33
20 June 2014
17) featString: the string composed of the 3 main feature values separated by ‘:’;
"inClusterRatio":"5pFluctDominant":"dominant2AllRatio"
18) Code: 2: perfect 3’ 2nt overhangs observed; 1: either only 1nt or 3nt overhang
observed
19) name-5p: the name of the 5p mature sequence
20) name-3p: the name of the 3p mature sequence
The novel.txt file is based on predictions that can be divided into 2 levels of
confidence: 1) high confidence (highConf.txt), 2) low confidence (lowConf.txt). The
division is done only based on the features described in figure 2. Each prediction
receives a label which is either HC (high confidence) or LC (low confidence)
The novel.txt file contains all the novel microRNAs, but the user can change the
stringency by for example forcing code=2 (perfect 2 nt 3’ overhang) or duplex type =
‘duplex’ (instead of ‘duplex’ or ‘duplexOther’).
6 Tips and tricks
6.1
Overwrite the mapping parameters
It is possible to “overwrite” the global mapping parameters given at the command line
(noMM, seed, alignType and mBowtie) for some analysis steps. This is mainly useful for
the library mode, where the parameters can be changed by adding a parameter string to
the library name: for example snoRNA.fa#1#20#n#20 (sRNAbench will try to align the
reads to the library named snoRNA.fa with noMM=1, seed=20, alignType=n and
mBowtie=20).
NOTE: i) for alignType=v, the seed parameter is ignored, and ii) the global parameters are
only temporarily overwritten for this specific library (snoRNA.fa).
6.2
Using bowtie indexes as libraries in genome mode
This possibility makes sense only if the library contains spliced sequences. The reads
34
20 June 2014
previously mapped to the genome are mapped again to the provided Bowtie index. In
addition, the mapping parameters to the Bowtie index can be overwritten as it was
shown on 6.1
Overwrite the mapping parameters.
EXAMPLE:
‘java
-jar
sRNAbench.jar
input=SRR069835_part.fastq
species=dm3
microRNA=dme
dbPath=”path to database” adapter=CTGTAGGCAC noMM=0 seed=18 alignType=n
libs=refSeq#0#20#v#20’
First, all reads are mapped to the genome with global parameters (1 mismatch, seed length 18
and seed alignment mode). The genome mapped reads that did not map to microRNAs, are then
mapped to the Bowtie index refSeq with noMM=0 and alignType=v. In this way, for each
Bowtie index library, different alignment parameters can be chosen.
6.3
Multi-species analysis
A new feature in sRNAbench is the possibility to analyse microRNAs from several
species at the same time. This is especially interesting when infected cells or
host/parasite interactions are analysed. However, caution is needed if the different
genomes included in the analysis share their “chromosome names” (as chr1 in human
and mouse genome), in this cases the genome sequence ids must be manipulated first.
EXAMPLE: in order to avoid redundant chromosome names, in the web-server, we added the
short species name to each chromosome name. Then, chr1:hsa (human) could be distinguished
from chr1:mmu (mouse). A simple command can add this information to the genome sequences
before building the Bowtie indexes and the seqOBJ zip file: cat hg19.fa | awk '{if($1 ~
/>/){print $1":hsa";} else{print $1; }}' > hg19_mp.fa
6.4
Construction of shared libraries
Sometimes, different RNA types are annotated in a single file. In order to include the
RNA classification during the sRNAbench analysis, each class/type should be added to
the RNA ID (the RNA IDs must contain the RNA name and its class/type separated by
‘:’, for example: >NR_046235:ribosomal_RNA). The sRNAbench helper tools can be
35
20 June 2014
used to generate this prepared annotation files from primary Ensembl and NCBI
annotation files (see sRNAbench helper tools).
6.5
Prediction of novel microRNAs
Generally, we recommend predicting novel microRNAs during a separate run and with
stricter parameter settings than during the analysis of known microRNAs: noMM=0
alignType=v minRC=2.
6.6
profile tRNAs
sRNAbench has some additional features when is used with tRNA libraries from the
Genome tRNA database (see sRNAbench helper tools). If the tRNAs library is set using
the tRNA=’name of the tRNA library’ parameter, an additional file will be generated
with the anti-codon information.
NOTE: Before using it with sRNAbench, the description field should be removed from the fasta
file. In Linux, this can be done easily by means of this command: ‘cat eukaryotic-trnas.fa | awk
’{ print $1 }’ > eukaryotic-trnas_woDesc.fa’, being eukaryotic-trnas.fa the input file.
7 Differential expression
A differential expression analysis can be carried out with sRNAbenchDE.jar. First, all
individual samples must be processed with sRNAbench. The differential expression can
be launched:
‘java -jar sRNAbenchDE.jar input=path to input folders output=name of the output
folder grpString=names of the input samples’
NOTE: all sample output folders must be in the same directory.
7.1

Mandatory parameters
input <string>: path to the output folders of the individual sRNAbench runs.
36
20 June 2014
For example, if the default output was used: input = /sRNAbenchDB/out/.

output <string>: name of the output folder. The output folder will be placed in
the directory given with input.

grpString <string>: group string must contain the names of the different
sRNAbench output folders in the following way: f1_1:f2_1#f1_2:f2_2 being
f1_1 the first folder (sample) of the first group (controls in a case/control study),
f2_1 the second folder of the first group, f1_2 the first folder of the second
group (cases), etc…
7.2

Analysis types
diffExpr <boolean>: performs a differential expression analysis (default:
diffExpr=true). Currently, at least two samples must be provided per condition.
Otherwise, please use only makeDEmatrix=true (see next parameter).
Note: Uses always the second column from the *.grouped output files (read count); all
expression values must be over minRC threshold.

makeDEmatrix <boolean>: generates an expression matrix using the files
specified by diffExprFiles (see 7.3
Differential Expression).
NOTE: Uses the values in the column specified by columnExpr (default columnExpr=4,
the library normalized RPM) applying a threshold of minRCexpr (default
minRCexpr=0).

iso <boolean>: performs a differential frequency analysis for the different
isomiR classes (default: iso=false).
NOTE: this analysis applies a standard t-test to the isomiR ratios. Therefore, at least 3
replicates per group must exist.

isoSummary
<boolean>:
uses
miRBase_allNTA.txt
and
miRBase_otherVariants.txt files to explore if a general difference in isomiR
generation exists between two conditions (default: isoSummary=false).

readLevel <boolean>: generates an annotated expression matrix at the read
37
20 June 2014
level (default: readLevel=false).
NOTE: the readMatrixExprCol column from the reads.annotation file is used to extract
the expression values (1=read count, 2=RPM(total)), applying a minReadRC threshold.
(See 7.5 Read level analysis for additional information).

stat <boolean>: calculates differential expression on summary files from the
stat folder (default: stat=false).
NOTE:
byLibStat.txt
file
is
used
during
the
analysis,
but
another
file
(byLibStat_extend.txt for example) can be set with statFiles parameter.

seqStat <boolean>: generates a table with sequencing statistics of all used
samples (default: seqStat=false).
7.3

Differential Expression
diffExprFiles <string>: names of the *.grouped expression files for which the
differential
expression
analysis
should
be
carried
out
(default:
diffExprFiles=mature_sense.grouped).

minRC <int>: minimum read count that an entity must have in all samples of a
given condition to be included in the analysis (default: minRC=1).

unify <boolean>: if it is set to true, the edgeR output is merged with an
expression matrix containing the RPM (Read Per Million) expression values.

columnExpr <int>: input file column from which the expression value should
be taken (only if unify is specified).
NOTE: for standard sRNAbench output files it should be either 4 (within library RPM)
or 5 (total read RPM).

minRCexpr <int>: minimum expression threshold required for the values given
by columnExpr (default: minRCexpr=0).
7.4
IsomiR analysis
38
20 June 2014
There are different ways to analyse the isomiR generation among different conditions.
The comparison can be performed at a microRNA level (iso=true) or at sample level
(isoSummary=true).

isoFile <string>: name of the file that contains the isomiR data or posttranscriptional
modifications
for
other
small
RNAs
(default:
isoFile=miRBase_iso.txt).

detectIsoMiRs <string>: string that encodes the names of all isomiR types that
should be analysed. The names of the different isomiR types must be separated
by
’|’.
(default:
detectIsoMiRs=nta#A|nta#A#1|nta#T|nta#T#1|nta#C|nta#G|lv5p|lv3p).

minRCiso <int>: minimum read count of a mature microRNA to consider it
during the isoform analysis (default: minRCiso=10).

isoCanonical <boolean>: canonical isomiR ratios are defined by (read count of
a given isomiR type)/(read count of canonical sequence), when it is set to
isoCanonical=true, the canonical ratios are additionally used to detect those
microRNAs with significant differences (default: isoCanonical=false).

isoSummary <boolean>: calculates statistics for the isomiR summary files in
the stat folder.

isoSummaryFiles <string>: file name(s) of the summary files (default:
isoSummaryFiles=miRBase_allNTA.txt|miRBase_otherVariants.txt).
NOTE: different summary files must be separated by ‘|’.
7.5 Read level analysis
This analysis compares the read counts between two or more conditions. However,
instead of assigning the reads first to a RNA type, the comparison is done at a read
level, i.e without grouping the reads.

minReadRC <int>: minimum expression value for a read.

readMatrixExprCol <int>: column in the read.annotations file used for the
expression
value:
1
for
read
count
and
2
for
RPM.
(default:
39
20 June 2014
readMatrixExprCol= 2).

annotCol <int>: column that should be used from the read.annotation file to
annotate the reads: 3 (‘groups#orientation’) or 4 (‘group#name#orientation’).
8 Differential Expression output
8.1
Differential expression output
8.1.1 Nomenclature of differential expression output files
The differential expression analysis is based on the *.grouped files. The nomenclature
of the edgeR output files is composed of: “base name of corresponding
groupedfile”_minRC_rcColumn_Group1_Group2.edgeR.
1) The base name of mature_sense.grouped would be “mature_sense”.
2) minRC is the minimum read count (minRCexpr) set by the user.
3) rcColumn is the column of the *.grouped file used to set up the expression matrix
(always column 2 which holds the integer read count values).
4) Group1 and Group2 are the group numbers (between 1 and the number of groups).
EXAMPLE: mature_sense_1_2_3_4.edgeR is the edgeR output file for the comparison between
the third and fourth group (in this example: group1 = healthy, group2 = risk, group3 = cancer,
group4 = metastasis) and the analysis will be done for the mature_sense.grouped file using a
minimum read count of 1. (minRCexpr=1).
8.1.2 *.edgeR files
The output files generated by edgeR have the following format:
1) name: the name of the sequence.
2) logFC: log2 of the fold change.
3) logCPM: the average log2-counts-per-million.
4) PValue: the exact p-value.
40
20 June 2014
5) FDR: corrected p-value.
8.1.3 *.extEdgeR files
These files are generated by sRNAbenchDE when unify parameter is set to true. They
have the same format as edgeR files with an extra columns, holding the RPM expression
values (columnExpr).
8.2
Expression matrix
Expression matrices are based on the *.grouped files and have mat extension. The
nomenclature is similar to the edgeR output files. The matrix contains the name of the
entity (microRNA, tRNA, etc) plus the columns with the expression values for each
sample.
EXAMPLE: mature_sense_0_4.mat is based on the mature_sense.grouped file, using column 4
(within library RPM) and applying a minimum threshold of 0.
8.3
Differential isomiR pattern
The isomiR ratio can be defined in two different ways:

Default: (isomiR type read number) / (total read count of microRNA (canonical
read count plus all isomiRs)).

Activated by isoCanonical=true: (isomiR type read number) / (canonical
(miRBase) read count).
IsomiR patterns are analysed for one *.iso file (isoFile). The first step in the analysis
of the isomiR patterns consists in the generation of isomiR ratio matrix files, which
have mat extension and its name indicates the analysed isomiR type.
EXAMPLE: iso_nta#T.mat indicates that this file holds the isomiR ratios for U/T non-templated
additions. iso_nta#T_canonical.mat is the corresponding matrix for the isomiR ratio based on
the canonical read count.
41
20 June 2014
8.3.1 *.ttest and *.sig files
*.ttest files hold all statistical comparisons, while *.sig files only hold the statistically
significant isomiR ratio differences (after FDR correction). For each isomiR ratio
matrix, all possible comparisons are calculated (for two groups 1 comparison, for three
groups 3 comparisons, for four groups 6 comparisons, etc…). The file nomenclature is
based on the group numbers.
EXAMPLE: iso_nta#T_1_2.ttest holds the t-test outcome for the comparison of U/T nontemplated additions between the first and the second group.
The files have the following columns:

microRNA: name of the mature microRNA.
1) mean_1: first group mean isomiR ratio.
2) var_1: first group standard deviation of the isomiR ratios.
3) mean_2: second group mean isomiR ratio.
4) var_2: second group standard deviation of the isomiR ratios.
5) p: the exact p-value (t-test).
6) FDR: the FDR corrected p-value.
8.3.2 sequencing statistic
The file sequencingStat.txt summarizes the sequencing statistics:
1) Sample: name of the sample.
2) raw reads: number of raw input reads.
3) adapter cleaned: number of adapter cleaned reads.
4) reads in analysis: number of reads in analysis (applying length thresholds to adapter
cleaned reads).
5) unique reads in analysis: number of unique reads in analysis (applying length
thresholds to adapter cleaned reads).
6) genome mapped reads: number of reads mapped to the genome.
42
20 June 2014
7) unique reads mapped to genome: number of unique reads mapped to the genome.
43