HT Sequence Analysis with R and Bioconductor
By Tyler Backman, Rebecca Sun and Thomas Girke, IIGB, UC Riverside
Contents
-
HT Sequence Analysis with R and Bioconductor
- Introduction
- Scope of this Manual
- Workshops
- R Basics
- String Handling Utilities in R's Base Distribution
-
Bioconductor Packages
-
Biostrings
- Handling Single Sequences with XString
- Comparing and Subsetting XString Objects
- Subfeature Retrieval with StringViews
- Masking Sequences
- Handling Many Sequences with XStringSet
- Comparing and Subsetting of XStringSet Objects
- Handling of Sequence and Quality Data with QualityScaleXStringSet
- Sequence Import and Export
- Basic Sequence Manipulation Utilities
- Trimming of Flanking Sequences (e.g. Adaptors)
- Alignment Tools for Genome Mapping
- Computing Pairwise Sequence Alignments
- Multiple Sequence Alignments (MSAs)
- Analysis Routines with Biostrings
- Exercises
- ShortRead
- IRanges
- BSgenome
- rtracklayer
- biomaRt
- HT-Seq Data Visualization
-
Biostrings
- Command Line Alignment Utilities
- Examples
- Session Info
- Installing Packages
Introduction
Sequencing Technology Slide Show (see also Simon Anders' presentation)
- High throughput sequencing (HT-Seq or HTS), also known as next generation sequencing (NGS), presents a wide spectrum of opportunities for genome research. Unfortunately, many existing bioinformatic tools do not scale well to large datasets consisting of tens of millions of sequences generated by technologies like Illumina/Solexa, Roche/454, ABI/SOLiD and Helicos. The Bioconductor project fills this gap by providing a rapidly growing suite of well designed R packages for analyzing traditional and HT-Seq datasets. These 'BioC-Seq' packages allow to analyze these sequences with impressive speed performance. Their accelerations are achieved by using memory efficient string containers and performing the time consuming computations with calls to external programs that are implemented in compiled languages (e.g. C++). Together these packages form a novel framework that allows researchers to develop efficient pipelines by performing complex data analyses in a high level data analysis and programming environment.
Scope of this Manual
This manual is intended for users who have a basic knowledge of the R environment, and would like to use R/Bioconductor to perform general or HT sequencing analysis. For new users, it is recommended to first review the material covered in the "R Basics" section (see below). To obtain a more comprehensive overview of R's sequence bioinformatics utilities, Robert Gentleman's book "R Programming for Bioinformatics" is an excellent choice.
The packages introduced here are a 'personal selection' of the authors of this manual that does not reflect the full utility spectrum of the Bioconductor project for this field of application. The introduced packages were chosen, because the authors use them often for their own teaching and research. To obtain a broad overview of available Bioconductor packages, it is strongly recommended to consult its official project site (http://bioconductor.org/). Due to the rapid development of many packages, it is also important to be aware that this manual will often not be fully up-to-date. Because of this and many other reasons, it is absolutely critical to use the original documentation of each package (PDF manual or vignette) as primary source of documentation. Users are welcome to send suggestions for improving this manual directly to the authors.
Workshops
Workshops on HT sequence data analysis using BioC-Seq resources are announced on the Bioconductor workshop site. The Institute for Integrative Genome Biology, IIGB at UCR also offers workshops in this field. In addition, there are many more institutions that organize similar workshops. The best way to find out about upcoming events in a certain location is to run a Google search on this topic.
R Basics
The R & BioConductor manual provides an introduction on the usage of the R environment and its basic command syntax, and the Programming in R manual provides a more advanced overview of R's programming syntax.
String Handling Utilities in R's Base Distribution
- The R base distribution contains various functions for handling and manipulating strings. These utilities are very useful for basic sequence analysis tasks. However, often it is much more effective to use for these tasks dedicated Bioconductor sequence analysis packages, such as Biostrings. This manual primarily introduces functions and data objects from the Bioconductor project. Nevertheless, the following commands provide a brief summary of some very useful string handling functions available in R's base distribution.
##################### ## String Matching ## ##################### myseq <- c("ATGCAGACATAGTG", "ATGAACATAGATCC", "GTACAGATCAC") # Creates a sample sequence data set. myseq[grep("ATG", myseq)] # String searching with regular expression support. pos1 <- regexpr("AT", myseq) # Searches 'myseq' for first match of pattern "AT". as.numeric(pos1); attributes(pos1)$match.length # Returns position information of matches. pos2 <- gregexpr("AT", myseq) # Searches 'myseq' for all matches of pattern "AT". as.numeric(pos2[[1]]); attributes(pos2[[1]])$match.length # Returns position information of matches in first sequence. gsub("^ATG", "atg", myseq) # String substitution with regular expression support. ######################## ## Positional Parsing ## ######################## nchar(myseq) # Computes length of strings. substring(myseq, c(1,4,7), c(2,6,10)) # Positional parsing of strings. ############################ ## Reverse and Complement ## ############################ myseq_comp <- chartr("ATGC", "TACG", myseq) # Returns complement for given DNA sequences. substring(myseq[1], 1:nchar(myseq[1]), 1:nchar(myseq[1])) # Vectorizes single sequence. x <- strsplit(myseq_comp, "") # Vectorizes many sequences. x <- lapply(x, rev) # Reverses vectors. myseq_revcomp <- sapply(x, paste, collapse="") # Collapses vectors to strings. Final result: reverse and complement of myseq. ######################################### ## Translate DNA Sequence into Protein ## ######################################### y <- c("ATGCATTGGACGTTAG") # Creates sample DNA sequence. AAdf <- read.table(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/AA.txt", header=T, sep="\t") # Imports genetic code. AAv <- AAdf[,2]; names(AAv) <- AAdf[,1] # Creates named vector with genetic code y <- gsub("(...)", "\\1_", y) # Inserts "_" after each triplet. y <- unlist(strsplit(y, "_")) # Splits on "_" and returns vectorized triplets. y <- y[grep("^...$", y)] # Removes incomplete triplets. AAv[y] # Translation into protein by name-based subsetting. ############################# ## Create Random Sequences ## ############################# sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse="")) # Creates 100 random DNA sequences with 20 residues. sapply(1:100, function(x) paste(sample(1:40, 20, replace=T), collapse=" ")) # Creates 100 random score sets with 20 elements. ################################################### ## Sequence Printing and Writing in FASTA Format ## ################################################### myseq <- sapply(1:12, function(x) paste(sample(c("A","T","G","C"), 40, replace=T), collapse="")) writeLines(myseq) # Prints the sequences to the screen, one per line. writeLines(strtrim(myseq, 10)) # The strtrim() function allows to print only the first section of the sequences. writeLines(strwrap(myseq, indent = 20)) # The strwrap() function provides wrapping, indenting and prefixing utilities. myname <- paste(">", month.name, sep="") # Creates sample names for the sequences. writeLines(as.vector(t(cbind(myname, myseq))), "myseq.fasta") # Writes the sequences in FASTA format to a file.
Bioconductor Packages
Biostrings
- The Biostrings package from Bioconductor provides an advanced environment for efficient sequence management and analysis in R. It contains many speed and memory effective string containers, string matching algorithms, and other utilities, for fast manipulation of large sets of biological sequences. The objects and functions provided by Biostrings form the basis for many other sequence analysis packages.
Handling Single Sequences with XString
- The XString class allows to represent the different types of biosequence strings in the subclasses: BString, RNAString, DNAString and AAString.
library(Biostrings) b <- BString("I store any set of characters. The other XString objects store only the IUPAC characters for each biosequence type.") d <- DNAString("GCATAT-TAC") # Creates DNAString object. r <- RNAString("GCAUAU-UAC") # Creates RNAString object. r <- RNAString(d) # Converts d into RNAString object. p <- AAString("HCWYHH")
Comparing and Subsetting XString Objects
- Several useful utilities are available to access, subset and manipulate XString objects.
replaceLetterAt(d, c(1), "N") # Replaces residues at specified positions. length(d) # Returns the length of the sequence that is stored in d. toString(d) # Converts DNAString to character vector of length 1. d[2:6] # Prints d from postions 2 to 6. This makes positional parsing of sequences very easy! d[length(d):1] # Prints the reverse string of d. subseq(d, start=3, end=6) # A more efficient way for subsetting large sequences is the subseq() function from the IRanges package. d[1:4]==d[1:4] # Checks whether two sequences are identical. In this case it returns TRUE. d[1:4]==d[1:5] # Returns FALSE. d == r # I interprets DNA and RNA sequences the same way. As a result, the given example returns TRUE. subseq(d, start=7, width=1) <- DNAString("G") # Replaces base 7 by "G". Requires Biostrings 2.11.44 and IRanges 1.1.55 or higher.
Subfeature Retrieval with StringViews
- With the XStringViews class provides utilities to manage and view subfeatures from sequences stored in XString objects.
dlong <- DNAString("GCATATTACAATCGATCCGCATATTAC") dsub <- Views(dlong, start = c(1,6,11, 20), end = c(5,8,18, 27)) # Generates XStringViews object with four views (subfeatures). dsub[1:2] # Returns first two views. dsub[[1]] # Returns first view as XString object. paste(dsub, collapse="") # Collapses views into a single sequence. attributes(dsub) # Returns attribute list of XStringViews object. start(dsub); end(dsub); width(dsub); subject(dsub) # Functions to returns the different XStringView components as vectors. sapply(as.list(dsub), toString) # Returns sequences in XStringView object as vector. dsub[dsub == DNAString("ATCGATCC")] # Subsetting by a sequence string.
Masking Sequences
- Biostrings allows to mask XString objects in order to exclude undesired regions from consideration by various analysis tools, e.g. repetitive sequences in the mapping of short reads. The advantage of this masking approach is that it does not result in any information loss, because masks can be turned on and off any time. In addition, the process is very memory efficient.
d <- DNAString("CGCTATCTGACTGCCGCGCC") # Creates sample sequence. maskMotif(d, "GC") # Function for masking a sequence by a pattern. m1 <- Mask(mask.width=nchar(d), start=c(1,17), end=c(3,20)) # Defines mask m1. m2 <- Mask(mask.width=nchar(d), start=c(4,12), end=c(7,14)) # Defines mask m2 m12 <- append(m1, m2) # Joins two masks in one container. names(m12) <- c("A", "B") # Assigns names to mask components. active(m12)["B"] <- FALSE # Turns one mask off. masks(d) <- m12 # Applies masks to DNAString object. No information is lost by this action! d; nchar(d); length(d) # Returns some information about masked object. active(m12)["B"] <- TRUE # Turns mask B back on. masks(d) <- m12 # Reapplies modified mask object to d. reduce(d) # Reduces a set of masks to a single mask. gaps(d) # Reverses all the masks. reverseComplement(d) # Returns the reverse and complement for a masked sequence object. dfrag <- as(d, "XStringViews") # Turns MaskedXString object into an XStringViews object sapply(as.list(dfrag), toString) # Retrieves all unmasked sections in d as strings in vector. injectHardMask(d, "N") # Injects/fills the masked regions of a sequence with a user-specified letter. masks(d) <- NULL # Removes all masks from DNAString object. Alternative command: unmasked().
Handling Many Sequences with XStringSet
- The XStringSet class allows to represent many sequences in a single container. The subclasses for the different types of biosequences are: BStringSet, RNAStringSet, DNAStringSet and AAStringSet.
dset <- DNAStringSet(c("GCATATTAC", "AATCGATCC", "GCATATTAC")) # The corresponding XStringSet containers allow to store many biosequences in one object. names(dset) <- c("seq1", "seq2", "seq3") # Assigns names to the elements in a XStringSet object. rset <- RNAStringSet(c("GCAUAUUAC", "UUUAAAA", "GCAUAUUAC")) rset <- RNAStringSet(dset) # Converts dset into RNAStringSet object. pset <- AAStringSet(c("HCWYHH", "SPPRHA"))
Comparing and Subsetting of XStringSet Objects
- Several useful utilities are available to access, subset and manipulate XStringSet objects.
dset[1:2] # Returns the first two sequence entries from dset. append(dset, dset) # Appends/concatenates two XStringSet objects. DNAStringSet(dset, start=c(1,2,3), end=c(4,8,5)) # Sequence subsetting by positions provided under start and end. d==dset[[1]] # The [[ subsetting operator returns a single entry as XString object. lapply(as.list(dset), toString) # Converts XStringSet to list. If the name field is populated then the list elements will be named accordingly. names(dset); width(dset) # Returns the values for remaining components as vectors. dsetv <- sapply(as.list(dset), toString); dsetv # Returns the sequences in XStringSet as named vector. table(dsetv) # Counts the number of occurrences of each sequence. dset[!duplicated(dsetv)] # Removes the duplicated sequence in dset.
Handling of Sequence and Quality Data with QualityScaleXStringSet
- The QualityScaleXStringSet class allows to represent many sequences plus their quality data in a single container. The subclasses for the different types of biosequences are: QualityScaledBStringSet, QualityScaledDNAStringSet, QualityScaledRNAStringSet and QualityScaledAAStringSet.
########################## ## About Quality Scores ## ########################## phred <- 1:9 phreda <- paste(sapply(as.raw((phred)+33), rawToChar), collapse=""); phreda as.integer(charToRaw(phreda))-33 # Phred quality scores are integers from 0-50 that are stored as ASCII characters after # adding 33. The basic R functions rawToChar and charToRaw can be used to interconvert # among their representations. Solexa scores are similar, but use an offset of 64. ########################################## ## Quality Score Handling in Biostrings ## ########################################## dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) # Create random sample sequence. myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Create random Phred score list. myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) # Converts integer scores into ASCII characters. myqual <- PhredQuality(myqual) # Converts to a PhredQuality object. lapply(seq(along=myqual), function(x) as.integer(myqual[x])) # Converts Phred scores from ASCII decimal back to integer representation. library(ShortRead); setSR <- ShortReadQ(sread=dset, quality=FastqQuality(BStringSet(myqual)), id=BStringSet(myqual)) myMA <- as(quality(setSR), "matrix") # Fast conversion of all quality ASCII values into a numeric matrix by using the ShortReadQ # function from the ShortRead library. boxplot(as.data.frame((myMA))) # Creates box plot of quality values. dsetq1 <- QualityScaledDNAStringSet(dset, myqual) # Combines DNAStringSet and quality data in QualityScaledDNAStringSet object. dsetq1[rowSums(myMA < 20) <= 4] # Filters for sequences that contain not more than 4 bases with a quality score below 20. dset <- DNAStringSet(dsetq1) # Extract sequence component from QualityScaledDNAStringSet object. qset <- quality(dsetq1) # Extract quality component from QualityScaledDNAStringSet object. DNAStringSet(dset[1:2], start=c(1,5), end=c(4,9)) # Positional subsetting for sequence data. BStringSet(qset[1:2], start=c(1,5), end=c(4,9)) # Positional subsetting for corresponding quality data. ########################################################### ## Example for Injecting Ns at Positions with Low Scores ## ########################################################### XStringSetInject <- function(myseq=dset[[1]], myqual=qset[1], cutoff=20, mychar="N") { # Defines character inject function XStringSetInject. mypos <- which(as.integer(myqual)<cutoff) return(toString(replaceLetterAt(myseq, which(as.integer(myqual)<cutoff), rep(mychar, length(mypos))))) } dvec <- sapply(seq(along=dset), function(x) XStringSetInject(myseq=dset[[x]], myqual=qset[x], cutoff=20)) # Apply XStringSetInject function to all sequences. dset2 <- DNAStringSet(dvec) names(dset2) <- paste("d", seq(along=dvec), sep="") names(myqual) <- paste("d", seq(along=dvec), sep="") dsetq2 <- QualityScaledDNAStringSet(dset2, myqual) # Reassemble results in QualityScaledDNAStringSet object. #################################### ## Filter Sequences by Ns (Score) ## #################################### Ncount <- alphabetFrequency(DNAStringSet(dvec))[,15] # Generates N counts for each sequence in dvec. dsetq2[Ncount <= 2] # Returns only sequences with 0-2 Ns. length(dsetq2[Ncount <= 2]) # Returns the number of sequences that have 0-2 Ns.
Sequence Import and Export
- The set of read.XStringSet import functions allow to import one or many FASTA-formatted biosequences from an external file directly into XStingSet objects. Alternatively, one can use the readFASTA function which stores the sequences in a list object.
################################################ ## Import and Export for XStringSet Container ## ################################################ myseq1 <- read.DNAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.ffn", "fasta") # Imports all open reading frames of the Halobacterium genome from NCBI into DNAStringSet object. myseq1[names(myseq1) %in% c("gb|AE004437.1|:1450-2115", "gb|AE004437.1|:c1195793-1195419")] # Retrieves sequences by their IDs. myseq2 <- myseq1[grep("^ATGCTT", sapply(as.list(myseq1), toString))] # Retrieves all sequences starting with "ATGCTT". write.XStringSet(myseq2, file="myseq2.txt", format="fasta", width=80) # Writes sequences to file in FASTA format. write.XStringSet(myqual, file="myqual.txt", format="fasta", width=80) # Writes quality scores from above example to file. ######################################### ## Alternative Import/Export Utilities ## ######################################### myseq1 <- readFASTA("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.ffn", strip.descs=T) # Imports all open reading frames of the Halobacterium genome from NCBI. myseq2 <- lapply(myseq1, function(x) x$seq) # Returns only sequences. myids <- sapply(myseq1, function(x) x$desc) # Returns IDs/Descriptions. names(myseq2) <- myids # Stores the sequences in a list where the sequences are named by their IDs myseq2[c("gb|AE004437.1|:1450-2115", "gb|AE004437.1|:c1195793-1195419")] # Retrieves sequences by their IDs. myquery <- myseq1[myids %in% c("gb|AE004437.1|:1450-2115", "gb|AE004437.1|:c1195793-1195419")] # Same result but on myseq1 object. writeFASTA(myquery, file="myquery.txt", width=80) dset <- DNAStringSet(sapply(myseq1, function(x) x$seq)) # Conversion into XDNAStringSet object. names(dset) <- sapply(myseq1, function(x) x$desc)
Basic Sequence Manipulation Utilities
The Biostrings package contains a wide spectrum of functions for performing many of the basic sequence transformation and manipulation routines, such as computing frequency tables, reverse & complements, translating DNA into protein sequences, etc. The following examples introduce a subset of these basic sequence analysis functions.
##################### ## Getting Started ## ##################### library(help=Biostrings) # Lists all functions provided by Biostrings. myseq1 <- read.DNAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.ffn", "fasta") # Imports a sample sequence data set: all open reading frames of the Halobacterium genome from NCBI. ###################### ## Useful Data Sets ## ###################### GENETIC_CODE # Prints genetic code information. IUPAC_CODE_MAP # Returns IUPAC nucleotide ambiguity codes. data(BLOSUM80) # Loads BLOSUM80 substitution matrix. More on this: ?substitution.matrices ?substitution.matrices # Detailed information on provided substitution matrices. ##################################################### ## Sequence Variants (SNPs) and Consensus Analysis ## ##################################################### data(phiX174Phage) # Imports six versions of the complete genome for bacteriophage phi X174 as DNAStringSet object. mymm <- which(rowSums(t(consensusMatrix(phiX174Phage))[,1:4] == 6)!=1) # Creates a consensus matrix and returns the mismatch (SNP) positions. DNAStringSet(phiX174Phage, start=mymm[1], end=mymm[1]+10) # Prints the sequence sections for the first SNP region in the phiX174Phage sequences. consensusString(DNAStringSet(phiX174Phage, start=mymm[1], end=mymm[1]+10)) # Prints the consensus sequence for the corresponding SNP region. mysnp <- t(consensusMatrix(phiX174Phage))[mymm,1:4]; rownames(mysnp) <- paste("SNP", mymm, sep="_"); mysnp # Returns the consensus matrix data only for the SNP positions. mysnp2 <- lapply(seq(along=mymm), function(x) sapply(as.list(DNAStringSet(phiX174Phage, start=mymm[x], end=mymm[x])), toString)) mysnp2 <- as.data.frame(mysnp2) colnames(mysnp2) <- paste("SNP", mymm, sep="_"); mysnp2 # Returns all SNPs in a data frame and names them according to their positions. ############################################# ## Reverse and Complement of DNA Sequences ## ############################################# reverseComplement(myseq1) # Returns reverse and complement for may BioC object types including XString, XStringView and XStringSet. reverseComplement(dsub) # Does the same for XStringView objects. Adjust all start and end coordinates accordingly. ######################################### ## Residue Frequency (e.g. GC Content) ## ######################################### alphabetFrequency(myseq1)[1:4,] # Computes the residue frequency for all sequences in myseq1. data.frame(ID=names(myseq1), GC=rowSums(alphabetFrequency(myseq1)[,c(2,3)]/width(myseq1))*100)[1:4,] # Returns GC content as data frame. trinucleotideFrequency(myseq1[1:4,]) # Returns triplet frequency. ########################################## ## Translate DNA into Protein Sequences ## ########################################## translate(myseq1[[1]]) # Translates a single ORF. mypep <- AAStringSet(sapply(seq(along=myseq1), function(x) toString(translate(myseq1[[x]])))) names(mypep) <- names(myseq1) # Translates all ORFs in myseq1 and stores result in corresponding AAStringSet object.
Trimming of Flanking Sequences (e.g. Adaptors)
- Many cloning and barcoding approaches append flanking sequences (e.g. adaptors) to the actual sequences of a biosample. These artificial sequence fragments need to be removed to allow a reliable matching of the sequences against their corresponding genomes or transcriptomes. The trimLRPatterns() function is a very efficient and flexible utility for removing flanking patterns from sequences.
myseq1 <- read.DNAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.ffn", "fasta") # Imports a sample sequence data set: all open reading frames of the Halobacterium genome from NCBI. trimLRPatterns(Lpattern = "AGG", Rpattern = "TGA", subject = myseq1, max.Lmismatch = 0.33, max.Rmismatch = 1) # Removes the specified R/L-patterns. The number of maximum mismatches can be specified for each pattern individually. trimLRPatterns(Lpattern="AAAAAAAAAGG", Rpattern="TGAAAAAAAA", subject=myseq1, max.Lmismatch=c(0,0,0,1,1,1,1,1,1,1,1), max.Rmismatch = rep(0, 10)) # To remove partial adaptors, the trimming can be performed in a hierarchical, iterative manner, # where first the full adaptor sequence N is considered, N-1 in the next iteration and so on. The # minimum adaptor length to consider in the last iteration is specified by a vector of the corresponding # length. The values in the vector specify the number of mismatches to allow in each iteration. For # instance, c(0,0,0,1,1,1,1,1,1,1,1) means allow for an 11-residue adaptor on the left no mismatch for the # first three fragments and one mismatch for the eight remaining fragments. trimLRPatterns(Lpattern="AAAAAAAAAGG", Rpattern="TGAAAAAAAA", subject=myseq1, max.Lmismatch=c(0,0,0,1,1,1,1,1,1,1,1), max.Rmismatch = rep(0, 10), ranges=T)[1:4,] # With the setting 'ranges=TRUE' one can retrieve the corresponding trimming coordinates.
Alignment Tools for Genome Mapping
- Biostrings contains a wide spectrum of sequence search and pattern matching tools. This section provides only a brief introduction into a subset of these tools.
###################### ## Pattern Matching ## ###################### ## The matchPattern() and vmatchPattern() functions allow to search a query (usually short) against ## one or many subject sequences, respectively. The number of mismatches and gaps can be ## specified by the user. mypos <- matchPattern("ATGGTG", myseq1[[1]], max.mismatch=1) # Finds all matches of a pattern in a reference sequence. mypos # The results are stored as XStringViews object. myhits <- sapply(as.list(mypos), toString) # Retrieves matches in vector. DNAStringSet(c("ATGGTG", myhits)) # Uses DNAStringSet to show a 'pseudo' alignment of the query sequence on the top and the hits below. t(consensusMatrix(c("ATGGTG", myhits))) # Returns a consensus matrix for query and hits. mycount <- countPattern("ATGGCT", myseq1[[1]], max.mismatch=1) # Counts only the corresponding matches. myvpos <- vmatchPattern("ATGGCT", myseq1, max.mismatch=1) # Finds all matches of a pattern in a set of reference sequences. myvpos # The results are stored as MIndex object. myvcount <- countPattern("ATGGCT", myseq1[[1]], max.mismatch=1) # Counts only the corresponding matches. Views(myseq1[[1]], start(myvpos[[1]]), end(myvpos[[1]])) # Retrieves the result for single entry as XStringViews object. lapply(seq(along=myseq1), function(x) toString(Views(myseq1[[x]], start(myvpos[[x]]), end(myvpos[[x]])))) # Retrieves the sequences of all matches. ################################################################ ## Searching for Patterns with Position Weight Matrices (PWM) ## ################################################################ ?matchPWM # Read help file for matchPWM. ################################### ## Fast Matching with matchPDict ## ################################### ## The matchPDict function allows fast matching of many short sequences stored in a dictionary against ## a long reference sequence (e.g. chromosome). ## (1) Matching of Sequences to Reference mydict <- DNAStringSet(sapply(1:1000, function(x) paste(sample(c("A","T","G","C"), 8, replace=T), collapse=""))) # Creates random sample sequences. names(mydict)<- paste("d", 1:1000, sep="") # Names the sequences. mypdict <- PDict(mydict) # Creates a PDict dictionary. Allows only sequences of same length. mychr <- DNAString(sapply(1, function(x) paste(sample(c("A","T","G","C"), 20000, replace=T), collapse=""))) # Creates sample chromosome. mysearch <- matchPDict(mypdict, mychr, max.mismatch=0) # Searches all dictionary entries against chromosome. ## (2) Accessing the Results unlist(mysearch)[1:10] # Returns first 10 matches. start_index <- startIndex(mysearch) # Returns start positions of matches. end_index <- endIndex(mysearch) # Returns end positions of matches. count_index <- countIndex(mysearch) # Counts the number of matches for each sequence. with_match <- which(count_index >= 1) # Retrieves index of sequences with at least one match. Views(mychr, start=unlist(start_index[with_match]), end=unlist(end_index[with_match])) # Returns matches as XStringViews object. mydict[with_match] # Returns corresponding dictionary entries. ## (3) Coverage Plots mycov <- as.integer(coverage(mysearch, 1, length(mychr))) max(mycov); min(mycov) sum(mycov != 0) / length(mycov) plotCoverage <- function(coverage, start, end, mycol="red") { # Defines coverage plotting function. plot.new() plot.window(c(start, end), c(0, 10)) axis(1) axis(2) axis(4) lines(start:end, coverage[start:end], type="l", col=mycol) } plotCoverage(mycov, start=1, end=20000, mycol="red") # Plots coverage result. ## (4) Plot Fragments onto Chromosome source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/featureMap.txt") # Imports plotting function. mapDF <- data.frame(as.data.frame(unlist(mysearch))[,1:2], strand=1) # Creates coordinate data frame. colnames(mapDF) <- c("startx", "endx", "strand") genelength <- length(mychr) # Defines chromosome length. mapDF[,2] <- mapDF$endx + 200 # To see the matches better in the plot, their length is here extended by an arbitrary value (200). mapDF[mapDF[,2]>genelength,2] <- genelength Mysub <- paste("Chromosome length", genelength, "bp") # Specifies subtitle of plot. Mymain <- "Chromosome Plot for Sense Alignments" # Specifies title of plot. plotDF <- transformDF(plotDF=mapDF, genelength, method="arrange") # Transforms coordinates for plotting. featurePlot(plotDF=plotDF, geneline=6, genecol="green", featureline=2, featurecol=sample(c(2,4,8), length(plotDF[,1]), replace=T), showxy=F) # Plots the feature map for the match coordinates that are stored in 'mysearch'. # The feature color argument allows to color each feature in a different color. ## (5) Allowing Mismatches with matchPDict mychr <- DNAString("CTAGGCTGTATCCTAAACACCGACAAGGTCTAGCCTGTATGCTAGCTAGTGTCCGCTCGGAAGCCGCT") mydict <- DNAStringSet(c("GTAGGCTGTATCCTA", "CGCTCGGAAGCCGCT")) names(mydict) <- paste("d", seq(along=mydict), sep="") mypdict <- PDict(mydict, tb.end=1) # Uses a trusted band of nucleotides that need to match exactly (here only first nucleotide). tb(mypdict); tail(mypdict); width(mypdict) # Methods for accessing the different trusted band components; '?PDict' provides more detailed information. mysearch <- matchPDict(mypdict, mychr, max.mismatch=1) # Allows one mismatch outside of trusted band. list(mysearch) # Returns matches as IRanges object. table(cut(countIndex(mysearch), c(0, 1:10), right=FALSE)) # Summary of matches.
Computing Pairwise Sequence Alignments
- The pairwise sequence alignment function from Biostrings provides a flexible environment for computing optimal pairwise alignments using different dynamic programming algorithms.
s1 <- DNAString("GGGCCC"); s2 <- DNAString("GGGTTCCC") # Creates sample sequences. myalign <- pairwiseAlignment(s1, s2, type="local") # Computes pairwise local alignment with Smith-Waterman dynamic programming algorithm. myalign <- pairwiseAlignment(s1, s2, type="global") # Computes global alignment with Needleman-Wunsch algorithm. pairwiseAlignment(c("GGGCCC", "GAGGCC", "CGAGA"), s2, type="local", scoreOnly=T) # The function can also be used for sequence searching by providing the 'sequence database' as vector in the first argument. attributes(myalign)[-5] # Returns the components of an alignment object. subject(myalign); pattern(myalign) # Returns aligned sequences with gaps. Many more helper functions for alignments can be found in Biostrings' Alignment Manual. sq1 <- SolexaQuality(as.integer(c(22,22,22,12,12,12))); sq2 <- SolexaQuality(as.integer(c(22,22,22,0,0,0,0,0))) pairwiseAlignment(s1, s2, patternQuality = sq1, subjectQuality=sq2) # Uses a quality-based method for generating the substitution matrix.
Multiple Sequence Alignments (MSAs)
- Multiple sequence alignments (MSAs) can be imported as FASTA formatted alignments into R using the read.XStringSet function. Almost all MSA programs support this format (e.g. ClustalW, T-Coffee, MUSCLE, Multalin). This import method will store the MSAs as XStringSet objects. The following examples introduce a variety of useful analysis routines on MSAs using this object class.
########################### ## Import of MSAs into R ## ########################### library(Biostrings) p450 <- read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul", "fasta") # Imports MSA of P450 sequences from a FASTA formatted file and stores the result as AAStringSet object. as.matrix(p450) # Converts MSA into a character matrix. consensusMatrix(p450) # Computes the residue frequencies for each MSA column and returns the results as matrix. #################################### ## Viewing of MSAs in HTML Format ## #################################### ## Viewing large alignments in a web browser can be very efficient. The following ## example outputs an HTML formatted alignment with consensus line. StringSet2html <- function(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=TRUE, ...) { msa <- AAStringSet(msa, start=start, end=end) msavec <- sapply(msa, toString) offset <- (counter-1)-nchar(nchar(msavec[1])) legend <- paste(paste(paste(paste(rep(" ", offset), collapse=""), format(seq(0, nchar(msavec[1]), by=counter)[-1])), collapse=""), collapse="") consensus <- consensusString(msavec, ambiguityMap=".", ...) msavec <- paste(msavec, rowSums(as.matrix(msa) != "-"), sep=" ") msavec <- paste(format(c("", names(msa), "Consensus"), justify="left"), c(legend, msavec, consensus), sep=" ") msavec <- c("<html><pre>", msavec, "</pre></html>") writeLines(msavec, file) if(browser==TRUE) { browseURL(file) } } StringSet2html(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=T, threshold=1.0) # Writes MSA to file "p450.html", which can be viewed in a browser. StringSet2html(msa=p450, file="p450.html", start=450, end=470, counter=20, browser=T, threshold=1.0) # Same as above, but for alignment positions 450-470. ########################################################## ## Computation of Similarity/Distance Matrices for MSAs ## ########################################################## masim <- sapply(names(p450), function(x) sapply(names(p450), function(y) pid(PairwiseAlignedXStringSet(p450[[x]], p450[[y]]), type="PID2"))) masim # Returns a percent identity matrix. The identity method can be specified under the "type" argument, see help for pid() function. madist <- 1-(masim/100) madist # Returns corresponding distance matrix. disttree <- hclust(as.dist(madist)) plot(as.dendrogram(disttree), edgePar=list(col=3, lwd=4), horiz=T) # Computes and plots distance tree for MSA. #################################################### ## Mapping Sequence Postions to Gapped Alignments ## #################################################### ## Remove gaps from sequences, e.g. for pattern matching removeGaps <- function(align=p450[[1]], gap="-") { myseq <- maskMotif(align, gap) myseq <- paste(as(myseq, "XStringViews"), collapse="") return(myseq) } p450seq <- AAStringSet(sapply(p450, function(x) removeGaps(align=x))) mypos1 <- matchPattern("FSAGKRICAG", p450seq[[1]]); mypos1 # Removes gaps from sequences and provides a pattern matching example for P450 heme binding site. p450gapmask <- maskMotif(p450[[1]], "-") ## Mapping sequence postions to gapped alignments seqAlignmap <- function(seq=p450seq[1], align=p450[1], gap="-") { seqindex <- 1:length(seq[[1]]) gapindex <- which(as.matrix(align)==gap) alignindex <- (1:length(align[[1]]))[-gapindex] names(alignindex) <- seqindex return(alignindex) } alignindex <- seqAlignmap(seq=p450seq[1], align=p450[1], gap="-") # Returns a named vector with the sequence coordinates in the name # field and the corresponding alignment coordinates in the data field. alignindex <- alignindex[start(mypos1)]:alignindex[end(mypos1)]; alignindex # Returns only the alignment positions for the above pattern match result. AAStringSet(p450, start=min(alignindex), end=max(alignindex)) # Returns the alignment segment for the above pattern match result. ############################################################# ## Simple Example for Sliding Window Conservation Analysis ## ############################################################# ## (A) Compute conservation vector. This example uses an oversimplified conservation model solely for demo purposes! consCol <- function(myalign=p450) { myalign <- as.matrix(myalign) consV <- sapply(seq(along=myalign[1,]), function(x) max(table(myalign[,x][!myalign[,x] %in% "-"]))/length(myalign[,1])) return(consV) } consV <- consCol(myalign=p450) ## (B) Perform sliding window analysis on conservation vector. slidingWindow <- function(mydata=consV, win=c(1, 100), start=1, end=length(consV)) { mydata <- mydata[start:end] windex <- t(sapply(0:length(mydata), function(x) win+x)) mywind <- sapply(seq(along=mydata), function(x) mean(mydata[windex[x,1]:windex[x,2]], na.rm=TRUE)) mywind <- mywind[1:(length(mywind)-win[2])] return(mywind) } slideV <- slidingWindow(mydata=consV, win=c(1, 20)) # Returns conservation vector for window size 20. ## (C) Plot sliding window conservation data plot(slideV, type="l", lwd=2, col="red", main="Sliding Window Conservation Analysis", sub="Window Size: 20", xlab="Alignment Position", ylab="Conservation: max=1.0") heme <- AAStringSet(p450, start=454, end=463); heme text(454, 0.71, toString(heme[[1]]), cex=0.8) # The highly conserved cytochrome P450 heme binding signature maps to positions 454-463, # which is located in the last big peak of the plot.
Analysis Routines with Biostrings
- The following examples introduce a variety of useful analysis routines that can be accomplished with functions provided by the Biostrings package.
Mapping of Affy Probes to Transcriptome (similar to RNA-Seq)
- Example for aligning Affy probe sets back to their target mRNAs. The code retrieves the matching counts and mapping coordinates for all probes and probe sets. In addition, it identifies probe sets matching several genes and/or genes matching several probe sets. The given analysis routine can be easily adjusted to the needs of RNA-Seq data sets for obtaining read counts per mRNA.
########################################## ## Mapping Affy Probes to Transcriptome ## ########################################## ## (1) Import mRNA/cDNA sequences (here Arabidopsis) library(Biostrings) mygenes <- read.DNAStringSet("ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR8_blastsets/TAIR8_cdna_20080412", "fasta") names(mygenes) <- gsub(" \\|.*", "", names(mygenes)) # Extracts locus IDs from description line for sequence naming. ## (2) Create pattern dictionary from probe sets library(ath1121501probe) mydict <- DNAStringSet(ath1121501probe$sequence) # Obtains the sequences of the probes from ath1121501 chip. mynames <- ath1121501probe$Probe.Set.Name # Obtains the corresponding probe set names. mynames <- table(mynames)[mynames] # Next 4 lines create unique names for probes by appending a counter. mycount <- unlist(sapply(mynames[!duplicated(names(mynames))], function(x) seq(1, x))) mynames <- paste(names(mynames), mycount, sep="_") names(mydict) <- mynames mydict <- PDict(mydict) # Creates a PDict dictionary. ## (3) Count matches for each mRNA sequence sapply(names(mygenes)[1:40], function(x) sum(countPDict(mydict, mygenes[[x]]))) # Returns matches for first 40 mRNAs. ## (4) Search with pattern dictionary each mRNA sequence (step 5 provides a much faster solution!) hitCollect <- function(mydict=mydict, myref=mygenes[[1]]) { # Defines search function for many query sequences (here mRNAs) mymatch <- matchPDict(mydict, myref) hits <- sum(countIndex(mymatch)>=1) if(hits==0) { return(NULL) } else { return(unlist(mymatch)) } } myhitList <- lapply(names(mygenes)[1:10], function(x) hitCollect(mydict=mydict, mygenes[[x]])) names(myhitList) <- names(mygenes)[1:10]; myhitList # The last two lines perform the search for the first 10 mRNA sequences and organize the results # in a list object. Searching all 39,000 mRNAs will take about 14 hours. countList <- sapply(names(myhitList), function(x) table(gsub("_at_.*", "_at", names(myhitList[[x]])))); countList # Counts for each mRNA the number of matching probes that belong to the same probe set. data.frame(locusID=rep(names(countList), sapply(countList, length)), affyID=unlist(sapply(countList, names)), Nprobes=unlist(sapply(countList, as.vector))) # Converts countList into a data frame. With this data structure one can easily determine # with tapply() all probe sets matching several genes and/or genes matching several probe sets. ## (5) Same as step (4) but with much faster computation by using the concatenation of all mRNAs as reference myconc <- BString(toString(mygenes)); myconc <- maskMotif(myconc, ", "); myconc <- DNAString(injectHardMask(myconc, "+")) # Concatenates sequences to a single string where each component is separated by '++'. myhitList <- matchPDict(mydict, myconc) # Performs search against the single reference containing all mRNAs. # This way the mapping of all probe sets will take only a few seconds. # Mappings across mRNA boundaries are prevented by inserted "+" marks. mymask <- maskMotif(myconc, "+") # Creates a mask for mRNA boundaries labeled with "+". mRNAloc <- data.frame(start=start(as(mymask, "XStringViews")), end=end(as(mymask, "XStringViews")), IDs=names(mygenes)) # Returns mRNA start and end positions in reference. myhitDF <- as.data.frame(unlist(myhitList)); myhitDF <- myhitDF[order(myhitDF$start), ] mRNAends <- mRNAloc$end; myhitends <- myhitDF$end mylength <- sapply(mRNAends, function(x) { mylength <- sum(myhitends <= x); myhitends <<- myhitends[!myhitends <= x]; return(mylength) }) tmphitDF <- cbind(myhitDF, mRNAloc[rep(seq(along=mRNAloc[,1]), mylength),]) finalhitDF <- cbind(start=tmphitDF[,1]-tmphitDF[,5]+1, end=tmphitDF[,2]-tmphitDF[,5]+1, tmphitDF[,c(4,7)]); finalhitDF[1:28,] # The previous five lines map the hits back to the corresponding mRNAs. countList <- tapply(gsub("_at_.*", "_at", finalhitDF$names), as.vector(finalhitDF$IDs), table); countList[1:20] # Counts for each mRNA the number of matching probes that belong to the same probe set. data.frame(locusID=rep(names(countList), sapply(countList, length)), affyID=unlist(sapply(countList, names)), Nprobes=unlist(sapply(countList, as.vector)))[1:40,] # Converts countList into a data frame. With this data structure one can easily determine # with tapply() all probe sets matching several genes and/or genes matching several probe sets.
Simulate Number of Unique Alignments
- Example for estimating the percentage of short read sequences that align unambiguously to a reference genome when allowing a specific number of mismatches. Genomes with a high content of repetitive/duplicated sequences will have lower numbers of uniquely aligning short reads than genomes with a low degree of repetitive sequences. The provided functions allow the user to specify the length of the short reads and the number of allowable mismatches. The final result for increasing numbers of aligned reads is represented in a bar plot.
#################################################################################### ## Estimate the Percentage of Short Reads that Have Unique Alignments in a Genome ## #################################################################################### ## Comment: The countPattern function is used here to allow alignments with mismatches. If ## no mismatches are required, the much faster matchPDict function should be used instead. ## (1) Store your reference genome in a DNAString object. A random sequence is used ## here for test purposes. library(Biostrings) ref <- DNAString(paste(sample(c("A","T","G","C"), 100000, replace=T), collapse="")) ## (2) Define alignment function where ## ref: reference genome ## Nreads: number of pseudoreads to extract from reference genome ## width: length of pseudoreads ## max.mismatch: max number of mismatche to allow alignCount <- function(ref=ref, Nreads=1000, width=15, max.mismatch=2, ...) { randstart <- sample(1:(length(ref)-width), Nreads) randreads <- Views(ref, randstart, width=width) match_count <- sapply(seq(along=randreads), function(x) countPattern(randreads[[x]], ref, max.mismatch=2)) return(table(match_count == 1)/length(match_count)*100) } ## (3) Choose the number of pseudoreads that should be tested. To obtain accurate estimates, ## one should choose here large enough numbers (e.g. 10,000 or higher for the largest value). testsize <- c(500, 1000, 2000, 5000) ## (4) Run the alignCount function and plot the results (parameter settings here for 15bp reads and 2 mismatches). alignMA <- sapply(testsize, function(x) alignCount(ref=ref, Nreads=x, width=15, max.mismatch=2)) rownames(alignMA) <- c(">1 Alignment", "1 Alignment"); colnames(alignMA) <- testsize barplot(alignMA[c(2,1),], col=c("blue", "red"), legend.text=rev(rownames(alignMA)), ylim=c(0,120), main="Alignments with 0-2 Mismatches", xlab="N Simulated Reads", ylab="%")
Exercises
ShortRead
- The ShortRead package provides input, quality control, filtering, parsing, and manipulation functionality for short read sequences produced by high throughput sequencing technologies. While support is provided for many sequencing technologies, this package is primairly focused on Solexa/Illumina reads.
Function Overview
Function
Description
SolexaPath("path")
Parses a Solexa output folder, and identifies important files
analysisPath(SolexaPathObject)
Returns a list of GERALD output folders
compose(filter1, filter2, filter3, ...)
Returns a sequence filter based on a combination of various filter objects
readAligned(SolexaPathObject, "filename", filter= myfilter)
Reads in an Eland export file which contains sequences, alignment coordinates, and alignment-normalized quality scores based on a given filter object
sread(AlignedReadObject)
Returns a list of sequences contained in an AlignedReadObject (as returned by readAligned)
coverage(AlignedReadObject, start=startnt, stop=stopnt)
Calculates coverage of an alignment over each base in the alignment genome, and compresses the data to save memory
Illumina Pipeline
- The Illumina Pipeline is a set of software tools for the unix platform provided to users of the Illumina Genome Analyzer. The pipeline reads the images produced by the sequencer, analyzes them, performs base calling, and (optionally) aligns sequences to a reference genome. The user is provided with a set of sequences corresponding to each read, along with quality scores representing the accuracy of each base call.
Quality Scores
- A numeric Phred score represents the error probability of a given base call. When a nucleotide sequence is produced by sequencing, random error results in the possibility that any given base call may be incorrect. Thus, a quality score is provided for each base.
- The phred score can be calculated from the error probability of a given base call:
- phred score=-10*log(error probability)/log(10)
- illumina score=-10*log(error probability / (1 - error probability))/log(10)
Error Probability
Phred Score
Illumina Score
1
0
negative infinity
0.1
10
~9.542
0.01
20
~19.956
0.001
30
~29.996
0.0001
40
~40.000
- Illumina scores are asymptotically identical to Phred scores. Therefore, values of around 15 or greater are practically identical for both scoring systems. When quality scores are used to represent a long sequence, they are often represented using the ASCII alphabet, adding the number 33 to Phred scores, and 64 to Illumina scores. For example, a Phred score of 40 can be represented as the ASCII char "I" (40+33=Ascii #73), and an Illumina score of 40 as "h" (40+64=Ascii #104).
Here is a link to a reference on the ASCII alphabet: http://www.asciitable.com/
- Update: Complicating things further, the most recent Illumina pipeline version 1.3.2 now uses Phred scores, but still represents the values using the old Illumina ASCII character offset.
ShortRead Objects
SolexaPath
- This object stores output file locations from the Illumina Pipeline. The SolexaPath function which creates this object automatically finds the subfolders for each output filetype.
library(ShortRead) sp <- SolexaPath("/home/zzz11/sample_flowcell") # Creates a SolexaPath for a pipeline output folder # general format: sp <- SolexaPath("/path/to/folder") baseCallPath(sp) # returns base call path subfolder analysisPath(sp) # returns GERALD/Eland alignment subfolder imageAnalysisPath(sp) # Returns firecrest image analysis subfolder
ShortReadQ
- This object stores sequences, quality scores, and ids for a series of reads.
reads <- readFastq("/home/zzz11/test", pattern="filename.fastq") # Reads in a FASTQ file id(reads) # outputs read ids as a list as BStringSet sread(reads) # outputs read sequences as a list as DNAStringSet quality(reads) # outputs list of quality scores as BStringSet writeFastq(reads, "output.fastq") # writes ShortReadQ object to output file
AlignedRead
- This objects stores the genome alignment coordinates (chromsome, position, strand) for each of a series of reads.
# Import AlignedRead Object alignedReads <- readAligned("/home/zzz11/test", pattern="output.soap", type="SOAP") # Reads in a external alignment (change type depending on program) # general format: readAligned("/path/to/folder", pattern="filename", type="AlignmentTool") alignedReads[1] # choose a read and inspect it's alignment coordinates
ShortRead Sequence Filters
- ShortRead provides functions that return sequence filter objects, which allow you to filter your input sequences by various criteria. It is also possible to create new filter objects.
Filter Function
Description
srFilter
Creates a custom filter (see Vignettes)
chromosomeFilter("chromosome name regex")
Selects reads which align to specific chromosome(s) matching the given regular expression
strandFilter("+/-")
Selects reads which align to the + or - strand on the genome
nFilter(# of Ns)
Selects reads which contain fewer than a given number of N characters
polynFilter(# of given nt)
Selects only reads which contain less than a given number of the same nucleotide
srdistanceFilter(DNAStringSet, DNAString)
Performs Pairwise alignment of each DNAStringSet string against DNAString, and calculates edit distance
uniqueFilter()
Removes duplicate occurances of a given nucleotide sequence from a set of reads
- Depending on the nature of the filter, they may work with different object types. Consult the Reference Manual Vignette for instructions on building filters, and additional built-in filter functions.
# Create several filters: filter1 <- nFilter(threshold=1) # keep only reads without Ns filter2 <- polynFilter(threshold=20, nuc=c("A")) # remove reads with 20 or more As filter3 <- strandFilter("+") # choose reads which align to positive strand filter <- compose(filter1, filter2, filter3) # Combine filters into one alignedReads <- alignedReads[filter(alignedReads)] # filter alignedReads object with all 3 filters at once
IRanges
- IRanges provides infrastructure and containers for handling sets of integer ranges. Integer ranges are commonly used to represent coordinates of read alignment loci, or various annotation feature loci within a biological sequence.
IRanges Use
library(IRanges) # load IRanges library ranges <- IRanges(start=c(1,2,3), end=c(2,3,4)) # Constructs an IRanges object with 3 ranges: 1-2, 2-3, and 3-4. # These ranges are often used to represent genome alignment coordinates, or adapter coordinates # within a read. rangeList <- IRangesList(ranges1, ranges2, ...) # Constructs a list of Iranges objects, and can be used to associate a set of ranges # to filter,trim, or align each string in a BStringSet object coverage <- Rle(c(1,1,2,3)) # Constructs an Rle (run length encoded) object. This is an efficient compressed # way to store a long integer string with a lot of repeated elements, and is # often used to represent genomic coverage of short read alignments for each # residue in the genome.
BSgenome
- BSgenome provides an object oriented infrastructure for interacting with a Biostring based genome sequence. BSgenome packages exist for many common genomes, and can be created to represent custom genomes. See the "How to forge a BSgenome data package" Vignette for instructions to create a new BSgenome package if a prebuilt package does not exist for your organism.
BSgenome Use
library(BSgenome) available.genomes() # Get list of all available pre-built genomes to choose from # source("http://bioconductor.org/biocLite.R") # biocLite("BSgenome.Hsapiens.UCSC.hg18") # Run these commands to install a genome you haven't used before, or to update the version library(BSgenome.Hsapiens.UCSC.hg18) # Download, install, and load the Homo sapiens genome Hsapiens # Inspect overview of data provided with this package masknames(Hsapiens) # Returns a list of built in masks defined for sequences in the genome # AGAPS = assembly gaps # AMB = intra-contig ambiguities # RM = RepeatMasker # TRF = Repeat Regions found via Tandem Repeats Finder chr1 <- Hsapiens$chr1 # load chromosome 1 into memory with default masks length(chr1) # returns length of chromosome 1
rtracklayer
- rtracklayer provides an interface for exporting annotation feature data to various genome browsers and file formats (such as GFF). See the Small RNA Profiling exercise for an example of using rtracklayer to visualize alignment coverage.
biomaRt
- biomaRt provides an R interface to publicly accessible biological databases including Ensembl, Uniprot, Gramene, Wormbase and HapMap. This data is retrieved automatically via the internet, and is not versioned. If a pre-existing Bioconductor package exists with the data you need, it is recommended to use that package instead.
biomaRt Use
library("biomaRt") listMarts() # get list of all available data sources ensembl <- useMart("ensembl") # load the snp source listDatasets(ensembl) # list available datasets aaegypti_gene_ensembl <- useDataset("aaegypti_gene_ensembl", mart= ensembl) listAttributes(aaegypti_gene_ensembl) # show attributes in this dataset listFilters(aaegypti_gene_ensembl) # show available filters aaegypti_gene_ensembl <- useMart("ensembl","aaegypti_gene_ensembl") # select specific dataset getBM(attributes=c("chromosome_name"), mart= aaegypti_gene_ensembl) # retrieve data regarding specific attributes in table/dataframe # format # optional filters= option allows for specific filters to be applied
HT-Seq Data Visualization
HilbertVis: Hilbert genome plots
GenomeGraphs: Plotting genomic information from Ensembl
TileQC: Flow Cell Quality Visualization
rtracklayer: R interface to genome browsers
Command Line Alignment Utilities
SOAP
- SOAP stands for "Short Oligonucleotide Analysis Package." This is a particularly powerful and fast alignment tool which supports mismatches and indels. It also contains a new tool for SNP calling.
Running SOAP
- SOAP is run from the unix command line, and requires a reference genome in .fasta format, and a list of query sequences in .fasta or .fastq format. This example was tested with SOAP version 2.17.
soap # inspect command line options 2bwt-builder genome.fasta # create binary of reference genome soap -a query.fastq -D genome.fasta.index -o output.soap # align query to genome and store output
Parsing Output with R
library(ShortRead) alignedReads <- readAligned("./", pattern="output.soap", type="SOAP") # Reads in alignment coordinates as a ShortRead AlignedRead object
MAQ
- MAQ stands for "Mapping and Assembly with Quality." It is a suite of tools which allows you to build a consensus assembly by mapping short reads to a reference. This is particularly useful for identifying SNPs. MAQ can also be used as a general purpose short read alignment tool, and contains some useful file format converters.
Running MAQ
maq # inspect command line options maq fasta2bfa genome.fasta genome.bfa # create binary of reference genome maq fastq2bfq query.fastq readBinary.bfq # create a binary of dataset maq match out.map genome.bfa readBinary.bfq # align query to genome and store output
Parsing Output with R
library(ShortRead) alignedReads <- readAligned("./", pattern="out.map", type="MAQ") # Reads in alignment coordinates as a ShortRead AlignedRead object
Bowtie
- Bowtie is an ultra-fast alignment tool. It has somewhat less flexibility with regard to mismatches than SOAP or MAQ, but much greater performance. It does not currently support gapped alignment. Bowtie is generally an excellent choice if you don't need any of the extended features provided by lower performance tools.
Running Bowtie
bowtie # inspect command line options bowtie-build ~/AT_Columbia_TAIR8.fasta genome.binary # create binary genome (replace ~/AT_Columbia_TAIR8.fasta with your genome) # You can also download pre-built genomes from the Bowtie web page bowtie -q ~/genome.binary ~/query.fastq output.bowtie # align query to genome and store output (replace ~/query.fastq with your query sequences # and ~/genome.binary with the genome you built in the previous step )
Parsing Output with R
library(ShortRead) alignedReads <- readAligned("./", pattern="output.bowtie", type="Bowtie") # Reads in alignment coordinates as a ShortRead AlignedRead object
Examples
Plot Per-Cycle Quality of FASTQ File
library(ShortRead) reads <- readFastq("~/", pattern="query.fastq") # Read in datafile (replace "~/" with your path and query.fastq with your filename) if (length(reads) > 1000000) { reads <- sample(reads,1000000) # if more than a million reads, sample randomly } qual <- FastqQuality(quality(quality(reads))) # get quality scores readM <- as(qual, "matrix") # convert scores to matrix boxplot(as.data.frame((readM)), outline = FALSE, main="Per Cycle Read Quality", xlab="Cycle", ylab="Phred Quality") # build per cycle boxplot
Import Reads, Filter, and Export
Import Reads
library(ShortRead) reads <- readFastq("~/", pattern="query.fastq") # import reads from fastq file into ShortReadQ Object # replace "~/" with your path and query.fastq with your filename
Filter Reads
filter1 <- nFilter(threshold=3) # keep only reads with fewer than 3 Ns filter2 <- polynFilter(threshold=20, nuc=c("A", "C", "T", "G")) # remove reads with 20 or more of the same letter filter <- compose(filter1, filter2) # Combine filters into one sequences <- sread(reads) # extract sequences from read object filteredReads <- reads[filter(sequences)] # apply filter to sequences, and use this to remove "bad" reads
Export Reads
writeFastq(filteredReads, "output.fastq") # export filtered reads to fastq file
SmallRNA Profiling
Import Reads From Illumina Pipeline
- Set the path of your Illumina Pipeline output:
library(ShortRead) # sp <- SolexaPath("/path/to/my/flowcell/folder") sp <- SolexaPath("/home/zzz11/sample_flowcell")
- Import reads from a lane:
lane <- 7 # read in all tiles of data qualityInputFiles <- paste("s_", lane, "_[0-9]+_prb.txt", sep="") sequenceInputFiles <- paste("s_", lane, "_[0-9]+_seq.txt", sep="") reads <- readBaseQuality(baseCallPath(sp), seqPattern=sequenceInputFiles, prbPattern=qualityInputFiles,type="Solexa") id(reads) # inspect read ids sread(reads) # inspect read sequences quality(reads) # inspect quality scores
Filter Sequences
# Construct several filters filter1 <- nFilter(threshold=3) # keep only reads with fewer than 3 Ns filter2 <- polynFilter(threshold=20, nuc=c("A", "C", "T", "G")) # remove reads with 20 or more of the same letter filter <- compose(filter1, filter2) # Combine filters into one reads <- reads[filter(sread(reads))] # filter sequences with all filters at once # since these filters work on the sequences themselves, it is necessary to pass them the sequences # with the sread(reads) function, rather than the entire ShortReadQ object reads # inspect filtered reads
Trim Adapters
library(Biostrings) # Load Biostrings library seqs <- sread(reads) # get sequence list qual <- quality(reads) # get quality score list qual <- quality(qual) # strip quality score type seqs <- DNAStringSet(seqs, start=4) # trim first 3 bases from sequence qual <- BStringSet(qual, start=4) # trim first 3 bases from Qscore # This is only necessary if you need to shorten the read by a fixed # distance at the beginning. This is done here because the samples # are multiplexed, and we wish to omit the multiplexing tag. adapter <- "TGTAGGCACCATCACTCGGGCACCAAGGA" # This is the adapter sequence to be trimmed from the end of your smallRNA reads mismatchVector <- c(rep(2,length(DNAString(adapter)))) # This is a vector of equal length to the adapter with "2" for each value. # This will allow the adapter to match the end of the sequence with any offset # and up to 2 mismatches. trimCoords <- trimLRPatterns(Rpattern=adapter, subject=seqs, max.Rmismatch=mismatchVector, ranges=T) # Trim sequences looking for a right end pattern # Gets IRanges object with trimmed coordinates seqs <- DNAStringSet(seqs, start=start(trimCoords), end=end(trimCoords)) qual <- BStringSet(qual, start=start(trimCoords), end=end(trimCoords)) # Use IRanges coordinates to trim sequences and quality scores qual <- SFastqQuality(qual) # reapply quality score type trimmed <- ShortReadQ(sread=seqs, quality=qual, id=id(reads)) # Rebuild reads object with trimmed sequences and quality scores
Choose Size Range and Make Unique
maxSize <- 30 minSize <- 19 trimmed <- trimmed[width(sread(trimmed)) <= maxSize] # remove all reads longer than maxSize trimmed <- trimmed[minSize <= width(sread(trimmed))] # remove all reads shorter than minSize uniquef <- uniqueFilter() # create filter unique <- trimmed[uniquef(trimmed)] # run unique filter (eliminate duplicates) # keep working with the non-unique reads if analyzing expression level dats
Export Sequences
writeFastq(trimmed, "filename.fastq") # write reads to filename.fastq
Perform Alignment With Bowtie Using Unix Command Line
system("wget http://illumina.ucr.edu/misc/AT_Columbia_TAIR8.fasta") # Download your reference genome in fasta format with wget system("bowtie-build AT_Columbia_TAIR8.fasta genome.binary") # build binary reference genome system("bowtie -q ~/genome.binary filename.fastq output.bowtie") # perform bowtie alignment with default options
Import Alignment
alignedReads <- readAligned("./", pattern="output.bowtie", type="Bowtie") # read in alignment table(width(sread(alignedReads))) # view table of SmallRNA Size distribution
Find Coverage Peaks
coverage <- coverage(alignedReads, start=1) # calculate genomic coverage and returns an IRanges Rle object islands <- slice(coverage$Chr1, lower = 10) # look for regions on chromsome 1 of 10x coverage or higher islands # inspect regions
Visualize Alignment In Genome Browser
This requires some archaic object conversions, because exporting shortread data via rtracklayer is not yet stable or well documented.
library(rtracklayer) chr1ranges <- as(coverage$Chr1, "RangedData") # convert coverage for chromosome 1 into rtracklayer object irangescores <- IRanges(start=start(chr1ranges),end=end(chr1ranges)) # create iranges object from start/stop coordinates gd <- GenomicData(irangescores, chr1ranges$score, chrom="1") # put coverage data into exportable GenomicData object # and specify that this data is for chromosome 1 export(gd, con="chromsome1.wig") # create output track for viewing in genome browser
If you are working on a remote server, you will need to copy this file locally before you can upload it to a genome browser. If you have trouble with the copy, you can download a sample here: chromsome1.wig
Go to an genome browser, upload your .wig file, and view it. You can try the Arabidopsis UCSC browser (http://epigenomics.mcdb.ucla.edu/cgi-bin/hgGateway?org=A.+thaliana&db=araTha1), by uploading your .wig file via the "add your own custom tracks" button. After uploading this, browse to the region you would like to see. A high density region in the test dataset for the workshop can be found at chr1:30,088,863-30,089,042. Also, try inspecting regions of high coverage you found in the previous step.
ChIP-seq Analysis
See BioC ChIP-seq Case Study.
Read Density and SNP Analysis
- This exercise is done from the Unix command line rather than the R console,
and uses the MAQ alignment tool. First you must have a copy of your reference genome in .fasta format, and a copy of your reads in .fastq format (see Import Reads, Filter, and Export section). For further information, and advanced command line options see the documentation provided on the MAQ website.
Build Reference Genome
From the command line:
wget http://illumina.ucr.edu/misc/AT_Columbia_TAIR8.fasta # download reference genome maq fasta2bfa AT_Columbia_TAIR8.fasta genome.bfa # create binary of reference genome maq fastq2bfq ~/snpReads.fastq readBinary.bfq # create a binary of dataset (replace ~/snpReads.fastq with your query sequences)
Perform Alignment
maq match -n 2 out.map genome.bfa readBinary.bfq # perform alignment maq mapcheck genome.bfa out.map >mapcheck.txt # get QC statistics on alignment
Build Consensus Genome
maq assemble consensus.cns genome.bfa out.map 2>assemble.log # assemble consensus sequence maq cns2fq consensus.cns >cns.fq # extract consensus sequence from binary
Identify SNP Sites
maq cns2snp consensus.cns >cns.snp # extract SNP locations maq.pl SNPfilter cns.snp > filtered.snp # filter SNPs with default values
From R:
Import SNP Candidates Into R
- This section is performed from within R. By creating an IRanges object from SNP candidates in a given genomic region, these SNP sites can be used as genomic annotation data.
library(IRanges) snptable <- read.table("filtered.snp") # read in SNP table as a data frame snptable <- snptable[snptable[1] == "Chr2",] # filter out all SNP candidates except those on chromsome 2 snptable <- snptable[snptable[2] > 10000,] snptable <- snptable[snptable[2] < 20000,] # filter out all SNP candidates occuring outside the region 10000-20000 snpSites <- snptable[2][,1] # extract SNP coordinates as a numeric vector snpRanges <- IRanges(start=snpSites, end=snpSites) # create IRanges object for SNP candidate loci
Further Analysis
- Once imported to R, and filtered by target region it may be useful to further filter SNP candidates by their relationship to annotation features such as coding genomic regions, and potential for protein amino acid swaps. Check back, as this example may be added to the manual in the future.
Session Info
> sessionInfo() R version 2.9.0 (2009-04-17) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.2.0 lattice_0.17-22 BSgenome_1.12.0 Biostrings_2.12.1 [5] IRanges_1.2.0 loaded via a namespace (and not attached): [1] Biobase_2.4.0 grid_2.9.0 hwriter_1.1
Installing Packages
source("http://bioconductor.org/biocLite.R") biocLite() biocLite(c("ShortRead", "Biostrings", "IRanges", "BSgenome", "rtracklayer", "biomaRt"))