||<: 12%> ~-[[http://illumina.ucr.edu/ht/documentation/analysis|IIGB HT-Seq|class=white]] -~||<: 12%>~-[[http://faculty.ucr.edu/~tgirke/Workshops.htm|Workshops|class=white]] -~||<: 12%>~-[[http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.html|R & BioC|class=white]] -~||<: 12%>~-[[http://htseq.ucr.edu|BioC-Seq|class=white]] -~||<: 12%>~-[[http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_Programming.html|R Programming|class=white]] -~||<: 12%>~-[[http://faculty.ucr.edu/~tgirke/Documents/EMBOSS/EMBOSS_MANUAL.html|EMBOSS|class=white]] -~||<: 12%>~-[[http://wiki.biocluster.ucr.edu/Linux-Essentials|Linux|class=white]] -~||<: 12%>~-[[http://wiki.biocluster.ucr.edu|Cluster|class=white]] -~|| = HT Sequence Analysis with R and Bioconductor = By [[http://faculty.ucr.edu/~tgirke/Group.htm|Tyler Backman, Rebecca Sun and Thomas Girke]], [[http://illumina.ucr.edu/ht/|IIGB, UC Riverside]] <> <> == Introduction == .[[http://biocluster.ucr.edu/~tgirke/HTML_Presentations/Manuals/HT-Seq/HT-Seq.pdf|Sequencing Technology Slide Show]] (see also [[http://www.google.com/url?sa=U&start=1&q=http://www.economia.unimi.it/projects/marray/2008/material/lectures/day5/HTS.pdf&ei=rJHOSd3oGpCgtgOm6ZWgAw&sig2=PxBI12-AajXzRCnPrTJIXg&usg=AFQjCNH9fMGfKNzSl1WJHHa-q721FqJIvQ|Simon Anders']] presentation) .High throughput sequencing (HT-Seq or HTS), also known as next generation sequencing (NGS), presents a wide spectrum of opportunities for genome research. Unfortunately, many existing bioinformatic tools do not scale well to large datasets consisting of tens of millions of sequences generated by technologies like Illumina/Solexa, Roche/454, ABI/SOLiD and Helicos. The Bioconductor project fills this gap by providing a rapidly growing suite of well designed R packages for analyzing traditional and HT-Seq datasets. These 'BioC-Seq' packages allow to analyze these sequences with impressive speed performance. Their accelerations are achieved by using memory efficient string containers and performing the time consuming computations with calls to external programs that are implemented in compiled languages (e.g. C++). Together these packages form a novel framework that allows researchers to develop efficient pipelines by performing complex data analyses in a high level data analysis and programming environment. {{{#!wiki bi-right [[#contents|Top]]}}} == Scope of this Manual == .This manual is intended for users who have a basic knowledge of the R environment, and would like to use R/Bioconductor to perform general or HT sequencing analysis. For new users, it is recommended to first review the material covered in the "R Basics" section (see below). To obtain a more comprehensive overview of R's sequence bioinformatics utilities, Robert Gentleman's book [[http://www.bioconductor.org/pub/RBioinf/|"R Programming for Bioinformatics"]] is an excellent choice. .The packages introduced here are a 'personal selection' of the authors of this manual that does not reflect the full utility spectrum of the Bioconductor project for this field of application. The introduced packages were chosen, because the authors use them often for their own teaching and research. To obtain a broad overview of available Bioconductor packages, it is strongly recommended to consult its official project site (http://bioconductor.org/). Due to the rapid development of many packages, it is also important to be aware that this manual will often not be fully up-to-date. Because of this and many other reasons, it is absolutely critical to use the original documentation of each package (PDF manual or vignette) as primary source of documentation. Users are welcome to send suggestions for improving this manual directly to the authors. {{{#!wiki bi-right [[#contents|Top]]}}} == Workshops == .Workshops on HT sequence data analysis using BioC-Seq resources are announced on the [[http://www.bioconductor.org/workshops|Bioconductor workshop]] site. The Institute for Integrative Genome Biology, [[http://faculty.ucr.edu/~tgirke/Workshops.htm|IIGB]] at UCR also offers workshops in this field. In addition, there are many more institutions that organize similar workshops. The best way to find out about upcoming events in a certain location is to run a Google search on this topic. {{{#!wiki bi-right [[#contents|Top]]}}} == R Basics == The [[http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.html|R & BioConductor manual]] provides an introduction on the usage of the R environment and its basic command syntax, and the [[http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_Programming.html|Programming in R manual]] provides a more advanced overview of R's programming syntax. == String Handling Utilities in R's Base Distribution == The R base distribution contains various functions for handling and manipulating strings. These utilities are very useful for basic sequence analysis tasks. However, often it is much more effective to use for these tasks dedicated Bioconductor sequence analysis packages, such as Biostrings. This manual primarily introduces functions and data objects from the Bioconductor project. Nevertheless, the following commands provide a brief summary of some very useful string handling functions available in R's base distribution. .{{{ #!python #!wiki style border-top: 3px solid red ##################### ## String Matching ## ##################### myseq <- c("ATGCAGACATAGTG", "ATGAACATAGATCC", "GTACAGATCAC") # Creates a sample sequence data set. myseq[grep("ATG", myseq)] # String searching with regular expression support. pos1 <- regexpr("AT", myseq) # Searches 'myseq' for first match of pattern "AT". as.numeric(pos1); attributes(pos1)$match.length # Returns position information of matches. pos2 <- gregexpr("AT", myseq) # Searches 'myseq' for all matches of pattern "AT". as.numeric(pos2[[1]]); attributes(pos2[[1]])$match.length # Returns position information of matches in first sequence. gsub("^ATG", "atg", myseq) # String substitution with regular expression support. ######################## ## Positional Parsing ## ######################## nchar(myseq) # Computes length of strings. substring(myseq, c(1,4,7), c(2,6,10)) # Positional parsing of strings. ############################ ## Reverse and Complement ## ############################ myseq_comp <- chartr("ATGC", "TACG", myseq) # Returns complement for given DNA sequences. substring(myseq[1], 1:nchar(myseq[1]), 1:nchar(myseq[1])) # Vectorizes single sequence. x <- strsplit(myseq_comp, "") # Vectorizes many sequences. x <- lapply(x, rev) # Reverses vectors. myseq_revcomp <- sapply(x, paste, collapse="") # Collapses vectors to strings. Final result: reverse and complement of myseq. ######################################### ## Translate DNA Sequence into Protein ## ######################################### y <- c("ATGCATTGGACGTTAG") # Creates sample DNA sequence. AAdf <- read.table(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/AA.txt", header=T, sep="\t") # Imports genetic code. AAv <- AAdf[,2]; names(AAv) <- AAdf[,1] # Creates named vector with genetic code y <- gsub("(...)", "\\1_", y) # Inserts "_" after each triplet. y <- unlist(strsplit(y, "_")) # Splits on "_" and returns vectorized triplets. y <- y[grep("^...$", y)] # Removes incomplete triplets. AAv[y] # Translation into protein by name-based subsetting. ############################# ## Create Random Sequences ## ############################# sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse="")) # Creates 100 random DNA sequences with 20 residues. sapply(1:100, function(x) paste(sample(1:40, 20, replace=T), collapse=" ")) # Creates 100 random score sets with 20 elements. ################################################### ## Sequence Printing and Writing in FASTA Format ## ################################################### myseq <- sapply(1:12, function(x) paste(sample(c("A","T","G","C"), 40, replace=T), collapse="")) writeLines(myseq) # Prints the sequences to the screen, one per line. writeLines(strtrim(myseq, 10)) # The strtrim() function allows to print only the first section of the sequences. writeLines(strwrap(myseq, indent = 20)) # The strwrap() function provides wrapping, indenting and prefixing utilities. myname <- paste(">", month.name, sep="") # Creates sample names for the sequences. writeLines(as.vector(t(cbind(myname, myseq))), "myseq.fasta") # Writes the sequences in FASTA format to a file. }}} {{{#!wiki bi-right [[#contents|Top]]}}} == Bioconductor Packages == === Biostrings === [[http://bioconductor.org/packages/2.4/bioc/html/Biostrings.html| Documentation]] The Biostrings package from Bioconductor provides an advanced environment for efficient sequence management and analysis in R. It contains many speed and memory effective string containers, string matching algorithms, and other utilities, for fast manipulation of large sets of biological sequences. The objects and functions provided by Biostrings form the basis for many other sequence analysis packages. {{{#!wiki bi-right [[#contents|Top]]}}} ==== Handling Single Sequences with XString ==== The XString class allows to represent the different types of biosequence strings in the subclasses: BString, RNAString, DNAString and AAString. .{{{ #!python #!wiki style border-top: 3px solid red library(Biostrings) b <- BString("I store any set of characters. The other XString objects store only the IUPAC characters for each biosequence type.") d <- DNAString("GCATAT-TAC") # Creates DNAString object. r <- RNAString("GCAUAU-UAC") # Creates RNAString object. r <- RNAString(d) # Converts d into RNAString object. p <- AAString("HCWYHH") }}} {{{#!wiki bi-right [[#contents|Top]]}}} ==== Comparing and Subsetting XString Objects ==== Several useful utilities are available to access, subset and manipulate XString objects. .{{{ #!python #!wiki style border-top: 3px solid red replaceLetterAt(d, c(1), "N") # Replaces residues at specified positions. length(d) # Returns the length of the sequence that is stored in d. toString(d) # Converts DNAString to character vector of length 1. d[2:6] # Prints d from postions 2 to 6. This makes positional parsing of sequences very easy! d[length(d):1] # Prints the reverse string of d. subseq(d, start=3, end=6) # A more efficient way for subsetting large sequences is the subseq() function from the IRanges package. d[1:4]==d[1:4] # Checks whether two sequences are identical. In this case it returns TRUE. d[1:4]==d[1:5] # Returns FALSE. d == r # I interprets DNA and RNA sequences the same way. As a result, the given example returns TRUE. subseq(d, start=7, width=1) <- DNAString("G") # Replaces base 7 by "G". Requires Biostrings 2.11.44 and IRanges 1.1.55 or higher. }}} {{{#!wiki bi-right [[#contents|Top]]}}} ==== Subfeature Retrieval with StringViews ==== With the XStringViews class provides utilities to manage and view subfeatures from sequences stored in XString objects. .{{{ #!python #!wiki style border-top: 3px solid red dlong <- DNAString("GCATATTACAATCGATCCGCATATTAC") dsub <- Views(dlong, start = c(1,6,11, 20), end = c(5,8,18, 27)) # Generates XStringViews object with four views (subfeatures). dsub[1:2] # Returns first two views. dsub[[1]] # Returns first view as XString object. paste(dsub, collapse="") # Collapses views into a single sequence. attributes(dsub) # Returns attribute list of XStringViews object. start(dsub); end(dsub); width(dsub); subject(dsub) # Functions to returns the different XStringView components as vectors. sapply(as.list(dsub), toString) # Returns sequences in XStringView object as vector. dsub[dsub == DNAString("ATCGATCC")] # Subsetting by a sequence string. }}} {{{#!wiki bi-right [[#contents|Top]]}}} ==== Masking Sequences ==== Biostrings allows to mask XString objects in order to exclude undesired regions from consideration by various analysis tools, e.g. repetitive sequences in the mapping of short reads. The advantage of this masking approach is that it does not result in any information loss, because masks can be turned on and off any time. In addition, the process is very memory efficient. .{{{ #!python #!wiki style border-top: 3px solid red d <- DNAString("CGCTATCTGACTGCCGCGCC") # Creates sample sequence. maskMotif(d, "GC") # Function for masking a sequence by a pattern. m1 <- Mask(mask.width=nchar(d), start=c(1,17), end=c(3,20)) # Defines mask m1. m2 <- Mask(mask.width=nchar(d), start=c(4,12), end=c(7,14)) # Defines mask m2 m12 <- append(m1, m2) # Joins two masks in one container. names(m12) <- c("A", "B") # Assigns names to mask components. active(m12)["B"] <- FALSE # Turns one mask off. masks(d) <- m12 # Applies masks to DNAString object. No information is lost by this action! d; nchar(d); length(d) # Returns some information about masked object. active(m12)["B"] <- TRUE # Turns mask B back on. masks(d) <- m12 # Reapplies modified mask object to d. reduce(d) # Reduces a set of masks to a single mask. gaps(d) # Reverses all the masks. reverseComplement(d) # Returns the reverse and complement for a masked sequence object. dfrag <- as(d, "XStringViews") # Turns MaskedXString object into an XStringViews object sapply(as.list(dfrag), toString) # Retrieves all unmasked sections in d as strings in vector. injectHardMask(d, "N") # Injects/fills the masked regions of a sequence with a user-specified letter. masks(d) <- NULL # Removes all masks from DNAString object. Alternative command: unmasked(). }}} {{{#!wiki bi-right [[#contents|Top]]}}} ==== Handling Many Sequences with XStringSet ==== The XStringSet class allows to represent many sequences in a single container. The subclasses for the different types of biosequences are: BStringSet, RNAStringSet, DNAStringSet and AAStringSet. .{{{ #!python #!wiki style border-top: 3px solid red dset <- DNAStringSet(c("GCATATTAC", "AATCGATCC", "GCATATTAC")) # The corresponding XStringSet containers allow to store many biosequences in one object. names(dset) <- c("seq1", "seq2", "seq3") # Assigns names to the elements in a XStringSet object. rset <- RNAStringSet(c("GCAUAUUAC", "UUUAAAA", "GCAUAUUAC")) rset <- RNAStringSet(dset) # Converts dset into RNAStringSet object. pset <- AAStringSet(c("HCWYHH", "SPPRHA")) }}} {{{#!wiki bi-right [[#contents|Top]]}}} ==== Comparing and Subsetting of XStringSet Objects ==== Several useful utilities are available to access, subset and manipulate XStringSet objects. .{{{ #!python #!wiki style border-top: 3px solid red dset[1:2] # Returns the first two sequence entries from dset. append(dset, dset) # Appends/concatenates two XStringSet objects. DNAStringSet(dset, start=c(1,2,3), end=c(4,8,5)) # Sequence subsetting by positions provided under start and end. d==dset[[1]] # The [[ subsetting operator returns a single entry as XString object. lapply(as.list(dset), toString) # Converts XStringSet to list. If the name field is populated then the list elements will be named accordingly. names(dset); width(dset) # Returns the values for remaining components as vectors. dsetv <- sapply(as.list(dset), toString); dsetv # Returns the sequences in XStringSet as named vector. table(dsetv) # Counts the number of occurrences of each sequence. dset[!duplicated(dsetv)] # Removes the duplicated sequence in dset. }}} {{{#!wiki bi-right [[#contents|Top]]}}} ==== Handling of Sequence and Quality Data with QualityScaleXStringSet ==== The QualityScaleXStringSet class allows to represent many sequences plus their quality data in a single container. The subclasses for the different types of biosequences are: QualityScaledBStringSet, QualityScaledDNAStringSet, QualityScaledRNAStringSet and QualityScaledAAStringSet. .{{{ #!python #!wiki style border-top: 3px solid red ########################## ## About Quality Scores ## ########################## phred <- 1:9 phreda <- paste(sapply(as.raw((phred)+33), rawToChar), collapse=""); phreda as.integer(charToRaw(phreda))-33 # Phred quality scores are integers from 0-50 that are stored as ASCII characters after # adding 33. The basic R functions rawToChar and charToRaw can be used to interconvert # among their representations. Solexa scores are similar, but use an offset of 64. ########################################## ## Quality Score Handling in Biostrings ## ########################################## dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) # Create random sample sequence. myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Create random Phred score list. myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) # Converts integer scores into ASCII characters. myqual <- PhredQuality(myqual) # Converts to a PhredQuality object. lapply(seq(along=myqual), function(x) as.integer(myqual[x])) # Converts Phred scores from ASCII decimal back to integer representation. library(ShortRead); setSR <- ShortReadQ(sread=dset, quality=FastqQuality(BStringSet(myqual)), id=BStringSet(myqual)) myMA <- as(quality(setSR), "matrix") # Fast conversion of all quality ASCII values into a numeric matrix by using the ShortReadQ # function from the ShortRead library. boxplot(as.data.frame((myMA))) # Creates box plot of quality values. dsetq1 <- QualityScaledDNAStringSet(dset, myqual) # Combines DNAStringSet and quality data in QualityScaledDNAStringSet object. dsetq1[rowSums(myMA < 20) <= 4] # Filters for sequences that contain not more than 4 bases with a quality score below 20. dset <- DNAStringSet(dsetq1) # Extract sequence component from QualityScaledDNAStringSet object. qset <- quality(dsetq1) # Extract quality component from QualityScaledDNAStringSet object. DNAStringSet(dset[1:2], start=c(1,5), end=c(4,9)) # Positional subsetting for sequence data. BStringSet(qset[1:2], start=c(1,5), end=c(4,9)) # Positional subsetting for corresponding quality data. ########################################################### ## Example for Injecting Ns at Positions with Low Scores ## ########################################################### XStringSetInject <- function(myseq=dset[[1]], myqual=qset[1], cutoff=20, mychar="N") { # Defines character inject function XStringSetInject. mypos <- which(as.integer(myqual)= 1) # Retrieves index of sequences with at least one match. Views(mychr, start=unlist(start_index[with_match]), end=unlist(end_index[with_match])) # Returns matches as XStringViews object. mydict[with_match] # Returns corresponding dictionary entries. ## (3) Coverage Plots mycov <- as.integer(coverage(mysearch, 1, length(mychr))) max(mycov); min(mycov) sum(mycov != 0) / length(mycov) plotCoverage <- function(coverage, start, end, mycol="red") { # Defines coverage plotting function. plot.new() plot.window(c(start, end), c(0, 10)) axis(1) axis(2) axis(4) lines(start:end, coverage[start:end], type="l", col=mycol) } plotCoverage(mycov, start=1, end=20000, mycol="red") # Plots coverage result. ## (4) Plot Fragments onto Chromosome source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/featureMap.txt") # Imports plotting function. mapDF <- data.frame(as.data.frame(unlist(mysearch))[,1:2], strand=1) # Creates coordinate data frame. colnames(mapDF) <- c("startx", "endx", "strand") genelength <- length(mychr) # Defines chromosome length. mapDF[,2] <- mapDF$endx + 200 # To see the matches better in the plot, their length is here extended by an arbitrary value (200). mapDF[mapDF[,2]>genelength,2] <- genelength Mysub <- paste("Chromosome length", genelength, "bp") # Specifies subtitle of plot. Mymain <- "Chromosome Plot for Sense Alignments" # Specifies title of plot. plotDF <- transformDF(plotDF=mapDF, genelength, method="arrange") # Transforms coordinates for plotting. featurePlot(plotDF=plotDF, geneline=6, genecol="green", featureline=2, featurecol=sample(c(2,4,8), length(plotDF[,1]), replace=T), showxy=F) # Plots the feature map for the match coordinates that are stored in 'mysearch'. # The feature color argument allows to color each feature in a different color. ## (5) Allowing Mismatches with matchPDict mychr <- DNAString("CTAGGCTGTATCCTAAACACCGACAAGGTCTAGCCTGTATGCTAGCTAGTGTCCGCTCGGAAGCCGCT") mydict <- DNAStringSet(c("GTAGGCTGTATCCTA", "CGCTCGGAAGCCGCT")) names(mydict) <- paste("d", seq(along=mydict), sep="") mypdict <- PDict(mydict, tb.end=1) # Uses a trusted band of nucleotides that need to match exactly (here only first nucleotide). tb(mypdict); tail(mypdict); width(mypdict) # Methods for accessing the different trusted band components; '?PDict' provides more detailed information. mysearch <- matchPDict(mypdict, mychr, max.mismatch=1) # Allows one mismatch outside of trusted band. list(mysearch) # Returns matches as IRanges object. table(cut(countIndex(mysearch), c(0, 1:10), right=FALSE)) # Summary of matches. }}} {{{#!wiki bi-right [[#contents|Top]]}}} ==== Computing Pairwise Sequence Alignments ==== The pairwise sequence alignment function from Biostrings provides a flexible environment for computing optimal pairwise alignments using different dynamic programming algorithms. .{{{ #!python #!wiki style border-top: 3px solid red s1 <- DNAString("GGGCCC"); s2 <- DNAString("GGGTTCCC") # Creates sample sequences. myalign <- pairwiseAlignment(s1, s2, type="local") # Computes pairwise local alignment with Smith-Waterman dynamic programming algorithm. myalign <- pairwiseAlignment(s1, s2, type="global") # Computes global alignment with Needleman-Wunsch algorithm. pairwiseAlignment(c("GGGCCC", "GAGGCC", "CGAGA"), s2, type="local", scoreOnly=T) # The function can also be used for sequence searching by providing the 'sequence database' as vector in the first argument. attributes(myalign)[-5] # Returns the components of an alignment object. subject(myalign); pattern(myalign) # Returns aligned sequences with gaps. Many more helper functions for alignments can be found in Biostrings' Alignment Manual. sq1 <- SolexaQuality(as.integer(c(22,22,22,12,12,12))); sq2 <- SolexaQuality(as.integer(c(22,22,22,0,0,0,0,0))) pairwiseAlignment(s1, s2, patternQuality = sq1, subjectQuality=sq2) # Uses a quality-based method for generating the substitution matrix. }}} {{{#!wiki bi-right [[#contents|Top]]}}} ==== Multiple Sequence Alignments (MSAs) ==== Multiple sequence alignments (MSAs) can be imported as FASTA formatted alignments into R using the read.XStringSet function. Almost all MSA programs support this format (e.g. ClustalW, T-Coffee, MUSCLE, Multalin). This import method will store the MSAs as XStringSet objects. The following examples introduce a variety of useful analysis routines on MSAs using this object class. .{{{ #!python #!wiki style border-top: 3px solid red ########################### ## Import of MSAs into R ## ########################### library(Biostrings) p450 <- read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul", "fasta") # Imports MSA of P450 sequences from a FASTA formatted file and stores the result as AAStringSet object. as.matrix(p450) # Converts MSA into a character matrix. consensusMatrix(p450) # Computes the residue frequencies for each MSA column and returns the results as matrix. #################################### ## Viewing of MSAs in HTML Format ## #################################### ## Viewing large alignments in a web browser can be very efficient. The following ## example outputs an HTML formatted alignment with consensus line. StringSet2html <- function(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=TRUE, ...) { msa <- AAStringSet(msa, start=start, end=end) msavec <- sapply(msa, toString) offset <- (counter-1)-nchar(nchar(msavec[1])) legend <- paste(paste(paste(paste(rep(" ", offset), collapse=""), format(seq(0, nchar(msavec[1]), by=counter)[-1])), collapse=""), collapse="") consensus <- consensusString(msavec, ambiguityMap=".", ...) msavec <- paste(msavec, rowSums(as.matrix(msa) != "-"), sep=" ") msavec <- paste(format(c("", names(msa), "Consensus"), justify="left"), c(legend, msavec, consensus), sep=" ") msavec <- c("
", msavec, "
") writeLines(msavec, file) if(browser==TRUE) { browseURL(file) } } StringSet2html(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=T, threshold=1.0) # Writes MSA to file "p450.html", which can be viewed in a browser. StringSet2html(msa=p450, file="p450.html", start=450, end=470, counter=20, browser=T, threshold=1.0) # Same as above, but for alignment positions 450-470. ########################################################## ## Computation of Similarity/Distance Matrices for MSAs ## ########################################################## masim <- sapply(names(p450), function(x) sapply(names(p450), function(y) pid(PairwiseAlignedXStringSet(p450[[x]], p450[[y]]), type="PID2"))) masim # Returns a percent identity matrix. The identity method can be specified under the "type" argument, see help for pid() function. madist <- 1-(masim/100) madist # Returns corresponding distance matrix. disttree <- hclust(as.dist(madist)) plot(as.dendrogram(disttree), edgePar=list(col=3, lwd=4), horiz=T) # Computes and plots distance tree for MSA. #################################################### ## Mapping Sequence Postions to Gapped Alignments ## #################################################### ## Remove gaps from sequences, e.g. for pattern matching removeGaps <- function(align=p450[[1]], gap="-") { myseq <- maskMotif(align, gap) myseq <- paste(as(myseq, "XStringViews"), collapse="") return(myseq) } p450seq <- AAStringSet(sapply(p450, function(x) removeGaps(align=x))) mypos1 <- matchPattern("FSAGKRICAG", p450seq[[1]]); mypos1 # Removes gaps from sequences and provides a pattern matching example for P450 heme binding site. p450gapmask <- maskMotif(p450[[1]], "-") ## Mapping sequence postions to gapped alignments seqAlignmap <- function(seq=p450seq[1], align=p450[1], gap="-") { seqindex <- 1:length(seq[[1]]) gapindex <- which(as.matrix(align)==gap) alignindex <- (1:length(align[[1]]))[-gapindex] names(alignindex) <- seqindex return(alignindex) } alignindex <- seqAlignmap(seq=p450seq[1], align=p450[1], gap="-") # Returns a named vector with the sequence coordinates in the name # field and the corresponding alignment coordinates in the data field. alignindex <- alignindex[start(mypos1)]:alignindex[end(mypos1)]; alignindex # Returns only the alignment positions for the above pattern match result. AAStringSet(p450, start=min(alignindex), end=max(alignindex)) # Returns the alignment segment for the above pattern match result. ############################################################# ## Simple Example for Sliding Window Conservation Analysis ## ############################################################# ## (A) Compute conservation vector. This example uses an oversimplified conservation model solely for demo purposes! consCol <- function(myalign=p450) { myalign <- as.matrix(myalign) consV <- sapply(seq(along=myalign[1,]), function(x) max(table(myalign[,x][!myalign[,x] %in% "-"]))/length(myalign[,1])) return(consV) } consV <- consCol(myalign=p450) ## (B) Perform sliding window analysis on conservation vector. slidingWindow <- function(mydata=consV, win=c(1, 100), start=1, end=length(consV)) { mydata <- mydata[start:end] windex <- t(sapply(0:length(mydata), function(x) win+x)) mywind <- sapply(seq(along=mydata), function(x) mean(mydata[windex[x,1]:windex[x,2]], na.rm=TRUE)) mywind <- mywind[1:(length(mywind)-win[2])] return(mywind) } slideV <- slidingWindow(mydata=consV, win=c(1, 20)) # Returns conservation vector for window size 20. ## (C) Plot sliding window conservation data plot(slideV, type="l", lwd=2, col="red", main="Sliding Window Conservation Analysis", sub="Window Size: 20", xlab="Alignment Position", ylab="Conservation: max=1.0") heme <- AAStringSet(p450, start=454, end=463); heme text(454, 0.71, toString(heme[[1]]), cex=0.8) # The highly conserved cytochrome P450 heme binding signature maps to positions 454-463, # which is located in the last big peak of the plot. }}} {{{#!wiki bi-right [[#contents|Top]]}}} ==== Analysis Routines with Biostrings ==== The following examples introduce a variety of useful analysis routines that can be accomplished with functions provided by the Biostrings package. ===== Mapping of Affy Probes to Transcriptome (similar to RNA-Seq) ===== Example for aligning Affy probe sets back to their target mRNAs. The code retrieves the matching counts and mapping coordinates for all probes and probe sets. In addition, it identifies probe sets matching several genes and/or genes matching several probe sets. The given analysis routine can be easily adjusted to the needs of RNA-Seq data sets for obtaining read counts per mRNA. .{{{ #!python #!wiki style border-top: 3px solid red ########################################## ## Mapping Affy Probes to Transcriptome ## ########################################## ## (1) Import mRNA/cDNA sequences (here Arabidopsis) library(Biostrings) mygenes <- read.DNAStringSet("ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR8_blastsets/TAIR8_cdna_20080412", "fasta") names(mygenes) <- gsub(" \\|.*", "", names(mygenes)) # Extracts locus IDs from description line for sequence naming. ## (2) Create pattern dictionary from probe sets library(ath1121501probe) mydict <- DNAStringSet(ath1121501probe$sequence) # Obtains the sequences of the probes from ath1121501 chip. mynames <- ath1121501probe$Probe.Set.Name # Obtains the corresponding probe set names. mynames <- table(mynames)[mynames] # Next 4 lines create unique names for probes by appending a counter. mycount <- unlist(sapply(mynames[!duplicated(names(mynames))], function(x) seq(1, x))) mynames <- paste(names(mynames), mycount, sep="_") names(mydict) <- mynames mydict <- PDict(mydict) # Creates a PDict dictionary. ## (3) Count matches for each mRNA sequence sapply(names(mygenes)[1:40], function(x) sum(countPDict(mydict, mygenes[[x]]))) # Returns matches for first 40 mRNAs. ## (4) Search with pattern dictionary each mRNA sequence (step 5 provides a much faster solution!) hitCollect <- function(mydict=mydict, myref=mygenes[[1]]) { # Defines search function for many query sequences (here mRNAs) mymatch <- matchPDict(mydict, myref) hits <- sum(countIndex(mymatch)>=1) if(hits==0) { return(NULL) } else { return(unlist(mymatch)) } } myhitList <- lapply(names(mygenes)[1:10], function(x) hitCollect(mydict=mydict, mygenes[[x]])) names(myhitList) <- names(mygenes)[1:10]; myhitList # The last two lines perform the search for the first 10 mRNA sequences and organize the results # in a list object. Searching all 39,000 mRNAs will take about 14 hours. countList <- sapply(names(myhitList), function(x) table(gsub("_at_.*", "_at", names(myhitList[[x]])))); countList # Counts for each mRNA the number of matching probes that belong to the same probe set. data.frame(locusID=rep(names(countList), sapply(countList, length)), affyID=unlist(sapply(countList, names)), Nprobes=unlist(sapply(countList, as.vector))) # Converts countList into a data frame. With this data structure one can easily determine # with tapply() all probe sets matching several genes and/or genes matching several probe sets. ## (5) Same as step (4) but with much faster computation by using the concatenation of all mRNAs as reference myconc <- BString(toString(mygenes)); myconc <- maskMotif(myconc, ", "); myconc <- DNAString(injectHardMask(myconc, "+")) # Concatenates sequences to a single string where each component is separated by '++'. myhitList <- matchPDict(mydict, myconc) # Performs search against the single reference containing all mRNAs. # This way the mapping of all probe sets will take only a few seconds. # Mappings across mRNA boundaries are prevented by inserted "+" marks. mymask <- maskMotif(myconc, "+") # Creates a mask for mRNA boundaries labeled with "+". mRNAloc <- data.frame(start=start(as(mymask, "XStringViews")), end=end(as(mymask, "XStringViews")), IDs=names(mygenes)) # Returns mRNA start and end positions in reference. myhitDF <- as.data.frame(unlist(myhitList)); myhitDF <- myhitDF[order(myhitDF$start), ] mRNAends <- mRNAloc$end; myhitends <- myhitDF$end mylength <- sapply(mRNAends, function(x) { mylength <- sum(myhitends <= x); myhitends <<- myhitends[!myhitends <= x]; return(mylength) }) tmphitDF <- cbind(myhitDF, mRNAloc[rep(seq(along=mRNAloc[,1]), mylength),]) finalhitDF <- cbind(start=tmphitDF[,1]-tmphitDF[,5]+1, end=tmphitDF[,2]-tmphitDF[,5]+1, tmphitDF[,c(4,7)]); finalhitDF[1:28,] # The previous five lines map the hits back to the corresponding mRNAs. countList <- tapply(gsub("_at_.*", "_at", finalhitDF$names), as.vector(finalhitDF$IDs), table); countList[1:20] # Counts for each mRNA the number of matching probes that belong to the same probe set. data.frame(locusID=rep(names(countList), sapply(countList, length)), affyID=unlist(sapply(countList, names)), Nprobes=unlist(sapply(countList, as.vector)))[1:40,] # Converts countList into a data frame. With this data structure one can easily determine # with tapply() all probe sets matching several genes and/or genes matching several probe sets. }}} {{{#!wiki bi-right [[#contents|Top]]}}} ===== Simulate Number of Unique Alignments ===== Example for estimating the percentage of short read sequences that align unambiguously to a reference genome when allowing a specific number of mismatches. Genomes with a high content of repetitive/duplicated sequences will have lower numbers of uniquely aligning short reads than genomes with a low degree of repetitive sequences. The provided functions allow the user to specify the length of the short reads and the number of allowable mismatches. The final result for increasing numbers of aligned reads is represented in a bar plot. .{{{ #!python #!wiki style border-top: 3px solid red #################################################################################### ## Estimate the Percentage of Short Reads that Have Unique Alignments in a Genome ## #################################################################################### ## Comment: The countPattern function is used here to allow alignments with mismatches. If ## no mismatches are required, the much faster matchPDict function should be used instead. ## (1) Store your reference genome in a DNAString object. A random sequence is used ## here for test purposes. library(Biostrings) ref <- DNAString(paste(sample(c("A","T","G","C"), 100000, replace=T), collapse="")) ## (2) Define alignment function where ## ref: reference genome ## Nreads: number of pseudoreads to extract from reference genome ## width: length of pseudoreads ## max.mismatch: max number of mismatche to allow alignCount <- function(ref=ref, Nreads=1000, width=15, max.mismatch=2, ...) { randstart <- sample(1:(length(ref)-width), Nreads) randreads <- Views(ref, randstart, width=width) match_count <- sapply(seq(along=randreads), function(x) countPattern(randreads[[x]], ref, max.mismatch=2)) return(table(match_count == 1)/length(match_count)*100) } ## (3) Choose the number of pseudoreads that should be tested. To obtain accurate estimates, ## one should choose here large enough numbers (e.g. 10,000 or higher for the largest value). testsize <- c(500, 1000, 2000, 5000) ## (4) Run the alignCount function and plot the results (parameter settings here for 15bp reads and 2 mismatches). alignMA <- sapply(testsize, function(x) alignCount(ref=ref, Nreads=x, width=15, max.mismatch=2)) rownames(alignMA) <- c(">1 Alignment", "1 Alignment"); colnames(alignMA) <- testsize barplot(alignMA[c(2,1),], col=c("blue", "red"), legend.text=rev(rownames(alignMA)), ylim=c(0,120), main="Alignments with 0-2 Mismatches", xlab="N Simulated Reads", ylab="%") }}} {{{#!wiki bi-right [[#contents|Top]]}}} ==== Exercises ==== .[[http://bioinfo.ucr.edu/~rsun/Biostrings.txt| Documentation]] {{{#!wiki bi-right [[#contents|Top]]}}} === ShortRead === .[[http://bioconductor.org/packages/2.4/bioc/html/ShortRead.html| Documentation]] .The ShortRead package provides input, quality control, filtering, parsing, and manipulation functionality for short read sequences produced by high throughput sequencing technologies. While support is provided for many sequencing technologies, this package is primairly focused on Solexa/Illumina reads. {{{#!wiki bi-right [[#contents|Top]]}}} ==== Function Overview ==== ||'''Function'''||'''Description'''|| ||SolexaPath("path") || Parses a Solexa output folder, and identifies important files|| ||analysisPath(SolexaPathObject) || Returns a list of GERALD output folders|| ||compose(filter1, filter2, filter3, ...) || Returns a sequence filter based on a combination of various filter objects|| ||readAligned(SolexaPathObject, "filename", filter= myfilter) || Reads in an Eland export file which contains sequences, alignment coordinates, and alignment-normalized quality scores based on a given filter object|| ||sread(AlignedReadObject) || Returns a list of sequences contained in an AlignedReadObject (as returned by readAligned)|| ||coverage(AlignedReadObject, start=startnt, stop=stopnt) || Calculates coverage of an alignment over each base in the alignment genome, and compresses the data to save memory || {{{#!wiki bi-right [[#contents|Top]]}}} ==== Illumina Pipeline ==== The Illumina Pipeline is a set of software tools for the unix platform provided to users of the Illumina Genome Analyzer. The pipeline reads the images produced by the sequencer, analyzes them, performs base calling, and (optionally) aligns sequences to a reference genome. The user is provided with a set of sequences corresponding to each read, along with quality scores representing the accuracy of each base call. {{{#!wiki bi-right [[#contents|Top]]}}} ==== Quality Scores ==== .A numeric Phred score represents the error probability of a given base call. When a nucleotide sequence is produced by sequencing, random error results in the possibility that any given base call may be incorrect. Thus, a quality score is provided for each base. .The phred score can be calculated from the error probability of a given base call: .phred score=-10*log(error probability)/log(10) .illumina score=-10*log(error probability / (1 - error probability))/log(10) ||'''Error Probability'''||'''Phred Score'''||'''Illumina Score'''|| ||1||0||negative infinity|| ||0.1||10||~9.542|| ||0.01||20||~19.956|| ||0.001||30||~29.996|| ||0.0001||40||~40.000|| .Illumina scores are asymptotically identical to Phred scores. Therefore, values of around 15 or greater are practically identical for both scoring systems. When quality scores are used to represent a long sequence, they are often represented using the ASCII alphabet, adding the number 33 to Phred scores, and 64 to Illumina scores. For example, a Phred score of 40 can be represented as the ASCII char "I" (40+33=Ascii #73), and an Illumina score of 40 as "h" (40+64=Ascii #104). .Here is a link to a reference on the ASCII alphabet: [[http://www.asciitable.com/]] .Update: Complicating things further, the most recent Illumina pipeline version 1.3.2 now uses Phred scores, but still represents the values using the old Illumina ASCII character offset. {{{#!wiki bi-right [[#contents|Top]]}}} ==== ShortRead Objects ==== ===== SolexaPath ===== .This object stores output file locations from the Illumina Pipeline. The SolexaPath function which creates this object automatically finds the subfolders for each output filetype. {{{ #!python #!wiki style border-top: 3px solid red library(ShortRead) sp <- SolexaPath("/home/zzz11/sample_flowcell") # Creates a SolexaPath for a pipeline output folder # general format: sp <- SolexaPath("/path/to/folder") baseCallPath(sp) # returns base call path subfolder analysisPath(sp) # returns GERALD/Eland alignment subfolder imageAnalysisPath(sp) # Returns firecrest image analysis subfolder }}} {{{#!wiki bi-right [[#contents|Top]]}}} ===== ShortReadQ ===== .This object stores sequences, quality scores, and ids for a series of reads. .{{{ #!python #!wiki style border-top: 3px solid red reads <- readFastq("/home/zzz11/test", pattern="filename.fastq") # Reads in a FASTQ file id(reads) # outputs read ids as a list as BStringSet sread(reads) # outputs read sequences as a list as DNAStringSet quality(reads) # outputs list of quality scores as BStringSet writeFastq(reads, "output.fastq") # writes ShortReadQ object to output file }}} {{{#!wiki bi-right [[#contents|Top]]}}} ===== AlignedRead ===== .This objects stores the genome alignment coordinates (chromsome, position, strand) for each of a series of reads. .{{{ #!python #!wiki style border-top: 3px solid red # Import AlignedRead Object alignedReads <- readAligned("/home/zzz11/test", pattern="output.soap", type="SOAP") # Reads in a external alignment (change type depending on program) # general format: readAligned("/path/to/folder", pattern="filename", type="AlignmentTool") alignedReads[1] # choose a read and inspect it's alignment coordinates }}} {{{#!wiki bi-right [[#contents|Top]]}}} ==== ShortRead Sequence Filters ==== .ShortRead provides functions that return sequence filter objects, which allow you to filter your input sequences by various criteria. It is also possible to create new filter objects. ||'''Filter Function'''||'''Description'''|| ||srFilter || Creates a custom filter (see Vignettes) || ||chromosomeFilter("chromosome name regex") || Selects reads which align to specific chromosome(s) matching the given regular expression|| ||strandFilter("+/-") || Selects reads which align to the + or - strand on the genome|| ||nFilter(# of Ns) || Selects reads which contain fewer than a given number of N characters|| ||polynFilter(# of given nt) || Selects only reads which contain less than a given number of the same nucleotide|| ||srdistanceFilter(DNAStringSet, DNAString) || Performs Pairwise alignment of each DNAStringSet string against DNAString, and calculates edit distance || ||uniqueFilter()|| Removes duplicate occurances of a given nucleotide sequence from a set of reads || .Depending on the nature of the filter, they may work with different object types. Consult the Reference Manual Vignette for instructions on building filters, and additional built-in filter functions. {{{ #!python #!wiki style border-top: 3px solid red # Create several filters: filter1 <- nFilter(threshold=1) # keep only reads without Ns filter2 <- polynFilter(threshold=20, nuc=c("A")) # remove reads with 20 or more As filter3 <- strandFilter("+") # choose reads which align to positive strand filter <- compose(filter1, filter2, filter3) # Combine filters into one alignedReads <- alignedReads[filter(alignedReads)] # filter alignedReads object with all 3 filters at once }}} {{{#!wiki bi-right [[#contents|Top]]}}} === IRanges === .[[http://bioconductor.org/packages/2.4/bioc/html/IRanges.html| Documentation]] .IRanges provides infrastructure and containers for handling sets of integer ranges. Integer ranges are commonly used to represent coordinates of read alignment loci, or various annotation feature loci within a biological sequence. ==== IRanges Use ==== .{{{ #!python #!wiki style border-top: 3px solid red library(IRanges) # load IRanges library ranges <- IRanges(start=c(1,2,3), end=c(2,3,4)) # Constructs an IRanges object with 3 ranges: 1-2, 2-3, and 3-4. # These ranges are often used to represent genome alignment coordinates, or adapter coordinates # within a read. rangeList <- IRangesList(ranges1, ranges2, ...) # Constructs a list of Iranges objects, and can be used to associate a set of ranges # to filter,trim, or align each string in a BStringSet object coverage <- Rle(c(1,1,2,3)) # Constructs an Rle (run length encoded) object. This is an efficient compressed # way to store a long integer string with a lot of repeated elements, and is # often used to represent genomic coverage of short read alignments for each # residue in the genome. }}} {{{#!wiki bi-right [[#contents|Top]]}}} === BSgenome === .[[http://bioconductor.org/packages/2.4/bioc/html/BSgenome.html| Documentation]] .BSgenome provides an object oriented infrastructure for interacting with a Biostring based genome sequence. BSgenome packages exist for many common genomes, and can be created to represent custom genomes. See the "How to forge a BSgenome data package" Vignette for instructions to create a new BSgenome package if a prebuilt package does not exist for your organism. ==== BSgenome Use ==== .{{{ #!python #!wiki style border-top: 3px solid red library(BSgenome) available.genomes() # Get list of all available pre-built genomes to choose from # source("http://bioconductor.org/biocLite.R") # biocLite("BSgenome.Hsapiens.UCSC.hg18") # Run these commands to install a genome you haven't used before, or to update the version library(BSgenome.Hsapiens.UCSC.hg18) # Download, install, and load the Homo sapiens genome Hsapiens # Inspect overview of data provided with this package masknames(Hsapiens) # Returns a list of built in masks defined for sequences in the genome # AGAPS = assembly gaps # AMB = intra-contig ambiguities # RM = RepeatMasker # TRF = Repeat Regions found via Tandem Repeats Finder chr1 <- Hsapiens$chr1 # load chromosome 1 into memory with default masks length(chr1) # returns length of chromosome 1 }}} {{{#!wiki bi-right [[#contents|Top]]}}} === rtracklayer === .[[http://bioconductor.org/packages/2.4/bioc/html/rtracklayer.html| Documentation]] .rtracklayer provides an interface for exporting annotation feature data to various genome browsers and file formats (such as GFF). See the Small RNA Profiling exercise for an example of using rtracklayer to visualize alignment coverage. {{{#!wiki bi-right [[#contents|Top]]}}} === biomaRt === .[[http://bioconductor.org/packages/2.4/bioc/html/biomaRt.html| Documentation]] .biomaRt provides an R interface to publicly accessible biological databases including Ensembl, Uniprot, Gramene, Wormbase and HapMap. This data is retrieved automatically via the internet, and is not versioned. If a pre-existing Bioconductor package exists with the data you need, it is recommended to use that package instead. ==== biomaRt Use ==== .{{{ #!python #!wiki style border-top: 3px solid red library("biomaRt") listMarts() # get list of all available data sources ensembl <- useMart("ensembl") # load the snp source listDatasets(ensembl) # list available datasets aaegypti_gene_ensembl <- useDataset("aaegypti_gene_ensembl", mart= ensembl) listAttributes(aaegypti_gene_ensembl) # show attributes in this dataset listFilters(aaegypti_gene_ensembl) # show available filters aaegypti_gene_ensembl <- useMart("ensembl","aaegypti_gene_ensembl") # select specific dataset getBM(attributes=c("chromosome_name"), mart= aaegypti_gene_ensembl) # retrieve data regarding specific attributes in table/dataframe # format # optional filters= option allows for specific filters to be applied }}} {{{#!wiki bi-right [[#contents|Top]]}}} === HT-Seq Data Visualization === .[[http://bioconductor.org/packages/2.4/bioc/html/HilbertVis.html|HilbertVis]]: Hilbert genome plots .[[http://bioconductor.org/packages/2.4/bioc/html/GenomeGraphs.html|GenomeGraphs]]: Plotting genomic information from Ensembl .[[http://www.science.oregonstate.edu/~dolanp/tileqc/index.html|TileQC]]: Flow Cell Quality Visualization .[[http://bioconductor.org/packages/2.4/bioc/html/rtracklayer.html|rtracklayer]]: R interface to genome browsers {{{#!wiki bi-right [[#contents|Top]]}}} == Command Line Alignment Utilities == === SOAP === .SOAP stands for "Short Oligonucleotide Analysis Package." This is a particularly powerful and fast alignment tool which supports mismatches and indels. It also contains a new tool for SNP calling. .[[http://soap.genomics.org.cn/|SOAP Web Page]] ==== Running SOAP ==== .SOAP is run from the unix command line, and requires a reference genome in .fasta format, and a list of query sequences in .fasta or .fastq format. This example was tested with SOAP version 2.17. .{{{ #!python #!wiki style border-top: 3px solid red soap # inspect command line options 2bwt-builder genome.fasta # create binary of reference genome soap -a query.fastq -D genome.fasta.index -o output.soap # align query to genome and store output }}} ==== Parsing Output with R ==== .{{{ #!python #!wiki style border-top: 3px solid red library(ShortRead) alignedReads <- readAligned("./", pattern="output.soap", type="SOAP") # Reads in alignment coordinates as a ShortRead AlignedRead object }}} {{{#!wiki bi-right [[#contents|Top]]}}} === MAQ === .MAQ stands for "Mapping and Assembly with Quality." It is a suite of tools which allows you to build a consensus assembly by mapping short reads to a reference. This is particularly useful for identifying SNPs. MAQ can also be used as a general purpose short read alignment tool, and contains some useful file format converters. .[[http://maq.sourceforge.net/|MAQ Web Page]] ==== Running MAQ ==== .{{{ #!python #!wiki style border-top: 3px solid red maq # inspect command line options maq fasta2bfa genome.fasta genome.bfa # create binary of reference genome maq fastq2bfq query.fastq readBinary.bfq # create a binary of dataset maq match out.map genome.bfa readBinary.bfq # align query to genome and store output }}} ==== Parsing Output with R ==== .{{{ #!python #!wiki style border-top: 3px solid red library(ShortRead) alignedReads <- readAligned("./", pattern="out.map", type="MAQ") # Reads in alignment coordinates as a ShortRead AlignedRead object }}} {{{#!wiki bi-right [[#contents|Top]]}}} === Bowtie === .Bowtie is an ultra-fast alignment tool. It has somewhat less flexibility with regard to mismatches than SOAP or MAQ, but much greater performance. It does not currently support gapped alignment. Bowtie is generally an excellent choice if you don't need any of the extended features provided by lower performance tools. .[[http://bowtie-bio.sourceforge.net/index.shtml|Bowtie web page]] ==== Running Bowtie ==== .{{{ #!python #!wiki style border-top: 3px solid red bowtie # inspect command line options bowtie-build ~/AT_Columbia_TAIR8.fasta genome.binary # create binary genome (replace ~/AT_Columbia_TAIR8.fasta with your genome) # You can also download pre-built genomes from the Bowtie web page bowtie -q ~/genome.binary ~/query.fastq output.bowtie # align query to genome and store output (replace ~/query.fastq with your query sequences # and ~/genome.binary with the genome you built in the previous step ) }}} ==== Parsing Output with R ==== .{{{ #!python #!wiki style border-top: 3px solid red library(ShortRead) alignedReads <- readAligned("./", pattern="output.bowtie", type="Bowtie") # Reads in alignment coordinates as a ShortRead AlignedRead object }}} {{{#!wiki bi-right [[#contents|Top]]}}} == Examples == <> == Session Info == .{{{ #!python #!wiki style border-top: 3px solid red > sessionInfo() R version 2.9.0 (2009-04-17) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.2.0 lattice_0.17-22 BSgenome_1.12.0 Biostrings_2.12.1 [5] IRanges_1.2.0 loaded via a namespace (and not attached): [1] Biobase_2.4.0 grid_2.9.0 hwriter_1.1 }}} == Installing Packages == .{{{ #!python #!wiki style border-top: 3px solid red source("http://bioconductor.org/biocLite.R") biocLite() biocLite(c("ShortRead", "Biostrings", "IRanges", "BSgenome", "rtracklayer", "biomaRt")) }}}