HTWiki: HT_BioC_Manual

IIGB HT-Seq

Workshops

R & BioC

BioC-Seq

R Programming

EMBOSS

Linux

Cluster

HT Sequence Analysis with R and Bioconductor

By Tyler Backman, Rebecca Sun and Thomas Girke, IIGB, UC Riverside

Contents

  1. HT Sequence Analysis with R and Bioconductor
    1. Introduction
    2. Scope of this Manual
    3. Workshops
    4. R Basics
    5. String Handling Utilities in R's Base Distribution
    6. Bioconductor Packages
      1. Biostrings
        1. Handling Single Sequences with XString
        2. Comparing and Subsetting XString Objects
        3. Subfeature Retrieval with StringViews
        4. Masking Sequences
        5. Handling Many Sequences with XStringSet
        6. Comparing and Subsetting of XStringSet Objects
        7. Handling of Sequence and Quality Data with QualityScaleXStringSet
        8. Sequence Import and Export
        9. Basic Sequence Manipulation Utilities
        10. Trimming of Flanking Sequences (e.g. Adaptors)
        11. Alignment Tools for Genome Mapping
        12. Computing Pairwise Sequence Alignments
        13. Multiple Sequence Alignments (MSAs)
        14. Analysis Routines with Biostrings
          1. Mapping of Affy Probes to Transcriptome (similar to RNA-Seq)
          2. Simulate Number of Unique Alignments
        15. Exercises
      2. ShortRead
        1. Function Overview
        2. Illumina Pipeline
        3. Quality Scores
        4. ShortRead Objects
          1. SolexaPath
          2. ShortReadQ
          3. AlignedRead
        5. ShortRead Sequence Filters
      3. IRanges
        1. IRanges Use
      4. BSgenome
        1. BSgenome Use
      5. rtracklayer
      6. biomaRt
        1. biomaRt Use
      7. HT-Seq Data Visualization
    7. Command Line Alignment Utilities
      1. SOAP
        1. Running SOAP
        2. Parsing Output with R
      2. MAQ
        1. Running MAQ
        2. Parsing Output with R
      3. Bowtie
        1. Running Bowtie
        2. Parsing Output with R
    8. Examples
      1. Plot Per-Cycle Quality of FASTQ File
      2. Import Reads, Filter, and Export
        1. Import Reads
        2. Filter Reads
        3. Export Reads
      3. SmallRNA Profiling
        1. Import Reads From Illumina Pipeline
        2. Filter Sequences
        3. Trim Adapters
        4. Choose Size Range and Make Unique
        5. Export Sequences
        6. Perform Alignment With Bowtie Using Unix Command Line
        7. Import Alignment
        8. Find Coverage Peaks
        9. Visualize Alignment In Genome Browser
      4. ChIP-seq Analysis
      5. Read Density and SNP Analysis
        1. Build Reference Genome
        2. Perform Alignment
        3. Build Consensus Genome
        4. Identify SNP Sites
        5. Import SNP Candidates Into R
        6. Further Analysis
    9. Session Info
    10. Installing Packages

Introduction

Scope of this Manual

Workshops

R Basics

String Handling Utilities in R's Base Distribution

Bioconductor Packages

Biostrings

Documentation

Handling Single Sequences with XString

Comparing and Subsetting XString Objects

Subfeature Retrieval with StringViews

Masking Sequences

Handling Many Sequences with XStringSet

Comparing and Subsetting of XStringSet Objects

Handling of Sequence and Quality Data with QualityScaleXStringSet

Sequence Import and Export

Basic Sequence Manipulation Utilities

Trimming of Flanking Sequences (e.g. Adaptors)

Alignment Tools for Genome Mapping

Computing Pairwise Sequence Alignments

Multiple Sequence Alignments (MSAs)

Analysis Routines with Biostrings

Mapping of Affy Probes to Transcriptome (similar to RNA-Seq)

Simulate Number of Unique Alignments

Exercises

.Documentation

ShortRead

Function Overview

Illumina Pipeline

Quality Scores

ShortRead Objects

SolexaPath

ShortReadQ

AlignedRead

ShortRead Sequence Filters

IRanges

IRanges Use

BSgenome

BSgenome Use

rtracklayer

biomaRt

biomaRt Use

HT-Seq Data Visualization

Command Line Alignment Utilities

SOAP

Running SOAP

Parsing Output with R

MAQ

Running MAQ

Parsing Output with R

Bowtie

Running Bowtie

Parsing Output with R

Examples

Plot Per-Cycle Quality of FASTQ File

  •      
    library(ShortRead)
    reads <- readFastq("~/", pattern="query.fastq")
       # Read in datafile (replace "~/" with your path and query.fastq with your filename)
    if (length(reads) > 1000000) {
       reads <- sample(reads,1000000) # if more than a million reads, sample randomly
    }
    qual <- FastqQuality(quality(quality(reads))) # get quality scores
    readM <- as(qual, "matrix") # convert scores to matrix
    boxplot(as.data.frame((readM)), outline = FALSE, main="Per Cycle Read Quality", xlab="Cycle", ylab="Phred Quality")
       # build per cycle boxplot
    

Import Reads, Filter, and Export

Import Reads

  •      
    library(ShortRead)
    reads <- readFastq("~/", pattern="query.fastq")
       # import reads from fastq file into ShortReadQ Object
       # replace "~/" with your path and query.fastq with your filename
    

Filter Reads

  •      
    filter1 <- nFilter(threshold=3) # keep only reads with fewer than 3 Ns
    filter2 <- polynFilter(threshold=20, nuc=c("A", "C", "T", "G"))
       # remove reads with 20 or more of the same letter
    filter <- compose(filter1, filter2) # Combine filters into one
    sequences <- sread(reads) # extract sequences from read object
    filteredReads <- reads[filter(sequences)]
       # apply filter to sequences, and use this to remove "bad" reads
    

Export Reads

  •      
    writeFastq(filteredReads, "output.fastq")
       # export filtered reads to fastq file
    

SmallRNA Profiling

Import Reads From Illumina Pipeline

  • Set the path of your Illumina Pipeline output:
         
    library(ShortRead)
    # sp <- SolexaPath("/path/to/my/flowcell/folder")
    sp <- SolexaPath("/home/zzz11/sample_flowcell")
    
  • Import reads from a lane:
         
    lane <- 7
    
    # read in all tiles of data
    qualityInputFiles <- paste("s_", lane, "_[0-9]+_prb.txt", sep="")
    sequenceInputFiles <- paste("s_", lane, "_[0-9]+_seq.txt", sep="")
    
    reads <- readBaseQuality(baseCallPath(sp), seqPattern=sequenceInputFiles, prbPattern=qualityInputFiles,type="Solexa")
    id(reads) # inspect read ids
    sread(reads) # inspect read sequences
    quality(reads) # inspect quality scores
    

Filter Sequences

  •      
    # Construct several filters
    filter1 <- nFilter(threshold=3) # keep only reads with fewer than 3 Ns
    filter2 <- polynFilter(threshold=20, nuc=c("A", "C", "T", "G")) # remove reads with 20 or more of the same letter
    filter <- compose(filter1, filter2) # Combine filters into one
    reads <- reads[filter(sread(reads))] # filter sequences with all filters at once
       # since these filters work on the sequences themselves, it is necessary to pass them the sequences
       # with the sread(reads) function, rather than the entire ShortReadQ object
    reads # inspect filtered reads
    

Trim Adapters

  •      
    library(Biostrings) # Load Biostrings library
    seqs <- sread(reads) # get sequence list
    qual <- quality(reads) # get quality score list
    qual <- quality(qual) # strip quality score type
    seqs <- DNAStringSet(seqs, start=4) # trim first 3 bases from sequence
    qual <- BStringSet(qual, start=4) # trim first 3 bases from Qscore
       # This is only necessary if you need to shorten the read by a fixed
       # distance at the beginning. This is done here because the samples
       # are multiplexed, and we wish to omit the multiplexing tag.
    adapter <- "TGTAGGCACCATCACTCGGGCACCAAGGA"
       # This is the adapter sequence to be trimmed from the end of your smallRNA reads
    mismatchVector <- c(rep(2,length(DNAString(adapter))))
       # This is a vector of equal length to the adapter with "2" for each value.
       #  This will allow the adapter to match the end of the sequence with any offset
       #  and up to 2 mismatches.
    trimCoords <- trimLRPatterns(Rpattern=adapter, subject=seqs, max.Rmismatch=mismatchVector, ranges=T)
       # Trim sequences looking for a right end pattern
       # Gets IRanges object with trimmed coordinates  
    seqs <- DNAStringSet(seqs, start=start(trimCoords), end=end(trimCoords))
    qual <- BStringSet(qual, start=start(trimCoords), end=end(trimCoords))
       # Use IRanges coordinates to trim sequences and quality scores
    qual <- SFastqQuality(qual) # reapply quality score type
    trimmed <- ShortReadQ(sread=seqs, quality=qual, id=id(reads))
       # Rebuild reads object with trimmed sequences and quality scores
    

Choose Size Range and Make Unique

  •      
    maxSize <- 30
    minSize <- 19
    trimmed <- trimmed[width(sread(trimmed)) <= maxSize] # remove all reads longer than maxSize
    trimmed <- trimmed[minSize <= width(sread(trimmed))] # remove all reads shorter than minSize
    uniquef <- uniqueFilter() # create filter
    unique <- trimmed[uniquef(trimmed)] # run unique filter (eliminate duplicates)
       # keep working with the non-unique reads if analyzing expression level dats
    

Export Sequences

  •      
    writeFastq(trimmed, "filename.fastq") # write reads to filename.fastq
    

Perform Alignment With Bowtie Using Unix Command Line

  •      
    
    system("wget http://illumina.ucr.edu/misc/AT_Columbia_TAIR8.fasta") # Download your reference genome in fasta format with wget
    system("bowtie-build AT_Columbia_TAIR8.fasta genome.binary")
       # build binary reference genome
    system("bowtie -q ~/genome.binary filename.fastq output.bowtie") # perform bowtie alignment with default options
    

Import Alignment

  •      
    alignedReads <- readAligned("./", pattern="output.bowtie", type="Bowtie") # read in alignment
    table(width(sread(alignedReads))) # view table of SmallRNA Size distribution
    

Find Coverage Peaks

  •      
    coverage <- coverage(alignedReads, start=1) # calculate genomic coverage and returns an IRanges Rle object
    islands <- slice(coverage$Chr1, lower = 10) # look for regions on chromsome 1 of 10x coverage or higher
    islands # inspect regions
    

Visualize Alignment In Genome Browser

This requires some archaic object conversions, because exporting shortread data via rtracklayer is not yet stable or well documented.

  •      
    library(rtracklayer)
    
    chr1ranges <- as(coverage$Chr1, "RangedData") # convert coverage for chromosome 1 into rtracklayer object
    irangescores <- IRanges(start=start(chr1ranges),end=end(chr1ranges)) # create iranges object from start/stop coordinates
    gd <- GenomicData(irangescores, chr1ranges$score, chrom="1") # put coverage data into exportable GenomicData object
       # and specify that this data is for chromosome 1
    export(gd, con="chromsome1.wig") # create output track for viewing in genome browser
    
  • If you are working on a remote server, you will need to copy this file locally before you can upload it to a genome browser. If you have trouble with the copy, you can download a sample here: chromsome1.wig

  • Go to an genome browser, upload your .wig file, and view it. You can try the Arabidopsis UCSC browser (http://epigenomics.mcdb.ucla.edu/cgi-bin/hgGateway?org=A.+thaliana&db=araTha1), by uploading your .wig file via the "add your own custom tracks" button. After uploading this, browse to the region you would like to see. A high density region in the test dataset for the workshop can be found at chr1:30,088,863-30,089,042. Also, try inspecting regions of high coverage you found in the previous step.

ChIP-seq Analysis

Read Density and SNP Analysis

  • This exercise is done from the Unix command line rather than the R console,

    and uses the MAQ alignment tool. First you must have a copy of your reference genome in .fasta format, and a copy of your reads in .fastq format (see Import Reads, Filter, and Export section). For further information, and advanced command line options see the documentation provided on the MAQ website.

Build Reference Genome

From the command line:

  •      
    wget http://illumina.ucr.edu/misc/AT_Columbia_TAIR8.fasta # download reference genome
    maq fasta2bfa AT_Columbia_TAIR8.fasta genome.bfa
       # create binary of reference genome
    maq fastq2bfq ~/snpReads.fastq readBinary.bfq
       # create a binary of dataset (replace ~/snpReads.fastq with your query sequences)
    

Perform Alignment

  •      
    maq match -n 2 out.map genome.bfa readBinary.bfq
       # perform alignment
    maq mapcheck genome.bfa out.map >mapcheck.txt
       # get QC statistics on alignment
    

Build Consensus Genome

  •      
    maq assemble consensus.cns genome.bfa out.map 2>assemble.log
       # assemble consensus sequence
    maq cns2fq consensus.cns >cns.fq
       # extract consensus sequence from binary
    

Identify SNP Sites

  •      
    maq cns2snp consensus.cns >cns.snp
       # extract SNP locations
    maq.pl SNPfilter cns.snp > filtered.snp
       # filter SNPs with default values
    

From R:

Import SNP Candidates Into R

  • This section is performed from within R. By creating an IRanges object from SNP candidates in a given genomic region, these SNP sites can be used as genomic annotation data.
         
    library(IRanges)
    snptable <- read.table("filtered.snp")
       # read in SNP table as a data frame
    snptable <- snptable[snptable[1] == "Chr2",]
       # filter out all SNP candidates except those on chromsome 2
    snptable <- snptable[snptable[2] > 10000,]
    snptable <- snptable[snptable[2] < 20000,]
       # filter out all SNP candidates occuring outside the region 10000-20000
    snpSites <- snptable[2][,1]
       # extract SNP coordinates as a numeric vector
    snpRanges <- IRanges(start=snpSites, end=snpSites)
       # create IRanges object for SNP candidate loci
    

Further Analysis

  • Once imported to R, and filtered by target region it may be useful to further filter SNP candidates by their relationship to annotation features such as coding genomic regions, and potential for protein amino acid swaps. Check back, as this example may be added to the manual in the future.

Session Info

Installing Packages

HT_BioC_Manual (last edited 2009-11-19 18:01:52 by ThomasGirke)
This site was accessed

blogger visitor
times (detailed access stats).