Milano R net meeting

I received the announce of the first Milano R net meeting, and I am glad to post it below. If you are based in Milano area you might think about going. If you are not based in Milano area, you might think about going as well!

Milano R net
Milano R net is a users group dedicated to bringing together area practitioners of the popular open source R statistics language to exchange knowledge, learn and share tricks and techniques, provide R beginners with an opportunity to meet more experienced users. The aims is to spur the adoption of R for innovative research and commercial applications, see case studies of real-life R applications in industry and government. Milano R net founders are: R consulting company Quantide’s Andrea Spanò, R-core member and University of Milan professor Stefano Iacus, and quantitative consultant Daniele Amberti.

First Milano R net meeting
The first meeting will be on May 8. Andrea Spanò will introduce the meeting. Speakers are: Stefano Iacus (R Development Core Team), Fabio Piacenza (Unicredit Group) and Alberto Santini (Monte dei Paschi di Siena).
Since today about 45 people subscribed to meeting. Subscribers background are very different: this is required for an interesting exchange of information.

Information and registrations: www.milanoR.net

Quantide (www.quantide.com) and Revolution Analytics (www.revolutionanalytics.com) sponsor the meeting.

Future meetings
The group is looking for suggestions for future talks and presenters, so if you’re in the Milan area (or know another R user who is), please get in touch via the www.milanor.net.

Extract different characters between two strings of equal length

In the desperate effort of understanding the secret of life it may be too simplistic to just count the differences between two strings of equal length. You might as well want to know where they differ.

You can do that recycling most of the function published in a previous post.

You can use it to compare two nucleotide sequences, two amino acid sequences or just two strings of equal length.

To facilitate analysis of DNA sequences, I gave the opportunity to ignore case (so that an “a” is equivalent to “A”) and to ignore positions carrying special characters, such as “N”, “?” or “-“.

Input:
a: first sequence (or string)
b: second sequence (or string)
exclude: character (or vector of characters) to be excluded from comparison
ignore.case: logical. If TRUE consider “a” equal to “A”. If FALSE consider “a” different from “A”.
show.excluded: logical. If TRUE, positions carrying an excluded character (such as “-“) will be returned, as NA. IF FALSE, positions in which one of the two strings carry an excluded character will be omitted.

Output:
only.diff: Matrix (or vector) reporting differences between the two strings and their position.

As usual, feedback welcome!

list.string.diff<-function(a="ATTCGA-",b="attTGTT",exclude=c("-","?"),ignore.case=TRUE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case==TRUE)
{
a<-toupper(a)
b<-toupper(b)
}
seq.a<-unlist(strsplit(a,split=""))
seq.b<-unlist(strsplit(b,split=""))
diff.d<-rbind(seq.a,seq.b)
only.diff<-diff.d[,diff.d[1,]!=diff.d[2,]]
pos<-which(diff.d[1,]!=diff.d[2,])
only.diff<-rbind(pos,only.diff)
for(ex.loop in 1:length(exclude))
{
only.diff<-only.diff[,!(only.diff["seq.a",]==exclude[ex.loop]|only.diff["seq.b",]==exclude[ex.loop])]
}
return(only.diff)
}

It might happen that you have two DNA or amino acid sequences of different length. I suggest that you align them with clustalW and save the output in a text file. You can then use the read.alignment() function of the R package seqinr to get the two aligned sequences, padded by “-” so that they have the same length.
I paste below an example of clustalW output

CLUSTAL 2.1 multiple sequence alignment


test1      MESLGVRKGAWIQEEDVLLRKCIEKYGEGKWHLVPLRAGLNRCRKSCRLRWLNYLKPDIK
test2      MESLGVRKGAWIQEEDVLLRKCIEKYGEGKWHLVPLRAGLNRCRKSCRLRWLNYLKPDIK
           ************************************************************

test1      RGEFALDEVDLMIRLHNLLGNRWSLIAGRLPGRTANDVKNYWHSHHFKKEVQFQEEGRDK
test2      RGEFALDEVDLMI--------------GRLPGRTANDVKNYWHSHHFKKEVQFQEEGRDK
           ************************************************************

test1      PQTHSKTKAIKPHPHKFSKALPRFELKTTAVDTFDTQVSTSRKPSSTSPQPNDDIIWWES
test2      PQTHSKTKAIKPHPHKFSKALPRFELKTTAVDTFDTQVSTSSKPSSTSPQPNDDIIWWES
           ***************************************** ******************

test1      LLAEHAQMDQETDFSASGEMLIASLRTEETATQKKGPMDGMIEQIQGGEGDFPFDVGFWD
test2      LLAEHAPMDQETDFSASGEMLIASLRTEETATQKKGPMDGMIEQIQGGEGDFPFDVGFWD
           ****** *****************************************************

test1      TPNTQVNHLI
test2      TPNTQVNHLI
           **********

You can then save this alignment with the name “test.clustalw”, read it with the following function which calls both read.alignment() and list.string.diff().

comp.seq<-function(infile="test.clustalw",format="clustal",
                   exclude=c("-","?"))
{
library(seqinr)
pseq<-read.alignment(infile,format=format)
pseq1<-as.character(pseq$seq[1])
pseq2<-as.character(pseq$seq[2])
pdiff<-list.string.diff(a=pseq1,b=pseq2,exclude=exclude)
return(pdiff)
}

If you try, you should get the following output

      [,1]  [,2] 
pos   "162" "187"
seq.a "R"   "Q"  
seq.b "S"   "P"  

Some nice improvements to the function. Thanks to richierocks for his comments!

list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE, show.excluded = FALSE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case)
{
a <- toupper(a)
b <- toupper(b)
}
split_seqs <- strsplit(c(a, b), split = "")
only.diff <- (split_seqs[[1]] != split_seqs[[2]])
only.diff[
(split_seqs[[1]] %in% exclude) |
(split_seqs[[2]] %in% exclude)
] <- NA
diff.info<-data.frame(which(is.na(only.diff)|only.diff),
split_seqs[[1]][only.diff],split_seqs[[2]][only.diff])
names(diff.info)<-c("position","poly.seq.a","poly.seq.b")
if(!show.excluded) diff.info<-na.omit(diff.info)
diff.info
}

Count different positions between two strings of equal length

This is another pretty simple function, written to help me solve the simplest representation of a trivial but tedious task. Most biologist are probably familiar with this task. How many nucleotide differences exist between two given sequences? I only faced the easiest part of the problem, i.e. I do not perform alignment, I just assume that the sequence have a perfect 1 to 1 correspndence, i.e. position 1 of sequence 1 is position 1 of sequence 2, and so on.
The problem is basically to compare two strings of equal length and count the number of positions in which they differ. To facilitate analysis of DNA sequences, I gave the opportunity to ignore case (so that an “a” is equivalent to “A”) and to ignore positions carrying special characters, such as “N” or “?”.

Input:
a: first DNA sequence (or string)
b: second DNA sequence (or string)
exclude: character (or vector of characters) to be excluded from comparison
ignore.case: logical. If TRUE consider “a” equal to “A”. If FALSE consider “a” different from “A”.

Output:
differences: Number of differences between the two DNA sequences (or strings)

As usual, feedback welcome!

Thanks to Larry for suggesting a nice improvement! It is now incorporated in the function.

string.diff.ex<-function(a="ATTCGAN",b="attTGTT",exclude=c("n","N","?"),ignore.case=TRUE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case==TRUE)
{
a<-toupper(a)
b<-toupper(b)
}
diff.a<-unlist(strsplit(a,split=""))
diff.b<-unlist(strsplit(b,split=""))
diff.d<-rbind(diff.a,diff.b)
for(ex.loop in 1:length(exclude))
{
diff.d<-diff.d[,!(diff.d[1,]==exclude[ex.loop]|diff.d[2,]==exclude[ex.loop])]
}
differences<-sum(diff.d[1,]!=diff.d[2,])
return(differences)
}

On Reference Managers and Styles

Reference managers are those esoteric instruments that collect references for you from the web and allow you to insert them in your text documents. In addition, they allow you to create a reference list at the end of your document, sorting by First Author’s name or by order of citation. They also allow you to decide if in a paper with 283 authors you really want to list ’em all or if you are heppy with just 5 (or 10) of them.

In the last few years I probably tried most of the available reference managers, and after some degree of scientific experimentation I can ensure you: reference managers are great! And some of them are also available free of charge.

Among free reference managers two are probably the most widely used: zotero (http://www.zotero.org/) and Mendeley (http://www.mendeley.com/).

Ok. That was the up side. What about the down side?

If you are a scientist and you submitted your work to different journals, you probably know how good they are in requiring slight deviations from widely established citation formats. going back to the example of the number of auhtors, some want you to cite all the authors of a paper, some want only the first one, followed by et. al, some want only the first 5, or 7, or 10 or … AAAAAARRRGHHHH!!! I will never find the right style for the journal I want to submit to! Actually, the number of journals for which a style is available is pretty high (more than 1500, as of today) but Murhpy’s law is there to ensure your journal will be missing. But don’t panic. It is not so difficult to create your own style.

I did it, and it worked.

Citation styles for both zotero and Mendeley are written in .csl (citation style language) an esoteric language which I heard is similar to .xml. Given the large number of styles availalbe, you just have to find a style similar to what you need and then modify the .csl file according to the requirements of the journal.

A nice tutorial on how to modify existing styles is on the zotero forum:

http://www.zotero.org/support/dev/citation_styles

There you will also find links to primers to csl, and instructions on how to submit your style to the zotero community. Submitting your style to zotero also ensures it will be available in mendeley quite soon.

However, you can also just add your style in Mendeley.

To use it in Mendeley, just copy the my-new-style.csl file in the Mendeley Desktop styles directory (usually citationStyles-1.0).

In your word processing software select “choose citation style”, then “more styles” and your newly created my-new-style will be among the installed styles, you just have to select “use this style”.

Estimate decay of linkage disequilibrium with distance

It is well known that linkage disequilibrium (LD) decays with distance. Several functions have been proposed to estimate such decay. Among the most widely used are the Hill and Weir (1) formula for describing the decay of r2 and a formula proposed by Abecasis (2) for describing the decay of D’.
I wrote R functions to estimate decay of LD according to both the formulas for a paper I recently published (3). Please, refer to the original publications for details. Here I just use a non-linear model to fit the data do the decay function.

Estimate decay of r2 with distance

Input:
n: sample size (i.e. number of sampled chromosomes)
LD.data: estimates of LD as r2 between pair of markers
distance: the distance between pair of markers. Can be in bp, Mbp, cM.
(note that LD.data and distance must be in the same order and of the same length since they represent respectively the LD values and distance of any pair of markers considered)

Output:
HW.nonlinear: object obtained after fitting the non-linear model
new.rho: estimate of population recombination parameter (which is actually C/distance)
fpoints: vector of LD obtained fitting the linear model. It has a one-to-one correspondence with the distance vector, i.e. the first element of fpointsis the LD estimate for the distance in the first element of distance and so on.

Below you find the commands, including some sample data. Any feedback is appreciated!

distance<-c(19,49,1981,991,104,131,158,167,30,1000,5000,100,150,11,20,33,1100,1300,1500,100,1400,900,300,100,2000,100,1900,500,600,700,3000,2500,400,792)
LD.data<-c(0.6,0.47,0.018,0.8,0.7,0.09,0.09,0.05,0,0.001,0,0.6,0.4,0.2,0.5,0.4,0.1,0.08,0.07,0.5,0.06,0.11,0.3,0.6,0.1,0.9,0.1,0.3,0.29,0.31,0.02,0.01,0.4,0.5)
n<-52
HW.st<-c(C=0.1)
HW.nonlinear<-nls(LD.data~((10+C*distance)/((2+C*distance)*(11+C*distance)))*(1+((3+C*distance)*(12+12*C*distance+(C*distance)^2))/(n*(2+C*distance)*(11+C*distance))),start=HW.st,control=nls.control(maxiter=100))
tt<-summary(HW.nonlinear)
new.rho<-tt$parameters[1]
fpoints<-((10+new.rho*distance)/((2+new.rho*distance)*(11+new.rho*distance)))*(1+((3+new.rho*distance)*(12+12*new.rho*distance+(new.rho*distance)^2))/(n*(2+new.rho*distance)*(11+new.rho*distance)))

Estimate decay of D’ with distance

Input:
n: sample size
LD.data: estimates of LD as D’ between pair of markers
distance: the distance between pair of markers. It is very important that distance is expressed in base pairs (bp). (note that LD.data and distance must be in the same order and of the same length since they represent respectively the LD values and distance of any pair of markers considered)

Output:
LD.nonlinear: object obtained after fitting the non-linear model
fpoints: vector of LD obtained fitting the linear model. It has a one-to-one correspondence with the distance vector, i.e. the first element of fpoints is the LD estimate for the distance in the first element of distance and so on.

Below you find the commands, including some sample data. Any feedback is appreciated!

distance<-c(19,49,1981,991,104,131,158,167,30,1000,5000,100,150,11,20,33,1100,1300,1500,100,1400,900,300,100,2000,100,1900,500,600,700,3000,2500,400,792)
LD.data<-c(0.6,0.47,0.018,0.8,0.7,0.09,0.09,0.05,0,0.001,0,0.6,0.4,0.2,0.5,0.4,0.1,0.08,0.07,0.5,0.06,0.11,0.3,0.6,0.1,0.9,0.1,0.3,0.29,0.31,0.02,0.01,0.4,0.5)
n<-52
LD.st<-c(b0=12.9)
distance.mb<-distance/1000000
LD.nonlinear<-nls(LD.data~(1-distance.mb)^b0,start=LD.st,control=nls.control(minFactor=1/1000000000,maxiter=100,warnOnly=T))
summ<-summary(LD.nonlinear)
param<-summ$parameters
beta0<-param["b0","Estimate"]
fpoints<-(1-distance.mb)^beta0

Elizabeth Scholl presented me with a quite important question. How to get the half-decay distance, i.e. the distance at which LD is half of its maximum value.
Here you have an approximate solution. However, if you have many comparisons it should work well.

df<-data.frame(distance,fpoints)
maxld<-max(LD.data)
#You could elucubrate if it's better to use the maximum ESTIMATED value of LD
#In that case you just set: maxld<-max(fpoints)
h.decay<-maxld/2
half.decay.distance<-df$distance[which.min(abs(df$fpoints-h.decay))]

Of course, you might decide to plot your data. In the example I plot the actual LD points as points, and the estimated LD points as a dotted line.

ld.df<-data.frame(distance,fpoints)
ld.df<-ld.df[order(ld.df$distance),]
plot(distance,LD.data,pch=19,cex=0.9)
lines(ld.df$distance,ld.df$fpoints,lty=3,lwd=1.2)

References:

(1) Hill WG, Weir BS (1988) Variances and covariances of squared linkage disequilibria in finite populations. Theor Popul Biol 33:54–78
(2) Abecasis GR et al (2001) Extent and distribution of linkage disequilibrium in three genomic regions. Am J Hum Genet 68:191–197
(3) Marroni et al (2011) Nucleotide diversity and linkage disequilibrium in Populus nigra cinnamyl alcohol dehydrogenase (CAD4) gene. Tree Genetics & Genomes, DOI 10.1007/s11295-011-0391-5.

Special thanks:
To Vittorio Zamboni and Davide Scaglione for bringing archival research to the next level!

For happy-R blogging

You may notice that I don’t have that many posts on my blog, and they are all about R.
The paucity of my posts makes me a bit sad, but not much, really…
What makes (or better, used to make me) sad is that posts of R code (used to) look awful.
However, your code doesn’t deserve and doesn’t need to be ugly. You can find online several beautifiers.
This page on Inside-R provides a useful tool. You just need to paste your R code in the provided form and paste it back in your page. Nice thing is that you get back HTML code so you can paste it in your blog, but also in your web site. Drawback: you get back a nearly unintelligible page full of html formatting in which your code is lost; to change even a single letter of code you will probably have to go back to your original function. You probably don’t need to edit your code now, but I got nervous about this for some strange reason. Thus, I looked for additional tools. I found out that WordPress now enables R syntax highlighting. I can assure you that it’s fairly easy (I did it!). Instructions are available on this R-bloggers page.
Since when I tried these two tools, I am an happy-R blogger!

Estimate Gene Diversity

I provide here an R function to estimate gene diversity of diallelic sites (e.g. SNPs), given allele frequencies at each segregating site.
The function takes three input parameters:

maf: a numeric value (or vector) representing minor allele frequency at each site. Default is 0.5
nreads: size of each resampling experiment. Default is 10000.
nreplicates: Number of resampling experiments to conduct. Default is 200.

The function outputs the median value of estimated gene diversity, together with upper and lower 95% confidence limits.

I tested the function against results provided by the software package Arlequin and found excellent correlation (r>0.95). However, additional testing is welcome!

Once you obtain gene diversity you can easily calculate nucleotide diversity by summing gene diversity over the segregating sites and dividing by the length of your sequence.

References:
Nei, M. and Li, W.H., (1979) Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci U S A, 76(10), 5269-5273.
Tajima, F., (1989) Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Adv Genet, 123(3), 585-595.

If you use this function in a published work, please refer to:
Marroni F, Pinosio S, Di Centa E, Jurman I, Boerjan W, Felice N, Cattonaro F, Morgante M., (2011) Large scale detection of rare variants via pooled multiplexed next generation sequencing: towards next generation Ecotilling. Plant J. doi: 10.1111/j.1365-313X.2011.04627.x.

egd<-function(maf=0.5,nreads=10000,nreplicates=200)
{
total<-rep(NA,nreplicates)
  for(aaa in 1:nreplicates)
  {
  pino1<-sample(c("a","b"),replace=T,size=nreads,prob=c(maf,1-maf))
  pino2<-sample(c("a","b"),replace=T,size=nreads,prob=c(maf,1-maf))
  total[aaa]<-sum(pino1!=pino2)/nreads
  }
  gene.div.median<-median(total)
  gene.div.ucl<-quantile(total,probs=0.975)
  gene.div.lcl<-quantile(total,probs=0.025)
  gd<-c(gene.div.median,gene.div.ucl,gene.div.lcl)
  names(gd)<-c("gene.div.median","gene.div.ucl","gene.div.lcl")
  gd
}