It is well known that linkage disequilibrium (LD) decays with distance. Several functions have been proposed to estimate such decay. Among the most widely used are the Hill and Weir (1) formula for describing the decay of r2 and a formula proposed by Abecasis (2) for describing the decay of D’.
I wrote R functions to estimate decay of LD according to both the formulas for a paper I recently published (3). Please, refer to the original publications for details. Here I just use a non-linear model to fit the data do the decay function.
Estimate decay of r2 with distance
Input:
n: sample size (i.e. number of sampled alleles)
LD.data: estimates of LD as r2 between pair of markers
distance: the distance between pair of markers. Can be in bp, Mbp, cM.
(note that LD.data and distance must be in the same order and of the same length since they represent respectively the LD values and distance of any pair of markers considered)
Output:
HW.nonlinear: object obtained after fitting the non-linear model
new.rho: estimate of population recombination parameter (which is actually C/distance)
fpoints: vector of LD obtained fitting the linear model. It has a one-to-one correspondence with the distance vector, i.e. the first element of fpointsis the LD estimate for the distance in the first element of distance and so on.
Below you find the commands, including some sample data. Any feedback is appreciated!
distance=c(19,49,1981,991,104,131,158,167,30,1000,5000,100,150,11,20,33,1100,1300,1500,100,1400,900,300,100,2000,100,1900,500,600,700,3000,2500,400,792) LD.data=c(0.6,0.47,0.018,0.8,0.7,0.09,0.09,0.05,0,0.001,0,0.6,0.4,0.2,0.5,0.4,0.1,0.08,0.07,0.5,0.06,0.11,0.3,0.6,0.1,0.9,0.1,0.3,0.29,0.31,0.02,0.01,0.4,0.5) n=52 HW.st=c(C=0.1) HW.nonlinear=nls(LD.data~((10+C*distance)/((2+C*distance)*(11+C*distance)))*(1+((3+C*distance)*(12+12*C*distance+(C*distance)^2))/(n*(2+C*distance)*(11+C*distance))),start=HW.st,control=nls.control(maxiter=100)) tt=summary(HW.nonlinear) new.rho=tt$parameters[1] fpoints=((10+new.rho*distance)/((2+new.rho*distance)*(11+new.rho*distance)))*(1+((3+new.rho*distance)*(12+12*new.rho*distance+(new.rho*distance)^2))/(n*(2+new.rho*distance)*(11+new.rho*distance)))Estimate decay of D’ with distance
Input:
n: sample size
LD.data: estimates of LD as D’ between pair of markers
distance: the distance between pair of markers. It is very important that distance is expressed in base pairs (bp). (note that LD.data and distance must be in the same order and of the same length since they represent respectively the LD values and distance of any pair of markers considered)Output:
LD.nonlinear: object obtained after fitting the non-linear model
fpoints: vector of LD obtained fitting the linear model. It has a one-to-one correspondence with the distance vector, i.e. the first element of fpoints is the LD estimate for the distance in the first element of distance and so on.Below you find the commands, including some sample data. Any feedback is appreciated!
distance=c(19,49,1981,991,104,131,158,167,30,1000,5000,100,150,11,20,33,1100,1300,1500,100,1400,900,300,100,2000,100,1900,500,600,700,3000,2500,400,792) LD.data=c(0.6,0.47,0.018,0.8,0.7,0.09,0.09,0.05,0,0.001,0,0.6,0.4,0.2,0.5,0.4,0.1,0.08,0.07,0.5,0.06,0.11,0.3,0.6,0.1,0.9,0.1,0.3,0.29,0.31,0.02,0.01,0.4,0.5) n=52 LD.st=c(b0=12.9) distance.mb=distance/1000000 LD.nonlinear=nls(LD.data~(1-distance.mb)^b0,start=LD.st,control=nls.control(minFactor=1/1000000000,maxiter=100,warnOnly=T)) summ=summary(LD.nonlinear) param=summ$parameters beta0=param["b0","Estimate"] fpoints=(1-distance.mb)^beta0Elizabeth Scholl presented me with a quite important question. How to get the half-decay distance, i.e. the distance at which LD is half of its maximum value.
Here you have an approximate solution. However, if you have many comparisons it should work well.df=data.frame(distance,fpoints) maxld=max(LD.data) #You could elucubrate if it's better to use the maximum ESTIMATED value of LD #In that case you just set: maxld=max(fpoints) h.decay=maxld/2 half.decay.distance=df$distance[which.min(abs(df$fpoints-h.decay))]Of course, you might decide to plot your data. In the example I plot the actual LD points as points, and the estimated LD points as a dotted line.
ld.df=data.frame(distance,fpoints) ld.df=ld.df[order(ld.df$distance),] plot(distance,LD.data,pch=19,cex=0.9) lines(ld.df$distance,ld.df$fpoints,lty=3,lwd=1.2)References:
(1) Hill WG, Weir BS (1988) Variances and covariances of squared linkage disequilibria in finite populations. Theor Popul Biol 33:54–78
(2) Abecasis GR et al (2001) Extent and distribution of linkage disequilibrium in three genomic regions. Am J Hum Genet 68:191–197
(3) Marroni et al (2011) Nucleotide diversity and linkage disequilibrium in Populus nigra cinnamyl alcohol dehydrogenase (CAD4) gene. Tree Genetics & Genomes, DOI 10.1007/s11295-011-0391-5.Special thanks:
To Vittorio Zamboni and Davide Scaglione for bringing archival research to the next level!
Hi Fabio
Thanks for the nice post. I am working on estimating LD in cats and I used the decay model you used with little modification. Thanks a lot. I was wondering if I can take a look at the decay function of D`. I would also suggest selecting independent pairwise comparisons only in the model. Thanks again
Hasan Alhaddad
PhD student
UC Davis
halhaddad@ucdavis.edu
Hi Hasan,
thank you for your interest.
I just updated the post, adding the D’ decay function, so that you can give it a try. For measuring D’ decay you should measure distance as base pairs (or modify the function).
Selecting independent pairwise comparisons is a good idea. At present my function only gets distance as input, so the selection should be done before. For example you can calculate distance only for independent pairwise comparison and use only them as input of the function. Right now I am not working on LD decay. Should I get back to that I might integrate also selection of independent comparisons in the workflow.
Fabio
Hi Fabio,
thank you for the post.
I would like to perform this analysis using molecular markers.
i just have a question about :HW.st<-c(C=0.1)
How did you determine this? , is it the population recombination parameter? , I am a little confuse between this and "new.rho"
Thank you very much
Antonio
Hi Antonio,
HW.st is just the starting value of C, which is then estimated as new.rho.
I actually used bad names since it is actually not rho (the population recombination parameter), but the product of rho and the distance in base pair between the two considered markers (which is termed C). So I should actually have called it new.C (and maybe the starting value should have been HW.st.C.
The definition is by Hill and Weir (reference 1), but you can find the formula also in my paper (reference 3).
Please note that it only works for biallelic markers (SNPs).
Hope this helps!
Thank you very Much Fabio,
Sorry but I am still a little bit confused. If new.rho does not contain rho (but the product of rho and distance), why should you wanted to multiply it by distance again in order to obtain fpoints?
I am using DARTs and cM distances, so I think I should be able to use this method.
Antonio
Antonio, I am afraid I contributed to your confusion.
I gave a better look to the function and here what I can say:
Hw.st is the starting value of rho. It’s important to obtain convergence (and eventually convergence on sensible values). 0.1 is maybe a bit large, but it worked well for me.
new.rho is the estimated value of rho which, as you correctly pointed out is then multiplied by distance (which can be in bp or cM) to get C.
Hope this helps!
but before plotting LD decay pot we need some more things like theta and recombination rate which is rho, how your commands deals with all these? Can we use this command for any crop or it is just for the crop you’ve worked.
Thanks,
Vinod
You need specific functions to estimate theta and rho. It shouldn’t be difficult, but I didn’t write anything. The function is appropriate for any diploid organism.
Perfect – thank you, Fabio!
Dear Fabio,
a probably very stupid question. I have phased data and would like to estimate the LD decay. The sample size ‘n’, is it the number of chromosomes sampled or the number of individuals sampled?
Thanks,
axel
Your question is actually very appropriate. I went back to check on the original paper, and the authors are using n as the number of gametes (i.e. in our case the number of chromosomes). I will update my post accordingly.
Thanks for clarification. Actually, it does make more sense to use the number of gametes.
Hi Fabio,
That sounds great you wrote a formula to estimate the LD decay. I tried the example you gave for r2 with distance. I am not sure how to get the LD decay extent something like “LD extent is 10 cM, or 150 kb”. Thanks for shedding light on it for me.
Angovy
In my opinion, sentences as “LD extent is 10cM (or 150 kb)” make no sense. One should more precisely assert: “LD is reduced at x% of its maximum value at x cM (or y Kb)”, or “LD half decay distance is xx cM (or yy Kb)”.
Yeah, that’s clearly stated in your paper. However, when reading sentence like ‘The average genetic distance at which LD on all chromosomes decayed to r2 < 0.2 was 9.9 cM ( Benson et al 2012)', is that what could be the 'maximum' value?
For my experiment, the LD significant threshold was r2 of 0.23. My question is how to get the corresponding genetic distance (cM) , using the non-linear regression model you developped?
Thanks
Angovy
First of all, I would like to clarify that I did not develop anything, I just wrote an R function to fit already known decay functions to data! Actually, Benson et al, just decided (arbitrarily, but reasonably) to select as a threshold of “low LD” the LD value od 0.2. Using my functions you just need to set the value of h.decay to 0.2 and then compute half.decay.distance. Obviously, and in spite of its name, in that case half.decay.distance will be the distance at which the LD (according to the LD decay function) decays below the value of 0.2.
One can also use the approach of Benson, by simply finding what is the distance above which no LD value greater than 0.2 is observed. Note that in that case you don’t need to fit the non-linear models.
Hi Fabio,
I really appreciate you clarified all those things for me. That’s kind of you. However, because I don’t have strong skills in R, could you please tell me the way to do. What parameter should I give the value of 0.2 to compute half.decay.distance? Below is my command lines.
Note that ld_data is already loaded. Thank you so much for your helpful assistance.
distance <- ld_data$DiffPosition
LD.data <- ld_data$R2
n <-169
HW.st<-c(C=0.1)
HW.nonlinear<-nls(LD.data~((10+C*distance)/((2+C*distance)*(11+C*distance)))*(1+((3+C*distance)*(12+12*C*distance+(C*distance)^2))/(n*(2+C*distance)*(11+C*distance))),start=HW.st,control=nls.control
(maxiter=100))
tt<-summary(HW.nonlinear)
new.rho<-tt$parameters[1]
fpoints<-((10+new.rho*distance)/((2+new.rho*distance)*(11+new.rho*distance)))*(1+((3+new.rho*distance)*(12+12*new.rho*distance+(new.rho*distance)^2))/(n*(2+new.rho*distance)*(11+new.rho*distance)))
Oh sorry, please do not consider my previous message. I got it. It worked fine. Thank you so much for those nice R functions. Just awesome.
Best
Hi Fabio.
You say that sample size n is the number of sampled chromsomes. Is this the haploid or diploid number of chromosomes? For example for calculating LD in a human population with genome wide genotyping data should n = 23 (haploid chromosome number) or 46 (diploid chromosome number)?
With thanks,
Mike
Hi Fabio,
I am beginner in this work,
I got R^2 value by tassel and the value of R^2 in some of distance showed NaN, my question; should I avoid this NaN value to calculate LD decay or some program can read this value to calculate LD decay?
duffy
You can remove them or keep them, but NaN will never be used to compute LD decay, because they are not numbers.
thank you so much for your reply
Hi Fabio,
I forget to ask you about N/A value in distance
because the result of calculation by tassel generated N/A value and they have value >0 for R^2, should I remove also this value or convert to “zero”
duffy
Mmm… Actually, I don’t know why Tassel should give NA for a distance value. Maybe the two SNPs are on different chromosomes? Or one of the distance values is misspecified? The distance is a subtraction, shouldn’t give rise to NA. Until you don’t know what’s the cause of this, you’d better not convert them.
Hi Fabio, I work with grape GWAS and I want to know if n = 19 (aploid chromosome number) or to 38 (diploid).
Thank you so much.
Annarita
Hi Annarita, I answer here to you and to Mike (above, who had a similar question). n is the number of sampled chromosomes used to compute LD. This number is not related at all to the number of chromosomes of the species. An example will clarify. If you compute LD in a diploid populations of 50 subjects, n will be 100 irrespective of the haploid number of the species.
Thanks for the clarification and for the script
Annarita
Hi Fabio,
It is quite clear from the above explanation but not really. What do you mean by subjects in this case.
Do you mean that we have 21 chromosomes analysed then n is 42
Thanks
Hi vin.
I am sorry, I think my explanations were not so clear. When you do a LD study you usually have a number i of individuals (study subjects) for which you estimate LD between SNPs. If the species you are studying is diploid, then n is twice the number of individuals (n=2*i), because each individual contributes 2 alleles.
Hi Fabio,
I have a stupid question. I have used your script to estimate the decay of r2 with the distance. I have used base pair distance between SNPs. When I evaluated the half.decay.distance I obtained a value of 141073. I supposed it’s still in base pair. If it’s in bp, it’s to big. Maybe I’m wrong. I don’t know.
Thanks in advance.
Annarita
Hi Fabio
Do you mind having a look at this post? http://stackoverflow.com/questions/28675994/plink-output-get-multiple-nls-output-from-single-table-in-r
Ok, I added a comment to the post, hope it helps!
Hi Fabio,
Thank you very much for your nice R function!
I have the problem that for some of my data negative values for “C” are estimated, which results in an increase of LD with distance (if I change the value manually to a positive value everything is fine). Could you give me a hint how to fix this?
Thanks in advance!
Maik
Hi Maik,
theoretically, if LD decays with distance, you shouldn’t estimate negative Cs. HNow do your data look? i.e. does it looks like at higher distance you have higher LD? While changing the sign of C might apparently fix the problem, I am not sure it is the best thing to do.
Hi Fabio,
thank you for your prompt answer! The manually adjusting of C was just something I tried out of curiosity (I don’t think that it is “allowed”).
It seems that there is no clear pattern of decreasing R^2 with distance in my data (looking at the raw r^2 values). This might be the problem! Thank you for this hint!
You might try focusing on very short distances, or expanding to very long distances, and see if the trend is regained. Good luck!
I am new to this work so could you please let me know how do we get the LD decay plot from this? Like the one’s we see in most publications? I did plot(fpoints) but that doesnt look like an LD decay plot.
Gent.mo Fabio,
ho notato che lo script ha problemi lavorando con un numero molto grande di dati.
1) non riuscivo a inserire distance e LD.data usando c(), ho risolto importando come data.frame di una sola riga (o colonna, e trasponendo poi) e rendendo poi il DF un numeric.
2)con un gran numero di dati il calcolo di D’ non va, restituendomi l’errore “Error in numericDeriv(form[[3L]], names(ind), env) :
Valore mancante o infinito generato durante il calcolo del modello”
tutto questo quando invece R2 funziona
3) in ogni caso ho con lo script un valore di decay sotto lo 0.2 di R2 eccezionalmente basso (11bp), cosa completamente inaspettata in una specie autogama come quella su cui lavoro. ci potrebbe essere qualche problema di calcolo? come posso verificare che il valore sia effettivamente quello, mi puoi suggerire una alternativa?
Chiedo scusa per l’italiano ma sto scrivendo di fretta…
Un saluto e grazie
Emidio Sabatini
Ciao Emidio! Mi fa molto piacere che tu usi questa risorsa!
Cosa intendi per grandi? Svariati genomi? Io fino al genoma intero (di pesco) riesco a farlo funzionare senza problemi! Rispondo sotto alle tue domande.
1) Hai fatto bene. Il c() era solo un esempio per data set piccoli, giustamente quando ci sono molti dati (io fatto sull’intero genoma di pesco!) bisogna importare i dati in altro modo.
2) Eh, lì non c’è molto da fare, temo. Non dipende dal fatto che siano pochi o molti dati, ma dal fatto che il fit non lineare non riesce a convergere. Il suggerimento altamente scientifico che ti do è: prova a variare a caso il valore di partenza dell’algoritmo, quello che si chiama LD.st e vedi se funziona.
3) Ahimé, così senza vedere i dati non saprei se ci siano stati problemi, ma è possibile. Una cosa sensata da fare è dividere i tuoi confronti in classi di distanza tra SNP (ad esempio 5bp 10bp 15bp se sei su scale piccole, oppure anche 1Kb, 2Kb ecc se sei su scale grandi) e poi per ogni classe puoi fare un boxplot (vedi ad esempio come abbiamo fatto in Verde et al 2013, il paper del genoma del pesco), o più semplicemente ti misuri la mediana, e poi vedi a quale distanza la mediana è la metà del valore massimo. Ultimamente questo è il mio metodo preferito!
Hi
i would like to draw LD plot for three population and i want to have a color plot. i dont know, which command should i use?
i use the following command , but not color myplot.
library(ggplot2)
ld <- read.table("plink.ld", sep="\t",header=T)
ggplot(ld) +
geom_line(aes(x=BP_B – BP_A, y = R2))
Hi, Reza, I am sorry, I am not so expert of ggplot2. Maybe you can look at some tutorial of ggplot2, I think they should explain how to change plot colors.
i dont understand how can i draw LD curve for multi decay . could you please explain me and give me code of R
Hi Fabio!
I’m using your half.decay function with a threshold of 0.2, and it gives me the location of the distance that I’m searching for.
However, the distance provided by the function is quite different from the 0.2 threshold visually observed in the plot.
They should be similar, shouldn’t they?
Thanks in advance!
Mmmm… I would need to see the plot to give a more detailed answer. However, the general answer is: yes, in several cases we expect the empirical and estimated thresholds to be similar. However, it is possible, if the empirical decrease is very steep or very gentle, that the two are not so close. I sometime report both the empirical value and the value estimated by the decay function.
hi Mr. Fabio,
Thank you for this nice tutorial.
I tried to use the script for data obtained from TASSEL.
The data contained NaN for R2 and negative value for distances in some instances however, I did not change anything and got he following error
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the mode
Can you please let me know what does it mean and how to proceed from here?
Thank you in advance.
Massub.
Just to add here I used n=3600 as there were 600 hexaploid individuals. thank you.
The error (which I guess you got from running the nls command) denotes problems in the non-linear regression. I tried hard to reproduce the error but not having the data I wasn’t able. However, I can give you some general advice:
1) NaN of LD values are useless. Remove them, and remove the corresponding distance.
2) Negative distance makes no sense. It is negative only because the arbitrary direction you used for the calculation but it is always a positive number of base pairs, so you should always use the absolute value of your distance.
3) What starting value (LD.st) did you use? By reading online I found that “illegal” starting values can generate this error, so maybe you can try to change the LD.st value and see if it helps.
I hope that by applying this suggestions your problems are solved!
Thank you Fabio for the quick response.
I was able to complete the analyis.
1. I removed all the NaN.
2, Converted the negatives into absolute values.
3. HW.st<-c(C=0.1) instead I used HW.st<-c(C=0.01) ; It was a sample data.
Thank you very much for your help though I just have a quick question, suppose If we have large number of markers lets say 20 thousand, is there a code to thin them down for rapid analysis as otherwise it will take a lot of hard disk space.
Appreciate your help.
Massub.
Hi again,
apologies for asking again, dear Fabio.
I was trying on my original data set and am getting the same LD decay and LD distance which is 0.22 and 1 , which is not correct. I am dealing with hexaploid wheat with 2223 individuals, it doesnt matter if i calculate the whole genome LD or separately for each A,B and D genome the final decay and distance remains the same.
Do you have any suggestions regarding this?
If you say, I can share my LD results from TASSEL with you to take a look, if possible.
Once again thank you very much and I really appreciate your help and kindness.
Massub.
Hi Fabio,
This is again regarding ‘n’ (half-decay distance).
I have used ‘16986’ SNPs from ‘199’ scaffolds for LD calculation (it is a non model organism).In this case what will be the value of ‘n’ for half-decay distance calculation ?
n=16986*2 or n=199*2.
please could you clarify my request.
None of them. By “number of chromosomes” I meant the number of each homologue chromosome. I change that to “number of alleles” which I hope is more immediate. So, if your organism is diploid and the number of individuals is “i” it will b n=2*i.
Hi Fabio,
Thanks for your quick reply. This organism is having 18 chromosomes and diploid.Then ‘n’ is 36. Is it ?
Sorry, I never manage to explain this correctly! If it is diploid, then it is 2*number_of_genotyped_individuals. So, for example, if you genotyped 50 diploid individuals, your n will be 2*50=100. I hope this clarifies the issue.
I think now it is very clear. ‘n’ is all about number of samples used for genotype.In my experiment I used 189 samples (individuals) used for the genotype.Below are the summary :
Number of sample used of genotype : 189
Total number of chromosomes in my organism : 18
Total number of SNPs used for LD calculation : 16986
In this case ‘n’ for half decay calculation is 189*2=378.
Is all right Fabio ?
I hope it will help others in future.
Grazie Fabio.
Yes!
Dear Fabio,
I’m new in that issue.
What’s mean ‘beta0’ parameter. What can I get it?
I have an error :
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the mode
thanx in advance
Dear Ula, beta0 is the exponential parameter estimated by the model. If you look in older comments by Massub (and my answers) you will see that you are not the first experiencing this error. It may be due to the presence of NAs in your data or to some other problems. Try to follow those suggestions and see if they work. If they don’t we will have to investigate more in detail.
Dear Fabio,
many thanks for help. Your code geos excellent!
Thank you so much