It is well known that linkage disequilibrium (LD) decays with distance. Several functions have been proposed to estimate such decay. Among the most widely used are the Hill and Weir (1) formula for describing the decay of r^{2} and a formula proposed by Abecasis (2) for describing the decay of D’.

I wrote R functions to estimate decay of LD according to both the formulas for a paper I recently published (3). Please, refer to the original publications for details. Here I just use a non-linear model to fit the data do the decay function.

Estimate decay of r^{2} with distance

Input:

**n**: sample size (i.e. number of sampled chromosomes)

**LD.data**: estimates of LD as r2 between pair of markers

**distance**: the distance between pair of markers. Can be in bp, Mbp, cM.

(note that LD.data and distance must be in the same order and of the same length since they represent respectively the LD values and distance of any pair of markers considered)

Output:

**HW.nonlinear**: object obtained after fitting the non-linear model

**new.rho**: estimate of population recombination parameter (which is actually C/distance)

**fpoints**: vector of LD obtained fitting the linear model. It has a one-to-one correspondence with the **distance** vector, i.e. the first element of **fpoints**is the LD estimate for the distance in the first element of **distance** and so on.

Below you find the commands, including some sample data. Any feedback is appreciated!

distance<-c(19,49,1981,991,104,131,158,167,30,1000,5000,100,150,11,20,33,1100,1300,1500,100,1400,900,300,100,2000,100,1900,500,600,700,3000,2500,400,792) LD.data<-c(0.6,0.47,0.018,0.8,0.7,0.09,0.09,0.05,0,0.001,0,0.6,0.4,0.2,0.5,0.4,0.1,0.08,0.07,0.5,0.06,0.11,0.3,0.6,0.1,0.9,0.1,0.3,0.29,0.31,0.02,0.01,0.4,0.5) n<-52 HW.st<-c(C=0.1) HW.nonlinear<-nls(LD.data~((10+C*distance)/((2+C*distance)*(11+C*distance)))*(1+((3+C*distance)*(12+12*C*distance+(C*distance)^2))/(n*(2+C*distance)*(11+C*distance))),start=HW.st,control=nls.control(maxiter=100)) tt<-summary(HW.nonlinear) new.rho<-tt$parameters[1] fpoints<-((10+new.rho*distance)/((2+new.rho*distance)*(11+new.rho*distance)))*(1+((3+new.rho*distance)*(12+12*new.rho*distance+(new.rho*distance)^2))/(n*(2+new.rho*distance)*(11+new.rho*distance)))

Estimate decay of D’ with distance

Input:

**n**: sample size

**LD.data**: estimates of LD as D’ between pair of markers

**distance**: the distance between pair of markers. It is very important that distance is expressed in **base pairs (bp)**. (note that LD.data and distance must be in the same order and of the same length since they represent respectively the LD values and distance of any pair of markers considered)

Output:

**LD.nonlinear**: object obtained after fitting the non-linear model

**fpoints**: vector of LD obtained fitting the linear model. It has a one-to-one correspondence with the **distance** vector, i.e. the first element of **fpoints** is the LD estimate for the distance in the first element of **distance** and so on.

Below you find the commands, including some sample data. Any feedback is appreciated!

distance<-c(19,49,1981,991,104,131,158,167,30,1000,5000,100,150,11,20,33,1100,1300,1500,100,1400,900,300,100,2000,100,1900,500,600,700,3000,2500,400,792) LD.data<-c(0.6,0.47,0.018,0.8,0.7,0.09,0.09,0.05,0,0.001,0,0.6,0.4,0.2,0.5,0.4,0.1,0.08,0.07,0.5,0.06,0.11,0.3,0.6,0.1,0.9,0.1,0.3,0.29,0.31,0.02,0.01,0.4,0.5) n<-52 LD.st<-c(b0=12.9) distance.mb<-distance/1000000 LD.nonlinear<-nls(LD.data~(1-distance.mb)^b0,start=LD.st,control=nls.control(minFactor=1/1000000000,maxiter=100,warnOnly=T)) summ<-summary(LD.nonlinear) param<-summ$parameters beta0<-param["b0","Estimate"] fpoints<-(1-distance.mb)^beta0

Elizabeth Scholl presented me with a quite important question. How to get the half-decay distance, i.e. the distance at which LD is half of its maximum value.

Here you have an approximate solution. However, if you have many comparisons it should work well.

df<-data.frame(distance,fpoints) maxld<-max(LD.data) #You could elucubrate if it's better to use the maximum ESTIMATED value of LD #In that case you just set: maxld<-max(fpoints) h.decay<-maxld/2 half.decay.distance<-df$distance[which.min(abs(df$fpoints-h.decay))]

Of course, you might decide to plot your data. In the example I plot the actual LD points as points, and the estimated LD points as a dotted line.

ld.df<-data.frame(distance,fpoints) ld.df<-ld.df[order(ld.df$distance),] plot(distance,LD.data,pch=19,cex=0.9) lines(ld.df$distance,ld.df$fpoints,lty=3,lwd=1.2)

**References**:

(1) Hill WG, Weir BS (1988) Variances and covariances of squared linkage disequilibria in finite populations. Theor Popul Biol 33:54–78

(2) Abecasis GR *et al * (2001) Extent and distribution of linkage disequilibrium in three genomic regions. Am J Hum Genet 68:191–197

(3) Marroni *et al* (2011) Nucleotide diversity and linkage disequilibrium in *Populus nigra* cinnamyl alcohol dehydrogenase (CAD4) gene. Tree Genetics & Genomes, DOI 10.1007/s11295-011-0391-5.

**Special thanks:
**To Vittorio Zamboni and Davide Scaglione for bringing archival research to the next level!

Hi Fabio

Thanks for the nice post. I am working on estimating LD in cats and I used the decay model you used with little modification. Thanks a lot. I was wondering if I can take a look at the decay function of D`. I would also suggest selecting independent pairwise comparisons only in the model. Thanks again

Hasan Alhaddad

PhD student

UC Davis

halhaddad@ucdavis.edu

Hi Hasan,

thank you for your interest.

I just updated the post, adding the D’ decay function, so that you can give it a try. For measuring D’ decay you should measure distance as base pairs (or modify the function).

Selecting independent pairwise comparisons is a good idea. At present my function only gets distance as input, so the selection should be done before. For example you can calculate distance only for independent pairwise comparison and use only them as input of the function. Right now I am not working on LD decay. Should I get back to that I might integrate also selection of independent comparisons in the workflow.

Fabio

Hi Fabio,

thank you for the post.

I would like to perform this analysis using molecular markers.

i just have a question about :HW.st<-c(C=0.1)

How did you determine this? , is it the population recombination parameter? , I am a little confuse between this and "new.rho"

Thank you very much

Antonio

Hi Antonio,

HW.st is just the starting value of C, which is then estimated as new.rho.

I actually used bad names since it is actually not rho (the population recombination parameter), but the product of rho and the distance in base pair between the two considered markers (which is termed C). So I should actually have called it new.C (and maybe the starting value should have been HW.st.C.

The definition is by Hill and Weir (reference 1), but you can find the formula also in my paper (reference 3).

Please note that it only works for biallelic markers (SNPs).

Hope this helps!

Thank you very Much Fabio,

Sorry but I am still a little bit confused. If new.rho does not contain rho (but the product of rho and distance), why should you wanted to multiply it by distance again in order to obtain fpoints?

I am using DARTs and cM distances, so I think I should be able to use this method.

Antonio

Antonio, I am afraid I contributed to your confusion.

I gave a better look to the function and here what I can say:

Hw.st is the starting value of rho. It’s important to obtain convergence (and eventually convergence on sensible values). 0.1 is maybe a bit large, but it worked well for me.

new.rho is the estimated value of rho which, as you correctly pointed out is then multiplied by distance (which can be in bp or cM) to get C.

Hope this helps!

but before plotting LD decay pot we need some more things like theta and recombination rate which is rho, how your commands deals with all these? Can we use this command for any crop or it is just for the crop you’ve worked.

Thanks,

Vinod

You need specific functions to estimate theta and rho. It shouldn’t be difficult, but I didn’t write anything. The function is appropriate for any diploid organism.

Perfect – thank you, Fabio!

Dear Fabio,

a probably very stupid question. I have phased data and would like to estimate the LD decay. The sample size ‘n’, is it the number of chromosomes sampled or the number of individuals sampled?

Thanks,

axel

Your question is actually very appropriate. I went back to check on the original paper, and the authors are using n as the number of gametes (i.e. in our case the number of chromosomes). I will update my post accordingly.

Thanks for clarification. Actually, it does make more sense to use the number of gametes.

Hi Fabio,

That sounds great you wrote a formula to estimate the LD decay. I tried the example you gave for r2 with distance. I am not sure how to get the LD decay extent something like “LD extent is 10 cM, or 150 kb”. Thanks for shedding light on it for me.

Angovy

In my opinion, sentences as “LD extent is 10cM (or 150 kb)” make no sense. One should more precisely assert: “LD is reduced at x% of its maximum value at x cM (or y Kb)”, or “LD half decay distance is xx cM (or yy Kb)”.

Yeah, that’s clearly stated in your paper. However, when reading sentence like ‘The average genetic distance at which LD on all chromosomes decayed to r2 < 0.2 was 9.9 cM ( Benson et al 2012)', is that what could be the 'maximum' value?

For my experiment, the LD significant threshold was r2 of 0.23. My question is how to get the corresponding genetic distance (cM) , using the non-linear regression model you developped?

Thanks

Angovy

First of all, I would like to clarify that I did not develop anything, I just wrote an R function to fit already known decay functions to data! Actually, Benson et al, just decided (arbitrarily, but reasonably) to select as a threshold of “low LD” the LD value od 0.2. Using my functions you just need to set the value of h.decay to 0.2 and then compute half.decay.distance. Obviously, and in spite of its name, in that case half.decay.distance will be the distance at which the LD (according to the LD decay function) decays below the value of 0.2.

One can also use the approach of Benson, by simply finding what is the distance above which no LD value greater than 0.2 is observed. Note that in that case you don’t need to fit the non-linear models.

Hi Fabio,

I really appreciate you clarified all those things for me. That’s kind of you. However, because I don’t have strong skills in R, could you please tell me the way to do. What parameter should I give the value of 0.2 to compute half.decay.distance? Below is my command lines.

Note that ld_data is already loaded. Thank you so much for your helpful assistance.

distance <- ld_data$DiffPosition

LD.data <- ld_data$R2

n <-169

HW.st<-c(C=0.1)

HW.nonlinear<-nls(LD.data~((10+C*distance)/((2+C*distance)*(11+C*distance)))*(1+((3+C*distance)*(12+12*C*distance+(C*distance)^2))/(n*(2+C*distance)*(11+C*distance))),start=HW.st,control=nls.control

(maxiter=100))

tt<-summary(HW.nonlinear)

new.rho<-tt$parameters[1]

fpoints<-((10+new.rho*distance)/((2+new.rho*distance)*(11+new.rho*distance)))*(1+((3+new.rho*distance)*(12+12*new.rho*distance+(new.rho*distance)^2))/(n*(2+new.rho*distance)*(11+new.rho*distance)))

Oh sorry, please do not consider my previous message. I got it. It worked fine. Thank you so much for those nice R functions. Just awesome.

Best

Hi Fabio.

You say that sample size n is the number of sampled chromsomes. Is this the haploid or diploid number of chromosomes? For example for calculating LD in a human population with genome wide genotyping data should n = 23 (haploid chromosome number) or 46 (diploid chromosome number)?

With thanks,

Mike

Hi Fabio,

I am beginner in this work,

I got R^2 value by tassel and the value of R^2 in some of distance showed NaN, my question; should I avoid this NaN value to calculate LD decay or some program can read this value to calculate LD decay?

duffy

You can remove them or keep them, but NaN will never be used to compute LD decay, because they are not numbers.

thank you so much for your reply

Hi Fabio,

I forget to ask you about N/A value in distance

because the result of calculation by tassel generated N/A value and they have value >0 for R^2, should I remove also this value or convert to “zero”

duffy

Mmm… Actually, I don’t know why Tassel should give NA for a distance value. Maybe the two SNPs are on different chromosomes? Or one of the distance values is misspecified? The distance is a subtraction, shouldn’t give rise to NA. Until you don’t know what’s the cause of this, you’d better not convert them.

Hi Fabio, I work with grape GWAS and I want to know if n = 19 (aploid chromosome number) or to 38 (diploid).

Thank you so much.

Annarita

Hi Annarita, I answer here to you and to Mike (above, who had a similar question). n is the number of sampled chromosomes used to compute LD. This number is not related at all to the number of chromosomes of the species. An example will clarify. If you compute LD in a diploid populations of 50 subjects, n will be 100 irrespective of the haploid number of the species.

Thanks for the clarification and for the script

Annarita

Hi Fabio,

I have a stupid question. I have used your script to estimate the decay of r2 with the distance. I have used base pair distance between SNPs. When I evaluated the half.decay.distance I obtained a value of 141073. I supposed it’s still in base pair. If it’s in bp, it’s to big. Maybe I’m wrong. I don’t know.

Thanks in advance.

Annarita

Hi Fabio

Do you mind having a look at this post? http://stackoverflow.com/questions/28675994/plink-output-get-multiple-nls-output-from-single-table-in-r

Ok, I added a comment to the post, hope it helps!

Hi Fabio,

Thank you very much for your nice R function!

I have the problem that for some of my data negative values for “C” are estimated, which results in an increase of LD with distance (if I change the value manually to a positive value everything is fine). Could you give me a hint how to fix this?

Thanks in advance!

Maik

Hi Maik,

theoretically, if LD decays with distance, you shouldn’t estimate negative Cs. HNow do your data look? i.e. does it looks like at higher distance you have higher LD? While changing the sign of C might apparently fix the problem, I am not sure it is the best thing to do.

Hi Fabio,

thank you for your prompt answer! The manually adjusting of C was just something I tried out of curiosity (I don’t think that it is “allowed”).

It seems that there is no clear pattern of decreasing R^2 with distance in my data (looking at the raw r^2 values). This might be the problem! Thank you for this hint!

You might try focusing on very short distances, or expanding to very long distances, and see if the trend is regained. Good luck!

I am new to this work so could you please let me know how do we get the LD decay plot from this? Like the one’s we see in most publications? I did plot(fpoints) but that doesnt look like an LD decay plot.

Gent.mo Fabio,

ho notato che lo script ha problemi lavorando con un numero molto grande di dati.

1) non riuscivo a inserire distance e LD.data usando c(), ho risolto importando come data.frame di una sola riga (o colonna, e trasponendo poi) e rendendo poi il DF un numeric.

2)con un gran numero di dati il calcolo di D’ non va, restituendomi l’errore “Error in numericDeriv(form[[3L]], names(ind), env) :

Valore mancante o infinito generato durante il calcolo del modello”

tutto questo quando invece R2 funziona

3) in ogni caso ho con lo script un valore di decay sotto lo 0.2 di R2 eccezionalmente basso (11bp), cosa completamente inaspettata in una specie autogama come quella su cui lavoro. ci potrebbe essere qualche problema di calcolo? come posso verificare che il valore sia effettivamente quello, mi puoi suggerire una alternativa?

Chiedo scusa per l’italiano ma sto scrivendo di fretta…

Un saluto e grazie

Emidio Sabatini

Ciao Emidio! Mi fa molto piacere che tu usi questa risorsa!

Cosa intendi per grandi? Svariati genomi? Io fino al genoma intero (di pesco) riesco a farlo funzionare senza problemi! Rispondo sotto alle tue domande.

1) Hai fatto bene. Il c() era solo un esempio per data set piccoli, giustamente quando ci sono molti dati (io fatto sull’intero genoma di pesco!) bisogna importare i dati in altro modo.

2) Eh, lì non c’è molto da fare, temo. Non dipende dal fatto che siano pochi o molti dati, ma dal fatto che il fit non lineare non riesce a convergere. Il suggerimento altamente scientifico che ti do è: prova a variare a caso il valore di partenza dell’algoritmo, quello che si chiama LD.st e vedi se funziona.

3) Ahimé, così senza vedere i dati non saprei se ci siano stati problemi, ma è possibile. Una cosa sensata da fare è dividere i tuoi confronti in classi di distanza tra SNP (ad esempio 5bp 10bp 15bp se sei su scale piccole, oppure anche 1Kb, 2Kb ecc se sei su scale grandi) e poi per ogni classe puoi fare un boxplot (vedi ad esempio come abbiamo fatto in Verde et al 2013, il paper del genoma del pesco), o più semplicemente ti misuri la mediana, e poi vedi a quale distanza la mediana è la metà del valore massimo. Ultimamente questo è il mio metodo preferito!

Hi

i would like to draw LD plot for three population and i want to have a color plot. i dont know, which command should i use?

i use the following command , but not color myplot.

library(ggplot2)

ld <- read.table("plink.ld", sep="\t",header=T)

ggplot(ld) +

geom_line(aes(x=BP_B – BP_A, y = R2))

Hi, Reza, I am sorry, I am not so expert of ggplot2. Maybe you can look at some tutorial of ggplot2, I think they should explain how to change plot colors.