In the desperate effort of understanding the secret of life it may be too simplistic to just count the differences between two strings of equal length. You might as well want to know where they differ.
You can do that recycling most of the function published in a previous post.
You can use it to compare two nucleotide sequences, two amino acid sequences or just two strings of equal length.
To facilitate analysis of DNA sequences, I gave the opportunity to ignore case (so that an “a” is equivalent to “A”) and to ignore positions carrying special characters, such as “N”, “?” or “-“.
Input:
a: first sequence (or string)
b: second sequence (or string)
exclude: character (or vector of characters) to be excluded from comparison
ignore.case: logical. If TRUE consider “a” equal to “A”. If FALSE consider “a” different from “A”.
show.excluded: logical. If TRUE, positions carrying an excluded character (such as “-“) will be returned, as NA. IF FALSE, positions in which one of the two strings carry an excluded character will be omitted.
Output:
only.diff: Matrix (or vector) reporting differences between the two strings and their position.
As usual, feedback welcome!
list.string.diff<-function(a="ATTCGA-",b="attTGTT",exclude=c("-","?"),ignore.case=TRUE) { if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.") if(ignore.case==TRUE) { a<-toupper(a) b<-toupper(b) } seq.a<-unlist(strsplit(a,split="")) seq.b<-unlist(strsplit(b,split="")) diff.d<-rbind(seq.a,seq.b) only.diff<-diff.d[,diff.d[1,]!=diff.d[2,]] pos<-which(diff.d[1,]!=diff.d[2,]) only.diff<-rbind(pos,only.diff) for(ex.loop in 1:length(exclude)) { only.diff<-only.diff[,!(only.diff["seq.a",]==exclude[ex.loop]|only.diff["seq.b",]==exclude[ex.loop])] } return(only.diff) }
It might happen that you have two DNA or amino acid sequences of different length. I suggest that you align them with clustalW and save the output in a text file. You can then use the read.alignment() function of the R package seqinr to get the two aligned sequences, padded by “-” so that they have the same length.
I paste below an example of clustalW output
CLUSTAL 2.1 multiple sequence alignment test1 MESLGVRKGAWIQEEDVLLRKCIEKYGEGKWHLVPLRAGLNRCRKSCRLRWLNYLKPDIK test2 MESLGVRKGAWIQEEDVLLRKCIEKYGEGKWHLVPLRAGLNRCRKSCRLRWLNYLKPDIK ************************************************************ test1 RGEFALDEVDLMIRLHNLLGNRWSLIAGRLPGRTANDVKNYWHSHHFKKEVQFQEEGRDK test2 RGEFALDEVDLMI--------------GRLPGRTANDVKNYWHSHHFKKEVQFQEEGRDK ************************************************************ test1 PQTHSKTKAIKPHPHKFSKALPRFELKTTAVDTFDTQVSTSRKPSSTSPQPNDDIIWWES test2 PQTHSKTKAIKPHPHKFSKALPRFELKTTAVDTFDTQVSTSSKPSSTSPQPNDDIIWWES ***************************************** ****************** test1 LLAEHAQMDQETDFSASGEMLIASLRTEETATQKKGPMDGMIEQIQGGEGDFPFDVGFWD test2 LLAEHAPMDQETDFSASGEMLIASLRTEETATQKKGPMDGMIEQIQGGEGDFPFDVGFWD ****** ***************************************************** test1 TPNTQVNHLI test2 TPNTQVNHLI **********
You can then save this alignment with the name “test.clustalw”, read it with the following function which calls both read.alignment() and list.string.diff().
comp.seq<-function(infile="test.clustalw",format="clustal", exclude=c("-","?")) { library(seqinr) pseq<-read.alignment(infile,format=format) pseq1<-as.character(pseq$seq[1]) pseq2<-as.character(pseq$seq[2]) pdiff<-list.string.diff(a=pseq1,b=pseq2,exclude=exclude) return(pdiff) }
If you try, you should get the following output
[,1] [,2] pos "162" "187" seq.a "R" "Q" seq.b "S" "P"
Some nice improvements to the function. Thanks to richierocks for his comments!
list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE, show.excluded = FALSE) { if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.") if(ignore.case) { a <- toupper(a) b <- toupper(b) } split_seqs <- strsplit(c(a, b), split = "") only.diff <- (split_seqs[[1]] != split_seqs[[2]]) only.diff[ (split_seqs[[1]] %in% exclude) | (split_seqs[[2]] %in% exclude) ] <- NA diff.info<-data.frame(which(is.na(only.diff)|only.diff), split_seqs[[1]][only.diff],split_seqs[[2]][only.diff]) names(diff.info)<-c("position","poly.seq.a","poly.seq.b") if(!show.excluded) diff.info<-na.omit(diff.info) diff.info }
I notice a few things about your code.
If the lengths of the strings differ then throwing an error seems harsh. I’m not an expert in nulceotide sequences, but it seems that you could at least compare up to the length of the shortest sequence.
For testing if(some_logical_value == TRUE), you get the same thing as simply if(some_logical_value), so the latter is preferred.
unlisting and rbinding the output from strsplit seems an unnecessary task. The data is in a suitable format anyway.
I think you can get rid of the for loop for excluing values in exclude.
Here’s my suggested alternative.
list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE)
{
if(nchar(a) != nchar(b))
{
n <- min(nchar(a), nchar(b))
length(a) <- n
length(b) <- n
}
if(ignore.case)
{
a <- toupper(a)
b <- toupper(b)
}
split_seqs <- strsplit(c(a, b), split = "")
(split_seqs[[1]] != split_seqs[[2]]) &
!(split_seqs[[1]] %in% exclude) &
!(split_seqs[[2]] %in% exclude)
}
Actually, on second thoughts, you probably want to return NA for values where there was an excluded value. Possible alternative:
list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE)
{
if(nchar(a) != nchar(b))
{
n <- min(nchar(a), nchar(b))
length(a) <- n
length(b) <- n
}
if(ignore.case)
{
a <- toupper(a)
b <- toupper(b)
}
split_seqs <- strsplit(c(a, b), split = "")
only.diff <- (split_seqs[[1]] != split_seqs[[2]])
only.diff[
(split_seqs[[1]] %in% exclude) |
(split_seqs[[2]] %in% exclude)
] <- NA
only.diff
}
Hey, thank you very much for your comments!
I especially like the way of avoiding the loop (something I’m not good at).
I will incorporate most of them.
However, I prefer not to include the comparison of the shortest sequence for this reason: imagine you have a string a<-"abcdefgh" and a string b<-"bcdefgh", which is one character shorter. You can easily see that they are almost identical. However, if you compare them to the shorter sequence you might choose to compare "abcdefg" with "bcdefgh" and conclude that they are different. Instead, if you use an alignment software you might easily get a<-"abcdefgh" and b<-"-bcdefgh" which are actually identical. It might be a nice generalization to read the two strings, send them to an aligner and then read the output. However, since sending two strings to an alignment software is just a shell command, I prefer keeping it our of R.
You will also notice that I am insisting in trying to get a data frame out of the function, since I would like to have both the position at which the strings differ and the content of both strings where they differ. I found nothing better than actually building a data frame object. Any other options?
If you want to return a data frame, my natural instinct would be to return
as.data.frame(c(split_seqs, only.diff))
Then if you want the numeric locations of the differences, you can call which afterwards.
If your strings are sufficiently short (65000 characters or less), a nice way to visualise them might be to use the xlsx package to write them to an Excel file, using setCellStyle to give them a different coloured background if they match/don’t match.
(I don’t usually advocate using Excel because the data analysis facilities are garbage, but it’s great for colouring in tabular data.)
Hope this helps.
I will try your solution, thanks! At present my strings are shorter than 65000 characters, but they might often be much longer (1M characters). We’ll see…
Nice job!