Extract different characters between two strings of equal length

In the desperate effort of understanding the secret of life it may be too simplistic to just count the differences between two strings of equal length. You might as well want to know where they differ.

You can do that recycling most of the function published in a previous post.

You can use it to compare two nucleotide sequences, two amino acid sequences or just two strings of equal length.

To facilitate analysis of DNA sequences, I gave the opportunity to ignore case (so that an “a” is equivalent to “A”) and to ignore positions carrying special characters, such as “N”, “?” or “-“.

Input:
a: first sequence (or string)
b: second sequence (or string)
exclude: character (or vector of characters) to be excluded from comparison
ignore.case: logical. If TRUE consider “a” equal to “A”. If FALSE consider “a” different from “A”.
show.excluded: logical. If TRUE, positions carrying an excluded character (such as “-“) will be returned, as NA. IF FALSE, positions in which one of the two strings carry an excluded character will be omitted.

Output:
only.diff: Matrix (or vector) reporting differences between the two strings and their position.

As usual, feedback welcome!

list.string.diff<-function(a="ATTCGA-",b="attTGTT",exclude=c("-","?"),ignore.case=TRUE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case==TRUE)
{
a<-toupper(a)
b<-toupper(b)
}
seq.a<-unlist(strsplit(a,split=""))
seq.b<-unlist(strsplit(b,split=""))
diff.d<-rbind(seq.a,seq.b)
only.diff<-diff.d[,diff.d[1,]!=diff.d[2,]]
pos<-which(diff.d[1,]!=diff.d[2,])
only.diff<-rbind(pos,only.diff)
for(ex.loop in 1:length(exclude))
{
only.diff<-only.diff[,!(only.diff["seq.a",]==exclude[ex.loop]|only.diff["seq.b",]==exclude[ex.loop])]
}
return(only.diff)
}

It might happen that you have two DNA or amino acid sequences of different length. I suggest that you align them with clustalW and save the output in a text file. You can then use the read.alignment() function of the R package seqinr to get the two aligned sequences, padded by “-” so that they have the same length.
I paste below an example of clustalW output

CLUSTAL 2.1 multiple sequence alignment


test1      MESLGVRKGAWIQEEDVLLRKCIEKYGEGKWHLVPLRAGLNRCRKSCRLRWLNYLKPDIK
test2      MESLGVRKGAWIQEEDVLLRKCIEKYGEGKWHLVPLRAGLNRCRKSCRLRWLNYLKPDIK
           ************************************************************

test1      RGEFALDEVDLMIRLHNLLGNRWSLIAGRLPGRTANDVKNYWHSHHFKKEVQFQEEGRDK
test2      RGEFALDEVDLMI--------------GRLPGRTANDVKNYWHSHHFKKEVQFQEEGRDK
           ************************************************************

test1      PQTHSKTKAIKPHPHKFSKALPRFELKTTAVDTFDTQVSTSRKPSSTSPQPNDDIIWWES
test2      PQTHSKTKAIKPHPHKFSKALPRFELKTTAVDTFDTQVSTSSKPSSTSPQPNDDIIWWES
           ***************************************** ******************

test1      LLAEHAQMDQETDFSASGEMLIASLRTEETATQKKGPMDGMIEQIQGGEGDFPFDVGFWD
test2      LLAEHAPMDQETDFSASGEMLIASLRTEETATQKKGPMDGMIEQIQGGEGDFPFDVGFWD
           ****** *****************************************************

test1      TPNTQVNHLI
test2      TPNTQVNHLI
           **********

You can then save this alignment with the name “test.clustalw”, read it with the following function which calls both read.alignment() and list.string.diff().

comp.seq<-function(infile="test.clustalw",format="clustal",
                   exclude=c("-","?"))
{
library(seqinr)
pseq<-read.alignment(infile,format=format)
pseq1<-as.character(pseq$seq[1])
pseq2<-as.character(pseq$seq[2])
pdiff<-list.string.diff(a=pseq1,b=pseq2,exclude=exclude)
return(pdiff)
}

If you try, you should get the following output

      [,1]  [,2] 
pos   "162" "187"
seq.a "R"   "Q"  
seq.b "S"   "P"

Some nice improvements to the function. Thanks to richierocks for his comments!

list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE, show.excluded = FALSE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case)
{
a <- toupper(a)
b <- toupper(b)
}
split_seqs <- strsplit(c(a, b), split = "")
only.diff <- (split_seqs[[1]] != split_seqs[[2]])
only.diff[
(split_seqs[[1]] %in% exclude) |
(split_seqs[[2]] %in% exclude)
] <- NA
diff.info<-data.frame(which(is.na(only.diff)|only.diff),
split_seqs[[1]][only.diff],split_seqs[[2]][only.diff])
names(diff.info)<-c("position","poly.seq.a","poly.seq.b")
if(!show.excluded) diff.info<-na.omit(diff.info)
diff.info
}

amino acid sequences, dna sequences, nucleotide sequences

January 3, 2012

6 Comments

6 comments

richierocks

I notice a few things about your code.

If the lengths of the strings differ then throwing an error seems harsh. I’m not an expert in nulceotide sequences, but it seems that you could at least compare up to the length of the shortest sequence.

For testing if(some_logical_value == TRUE), you get the same thing as simply if(some_logical_value), so the latter is preferred.

unlisting and rbinding the output from strsplit seems an unnecessary task. The data is in a suitable format anyway.

I think you can get rid of the for loop for excluing values in exclude.

Here’s my suggested alternative.

list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE)
{
if(nchar(a) != nchar(b))
{
n <- min(nchar(a), nchar(b))
length(a) <- n
length(b) <- n
}
if(ignore.case)
{
a <- toupper(a)
b <- toupper(b)
}
split_seqs <- strsplit(c(a, b), split = "")
(split_seqs[[1]] != split_seqs[[2]]) &
!(split_seqs[[1]] %in% exclude) &
!(split_seqs[[2]] %in% exclude)
}

Permalink, Reply

January 4, 2012 4:40 pm

richierocks

Actually, on second thoughts, you probably want to return NA for values where there was an excluded value. Possible alternative:

list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE)
{
if(nchar(a) != nchar(b))
{
n <- min(nchar(a), nchar(b))
length(a) <- n
length(b) <- n
}
if(ignore.case)
{
a <- toupper(a)
b <- toupper(b)
}
split_seqs <- strsplit(c(a, b), split = "")
only.diff <- (split_seqs[[1]] != split_seqs[[2]])
only.diff[
(split_seqs[[1]] %in% exclude) |
(split_seqs[[2]] %in% exclude)
] <- NA
only.diff
}

Permalink, Reply

January 4, 2012 4:44 pm

fabiomarroni

Hey, thank you very much for your comments!
I especially like the way of avoiding the loop (something I’m not good at).
I will incorporate most of them.
However, I prefer not to include the comparison of the shortest sequence for this reason: imagine you have a string a<-"abcdefgh" and a string b<-"bcdefgh", which is one character shorter. You can easily see that they are almost identical. However, if you compare them to the shorter sequence you might choose to compare "abcdefg" with "bcdefgh" and conclude that they are different. Instead, if you use an alignment software you might easily get a<-"abcdefgh" and b<-"-bcdefgh" which are actually identical. It might be a nice generalization to read the two strings, send them to an aligner and then read the output. However, since sending two strings to an alignment software is just a shell command, I prefer keeping it our of R.
You will also notice that I am insisting in trying to get a data frame out of the function, since I would like to have both the position at which the strings differ and the content of both strings where they differ. I found nothing better than actually building a data frame object. Any other options?

Permalink, Reply

January 4, 2012 6:24 pm
- richierocks
  
  If you want to return a data frame, my natural instinct would be to return
  
  as.data.frame(c(split_seqs, only.diff))
  
  Then if you want the numeric locations of the differences, you can call which afterwards.
  
  If your strings are sufficiently short (65000 characters or less), a nice way to visualise them might be to use the xlsx package to write them to an Excel file, using setCellStyle to give them a different coloured background if they match/don’t match.
  
  (I don’t usually advocate using Excel because the data analysis facilities are garbage, but it’s great for colouring in tabular data.)
  
  Hope this helps.
  
  Permalink
  
  January 4, 2012 8:29 pm

fabiomarroni

I will try your solution, thanks! At present my strings are shorter than 65000 characters, but they might often be much longer (1M characters). We’ll see…

Permalink, Reply

January 5, 2012 9:40 am

Janko

Nice job!

Permalink, Reply

January 5, 2012 11:03 am

Fabio Marroni's Blog