## Perform pairwise Wilcoxon test, classify groups by significance and plot results

This post is the result of work performed in collaboration with my colleague Eleonora Paparelli (who actually did most of the work!). We wanted to compare several distributions using Wilcoxon test and summarize results (i.e. indicate the comparisons showing significant differences).
R base includes pairwise.wilcox.test to perform Wilcoxon rank sum test between all pairs of samples in a study.
A common way to represent significance in pairwise comparisons is the use of letters. Samples sharing a letter are not different from each other. Samples not sharing letters are different.
Library multcompView in R can take a square matrix of p-values and return letters for all samples.
Sadly enough, pairwise.wilcox.test returns a triangular matrix, so I had to write a small function – named tri.to.squ – to take the output of pairwise.wilcox.test and convert it in a suitable input for the multcompLetters function of multcompView.
Now we can easily plot the distributions as box plot and add the letters as text.

library(multcompView)

tri.to.squ<-function(x)
{
rn<-row.names(x)
cn<-colnames(x)
an<-unique(c(cn,rn))
myval<-x[!is.na(x)]
mymat<-matrix(1,nrow=length(an),ncol=length(an),dimnames=list(an,an))
for(ext in 1:length(cn))
{
for(int in 1:length(rn))
{
if(is.na(x[row.names(x)==rn[int],colnames(x)==cn[ext]])) next
mymat[row.names(mymat)==rn[int],colnames(mymat)==cn[ext]]<-x[row.names(x)==rn[int],colnames(x)==cn[ext]]
mymat[row.names(mymat)==cn[ext],colnames(mymat)==rn[int]]<-x[row.names(x)==rn[int],colnames(x)==cn[ext]]
}

}
return(mymat)
}
first.set<-cbind(rnorm(100,mean=1.8),1)
second.set<-cbind(rnorm(100,mean=0.9),2)
third.set<-cbind(rnorm(100,mean=1),3)

full<-rbind(first.set,second.set,third.set)

pp<-pairwise.wilcox.test(full[,1], full[,2], p.adjust.method = "none", paired = FALSE)
mymat<-tri.to.squ(pp\$p.value)
myletters<-multcompLetters(mymat,compare="<=",threshold=0.05,Letters=letters)

boxplot(full[,1] ~ full[,2],ylab="Something nice",names=c("Set 1","Set 2","Set 3"),ylim=c(min(full[,1]),0.3+max(full[,1])))
text(c(1,2,3),0.2+max(full[,1]),c(myletters\$Letters[1],myletters\$Letters[2],myletters\$Letters[3]))

## Reading large data tables in R

Ok, I confess.

Until yesterday I was one of those still trying to read large data tables using read.table.

Then, I came across this thread in stackoverflow and I saw the light.

Since I noticed that a lot of people still struggle with read.table I decided to write this very brief post.

Imagine that you have your large file named “mylargefile.txt”. Then all you have to do is the following.

library(data.table)

I promise I’ll never go back to read.table again!

## Summarize content of a vector or data.frame every n entries

I imagine that the same result can be achieved by a proper use of quantile, but I like to have an easy way to obtain summary statistics every n entries of my dataset be it a vector or data.frame.

The function takes three parameters: the R object on which we need to obtain statistics (x), how many entries should each summary contain (step, defaulting to 1000), and the function we want to apply (fun, defaulting to “mean”).

Then, it’s all about using aggregate.

The present version incorporates useful comments by pat and ap53!

summarize.by <- function(x,step=1000,fun="mean")
{
n <- NROW(x)
group<-sort(rep(seq(1,ceiling(n/step)),step)[1:n])
x <- data.frame(group,x)
x <- aggregate(x,by=list(x\$group),FUN=fun)
x <- x[,-c(1,2)]
return(x)
}

Example application and result for a data.frame:

dummy<-data.frame(matrix(runif(100000,0,1),ncol=10))
summarize.by(dummy)
X1        X2        X3        X4        X5        X6        X7
1  0.5081756 0.5206011 0.4972622 0.5060707 0.4907807 0.5063138 0.4982252
2  0.5014300 0.5093051 0.5015310 0.4718058 0.4931249 0.4882382 0.5084970
3  0.4994759 0.4979546 0.4964157 0.5138695 0.5018427 0.5228862 0.4980824
4  0.4970300 0.4953163 0.4954068 0.5157935 0.4770471 0.5000562 0.4960250
5  0.5118221 0.4967686 0.5114420 0.4945936 0.5016019 0.5003544 0.5016693
6  0.5026323 0.4995367 0.5003587 0.4970245 0.4992188 0.4993896 0.4873300
7  0.4911944 0.5081578 0.4858666 0.4974576 0.4864710 0.5022401 0.5058064
8  0.5050684 0.5021456 0.4970707 0.4829222 0.4980984 0.4901941 0.5053296
9  0.4910359 0.4883865 0.4915000 0.4984415 0.4941274 0.4933778 0.4964306
10 0.4832396 0.4986647 0.5017873 0.5008766 0.4952849 0.5036030 0.5084799
X8        X9       X10
1  0.5052379 0.4906292 0.4916262
2  0.5074966 0.5117570 0.5183119
3  0.4988349 0.5029704 0.5077726
4  0.4889516 0.5066026 0.5078195
5  0.5068717 0.4988389 0.5018225
6  0.5010366 0.4870614 0.4827767
7  0.5148197 0.5083662 0.5037901
8  0.4979452 0.5273463 0.4944513
9  0.5130718 0.5061075 0.5058208
10 0.4896030 0.4911127 0.4956848

And for a vector

dummy<-runif(10000,0,1)
summarize.by(dummy)
[1] 0.4914789 0.4908839 0.4951939 0.4928015 0.4911908 0.4994735 0.4947729
[8] 0.5058204 0.5026956 0.5018375

## A biologist’s guide to linux “screen”

I write here a very concise guide to linux screen (ok, I assume you know something about linux, even if you are a biologist…).
It takes five minutes to learn and you can live with it for all of your life. Trust me, I’m a biologist!

Why screen?
If you work on a remote server and there is an interruption of your internet/LAN connection, all you were doing is gone. Screen prevents you against these data losses.
How does it work?
Simple. To create a screen or resume an existing screen you can use the ever useful command
screen -dR

Then, when you are in screen you can do all your work ignoring you are in screen except that you have a series of screen commands that help you create several screens inside the same terminal and also move around them.
Any screen command starts pressing at the same time ctrl-a (yes, even in mac).
AFTER pressing ctrl-a you can press one of the following letters:
n: move to next screen (if existing)
p: move to previous screen (if existing)
d: detach screen

The only command that needs not be preceded by ctrl-a is exit. If you type exit inside screen you will quite screen and erase it from memory (so do not type exit if you have unsaved work).

## Restituire dignità alla wi-fi fastweb

Ammettiamolo, ultimamente la rete fastweb non è il massimo.
So di svariate persone che si riducono ad usare la chiavetta per navigare… da casa.
Anche la mitologica “riqualifica della linea”, la prima opzione che immancabilmente fastweb propone, ultimamente non dava risultati.
Stavo veramente meditando a quale nuovo operatore affidarmi quando ho avuto un lampo di genio (ebbene sì, io non ho culo, ho lampi di genio) e mi sono imbattuto in una serie di post che suggerivano di cambiare il canale del wireless.
Nonostante io sappia a malapena cambiare il canale della TV (strumento che da decenni non possiedo più) ho deciso di cimentarmi in questa impresa.
Con un po’ di colpi di c… ehm di genio, sono riuscito a farlo. Scrivo qui la mia avventura a imperitura memoria e affinché gli altri poveri sfigati come me vittime della lentezza del wi-fi fastweb vi trovino consolazione e – magari – la salvezza, come è capitato a me.

Allora, un po’ rocambolescamente ho trovato questa pagina di fastweb in cui si accede alla configurazione del wifi di fastweb (ovviamente bisogna effettuare il login):
http://www.fastweb.it/myfastpage/assistenza/guide/assistenza-tecnica/internet/non-funziona-internet-wireless/

Si presenta una pagina, nelle cui prime righe c’è un link un po’ mimetizzato che io però ho evidenziato con una freccia rossa

Una volta cliccato, si aprirà un pop-up nel quale ci sono vari settaggi. Basta cercare “Modifica canale wi-fi” e cambiare il numero.

Io, dall’alto delle mie competenze informatiche, ho cambiato a caso.
Ha funzionato.
Auguro anche a voi la stessa fortuna.

## R and theater

You might ask what R has to do with theater.
I assure you it has. I act in the theater group ‘ndescenze. We will soon present (actually, we just performed) a show based on the Marx Brothers Radio Shows. We shuffle actors and characters during the show (we like it complicated!) and we needed to find at least one sequence of scenes which allowed people to change dress and be ready quickly to be back on stage.
I manually described a number of rules based on the roles of the actors.
For example, “Teatro”, in which I act as Ravelli and I am on stage until the end of the scene, cannot be followed by “Autobus”, in which I act as Miss Dimple and I am on stage in the beginning of the scene…
After doing that, I just needed to find the combination of six scenes (named “affitto”,”autobus”,”eredita”,”ristorante”,”tassista” and “teatro”) which didn’t break any of the imposed rules.
No need to say that I used R! It worked, I found a good answer, but with a really ugly piece of code… Let’s see if you can find the right answer(s) and with a better piece of code (ok, the latter is not so difficult)

Here I list the rules:
“teatro” must take place after “affitto”, because in “affitto” we explain some things that will be needed for “teatro”.
They don’t need to be consecutive, though.
Then a list of consecutive combinations that are forbidden
“affitto” cannot be followed by “tassista” nor by “teatro”
“autobus” cannot be followed by “ristorante”
“eredita” cannot be followed by “tassista” nor by “autobus”
“ristorante” cannot be followed by “affitto”, “tassista”, “teatro” nor by “autobus”
“tassista” cannot be followed by “affitto”, “teatro” nor by “eredita”
“teatro” cannot be followed by “autobus”, “affitto”, nor by “ristorante”

Here’s the code:

scenes<-c("affitto","autobus","eredita","ristorante","tassista","teatro")
choose.me<-expand.grid(list(scenes,scenes,scenes,scenes,scenes,scenes),
stringsAsFactors=FALSE)
for (bbb in 1: nrow(choose.me))
{
if(length(unique(as.character(choose.me[bbb,])))<6)
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="affitto") > which(choose.me[bbb,]=="teatro"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="affitto")+1==which(choose.me[bbb,]=="tassista"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="affitto")+1==which(choose.me[bbb,]=="teatro"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="autobus")+1==which(choose.me[bbb,]=="ristorante"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="eredita")+1==which(choose.me[bbb,]=="tassista"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="eredita")+1==which(choose.me[bbb,]=="autobus"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="ristorante")+1==which(choose.me[bbb,]=="affitto"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="ristorante")+1==which(choose.me[bbb,]=="tassista"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="ristorante")+1==which(choose.me[bbb,]=="teatro"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="ristorante")+1==which(choose.me[bbb,]=="autobus"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="tassista")+1==which(choose.me[bbb,]=="affitto"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="tassista")+1==which(choose.me[bbb,]=="teatro"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="tassista")+1==which(choose.me[bbb,]=="eredita"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="teatro")+1==which(choose.me[bbb,]=="autobus"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="teatro")+1==which(choose.me[bbb,]=="affitto"))
{choose.me[bbb,]<-rep(NA,6);next}
if(which(choose.me[bbb,]=="teatro")+1==which(choose.me[bbb,]=="ristorante"))
{choose.me[bbb,]<-rep(NA,6);next}
}
choose.me<-na.omit(choose.me)

Using this code, I was able to find 10 combinations of scenes that were technically feasible. Among them we chose the one likely to have the best impact on the audience…
The list of all the technically possible scenes and the chosen one will be out soon… Obviously, if you attended our show you already know!

Improved code by Paul Fruin
Following Paul’s comment I paste below the complete code for solving the same problem. I tested it and it provides the right answer. In addition, Paul’s code avoids the loop and thus saves a considerable amount of time! I am always happy when I learn a way to avid a loop, so thanks Paul!

BTW: stay tuned, I will soon add alternative versions suggested by comments… I just need some time to test them and comment them!!

Here’s Paul’s code:

scenes<-c("affitto","autobus","eredita","ristorante","tassista","teatro")
choose.me<-expand.grid(list(scenes,scenes,scenes,scenes,scenes,scenes),
stringsAsFactors=FALSE)
##now knock out rows step by step, using apply() to index the bad rows
choose.me[ apply(choose.me,1,function(x)
! length(unique(as.character(x)))<6),] -> choose.me

#remove rows based on next criteria
choose.me[ apply(choose.me,1,function(x)
! which(x=="affitto") > which(x=="teatro") ) , ] -> choose.me

#remove al the other rows...
choose.me[ apply(choose.me,1,function(x)
! which(x=="affitto")+1 == which(x=="tassista") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="affitto")+1 == which(x=="teatro") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="autobus")+1 == which(x=="ristorante") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="eredita")+1 == which(x=="tassista") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="eredita")+1 == which(x=="autobus") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="ristorante")+1 == which(x=="affitto") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="ristorante")+1 == which(x=="tassista") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="ristorante")+1 == which(x=="teatro") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="ristorante")+1 == which(x=="autobus") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="tassista")+1 == which(x=="affitto") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="tassista")+1 == which(x=="teatro") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="tassista")+1 == which(x=="eredita") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="teatro")+1 == which(x=="autobus") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="teatro")+1 == which(x=="affitto") ) , ] -> choose.me
choose.me[ apply(choose.me,1,function(x)
! which(x=="teatro")+1 == which(x=="ristorante") ) , ] -> choose.me

Carlos Ortega provided an even faster solution. It uses libraries “combinat” and “stringr” and it runs in less than 3 secs even on my old laptop! I tested it and it provides the right answer. I promise that I will try to learn this efficient implementation!
You can find Carlo’s code below:

a<-Sys.time()
library(combinat)
library(stringr)

scenes<-c("affitto","autobus","eredita","ristorante","tassista","teatro")
all.sce<-permn(scenes )

foo<-function(x) {
str_c(x, collapse="-")
}
sce.god<-lapply(all.sce, foo)

# Working data.frame
sce.end<-as.data.frame(sce.god)
names(sce.end)<-NULL
sce.wrk<-t(sce.end)

###### Forbidden Rules
r01<-c('affitto-tassista')
r02<-c('affitto-teatro')
r03<-c('autobus-ristorante')
r04<-c('eredita-tassista')
r05<-c('eredita-autobus')
r06<-c('ristorante-affitto')
r07<-c('ristorante-tassista')
r08<-c('ristorante-teatro')
r09<-c('ristorante-autobus')
r10<-c('tassista-affitto')
r11<-c('tassista-teatro')
r12<-c('tassista-eredita')
r13<-c('teatro-autobus')
r14<-c('teatro-affitto')
r15<-c('teatro-ristorante')
r16<-c('^teatro')

rul.for<-data.frame(
r01, r02, r03, r04, r05,
r06, r07, r08, r09, r10,
r11, r12, r13, r14, r15, r16
)

############# Process to get the valid scenes
sce.ini<-sce.wrk

for(i in 1:ncol(rul.for)) {

rul.tmp<-as.vector(rul.for[1,i])
val.tmp<-str_locate(sce.ini, rul.tmp)
sce.tmp<-sce.ini[is.na(val.tmp[,1])]

sce.ini<-sce.tmp

}

### To eliminate first rule Ð teatro after affittoÉ
te.val<-str_locate(sce.ini, "teatro")
af.val<-str_locate(sce.ini, "affitto")
te.af<-data.frame(te=te.val[,1],af=af.val[,1])

no.val<-as.numeric(row.names(te.af[te.af\$te < te.af\$af,]))

### Valid Scenes
valid.scenes <- sce.ini[-no.val]
valid.scenes

b<-Sys.time()
b-a

Last but not least, Ben Bolker found another elegant solution providing the correct answer.
I paste it below:

scenes<-c("affitto","autobus","eredita","ristorante","tassista","teatro")
choose.me<-expand.grid(list(scenes,scenes,scenes,scenes,scenes,scenes),
stringsAsFactors=FALSE)
##now knock out rows step by step, using apply() to index the bad rows
choose.me[ apply(choose.me,1,function(x) ! length(unique(as.character(x)))<6),] -> choose.me

choose.me[ apply(choose.me,1,function(x)
! which(x=="affitto") > which(x=="teatro") ) , ] -> choose.me

##remove rows based on next criteria
apply(choose.me,1,function(x)
! (which(x==t1)+1 == which(x==t2)))
}
omit_pairs <- matrix(c("affitto","tassista",
"affitto","teatro",
"autobus","ristorante",
"eredita","tassista",
"eredita","autobus",
"ristorante","affitto",
"ristorante","tassista",
"ristorante","teatro",
"ristorante","autobus",
"tassista","affitto",
"tassista","teatro",
"tassista","eredita",
"teatro","autobus",
"teatro","affitto",
"teatro","ristorante"),
ncol=2,byrow=TRUE)
invisible(apply(omit_pairs,1,
function(z) {