Reading large data tables in R

Ok, I confess.

Until yesterday I was one of those still trying to read large data tables using read.table.

Then, I came across this thread in stackoverflow and I saw the light.

Since I noticed that a lot of people still struggle with read.table I decided to write this very brief post.

Imagine that you have your large file named “mylargefile.txt”. Then all you have to do is the following.


I promise I’ll never go back to read.table again!


Summarize content of a vector or data.frame every n entries

I imagine that the same result can be achieved by a proper use of quantile, but I like to have an easy way to obtain summary statistics every n entries of my dataset be it a vector or data.frame.

The function takes three parameters: the R object on which we need to obtain statistics (x), how many entries should each summary contain (step, defaulting to 1000), and the function we want to apply (fun, defaulting to “mean”).

Then, it’s all about using aggregate.

The present version incorporates useful comments by pat and ap53! <- function(x,step=1000,fun="mean")
n <- NROW(x)
x <- data.frame(group,x)
x <- aggregate(x,by=list(x$group),FUN=fun)
x <- x[,-c(1,2)]

Example application and result for a data.frame:

          X1        X2        X3        X4        X5        X6        X7
1  0.5081756 0.5206011 0.4972622 0.5060707 0.4907807 0.5063138 0.4982252
2  0.5014300 0.5093051 0.5015310 0.4718058 0.4931249 0.4882382 0.5084970
3  0.4994759 0.4979546 0.4964157 0.5138695 0.5018427 0.5228862 0.4980824
4  0.4970300 0.4953163 0.4954068 0.5157935 0.4770471 0.5000562 0.4960250
5  0.5118221 0.4967686 0.5114420 0.4945936 0.5016019 0.5003544 0.5016693
6  0.5026323 0.4995367 0.5003587 0.4970245 0.4992188 0.4993896 0.4873300
7  0.4911944 0.5081578 0.4858666 0.4974576 0.4864710 0.5022401 0.5058064
8  0.5050684 0.5021456 0.4970707 0.4829222 0.4980984 0.4901941 0.5053296
9  0.4910359 0.4883865 0.4915000 0.4984415 0.4941274 0.4933778 0.4964306
10 0.4832396 0.4986647 0.5017873 0.5008766 0.4952849 0.5036030 0.5084799
          X8        X9       X10
1  0.5052379 0.4906292 0.4916262
2  0.5074966 0.5117570 0.5183119
3  0.4988349 0.5029704 0.5077726
4  0.4889516 0.5066026 0.5078195
5  0.5068717 0.4988389 0.5018225
6  0.5010366 0.4870614 0.4827767
7  0.5148197 0.5083662 0.5037901
8  0.4979452 0.5273463 0.4944513
9  0.5130718 0.5061075 0.5058208
10 0.4896030 0.4911127 0.4956848

And for a vector

 [1] 0.4914789 0.4908839 0.4951939 0.4928015 0.4911908 0.4994735 0.4947729
 [8] 0.5058204 0.5026956 0.5018375

A biologist’s guide to linux “screen”

I write here a very concise guide to linux screen (ok, I assume you know something about linux, even if you are a biologist…).
It takes five minutes to learn and you can live with it for all of your life. Trust me, I’m a biologist!

Why screen?
If you work on a remote server and there is an interruption of your internet/LAN connection, all you were doing is gone. Screen prevents you against these data losses.
How does it work?
Simple. To create a screen or resume an existing screen you can use the ever useful command
screen -dR

Then, when you are in screen you can do all your work ignoring you are in screen except that you have a series of screen commands that help you create several screens inside the same terminal and also move around them.
Any screen command starts pressing at the same time ctrl-a (yes, even in mac).
AFTER pressing ctrl-a you can press one of the following letters:
c: create an additional screen
n: move to next screen (if existing)
p: move to previous screen (if existing)
d: detach screen

The only command that needs not be preceded by ctrl-a is exit. If you type exit inside screen you will quite screen and erase it from memory (so do not type exit if you have unsaved work).

Restituire dignità alla wi-fi fastweb

Ammettiamolo, ultimamente la rete fastweb non è il massimo.
So di svariate persone che si riducono ad usare la chiavetta per navigare… da casa.
Anche la mitologica “riqualifica della linea”, la prima opzione che immancabilmente fastweb propone, ultimamente non dava risultati.
Stavo veramente meditando a quale nuovo operatore affidarmi quando ho avuto un lampo di genio (ebbene sì, io non ho culo, ho lampi di genio) e mi sono imbattuto in una serie di post che suggerivano di cambiare il canale del wireless.
Nonostante io sappia a malapena cambiare il canale della TV (strumento che da decenni non possiedo più) ho deciso di cimentarmi in questa impresa.
Con un po’ di colpi di c… ehm di genio, sono riuscito a farlo. Scrivo qui la mia avventura a imperitura memoria e affinché gli altri poveri sfigati come me vittime della lentezza del wi-fi fastweb vi trovino consolazione e – magari – la salvezza, come è capitato a me.

Allora, un po’ rocambolescamente ho trovato questa pagina di fastweb in cui si accede alla configurazione del wifi di fastweb (ovviamente bisogna effettuare il login):

Si presenta una pagina, nelle cui prime righe c’è un link un po’ mimetizzato che io però ho evidenziato con una freccia rossa


Una volta cliccato, si aprirà un pop-up nel quale ci sono vari settaggi. Basta cercare “Modifica canale wi-fi” e cambiare il numero.


Io, dall’alto delle mie competenze informatiche, ho cambiato a caso.
Ha funzionato.
Auguro anche a voi la stessa fortuna.

R and theater

You might ask what R has to do with theater.
I assure you it has. I act in the theater group ‘ndescenze. We will soon present (actually, we just performed) a show based on the Marx Brothers Radio Shows. We shuffle actors and characters during the show (we like it complicated!) and we needed to find at least one sequence of scenes which allowed people to change dress and be ready quickly to be back on stage.
I manually described a number of rules based on the roles of the actors.
For example, “Teatro”, in which I act as Ravelli and I am on stage until the end of the scene, cannot be followed by “Autobus”, in which I act as Miss Dimple and I am on stage in the beginning of the scene…
After doing that, I just needed to find the combination of six scenes (named “affitto”,”autobus”,”eredita”,”ristorante”,”tassista” and “teatro”) which didn’t break any of the imposed rules.
No need to say that I used R! It worked, I found a good answer, but with a really ugly piece of code… Let’s see if you can find the right answer(s) and with a better piece of code (ok, the latter is not so difficult)

Here I list the rules:
“teatro” must take place after “affitto”, because in “affitto” we explain some things that will be needed for “teatro”.
They don’t need to be consecutive, though.
Then a list of consecutive combinations that are forbidden
“affitto” cannot be followed by “tassista” nor by “teatro”
“autobus” cannot be followed by “ristorante”
“eredita” cannot be followed by “tassista” nor by “autobus”
“ristorante” cannot be followed by “affitto”, “tassista”, “teatro” nor by “autobus”
“tassista” cannot be followed by “affitto”, “teatro” nor by “eredita”
“teatro” cannot be followed by “autobus”, “affitto”, nor by “ristorante”

Here’s the code:

for (bbb in 1: nrow(
    if(which([bbb,]=="affitto") > which([bbb,]=="teatro")) 

Using this code, I was able to find 10 combinations of scenes that were technically feasible. Among them we chose the one likely to have the best impact on the audience…
The list of all the technically possible scenes and the chosen one will be out soon… Obviously, if you attended our show you already know!
Waiting for comments on how to improve the code and your guesses for the right answer!

Improved code by Paul Fruin
Following Paul’s comment I paste below the complete code for solving the same problem. I tested it and it provides the right answer. In addition, Paul’s code avoids the loop and thus saves a considerable amount of time! I am always happy when I learn a way to avid a loop, so thanks Paul!

BTW: stay tuned, I will soon add alternative versions suggested by comments… I just need some time to test them and comment them!!

Here’s Paul’s code:

##now knock out rows step by step, using apply() to index the bad rows[ apply(,1,function(x) 
					! length(unique(as.character(x)))<6),] ->

#remove rows based on next criteria[ apply(,1,function(x)
					! which(x=="affitto") > which(x=="teatro") ) , ] ->

#remove al the other rows...[ apply(,1,function(x)
					! which(x=="affitto")+1 == which(x=="tassista") ) , ] ->[ apply(,1,function(x)
					! which(x=="affitto")+1 == which(x=="teatro") ) , ] ->[ apply(,1,function(x)
					! which(x=="autobus")+1 == which(x=="ristorante") ) , ] ->[ apply(,1,function(x)
					! which(x=="eredita")+1 == which(x=="tassista") ) , ] ->[ apply(,1,function(x)
					! which(x=="eredita")+1 == which(x=="autobus") ) , ] ->[ apply(,1,function(x)
					! which(x=="ristorante")+1 == which(x=="affitto") ) , ] ->[ apply(,1,function(x)
					! which(x=="ristorante")+1 == which(x=="tassista") ) , ] ->[ apply(,1,function(x)
					! which(x=="ristorante")+1 == which(x=="teatro") ) , ] ->[ apply(,1,function(x)
					! which(x=="ristorante")+1 == which(x=="autobus") ) , ] ->[ apply(,1,function(x)
					! which(x=="tassista")+1 == which(x=="affitto") ) , ] ->[ apply(,1,function(x)
					! which(x=="tassista")+1 == which(x=="teatro") ) , ] ->[ apply(,1,function(x)
					! which(x=="tassista")+1 == which(x=="eredita") ) , ] ->[ apply(,1,function(x)
					! which(x=="teatro")+1 == which(x=="autobus") ) , ] ->[ apply(,1,function(x)
					! which(x=="teatro")+1 == which(x=="affitto") ) , ] ->[ apply(,1,function(x)
					! which(x=="teatro")+1 == which(x=="ristorante") ) , ] ->

Carlos Ortega provided an even faster solution. It uses libraries “combinat” and “stringr” and it runs in less than 3 secs even on my old laptop! I tested it and it provides the right answer. I promise that I will try to learn this efficient implementation!
You can find Carlo’s code below:


all.sce<-permn(scenes )

foo<-function(x) {
	str_c(x, collapse="-")
sce.god<-lapply(all.sce, foo)

# Working data.frame

###### Forbidden Rules

		r01, r02, r03, r04, r05,
		r06, r07, r08, r09, r10,
		r11, r12, r13, r14, r15, r16

############# Process to get the valid scenes

for(i in 1:ncol(rul.for)) {
	val.tmp<-str_locate(sce.ini, rul.tmp)

### To eliminate first rule Ð teatro after affittoÉ
te.val<-str_locate(sce.ini, "teatro")
af.val<-str_locate(sce.ini, "affitto")<-data.frame(te=te.val[,1],af=af.val[,1])

no.val<-as.numeric(row.names([$te <$af,]))

### Valid Scenes
valid.scenes <- sce.ini[-no.val]


Last but not least, Ben Bolker found another elegant solution providing the correct answer.
I paste it below:

##now knock out rows step by step, using apply() to index the bad rows[ apply(,1,function(x) ! length(unique(as.character(x)))<6),] ->[ apply(,1,function(x)
! which(x=="affitto") > which(x=="teatro") ) , ] ->

##remove rows based on next criteria
not_adjacent <- function(t1,t2) {
! (which(x==t1)+1 == which(x==t2)))
omit_pairs <- matrix(c("affitto","tassista",
function(z) { <<-[not_adjacent(z[1],z[2]),]
nrow( ## 10

All the solutions provided so far were correct, fast and elegant. This makes me realize the paucity of my R skills, but at the same time stimulates me to learn more!

Introduction to R

I am happy to repost the information I got about the course “Introduction to R” that will be organized by Milano R net in collaboration with Quantide.
The course will be held in Milano, Italy, June 7-8, 2012, and is intended to introduce the unexperienced user to R.
For furhter info visit milano R net blog.

Milano R net meeting

I received the announce of the first Milano R net meeting, and I am glad to post it below. If you are based in Milano area you might think about going. If you are not based in Milano area, you might think about going as well!

Milano R net
Milano R net is a users group dedicated to bringing together area practitioners of the popular open source R statistics language to exchange knowledge, learn and share tricks and techniques, provide R beginners with an opportunity to meet more experienced users. The aims is to spur the adoption of R for innovative research and commercial applications, see case studies of real-life R applications in industry and government. Milano R net founders are: R consulting company Quantide’s Andrea Spanò, R-core member and University of Milan professor Stefano Iacus, and quantitative consultant Daniele Amberti.

First Milano R net meeting
The first meeting will be on May 8. Andrea Spanò will introduce the meeting. Speakers are: Stefano Iacus (R Development Core Team), Fabio Piacenza (Unicredit Group) and Alberto Santini (Monte dei Paschi di Siena).
Since today about 45 people subscribed to meeting. Subscribers background are very different: this is required for an interesting exchange of information.

Information and registrations:

Quantide ( and Revolution Analytics ( sponsor the meeting.

Future meetings
The group is looking for suggestions for future talks and presenters, so if you’re in the Milan area (or know another R user who is), please get in touch via the