A simple mbox file splitter in R

Part of my job I had to find a way to extract individual messages from a Linux mail spool file. R has very good built-in functions for text-manipulation, so I wrote some helper functions in it which reads in a spool file and sorts individual messages into a list object. After that I can filter those messages according to my taste using R's capabilities.

as.list.mbox <- function(mbox) {
  start0 <- grep("From",mbox)
  start <- start0[sapply(start0,function(x) ifelse(x > 1,mbox[x-1]=="",TRUE))]
  end <- c((start-1)[-1],length(mbox))
  lapply(as.data.frame(rbind(start,end)),function(x) mbox[x[1]:x[2]])
}

It works like this:

mbox <- readLines("/var/mail/username")                                                                 
class(mbox) <- "mbox"                                                                                   
mails <- as.list(mbox)
I agree that it's not the most efficient way of extracting mails from a large mbox file, but it does it in my case because my file is small. Besides that I also have a header parser:
header <- function(m) {
  ## From field                                                                                          
  ind.from <- grep("From:",m)
  from <- strsplit(m[ind.from]," ")[[1]][2]
  ## Subject field                                                                                       
  ind.sub <- grep("Subject:",m)
  sub <- strsplit(m[ind.sub]," ")[[1]][2]
  c(from=from,subject=sub)
  ## Lines field                                                                                         
  ind.lin <- grep("Lines:",m)
  lin <- strsplit(m[ind.lin]," ")[[1]][2]
  list(from=from,subject=sub,lines=as.integer(lin))
}
It includes only those fields I needed for my work but you can extend it along the same lines if necessary. Parsing of the subject field is a bit imperfect because I'm only interested in the first word of the subject. It's usage is:
header(mails[[1]])

I also have a function for getting the actual content of the mail:

content <- function(m) {
  l <- header(m)$lines
  ll <- length(m)
  m[(ll-l):(ll-2)]
}
F.i.
content(mails[[2]])
will print the second mail in the spool. Later I might write more about what I use it for, but it is still a work-in-progress so it's not worth to publish it at this stage.

Comments !

blogroll

social