Process list of histograms into separate histograms - R - r

I have a list of data frames which look like this
F0001
PoseID Score
1 AAAA_1 -13.70
2 AAAA_2 -9.21
3 AAAA_3 -7.60
4 AAAA_4 -6.28
F0002
PoseID Score
1 AAAB_1 -14.90
2 AAAB_2 -13.92
3 AAAB_3 -13.49
And essentially I'd like to generate plots for each data frame's $Score and spit them out as images.
One of the ways I've tried was to import all the data frames into a list.
lst <- mget(ls(pattern='^F\\d+'))
then run the hist() on each separate data frame in the list and push that out into a list of histograms.
hist <- lapply(lst, function(x) hist(x$Score))
The idea would then be to spit out that list as separate histograms saved to files. Seems like a simple thing but it's beating me at the moment. Any R boffins have a good way to do this? Maybe other approaches (e.g. for-loop on each separate data frame rather than adding it to a list and performing operations on it)?

The following saves each file with the name image1, image2,... as a pdf file in your working directory. You can also change pdf to jpeg or png or ps.
lapply(1:2,function(i){
pdf(paste0("image",i,".pdf"))
hist(mtcars[,i])
dev.off()})

Related

R Refer to (part of) data frame using string in R

I have a large data set in which I have to search for specific codes depending on what i want. For example, chemotherapy is coded by ~40 codes, that can appear in any of 40 columns called (diag1, diag2, etc).
I am in the process of writing a function that produces plots depending on what I want to show. I thought it would be good to specify what I want to plot in a input data frame. Thus, for example, in case I only want to plot chemotherapy events for patients, I would have a data frame like this:
Dataframe name: Style
Name SearchIn codes PlotAs PlotColour
Chemo data[substr(names(data),1,4)=="diag"] 1,2,3,4,5,6 | red
I already have a function that searches for codes in specific parts of the data frame and flags the events of interest. What i cannot do, and need your help with, is referring to a data frame (Style$SearchIn[1]) using codes in a data frame as above.
> Style$SearchIn[1]
[1] data[substr(names(data),1,4)=="diag"]
Levels: data[substr(names(data),1,4)=="diag"]
I thought perhaps get() would work, but I cant get it to work:
> get(Style$SearchIn[1])
Error in get(vars$SearchIn[1]) : invalid first argument
enter code here
or
> get(as.character(Style$SearchIn[1]))
Error in get(as.character(Style$SearchIn[1])) :
object 'data[substr(names(data),1,5)=="TDIAG"]' not found
Obviously, running data[substr(names(data),1,5)=="TDIAG"] works.
Example:
library(survival)
ex <- data.frame(SearchIn="lung[substr(names(lung),1,2) == 'ph']")
lung[substr(names(lung),1,2) == 'ph'] #works
get(ex$SearchIn[1]) # does not work
It is not a good idea to store R code in strings and then try to eval them when needed; there are nearly always better solutions for dynamic logic, such as lambdas.
I would recommend using a list to store the plot specification, rather than a data.frame. This would allow you to include a function as one of the list's components which could take the input data and return a subset of it for plotting.
For example:
library(survival);
plotFromSpec <- function(data,spec) {
filteredData <- spec$filter(data);
## ... draw a plot from filteredData and other stuff in spec ...
};
spec <- list(
Name='Chemo',
filter=function(data) data[,substr(names(data),1,2)=='ph'],
Codes=c(1,2,3,4,5,6),
PlotAs='|',
PlotColour='red'
);
plotFromSpec(lung,spec);
If you want to store multiple specifications, you could create a list of lists.
Have you tried using quote()
I'm not entirely sure what you want but maybe you could store the things you're trying to get() like
quote(data[substr(names(data),1,4)=="diag"])
and then use eval()
eval(quote(data[substr(names(data),1,4)=="diag"]), list(data=data))
For example,
dat <- data.frame("diag1"=1:10, "diag2"=1:10, "other"=1:10)
Style <- list(SearchIn=c(quote(data[substr(names(data),1,4)=="diag"]), quote("Other stuff")))
> head(eval(Style$SearchIn[[1]], list(data=dat)))
diag1 diag2
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6

How to store trees/nested lists in R?

I have a list of boroughs and a list of localities (like this one). Each locality lies in exactly one borough. What's the best way to store this kind of hierarchical structure in R, considerung that I'd like to have a convenient and readable way of accessing these, and using this list to accumulate data on the locality-level to the borough level.
I've come up with the following:
localities <- list("Mitte" = c("Mitte", "Moabit", "Hansaviertel", "Tiergarten", "Wedding", "Gesundbrunnen",
"Friedrichshain-Kreuzberg" = c("Friedrichshain", "Kreuzberg")
)
But I am not sure if this is the most elegant and accessible way.
If I wanted to assign additional information on the localitiy-level, I could do that by replacing the c(...) by some other call, like rbind(c('0201', '0202'), c("Friedrichshain", "Kreuzberg")) if I wanted to add additional information to the borough-level (like an abbreviated name and a full name for each list), how would I do this?
Edit: For example, I'd like to condense a table like this into a borough-wise version.
Hard to know without having a better view on how you intend to use this, but I would strongly recommend moving away from a nested list structure to a data frame structure:
library(reshape2)
loc.df <- melt(localities)
This is what the molten data looks like:
value L1
1 Mitte Mitte
2 Moabit Mitte
3 Hansaviertel Mitte
4 Tiergarten Mitte
5 Wedding Mitte
6 Gesundbrunnen Mitte
7 Friedrichshain Friedrichshain-Kreuzberg
8 Kreuzberg Friedrichshain-Kreuzberg
You can then use all the standard data frame and other computations:
loc.df$population <- sample(100:500, nrow(loc.df)) # make up population
tapply(loc.df$population, loc.df$L1, mean) # population by borough
gives mean population by Borough:
Friedrichshain-Kreuzberg Mitte
278.5000 383.8333
For more complex calculations you can use data.table and dplyr
You can extract all of this data directly into a data.frame using the XML library.
library(XML)
theurl <- "http://en.wikipedia.org/wiki/Boroughs_and_localities_of_Berlin#List_of_localities"
tables<-readHTMLTable(theurl)
boroughs<-tables[[1]]$Borough
localities<-tables[c(3:14)]
names(localities) <- as.character(boroughs)
all<-do.call("rbind", localities)
#Roland, I think you will find data frames superior to lists for the reasons cited earlier, but also because there is other data on the web page you reference. Loading to a data frame will make it easy to go further if you wish. For example, making comparisons based on population density or other items provided "for free" on the page will be a snap from a data frame.

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

R stack alternative

I am trying to write code that takes values from one column of each of many files and prints out a list of the values of a different column depending on the values found in the first. If that makes sense. I have read the files in, but I am having trouble managing the table. I would like to limit the table to just those two columns, because the files are very large, cumbersome and unnecessary. In my attempt to do so I had this line:
tmp<-stack(lapply(inputFiles,function(x) x[,3]))
But ideally I would like to include two columns (3 and 1), not just one, so that I may use a line, such as these ones:
search<-tmp[tmp$values < 100, "Target"]
write(search, file = "Five", ncolumns = 2)
But I am not sure how. I am almost certain that stack is not going to work for more than one column. I tried some different things, similar to this:
tmp<-stack(lapply(inputFiles,function(x) x[,3], x[,1]))
But of course that didn't work.
But I don't know where to look. Does anyone have any suggestions?
The taRifx package has a list method for stack that will do what you want. It stacks lists of data.frames.
Untested code:
library(taRifx)
tmp<-stack(lapply(inputFiles,function(x) x[,c(1,3)]))
But you didn't change anything! Why does this work?
lapply() returns a list. In your case, it returns a list where each element is a data.frame.
Base R does not have a special method for stacking lists. So when you call stack() on your list of data.frames, it calls stack.default, which doesn't work.
Loading the taRifx library loads a method of stack that deals specifically with lists of data.frames. So everything works fine since stack() now knows how to properly handle a list of data.frames.
Tested example:
dat <- replicate(10, data.frame(x=runif(2),y=rnorm(2)), simplify=FALSE)
str(dat)
stack(dat)
x y
1 0.42692948 0.32023455
2 0.75388820 0.24154125
3 0.64035957 1.96580059
4 0.47690790 -1.89772855
5 0.41668993 0.78083412
6 0.12643784 0.38029833
7 0.01656855 0.51225268
8 0.40653094 1.09408159
9 0.94236491 -0.13410923
10 0.05578115 1.12475364
11 0.75651062 -0.65441493
12 0.48210444 1.67325343
13 0.95348755 0.04828449
14 0.02315498 -0.28481193
15 0.27370762 0.43927826
16 0.83045889 0.75880763
17 0.40049367 0.06945058
18 0.86212662 1.49918712
19 0.97611629 0.13959291
20 0.29107186 0.64483646

How to split a data frame by rows, and then process the blocks?

I have a data frame with several columns, one of which is a factor called "site". How can I split the data frame into blocks of rows each with a unique value of "site", and then process each block with a function? The data look like this:
site year peak
ALBEN 5 101529.6
ALBEN 10 117483.4
ALBEN 20 132960.9
ALBEN 50 153251.2
ALBEN 100 168647.8
ALBEN 200 184153.6
ALBEN 500 204866.5
ALDER 5 6561.3
ALDER 10 7897.1
ALDER 20 9208.1
ALDER 50 10949.3
ALDER 100 12287.6
ALDER 200 13650.2
ALDER 500 15493.6
AMERI 5 43656.5
AMERI 10 51475.3
AMERI 20 58854.4
AMERI 50 68233.3
AMERI 100 75135.9
AMERI 200 81908.3
and I want to create a plot of year vs peak for each site.
You can use isplit (from the "iterators" package) to create an iterator object that loops over the blocks defined by the site column:
require(iterators)
site.data <- read.table("isplit-data.txt",header=T)
sites <- isplit(site.data,site.data$site)
Then you can use foreach (from the "foreach" package) to create a plot within each block:
require(foreach)
foreach(site=sites) %dopar% {
pdf(paste(site$key[[1]],".pdf",sep=""))
plot(site$value$year,site$value$peak,main=site$key[[1]])
dev.off()
}
As a bonus, if you have a multiprocessor machine and call registerDoMC() first (from the "doMC" package), the loops will run in parallel, speeding things up. More details in this Revolutions blog post: Block-processing a data frame with isplit
Another choice is use the ddply function from the ggplot2 library. But you mention you mostly want to do a plot of peak vs. year, so you could also just use qplot:
A <- read.table("example.txt",header=TRUE)
library(ggplot2)
qplot(peak,year,data=A,colour=site,geom="line",group=site)
ggsave("peak-year-comparison.png")
On the other hand, I do like David Smith's solution that allows the applying of the function to be run across several processors.
I seem to recall that plain old split() has a method for data.frames, so that split(data,data$site) would produce a list of blocks. You could then operate on this list using sapply/lapply/for.
split() is also nice because of unsplit(), which will create a vector the same length as the original data and in the correct order.
Here's what I would do, although it looks like you guys have it handled by library functions.
for(i in 1:length(unique(data$site))){
constrainedData = data[data$site==data$site[i]];
doSomething(constrainedData);
}
This kind of code is more direct and might be less efficient, but I prefer to be able to read what it is doing than learn some new library function for the same thing. makes this feel more flexible too, but in all honesty this is just the way I figured it out as a novice.
There are two handy built in functions for dealing with these kind of situations. ?aggregate and ?by. In this case because you want a plot and aren't returning a scalar, use by()
data <- read.table("example.txt",header=TRUE)
by(data[, c('year', 'peak')], data$site, plot)
The output says NULL because that's what plot returns. You might want to set the graphics device to pdf to capture all the output.
It is also very easy to generate your plots with the lattice package:
library(lattice)
xyplot(year~peak | site, data)
You could use the split function
If you opened your data as:
data <- read.table('your_data.txt', header=T)
blocks <- split(data, data$site)
After that, blocks contains data from each block, that you can access as other data.frame:
plot(blocks$ALBEN$year, blocks$ALBEN$peak)
And so on for each plot.

Resources