I'm writing a gene level analysis script in R and I'll have to handle large amounts of data.
My initial idea was to create a super list structure, a set of lists within lists. Essentially the structure is
#12.8 mins
list[[1:8]][[1:1000]][[1:6]][[1:1000]]
This is huge and takes in excess of 12 mins purely to set up the data structure. Stream lining this process, I can get it down to about 1.6 mins when setting up one value of the 1:8 list, so essentially...
#1.6 mins
list[[1:1]][[1:1000]][[1:6]][[1:1000]]
Normally, I'd create the structure as and when it's needed, on the fly, however, I'm distributing the 1:1000 steps which means, I don't know which order they'll come back in.
Are there any other packages for handling the creation of this level of data?
Could I use any more efficient data structures in my approach?
I apologise if this seems like the wrong approach entirely, but this is my first time handling big data in R.
Note that lists are vectors, and like any other vector, they can have a dim attribute.
l <- vector("list", 8 * 1000 * 6 * 1000)
dim(l) <- c(8, 1000, 6, 1000)
This is effectively instantaneous. You access individual elements with [[, eg l[[1, 2, 3, 4]].
A different strategy is to create a vector and a partitioning, e.g., to represent
list(1:4, 5:7)
as
l = list(data=1:7, partition=c(4, 7))
then one can do vectorized calculations, e.g.,
logl = list(data=log(l$data), partition = l$partition)
and other clever things. This avoids creating complicated lists and the iterations that implies. This approach is formalized in the Bioconductor IRanges package *List classes.
> library(IRanges)
> l <- NumericList(1:4, 5:7)
> l
NumericList of length 2
[[1]] 1 2 3 4
[[2]] 5 6 7
> log(l)
NumericList of length 2
[[1]] 0 0.693147180559945 1.09861228866811 1.38629436111989
[[2]] 1.6094379124341 1.79175946922805 1.94591014905531
One idiom for working with this data is to unlist, transform, then relist; both unlist and relist are inexpensive, so the long-hand version of the above is relist(log(unlist(l)), l)
Depending on your data structure, the DataFrame class may be appropriate, e.g., the following can be manipulated like a data.frame (subset, etc) but contains *List elements.
> DataFrame(Sample=c("A", "B"), VariableA=l, LogA=log(l))
DataFrame with 2 rows and 3 columns
Sample VariableA LogA
<character> <NumericList> <NumericList>
1 A 1,2,3,... 0,0.693147180559945,1.09861228866811,...
2 B 5,6,7 1.6094379124341,1.79175946922805,1.94591014905531
For genomic data where the coordinates of genes (or other features) on chromosomes is of fundamental importance, the GenomicRanges package and GRanges / GRangesList classes are appropriate.
Related
I am reading several files together into a list of data frames to be able to apply functions to the combined data, but I am running into memory allocation problems when I have too many data frames ("Error: R cannot allocate memory").
e.g. variable number of data frames read, lets say for now 3 data frames:
x = data.frame(A=rnorm(100), B=rnorm(200))
y = data.frame(A=rnorm(30), B=rnorm(300))
z = data.frame(A=rnorm(20), B=rnorm(600))
listDF <- list(x,y,z)
Error: R cannot allocate memory
I was wondering whether someone here knows whether for example an [ array or one single data frame with many columns ] would be a more efficient way of storing and manipulating data frames.
The list of data frames is a very practical way because I can manipulate the many columns in the data based on the name of the data frame, when dealing with a variable number of data frames this is convenient. Anyway, if there are any ideas/any ways you like doing this, please share them :) Thank you!
This solution may not be ideal as it isn't free, but Revolution R Enterprise is designed to deal with the problem of big data in R. It uses some of the data manipulation capabilities of SQL within R to do faster computations on big data. There is a learning curve as it has different functions to deal with the new data type, but if you are dealing with big data, the speed up is worth it. You just have to decide if the time to learn it and the cost of the product are more valuable to you than some of the slower and more klugie work arounds.
DataTables are very efficient data structures in R, take a look maybe they are useful for your case.
Your example and mentioning the apply family of functions suggest that the structure of the data frames is identical, ie, they all have the same columns.
If this is the case and if the total volume of data (all data frames together) still does fit in available RAM then a solution could be to pack all data into one large data.table with an extra id column. This can be achieved with function rbindlist:
library(data.table)
x <- data.table(A = rnorm(100), B = rnorm(200))
y <- data.table(A = rnorm(30), B = rnorm(300))
z <- data.table(A = rnorm(20), B = rnorm(600))
dt <- rbindlist(list(x, y, z), idcol = TRUE)
dt
.id A B
1: 1 -0.10981198 -0.55483251
2: 1 -0.09501871 -0.39602767
3: 1 2.07894635 0.09838722
4: 1 -2.16227936 0.04620932
5: 1 -0.85767886 -0.02500463
---
1096: 3 1.65858606 -1.10010088
1097: 3 -0.52939876 -0.09720765
1098: 3 0.59847826 0.78347801
1099: 3 0.02024844 -0.37545346
1100: 3 -1.44481850 -0.02598364
The rows originating from the individual source data frames
can be distinghuished by the .id variable. All the memory efficient data.tableoperations can be applied on all rows, selected rows (dt[.id == 1, some_function(A)]) or group-wise (dt[, another_function(B), by = .id]).
Although the data.table operations are memory efficient, RAM might still be a limiting factor. Use the tables() function to monitor memory consumption of all created data.table objects:
tables()
NAME NROW NCOL MB COLS KEY
[1,] dt 1,100 3 1 .id,A,B
[2,] x 200 2 1 A,B
[3,] y 300 2 1 A,B
[4,] z 600 2 1 A,B
Total: 4MB
and remove objects from memory which are no longer needed
rm(x, y, z)
tables()
NAME NROW NCOL MB COLS KEY
[1,] dt 1,100 3 1 .id,A,B
Total: 1MB
I would like to make a dataframe from a list of n. Each list contains 3 different list inside. I am only intrested in 1 list of those 3 list inside. The list I am intrested in is a data.frame with 12 obs of 12 variables.
My imput tmp in my lapply function is a list of n with each 5 observations.
2 of those observations are the Latitude and Longitude. This is how my lapply function looks like:
DF_Google_Places<- lapply(tmp, function(tmp){
Latitude<-tmp$Latitude
Longitude<-tmp$Longitude
LatLon<- paste(Latitude,Longitude, sep=",")
res<-GET(paste("https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=",LatLon,"&radius=200&types=food&key=AIzaSyDS6usHdhdoGIj0ILtXJKCjuj7FBmDEzpM", sep=""))
jsonAnsw<-content(res,"text")
myDataframe<- jsonlite::fromJSON(content(res,"text"))
})
My question is: how do I get this list of 12 obs of 12 variables into a dataframe from a list of n ?
Could anyone help me out?, Thanks
I'm just posting my comment as an answer so I can show output to show you the idea:
x <- list(a=list(b=1,c=2),d=list(b=3,c=4))
So x is a nested list structure, in this case with consistent naming / structure one level down.
> x
$a
$a$b
[1] 1
$a$c
[1] 2
$d
$d$b
[1] 3
$d$c
[1] 4
Now we'll use do.call to build the data.frame. We need to pass it a named list of arguments, so we'll use list(sapply to get the named list. We'll walk the higher level of the list by position, and the inner level by name since the names are consistent across sub-lists at the inner level. Note here that the key idea is essentially to reverse what would be the intuitive way of indexing; since I want to pull observations at the second level from across observations at the first level, the inner call to sapply traverses multiple values of level one for each value of the name at level two.
y <- do.call(data.frame,
list(sapply(names(x[[1]]),
function(t) sapply(1:length(x),
function(j) x[[j]][[t]]))))
> y
b c
1 1 2
2 3 4
Try breaking apart the command to see what each step does. If there is any consistency in your sub-list structure, you should be able to adapt this approach to walk that structure in the right order and fold the data you need.
On a large dataset, this would not be efficient, but for 12x12 it should be fine.
I have two data frames. One of them contains 165 columns (species names) and almost 193.000 rows which in each cell is a number from 0 to 1 which is the percent possibility of the species to be present in that cell.
POINTID Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran
2 0.0279037 0.604687 0.0388309 0.0161980 0.0143966 0.240152
3 0.0294101 0.674846 0.0673055 0.0481405 0.0397423 0.231308
4 0.0292839 0.603869 0.0597947 0.0526606 0.0463431 0.188875
6 0.0331264 0.541165 0.0470451 0.0270871 0.0373348 0.256662
8 0.0393825 0.672371 0.0715808 0.0559353 0.0565391 0.230833
9 0.0376557 0.663732 0.0747417 0.0445794 0.0602539 0.229265
The second data frame contains 164 columns (species names, as the first data frame) and one row which is the threshold that above this we assume that the species is present and under of this the species is absent
Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran Acta_Spic
0.3155 0.2816 0.2579 0.2074 0.3007 0.3513 0.3514
What i want to do is to make a new data frame that will contain for every species in the presence possibility (my.data) the number of possibility if it is above the threshold (thres) and if it is under the threshold the zero number.
I know that it would be a for loop and if statement but i am new in R and i don't know for to do this.
Please help me.
I think you want something like this:
(Make up small reproducible example)
set.seed(101)
speciesdat <- data.frame(pointID=1:10,matrix(runif(100),ncol=10,
dimnames=list(NULL,LETTERS[1:10])))
threshdat <- rbind(seq(0.1,1,by=0.1))
Now process:
thresh <- unlist(threshdat) ## make data frame into a vector
## 'sweep' runs the function column-by-column if MARGIN=2
ss2 <- sweep(as.matrix(speciesdat[,-1]),MARGIN=2,STATS=thresh,
FUN=function(x,y) ifelse(x<y,0,x))
## recombine results with the first column
speciesdat2 <- data.frame(pointID=speciesdat$pointID,ss2)
It's simpler to have the same number of columns (with the same meanings of course).
frame2 = data.frame(POINTID=0, frame2)
R works with vectors so a row of frame1 can be directly compared to frame2
frame1[,1] < frame2
Could use an explicit loop for every row of frame1 but it's common to use the implicit loop of "apply"
answer = apply(frame1, 1, function(x) x < frame2)
This was all rather sloppy solution (especially changing frame2) but it hopefully demonstrates some basic R. Also, I'd generally prefer arrays and matrices when possible (they can still use labels but are generally faster).
This produces a logical matrix which can be used to generate assignments with "[<-"; (Assuming name of multi-row dataframe is "cols" and named vector is "vec":
sweep(cols[-1], 2, vec, ">") # identifies the items to keep
cols[-1][ sweep(cols[-1], 2, vec, "<") ] <- 0
Your example produced a warning about the mismatch of the number of columns with the length of the vector, but presumably you can adjust the length of the vector to be the correct number of entries.
So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.
So earlier I answered my own question on thinking in vectors in R. But now I have another problem which I can't 'vectorize.' I know vectors are faster and loops slower, but I can't figure out how to do this in a vector method:
I have a data frame (which for sentimental reasons I like to call my.data) which I want to do a full marginal analysis on. I need to remove certain elements one at a time and 'value' the data frame then I need to do the iterating again by removing only the next element. Then do again... and again... The idea is to do a full marginal analysis on a subset of my data. Anyhow, I can't conceive of how to do this in a vector efficient way.
I've shortened the looping part of the code down and it looks something like this:
for (j in my.data$item[my.data$fixed==0]) { # <-- selects the items I want to loop
# through
my.data.it <- my.data[my.data$item!= j,] # <-- this kicks item j out of the list
sum.data <-aggregate(my.data.it, by=list(year), FUN=sum, na.rm=TRUE) #<-- do an
# aggregation
do(a.little.dance) && make(a.little.love) -> get.down(tonight) # <-- a little
# song and dance
delta <- (get.love) # <-- get some love
delta.list<-append(delta.list, delta, after=length(delta.list)) #<-- put my love
# in a vector
}
So obviously I hacked out a bunch of stuff in the middle, just to make it less clumsy. The goal would be to remove the j loop using something more vector efficient. Any ideas?
Here's what seems like another very R-type way to generate the sums. Generate a vector that is as long as your input vector, containing nothing but the repeated sum of n elements. Then, subtract your original vector from the sums vector. The result: a vector (isums) where each entry is your original vector less the ith element.
> (my.data$item[my.data$fixed==0])
[1] 1 1 3 5 7
> sums <- rep(sum(my.data$item[my.data$fixed==0]),length(my.data$item[my.data$fixed==0]))
> sums
[1] 17 17 17 17 17
> isums <- sums - (my.data$item[my.data$fixed==0])
> isums
[1] 16 16 14 12 10
Strangely enough, learning to vectorize in R is what helped me get used to basic functional programming. A basic technique would be to define your operations inside the loop as a function:
data = ...;
items = ...;
leave_one_out = function(i) {
data1 = data[items != i];
delta = ...; # some operation on data1
return delta;
}
for (j in items) {
delta.list = cbind(delta.list, leave_one_out(j));
}
To vectorize, all you do is replace the for loop with the sapply mapping function:
delta.list = sapply(items, leave_one_out);
This is no answer, but I wonder if any insight lies in this direction:
> tapply((my.data$item[my.data$fixed==0])[-1], my.data$year[my.data$fixed==0][-1], sum)
tapply produces a table of statistics (sums, in this case; the third argument) grouped by the parameter given as the second argument. For example
2001 2003 2005 2007
1 3 5 7
The [-1] notation drops observation (row) one from the selected rows. So, you could loop and use [-i] on each loop
for (i in 1:length(my.data$item)) {
tapply((my.data$item[my.data$fixed==0])[-i], my.data$year[my.data$fixed==0][-i], sum)
}
keeping in mind that if you have any years with only 1 observation, then the tables returned by the successive tapply calls won't have the same number of columns. (i.e., if you drop out the only observation for 2001, then 2003, 2005, and 2007 would be te only columns returned).