Breaking a list of character strings into partitions - r

Here is my problem. I have a dataset with 200k rows.
Each row corresponds to a test conducted on a subject.
Subjects have unequal number of tests.
Each test is dated.
I want to assign an index to each test. E.g. The first test of subject 1 would be 1, the second test of subject 1 would be 2. The first test of subject 2 would be 1 etc..
My strategy is to get a list of unique Subject IDs, use lapply to subset the dataset into a list of dataframes using the unique Subject IDs, with each Subject having his/her own dataframe with the tests conducted. Ideally I would then be able to sort each dataframe of each subject and assign an index for each test.
However, doing this over a 200k x 32 dataframe made my laptop (i5, Sandy Bridge, 4GB ram) run out of memory quite quickly.
I have 2 questions:
Is there a better way to do this?
If there is not, my only option to overcome the memory limit is to break my unique SubjectID list into smaller sets like 1000 SubjectIDs per list, lapply it through the dataset and at the end of everything, join the lists together. Then, how do I create a function to break my SubjectID list by supplying say an integer that denotes the number of partitions. e.g. BreakPartition(Dataset, 5) will break the dataset into 5 partitions equally.
Here is code to generate some dummy data:
UniqueSubjectID <- sapply(1:500, function(i) paste(letters[sample(1:26, 5, replace = TRUE)], collapse =""))
UniqueSubjectID <- subset(UniqueSubjectID, !duplicated(UniqueSubjectID))
Dataset <- data.frame(SubID = sample(sapply(1:500, function(i) paste(letters[sample(1:26, 5, replace = TRUE)], collapse ="")),5000, replace = TRUE))
Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d.%m.%Y')), 5000, replace = TRUE)
Dataset <- cbind(Dataset, Dates)

I would guess that the splitting/lapply is what is using up the memory. You should consider a more vectorized approach. Starting with a slightly modified version of your example code:
n <- 200000
UniqueSubjectID <- replicate(500, paste(letters[sample(26, 5, replace=TRUE)], collapse =""))
UniqueSubjectID <- unique(UniqueSubjectID)
Dataset <- data.frame(SubID = sample(UniqueSubjectID , n, replace = TRUE))
Dataset$Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d.%m.%Y')), n, replace = TRUE)
And assuming that what you want is an index counting the tests by date order by subject, you could do the following.
Dataset <- Dataset[order(Dataset$SubID, Dataset$Dates), ]
ids.rle <- rle(as.character(Dataset$SubID))
Dataset$SubIndex <- unlist(sapply(ids.rle$lengths, function(n) 1:n))
Now the 'SubIndex' column in 'Dataset' contains a by-subject numbered index of the tests. This takes a very small amount of memory and runs in a few seconds on my 4GB Core 2 duo Laptop.

This sounds like a job for the plyr package. I would add the index in this way:
require(plyr)
system.time(new_dat <- ddply(Dataset, .(SubID), function(dum) {
dum = dum[order(dum$SubID, dum$Dates), ]
mutate(dum, index = 1:nrow(dum))
}))
This splits the dataset up into chunks per SubID, and adds an index. The new object has all the SubID grouped together, and sorted in time. Your example took about 2 seconds on my machine, and used almost no memory. I'm not sure how ddply scales to your data size and characteristics, but you could try. I this does not work fast enough, definitely take a look at the data.table package. A blog post of mine which compares (among others) ddply and data.table could serve as some inspiration.

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

Repetitive Action Over Ten Matrices in R

I have ten datasets, and each dataset contains "ratings" and "occupation" columns. From each of those ten datasets I want to find out the "average" of "ratings" per three occupation groups (i.e. artists, technician, marketing).
The code I have written is as follows:
Average.Rating.per.Interval <- data.frame(interval=as.numeric(),
occupation=as.character(),
average.rating=as.numeric(),
stringsAsFactors=FALSE)
##interval number refers to the dataset number (e.g. for 'e.1' it is 1, for 'e.2' it's 2)
Average.Rating.per.Interval <- as.matrix(Average.Rating.per.Interval)
e.1.artist <- e.1[which(e.1[,"occupation"]=='artist', arr.ind = TRUE),]
mean(e.1.artist$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(interval=1,occupation="artist",average.rating=mean(e.1.artist$rating)))
e.1.technician <- e.1[which(e.1[,"occupation"]=='technician', arr.ind = TRUE),]
mean(e.1.technician$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"technician",mean(e.1.technician$rating)))
e.1.marketing <- e.1[which(e.1[,"occupation"]=='marketing', arr.ind = TRUE),]
mean(e.1.marketing$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"marketing",mean(e.1.marketing$rating)))
This is clearly not efficient at all, because for ten datasets, I have to rewrite the same code 9 more times to get the average ratings for each of those occupations groups for all of my ten datasets. Is there a better way to do this? I cannot think of anything better! I found out that apply/lapply can be a way to do this, but I could not figure out how they can work for my case.
Two of my datasets (e1 and e2) can be found here. (I have only included 10% of the entire observations in each)
You can use the tidyverse package to summarize each of your data frames. First, you'll want to put them in a list. Then you can iterate over each of the data frames in the list, summarizing by occupation:
library(tidyverse)
# Create sample data
set.seed(2353)
sample_data <- rerun(10, tibble(
occupation = sample(c("Artist", "Technician", "Marketing"), 100, replace = TRUE),
ratings = sample(1:100, 100, replace = TRUE)
))
# Summarize by occupation
summarized_data <- sample_data %>%
map(~ .x %>% group_by(occupation) %>% summarize(avg_rating = mean(ratings)))
Another option, with base. First load the files into a list, then use lapply to calculate the means for each dataset
# Set directory to a file that contains the files
files <- list.files()
# Load all the data at once into a single list
l <- lapply(files, dget)
names(l) <- substr(files, 1, 2) # gives meaningful names to list elements (datasets)
# Calculate the mean by group for each dataset
all_group_means <- lapply(l, function(x) tapply(x$rating, x$occupation, mean, na.rm = TRUE))
# Subset all the group means to just those you're interested in
sapply(all_group_means, function(x) x[c("artist", "technician", "marketing")])
d1 d2
artist 3.540984 3.612048
technician 3.519512 3.651106
marketing 3.147208 3.342569
Note that if your data are already all loaded, you could just put them into a list (rather then loading all the data directly into a list) and then use the lapply function and it should still work.
Edit
I just realized you only wanted the means for the three groups. I've edited the code above to subset all means to only the three groups.
I recommend the "plyr" package for this kind of manipulation; it is well worth the investment of an hour or so to learn. In your case, I loaded your first example dataset in "d1", and I can summarise it like so:
ddply(d1, .(occupation), summarise, mean_rating=mean(rating))
This shows the results for all occupations, and you only wanted a specific three, so we can filter it to those:
ddply(subset(d1, occupation %in% c('artist','technician','marketing')), summarise, mean_rating=mean(rating))
Now we just need to generalize it to running over 10 datasets without cut and paste. Let's store our data frames inside a list:
dataset_list <- list(d1=d1) # you would put all of them here; I just have one
Now we can run the same code on all of them, with lapply, and get a list back out:
filtered_occupations <- c('artist','technician','marketing')
lapply(dataset_list, function(dataset) {
ddply(subset(dataset,occupation %in% filtered_occupations),
.(occupation), summarise, mean_rating=mean(rating))} )
Result:
$d1
occupation mean_rating
1 artist 3.540984
2 marketing 3.147208
3 technician 3.519512

Efficient Combination and Operating on Large Data Frames

I have 2 relatively large data frames in R. I'm attempting to merge / find all combos, as efficiently as possible. The resulting df turns out to be huge (the length is dim(myDF1)[1]*dim(myDF2)[1]), so I'm attempting to implement a solution using ff. I'm also open to using other solutions, such as the bigmemory package to work-around these memory issues. I'm have virtually no experience with either of these packages.
Working example - assume I'm working with some data frame that looks similar to USArrests:
library('ff')
library('ffbase')
myNames <- USArrests
myNames$States <- rownames(myNames)
rownames(myNames) <- NULL
Now, I will fabricate 2 data frames, which represent some particular sets of observations from myNames. I'm going to try to reference them by their rownames later.
myDF1 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(3*1e5, 1, 50))], ncol = 3)))
myDF2 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(2*1e5, 1, 50))], ncol = 2)))
# unique combos:
myDF1 <- unique(myDF1)
myDF2 <- unique(myDF2)
For example, my first set of states in myDF1 are myNames[unlist(myDF1[1, ]), ]. Then I will find all combos of myDF1 and myDF2 using ikey :
# create keys:
myDF1$key <- ikey(myDF1)
myDF2$key <- ikey(myDF2)
startTime <- Sys.time()
# Create some huge vectors:
myVector1 <- ffrep.int(myDF1$key, dim(myDF2)[1])
myVector2 <- ffrep.int(myDF2$key, dim(myDF1)[1])
# This takes about 25 seconds on my machine:
print(Sys.time() - startTime)
# Sort one DF (to later combine with the other):
myVector2 <- ffsorted(myVector2)
# Sorting takes an additional 2.5 minutes:
print(Sys.time() - startTime)
1) Is there a faster way to sort this?
# finally, find all combinations:
myDF <- as.ffdf(myVector1, myVector2)
# Very fast:
print(Sys.time() - startTime)
2) Is there an alternative to this type of combination (without using RAM)?
Finally, I'd like to be able to reference any of the original data by row / column. Specifically, I'd like to get different types of rowSums. For example:
# Here are the row numbers (from myNames) for the top 6 sets of States:
this <- cbind(myDF1[myDF[1:6,1], -4], myDF2[myDF[1:6,2], -3])
this
# Then, the original data for the first set of States is:
myNames[unlist(this[1,]),]
# Suppose I want to get the sum of the Urban Population for every row, such as the first:
sum(myNames[unlist(this[1,]),]$UrbanPop)
3) Ultimately, I'd like a vector with the above rowSum, so I can perform some type of subset on myDF. Any advice on how to most efficiently accomplish this?
Thanks!
It's pretty much unclear to me what you intent to do with the rowSum and your 3) element but if you want an efficient and RAM-friendly combination of 2 ff vectors, to get all combinations, you can use expand.ffgrid from ffbase.
The following will generate your ffdf with dimensions 160Mio rows x 2 columns in a few seconds.
require(ffbase)
x <- expand.ffgrid(myDF1$key, myDF2$key)

R data samples counting

I have a task to do using R. I need to make 10000 samples of a vector of 12 elements each of them between 1 and 7. I did this using:
dataSet = t(replicate(10000, sample(1:7, 12, r=T)))
Now I need to count the rows of this dataSet that contain all the values from 1:7.
How can I do that and is there a better way to represent the data than this?
One way would be (you need to use set.seed in order to make this reproducible)
indx <- 1:7
sum(apply(dataSet, 1, function(x) all(indx %in% x)))
## 2336

Optimization: splitting dataframe into a list of dataframes, transforming data per row

Preliminaries: this question is mostly of educational value, the actual task at hand is completed, even if the approach is not entirely optimal. My question is whether the code below can be optimized for speed and/or implemented more elegantly. Perhaps using additional packages, such as plyr or reshape. Run on the actual data it takes about 140 seconds, much higher than the simulated data, since some of the original rows contain nothing but NA, and additional checks have to be made. To compare, the simulated data are processed in about 30 seconds.
Conditions: the dataset contains 360 variables, 30 times the set of 12. Let's name them V1_1, V1_2... (first set), V2_1, V2_2 ... (second set) and so forth. Each set of 12 variables contains dichotomous (yes/no) responses, in practice corresponding to a career status. For instance: work (yes/no), study (yes/no) and so forth, in total 12 statuses, repeated 30 times.
Task: the task at hand is to recode each set of 12 dichotomous variables into a single variable with 12 response categories (e.g. work, study... ). Ultimately we should get 30 variables, each with 12 response categories.
Data: I cannot post the actual dataset, but here is a good simulated approximation:
randomRow <- function() {
# make a row with a single 1 and some NA's
sample(x=c(rep(0,9),1,NA,NA),size=12,replace=F)
}
# create a data frame with 12 variables and 1500 cases
makeDf <- function() {
data <- matrix(NA,ncol=12,nrow=1500)
for (i in 1:1500) {
data[i,] <- randomRow()
}
return(data)
}
mydata <- NULL
# combine 30 of these dataframes horizontally
for (i in 1:30) {
mydata <- cbind(mydata,makeDf())
}
mydata <- as.data.frame(mydata) # example data ready
My solution:
# Divide the dataset into a list with 30 dataframes, each with 12 variables
S1 <- lapply(1:30,function(i) {
Z <- rep(1:30,each=12) # define selection vector
mydata[Z==i] # use selection vector to get groups of variables (x12)
})
recodeDf <- function(df) {
result <- as.numeric(apply(df,1,function(x) {
if (any(!is.na(df))) which(x == 1) else NA # return the position of "1" per row
})) # the if/else check is for the real data
return(result)
}
# Combine individual position vectors into a dataframe
final.df <- as.data.frame(do.call(cbind,lapply(S1,recodeDf)))
All in all, there is a double *apply function, one across the list, the other across the dataframe rows. This makes it a bit slow. Any suggestions? Thanks in advance.
Here is an approach that is basically instantaneous. (system.time = 0.1 seconds)
se set. The columnMatch component will depend on your data, but if it is every 12 columns, then the following will work.
MYD <- data.table(mydata)
# a new data.table (changed to numeric : Arun)
newDT <- as.data.table(replicate(30, numeric(nrow(MYD)),simplify = FALSE))
# for each column, which values equal 1
whiches <- lapply(MYD, function(x) which(x == 1))
# create a list of column matches (those you wish to aggregate)
columnMatch <- split(names(mydata), rep(1:30,each = 12))
setattr(columnMatch, 'names', names(newDT))
# cycle through all new columns
# and assign the the rows in the new data.table
## Arun: had to generate numeric indices for
## cycling through 1:12, 13:24 in whiches[[.]]. That was the problem.
for(jj in seq_along(columnMatch)) {
for(ii in seq_along(columnMatch[[jj]])) {
set(newDT, j = jj, i = whiches[[ii + 12 * (jj-1)]], value = ii)
}
}
This would work just as well adding columns by reference to the original.
Note set works on data.frames as well....
I really like #Arun's matrix multiplication idea. Interestingly, if you compiling R against some OpenBLAS libraries, you could get this to operate in parallel.
However, I wanted to provide you with another, perhaps slower than matrix multiplication, solution that uses your original pattern, but is much faster than your implementation:
# Match is usually faster than which, because it only returns the first match
# (and therefore won't fail on multiple matches)
# It also neatly handles your *all NA* case
recodeDf2 <- function(df) apply(df,1,match,x=1)
# You can split your data.frame by column with split.default
# (Using split on data.frame will split-by-row)
S2<-split.default(mydata,rep(1:30,each=12))
final.df2<-lapply(S2,recodeDf2)
If you had a very large data frame, and many processors, you may consider parallelizing this operation with:
library(parallel)
final.df2<-mclapply(S2,recodeDf2,mc.cores=numcores)
# Where numcores is your number of processors.
Having read #Arun and #mnel, I learned a lot about how to improve this function, by avoiding the coercion to an array, by processing the data.frame by column instead of by row. I don't mean to "steal" an answer here; OP should consider switching the checkbox to #mnel's answer.
I wanted, however, to share a solution that doesn't use data.table, and avoids for. It is still, however, slower than #mnel's solution, albeit slightly.
nograpes2<-function(mydata) {
test<-function(df) {
l<-lapply(df,function(x) which(x==1))
lens<-lapply(l,length)
rep.int(seq.int(l),times=lens)[order(unlist(l))]
}
S2<-split.default(mydata,rep(1:30,each=12))
data.frame(lapply(S2,test))
}
I would also like to add that #Aaron's approach, using which with arr.ind=TRUE would also be very fast and elegant, if mydata started out as a matrix, rather than a data.frame. Coercion to a matrix is slower than the rest of the function. If speed were an issue, it would be worth considering reading the data in as a matrix in the first place.
IIUC, you've only one 1 per 12 columns. You've the rest with 0's or NA's. If so, the operation can be performed much faster by this idea.
The idea: Instead of going through each row and asking for the position of 1, you could use a matrix with dimensions 1500 * 12 where each row is just 1:12. That is:
mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE)
Now, you can multiply this matrix with each of your subset'd data.frame (of same dimensions, 1500*12 here) and them take their "rowSums" (which is vectorised) with na.rm = TRUE. This'll just give directly the row where you have 1 (because that 1 will have been multiplied by the corresponding value between 1 and 12).
data.table implementation: Here, I'll use data.table to illustrate the idea. Since it creates column by references, I'd expect that the same idea used on a data.frame would be a tad slower, although it should drastically speed up your current code.
require(data.table)
DT <- data.table(mydata)
ids <- seq(1, ncol(DT), by=12)
# for multiplying with each subset and taking rowSums to get position of 1
mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE)
for (i in ids) {
sdcols <- i:(i+12-1)
# keep appending the new columns by reference to the original data
DT[, paste0("R", i %/% 12 + 1) := rowSums(.SD * mul.mat,
na.rm = TRUE), .SDcols = sdcols]
}
# delete all original 360 columns by reference from the original data
DT[, grep("V", names(DT), value=TRUE) := NULL]
Now, you'll be left with 30 columns that correspond to the position of 1's. On my system, this takes about 0.4 seconds.
all(unlist(final.df) == unlist(DT)) # not a fan of `identical`
# [1] TRUE
Another way this could be done with base R is with simply getting the values you want to put in the new matrix and filling them in directly with matrix indexing.
idx <- which(mydata==1, arr.ind=TRUE) # get indices of 1's
i <- idx[,2] %% 12 # get column that was 1
idx[,2] <- ((idx[,2] - 1) %/% 12) + 1 # get "group" and put in "col" of idx
out <- array(NA, dim=c(1500,30)) # make empty matrix
out[idx] <- i # and fill it in!

Resources