Cumulative sums over run lengths. Can this loop be vectorized? - r

I have a data frame on which I calculate a run length encoding for a specific column. The values of the column, dir, are either -1, 0, or 1.
dir.rle <- rle(df$dir)
I then take the run lengths and compute segmented cumulative sums across another column in the data frame. I'm using a for loop, but I feel like there should be a way to do this more intelligently.
ndx <- 1
for(i in 1:length(dir.rle$lengths)) {
l <- dir.rle$lengths[i] - 1
s <- ndx
e <- ndx+l
tmp[s:e,]$cumval <- cumsum(df[s:e,]$val)
ndx <- e + 1
}
The run lengths of dir define the start, s, and end, e, for each run. The above code works but it does not feel like idiomatic R code. I feel as if there should be another way to do it without the loop.

This can be broken down into a two step problem. First, if we create an indexing column based off of the rle, then we can use that to group by and run the cumsum. The group by can then be performed by any number of aggregation techniques. I'll show two options, one using data.table and the other using plyr.
library(data.table)
library(plyr)
#data.table is the same thing as a data.frame for most purposes
#Fake data
dat <- data.table(dir = sample(-1:1, 20, TRUE), value = rnorm(20))
dir.rle <- rle(dat$dir)
#Compute an indexing column to group by
dat <- transform(dat, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths))
#What does the indexer column look like?
> head(dat)
dir value indexer
[1,] 1 0.5045807 1
[2,] 0 0.2660617 2
[3,] 1 1.0369641 3
[4,] 1 -0.4514342 3
[5,] -1 -0.3968631 4
[6,] -1 -2.1517093 4
#data.table approach
dat[, cumsum(value), by = indexer]
#plyr approach
ddply(dat, "indexer", summarize, V1 = cumsum(value))

Both Spacedman & Chase make the key point that a grouping variable simplifies everything (and Chase lays out two nice ways to proceed from there).
I'll just throw in an alternative approach to forming that grouping variable. It doesn't use rle and, at least to me, feels more intuitive. Basically, at each point where diff() detects a change in value, the cumsum that will form your grouping variable is incremented by one:
df$group <- c(0, cumsum(!(diff(df$dir)==0)))
# Or, equivalently
df$group <- c(0, cumsum(as.logical(diff(df$dir))))

Add a 'group' column to the data frame. Something like:
df=data.frame(z=rnorm(100)) # dummy data
df$dir = sign(df$z) # dummy +/- 1
rl = rle(df$dir)
df$group = rep(1:length(rl$lengths),times=rl$lengths)
then use tapply to sum within groups:
tapply(df$z,df$group,sum)

Related

Looking for a way to create a "median split" variable for several variables

I have a dataset which contains about 40 different variables. Now I would like to create a new variable indicating whether each observation is above or below the median.
I managed to create a new variable "var1_mediansplit" from the existing "var1" (values 1 for below median, 2 for everything else):
mydata$var1_mediansplit <- ifelse(mydata$var1 < median(mydata$var1), mydata$var1_mediansplit <- "1", mydata$var1_mediansplit <- "2"
I am looking for a way to run it through several variables (with a loop, I would guess). I appreciate any help!
Edit: The solution from jblood94 worked for me, so thank you!
Using the colMedians and eachrow from the Rfast package:
library(Rfast)
df <- as.data.frame(matrix(runif(4000), ncol = 40)) # dummy data
m <- as.matrix(df)
df2 <- as.data.frame((eachrow(m, colMedians(m), "-") >= 0) + 1)
Detailed explanation:
colMedians(m) returns the median of each column (a vector of
length 40).
eachrow takes a matrix for the first argument, a vector for the second argument (with the same length as the number of columns in the matrix), and an operator for the third argument. Each row of the matrix has the vector applied to it according to the operator. So here, the colMedians(m) vector is subtracted from each row of m.
The result of eachrow is compared to 0 (FALSE if it is less than 0, TRUE otherwise).
Operating on a logical with a numeric will coerce it to numeric: FALSE + 1 = 1, and TRUE + 1 = 2.
Don't overcomplicate it, consider this:
v <- c(1:100)
x <- median(v)
y <- v >= x

Recoding sequentially-named variables based on values of answers

I'm struggling with using lapply to recode values parsimoniously.
Let's say I have 10 survey questions with 4 answers each, in which there is always one right or wrong answer. The questions are labeled q_1 through q_10, and my dataframe is called df. I'd like to create new variables with the same sequential labels that simply code the question as "right" (1) or "wrong" (0).
If I were to make a list of the right answers, it would be:
right_answers<-c(1,2,3,4,2,3,4,1,2,4)
Then, I'm trying to write a function that simply recodes all of the variables into new variables while using the same sequential identifier, such as
lapply(1:10, function(fx) {
df$know_[fx]<-ifelse(df$q_[fx]==right_answers[fx],1,0)
})
In a hypothetical universe where this code was remotely correct, I'd get results such that:
id q_1 know_1 q_2 know_2
1 1 1 2 1
2 4 0 3 0
3 3 0 2 1
4 4 0 1 0
Thanks so much for your help!
For the same matrix output as the other answers, I would suggest:
q_names <- paste0("q_", seq_along(right_answers))
answers <- df[q_names]
correct <- mapply(`==`, answers, right_answers)
This should give you a matrix of whether or not each answer was correct:
t(apply(test[,grep("q_", names(test))], 1, function(X) X==right_answers))
You are likely having trouble with this part of the codedf$q_[fx]. You could call the column names using paste. Such as:
df = read.table(text = "
id q_1 q_2
1 1 2
2 4 3
3 3 2
4 4 1", header = TRUE)
right_answers = c(1,2,3,4,2,3,4,1,2,4)
dat2 = sapply(1:2, function(fx) {
ifelse(df[paste("q",fx,sep = "_")]==right_answers[fx],
1,0)
})
This doesn't add columns to your data.frame, but instead makes a new matrix much like #SenorO's answer. You can name the columns in the matrix and then add them to the original data.frame as follows.
colnames(dat2) = paste("know", 1:2, sep = "_")
data.frame(df, dat2)
I'd like to suggest a different approach to your question, using the reshape2 package. In my opinion, this has the advantages of being: 1) more idiomatic R (for what that's worth), 2) more readable code, 3) less error prone, particularly if you want to add analysis in the future. In this approach, everything is done within dataframes, which I think is desirable when possible -- easier to keep all the values for a single record (id in this case) and easier to use the power of R tools.
# Creating a dataframe with the form you describe
df <- data.frame(id=c('1','2','3','4'), q_1 = c(1,4,3,4), q_2 = c(2,3,2,1), q_3 = rep(1, 4), q_4 = rep(2, 4), q_5 = rep(3, 4),
q_6 = rep(4,4), q_7 = c(1,4,3,4), q_8 = c(2,3,2,1), q_9 = rep(1, 4), q_10 = rep(2, 4))
right_answers<-c(1,2,3,4,2,3,4,1,2,4)
# Associating the right answers explicitly with the corresponding question labels in a data frame
answer_df <- data.frame(questions=paste('q', 1:10, sep='_'), right_answers)
library(reshape2)
# "Melting" the dataframe from "wide" to "long" form -- now questions labels are in variable values rather than in column names
melt_df <- melt(df) # melt function is from reshape2 package
# Now merging the correct answers into the data frame containing the observed answers
merge_df <- merge(melt_df, answer_df, by.x='variable', by.y='questions')
# At this point comparing the observed to correct answers is trivial (using as.numeric to convert from logical to 0/1 as you request, though keeping as TRUE/FALSE may be clearer)
merge_df$correct <- as.numeric(merge_df$value==merge_df$right_answers)
# If desireable (not sure it is), put back into "wide" dataframe form
cast_obs_df <- dcast(merge_df, id ~ variable, value.var='value') # dcast function is from reshape2 package
cast_cor_df <- dcast(merge_df, id ~ variable, value.var='correct')
names(cast_cor_df) <- gsub('q_', 'know_', names(cast_cor_df))
final_df <- merge(cast_obs_df, cast_cor_df)
The new tidyr package would probably be even better here than reshape2.

Vectorized meta data computation based on multiple columns on R data.frame

I have a data.frame with 3 columns, each of which can be thought of as a factor. I'd like to compute some stats on the data.frame and store it in a new frame. To be more specific, I have the following fields:
obs, len, src
A 10 X
B 10 Y
I'd like to compute the breakdown of each source at each length (i.e. what percentage of observations from source X that are of length 10 are "A", "B", etc.)
An obvious approach to this is to use two for loops to iterate over the lengths and sources and then use nrow() and count() to get the values I'd need to compute, like so:
relevant_subset <- data[data$src==source & data$len==length,]
breakdown_info <- count(relevant_subset)
breakdown_info$frac <- breakdown_info$freq / nrow(relevant_subset)
Is there a way to avoid using the double for loop and use a more vectorized approach? Is there a smart way to pre-allocate the new frame that would hold the modified breakdown_info for each length and source?
aggregate is your friend for these tasks:
Example data:
set.seed(23)
test <- data.frame(
obs=sample(LETTERS[1:2],20,replace=TRUE),
len=sample(c(10,20),20,replace=TRUE),
src=sample(LETTERS[24:25],20,replace=TRUE)
)
Aggregate it:
aggregate(obs ~ src + len,data=test, function(x) prop.table(table(x)))
src len obs.A obs.B
1 X 10 0.6000000 0.4000000
2 Y 10 0.2000000 0.8000000
3 X 20 0.2500000 0.7500000
4 Y 20 0.1666667 0.8333333
This is what the plyr package was made for!
The format is <input_type><output_type>ply. For example if the input is a data.frame and you want the output to be a data.frame use ddply.
To use it, you specify the input data.frame, the columns to group by and then a function that constructs a data.frame from each group. The resulting data.frames appended with the grouping columns are assembled together into the output data.frame.
In something similar to your example, you could do
require(plyr)
a <- data.frame(
obs=factor(c('A','A','A','B','B')),
len=c(10,10,10,10,210),
src=factor(c('X','X','Y','Y','Z')))
then
z <- ddply(
a,
.(obs),
function(df){
data.frame(mean.len=mean(df$len))
})
would produce
data.frame(
obs=c('A', 'B'),
mean.length(10, 110))
while
ddply(a, .(src), function(df){
data.frame(
num.obs.A = sum(df$obs == 'A'),
num.obs.B = sum(df$obs == 'B'))})
would produce
data.frame(
src=c('X','Y', 'Z'),
num.obs.A = c(3,1,0),
num.obs.B = c(0,1,1))
The website is http://plyr.had.co.nz/ has good documentation too.
You haven't stated a reason why you want a data.frame here as output. Perhaps it's best for you, perhaps not. You also aren't really clear on what proportions are what but I think the following might solve your problem best.
prop.table( table(test) )
You could enter it slightly differently and play with the order of columns so that what you want to compare is most easily examined. But, this output is a 3-dimensional array and quite a bit different from a data.frame.
(example of alternate usage)
prop.table(with(test, table(src, obs, len) ))

Optimization: splitting dataframe into a list of dataframes, transforming data per row

Preliminaries: this question is mostly of educational value, the actual task at hand is completed, even if the approach is not entirely optimal. My question is whether the code below can be optimized for speed and/or implemented more elegantly. Perhaps using additional packages, such as plyr or reshape. Run on the actual data it takes about 140 seconds, much higher than the simulated data, since some of the original rows contain nothing but NA, and additional checks have to be made. To compare, the simulated data are processed in about 30 seconds.
Conditions: the dataset contains 360 variables, 30 times the set of 12. Let's name them V1_1, V1_2... (first set), V2_1, V2_2 ... (second set) and so forth. Each set of 12 variables contains dichotomous (yes/no) responses, in practice corresponding to a career status. For instance: work (yes/no), study (yes/no) and so forth, in total 12 statuses, repeated 30 times.
Task: the task at hand is to recode each set of 12 dichotomous variables into a single variable with 12 response categories (e.g. work, study... ). Ultimately we should get 30 variables, each with 12 response categories.
Data: I cannot post the actual dataset, but here is a good simulated approximation:
randomRow <- function() {
# make a row with a single 1 and some NA's
sample(x=c(rep(0,9),1,NA,NA),size=12,replace=F)
}
# create a data frame with 12 variables and 1500 cases
makeDf <- function() {
data <- matrix(NA,ncol=12,nrow=1500)
for (i in 1:1500) {
data[i,] <- randomRow()
}
return(data)
}
mydata <- NULL
# combine 30 of these dataframes horizontally
for (i in 1:30) {
mydata <- cbind(mydata,makeDf())
}
mydata <- as.data.frame(mydata) # example data ready
My solution:
# Divide the dataset into a list with 30 dataframes, each with 12 variables
S1 <- lapply(1:30,function(i) {
Z <- rep(1:30,each=12) # define selection vector
mydata[Z==i] # use selection vector to get groups of variables (x12)
})
recodeDf <- function(df) {
result <- as.numeric(apply(df,1,function(x) {
if (any(!is.na(df))) which(x == 1) else NA # return the position of "1" per row
})) # the if/else check is for the real data
return(result)
}
# Combine individual position vectors into a dataframe
final.df <- as.data.frame(do.call(cbind,lapply(S1,recodeDf)))
All in all, there is a double *apply function, one across the list, the other across the dataframe rows. This makes it a bit slow. Any suggestions? Thanks in advance.
Here is an approach that is basically instantaneous. (system.time = 0.1 seconds)
se set. The columnMatch component will depend on your data, but if it is every 12 columns, then the following will work.
MYD <- data.table(mydata)
# a new data.table (changed to numeric : Arun)
newDT <- as.data.table(replicate(30, numeric(nrow(MYD)),simplify = FALSE))
# for each column, which values equal 1
whiches <- lapply(MYD, function(x) which(x == 1))
# create a list of column matches (those you wish to aggregate)
columnMatch <- split(names(mydata), rep(1:30,each = 12))
setattr(columnMatch, 'names', names(newDT))
# cycle through all new columns
# and assign the the rows in the new data.table
## Arun: had to generate numeric indices for
## cycling through 1:12, 13:24 in whiches[[.]]. That was the problem.
for(jj in seq_along(columnMatch)) {
for(ii in seq_along(columnMatch[[jj]])) {
set(newDT, j = jj, i = whiches[[ii + 12 * (jj-1)]], value = ii)
}
}
This would work just as well adding columns by reference to the original.
Note set works on data.frames as well....
I really like #Arun's matrix multiplication idea. Interestingly, if you compiling R against some OpenBLAS libraries, you could get this to operate in parallel.
However, I wanted to provide you with another, perhaps slower than matrix multiplication, solution that uses your original pattern, but is much faster than your implementation:
# Match is usually faster than which, because it only returns the first match
# (and therefore won't fail on multiple matches)
# It also neatly handles your *all NA* case
recodeDf2 <- function(df) apply(df,1,match,x=1)
# You can split your data.frame by column with split.default
# (Using split on data.frame will split-by-row)
S2<-split.default(mydata,rep(1:30,each=12))
final.df2<-lapply(S2,recodeDf2)
If you had a very large data frame, and many processors, you may consider parallelizing this operation with:
library(parallel)
final.df2<-mclapply(S2,recodeDf2,mc.cores=numcores)
# Where numcores is your number of processors.
Having read #Arun and #mnel, I learned a lot about how to improve this function, by avoiding the coercion to an array, by processing the data.frame by column instead of by row. I don't mean to "steal" an answer here; OP should consider switching the checkbox to #mnel's answer.
I wanted, however, to share a solution that doesn't use data.table, and avoids for. It is still, however, slower than #mnel's solution, albeit slightly.
nograpes2<-function(mydata) {
test<-function(df) {
l<-lapply(df,function(x) which(x==1))
lens<-lapply(l,length)
rep.int(seq.int(l),times=lens)[order(unlist(l))]
}
S2<-split.default(mydata,rep(1:30,each=12))
data.frame(lapply(S2,test))
}
I would also like to add that #Aaron's approach, using which with arr.ind=TRUE would also be very fast and elegant, if mydata started out as a matrix, rather than a data.frame. Coercion to a matrix is slower than the rest of the function. If speed were an issue, it would be worth considering reading the data in as a matrix in the first place.
IIUC, you've only one 1 per 12 columns. You've the rest with 0's or NA's. If so, the operation can be performed much faster by this idea.
The idea: Instead of going through each row and asking for the position of 1, you could use a matrix with dimensions 1500 * 12 where each row is just 1:12. That is:
mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE)
Now, you can multiply this matrix with each of your subset'd data.frame (of same dimensions, 1500*12 here) and them take their "rowSums" (which is vectorised) with na.rm = TRUE. This'll just give directly the row where you have 1 (because that 1 will have been multiplied by the corresponding value between 1 and 12).
data.table implementation: Here, I'll use data.table to illustrate the idea. Since it creates column by references, I'd expect that the same idea used on a data.frame would be a tad slower, although it should drastically speed up your current code.
require(data.table)
DT <- data.table(mydata)
ids <- seq(1, ncol(DT), by=12)
# for multiplying with each subset and taking rowSums to get position of 1
mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE)
for (i in ids) {
sdcols <- i:(i+12-1)
# keep appending the new columns by reference to the original data
DT[, paste0("R", i %/% 12 + 1) := rowSums(.SD * mul.mat,
na.rm = TRUE), .SDcols = sdcols]
}
# delete all original 360 columns by reference from the original data
DT[, grep("V", names(DT), value=TRUE) := NULL]
Now, you'll be left with 30 columns that correspond to the position of 1's. On my system, this takes about 0.4 seconds.
all(unlist(final.df) == unlist(DT)) # not a fan of `identical`
# [1] TRUE
Another way this could be done with base R is with simply getting the values you want to put in the new matrix and filling them in directly with matrix indexing.
idx <- which(mydata==1, arr.ind=TRUE) # get indices of 1's
i <- idx[,2] %% 12 # get column that was 1
idx[,2] <- ((idx[,2] - 1) %/% 12) + 1 # get "group" and put in "col" of idx
out <- array(NA, dim=c(1500,30)) # make empty matrix
out[idx] <- i # and fill it in!

ranking multiple data frames and summing across them in R

I have 10 data frames with 2 columns each, I'm calling the dataframes a, b, c, d, e, f, g, h, i and j.
The first column in each data frame is called s for sequences and the second is p for p-values corresponding to each sequence. The s column contains the same sequences across all 10 data frames, essentially the only difference is in the p-values.
Below is a short version of data frame a, which has 600,000 rows.
s p
gtcg 0.06
gtcgg 0.05
gggaa 0.07
cttg 0.05
I want to rank each dataframe by p-value, the smallest p-value should get a rank of 1 and equal p-values should get the same rank. Each final data frame should be in this format:
s p_rank_a
gtcg 2
gtcgg 1
gggaa 3
cttg 1
I've used this to do one:
r<-rank(a$p)
cbind(a$s,r)
but I'm not very familiar with loops and I don't know how to do this automatically. Ultimately I would like a final file that has the s column and in the next column the rank sum of all the ranks across all data frames for each specific sequence.
SO basically this:
s ranksum_P_a-j
gtcg 34
gtcgg 5
gggaa 5009093
cttg 499
Please help and thanks!
for a single data.frame, you can do it one line, as follows:
credit to #Arun for pointing out to use as.numeric(factor(p))
library(data.table)
aDT <- data.table(a)[, p_rank := as.numeric(factor(p))]
I would suggest keeping all the data.frames in a single list, so that you can easily iterate over them.
Since your date.frames are letters, it's easy to collect the ten of them:
# collect them all
allOfThem <- lapply(letters[1:10], get, envir=.GlobalEnv)
# keep in mind you named an object `c`
# convert to DT and create the ranks
allOfThem <- lapply(allOfThem, function(x) data.table(x)[, p_rank := as.numeric(factor(p))])
on a separate note: it might be good habbit to start avoiding naming objects "c" and other common functions in R. otherwise, you will find that you'll start encountering many "unexplainable" behaviors which, after you've beaten your
head against a wall for an hour trying to debug it, you realize that you've overwritten the name of a function. This has never happened to me :)
I'd put all the data.frames in a list and then use lapply and transform as follows:
my_l <- list(a,b,c) # all your data.frames
# you can use rank but it'll give you the average in case of ties
# lapply(my_l, function(x) transform(x, rank_p = rank(p)))
# I prefer this method instead
my_o <- lapply(my_l, function(x) transform(x, p = as.numeric(factor(p))))
# now bind them in to a single data.frame
my_o <- do.call(rbind, my_o)
# now paste them
aggregate(data = my_o, p ~ s, function(x) paste(x, collapse=","))
# s p
# 1 cttg 1,1,1
# 2 gggaa 3,3,3
# 3 gtcg 2,2,2
# 4 gtcgg 1,1,1
Edit since you've asked for a potential faster solution (due to large data), I'd suggest, like #Ricardo, a data.table solution:
require(data.table)
# bind all your data.frames together
dt <- rbindlist(my_l) # my_l is your list of data.frames
# replace p-value with their "rank"
dt[, p := as.numeric(factor(p))]
# set key
setkey(dt, "s")
# combine them using `,`
dt[, list(p_ranks = paste(p, collapse=",")), by=s]
Try this out:

Resources