Creating multiple subsets all in one data.frame (possibly with ddply) - r

I have a large data.frame, and I'd like to be able to reduce it by using a quantile subset by one of the variables. For example:
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame(x,rnorm(100))
df2 <- subset(df, df$x == 1)
df3 <- subset(df2, df2[2] > quantile(df2$rnorm.100.,0.8))
What I would like to end up with is a data.frame that contains all quantiles for x=1,2,3...10.
Is there a way to do this with ddply?

You could try:
ddply(df, .(x), subset, rnorm.100. > quantile(rnorm.100., 0.8))
And off topic: you could use df <- data.frame(x,y=rnorm(100)) to name a column on-the-fly.

Here's a different approach with the little used ave() command. (very fast to calculate this way)
Make a new column that contains the quantile calculation across each level of x
df$quantByX <- ave(df$rnorm.100., df$x, FUN = function (x) quantile(x,0.8))
Select the items of the new column and the x column.
df2 <- unique(df[,c(1,3)])
The result is one data frame with the unique items in the x column and the calculated quantile for each level of x.

Related

How to make a for loop to find the average of the columns in several data frames and add a new column?

I looked at the variations of this question, but was not able to find an answer that worked...so here goes. I have a lot of data frames, each representing a psychological index (the kind where they ask several questions and the average of them all gives you a score on what it is that you are measuring (anger, anxiety, etc). For this example, I will choose three of them: SA, SE, GT
I would like to make a for loop to automatically calculate the average of the columns in each data frame, and then add a new column with that average.
I was able to make a for loop to do this for one data frame, but how do I then loop this loop to do it for all of my data frames (which is a lot more than 3)?
#This is the for loop to do it for just one data frame (SA)
avg <- c()
for(i in 1:nrow(SA)){
avg[i] <- (sum(as.numeric(SA[i,]), na.rm =T)/ncol(SA))
}
SA$avg <- avg
#This is what I tried to do for multiple:
my.list <- list(SA, SE, GT)
for(l in my.list){
avg <- c()
for(i in 1:nrow(l)){
avg[i] <- (sum(as.numeric(l[i,]), na.rm =T)/ncol(l))
}
l$avg <- avg
}
This may work for you. I've created some dummy data frames, assuming that you have the same number of observations for each psychological index. You then bash them all together into one big dataframe. The colMeans function will compute means for each column:
SA <- data.frame(SA=runif(10))
SE <- data.frame(SE=runif(10))
GT <- data.frame(GT=runif(10))
MP <- data.frame(MP=runif(10))
df <- cbind(SA, SE, GT, MP)
av <- colMeans(df, na.rm = TRUE)
If the indices have differing numbers of observations, you can combine them into a list as you did, and then use the function sapply(). Since each element of the list is a dataframe, you need to extract the actual column by using the index operator [, 1] (first column):
df <- list(SA, SE, GT, MP)
sapply(df, function(x) mean(x[,1], na.rm=TRUE))
UPDATE:
You can create a list of your dataframes again, but as you need means across rows, just use the rowMeans() function:
SA <- data.frame(matrix(runif(50), nrow=10))
SE <- data.frame(matrix(runif(80), nrow=10))
df <- list(SA, SE)
lapply(df, function(x) {x$index_means <- rowMeans(x, na.rm=TRUE); return(x) })
This will give you a list of data frames with a new column of means for each index.

Split Apply Combine

I have a large list, and would like to apply the exact technique detailed in the answer here:
Create mutually exclusive dummy variables from categorical variable in R
However, my data is much larger, and I would like to split, apply and combine the operation to each individual row.
This code, which of course does not work, illustrates what I am trying to do:
id <- c(1,1,1,1)
time <- c(1,2,3,4)
time <- as.character(time)
unique.time <- as.character(unique(df$time))
df <- data.frame(id,time)
df1 <- split(df, row(df))
sapply(df1, (unique.time, function(x)as.numeric(df1$time == x)))
z <- unsplit(lapply(df1, row(df)), scale), x)
Thanks!

How can I cumulatively apply a custom function to a vector in R? In an efficient and idiomatic way?

I know the function cumsum in R which compute a cumulative sum of its vector argument.
I need to "cumulatively apply" not the sum function but a generic function, in my specific case, the quantile function.
My current solution is based on a loop:
set.seed(42)
df<-data.frame(measurement=rnorm(1000),upper=0,lower=0)
for ( r in seq(1,nrow(df))){
df$upper[r]<-quantile(df[seq(1,r),"measurement"],c(.99))
df$lower[r]<-quantile(df[seq(1,r),"measurement"],c(.01))
}
x=seq(1,nrow(df))
plot(df$measurement,type="l",col="grey")
lines(x,df$upper,col="red")
lines(x,df$lower,col="blue")
It works but it is not efficient and I feel there should be a more idiomatic way of doing it in R.
You can use this approach:
set.seed(42)
df <- data.frame(measurement = rnorm(1000))
res <- sapply(seq(nrow(df)), function(x)
quantile(df[seq(x), "measurement"], c(.01, .99)))
It creates a matrix with nrow(df) columns and 2 rows, one row for the 1st percentile and one row for the 99th percentile.
You can add this information to you data frame df (as two olumns):
df <- setNames(cbind(df, t(res)), c(names(df), "lower", "upper"))

R: Apply function on specific columns preserving the rest of the dataframe

I'd like to learn how to apply functions on specific columns of my dataframe without "excluding" the other columns from my df. For example i'd like to multiply some specific columns by 1000 and leave the other ones as they are.
Using the sapply function for example like this:
a<-as.data.frame(sapply(table.xy[,1], function(x){x*1000}))
I get new dataframes with the first column multiplied by 1000 but without the other columns that I didn't use in the operation. So my attempt was to do it like this:
a<-as.data.frame(sapply(table.xy, function(x) if (colnames=="columnA") {x/1000} else {x}))
but this one didn't work.
My workaround was to give both dataframes another row with IDs and later on merge the old dataframe with the newly created to get a complete one. But I think there must be a better solution. Isn't it?
If you only want to do a computation on one or a few columns you can use transform or simply do index it manually:
# with transfrom:
df <- data.frame(A = 1:10, B = 1:10)
df <- transform(df, A = A*1000)
# Manually:
df <- data.frame(A = 1:10, B = 1:10)
df$A <- df$A * 1000
The following code will apply the desired function to the only the columns you specify.
I'll create a simple data frame as a reproducible example.
(df <- data.frame(x = 1, y = 1:10, z=11:20))
(df <- cbind(df[1], apply(df[2:3],2, function(x){x*1000})))
Basically, use cbind() to select the columns you don't want the function to run on, then use apply() with desired functions on the target columns.
In dplyr we would use mutate_at in which you can select or exclude (by preceding variable name with "-" minus sign) specific variables.
You can just name a function
df <- df %>%
mutate_at(vars(columnA), scale)
or create your own
df <- df %>%
mutate_at(vars(columnA, columnC), function(x) {do this})

data.frame: create column by applying a function to groups of rows

I have a data frame consisting of results from multiple runs of an experiment, each of which serves as a log, with its own ascending counter. I'd like to add another column to the data frame that has the maximum value of iteration for each distinct value of experiment.num in the sample below:
df <- data.frame(
iteration = rep(1:5,5),
experiment.num = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
some.val=42,
another.val=12
)
In this example, the extra column would look like this (as all the subsets have the same maximum for iteration):
df$max <- rep(5,25)
The naive solution I currently use is:
df$max <- sapply(df$experiment.num,function(exp.num) max(df$iteration[df$experiment.num == exp.num]))
I've also used sapply(unique(df$experiment.num), function(n) c(n,max(df$iteration[df$experiment.num==n]))) to build another frame which I can then merge with the original, but both of these approaches seem more complicated than necessary.
The experiment.num column is a factor, so I think I might be able to exploit that to avoid iteratively doing this naive subsetting for all rows.
Is there a better way to get a column of maximum values for subsets of a data.frame?
Using plyr:
ddply(df, .(experiment.num), transform, max = max(iteration))
Using ave in base R:
df$i_max <- with(df, ave(iteration, experiment.num, FUN=max))
Here's a way in base R:
within(df[order(df$experiment.num), ],
max <- rep(tapply(iteration, experiment.num, max),
rle(experiment.num)$lengths))
I think you can use data.table:
install.packages("data.table")
library("data.table")
dt <- data.table(df) #make your data frame into a data table)
dt[, pgIndexBY := .BY, by = list(experiment.num)] #this will add a new column to your data table called pgIndexBY

Resources