data.frame: create column by applying a function to groups of rows

data.frame: create column by applying a function to groups of rows - r

I have a data frame consisting of results from multiple runs of an experiment, each of which serves as a log, with its own ascending counter. I'd like to add another column to the data frame that has the maximum value of iteration for each distinct value of experiment.num in the sample below:
df <- data.frame(
iteration = rep(1:5,5),
experiment.num = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
some.val=42,
another.val=12
)
In this example, the extra column would look like this (as all the subsets have the same maximum for iteration):
df$max <- rep(5,25)
The naive solution I currently use is:
df$max <- sapply(df$experiment.num,function(exp.num) max(df$iteration[df$experiment.num == exp.num]))
I've also used sapply(unique(df$experiment.num), function(n) c(n,max(df$iteration[df$experiment.num==n]))) to build another frame which I can then merge with the original, but both of these approaches seem more complicated than necessary.
The experiment.num column is a factor, so I think I might be able to exploit that to avoid iteratively doing this naive subsetting for all rows.
Is there a better way to get a column of maximum values for subsets of a data.frame?

Using plyr:
ddply(df, .(experiment.num), transform, max = max(iteration))

Using ave in base R:
df$i_max <- with(df, ave(iteration, experiment.num, FUN=max))

Here's a way in base R:
within(df[order(df$experiment.num), ],
max <- rep(tapply(iteration, experiment.num, max),
rle(experiment.num)$lengths))

I think you can use data.table:
install.packages("data.table")
library("data.table")
dt <- data.table(df) #make your data frame into a data table)
dt[, pgIndexBY := .BY, by = list(experiment.num)] #this will add a new column to your data table called pgIndexBY

Related

Subset the remaining of a dataframe using another subset

I have a sample dataset. I've created a subset of the original data frame using some condition. Now I need to extract the remaining contents of the original sample data frame, except the subset created. How can I do this?
data("mtcars")
fulldf <- mtcars
subdf <- subset.data.frame(fulldf, subset = fulldf$disp < 100)
restdf <- subset.data.frame(fulldf, subset = <fulldf without subdf>)
There are a lot of questions on subsetting data frames in R, but I couldn't find one that satisfied my requirement.
Also the final solution need not necessarily be using subset.data.frame. Any method/package will do.

It is better to assign the logical condition in base R to an object identifier and then negate (!)
i1 <- fulldf$disp < 100
subdf <- subset.data.frame(fulldf, subset = i1)
restdf <- subset.data.frame(fulldf, subset = !i1)
Also another option is to create a list of two datasets with split
lst1 <- split(fulldf, i1)
If the 'subdf' is creating with multiple conditions (not clear though), one option is to add a sequence variable in the data and then subset with %in%
fulldf$ind <- seq_len(nrow(fulldf))
then after the 'subdf' step
restdf <- subset(fulldf, !ind %in% subdf$ind)
and remove the 'ind' columns
restdf$ind <- NULL
subdf$ind <- NULL

Finding the closest character string in a second data frame in R

I have a quite big data.frame with non updated names and I want to get the correct names that are stored in another data.frame.
I am using stringdist function to find the closest match between the two columns and then I want to put the new names in the original data.frame.
I am using a code based on sapply function, as in the following example :
dat1 <- data.frame("name" = paste0("abc", seq(1:5)),
"value" = round(rnorm(5), 1))
dat2 <- data.frame("name" = paste0("abd", seq(1:5)),
"other_info" = seq(11:15))
dat1$name2 <- sapply(dat1$name,
function(x){
char_min <- stringdist::stringdist(x, dat2$name)
dat2[which.min(char_min), "name"]
})
dat1
However, this code is too slow considering the size of my data.frame.
Is there a more optimized alternative solution, using for example data.table R package?

First convert the data frames into data tables:
dat1 <- data.table(dat1)
dat2 <- data.table(dat2)
Then use the ":=" and "amatch" command to create a new column that approximately matches the two names:
dat1[,name2 := dat2[stringdist::amatch(name, dat2$name)]$name]
This should be much faster than the sapply function. Hope this helps!

Normalize values in a column to obtain overheads

I have the following data frame:
df <- data.frame(
Target=rep(LETTERS[1:3],each=8),
Prov=rep(letters[1:4],each=2),
B=rep("5MB"),
S=rep("1MB"),
BUF=rep("8kB"),
M=rep(c('g','p')),
Thr.mean=1:24)
whose column Thr.mean I would like to normalize by the values where Target=='C' (I don't mind attaching a new column).
To clarify, I would like to end up with:
Thr.mean <- c(1/17,2/18,3/19,4/20,5/21,6/22,7/23,8/24,9/17,10/18,11/19,12/20,13/21,14/22,15/23,16/24,1,1,1,1,1,1,1,1)
Now, it may happen that there are rows in this data frame, where Target!='C', and they have values in S or B that are not present in rows where Target=='C', and for these I would also like to calculate the overhead. The most important column for matching is M, then BUF, B, and S.
Any ideas how to do it? I could write several loops and ifs, but I'm looking for a more elegant solution.

For posterity,
the way how I solved my problem is by using data.table:
DT <- data.table(df)
DT[, Thr.Norm.C := .SD[Target=='C', Thr.mean], by = 'B,BUF,Prov']
DT[, over.thr := Thr.Norm.C/Thr.mean]

R: Apply function on specific columns preserving the rest of the dataframe

I'd like to learn how to apply functions on specific columns of my dataframe without "excluding" the other columns from my df. For example i'd like to multiply some specific columns by 1000 and leave the other ones as they are.
Using the sapply function for example like this:
a<-as.data.frame(sapply(table.xy[,1], function(x){x*1000}))
I get new dataframes with the first column multiplied by 1000 but without the other columns that I didn't use in the operation. So my attempt was to do it like this:
a<-as.data.frame(sapply(table.xy, function(x) if (colnames=="columnA") {x/1000} else {x}))
but this one didn't work.
My workaround was to give both dataframes another row with IDs and later on merge the old dataframe with the newly created to get a complete one. But I think there must be a better solution. Isn't it?

If you only want to do a computation on one or a few columns you can use transform or simply do index it manually:
# with transfrom:
df <- data.frame(A = 1:10, B = 1:10)
df <- transform(df, A = A*1000)
# Manually:
df <- data.frame(A = 1:10, B = 1:10)
df$A <- df$A * 1000

The following code will apply the desired function to the only the columns you specify.
I'll create a simple data frame as a reproducible example.
(df <- data.frame(x = 1, y = 1:10, z=11:20))
(df <- cbind(df[1], apply(df[2:3],2, function(x){x*1000})))
Basically, use cbind() to select the columns you don't want the function to run on, then use apply() with desired functions on the target columns.

In dplyr we would use mutate_at in which you can select or exclude (by preceding variable name with "-" minus sign) specific variables.
You can just name a function
df <- df %>%
mutate_at(vars(columnA), scale)
or create your own
df <- df %>%
mutate_at(vars(columnA, columnC), function(x) {do this})

Creating multiple subsets all in one data.frame (possibly with ddply)

I have a large data.frame, and I'd like to be able to reduce it by using a quantile subset by one of the variables. For example:
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame(x,rnorm(100))
df2 <- subset(df, df$x == 1)
df3 <- subset(df2, df2[2] > quantile(df2$rnorm.100.,0.8))
What I would like to end up with is a data.frame that contains all quantiles for x=1,2,3...10.
Is there a way to do this with ddply?

You could try:
ddply(df, .(x), subset, rnorm.100. > quantile(rnorm.100., 0.8))
And off topic: you could use df <- data.frame(x,y=rnorm(100)) to name a column on-the-fly.

Here's a different approach with the little used ave() command. (very fast to calculate this way)
Make a new column that contains the quantile calculation across each level of x
df$quantByX <- ave(df$rnorm.100., df$x, FUN = function (x) quantile(x,0.8))
Select the items of the new column and the x column.
df2 <- unique(df[,c(1,3)])
The result is one data frame with the unique items in the x column and the calculated quantile for each level of x.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

data.frame: create column by applying a function to groups of rows - r

Using plyr: ddply(df, .(experiment.num), transform, max = max(iteration))

Using ave in base R: df$i_max <- with(df, ave(iteration, experiment.num, FUN=max))

Here's a way in base R: within(df[order(df$experiment.num), ], max <- rep(tapply(iteration, experiment.num, max), rle(experiment.num)$lengths))

I think you can use data.table: install.packages("data.table") library("data.table") dt <- data.table(df) #make your data frame into a data table) dt[, pgIndexBY := .BY, by = list(experiment.num)] #this will add a new column to your data table called pgIndexBY

Related

Subset the remaining of a dataframe using another subset

Finding the closest character string in a second data frame in R

Normalize values in a column to obtain overheads

R: Apply function on specific columns preserving the rest of the dataframe

Creating multiple subsets all in one data.frame (possibly with ddply)

Categories

Resources