I have the following data frame:
df <- data.frame(
Target=rep(LETTERS[1:3],each=8),
Prov=rep(letters[1:4],each=2),
B=rep("5MB"),
S=rep("1MB"),
BUF=rep("8kB"),
M=rep(c('g','p')),
Thr.mean=1:24)
whose column Thr.mean I would like to normalize by the values where Target=='C' (I don't mind attaching a new column).
To clarify, I would like to end up with:
Thr.mean <- c(1/17,2/18,3/19,4/20,5/21,6/22,7/23,8/24,9/17,10/18,11/19,12/20,13/21,14/22,15/23,16/24,1,1,1,1,1,1,1,1)
Now, it may happen that there are rows in this data frame, where Target!='C', and they have values in S or B that are not present in rows where Target=='C', and for these I would also like to calculate the overhead. The most important column for matching is M, then BUF, B, and S.
Any ideas how to do it? I could write several loops and ifs, but I'm looking for a more elegant solution.
For posterity,
the way how I solved my problem is by using data.table:
DT <- data.table(df)
DT[, Thr.Norm.C := .SD[Target=='C', Thr.mean], by = 'B,BUF,Prov']
DT[, over.thr := Thr.Norm.C/Thr.mean]
Related
I'm able to use the following to generate/replace a numeric column (var) with each cell's quartile rank (1-4) for the column:
df <- as.data.table(df)
df[, var:=ntile(var, 4)]
I want to iterate this conversion over each of the columns in the data table. When I try the following, every cell in the table becomes the number, 1. Any guidance on why I'm not getting the expected output? I'm sure there's also a simpler approach, so alternatives are welcome too. Thanks!
for (i in 1:ncol(df))
{
df[, colnames(df)[i]:=ntile(df[i], 4)]
}
I work mostly with data table but a data frame solution would work as well.
I have the result of an apply which returns this data structure
applyres=structure(c(0.0260, 3.6775, 0.92
), .Names = c("a.1", "a.2", "a.3"))
Then I have a data table
coltoadd=c('a.1','a.2','a.3')
dt <- data.table(variable1 = factor(c("what","when","where")))
dt[,coltoadd]=as.numeric(NA)
Now I would like to add the elements of applyres to the corresponding columns, just one row at a time, because applyres is calculated from another function. I have tried different assignments but nothing seems to work. Ideally I would like to assign based on column name, just in case the columns change order in one of the two structures.
This doesn't work
dt[1,coltoadd]=applyres
I also tried
dt[1,coltoadd := applyres]
And tried to change applyrest to a matrix or a data table and transpose.
I would like to do something like this
dt[1,coltoadd[i]]=applyres[coltoadd[i]]
But not sure if it should go in a loop, doesn't seem the best way to do it.
How do I avoid doing single assignments if I have a large number of columns?
1) data.frame Convert to data.frame, perform the assignments and convert back.
DF <- as.data.frame(dt)
DF[1, -1] <- applyres
# perform remaining of assignments
dt <- as.data.table(DF)
2) loop Another possibility is a for loop:
for(i in 2:ncol(dt)) dt[1, i] <- applyres[i-1]
This is similar to Update values in data.table with values from another data.table and R data.table replacing an index of values from another data.table, except in my situation the number of variables is very large so I do not want to list them explicitly.
What I have is a large data.table (let's call it dt_original) and a smaller data.table (let's call it dt_newdata) whose IDs are a subset of the first and it has only some of the variables of the first. I would like to update the values in dt_original with the values from dt_newdata. For an added twist, I only want to update the values conditionally - in this case, only if the values in dt_newdata are larger than the corresponding values in dt_original.
For a reproducible example, here are the data. In the real world the tables are much larger:
library(data.table)
set.seed(0)
## This data.table with 20 rows and many variables is the existing data set
dt_original <- data.table(id = 1:20)
setkey(dt_original, id)
for(i in 2015:2017) {
varA <- paste0('varA_', i)
varB <- paste0('varB_', i)
varC <- paste0('varC_', i)
dt_original[, (varA) := rnorm(20)]
dt_original[, (varB) := rnorm(20)]
dt_original[, (varC) := rnorm(20)]
}
## This table with a strict subset of IDs from dt_original and only a part of
## the variables is our potential replacement data
dt_newdata <- data.table(id = sample(1:20, 3))
setkey(dt_newdata, id)
newdata_vars <- sample(names(dt_original)[-1], 4)
for(var in newdata_vars) {
dt_newdata[, (var) := rnorm(3)]
}
Here is a way of doing it using a loop and pmax, but there has to be a better way, right?
for(var in newdata_vars) {
k <- pmax(dt_newdata[, (var), with = FALSE], dt_original[id %in% dt_newdata$id, (var), with = FALSE])
dt_original[id %in% dt_newdata$id, (var) := k, with = FALSE]
}
It seems like there should be a way using join syntax, and maybe the prefix i. and/or .SD or something like that, but nothing I've tried comes close enough to warrant repeating here.
This code should work in the current format given your criteria.
dt_original[dt_newdata, names(dt_newdata) := Map(pmax, mget(names(dt_newdata)), dt_newdata)]
It joins to the IDs that match between the data.tables and then performs an assignment using := Because we want to return a list, I use Map to run pmax through the columns of data.tables matching by the name of dt_newdata. Note that it is necessary that all names of dt_newdata are in dt_original data.
Following Frank's comment, you can remove the first column of the Map list items and the column names using [-1] because they are IDs, which don't need to be computed. Removing the first column from Map avoids one pass of pmax and also preserves the key on id. Thanks to #brian-stamper for pointing out the key preservation in the comments.
dt_original[dt_newdata,
names(dt_newdata)[-1] := Map(pmax,
mget(names(dt_newdata)[-1]),
dt_newdata[, .SD, .SDcols=-1])]
Note that the use of [-1] assumes that the ID variable is located in the first position of new_data. If it is elsewhere, you could change the index manually or use grep.
I need to iterate through a dataframe, df, where
colnames(df) == c('year','month','a','id','dollars')
I need to iterate through all of the unique pairs ('a','id'), which I've found via
counts <- count(df, c('area','normalid'))
uniquePairs <- counts[ counts$freq > 10, c('a','id') ]
Next I iterate through each of the unique pairs, finding the corresponding rows like so (I have named each column of uniquePairs appropriately):
aVec <- as.vector( uniquePairs$a )
idVec <- as.vector( uniquePairs$id )
for (i in 1:length(uniquePairs))
{
a <- aVec[i]
id <- idVec[i]
selectRows <- (df$a==a & df$id==id)
# ... get those rows and do stuff with them ...
df <- df[!selectRows,] # so lookups are slightly faster next time through
# ...
}
I know for loops are discouraged in general, but in this case I think it's appropriate. It at least seems to me to be irrelevant to this question, but maybe a more efficient way of doing this would get rid of the loop.
There are between 10-100k rows in the dataframe, and it makes sense that it'd be a worse-than-linear (though I haven't tested it) relationship between lookup time and nrow(df).
Now unique must have seen where each of these pairs occurred, even if it didn't save it. Is there a way to save that off, so that I have a boolean vector I could use for each of the pairs to more efficiently select them out of the dataframe? Or is there an alternate, better way to do this?
I have a feeling that some use of plyr or reshape could help me out, but I'm still relatively new to the large R ecosystem, so some guidance would be greatly appreciated.
data.table is your best option by far:
dt = data.table(df)
dt[,{do stuff in here, then leave results in list form},by=list(a, id)]
for a simple case of an average of some variable:
dt[,list(Mean = mean(dollars)), by = list(a, id)]
I have a data frame consisting of results from multiple runs of an experiment, each of which serves as a log, with its own ascending counter. I'd like to add another column to the data frame that has the maximum value of iteration for each distinct value of experiment.num in the sample below:
df <- data.frame(
iteration = rep(1:5,5),
experiment.num = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
some.val=42,
another.val=12
)
In this example, the extra column would look like this (as all the subsets have the same maximum for iteration):
df$max <- rep(5,25)
The naive solution I currently use is:
df$max <- sapply(df$experiment.num,function(exp.num) max(df$iteration[df$experiment.num == exp.num]))
I've also used sapply(unique(df$experiment.num), function(n) c(n,max(df$iteration[df$experiment.num==n]))) to build another frame which I can then merge with the original, but both of these approaches seem more complicated than necessary.
The experiment.num column is a factor, so I think I might be able to exploit that to avoid iteratively doing this naive subsetting for all rows.
Is there a better way to get a column of maximum values for subsets of a data.frame?
Using plyr:
ddply(df, .(experiment.num), transform, max = max(iteration))
Using ave in base R:
df$i_max <- with(df, ave(iteration, experiment.num, FUN=max))
Here's a way in base R:
within(df[order(df$experiment.num), ],
max <- rep(tapply(iteration, experiment.num, max),
rle(experiment.num)$lengths))
I think you can use data.table:
install.packages("data.table")
library("data.table")
dt <- data.table(df) #make your data frame into a data table)
dt[, pgIndexBY := .BY, by = list(experiment.num)] #this will add a new column to your data table called pgIndexBY