I have a data table with 3 columns: Field1, Field2 and Value.
For each attribute in Field2, I want to find the attribute in Field1 that corresponds to the largest sum of Value (ie there are multiple Field1 / Field2 rows in the data table).
When I try this: x[,.(Field1 = Field1[which.max(sum(Value))]),.(Field2)] I seem to be getting the first Field1 row for each Field2 rather than the row corresponding to the max sum of Value.
As an extension, what if you wanted to provide both the sum of value, the total number of rows and the Field1 value corresponding to the largest sum across the Value field within Field2?
Below is a reproducible code.
library(data.table)
#Set random seed
set.seed(2017)
#Create a table with the attributes we need
x = data.table(rbind(data.frame(Field1 = 1:12,Field2 = rep(1:3, each = 4), Value = runif(12)),
data.frame(Field1 = 1:12,Field2 = rep(1:3, each = 4), Value = runif(12))))
#Let's order by Field2/ Field1 / Value
x = x[order(Field2,Field1,Value)]
#Check
print(x)
# This works, but requires 2 steps which can complicate things when needing
# to pull other attributes too.
(x[,.(Value = sum(Value)),.(Field2,Field1)][,.SD[which.max(Value)],.(Field2)])
#This instead provides the row corresponding to the largest Value.
(x[,.(Field1 = Field1[which.max(Value)]),.(Field2)])
# This is what I was ideally looking for but it only returns the first row of the attribute
# regardless of the value of Value, or the corresponding sum.
(x[,.(Field1 = Field1[which.max(sum(Value))]),.(Field2)])
# This works but seems clumsy
(x[,
.SD[, .(RKCNT=length(.I),TotalValue=sum(Value)), .(Field1)]
[,.(RKCNT = sum(RKCNT), TotalValue = sum(TotalValue),
Field1 = Field1[which.max(TotalValue)])],
.(Field2)])
We can use
x[, .SD[, sum(Value), Field1][which.max(V1)], Field2]
Which is concise and thus somewhat easier to read. But does not give any performance improvement.
Related
How do I multiply the values inside a column by grouping from another column.
Let's say I have :
dt = data.table(group = c(1,1,2,2), value = c(2,3,4,5))
I want to multiply the elements of the value with each other but only the ones that belong to the same group , hence that would return.
dt=data.table(group=c(1,2), value=c(6,20))
I tried it with cumprod
dt[, new_value := cumprod(value), by = group]
but then that returns
dt=data.table(group=c(1,1,2,2), value=c(2,6,4,20)) and I don't know how to remove the rows that i dont neeed: those with value(2,4)
...
Taking the maximum is not a solution because the values could also be negative.
Updating for visibility using #chinsoon12 solution in the comments.
dt[, .(new_value = prod(value)), by = group]
Here's one option where you first perform the calculation and then take the last row by group.
dt[, .(new_value = cumprod(value)), by = group][,.SD[.N], by = group]
group new_value
1: 1 6
2: 2 20
I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5))
And I would like to create a cumulative sum of "Val". I know how to do the simple cumulative sum
df <- df %>% group_by(id) %>% mutate(cumval=cumsum(Val))
However, I would like my final data to look like this
final <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5),
"cumval" = c(2,5,6,11,2,7,5,10))
The basic idea is that when two "Val"'s are of the same "Group" the one happening later (Year) substitutes the previous one.
For instance, in the sample dataset, observation 3 has a "cumval" of 6 rather than 8 because of the "Val" at time 1972 replaced the "Val" at time 1970. similarly for Beta.
I thank you in advance for your help
In my head, this requires a for loop. First we split the dataframe by the id column into a list of two. Then we create two empty lists. In the og list, we will put the row where the first unique non NA group identifier occurs. For alpha this is the first row and for Beta this is the second row. We will use this to subtract from the cumulative sum when the value gets substituted.
mylist <- split(df, f = df$id)
og <- list()
vals <- list()
df_num <- 1
We shall use a nested loop, the outer loop loops over each object (dataframe in this case) in the list and the inner loop loops over each value in the Group column.
We need to keep track of the row numbers, which we do with the r variable. We initially set it to 0 outside the for loop so we add 1. First we check if we are in the first row of the data frame, in which case the cumulative sum is simply equal to the value in the first row of the Val column. Then within the if test, we use another if test to check if the Group id is an NA. If it isn't then this is the first occurrence of the number that will indicate a substitution of the current value if this number appears again. So we save the number to the temporary variable temp. We also extract and save the row that contains the value to the og list.
After this it, goes to the next iteration. We check if the current Group value is NA. If it is, then we just add the value to the cumulative sum. If it isn't equal to NA, we check if the value is NA and is equal to the value stored in temp. If both are true, then this means we need to substitute. We extract the original value stored in the og list and save it as old. We then subtract the old value from the cumulative sum and add the current value. We also replace the orginal value in og with the current replacement value. This is because if the value needs to replaced again, we will need to subtract the current value and not the original value.
If j is NA but it is not equal to temp, then this is a new instance of Group. So we save the row with the original value to og list, and save the Group. The sum continues as normal as this is not an instance of replacing a value. Note that the variable x that is used to count the elements in the og list is only incremented when a new occurrence is added to the list. Thus, og[[x-1]] will always be the replacement value.
for (my_df in mylist) {
x <- 1
r <- 0
for (j in my_df$Group) {
r <- r + 1
if (r == 1) {
vals[[1]] <- my_df$Val[1]
if (is.na(j)==FALSE) {
og[[x]] <- df[r, c('Group', 'Val'), drop = FALSE]
temp <- j
x <- x + 1
}
next
}
if (is.na(j)==TRUE) {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
} else if (is.na(j)==FALSE & j==temp) {
old <- og[[x-1]]
old <- old[,2]
vals[[r]] <- vals[[r-1]] - old + df$Val[r]
og[[x-1]] <- df[r, c('Group', 'Val'), drop = FALSE]
} else {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
og[[x]] <- my_df[r, c('Group', 'Val')]
temp <- j
x <- x + 1
}
}
cumval <- unlist(vals) %>% as.data.frame()
colnames(cumval) <- 'cumval'
my_df <- cbind(my_df, cumval)
mylist[[df_num]] <- my_df
df_num <- df_num + 1
}
Lastly, we combine the two dataframes in the list by binding them on rows with bind_rows from the dplyr package. Then I check if the Final dataframe is identical to your desired output with identical() and it evaluates to TRUE
final_df <- bind_rows(mylist)
identical(final_df, final)
[1] TRUE
I had question regarding filling a data.frame using values from the data.table. I have gotten stuck withba data.table a where I have 4 columns and I need to select the max Average and return the value in a new data.frame z. From data.frame a I am using Average to select Value to put into z. The way I was trying to set it up was if Average > 80 than select and repeat value for row a and red_a and red_b else use Average[1]==highest Average and Average[2] == 2nd largest value to select the Values.Is there a way to automatically generate a red_a and red_b without and filling a table easily that I missing. I was to get the max but have been stuck at using a single value to fill a table. I put what I have tried so far with data.tables.
Example data set
set_a <- c("a","a","a","a","b","b","b","b","c","c","c","c")
set_b <- c("red","red","red","red","red","red","red","red","red","red","red","red")
value <- c(42,68,90,91,22,65,89,98,78,88,91,33)
Average <- c(94,3,2,1,50,40,5,5,80,9,1,1)
a = data.frame(set_a,set_b,value,Average)
Desired output
z <- data.frame(set_a = c("a","a","b","b"),set_b = c("red_a","red_b","red_a","red_b"),value = c(42,42,22,65))
library(reshape2)
example <- acast(z,set_a ~ set_b, value.var = "value")
red_a red_b
a 42 42
b 22 65
R used so far:
I have been able to get the max value but have trouble setting it up to get the second value.
a = data.frame(set_a,set_b,value,Average)
library(data.table)
a = data.table(a)
y =a[,max(na.omit(Average)), by=.(set_a,set_b),all=T]
y
mydata2 = merge(a[0], y, by.x = c("set_a","set_b"),by.y = c("set_a","set_b"),all=TRUE)
mydata2
y$match <- a$value[match(y$V1,a$Average)]
y$V1 <- NULL
library(reshape2)
set_matrix = acast(y, set_a ~ set_b, value.var = "match")
set_matrix
I'm trying to solve a problem in data.table which requires me to use the value just predicted in the next step of the prediction.
I have the data set up like this, with NA rows generated ready receive the predictions. Each NA is calculated by multiplying the value preceding it by the current parameter
library(data.table)
dt <- data.table(
date = as.Date(paste(rep(c(2015, 2016), each = 12, times = 2), 1:12, 1, sep = "-")),
val = c(rnorm(12, 50, 5), rep(NA, 12)),
param1 = runif(48),
cat = rep(c("a", "b"), each = 24)
)
I can't do it this way
dt[, {
dt_in <- .SD
lapply(dt_in[year(date) > 2015, date], function(d){
dt_sub <- dt_in[date <= d]
pred <- dt_sub[.N-1, val] * dt_sub[.N, param1]
dt_in[date == d, val := pred]
})
} , by = cat]
As trying to update the .SD within {} give me the '.SD is locked...' error. My current solution involves breaking the data.table into a list and updating each list item row by row
# Create a list of data.tables, one for each category
break_list <- lapply(dt[, unique(cat)], function(c){
dt[cat == c]
})
l_out <- lapply(break_list, function(dt_in){
# Select the dates requiring prediction
lapply(dt_in[year(date) > 2015, date], function(d){
# Subset by date
dt_sub <- dt_in[date <= d]
# Prediciton = value from the second to last row * parameter in the last row
pred <- dt_sub[.N-1, val] * dt_sub[.N, param1]
# Update data.table
dt_in[date == d, val := pred]
})
dt_in
})
dt_out <- rbindlist(l_out)
This works and gives me the desired solution, but it can be slow and feels like I've broken all the data.table rules. Is there a better way?
You are looking to iteratively update rows of a data.table with values computed from rows updated in a previous iteration. While it is generally better to find an explicit formulation of the problem making the updates independent and it is possible in your case using a helper column holding the cumprod of param1 and a rolling join (dt[dt[...], ..., roll=TRUE]) I will show how to do iterative updates of a data.table efficiently using data.table::set, as the former is not always easy/possible:
setkey(dt, cat, date) # sort by cat first then by date in have the reference value used for each calculation in the row above
val_col_nr <- which(colnames(dt)=="val") # set requires a column number
dt[is.na(val), # we want to compute new values for val where val currently is NA
# .I is a vector the row numbers (in dt) of each row in .SD
for (ii in .I) set(dt, i=ii, j=val_col_nr, value=dt[ii,param1]*dt[ii-1L,val]),
by=cat] # for every 'cat'
You can use identical(dt, setkey(dt_out,cat,date)) to check the result.
Please do also note that it generally a bad idea to use names of base functions (cat in your case) as variable names (even in a distinct namespace).
I am trying to create a column with a random integer that is between 1 and the length of the unique values of a different column.
In other words it is a way in which to randomly re-assign the category of each row from a sample of all the unique values that column could possibly be.
Here is what I have. Unfortunatly, It returns exactly what I put in.
randomBinAssigner <- function(testingDT) {
levelsInCat <- levels(testingDT$randomCat)
testingDT[, randomCatKey := sample(1:length(levelsInCat), 1, replace = T)]
testingDT[, randomCat := levelsInCat[randomCatKey]]
testingDT[, randomCatKey := NULL]
return(testingDT)
}
No reproducible example in OP with clear desired output, but my guess is you're simply missing a "by":
testingDT[, randomCatKey := sample(length(levelsInCat), 1, T), by = randomCat]