I have the following data.table
# Load Library
library(data.table)
# Generate Data
test_data <- data.table(
year = c(2000,2000,2001,2001),
grp = rep(c("A","B"),2),
value = 1:4
)
I want to multiply the values of group A with varying parameters over year. My attempt was using sapply with fifelse and a fixed value of 2, but it seems that this solutions is going to be messy if I want to vary this value over time.
multiply_effect <- sapply(
1:nrow(test_data), function(i){
fifelse(
test = test_data$grp[i] == "A", test_data$value[i] * 2, test_data$value[i]
)
}
)
Lets say that I want to multiply the value of grp A with 2 in 2000 and by 3 in 2001 and keeping grp B as it is, then my desired outputwould be,
year grp value new_value
1: 2000 A 1 2
2: 2000 B 2 2
3: 2001 A 3 9
4: 2001 B 4 4
Im looking for at data.table-solution only.
You could define the factors for the years / groups you want to modify in another lookup data.table and update the main table with a join:
test_data <- data.table(
year = c(2000,2000,2001,2001),
grp = rep(c("A","B"),2),
value = 1:4
)
factor_lookup <- data.table(
year = c(2000,2001),
grp = rep("A",2),
factor = c(2,3)
)
test_data[factor_lookup ,value:=value*factor,on=.(year,grp)][]
year grp value
<num> <char> <int>
1: 2000 A 2
2: 2000 B 2
3: 2001 A 9
4: 2001 B 4
Related
I would like to calculate the squared sum of occurences (number of rows respectively) of the unique values of group A (industry) by group B (country) over the previous year.
Calculation example row 5: 2x A + 1x B + 1x C = 2^2+1^2+^+1^2 = 6 (does not include the A from row 1 because it is older than a year and also not include the A from row 6 because it is in another country).
I manage to calculate the numbers by row but I am failing to move this to the aggregated date level:
dt[, count_by_industry:= sapply(date, function(x) length(industry[between(date, x - lubridate::years(1), x)])),
by = c("country", "industry")]
The solution ideally scales to real data with ~2mn rows and around 10k dates and group elements (hence the data.table tag).
Example Data
ID <- c("1","2","3","4","5","6")
Date <- c("2016-01-02","2017-01-01", "2017-01-03", "2017-01-03", "2017-01-04","2017-01-03")
Industry <- c("A","A","B","C","A","A")
Country <- c("UK","UK","UK","UK","UK","US")
Desired <- c(1,4,3,3,6,1)
library(data.table)
dt <- data.frame(id=ID, date=Date, industry=Industry, country=Country, desired_output=Desired)
setDT(dt)[, date := as.Date(date)]
Adapting from your start:
dt[, output:= sapply(date, function(x) sum(table(industry[between(date, x - lubridate::years(1), x)]) ^ 2)),
by = c("country")]
dt
id date industry country desired_output output
1: 1 2016-01-02 A UK 1 1
2: 2 2017-01-01 A UK 4 4
3: 3 2017-01-03 B UK 3 3
4: 4 2017-01-03 C UK 3 3
5: 5 2017-01-04 A UK 6 6
6: 6 2017-01-03 A US 1 1
This question already has answers here:
Dynamically add column names to data.table when aggregating
(2 answers)
How do I pass names for new summary columns to data.table in a function?
(3 answers)
Closed 2 years ago.
I am trying to summarize a data.table using a character variable as the name for the new column along with by.
library(data.table)
dt <- data.table(g = rep(1:3, 4), xa = runif(12), xb = runif(12))
# desired output
dt[, .(sa = mean(xa)), by = g]
g sa
1: 1 1.902360
2: 2 2.149041
3: 3 2.586044
The issue is that the following code returns the entire data.table still, without reducing to just the unique values of g.
cn <- paste0('s', 'a')
# returns all rows
dt[, (cn) := mean(xa), by = g][]
g xa xb sa
1: 1 0.3423699 0.81447505 0.4755900
2: 2 0.0932055 0.06853225 0.5372602
3: 3 0.2486223 0.13286546 0.6465111
4: 1 0.6942175 0.66405944 0.4755900
5: 2 0.7225208 0.83110248 0.5372602
6: 3 0.9898293 0.09520907 0.6465111
7: 1 0.3523753 0.72743182 0.4755900
8: 2 0.5504942 0.01966303 0.5372602
9: 3 0.3523625 0.55257436 0.6465111
10: 1 0.5133974 0.39650089 0.4755900
11: 2 0.7828203 0.89909528 0.5372602
12: 3 0.9952302 0.16872205 0.6465111
How do I get the usual summarized data.table? (This is a simplified example. In my actual problem, there will be multiple names passed to a loop)
There is a pending PR which will make this kind of operations much easier,
data.table#4304. Once implemented in current design the query will looks like:
dt[, .(cn = mean(xa)), by = g, env = list(cn="sa")]
# g sa
# <int> <num>
#1: 1 0.2060352
#2: 2 0.1707827
#3: 3 0.6850591
installation of PR branch
remotes::install_github("Rdatatable/data.table#programming")
data
library(data.table)
dt <- data.table(g = rep(1:3, 4), xa = runif(12), xb = runif(12))
Either use setNames wrapped around the list (.(mean(xa))) column or
dt[, setNames(.(mean(xa)), cn), by = g]
# g sa
#1: 1 0.2010599
#2: 2 0.4710056
#3: 3 0.4871248
or the setnames after getting the summarised output
setnames(dt[, mean(xa), by = g], 'V1', cn)[]
In data.table, := operator is used for creating/modifying a column in the original dataset. But, this operator is different when used in the tidyverse context
library(dplyr)
dt %>%
group_by(g) %>%
summarise(!! cn := mean(xa), .groups = 'drop')
# A tibble: 3 x 2
# g sa
# <int> <dbl>
#1 1 0.201
#2 2 0.471
#3 3 0.487
I have a set of 85 possible combinations from two variables, one with five values (years) and one with 17 values (locations). I make a dataframe that has the years in the first column and the locations in the second column. For each combination of year and location I want to calculate the weighted mean value and then add it to the third column, according to the year and location values.
My code is as follows:
for (i in unique(data1$year)) {
for (j in unique(data1$location)) {
data2 <- crossing(data1$year, data1$location)
dataname <- subset(data1, year %in% i & location %in% j)
result <- weighted.mean(dataname$length, dataname$raising_factor, na.rm = T)
}
}
The result I gets puts the last calculated mean in the third column for each row.
How can I get it to add according to matching year and location combination?
thanks.
A base R option would be by
by(df[c('x', 'y')], df[c('group', 'year')],
function(x) weighted.mean(x[,1], x[,2]))
Based on #LAP's example
As #A.Suleiman suggested, we can use dplyr::group_by.
Example data:
df <- data.frame(group = rep(letters[1:5], each = 4),
year = rep(2001:2002, 10),
x = 1:20,
y = rep(c(0.3, 1, 1/0.3, 0.4), each = 5))
library(dplyr)
df %>%
group_by(group, year) %>%
summarise(test = weighted.mean(x, y))
# A tibble: 10 x 3
# Groups: group [?]
group year test
<fctr> <int> <dbl>
1 a 2001 2.000000
2 a 2002 3.000000
3 b 2001 6.538462
4 b 2002 7.000000
5 c 2001 10.538462
6 c 2002 11.538462
7 d 2001 14.000000
8 d 2002 14.214286
9 e 2001 18.000000
10 e 2002 19.000000
I am working with a data frame corresponding to the example below:
set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
"SomeVal" = runif(12))
I would like to quickly build a data frame that would have sum values for all the combinations of the categories derived from the CatA and CatNum as well as for the categories derived from each column separately. On the primitive example above, for the first couple of combinations, this can be achieved with use of simple code:
df_sums <- data.frame(
"Category" = c("Total for A",
"Total for A and 1",
"Total for A and 2"),
"Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)
This produces and informative data frame of sums:
Category Sum
1 Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941
This solution would be grossly inefficient when applied to a data frame with multiple categories. I would like to achieve the following:
Cycle through all the categories, including categories derived from each column separately as well as from both columns in the same time
Achieve some flexibility with respect to how the function is applied, for instance I may want to apply mean instead of the sum
Save the Total for string a separate object that I could easily edit when applying other function than sum.
I was initially thinking of using dplyr, on the lines:
require(dplyr)
df_sums_experiment <- dta %>%
group_by(CatA, CatNum) %>%
summarise(TotVal = sum(SomeVal))
But it's not clear to me how I could apply multiple groupings simultaneously. As stated, I'm interested in grouping by each column separately and by the combination of both columns. I would also like to create a string column that would indicate what is combined and in what order.
You could use tidyr to unite the columns and gather the data. Then use dplyr to summarise:
library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
gather(key, val, -SomeVal) %>%
group_by(val) %>%
summarise(sum(SomeVal))
val sum(SomeVal)
(chr) (dbl)
1 1 2.8198078
2 2 3.0778622
3 A 2.1801780
4 A_1 1.2101839
5 A_2 0.9699941
6 B 1.4405782
7 B_1 0.4076565
8 B_2 1.0329217
9 C 2.2769138
10 C_1 1.2019674
11 C_2 1.0749464
Just loop over the column combinations, compute the quantities you want and then rbind them together:
library(data.table)
dt = as.data.table(dta) # or setDT to convert in place
cols = c('CatA', 'CatNum')
rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
# CatA CatNum V1
# 1: A 1 1.2101839
# 2: B 2 1.0329217
# 3: C 1 1.2019674
# 4: A 2 0.9699941
# 5: B 1 0.4076565
# 6: C 2 1.0749464
# 7: A NA 2.1801780
# 8: B NA 1.4405782
# 9: C NA 2.2769138
#10: NA 1 2.8198078
#11: NA 2 3.0778622
Split then use apply
#result
res <- do.call(rbind,
lapply(
c(split(dta,dta$CatA),
split(dta,dta$CatNum),
split(dta,dta[,1:2])),
function(i)sum(i[,"SomeVal"])))
#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))
res1
# Category Sum
# 1 Total for A 2.1801780
# 2 Total for B 1.4405782
# 3 Total for C 2.2769138
# 4 Total for 1 2.8198078
# 5 Total for 2 3.0778622
# 6 Total for A and 1 1.2101839
# 7 Total for B and 1 0.4076565
# 8 Total for C and 1 1.2019674
# 9 Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464
I have id variable and date variable where there are multiple dates for a given id (a panel). I would like to generate a new variable based on whether ANY of the years for a given id meet a logical condition. I am not sure of how to code it so please don't take the following as R code, just as logical pseudocode. Something like
foreach(i in min(id):max(id)) {
if(var1[yearvar[1:max(yearvar)]=="A") then { newvar==1}
}
As an example:
ID Year Letter
1 1999 A
1 2000 B
2 2000 C
3 1999 A
Should return newvar
1
1
0
1
Since data[ID==1] contains A in some year, it should also ==1 in 2000 despite Letter==B that year.
Here's a way of approaching it with base R:
#Find which ID meet first criteria
withA <- unique(dat$ID[dat$Letter == "A"])
#add new column based on whether ID is in withA
dat$newvar <- as.numeric(dat$ID %in% withA)
# ID Year Letter newvar
# 1 1 1999 A 1
# 2 1 2000 B 1
# 3 2 2000 C 0
# 4 3 1999 A 1
Here's a solution using plyr:
library(plyr)
a <- ddply(dat, .(ID), summarise, newvar = as.numeric(any(Letter == "A")))
merge(ID, a, by="ID")
Without using a package:
dat <- data.frame(
ID = c(1,1,2,3),
Year = c(1999,2000,2000,1999),
Letter = c("A","B","C","A")
)
tableData <- table(dat[,c("ID","Letter")])
newvar <- ifelse(tableData[dat$ID,"A"]==1,1,0)
dat <- cbind(dat,newvar)
# ID Year Letter newvar
#1 1 1999 A 1
#2 1 2000 B 1
#3 2 2000 C 0
#4 3 1999 A 1