Calculate a mean, by a condition, within a factor [r] - r

I'm looking to calculate the simple mean of an outcome variable, but only for the outcome associated with the maximal instance of another running variable, grouped by factors.
Of course, the calculated statistic could be substituted for any other function, and the evaluation within the group could be any other function.
library(data.table) #1.9.5
dt <- data.table(name = rep(LETTERS[1:7], each = 3),
target = rep(c(0,1,2), 7),
filter = 1:21)
dt
## name target filter
## 1: A 0 1
## 2: A 1 2
## 3: A 2 3
## 4: B 0 4
## 5: B 1 5
## 6: B 2 6
## 7: C 0 7
With this frame, the desired output should return a mean value for target that meets the criteria of exactly 2.
Something like:
dt[ , .(mFilter = which.max(filter),
target = target), by = name][ ,
mean(target), by = c("name", "mFilter")]
... seems close, but isn't hitting it quite right.
The solution should return:
## name V1
## 1: A 2
## 2: B 2
## 3: ...

You could do this with:
dt[, .(meantarget = mean(target[filter == max(filter)])), by = name]
# name meantarget
# 1: A 2
# 2: B 2
# 3: C 2
# 4: D 2
# 5: E 2
# 6: F 2
# 7: G 2

Related

Add new blank rows into dataset by group (in R)

I use R. I have dataframe like this:
dat <- data.frame(
group = c(1,1,1,1,1,1,2,2,2,2,2),
horizon = c(1,3,5,6,7,10,1,3,5,9,10),
value = c(1.0,0.9,0.8,0.6,0.3,0.0,0.5,0.6,0.8,0.9,0.8)
other = c(a,a,a,a,a,a,b,b,b,b,b)
)
And i would like to add row for every horizon that is missing (2,4,8 and 9 for the first group and 2,4,6,7,8 for the second group). Values (value) for the missing horizons would be blank.
I would like to get something like this:
datx <- data.frame(
group = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
horizon = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10),
value = c(1.0,"na",0.9,"na",0.8,0.6,0.3,"na","na",0.0,0.5,"na",0.6,"na",0.8,"na","na","na",0.9,0.8)
other = c(a,a,a,a,a,a,a,a,a,a,b,b,b,b,b,b,b,b,b,b)
)
i.e. englarged dataset with new horizons, blank or "na" spaces in "value" variable and retained "other" variable.
This is just an example. I am actually working with a much larger dataset.
Without the groups, the problem would be much easier to solve, i would use something like this:
newdat <- merge(data.frame(horizon=seq(1,10,1)),dat,all=TRUE)
newdat <- newdat[order(newdat$horizon),]
Thanks for help!
I'll assume that the values in the variable other are the characters, a or b, and that this is completely redundant with your variable group. If this is the case, you could accomplish this with full_join in the dplyrpackage.
a="a"
b="b"
dat <- data.frame(
group = c(1,1,1,1,1,1,2,2,2,2,2),
horizon = c(1,3,5,6,7,10,1,3,5,9,10),
value = c(1.0,0.9,0.8,0.6,0.3,0.0,0.5,0.6,0.8,0.9,0.8),
other = c(a,a,a,a,a,a,b,b,b,b,b)
)
groups <- expand.grid(group=c(1,2),horizon=1:10)
groups <- groups %>% dplyr::mutate(other=ifelse(group==1,"a","b"))
dat %>%
dplyr::full_join(groups,by=c('group','horizon','other')) %>%
dplyr::arrange(group,horizon)
Using data.table:
library(data.table)
setDT(dat)
fill = c("other")
RES =
dat[CJ(group = group, horizon = min(horizon):max(horizon), unique = TRUE),
on = .(group, horizon)
][, (fill) := lapply(.SD, \(x) x[which.min(is.na(x))]), by = group, .SDcols = fill]
RES[]
# group horizon value other
# <num> <int> <num> <char>
# 1: 1 1 1.0 a
# 2: 1 2 NA a
# 3: 1 3 0.9 a
# 4: 1 4 NA a
# 5: 1 5 0.8 a
# 6: 1 6 0.6 a
# 7: 1 7 0.3 a
# 8: 1 8 NA a
# 9: 1 9 NA a
# 10: 1 10 0.0 a
# 11: 2 1 0.5 b
# 12: 2 2 NA b
# 13: 2 3 0.6 b
# 14: 2 4 NA b
# 15: 2 5 0.8 b
# 16: 2 6 NA b
# 17: 2 7 NA b
# 18: 2 8 NA b
# 19: 2 9 0.9 b
# 20: 2 10 0.8 b
# group horizon value other

How to use functions to do a recursive calculation in data.table/R?

I am new to Programming and got stuck in it. I wanted to calculate the hourly temperature variation of an object throughout the year using some variables, which changes in every hour. The original data contains 60 columns and 8760 rows for the calculation.
I got the desired output using the for loop, but the model is taking a lot of time for the calculation. I wonder if there is any way to replace the loop with functions, which I suspect, can also increase the speed of the calculations.
Here is a small reproducible example to show what I did.
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table
A B C
1: 1 1 10
2: 1 2 10
3: 1 3 10
4: 1 4 10
5: 1 5 10
The forloop
for (j in (2: nrow(table))) {
table$A[j] = (table$A[j-1] + table$B[j-1]) * table$B[j]
table$C[j] = table$B[j] * table$A[j]
}
I got the output as I desired:
A B C
1: 1 1 10
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
but it took 15 min to run the whole program in my case (not this!)
So I tried to use function instead of the for loop.
I tried this:
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
myfun <- function(df){
df = df %>% mutate(A = (lag(A) + lag(B)) * B,
C = B * A)
return(df)
}
myfun(table)
But the output was
A B C
1 NA 1 NA
2 4 2 8
3 9 3 27
4 16 4 64
5 25 5 125
As it seems that the function refers to the rows of the first table not the updated rows after the calculation. Is there any way to obtain the desired output using functions? It is my first R project, any help is very much appreciated. Thank you.
A much faster alternative using data.table. Note that the calculation of C can be separated from the calculation of A so we can do less within the loop:
for (i in 2:nrow(table)) {
set(table, i = i, j = "A", value = with(table, (A[i-1] + B[i-1]) * B[i]))
}
table[-1, C := A * B]
table
# A B C
# <num> <int> <num>
# 1: 1 1 10
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200
You can try Reduce like below
dt[
,
A := Reduce(function(x, Y) (x + Y[2]) * Y[1],
asplit(embed(B, 2), 1),
init = A[1],
accumulate = TRUE
)
][
,
C := A * B
]
which updates dt as
> dt
A B C
1: 1 1 1
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
data
dt <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
Here's a solution using purrr::accumulate2 which lets you use the result of the previous computation as the input to the next one:
library(data.table)
library(purrr)
library(magrittr)
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table$A <- accumulate2(
table$A,
seq(table$A),
~ (..1 + table$B[..3]) * table$B[..3 + 1],
.init = table$A[1]
) %>%
unlist() %>%
extract(1:nrow(table))
table$C <- table$B * table$A
table
# A B C
# 1: 1 1 1
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200

Create a list of dataframes/data.tables which created from function with argument in R?

I am impressed by the efficiency R-code could be by using functions and loops.
I will provide a simplified example of the question first, and explain my problem (where the code is probably not replicable).
If I have several vectors which are different in contents and length,like:
tables_vector_1 <- c(1,2,3)
tables_vector_2 <- c(1:10)
And I have a function to create data.tables from the vector, like:
create_dt <- function(tables_vector, i){
DT <- data.table(id = 1:i, name = c("a","b","c"))
return(DT)
}
I am wondering, if there is a way to write a loop or function, where I can create all (or some of ) data.tables in the vector by running the function created before?
(probably like)
for i in 1:length(tables_vector){
create_dt(tables_vector, i)
}
And then combine the results in a list, same as the result if you run:
list(create_dt(tables_vector_1,1),create_dt(tables_vector_1,2),create_dt(tables_vector_1,3))
I have tried to use lapply(list(1:3),create_dt,tables_vector = tables_vector_1, i), but it falls, since I don't know how to specify the i argument correctly in lapply().
Here is the explanation why this problem rise:
In the real situation, I have created a function to import data.table from the database:
import_data <- function(tables_vector,i){
end <- Sys.time()
start <- end - 7200
con <- dbConnect("PostgreSQL", dbname="db", host = "host", user=db_user, password=db_password)
query <- sprintf("SELECT %s.timeutc, %s.scal AS %s FROM %s WHERE timeutc BETWEEN '%s' AND '%s' AND mode='General';",
tables_vector[i],tables_vector[i],tables_vector[i], tables_vector[i],start,end)
rs <- dbSendQuery(con, query)
df <- fetch(rs, n = -1)
dbClearResult(rs)
dbDisconnect(con)
return(as.data.table(df))
}
And I have tens of vectors which are defined by groups (e.g. vector1 contains channels for purpose 1, vector2 contains channels for purpose 2).
Since they are created for different analysis purposes, I cannot simply combine them in one vector.
Moreover, some vector contains 7, 8 channels, so it is quite annoying to list them by repeating the function one by one.
How about something like this:
tables_vector_1 <- c(1,2,3)
tables_vector_2 <- c(1:10)
create_dt <- function(tables_vector, i){
DT <- data.table(id = 1:i, name = letters[1:i])
return(DT)
}
make_list <- function(x){
lapply(seq_along(x), function(i)create_dt(x, i))
}
make_list(tables_vector_1)
[[1]]
id name
1: 1 a
[[2]]
id name
1: 1 a
2: 2 b
[[3]]
id name
1: 1 a
2: 2 b
3: 3 c
make_list(tables_vector_2)
[[1]]
id name
1: 1 a
[[2]]
id name
1: 1 a
2: 2 b
[[3]]
id name
1: 1 a
2: 2 b
3: 3 c
[[4]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
[[5]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
[[6]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
[[7]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
[[8]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
[[9]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
9: 9 i
[[10]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
9: 9 i
10: 10 j
Note, I changed the create_dt() function so it did not produce a warning, but the mechanics should still work as intended.

data.table grouping by column names from a variable sets column names as "get"

I need to aggregate a data table by columns given as strings in variables. I'm using get to achieve this but the column names in the resulting table are named as "get" instead of the original names. How to avoid this?
dt = data.table(id = rep(LETTERS[1:4], 1, each = 3),
grp = round(runif(12)),
val = runif(12))
col.names = names(dt)
dt[, .(meanByIDByGrp = mean(val)), by = .(get(col.names[1]), get(col.names[2]))]
get get meanByIDByGrp
1: A 1 0.5628882
2: A 0 0.6021001
3: B 1 0.4013824
4: B 0 0.0551370
5: C 1 0.6031302
6: C 0 0.7107527
7: D 1 0.2778507
dt[, .(meanByIDByGrp = mean(val)), by = col.names[1:2]]
# id grp meanByIDByGrp
# 1: A 1 0.1638516
# 2: A 0 0.5859206
# 3: B 1 0.4907845
# 4: B 0 0.3665976
# 5: C 1 0.6644277
# 6: D 0 0.5028973

Order factor levels in order of appearance in data set

I have a survey in which a unique ID must be assigned to questions. Some questions appear multiple times. This means that there is an extra layer of questions. In the sample data below only the first layer is included.
Question: how do I assign a unique index by order of appearance? The solution provided here works alphabetically. I can order the factors, but this defeats the purpose of doing it in R [there are many questions to sort].
library(data.table)
dt = data.table(question = c("C", "C", "A", "B", "B", "D"),
value = c(10,20,30,40,20,30))
dt[, idx := as.numeric(as.factor(question))]
gives:
question value idx
# 1: C 10 3
# 2: C 20 3
# 3: A 30 1
# 4: B 40 2
# 5: B 20 2
# 6: D 30 4
# but required is:
dt[, idx.required := c(1, 1, 2, 3, 3, 4)]
I think the data.table way to do this will be
dt[, idx := .GRP, by = question]
## question value idx
## 1: C 10 1
## 2: C 20 1
## 3: A 30 2
## 4: B 40 3
## 5: B 20 3
## 6: D 30 4
You could respecify the factor levels:
dt[, idx := as.numeric(factor(question, levels=unique(question)))]
# question value idx
# 1: C 10 1
# 2: C 20 1
# 3: A 30 2
# 4: B 40 3
# 5: B 20 3
# 6: D 30 4

Resources