How to aggregate a data frame by columns and rows? - r

I have the following data set:
Class Total AC Final_Coverage
A 1000 1 55
A 1000 2 66
B 1000 1 77
A 1000 3 88
B 1000 2 99
C 1000 1 11
B 1000 3 12
B 1000 4 13
B 1000 5 22
C 1000 2 33
C 1000 3 44
C 1000 4 55
C 1000 5 102
A 1000 4 105
A 1000 5 109
I would like to get the average of the AC and the Final_Coverage for the first three rows of each class. Then, I want to store the average values along with the class name in a new dataframe. To do that, I did the following:
dataset <- read_csv("/home/ad/Desktop/testt.csv")
classes <- unique(dataset$Class)
new_data <- data.frame(Class = character(0), AC = numeric(0), Coverage = numeric(0))
for(class in classes){
new_data$Class <- class
dataClass <- subset(dataset, Class == class)
tenRows <- dataClass[1:3,]
coverageMean <- mean(tenRows$Final_Coverage)
acMean <- mean(tenRows$AC)
new_data$Coverage <- coverageMean
new_data$AC <- acMean
}
Everything works fine except entering the average value into the new_data frame. I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "Class", value = "A") :
replacement has 1 row, data has 0
Do you know how to solve this?

This should get you the new dataframe by using dplyr.
dataset %>% group_by(Class) %>% slice(1:3) %>% summarise(AC= mean(AC),
Coverage= mean(Final_Coverage))
In your method the error is that you initiated your new dataframe with 0 rows and try to assign a single value to it. This is reflected by the error. You want to replace one row to a dataframe with 0 rows. This would work, though:
new_data <- data.frame(Class = classes, AC = NA, Coverage = NA)
for(class in classes){
new_data$Class <- class
dataClass <- subset(dataset, Class == class)
tenRows <- dataClass[1:3,]
coverageMean <- mean(tenRows$Final_Coverage)
acMean <- mean(tenRows$AC)
new_data$Coverage[classes == class] <- coverageMean
new_data$AC[classes == class] <- acMean
}

You could look into aggregate().
> aggregate(df1[df1$AC <= 3, 3:4], by=list(Class=df1[df1$AC <= 3, 1]), FUN=mean)
Class AC Final_Coverage
1 A 2 69.66667
2 B 2 62.66667
3 C 2 29.33333
DATA
df1 <- structure(list(Class = structure(c(1L, 1L, 2L, 1L, 2L, 3L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"),
Total = c(1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L,
1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L),
AC = c(1L, 2L, 1L, 3L, 2L, 1L, 3L, 4L, 5L, 2L, 3L, 4L, 5L,
4L, 5L), Final_Coverage = c(55L, 66L, 77L, 88L, 99L, 11L,
12L, 13L, 22L, 33L, 44L, 55L, 102L, 105L, 109L)), class = "data.frame", row.names = c(NA,
-15L))

Related

How to run a regression for every combination of levels in two or more factor independent variables?

I was wondering how to perform multiple independent linear regressions on a combination of levels for two or more factor variables.
Let's say our dataset has one dependent continuous variable, and then two factor independent variables and one continuous independent variable.
Then let's say our regression formula in r is this:
model <- lm(weight ~ city + diet + height)
Or, to write in pseudo code i'm trying to do this:
lm(weight ~ height) %>% group by city
lm(weight ~ height) %>% group by diet
lm(weight ~ height) %>% group by city & diet
I know that we could run a linear regression for each city and diet one by one, but do you know of a way we could create a loop so that we do an independent regression for each city and diet in our dataset?
To illustrate this better I've made this fake dataset in this image and then listed the three types of outputs I would want. However, I don't want to manually do them one by one, but would rather use a loop.
Does anyone know how to do this in r?
We can define the model specification in a list and then use lapply() over the list of desired models.
Code
models <- list("m1" = c("weight", "height"),
"m2" = c("weight", "height", "city"),
"m3" = c("weight", "height", "diet"),
"m4" = c("weight", "height", "diet", "city"))
lapply(models, function(x){
lm(weight ~ ., data = df[, x])
})
# $m1
#
# Call:
# lm(formula = weight ~ ., data = df[, x])
#
# Coefficients:
# (Intercept) height
# -0.2970 0.1219
#
#
# $m2
#
# Call:
# lm(formula = weight ~ ., data = df[, x])
#
# Coefficients:
# (Intercept) height cityHouston
# -0.3705 0.1259 0.1205
#
#
# $m3
#
# Call:
# lm(formula = weight ~ ., data = df[, x])
#
# Coefficients:
# (Intercept) height dietVegan dietVegetarian
# -0.1905 0.1270 -0.1288 -0.1757
#
#
# $m4
#
# Call:
# lm(formula = weight ~ ., data = df[, x])
#
# Coefficients:
# (Intercept) height dietVegan dietVegetarian cityHouston
# -0.2615 0.1310 -0.1417 -0.1663 0.1197
Data
df <- data.frame("weight" = rnorm(100),
"height" = rexp(100),
"diet" = as.factor(sample(c("Vegan", "Vegetarian", "Meat"), replace = TRUE, 100)),
"city" = as.factor(sample(c("Houston", "Chicago"), replace = TRUE, 100)))
First define a small regfun that computes the desired summary statistics. Then, using by apply it group-wise. For the combination of two groups we may paste the columns together use the interaction function : for factors.
regfun <- function(x) summary(lm(w ~ h, x))$coe[2, c(1, 4)]
do.call(rbind, by(d, d$city, regfun))
# Estimate Pr(>|t|)
# a -0.1879530 0.4374580
# b -0.2143780 0.4674864
# c -0.2866948 0.5131854
do.call(rbind, by(d, d$diet, regfun))
# Estimate Pr(>|t|)
# y -0.1997162 0.3412652
# z -0.3512349 0.4312766
# do.call(rbind, by(d, Reduce(paste, d[1:2]), regfun))
with(d, do.call(rbind, by(d, city:diet, regfun))) ## credits to #G.Grothendieck
# Estimate Pr(>|t|)
# a y -0.2591764 0.5576043
# a z -0.1543536 0.8158689
# b y -0.1966501 0.7485405
# b z -0.4354839 0.7461538
# c y -0.5000000 0.3333333
# c z -1.0671642 0.7221495
Edit
If we have an unbalanced panel, i.e. with(d, city:diet) gives "impossible" combinations that aren't actually in the data, we have to code this slightly different. You can think of by as a combination of first split then lapply, so let's to that. Because we'll get errors, we may use tryCatch to provide a similar substitute.
s <- with(d2, split(d2, city:diet))
do.call(rbind, lapply(s, function(x)
tryCatch(regfun(x),
error=function(e) cbind.data.frame(Estimate=NA, `Pr(>|t|)`=NA))))
# Estimate Pr(>|t|)
# a:y -0.2591764 0.5576043
# a:z NA NA
# b:y 5.2500000 NaN
# b:z NA NA
# c:y -0.5000000 0.3333333
# c:z 9.5000000 NaN
# d:y NA NA
# d:z 1.4285714 NaN
# e:y NA NA
# e:z -7.0000000 NaN
# f:y NA NA
# f:z 2.0000000 NaN
Data:
d <- structure(list(city = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("a",
"b", "c"), class = "factor"), diet = structure(c(1L, 1L, 1L,
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("y",
"z"), class = "factor"), id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), w = c(66L, 54L, 50L,
74L, 59L, 53L, 67L, 75L, 66L, 64L, 73L, 56L, 53L, 74L, 54L, 63L,
69L, 75L), h = c(152L, 190L, 174L, 176L, 185L, 186L, 180L, 194L,
154L, 169L, 183L, 177L, 189L, 152L, 182L, 191L, 173L, 179L)), out.attrs = list(
dim = c(city = 3L, diet = 2L, id = 3L), dimnames = list(city = c("city=a",
"city=b", "city=c"), diet = c("diet=y", "diet=z"), id = c("id=1",
"id=2", "id=3"))), row.names = c(NA, -18L), class = "data.frame")
d2 <- structure(list(city = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 1L,
2L, 3L, 4L, 5L, 3L, 1L, 6L, 3L, 6L, 2L, 3L), .Label = c("a",
"b", "c", "d", "e", "f"), class = "factor"), diet = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L,
2L), .Label = c("y", "z"), class = "factor"), id = c(1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L
), w = c(66L, 54L, 50L, 74L, 59L, 53L, 67L, 75L, 66L, 64L, 73L,
56L, 53L, 74L, 54L, 63L, 69L, 75L), h = c(152L, 190L, 174L, 176L,
185L, 186L, 180L, 194L, 154L, 169L, 183L, 177L, 189L, 152L, 182L,
191L, 173L, 179L)), out.attrs = list(dim = c(city = 3L, diet = 2L,
id = 3L), dimnames = list(city = c("city=a", "city=b", "city=c"
), diet = c("diet=y", "diet=z"), id = c("id=1", "id=2", "id=3"
))), row.names = c(NA, -18L), class = "data.frame")
To expand on my comments about the efficiency of the pooled analysis over that of the subgroup analyses...
Using starwars as (a less than ideal) starting point:
d <- starwars %>%
filter(mass < 1000) %>% # Exclude Jabba
mutate(maleOrNot=ifelse(sex=="male", sex, "other")) %>%
replace_na(list(maleOrNot="other"))
For the sake of argument, say we want to regress a character's mass based only on whether they are male or not and their height and then obtain the standard error of the predicted mass at the mean height.
pData <- d %>%
group_by(maleOrNot) %>%
summarise(height=mean(height), .groups="drop")
pData
# A tibble: 2 x 2
maleOrNot height
* <chr> <dbl>
1 male 178.
2 other 162.
By group analyses
lapply(
d %>% pull(maleOrNot) %>% unique(),
function(x) {
m <- lm(mass ~ height, d %>% filter(maleOrNot == x))
predict(m, pData %>% filter(maleOrNot == x), se.fit=TRUE)$se.fit
}
)
[[1]]
[1] 2.656427
[[2]]
[1] 5.855176
Now the pooled analysis:
m <- lm(mass ~ maleOrNot + height, d)
predict(m, pData, se.fit=TRUE)$se.fit
1 2
2.789770 4.945734
The prediction for the non-males is slightly (5%) less precise, but for males, precision is improved by 15.5%.
But the model isn't particularly good. Perhaps an interactioon model will improve things:
m <- lm(mass ~ maleOrNot:height, d)
predict(m, pData, se.fit=TRUE)$se.fit
1 2
2.776478 4.880154
Now the figures are 4.5% worse and 16.7% better. Including other terms in the model may well improve the precision even more.
In general terms (though there are exceptions), fitting a pooled model is unlikely to reduce precsion compared to fitting several subgroup models and can substantially improve precision. This is because all groups contribute to the estimation of the (common) variance.
In terms of computation time:
library(microbenchmark)
byGroup <- function() {
lapply(
d %>% pull(maleOrNot) %>% unique(),
function(x) {
m <- lm(mass ~ height, d %>% filter(maleOrNot == x))
predict(m, pData %>% filter(maleOrNot == x), se.fit=TRUE)$se.fit
}
)
}
pooled <- function() {
m <- lm(mass ~ maleOrNot + height, d)
predict(m, pData, se.fit=TRUE)$se.fit
}
microbenchmark(byGroup, pooled, times=100)
Unit: nanoseconds
expr min lq mean median uq max neval
byGroup 44 45.5 55.22 47 48 891 100
pooled 42 44.0 60.27 46 47 1434 100
So for this simple case, there's virtually no difference. More complex examples may give different answers.

Replace all values in dataframe using another dataframe as key in R

I have two dataframes and I want to replace all values ( in all the columns) of df1 using the equivalent value in df2 (df2$value).
df1
structure(list(Cell_ID = c(7L, 2L, 3L, 10L), n_1 = c(0L, 0L,
0L, 0L), n_2 = c(9L, 1L, 4L, 1L), n_3 = c(10L, 4L, 5L, 2L), n_4 = c(NA,
5L, NA, 4L), n_5 = c(NA, 7L, NA, 6L), n_6 = c(NA, 9L, NA, 8L),
n_7 = c(NA, 10L, NA, 3L)), class = "data.frame", row.names = c(NA,
-4L))
df2
structure(list(Cell_ID = 0:10, value = c(5L, 100L, 200L, 300L,
400L, 500L, 600L, 700L, 800L, 900L, 1000L)), class = "data.frame", row.names = c(NA,
-11L))
The desired output would look like this:
So far I tried this as suggested in another similar post but its not doing it well (randomly missing some points)
key= df2$Cell_ID
value = df2$value
lapply(1:8,FUN = function(i){df1[df1 == key[i]] <<- value[i]})
Note that the numbers have been just multiplied by 10 for ease in the example the real data has numbers are all over the place so just multiplying the dataframe by 10 won't work.
An option is match the elements with the 'Cell_ID' of second dataset and use that as index to return the corresponding 'value' from 'df2'
library(dplyr)
df1 %>%
mutate(across(everything(), ~ df2$value[match(., df2$Cell_ID)]))
-output
# Cell_ID n_1 n_2 n_3 n_4 n_5 n_6 n_7
#1 700 5 900 1000 NA NA NA NA
#2 200 5 100 400 500 700 900 1000
#3 300 5 400 500 NA NA NA NA
#4 1000 5 100 200 400 600 800 300
Or another option is to use a named vector to do the match
library(tibble)
df1 %>%
mutate(across(everything(), ~ deframe(df2)[as.character(.)]))
The base R equivalent is
df1[] <- lapply(df1, function(x) df2$value[match(x, df2$Cell_ID)])

R: using a for loop to create a new data table containing min and max variables given multiple column combinations

I am currently working with a data set in R that contains four variables for a large set of individuals: pid, month, window, and agedays. I'm trying to create a loop that will output the min and max agedays of each group of combinations between month and window into a new data table that I can export as a csv.
Here's an example of the data:
pid agedays month window
1 22 2 1
2 35 3 2
3 33 3 2
4 55 3 2
1 66 2 1
2 55 4 2
3 80 4 2
4 90 4 2
I'd like for the new data table to contain the min and max agedays of each group within each combination of window and month as well as the count of each group within each combination. The range for month is 2-24 and the range for window is 0-2.
The data table should look something like this:
month window min max N
2 1 22 66 1
3 2 33 55 3
etc....
where N is the number of unique individuals (pids) within each group
After grouping by 'month', 'window', get the min, max of 'agedays' and the number of distinct (n_distinct) elements of 'pid'
library(dplyr)
df1 %>%
group_by(month, window) %>%
summarise(min = min(agedays), max = max(agedays), N = n_distinct(pid))
# A tibble: 3 x 5
# Groups: month [3]
# month window min max N
# <int> <int> <int> <int> <int>
#1 2 1 22 66 1
#2 3 2 33 55 3
#3 4 2 55 90 3
We can also do this with data.table
library(data.table)
setDT(df1)[, .(min = min(agedays), max = max(agedays),
N = uniqueN(pid)), by = .(month, window)]
Or using split from base R
do.call(rbind, lapply(split(df1, df1[c('month', 'window')], drop = TRUE),
function(x) cbind(month = x$month[1], window = x$window[1], min = min(x$agedays), max = max(x$agedays),
N = length(unique(x$pid)))))
data
df1 <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))
Using data.table, we can calculate min, max of agedays along with number of rows for each combination of month and window.
library(data.table)
setDT(df) #Convert to data.table if it is not already
df[, .(min_age = min(agedays, na.rm = TRUE),
max_age = max(agedays, na.rm = TRUE), N = .N), .(month, window)]
# month window min_age max_age N
#1: 2 1 22 66 2
#2: 3 2 33 55 3
#3: 4 2 55 90 3
data
df <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -8L))

Conditional data manipulation using data.table in R

I have 2 dataframes, testx and testy
testx
testx <- structure(list(group = 1:2), .Names = "group", class = "data.frame", row.names = c(NA,
-2L))
testy
testy <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(1L, 3L, 4L, 1L, 4L, 5L, 1L, 5L, 7L), value = c(50L,
52L, 10L, 4L, 84L, 2L, 25L, 67L, 37L)), .Names = c("group",
"time", "value"), class = "data.frame", row.names = c(NA, -9L
))
Based on this topic, I add missing time values using the following code, which works perfectly.
data <- setDT(testy, key='time')[, .SD[J(min(time):max(time))], by = group]
Now I would like to only add these missing time values IF the value for group appears in testx. In this example, I thus only want to add missing time values for groups matching the values for group in the file testx.
data <- setDT(testy, key='time')[,if(testy[group %in% testx[, group]]) .SD[J(min(time):max(time))], by = group]
The error I get is "undefined columns selected". I looked here, here and here, but I don't see why my code isn't working. I am doing this on large datasets, why I prefer using data.table.
You don't need to refer testy when you are within testy[] and are using group by, directly using group as a variable gives correct result, you need an extra else statement to return rows where group is not within testx if you want to keep all records in testy:
testy[, {if(group %in% testx$group) .SD[J(min(time):max(time))] else .SD}, by = group]
# group time value
# 1: 1 1 50
# 2: 1 2 NA
# 3: 1 3 52
# 4: 1 4 10
# 5: 2 1 4
# 6: 2 2 NA
# 7: 2 3 NA
# 8: 2 4 84
# 9: 2 5 2
# 10: 3 1 25
# 11: 3 5 67
# 12: 3 7 37

Replacing loop in dplyr R

So I am trying to program function with dplyr withou loop and here is something I do not know how to do
Say we have tv stations (x,y,z) and months (2,3). If I group by this say we get
this output also with summarised numeric value
TV months value
x 2 52
y 2 87
z 2 65
x 3 180
y 3 36
z 3 99
This is for evaluated Brand.
Then I will have many Brands I need to filter to get only those which get value >=0.8*value of evaluated brand & <=1.2*value of evaluated brand
So for example from this down I would only want to filter first two, and this should be done for all months&TV combinations
brand TV MONTH value
sdg x 2 60
sdfg x 2 55
shs x 2 120
sdg x 2 11
sdga x 2 5000
As #akrun said, you need to use a combination of merging and subsetting. Here's a base R solution.
m <- merge(df, data, by.x=c("TV", "MONTH"), by.y=c("TV", "months"))
m[m$value.x >= m$value.y*0.8 & m$value.x <= m$value.y*1.2,][,-5]
# TV MONTH brand value.x
#1 x 2 sdg 60
#2 x 2 sdfg 55
Data
data <- structure(list(TV = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("x",
"y", "z"), class = "factor"), months = c(2L, 2L, 2L, 3L, 3L,
3L), value = c(52L, 87L, 65L, 180L, 36L, 99L)), .Names = c("TV",
"months", "value"), class = "data.frame", row.names = c(NA, -6L
))
df <- structure(list(brand = structure(c(2L, 1L, 4L, 2L, 3L), .Label = c("sdfg",
"sdg", "sdga", "shs"), class = "factor"), TV = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "x", class = "factor"), MONTH = c(2L,
2L, 2L, 2L, 2L), value = c(60L, 55L, 120L, 11L, 5000L)), .Names = c("brand",
"TV", "MONTH", "value"), class = "data.frame", row.names = c(NA,
-5L))

Resources