I am trying to created a weighted average for each week, across multiple columns. My data looks like this:
week <- c(1,1,1,2,2,3)
col_a <- c(1,2,2,4,2,7)
col_b <- c(4,2,3,1,2,5)
col_c <- c(4,2,3,2,2,4)
dfreprex <- data.frame(week,col_a,col_b,col_c)
week col_a col_b col_c
1 1 1 4 4
2 1 2 2 2
3 1 2 3 3
4 2 4 1 2
5 2 2 2 2
6 3 7 5 4
weightsreprex <- data.frame(county = c("col_a", "col_b", "col_c")
, weights = c(.3721, .3794, .2485))
How do I weight each column and then get the mean? Is there a simpler way than just multiplying each column by its weight in a new column (col_a_weighted) and then taking the rowmean of the weighted columns only?
Tried weighted.means, rowmeans, group_by and summarise
We may use * for matrix multiplication:
dfreprex$wtmean <- as.matrix(dfreprex[,-1]) %*% as.matrix(weightsreprex[, 2])
dfreprex
week col_a col_b col_c wtmean
1 1 1 4 4 2.8837
2 1 2 2 2 2.0000
3 1 2 3 3 2.6279
4 2 4 1 2 2.3648
5 2 2 2 2 2.0000
6 3 7 5 4 5.4957
We might also use crossprod
crossprod(t(as.matrix(dfreprex[,-1])), as.matrix(weightsreprex[, 2]))
You can use stats::weighted.mean() here:
library(tidyverse)
dfreprex <- structure(list(week = c(1, 1, 1, 2, 2, 3), col_a = c(1, 2, 2, 4, 2, 7), col_b = c(4, 2, 3, 1, 2, 5), col_c = c(4, 2, 3, 2, 2, 4)), class = "data.frame", row.names = c(NA, -6L))
weightsreprex <- data.frame(county = c("col_a", "col_b", "col_c"), weights = c(.3721, .3794, .2485))
dfreprex %>%
rowwise() %>%
mutate(wt_avg = weighted.mean(c(col_a, col_b, col_c), weightsreprex$weights))
#> # A tibble: 6 × 5
#> # Rowwise:
#> week col_a col_b col_c wt_avg
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 4 4 2.88
#> 2 1 2 2 2 2
#> 3 1 2 3 3 2.63
#> 4 2 4 1 2 2.36
#> 5 2 2 2 2 2
#> 6 3 7 5 4 5.50
Created on 2022-11-09 with reprex v2.0.2
Related
My dataframe contains data about political careers, such as a unique identifier (called: ui) column for each politician and the electoral term(called: electoral_term) in which they were elected. Since a politician can be elected in multiple electoral terms, there are multiple rows that contain the same ui.
Now I would like to add another column to my dataframe, that counts how many times the politician got re-elected.
So e.g. the politician with ui=1 was re-elected 2 times, since he occured in 3 electoral_terms.
I already tried
df %>% count(ui)
But that only gives out a table which can't be added into my dataframe.
Thanks in advance!
We may use base R
df$reelected <- with(df, ave(ui, ui, FUN = length)-1)
-output
> df
ui electoral reelected
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
data
df <- structure(list(ui = c(1, 1, 1, 2, 3, 3), electoral = c(1, 2,
3, 2, 7, 9)), class = "data.frame", row.names = c(NA, -6L))
mydf <- tibble::tribble(~ui, ~electoral, 1, 1, 1, 2, 1, 3, 2, 2, 3, 7, 3, 9)
library(dplyr)
df |>
add_count(ui, name = "re_elected") |>
mutate(re_elected = re_elected - 1)
# A tibble: 6 × 3
ui electoral re_elected
<dbl> <dbl> <dbl>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
library(tidyverse)
df %>%
group_by(ui) %>%
mutate(re_elected = n() - 1)
# A tibble: 6 × 3
# Groups: ui [3]
ui electoral re_elected
<dbl> <dbl> <dbl>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
I have a longitudinal data set in wide format, with > 2500 columns. Almost all columns begin with 'W1_' or 'W2_' to indicate the wave (ie, time point) of data collection. In the real data, there are > 2 waves. They look like this:
# Populate wide format data frame
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
wide
#> person W1_resp_sex W2_resp_sex W1_edu W2_q_2_1
#> 1 1 1 1 1 0
#> 2 2 2 2 2 1
#> 3 3 1 1 3 1
#> 4 4 2 2 4 0
I want to reshape from wide to long format so that the data look like this:
# Populate long data frame (this is how we want the wide data above to look after reshaping it)
person <- c(1, 1, 2, 2, 3, 3, 4, 4)
wave <- c(1, 2, 1, 2, 1, 2, 1, 2)
sex <- c(1, 1, 2, 2, 1, 1, 2, 2)
education <- c(1, NA, 2, NA, 3, NA, 4, NA)
q_2_1 <- c(NA, 0, NA, 1, NA, 1, NA, 0)
long_goal <- as.data.frame(cbind(person, wave, sex, education, q_2_1))
long_goal
#> person wave sex education q_2_1
#> 1 1 1 1 1 NA
#> 2 1 2 1 NA 0
#> 3 2 1 2 2 NA
#> 4 2 2 2 NA 1
#> 5 3 1 1 3 NA
#> 6 3 2 1 NA 1
#> 7 4 1 2 4 NA
#> 8 4 2 2 NA 0
To reshape the data, I tried pivot_longer(). How do I fix these issues?
(I prefer not to use data.table.)
The variables have different naming patterns (How can I correctly specify names_pattern() ?)
The multiple columns (see how all values are under the 'sex' column)
Creating a column with 'NA' when a variable was only collected in one wave (ie, if it was only collected in wave 2, I want a column with W1_varname in which all values are NA).
# Re-load wide format data
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
# Load package
pacman::p_load(tidyr)
# Reshape from wide to long
long <- wide %>%
pivot_longer(
cols = starts_with('W'),
names_to = 'Wave',
names_prefix = 'W',
names_pattern = '(.*)_',
values_to = 'sex',
values_drop_na = TRUE
)
long
#> # A tibble: 16 × 3
#> person Wave sex
#> <dbl> <chr> <dbl>
#> 1 1 1_resp 1
#> 2 1 2_resp 1
#> 3 1 1 1
#> 4 1 2_q_2 0
#> 5 2 1_resp 2
#> 6 2 2_resp 2
#> 7 2 1 2
#> 8 2 2_q_2 1
#> 9 3 1_resp 1
#> 10 3 2_resp 1
#> 11 3 1 3
#> 12 3 2_q_2 1
#> 13 4 1_resp 2
#> 14 4 2_resp 2
#> 15 4 1 4
#> 16 4 2_q_2 0
Created on 2022-09-19 by the reprex package (v2.0.1)
We could reshape to 'long' with pivot_longer, specifying the names_pattern to capture substring from column names ((...)) that matches with the same order of names_to - i.e.. wave column will get the digits (\\d+) after the 'W', where as the .value (value of the columns) correspond to the substring after the first _ in column names. Then, we could modify the resp_sex and edu by column names
library(dplyr)
library(tidyr)
pivot_longer(wide, cols = -person, names_to = c("wave", ".value"),
names_pattern = "^W(\\d+)_(.*)$") %>%
rename_with(~ c("sex", "education"), c("resp_sex", "edu"))
-output
# A tibble: 8 × 5
person wave sex education q_2_1
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 1 1 1 NA
2 1 2 1 NA 0
3 2 1 2 2 NA
4 2 2 2 NA 1
5 3 1 1 3 NA
6 3 2 1 NA 1
7 4 1 2 4 NA
8 4 2 2 NA 0
You want to reshape the variables that are measured in both waves. You may find them tableing the substring of the names without prefix.
v <- grep(names(which(table(substring(names(wide)[-1], 4)) == 2)), names(wide))
reshape2::melt(data=wide, id.vars=1, measure.vars=v)
# person variable value
# 1 1 W1_resp_sex 1
# 2 2 W1_resp_sex 2
# 3 3 W1_resp_sex 1
# 4 4 W1_resp_sex 2
# 5 1 W2_resp_sex 1
# 6 2 W2_resp_sex 2
# 7 3 W2_resp_sex 1
# 8 4 W2_resp_sex 2
This is a follow-up question of this How to add a row to a dataframe modifying only some columns.
After solving this question I wanted to apply the solution provided by stefan to a larger dataframe with group_by:
My dataframe:
df <- structure(list(test_id = c(1, 1, 1, 1, 1, 1, 1, 1), test_nr = c(1,
1, 1, 1, 2, 2, 2, 2), region = c("A", "B", "C", "D", "A", "B",
"C", "D"), test_value = c(3, 1, 1, 2, 4, 2, 4, 1)), class = "data.frame", row.names = c(NA,
-8L))
test_id test_nr region test_value
1 1 1 A 3
2 1 1 B 1
3 1 1 C 1
4 1 1 D 2
5 1 2 A 4
6 1 2 B 2
7 1 2 C 4
8 1 2 D 1
I now want to add a new row to each group with this code, which gives an error:
df %>%
group_by(test_nr) %>%
add_row(test_id = .$test_id[1], test_nr = .$test_nr[1], region = "mean", test_value = mean(.$test_value))
Error: Can't add rows to grouped data frames.
Run `rlang::last_error()` to see where the error occurred.
My expected output would be:
test_id test_nr region test_value
1 1 1 A 3.00
2 1 1 B 1.00
3 1 1 C 1.00
4 1 1 D 2.00
5 1 1 MEAN 1.75
6 1 2 A 4.00
7 1 2 B 2.00
8 1 2 C 4.00
9 1 2 D 1.00
10 1 2 MEAN 2.75
I have tried so far:
library(tidyverse)
df %>%
group_by(test_nr) %>%
group_split() %>%
map_dfr(~ .x %>%
add_row(!!! map(.[4], mean)))
test_id test_nr region test_value
<dbl> <dbl> <chr> <dbl>
1 1 1 A 3
2 1 1 B 1
3 1 1 C 1
4 1 1 D 2
5 NA NA NA 1.75
6 1 2 A 4
7 1 2 B 2
8 1 2 C 4
9 1 2 D 1
10 NA NA NA 2.75
How could I modify column 1 to 3 to place my values there?
I actually recently made a little helper function for exactly this. The idea
is to use group_modify() to take the group data, and
bind_rows() the summary statistics calculated with summarise().
This is what it looks like in code:
add_summary_rows <- function(.data, ...) {
group_modify(.data, function(x, y) bind_rows(x, summarise(x, ...)))
}
And here’s how that would work with your data:
library(dplyr, warn.conflicts = FALSE)
df <- data.frame(
test_id = c(1, 1, 1, 1, 1, 1, 1, 1),
test_nr = c(1, 1, 1, 1, 2, 2, 2, 2),
region = c("A", "B", "C", "D", "A", "B", "C", "D"),
test_value = c(3, 1, 1, 2, 4, 2, 4, 1)
)
df %>%
group_by(test_id, test_nr) %>%
add_summary_rows(
region = "MEAN",
test_value = mean(test_value)
)
#> # A tibble: 10 x 4
#> # Groups: test_id, test_nr [2]
#> test_id test_nr region test_value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 1 A 3
#> 2 1 1 B 1
#> 3 1 1 C 1
#> 4 1 1 D 2
#> 5 1 1 MEAN 1.75
#> 6 1 2 A 4
#> 7 1 2 B 2
#> 8 1 2 C 4
#> 9 1 2 D 1
#> 10 1 2 MEAN 2.75
You can combine your two approaches:
df %>%
split(~test_nr) %>%
map_dfr(~ .x %>%
add_row(test_id = .$test_id[1],
test_nr = .$test_nr[1],
region = "mean",
test_value = mean(.$test_value)))
You could achieve your target with this Base R one-liner:
merge( df, aggregate( df, by = list( df$test_nr ), FUN = mean ), all = TRUE )[ , 1:4 ]
aggregate provides you with the lines you need, and merge inserts them into the right places of your dataframe. You don't need the last column of the combined dataframe, so use only the first four columns. The code produces some warnings for the region column which can be disregarded. In the region column, the function (MEAN) is not displayed.
Making it a little more generic:
f <- "mean"
df1 <- merge( df, aggregate( df, by = list( df$test_id, df$test_nr ),
FUN = f ), all = TRUE )[ , 1:4 ]
df1$region[ is.na( df1$region ) ] <- toupper( f )
Here, you aggregate also by test_id, you can change the function you are using in one place, and you have it printed in the region column:
> df1
test_id test_nr region test_value
1 1 1 A 3.00
2 1 1 B 1.00
3 1 1 C 1.00
4 1 1 D 2.00
5 1 1 MEAN 1.75
6 1 2 A 4.00
7 1 2 B 2.00
8 1 2 C 4.00
9 1 2 D 1.00
10 1 2 MEAN 2.75
I have the following list
example <- list(a = c(1, 2, 3),
b = c(2, 3),
c = c(3, 4, 5, 6))
that I'd like to transform into the following tibble
# A tibble: 9 × 2
name value
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 2
5 b 3
6 c 3
7 c 4
8 c 5
9 c 6
I've found multiple StackOverflow questions on this subject like here, here or here, but none is adressing this particular case where the name of the vector is not expected to become a column name.
I managed to achieve the desired result with a good old loop like below, but I'm looking for a faster and more elegant way.
library(dplyr)
example_list <- list(a = c(1, 2, 3),
b = c(2, 3),
c = c(3, 4, 5, 6))
example_tibble <- tibble()
for (i in 1:length(example_list)) {
example_tibble <- example_tibble %>%
bind_rows(as_tibble(example_list[[i]]) %>%
mutate(name = names(example_list)[[i]]))
}
example_tibble <- example_tibble %>%
relocate(name)
Try stack
> stack(example)
values ind
1 1 a
2 2 a
3 3 a
4 2 b
5 3 b
6 3 c
7 4 c
8 5 c
9 6 c
example <- list(a = c(1, 2, 3),
b = c(2, 3),
c = c(3, 4, 5, 6))
library(tidyverse)
enframe(example) %>%
unnest(value)
#> # A tibble: 9 x 2
#> name value
#> <chr> <dbl>
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 2
#> 5 b 3
#> 6 c 3
#> 7 c 4
#> 8 c 5
#> 9 c 6
Created on 2021-11-04 by the reprex package (v2.0.1)
I got data like this
structure(list(id = c(1, 1, 1, 2, 2, 2), time = c(1, 2, 2, 5,
6, 6)), class = "data.frame", row.names = c(NA, -6L))
and If for the same ID the value in the next row is equal to the value in the previous row, then increase the value of the duplicate by 1. I want to get this
structure(list(id2 = c(1, 1, 1, 2, 2, 2), time2 = c(1, 2, 3,
5, 6, 7)), class = "data.frame", row.names = c(NA, -6L))
Using base R:
ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
# [1] 1 2 3 5 6 7
(This can be reassigned back into time.)
This deals with 2 or more duplicates, meaning if we instead have another 6th row,
df <- rbind(df, df[6,])
df$time2 <- ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
df
# id time time2
# 1 1 1 1
# 2 1 2 2
# 3 1 2 3
# 4 2 5 5
# 5 2 6 6
# 6 2 6 7
# 61 2 6 8
You could use accumulate
library(tidyverse)
df %>%
group_by(id) %>%
mutate(time2 = accumulate(time, ~if(.x>=.y) .x + 1 else .y))
# A tibble: 6 x 3
# Groups: id [2]
id time time2
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 2
3 1 2 3
4 2 5 5
5 2 6 6
6 2 6 7
This works even if the group is repeated more than twice.
If the first data.frame is named df, this gives you what you need:
df$time[duplicated(df$id) & duplicated(df$time)] <- df$time[duplicated(df$id) & duplicated(df$time)] + 1
df
id time
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7
It finds the rows where both id and time have been duplicated from the previous row, and adds 1 to time in those rows
You can use dplyr's mutate with lag
data%>%group_by(id)%>%
mutate(time=time+cumsum(duplicated(time)))%>%
ungroup()
# A tibble: 6 x 2
id time
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7