This is a follow-up question of this How to add a row to a dataframe modifying only some columns.
After solving this question I wanted to apply the solution provided by stefan to a larger dataframe with group_by:
My dataframe:
df <- structure(list(test_id = c(1, 1, 1, 1, 1, 1, 1, 1), test_nr = c(1,
1, 1, 1, 2, 2, 2, 2), region = c("A", "B", "C", "D", "A", "B",
"C", "D"), test_value = c(3, 1, 1, 2, 4, 2, 4, 1)), class = "data.frame", row.names = c(NA,
-8L))
test_id test_nr region test_value
1 1 1 A 3
2 1 1 B 1
3 1 1 C 1
4 1 1 D 2
5 1 2 A 4
6 1 2 B 2
7 1 2 C 4
8 1 2 D 1
I now want to add a new row to each group with this code, which gives an error:
df %>%
group_by(test_nr) %>%
add_row(test_id = .$test_id[1], test_nr = .$test_nr[1], region = "mean", test_value = mean(.$test_value))
Error: Can't add rows to grouped data frames.
Run `rlang::last_error()` to see where the error occurred.
My expected output would be:
test_id test_nr region test_value
1 1 1 A 3.00
2 1 1 B 1.00
3 1 1 C 1.00
4 1 1 D 2.00
5 1 1 MEAN 1.75
6 1 2 A 4.00
7 1 2 B 2.00
8 1 2 C 4.00
9 1 2 D 1.00
10 1 2 MEAN 2.75
I have tried so far:
library(tidyverse)
df %>%
group_by(test_nr) %>%
group_split() %>%
map_dfr(~ .x %>%
add_row(!!! map(.[4], mean)))
test_id test_nr region test_value
<dbl> <dbl> <chr> <dbl>
1 1 1 A 3
2 1 1 B 1
3 1 1 C 1
4 1 1 D 2
5 NA NA NA 1.75
6 1 2 A 4
7 1 2 B 2
8 1 2 C 4
9 1 2 D 1
10 NA NA NA 2.75
How could I modify column 1 to 3 to place my values there?
I actually recently made a little helper function for exactly this. The idea
is to use group_modify() to take the group data, and
bind_rows() the summary statistics calculated with summarise().
This is what it looks like in code:
add_summary_rows <- function(.data, ...) {
group_modify(.data, function(x, y) bind_rows(x, summarise(x, ...)))
}
And here’s how that would work with your data:
library(dplyr, warn.conflicts = FALSE)
df <- data.frame(
test_id = c(1, 1, 1, 1, 1, 1, 1, 1),
test_nr = c(1, 1, 1, 1, 2, 2, 2, 2),
region = c("A", "B", "C", "D", "A", "B", "C", "D"),
test_value = c(3, 1, 1, 2, 4, 2, 4, 1)
)
df %>%
group_by(test_id, test_nr) %>%
add_summary_rows(
region = "MEAN",
test_value = mean(test_value)
)
#> # A tibble: 10 x 4
#> # Groups: test_id, test_nr [2]
#> test_id test_nr region test_value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 1 A 3
#> 2 1 1 B 1
#> 3 1 1 C 1
#> 4 1 1 D 2
#> 5 1 1 MEAN 1.75
#> 6 1 2 A 4
#> 7 1 2 B 2
#> 8 1 2 C 4
#> 9 1 2 D 1
#> 10 1 2 MEAN 2.75
You can combine your two approaches:
df %>%
split(~test_nr) %>%
map_dfr(~ .x %>%
add_row(test_id = .$test_id[1],
test_nr = .$test_nr[1],
region = "mean",
test_value = mean(.$test_value)))
You could achieve your target with this Base R one-liner:
merge( df, aggregate( df, by = list( df$test_nr ), FUN = mean ), all = TRUE )[ , 1:4 ]
aggregate provides you with the lines you need, and merge inserts them into the right places of your dataframe. You don't need the last column of the combined dataframe, so use only the first four columns. The code produces some warnings for the region column which can be disregarded. In the region column, the function (MEAN) is not displayed.
Making it a little more generic:
f <- "mean"
df1 <- merge( df, aggregate( df, by = list( df$test_id, df$test_nr ),
FUN = f ), all = TRUE )[ , 1:4 ]
df1$region[ is.na( df1$region ) ] <- toupper( f )
Here, you aggregate also by test_id, you can change the function you are using in one place, and you have it printed in the region column:
> df1
test_id test_nr region test_value
1 1 1 A 3.00
2 1 1 B 1.00
3 1 1 C 1.00
4 1 1 D 2.00
5 1 1 MEAN 1.75
6 1 2 A 4.00
7 1 2 B 2.00
8 1 2 C 4.00
9 1 2 D 1.00
10 1 2 MEAN 2.75
Related
I am trying to created a weighted average for each week, across multiple columns. My data looks like this:
week <- c(1,1,1,2,2,3)
col_a <- c(1,2,2,4,2,7)
col_b <- c(4,2,3,1,2,5)
col_c <- c(4,2,3,2,2,4)
dfreprex <- data.frame(week,col_a,col_b,col_c)
week col_a col_b col_c
1 1 1 4 4
2 1 2 2 2
3 1 2 3 3
4 2 4 1 2
5 2 2 2 2
6 3 7 5 4
weightsreprex <- data.frame(county = c("col_a", "col_b", "col_c")
, weights = c(.3721, .3794, .2485))
How do I weight each column and then get the mean? Is there a simpler way than just multiplying each column by its weight in a new column (col_a_weighted) and then taking the rowmean of the weighted columns only?
Tried weighted.means, rowmeans, group_by and summarise
We may use * for matrix multiplication:
dfreprex$wtmean <- as.matrix(dfreprex[,-1]) %*% as.matrix(weightsreprex[, 2])
dfreprex
week col_a col_b col_c wtmean
1 1 1 4 4 2.8837
2 1 2 2 2 2.0000
3 1 2 3 3 2.6279
4 2 4 1 2 2.3648
5 2 2 2 2 2.0000
6 3 7 5 4 5.4957
We might also use crossprod
crossprod(t(as.matrix(dfreprex[,-1])), as.matrix(weightsreprex[, 2]))
You can use stats::weighted.mean() here:
library(tidyverse)
dfreprex <- structure(list(week = c(1, 1, 1, 2, 2, 3), col_a = c(1, 2, 2, 4, 2, 7), col_b = c(4, 2, 3, 1, 2, 5), col_c = c(4, 2, 3, 2, 2, 4)), class = "data.frame", row.names = c(NA, -6L))
weightsreprex <- data.frame(county = c("col_a", "col_b", "col_c"), weights = c(.3721, .3794, .2485))
dfreprex %>%
rowwise() %>%
mutate(wt_avg = weighted.mean(c(col_a, col_b, col_c), weightsreprex$weights))
#> # A tibble: 6 × 5
#> # Rowwise:
#> week col_a col_b col_c wt_avg
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 4 4 2.88
#> 2 1 2 2 2 2
#> 3 1 2 3 3 2.63
#> 4 2 4 1 2 2.36
#> 5 2 2 2 2 2
#> 6 3 7 5 4 5.50
Created on 2022-11-09 with reprex v2.0.2
I have a longitudinal data set in wide format, with > 2500 columns. Almost all columns begin with 'W1_' or 'W2_' to indicate the wave (ie, time point) of data collection. In the real data, there are > 2 waves. They look like this:
# Populate wide format data frame
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
wide
#> person W1_resp_sex W2_resp_sex W1_edu W2_q_2_1
#> 1 1 1 1 1 0
#> 2 2 2 2 2 1
#> 3 3 1 1 3 1
#> 4 4 2 2 4 0
I want to reshape from wide to long format so that the data look like this:
# Populate long data frame (this is how we want the wide data above to look after reshaping it)
person <- c(1, 1, 2, 2, 3, 3, 4, 4)
wave <- c(1, 2, 1, 2, 1, 2, 1, 2)
sex <- c(1, 1, 2, 2, 1, 1, 2, 2)
education <- c(1, NA, 2, NA, 3, NA, 4, NA)
q_2_1 <- c(NA, 0, NA, 1, NA, 1, NA, 0)
long_goal <- as.data.frame(cbind(person, wave, sex, education, q_2_1))
long_goal
#> person wave sex education q_2_1
#> 1 1 1 1 1 NA
#> 2 1 2 1 NA 0
#> 3 2 1 2 2 NA
#> 4 2 2 2 NA 1
#> 5 3 1 1 3 NA
#> 6 3 2 1 NA 1
#> 7 4 1 2 4 NA
#> 8 4 2 2 NA 0
To reshape the data, I tried pivot_longer(). How do I fix these issues?
(I prefer not to use data.table.)
The variables have different naming patterns (How can I correctly specify names_pattern() ?)
The multiple columns (see how all values are under the 'sex' column)
Creating a column with 'NA' when a variable was only collected in one wave (ie, if it was only collected in wave 2, I want a column with W1_varname in which all values are NA).
# Re-load wide format data
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
# Load package
pacman::p_load(tidyr)
# Reshape from wide to long
long <- wide %>%
pivot_longer(
cols = starts_with('W'),
names_to = 'Wave',
names_prefix = 'W',
names_pattern = '(.*)_',
values_to = 'sex',
values_drop_na = TRUE
)
long
#> # A tibble: 16 × 3
#> person Wave sex
#> <dbl> <chr> <dbl>
#> 1 1 1_resp 1
#> 2 1 2_resp 1
#> 3 1 1 1
#> 4 1 2_q_2 0
#> 5 2 1_resp 2
#> 6 2 2_resp 2
#> 7 2 1 2
#> 8 2 2_q_2 1
#> 9 3 1_resp 1
#> 10 3 2_resp 1
#> 11 3 1 3
#> 12 3 2_q_2 1
#> 13 4 1_resp 2
#> 14 4 2_resp 2
#> 15 4 1 4
#> 16 4 2_q_2 0
Created on 2022-09-19 by the reprex package (v2.0.1)
We could reshape to 'long' with pivot_longer, specifying the names_pattern to capture substring from column names ((...)) that matches with the same order of names_to - i.e.. wave column will get the digits (\\d+) after the 'W', where as the .value (value of the columns) correspond to the substring after the first _ in column names. Then, we could modify the resp_sex and edu by column names
library(dplyr)
library(tidyr)
pivot_longer(wide, cols = -person, names_to = c("wave", ".value"),
names_pattern = "^W(\\d+)_(.*)$") %>%
rename_with(~ c("sex", "education"), c("resp_sex", "edu"))
-output
# A tibble: 8 × 5
person wave sex education q_2_1
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 1 1 1 NA
2 1 2 1 NA 0
3 2 1 2 2 NA
4 2 2 2 NA 1
5 3 1 1 3 NA
6 3 2 1 NA 1
7 4 1 2 4 NA
8 4 2 2 NA 0
You want to reshape the variables that are measured in both waves. You may find them tableing the substring of the names without prefix.
v <- grep(names(which(table(substring(names(wide)[-1], 4)) == 2)), names(wide))
reshape2::melt(data=wide, id.vars=1, measure.vars=v)
# person variable value
# 1 1 W1_resp_sex 1
# 2 2 W1_resp_sex 2
# 3 3 W1_resp_sex 1
# 4 4 W1_resp_sex 2
# 5 1 W2_resp_sex 1
# 6 2 W2_resp_sex 2
# 7 3 W2_resp_sex 1
# 8 4 W2_resp_sex 2
I have a large list of dataframes like the following:
> head(lst)
$Set1
ID Value
1 A 1
2 B 1
3 C 1
$Set2
ID Value
1 A 1
2 D 1
3 E 1
$Set3
ID Value
1 B 1
2 C 1
I would like to change the name of the column "Value" in each dataframe to be similar to the name of the dataframe, so that the list of dataframes looks like this:
> head(lst)
$Set1
ID Set1
1 A 1
2 B 1
3 C 1
$Set2
ID Set2
1 A 1
2 D 1
3 E 1
$Set3
ID Set3
1 B 1
2 C 1
Can anyone think of a function that takes the name of each dataframe in the list and names the column accordingly? My original list has >400 dataframes, so I was hoping to automate this somehow. Sorry if this is a naive question, but I'm somehow stuck...
Thanks so much!
Here is an example of a list of dfs:
lst <- list(
data.frame(ID = c("A", "B", "C"), Value = c(1, 1, 1)),
data.frame(ID = c("A", "D", "E"), Value = c(1, 1, 1)),
data.frame(ID = c("B", "C"), Value = c(1, 1)),
data.frame(ID = c("B", "C"), Value = c(1, 1)),
data.frame(ID = c("B", "C"), Value = c(1, 1)),
data.frame(ID = c("B", "C"), Value = c(1, 1)))
lst_names <- c("Set1", "Set2", "Set3", "Set4", "Set5","Set6")
names(lst) <- lst_names
In the tidyverse we can use purrr::imap and dplyr::rename:
library(purrr)
library(dplyr)
lst %>%
imap(~ rename(.x, "{.y}" := Value))
#> $Set1
#> ID Set1
#> 1 A 1
#> 2 B 1
#> 3 C 1
#>
#> $Set2
#> ID Set2
#> 1 A 1
#> 2 D 1
#> 3 E 1
#>
#> $Set3
#> ID Set3
#> 1 B 1
#> 2 C 1
#>
#> $Set4
#> ID Set4
#> 1 B 1
#> 2 C 1
#>
#> $Set5
#> ID Set5
#> 1 B 1
#> 2 C 1
#>
#> $Set6
#> ID Set6
#> 1 B 1
#> 2 C 1
Created on 2022-03-28 by the reprex package (v2.0.1)
We can do,
lapply(
names(lst),
function(x) setNames(lst[[x]], c(names(lst[[x]])[2], x))
)
[[1]]
Value Set1
1 A 1
2 B 1
3 C 1
[[2]]
Value Set2
1 A 1
2 D 1
3 E 1
I'd like to remove the "(N)" from the column names.
Example data:
df <- tibble(
name = c("A", "B", "C", "D"),
`id (N)` = c(1, 2, 3, 4),
`Number (N)` = c(3, 1, 2, 8)
)
I got so far, but don't know how to figure out the rest of regex
df %>%
rename_with(stringr::str_replace,
pattern = "[//(],N//)]", replacement = "")
But the n from the "number (N)" is gone.
name id N) umber (N)
1 A 1 3
2 B 2 1
3 C 3 2
4 D 4 8
One liner: rename_with(df, ~str_remove_all(., ' \\(N\\)'))
or dplyr only: rename_with(df, ~sub(' \\(N\\)', '', .))
We could use the rename_with function from dplyr package and apply a function (in this case str_remove from stringr package).
And then use \\ to escape (:
library(dplyr)
library(stringr)
df %>%
rename_with(~str_remove_all(., ' \\(N\\)'))
name id Number
<chr> <dbl> <dbl>
1 A 1 3
2 B 2 1
3 C 3 2
4 D 4 8
A possible solution:
library(tidyverse)
df <- tibble(
name = c("A", "B", "C", "D"),
`id (N)` = c(1, 2, 3, 4),
`Number (N)` = c(3, 1, 2, 8)
)
df %>% names %>% str_remove("\\s*\\(N\\)\\s*") %>% set_names(df,.)
#> # A tibble: 4 × 3
#> name id Number
#> <chr> <dbl> <dbl>
#> 1 A 1 3
#> 2 B 2 1
#> 3 C 3 2
#> 4 D 4 8
Perhaps you can try
setNames(df, gsub("\\s\\(.*\\)", "", names(df)))
which gives
name id Number
<chr> <dbl> <dbl>
1 A 1 3
2 B 2 1
3 C 3 2
4 D 4 8
A simple solution is
colnames(df) <- gsub(" \\(N\\)", "", colnames(df))
| id | msgid | source | value |
|----|-------|--------|-------|
| 1 | 1 | B | 0 |
| 1 | 2 | A | 1 |
| 1 | 3 | B | 0 |
| 2 | 1 | B | 0 |
| 2 | 2 | A | 0 |
| 2 | 3 | A | 1 |
| 2 | 4 | B | 0 |
In the above snippet, I want to create column value from the other columns. id is a conversation and msgId is the message in each conversation.
I wish to identify the row number for the last message that came from source=A.
I made an attempt to solve it. However, I was able to identify only the last row within a conversation.
last_values <- dat %>% group_by(id) %>%
slice(which.max(msgid)) %>%
ungroup %>%
mutate(value = cumsum(msgid))
dat$final_val <- 0
dat[last_values$value,5] <- 1
We can create the column 'value' by
dat %>%
group_by(id) %>%
mutate(value1 = as.integer(source == "A" & !duplicated(source == "A", fromLast = TRUE)))
# A tibble: 7 x 5
# Groups: id [2]
# id msgid source value value1
# <int> <int> <chr> <int> <int>
#1 1 1 B 0 0
#2 1 2 A 1 1
#3 1 3 B 0 0
#4 2 1 B 0 0
#5 2 2 A 0 0
#6 2 3 A 1 1
#7 2 4 B 0 0
Another dplyr solution:
library(dplyr)
# create data
df <- data.frame(
id = c(1, 1, 1, 2, 2, 2, 2),
msgid = c(1, 2, 3, 1, 2, 3, 4),
source = c("B", "A", "B", "B", "A", "A", "B")
)
df <- df %>%
group_by(id, source) %>% # group by id and source
mutate(value = as.integer(ifelse((row_number() == n()) & source == "A", 1, 0))) # write 1 if it's the last occurence of a group and the source is "A"
> df
# A tibble: 7 x 4
# Groups: id, source [4]
id msgid source value
<dbl> <dbl> <fctr> <dbl>
1 1 1 B 0
2 1 2 A 1
3 1 3 B 0
4 2 1 B 0
5 2 2 A 0
6 2 3 A 1
7 2 4 B 0
I came up with the following solution
library(tidyverse)
# first we create the dataframe as it wasn't supplied in the question
df <- tibble(
id = c(1, 1, 1, 2, 2, 2, 2),
msgid = c(1, 2, 3, 1, 2, 3, 4),
source = c("B", "A", "B", "B", "A", "A", "B")
)
df %>%
# group by both id and source
group_by(id, source) %>%
mutate(
# create a new column
value = max(msgid) == msgid & source == "A",
# convert the new column to integers
value = as.integer(value)
)
Output:
# A tibble: 7 x 4
# Groups: id, source [4]
id msgid source value
<dbl> <dbl> <chr> <int>
1 1 1 B 0
2 1 2 A 1
3 1 3 B 0
4 2 1 B 0
5 2 2 A 0
6 2 3 A 1
7 2 4 B 0
I used index flagging for finding the final position of A and checked if that number matches with row number in order to assign 1 to value.
library(dplyr)
mydf <- data.frame(id = c(1, 1, 1, 2, 2, 2, 2),
msgid = c(1, 2, 3, 1, 2, 3, 4),
source = c("B", "A", "B", "B", "A", "A", "B"))
group_by(mydf, id) %>%
mutate(value = if_else(last(grep(source, pattern = "A")) == row_number(),
1, 0)
id msgid source value
<dbl> <dbl> <fctr> <dbl>
1 1.00 1.00 B 0
2 1.00 2.00 A 1.00
3 1.00 3.00 B 0
4 2.00 1.00 B 0
5 2.00 2.00 A 0
6 2.00 3.00 A 1.00
7 2.00 4.00 B 0