this is not a very good title for the question. I want to sum across certain columns in a data frame for each group, excluding one column for each of my groups. A simple example would be as follows:
df <- tibble(group_name = c("A", "B","C"), mean_A = c(1,2,3), mean_B = c(2,3,4), mean_C=c(3,4,5))
df %>% group_by(group_name) %>% mutate(m1 = sum(across(contains("mean"))))
This creates column m1, which is the sum across mean_a, mean_b, mean_c for each group. What I want to do is exclude mean_a for group a, mean_b for b and mean_c for c. The following does not work though (not surprisingly).
df %>% group_by(group_name) %>% mutate(m1 = sum(across(c(contains("mean") & !contains(group_name)))))
Do you have an idea how I could do this? My original data contains many more groups, so would be hard to do by hand.
Edit: I have tried the following way which solves it in a rudimentary fashion, but something (?grepl maybe) seems to not work great here and I get the wrong result.
df %>% pivot_longer(!group_name) %>% mutate(value2 = case_when(grepl(group_name, name) ~ 0, TRUE ~ value)) %>% group_by(group_name) %>% summarise(m1 = sum(value2))
Edit2: Found out what's wrong with the above, and below works, but still a lot of warnings so I recommend people to follow TarJae's response below
df %>% pivot_longer(!group_name) %>% group_by(group_name) %>% mutate(value2 = case_when(grepl(group_name, name) ~ 0, TRUE ~ value)) %>% group_by(group_name) %>% summarise(m1 = sum(value2))
Here is another option where you can just use group_name directly with the tidyselect helpers:
df %>%
rowwise() %>%
mutate(m1 = rowSums(select(across(starts_with("mean")), -ends_with(group_name)))) %>%
ungroup()
Output
group_name mean_A mean_B mean_C m1
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 2 3 5
2 B 2 3 4 6
3 C 3 4 5 7
How it works
The row-wise output of across is a 1-row tibble containing only the variables that start with "mean".
select unselects the subset of the variables from output by across that end with the value from group_name.
At this point you are left with a 1 x 2 tibble, which is then summed using rowSums.
Here is one way how we could do it:
We create a helper column to match column names
We set value of mean column to zeor if column names matches helper name.
Then we use transmute with select to calculate rowSums
Finally we cbind column m1 to df:
library(dplyr)
df %>%
mutate(helper = paste0("mean_", group_name)) %>%
mutate(across(starts_with("mean"), ~ifelse(cur_column()==helper, 0, .))) %>%
transmute(m1 = select(., contains("mean")) %>%
rowSums()) %>%
cbind(df)
m1 group_name mean_a mean_b mean_c
1 5 a 1 2 3
2 6 b 2 3 4
3 7 c 3 4 5
Related
Background
I've got this dataset d:
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
It's got 2 people (IDs) in it, and they each have some events.
The problem
I'm trying to get an average number (count) of events per person, along with a standard deviation for that average, all in one result (it can be a dataframe or not, doesn't matter).
In other words I'm looking for something like this:
| Mean | SD |
|------|------|
| 4.00 | 2.83 |
What I've tried
I'm not far off, I don't think -- it's just that I've got 2 separate pieces of code doing these calculations. Here's the mean:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event))
# A tibble: 1 x 1
ratio
<dbl>
1 4
And here's the SD:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(sd = sd(event))
# A tibble: 1 x 1
sd
<dbl>
1 2.83
But I when I try to pipe them together like so...
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event)) %>%
summarise(sd = sd(event))
... I get an error:
Error in `h()`:
! Problem with `summarise()` column `sd`.
i `sd = sd(event)`.
x object 'event' not found
Any insight?
You have to put the last two calls to summarise() in the same call. The only remaining columns after summarise() will be those you named and the grouping columns, so after your second summarise, the event column no longer exists.
library(dplyr)
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
d %>%
group_by(ID) %>%
# the next summarise will be within ID
summarise(event = length(event)) %>%
# this summarise is overall
summarise(sd = sd(event),
ratio = mean(event))
#> # A tibble: 1 × 2
#> sd ratio
#> <dbl> <dbl>
#> 1 2.83 4
The code is a bit confusing because you are renaming the event variable, and doing the first summarise() within groups and the second without grouping. This code would be a little easier to read and get the same result:
d %>%
count(ID) %>%
summarise(sd = sd(n),
ratio = mean(n))
Created on 2022-05-25 by the reprex package (v2.0.1)
I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks
As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
I have a data frame similar to this one.
df <- data.frame(id=c(1,2,3), tot_1=runif(3, 0, 100), tot_2=runif(3, 0, 100), tot_3=runif(3, 0, 100), tot_4=runif(3, 0, 100))
I want to select or make an operation only with those with suffixes lower than 3.
#select
df <- df %>% select(id, tot_1, tot_2)
#or sum
df <- df %>% mutate(sumVar = rowSums(across(c(tot_1, tot_2))))
However, in my real data, there are many more variables and not in order. So how could I select them without doing it manually?
We may use matches
df %>%
mutate(sumVar = rowSums(across(matches('tot_[1-2]$'))))
If we need to be more flexible, extract the digit part from the column names that starts with 'tot', subset based on the condition and use that new names
library(stringr)
nm1 <- str_subset(names(df), 'tot')
nm2 <- nm1[readr::parse_number(nm1) <3]
df %>%
mutate(sumVar = rowSums(across(all_of(nm2))))
Solution with num_range
This is the rare case for the often forgotten num_range selection helper from dplyr, which extracts the numbers from the names in a single step, then selects a range:
determine the threshold
suffix_threshold <- 3
Select( )
library(dplyr)
df %>% select(id, num_range(prefix='tot_',
range=seq_len(suffix_threshold-1)))
id tot_1 tot_2
1 1 26.75082 26.89506
2 2 21.86453 18.11683
3 3 51.67968 51.85761
mutate() with rowSums()
library(dplyr)
df %>% mutate(sumVar = across(num_range(prefix='tot_', range=seq_len(suffix_threshold-1)))%>%
rowSums)
id tot_1 tot_2 tot_3 tot_4 sumVar
1 1 26.75082 26.89506 56.27829 71.79353 53.64588
2 2 21.86453 18.11683 12.91569 96.14099 39.98136
3 3 51.67968 51.85761 25.63676 10.01408 103.53730
Here is a base R way -
cols <- grep('tot_', names(df), value = TRUE)
#Select
df[c('id', cols[as.numeric(sub('tot_', '',cols)) < 3])]
# id tot_1 tot_2
#1 1 75.409112 30.59338
#2 2 9.613496 44.96151
#3 3 58.589574 64.90672
#Rowsums
df$sumVar <- rowSums(df[cols[as.numeric(sub('tot_', '',cols)) < 3]])
df
# id tot_1 tot_2 tot_3 tot_4 sumVar
#1 1 75.409112 30.59338 59.82815 50.495758 106.00250
#2 2 9.613496 44.96151 84.19916 2.189482 54.57501
#3 3 58.589574 64.90672 18.17310 71.390459 123.49629
From a data frame I need a list of all unique values of one column. For possible later check we need to keep information from a second column, though for simplicity combined.
Sample data
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df
id source
1 1 x
2 3 y
3 1 z
The desired outcome is
df2
id source
1 1 x,z
2 3 y
It should be pretty easy, still I cannot find the proper function / grammar?
E.g. something like
df %>%
+ group_by(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
or
df %>%
+ distinct(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
What am I missing? Thanks for any advice!
You can use aggregate from stats to combine per group.
aggregate(source ~ id, df, paste, collapse = ",")
# id source
#1 1 x,z
#2 3 y
Using your code here is a solution:
library(dplyr)
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df %>%
group_by(id) %>%
summarise(vlist = paste0(source, collapse = ",")) %>%
distinct(id, .keep_all = TRUE)
# A tibble: 2 x 2
id vlist
<dbl> <chr>
1 1 x,z
2 3 y
Your second approach doesn't work because you call distinct before you aggregate the data. Also, you need to use .keep_all = TRUE to also keep the other column.
Your first approach was missing the distinct.
aggregate(source ~ id, df, toString)
Hello i have the following dataframe :
colnames(tv_viewing time) <-c("channel_1", "channel_2", "channel_1", "channel_2")
Each row gives a the viewing time for an individual on channel 1 and channel 2, for instance for individual 1 i get :
tv_viewing_time[1,] <- c(1,2,4,5)
What I would like is actually a dataframe that sums up the values of duplicated columns.
I.e. I would get
colnames(tv_viewing time) <-c("channel_1", "channel_2")
Where for instance for individual 1 i would get :
tv_viewing_time[1,] <- c(5,7)
As all two row entries are summed when they correspond to duplicated column names.
I have looked for an answer but all suggested on other threads did not work for my dataframe case.
Note that there are many more duplicated columns, so i am looking for a solution that can be efficiently applied to all my duplicates.
We could use split.default with rowSums
sapply(split.default(tv_viewing_time,
sub("\\.\\d+$", "", names(tv_viewing_time))), rowSums)
-output
# channel_1 channel_2
# 5 7
Or using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
tv_viewing_time %>%
pivot_longer(cols = everything()) %>%
group_by(name = str_remove(name, "\\.\\d+$")) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 1 x 2
# channel_1 channel_2
# <dbl> <dbl>
#1 5 7
data
tv_viewing_time <- data.frame(channel_1 = 1, channel_2 = 2,
channel_1 = 4, channel_2 = 5)