I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks
As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
Related
Background
I've got this dataset d:
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
It's got 2 people (IDs) in it, and they each have some events.
The problem
I'm trying to get an average number (count) of events per person, along with a standard deviation for that average, all in one result (it can be a dataframe or not, doesn't matter).
In other words I'm looking for something like this:
| Mean | SD |
|------|------|
| 4.00 | 2.83 |
What I've tried
I'm not far off, I don't think -- it's just that I've got 2 separate pieces of code doing these calculations. Here's the mean:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event))
# A tibble: 1 x 1
ratio
<dbl>
1 4
And here's the SD:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(sd = sd(event))
# A tibble: 1 x 1
sd
<dbl>
1 2.83
But I when I try to pipe them together like so...
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event)) %>%
summarise(sd = sd(event))
... I get an error:
Error in `h()`:
! Problem with `summarise()` column `sd`.
i `sd = sd(event)`.
x object 'event' not found
Any insight?
You have to put the last two calls to summarise() in the same call. The only remaining columns after summarise() will be those you named and the grouping columns, so after your second summarise, the event column no longer exists.
library(dplyr)
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
d %>%
group_by(ID) %>%
# the next summarise will be within ID
summarise(event = length(event)) %>%
# this summarise is overall
summarise(sd = sd(event),
ratio = mean(event))
#> # A tibble: 1 × 2
#> sd ratio
#> <dbl> <dbl>
#> 1 2.83 4
The code is a bit confusing because you are renaming the event variable, and doing the first summarise() within groups and the second without grouping. This code would be a little easier to read and get the same result:
d %>%
count(ID) %>%
summarise(sd = sd(n),
ratio = mean(n))
Created on 2022-05-25 by the reprex package (v2.0.1)
This question is slightly modified from this one.
I have a dataframe in long table format like this:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
ID name value
1 a broad
1 c 50
1 a mangrove
1 c 50
1 a mangrove
1 c 50
2 a coniferous
2 c 50
About the data: The value from the second row 50 corresponds to the value broad from the first row. Similarly, the value from the fourth row 50 corresponds to the value mangrove from the third row and so on.. In simple words, values for name c are related with name a.
I want to combine the value in such a way that I could get the corresponding values for each name, which would also aggregate the values with similar names:
df2 <- data.frame(ID=c(1,1,2),
name=c("c_broad","c_mangrove","c_coniferous"),
value=c(50,100,50))
which should look like this:
ID name value
1 c_broad 50
1 c_mangrove 100
2 c_coniferous 50
Using reshape2:
library(reshape2)
df1$grp = cumsum(df1$name == "a")
df2 = dcast(df1, ID + grp ~ name)
df2$c = as.numeric(df2$c)
aggregate(c ~ ID + a, df2, sum)
ID a c
1 1 broad 50
2 2 coniferous 50
3 1 mangrove 100
Column names can be changed if desired, also "c_" can be added to the names with paste.
Using tidyverse:
value_a <- df1 %>% dplyr::filter(name=="a") %>% dplyr::pull(value)
df1 %>%
dplyr::filter(name=="c") %>% #Modify into a sensible data frame from here
dplyr::mutate(a = value_a,
name = stringr::str_c(name, "_" ,a)) %>%
dplyr::select(-a) %>% # to here
dplyr::group_by(ID, name) %>%
dplyr::summarise(value=sum(as.numeric(value)))
# A tibble: 3 x 3
# Groups: ID [2]
ID name value
<dbl> <chr> <dbl>
1 1 c_broad 50
2 1 c_mangrove 100
3 2 c_coniferous 50
Tha main problem you find in your dataframe is that a single column is containing, names and values, and that is the first thing you should fix. My advice is always modify the original dataframe into a tidy format (https://tidyr.tidyverse.org/articles/tidy-data.html) and from there leverage all tidyverse power, or data.table or your framework of choice.
Notice the temporal variable value_a could be included in the pipeline directly I have not done it for clarity. The main idea is to separate values and species in different columns, the first three calls in the pipeline, and then apply the usual tidyverse operations.
Might not be the most elegant, but it works:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50)
)
df1 %>% group_by( 1+floor((1:n()-1)/2) ) %>%
summarize(
ID = ID[1],
name = paste0( name[2], "_", value[1] ),
value = as.numeric(value[2])
) %>% ungroup %>% select( -1 ) %>% group_by(name) %>%
mutate( value = sum(value) ) %>%
unique
Here is somthing improved, that actually is humanly readable:
i <- seq( 1, nrow(df1), 2 )
df1 %>% summarise(
ID = ID[i],
name = paste0( name[i+1], "_", value[i] ),
value = as.numeric(value[i+1])
) %>% group_by(name) %>%
summarize(
ID=ID[1], value = sum( value )
) %>% arrange(ID)
Base R solution:
# Nullify numeric values belonging to a grouping category: grps => character vector
grps <- gsub("\\d+", NA, df1$value)
# Interpolate NA values using prior string value: a => character vector
df1$a <- na.omit(grps)[cumsum(!(is.na(grps)))]
# Split-Apply-Combine aggregation: data.frame => stdout(console)
data.frame(do.call(rbind, lapply(with(df1, split(df1, a)), function(x){
y <- transform(subset(x, !grepl("\\D+", value)), value = as.numeric(value))
setNames(
aggregate(value ~ ID + a, y, FUN = function(z){sum(z, na.rm = TRUE)}),
c("ID", "a", "c")
)
}
)
),
row.names = NULL
)
additional option
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
library(tidyverse)
df1 %>%
pivot_wider(ID, names_from = name, values_from = value) %>%
unnest(c("a", "c")) %>%
group_by(ID, name = a) %>%
summarise(value = sum(as.numeric(c), na.rm = T), .groups = "drop")
#> # A tibble: 3 x 3
#> ID name value
#> <dbl> <chr> <dbl>
#> 1 1 broad 50
#> 2 1 mangrove 100
#> 3 2 coniferous 50
Created on 2021-04-12 by the reprex package (v2.0.0)
Hello i have the following dataframe :
colnames(tv_viewing time) <-c("channel_1", "channel_2", "channel_1", "channel_2")
Each row gives a the viewing time for an individual on channel 1 and channel 2, for instance for individual 1 i get :
tv_viewing_time[1,] <- c(1,2,4,5)
What I would like is actually a dataframe that sums up the values of duplicated columns.
I.e. I would get
colnames(tv_viewing time) <-c("channel_1", "channel_2")
Where for instance for individual 1 i would get :
tv_viewing_time[1,] <- c(5,7)
As all two row entries are summed when they correspond to duplicated column names.
I have looked for an answer but all suggested on other threads did not work for my dataframe case.
Note that there are many more duplicated columns, so i am looking for a solution that can be efficiently applied to all my duplicates.
We could use split.default with rowSums
sapply(split.default(tv_viewing_time,
sub("\\.\\d+$", "", names(tv_viewing_time))), rowSums)
-output
# channel_1 channel_2
# 5 7
Or using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
tv_viewing_time %>%
pivot_longer(cols = everything()) %>%
group_by(name = str_remove(name, "\\.\\d+$")) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 1 x 2
# channel_1 channel_2
# <dbl> <dbl>
#1 5 7
data
tv_viewing_time <- data.frame(channel_1 = 1, channel_2 = 2,
channel_1 = 4, channel_2 = 5)
Im trying to perform a sum function to count the number of interactions for Unique Id's
So I have something like this:
Client ID
JOE12_EMI
ABC12_CANC
ABC12_EMI
ABC12_RENE
and so on...
It'll also have a column next to it that counts the how many times each unique ID repeats.
Frequency
1
2
2
1
Is there a way that i can have all the activity types (EMI, TELI, PFL) summed for each ID and then placed into new columns?
I've tried to transpose the data by separating the actual ID from the activity type but this doesn't return the sums, thank you for any help. I'm not sure if that's the best way or if transposing the data to wide format and then doing another sum function but I am unsure how to go about it.
separate(frequency, id, c("id", "act_code") )
nd <- melt(frequency, id=(c("id")))
Try this:
library(dplyr)
data=data.frame(Client_ID= c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"),
frequency= c(1,2,2,1))
client_and_id <- as.data.frame(do.call(rbind, strsplit(as.character(data$Client_ID), "_")))
names(client_and_id) <- c("client", "id")
data <- cbind(data, client_and_id)
data_sum <- data %>% group_by(id) %>% mutate(sum_freq = sum(frequency))
The output
> data_sum
# A tibble: 4 x 5
# Groups: id [3]
Client_ID frequency client id sum_freq
<fct> <dbl> <fct> <fct> <dbl>
1 JOE12_EMI 1 JOE12 EMI 3
2 ABC12_CANC 2 ABC12 CANC 2
3 ABC12_EMI 2 ABC12 EMI 3
4 ABC12_RENE 1 ABC12 RENE 1
You can also display the output by ID:
distinct(data_sum %>% dplyr::select(id, sum_freq))
# A tibble: 3 x 2
# Groups: id [3]
id sum_freq
<fct> <dbl>
1 EMI 3
2 CANC 2
3 RENE 1
You're on the right track; I think the only thing you need is a group_by. Something like this:
library(dplyr)
library(tidyr)
df = data.frame(ClientID = c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"))
df %>%
separate(ClientID, into = c("id", "act_code"), sep = "_") %>%
group_by(id) %>%
mutate(frequency = n()) %>%
ungroup() %>%
group_by(id, act_code) %>%
mutate(act_frequency = n()) %>%
ungroup() %>%
spread(act_code, act_frequency)
(This does the sum by user and the pivot by activity type separately; it's possible to calculate the sum by user after pivoting, but this way is easier for me to read.)
suppose I have a tibble dat below, what I would like to do is to calculate maximum of (x 2, x 3) and then minus x 1, where x can be either a or b. In my real data I have more than 3 columns, so something like 2:n (e.g., 2:3) would be great. tried many things, seems not working as I wanted them to, still struggling with the string vs column name thing..
dat <- tibble(`a 1` = c(0, 0, 0), `a 2` = 1:3, `a 3` = 3:1,
`b 1` = rep(1, 3), `b 2` = 4:6, `b 3` = 6:4)
foo <- function(x = 'a')
{
???
}
end result:
if x == `a`
c(3, 2, 3)
if x == `b`
c(5, 4, 5)
Solution 1
This solution uses only base R. The idea is to define a function (max_minus_first) to calculate the answer. The max_minus_first function has two arguments. The first argument, dat, is a data frame for analysis with the same format as the OP provided. group is the name of the group for analysis. The end product is a vector with the answer.
max_minus_first <- function(dat, group){
# Get all column names with starting string "group"
col_names <- colnames(dat)
dat2 <- dat[, col_names[grepl(paste0("^", group), col_names)]]
# Get the maximum values from all columns except the first column
max_value <- apply(dat2[, -1], 1, max, na.rm = TRUE)
# Calculate max_value minus the values from the first column
final_value <- max_value - unlist(dat2[, 1], use.names = FALSE)
return(final_value)
}
max_minus_first(dat, "a")
# [1] 3 2 3
max_minus_first(dat, "b")
# [1] 5 4 5
Solution 2
A solution using the tidyverse. The end product (dat2) is a tibble with the output from each group (a, b, ...)
library(tidyverse)
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid, -ends_with(" 1")) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
gather(Column_1, Value_1, ends_with(" 1")) %>%
separate(Column_1, into = c("Group_1", "Column_Number_1")) %>%
filter(Group == Group_1) %>%
group_by(rowid, Group, Value_1) %>%
summarise(Value = max(Value, na.rm = TRUE)) %>%
mutate(Final = Value - Value_1) %>%
ungroup() %>%
select(-starts_with("Value")) %>%
spread(Group, Final)
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5
Explanation
rowid_to_column() is from the tibble package, a way to create a new column based on row ID.
gather is from the tidyr package to convert the data frame from the wide format to long format. I used gather twice because the first column of each group is different than other columns in the same group. ends_with(" 1") is a select helper function from the dplyr, which select the column with a name ending in " 1". Notice that the space in " 1" is important because "1" may select other columns like a 11 if such columns exist.
separate is from the tidyr package to separate a column into two columns. I used it to separate the Group name and column numbers in each Group.
filter(Group == Group_1) is to filter rows with Group == Group_1.
group_by(rowid, Group, Value_1) and then summarise(Value = max(Value, na.rm = TRUE)) make sure the maximum from each Group is calculated.
mutate(Final = Value - Value_1) is to calculate the difference between maximum from each Group and the value from the first column. The results are stored in the Final column.
select(-starts_with("Value")) removes any columns with a name beginning with "Value".
spread from the tidyr package converts the data frame from long format to wide format.
Solution 3
Another tidyverse solution, which similar to Solution 2. It uses do to conduct operation to each Group hence making the code more concise.
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
group_by(rowid, Group) %>%
do(data_frame(Max = max(.$Value[.$Column_Number != 1]),
First = .$Value[.$Column_Number == 1])) %>%
mutate(Final = Max - First) %>%
select(-Max, -First) %>%
spread(Group, Final) %>%
ungroup()
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5