I have a data frame which tracks service involvement (srvc_inv {1, 0}) for individual x (Bob) over a timeframe of interest (years 1900-1999).
library(tidyverse)
dat <- data.frame(name = rep("Bob", 100),
day = seq(as.Date("1900/1/1"), as.Date("1999/1/1"), "years"),
srvc_inv = c(rep(0, 25), rep(1, 25), rep(0, 25), rep(1, 25)))
As we can see, Bob has two service episodes: one episode between rows 26:50, and the other between rows 76:100.
If we want to determine any service involvement for Bob during the timeframe, we can use a simple max statement as shown below.
dat %>%
group_by(name) %>%
summarise(ever_inv = max(srvc_inv))
However, I would like to determine the number of service episodes that Bob had during the timeframe of interest (in this case, 2). A distinct service episode would be identified by a break in service involvement over consecutive dates. Anybody have any idea how to program this? Thanks!
One more solution based on base R rle
library(dplyr)
dat %>% group_by(name) %>%
summarise(ever_inv = length(with(rle(srvc_inv), lengths[values==1])))
# A tibble: 1 x 2
name ever_inv
<fct> <int>
1 Bob 2
One possibility could be:
dat %>%
group_by(name) %>%
mutate(rleid = with(rle(srvc_inv), rep(seq_along(lengths), lengths))) %>%
summarise(ever_inv = n_distinct(rleid[srvc_inv == 1]))
name ever_inv
<fct> <int>
1 Bob 2
Alternatively to rle() you can use diff():
dat %>%
group_by(name) %>%
summarise(ever_inv = sum(diff(c(0, srvc_inv)) > 0))
# A tibble: 1 x 2
# name ever_inv
# <fct> <int>
# 1 Bob 2
Assuming that srvc_inv is either 0 or 1, diff(srvc_inv) == 1 only when xi is 1, and xi-1 is 0. It turns into 0 or -1 otherwise. I added 0 before srvc_inv for a case when it starts from 1s run.
And with rle(), from my opinion, there is even simpler solution:
dat %>%
group_by(name) %>%
summarise(ever_inv = sum(rle(srvc_inv)$value))
# A tibble: 1 x 2
# name ever_inv
# <fct> <int>
# 1 Bob 2
Assuming that srvc_inv is either 0 or 1, that's enough just to sum values component of rle object, which returns the number of 1s runs.
Related
I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks
As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
I was wondering if there was a way to create multiple columns from a list in R using the mutate() function within a for loop.
Here is an example of what I mean:
The Problem:
I have a data frame df that has 2 columns: category and rating. I want to add a column for every element of df$category and in that column, I want a 1 if the category column matches the iterator.
library(dplyr)
df <- tibble(
category = c("Art","Technology","Finance"),
rating = c(100,95,50)
)
Doing it manually, I could do:
df <-
df %>%
mutate(art = ifelse(category == "Art", 1,0))
However, what happens when I have 50 categories? (Which is close to what I have in my original problem. That would take a lot of time!)
What I tried:
category_names <- df$category
for(name in category_names){
df <-
df %>%
mutate(name = ifelse(category == name, 1,0))
}
Unfortunately, It doesn't seem to work.
I'd appreciate any light on the subject!
Full Code:
library(dplyr)
#Creates tibble
df <- tibble(
category = c("Art","Technology","Finance"),
rating = c(100,95,50)
)
#Showcases the operation I would like to loop over df
df <-
df %>%
mutate(art = ifelse(category == "Art", 1,0))
#Creates a variable for clarity
category_names <- df$category
#For loop I tried
for(name in category_names){
df <-
df %>%
mutate(name = ifelse(category == name, 1,0))
}
I am aware that what I am essentially doing is a form of model.matrix(); however, before I found out about that function I was still perplexed why what I was doing before wasn't working.
We can use pivot_wider after creating a sequence column
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number(), n = 1) %>%
pivot_wider(names_from = category, values_from = n,
values_fill = list(n = 0)) %>%
select(-rn)
# A tibble: 3 x 4
# rating Art Technology Finance
# <dbl> <dbl> <dbl> <dbl>
#1 100 1 0 0
#2 95 0 1 0
#3 50 0 0 1
Or another option is map
library(purrr)
map_dfc(unique(df$category), ~ df %>%
transmute(!! .x := +(category == .x))) %>%
bind_cols(df, .)
# A tibble: 3 x 5
# category rating Art Technology Finance
#* <chr> <dbl> <int> <int> <int>
#1 Art 100 1 0 0
#2 Technology 95 0 1 0
#3 Finance 50 0 0 1
If we need a for loop
for(name in category_names) df <- df %>% mutate(!! name := +(category == name))
Or in base R with table
cbind(df, as.data.frame.matrix(table(seq_len(nrow(df)), df$category)))
# category rating Art Finance Technology
#1 Art 100 1 0 0
#2 Technology 95 0 0 1
#3 Finance 50 0 1 0
Wanted to throw something in for anyone who stumbles across this question. The problem in the OP is that the "name" column name gets re-used during each iteration of the loop: you end up with only one new column, when you really wanted three (or 50). I consistently find myself wanting to create multiple new columns within loops, and I recently found out that mutate can now take "glue"-like inputs to do this. The following code now also solves the original question:
for(name in category_names){
df <-
df %>%
mutate("{name}" := ifelse(category == name, 1, 0))
}
This is equivalent to akrun's answer using a for loop, but it doesn't involve the !! operator. Note that you still need the "walrus" := operator, and that the column name needs to be a string (I think since it's using "glue" in the background). I'm thinking some people might find this format easier to understand.
Reference: https://www.tidyverse.org/blog/2020/02/glue-strings-and-tidy-eval/
I'm trying to output the number of status (that is open) group by ID. Please see below example:
(note: (1 status that is open) is used to show why it's 1, I don't want to output the sentence)
Re-producible code:
ID <- c(1,1,1,2,2,2)
Status <- c("status.open","status.closed", "status.wait", "status.open", "status.open", "status.wait" )
df <- data.frame(ID, Status)
pseudo-code:
df %>%
group_by(ID) %>%
summarize(count = length(Status where status like "%open"))
Please help, thanks!
You may achieve this with the following code:
require(dplyr)
df %>% filter(Status == "status.open") %>% ## you only want status.open
count(ID) ## count members of ID
Which produces:
# A tibble: 2 x 2
# Groups: ID [2]
ID n
<dbl> <int>
1 1 1
2 2 2
Solution (as close as possible to your 'pseudo-code') using dplyr and grepl and R's implicit conversion of booleans (where TRUE becomes 1 if we try to to math with it):
library(dplyr)
df %>%
group_by(ID) %>%
summarise(count = sum(grepl("open", Status)))
Returns:
# A tibble: 2 x 2
ID count
<dbl> <int>
1 1 1
2 2 2
Roughly like SQL-%open is:
library(stringr)
df %>%
filter(str_detect(Status, "open$")) # open$ = ends with open
What about this solution :
df %>% dplyr::group_by(ID) %>% dplyr::summarize(count = sum(Status == "status.open"))
Im trying to perform a sum function to count the number of interactions for Unique Id's
So I have something like this:
Client ID
JOE12_EMI
ABC12_CANC
ABC12_EMI
ABC12_RENE
and so on...
It'll also have a column next to it that counts the how many times each unique ID repeats.
Frequency
1
2
2
1
Is there a way that i can have all the activity types (EMI, TELI, PFL) summed for each ID and then placed into new columns?
I've tried to transpose the data by separating the actual ID from the activity type but this doesn't return the sums, thank you for any help. I'm not sure if that's the best way or if transposing the data to wide format and then doing another sum function but I am unsure how to go about it.
separate(frequency, id, c("id", "act_code") )
nd <- melt(frequency, id=(c("id")))
Try this:
library(dplyr)
data=data.frame(Client_ID= c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"),
frequency= c(1,2,2,1))
client_and_id <- as.data.frame(do.call(rbind, strsplit(as.character(data$Client_ID), "_")))
names(client_and_id) <- c("client", "id")
data <- cbind(data, client_and_id)
data_sum <- data %>% group_by(id) %>% mutate(sum_freq = sum(frequency))
The output
> data_sum
# A tibble: 4 x 5
# Groups: id [3]
Client_ID frequency client id sum_freq
<fct> <dbl> <fct> <fct> <dbl>
1 JOE12_EMI 1 JOE12 EMI 3
2 ABC12_CANC 2 ABC12 CANC 2
3 ABC12_EMI 2 ABC12 EMI 3
4 ABC12_RENE 1 ABC12 RENE 1
You can also display the output by ID:
distinct(data_sum %>% dplyr::select(id, sum_freq))
# A tibble: 3 x 2
# Groups: id [3]
id sum_freq
<fct> <dbl>
1 EMI 3
2 CANC 2
3 RENE 1
You're on the right track; I think the only thing you need is a group_by. Something like this:
library(dplyr)
library(tidyr)
df = data.frame(ClientID = c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"))
df %>%
separate(ClientID, into = c("id", "act_code"), sep = "_") %>%
group_by(id) %>%
mutate(frequency = n()) %>%
ungroup() %>%
group_by(id, act_code) %>%
mutate(act_frequency = n()) %>%
ungroup() %>%
spread(act_code, act_frequency)
(This does the sum by user and the pivot by activity type separately; it's possible to calculate the sum by user after pivoting, but this way is easier for me to read.)
I've got a data frame (df) with two variables, location and weather.
I'd like a wide data frame (dfgoal), in which the data is grouped by location and in which there are three new variables (weather_1 to weather_3) with counts for the observations in the original weather variable.
The problem's that when I try to use dplyr()::mutate() I only get TRUE/FALSE output rather than counts, alternatively an error message: Evaluation error: no applicable method for 'summarise_' applied to an object of class "logical".
Any help would be much appreciated.
Starting point (df):
df <- data.frame(location=c("az","az","az","az","bi","bi","bi","ca","ca","ca","ca","ca"),weather=c(1,1,2,3,2,3,2,1,2,3,1,2))
Desired outcome (df):
dfgoal <- data.frame(location=c("az","bi","ca"),weather_1=c(2,0,2),weather_2=c(1,2,2),weather_3=c(1,1,1))
Current code:
library(dplyr)
df %>% group_by(location) %>% mutate(weather_1 = (weather == 1)) %>% mutate(weather_2 = (weather == 2)) %>% mutate(weather_3 = (weather == 3))
df %>% group_by(location) %>% mutate(weather_1 = summarise(weather == 1)) %>% mutate(weather_2 = summarise(weather == 2)) %>% mutate(weather_3 = summarise(weather == 3))
It is super simple with function called table:
df %>% table
weather
location 1 2 3
az 2 1 1
bi 0 2 1
ca 2 2 1
Krzysztof's solution is the way to go, but if you insist on using tidyverse, here is a solution with dplyr + tidyr:
library(dplyr)
library(tidyr)
df %>%
group_by(location, weather) %>%
summarize(count = count(weather)) %>%
spread(weather, count, sep="_") %>%
mutate_all(funs(coalesce(., 0L)))
Result:
# A tibble: 3 x 4
# Groups: location [3]
location weather_1 weather_2 weather_3
<fctr> <int> <int> <int>
1 az 2 1 1
2 bi 0 2 1
3 ca 2 2 1
Krzysztof's answer wins for simplicity, but if you want a tidyverse-only solution (dplyr and tidyr):
df %>%
group_by(location, weather) %>%
summarize(bin = sum(weather==weather)) %>%
spread(weather, bin, fill = 0, sep='_')
This results in:
location weather_1 weather_2 weather_3
az 2 1 1
bi 0 2 1
ca 2 2 1