R Studio using dplyr to summarize (where like)

R Studio using dplyr to summarize (where like) - r

I'm trying to output the number of status (that is open) group by ID. Please see below example:
(note: (1 status that is open) is used to show why it's 1, I don't want to output the sentence)
Re-producible code:
ID <- c(1,1,1,2,2,2)
Status <- c("status.open","status.closed", "status.wait", "status.open", "status.open", "status.wait" )
df <- data.frame(ID, Status)
pseudo-code:
df %>%
group_by(ID) %>%
summarize(count = length(Status where status like "%open"))
Please help, thanks!

You may achieve this with the following code:
require(dplyr)
df %>% filter(Status == "status.open") %>% ## you only want status.open
count(ID) ## count members of ID
Which produces:
# A tibble: 2 x 2
# Groups: ID [2]
ID n
<dbl> <int>
1 1 1
2 2 2

Solution (as close as possible to your 'pseudo-code') using dplyr and grepl and R's implicit conversion of booleans (where TRUE becomes 1 if we try to to math with it):
library(dplyr)
df %>%
group_by(ID) %>%
summarise(count = sum(grepl("open", Status)))
Returns:
# A tibble: 2 x 2
ID count
<dbl> <int>
1 1 1
2 2 2

Roughly like SQL-%open is:
library(stringr)
df %>%
filter(str_detect(Status, "open$")) # open$ = ends with open

What about this solution :
df %>% dplyr::group_by(ID) %>% dplyr::summarize(count = sum(Status == "status.open"))

Related

Calculating average rle$lengths over grouped data

I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks

As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN

Matching and returning values based on condition or ID

This seems like it should be fairly easy, but i'm having trouble with it.
Example: I have a dataframe with two columns IDs and perc_change. I want to know which unique IDs have had more than 30% change.
IDs <- c(1,1,2,1,1,2,2,2,3,2,3,4,5,6,3)
perc_change <- c(50,40,60,70,80,30,20,40,23,25,10,30,12,7,70)
df <- data.frame(IDs, perc_change)
So far:
if (df$perc_change > 30) {
unique(df$IDs)
} else {
}
This obviously doesn't work because it returns all unique IDs. Should I be be finding the index and then matching it or soemthing?
Thanks in advance!

We could do so, to get the values of each ID:
library(dplyr)
df %>%
group_by(IDs) %>%
filter(perc_change > 30) %>%
mutate(values = paste0(perc_change, collapse = ","), .keep="unused") %>%
distinct(IDs, .keep_all = TRUE)
Output:
IDs values
<dbl> <chr>
1 1 50,40,70,80
2 2 60,40
3 3 70

Just use [ to subset and take the unique - i.e. no need for if/else conditions
with(df, unique(IDs[perc_change > 30]))
[1] 1 2 3

We can group, filter and count using dplyr
> library(dplyr)
> df %>%
group_by(IDs) %>%
filter(perc_change > 30) %>%
count(IDs)
# A tibble: 3 x 2
# Groups: IDs [3]
IDs n
<dbl> <int>
1 1 4
2 2 2
3 3 1

unique(df[df$perc_change > 30,"IDs"])

How to get frequency counts using column breaks by row?

I have a data frame which tracks service involvement (srvc_inv {1, 0}) for individual x (Bob) over a timeframe of interest (years 1900-1999).
library(tidyverse)
dat <- data.frame(name = rep("Bob", 100),
day = seq(as.Date("1900/1/1"), as.Date("1999/1/1"), "years"),
srvc_inv = c(rep(0, 25), rep(1, 25), rep(0, 25), rep(1, 25)))
As we can see, Bob has two service episodes: one episode between rows 26:50, and the other between rows 76:100.
If we want to determine any service involvement for Bob during the timeframe, we can use a simple max statement as shown below.
dat %>%
group_by(name) %>%
summarise(ever_inv = max(srvc_inv))
However, I would like to determine the number of service episodes that Bob had during the timeframe of interest (in this case, 2). A distinct service episode would be identified by a break in service involvement over consecutive dates. Anybody have any idea how to program this? Thanks!

One more solution based on base R rle
library(dplyr)
dat %>% group_by(name) %>%
summarise(ever_inv = length(with(rle(srvc_inv), lengths[values==1])))
# A tibble: 1 x 2
name ever_inv
<fct> <int>
1 Bob 2

One possibility could be:
dat %>%
group_by(name) %>%
mutate(rleid = with(rle(srvc_inv), rep(seq_along(lengths), lengths))) %>%
summarise(ever_inv = n_distinct(rleid[srvc_inv == 1]))
name ever_inv
<fct> <int>
1 Bob 2

Alternatively to rle() you can use diff():
dat %>%
group_by(name) %>%
summarise(ever_inv = sum(diff(c(0, srvc_inv)) > 0))
# A tibble: 1 x 2
# name ever_inv
# <fct> <int>
# 1 Bob 2
Assuming that srvc_inv is either 0 or 1, diff(srvc_inv) == 1 only when xi is 1, and xi-1 is 0. It turns into 0 or -1 otherwise. I added 0 before srvc_inv for a case when it starts from 1s run.
And with rle(), from my opinion, there is even simpler solution:
dat %>%
group_by(name) %>%
summarise(ever_inv = sum(rle(srvc_inv)$value))
# A tibble: 1 x 2
# name ever_inv
# <fct> <int>
# 1 Bob 2
Assuming that srvc_inv is either 0 or 1, that's enough just to sum values component of rle object, which returns the number of 1s runs.

How to transpose character data for unique IDs

Im trying to perform a sum function to count the number of interactions for Unique Id's
So I have something like this:
Client ID
JOE12_EMI
ABC12_CANC
ABC12_EMI
ABC12_RENE
and so on...
It'll also have a column next to it that counts the how many times each unique ID repeats.
Frequency
1
2
2
1
Is there a way that i can have all the activity types (EMI, TELI, PFL) summed for each ID and then placed into new columns?
I've tried to transpose the data by separating the actual ID from the activity type but this doesn't return the sums, thank you for any help. I'm not sure if that's the best way or if transposing the data to wide format and then doing another sum function but I am unsure how to go about it.
separate(frequency, id, c("id", "act_code") )
nd <- melt(frequency, id=(c("id")))

Try this:
library(dplyr)
data=data.frame(Client_ID= c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"),
frequency= c(1,2,2,1))
client_and_id <- as.data.frame(do.call(rbind, strsplit(as.character(data$Client_ID), "_")))
names(client_and_id) <- c("client", "id")
data <- cbind(data, client_and_id)
data_sum <- data %>% group_by(id) %>% mutate(sum_freq = sum(frequency))
The output
> data_sum
# A tibble: 4 x 5
# Groups: id [3]
Client_ID frequency client id sum_freq
<fct> <dbl> <fct> <fct> <dbl>
1 JOE12_EMI 1 JOE12 EMI 3
2 ABC12_CANC 2 ABC12 CANC 2
3 ABC12_EMI 2 ABC12 EMI 3
4 ABC12_RENE 1 ABC12 RENE 1
You can also display the output by ID:
distinct(data_sum %>% dplyr::select(id, sum_freq))
# A tibble: 3 x 2
# Groups: id [3]
id sum_freq
<fct> <dbl>
1 EMI 3
2 CANC 2
3 RENE 1

You're on the right track; I think the only thing you need is a group_by. Something like this:
library(dplyr)
library(tidyr)
df = data.frame(ClientID = c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"))
df %>%
separate(ClientID, into = c("id", "act_code"), sep = "_") %>%
group_by(id) %>%
mutate(frequency = n()) %>%
ungroup() %>%
group_by(id, act_code) %>%
mutate(act_frequency = n()) %>%
ungroup() %>%
spread(act_code, act_frequency)
(This does the sum by user and the pivot by activity type separately; it's possible to calculate the sum by user after pivoting, but this way is easier for me to read.)

dplyr() grouping and getting counts - error message Evaluation error: no applicable method for 'summarise_' applied to an object of class "logical"

I've got a data frame (df) with two variables, location and weather.
I'd like a wide data frame (dfgoal), in which the data is grouped by location and in which there are three new variables (weather_1 to weather_3) with counts for the observations in the original weather variable.
The problem's that when I try to use dplyr()::mutate() I only get TRUE/FALSE output rather than counts, alternatively an error message: Evaluation error: no applicable method for 'summarise_' applied to an object of class "logical".
Any help would be much appreciated.
Starting point (df):
df <- data.frame(location=c("az","az","az","az","bi","bi","bi","ca","ca","ca","ca","ca"),weather=c(1,1,2,3,2,3,2,1,2,3,1,2))
Desired outcome (df):
dfgoal <- data.frame(location=c("az","bi","ca"),weather_1=c(2,0,2),weather_2=c(1,2,2),weather_3=c(1,1,1))
Current code:
library(dplyr)
df %>% group_by(location) %>% mutate(weather_1 = (weather == 1)) %>% mutate(weather_2 = (weather == 2)) %>% mutate(weather_3 = (weather == 3))
df %>% group_by(location) %>% mutate(weather_1 = summarise(weather == 1)) %>% mutate(weather_2 = summarise(weather == 2)) %>% mutate(weather_3 = summarise(weather == 3))

It is super simple with function called table:
df %>% table
weather
location 1 2 3
az 2 1 1
bi 0 2 1
ca 2 2 1

Krzysztof's solution is the way to go, but if you insist on using tidyverse, here is a solution with dplyr + tidyr:
library(dplyr)
library(tidyr)
df %>%
group_by(location, weather) %>%
summarize(count = count(weather)) %>%
spread(weather, count, sep="_") %>%
mutate_all(funs(coalesce(., 0L)))
Result:
# A tibble: 3 x 4
# Groups: location [3]
location weather_1 weather_2 weather_3
<fctr> <int> <int> <int>
1 az 2 1 1
2 bi 0 2 1
3 ca 2 2 1

Krzysztof's answer wins for simplicity, but if you want a tidyverse-only solution (dplyr and tidyr):
df %>%
group_by(location, weather) %>%
summarize(bin = sum(weather==weather)) %>%
spread(weather, bin, fill = 0, sep='_')
This results in:
location weather_1 weather_2 weather_3
az 2 1 1
bi 0 2 1
ca 2 2 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Studio using dplyr to summarize (where like) - r

You may achieve this with the following code: require(dplyr) df %>% filter(Status == "status.open") %>% ## you only want status.open count(ID) ## count members of ID Which produces: # A tibble: 2 x 2 # Groups: ID [2] ID n <dbl> <int> 1 1 1 2 2 2

Roughly like SQL-%open is: library(stringr) df %>% filter(str_detect(Status, "open$")) # open$ = ends with open

What about this solution : df %>% dplyr::group_by(ID) %>% dplyr::summarize(count = sum(Status == "status.open"))

Related

Calculating average rle$lengths over grouped data

Matching and returning values based on condition or ID

How to get frequency counts using column breaks by row?

How to transpose character data for unique IDs

dplyr() grouping and getting counts - error message Evaluation error: no applicable method for 'summarise_' applied to an object of class "logical"

Categories

Resources