How to transpose character data for unique IDs - r

Im trying to perform a sum function to count the number of interactions for Unique Id's
So I have something like this:
Client ID
JOE12_EMI
ABC12_CANC
ABC12_EMI
ABC12_RENE
and so on...
It'll also have a column next to it that counts the how many times each unique ID repeats.
Frequency
1
2
2
1
Is there a way that i can have all the activity types (EMI, TELI, PFL) summed for each ID and then placed into new columns?
I've tried to transpose the data by separating the actual ID from the activity type but this doesn't return the sums, thank you for any help. I'm not sure if that's the best way or if transposing the data to wide format and then doing another sum function but I am unsure how to go about it.
separate(frequency, id, c("id", "act_code") )
nd <- melt(frequency, id=(c("id")))

Try this:
library(dplyr)
data=data.frame(Client_ID= c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"),
frequency= c(1,2,2,1))
client_and_id <- as.data.frame(do.call(rbind, strsplit(as.character(data$Client_ID), "_")))
names(client_and_id) <- c("client", "id")
data <- cbind(data, client_and_id)
data_sum <- data %>% group_by(id) %>% mutate(sum_freq = sum(frequency))
The output
> data_sum
# A tibble: 4 x 5
# Groups: id [3]
Client_ID frequency client id sum_freq
<fct> <dbl> <fct> <fct> <dbl>
1 JOE12_EMI 1 JOE12 EMI 3
2 ABC12_CANC 2 ABC12 CANC 2
3 ABC12_EMI 2 ABC12 EMI 3
4 ABC12_RENE 1 ABC12 RENE 1
You can also display the output by ID:
distinct(data_sum %>% dplyr::select(id, sum_freq))
# A tibble: 3 x 2
# Groups: id [3]
id sum_freq
<fct> <dbl>
1 EMI 3
2 CANC 2
3 RENE 1

You're on the right track; I think the only thing you need is a group_by. Something like this:
library(dplyr)
library(tidyr)
df = data.frame(ClientID = c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"))
df %>%
separate(ClientID, into = c("id", "act_code"), sep = "_") %>%
group_by(id) %>%
mutate(frequency = n()) %>%
ungroup() %>%
group_by(id, act_code) %>%
mutate(act_frequency = n()) %>%
ungroup() %>%
spread(act_code, act_frequency)
(This does the sum by user and the pivot by activity type separately; it's possible to calculate the sum by user after pivoting, but this way is easier for me to read.)

Related

Calculating average rle$lengths over grouped data

I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks
As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN

Matching and returning values based on condition or ID

This seems like it should be fairly easy, but i'm having trouble with it.
Example: I have a dataframe with two columns IDs and perc_change. I want to know which unique IDs have had more than 30% change.
IDs <- c(1,1,2,1,1,2,2,2,3,2,3,4,5,6,3)
perc_change <- c(50,40,60,70,80,30,20,40,23,25,10,30,12,7,70)
df <- data.frame(IDs, perc_change)
So far:
if (df$perc_change > 30) {
unique(df$IDs)
} else {
}
This obviously doesn't work because it returns all unique IDs. Should I be be finding the index and then matching it or soemthing?
Thanks in advance!
We could do so, to get the values of each ID:
library(dplyr)
df %>%
group_by(IDs) %>%
filter(perc_change > 30) %>%
mutate(values = paste0(perc_change, collapse = ","), .keep="unused") %>%
distinct(IDs, .keep_all = TRUE)
Output:
IDs values
<dbl> <chr>
1 1 50,40,70,80
2 2 60,40
3 3 70
Just use [ to subset and take the unique - i.e. no need for if/else conditions
with(df, unique(IDs[perc_change > 30]))
[1] 1 2 3
We can group, filter and count using dplyr
> library(dplyr)
> df %>%
group_by(IDs) %>%
filter(perc_change > 30) %>%
count(IDs)
# A tibble: 3 x 2
# Groups: IDs [3]
IDs n
<dbl> <int>
1 1 4
2 2 2
3 3 1
unique(df[df$perc_change > 30,"IDs"])

Tally()ing Multiple Observations In an Entire Data Frame

I'm having trouble with figuring out how to deal with a column that features several observations that I would like to tally. For example:
HTML/CSS;Java;JavaScript;Python;SQL
This is one of the cells for a column of a data frame and I'd like to tally the occurrences of each programming language. Is this something that should be tackled with str_detect(), with corpus(), or is there another way I'm not seeing?
My goal is to make each one of these languages (HTML, CSS, Java, JavaScript, Python, SQL, etc...) into a column name with the tally of how many times they occur in this column of the data frame.
I feel like I might've phrased this strangely so let me know if you need any clarification.
In tidyverse you can use separate_rows and count.
library(dplyr)
df %>% tidyr::separate_rows(PL, sep = ';') %>% count(PL)
In base R, we can split the string on semi-colon and count with table :
table(unlist(strsplit(df$PL, ';')))
#If you need a dataframe
#stack(table(unlist(strsplit(df$PL, ';'))))
If you just want a total count of each label, you can use unnest_longer and a grouped count:
# using #DPH's example data
library(dplyr)
library(tidyr)
df %>%
mutate(across(PL, strsplit, ";")) %>%
unnest_longer(PL) %>%
group_by(PL) %>%
count()
# A tibble: 6 x 2
# Groups: PL [6]
PL n
<chr> <int>
1 HTML/CSS 2
2 Java 1
3 JavaScript 2
4 Python 1
5 R 3
6 SQL 2
If I understood your problem correctly this would be solution:
library(dplyr)
library(tidyr)
# demo data
df <- dplyr::tibble(ID = c("Line 1: ","Line 2:"),
PL = c("HTML/CSS;JavaScript;Python;SQL;R","R;HTML/CSS;Java;JavaScript;SQL;R"))
# calculations
df %>%
dplyr::mutate(PLANG = stringr::str_split(PL, ";")) %>%
tidyr::unnest(c(PLANG)) %>%
dplyr::group_by(ID, PLANG) %>%
dplyr::count() %>%
tidyr::pivot_wider(names_from = "PLANG", values_from = "n", values_fill = 0)
ID `HTML/CSS` JavaScript Python R SQL Java
<chr> <int> <int> <int> <int> <int> <int>
1 "Line 1: " 1 1 1 1 1 0
2 "Line 2:" 1 1 0 2 1 1

R Studio using dplyr to summarize (where like)

I'm trying to output the number of status (that is open) group by ID. Please see below example:
(note: (1 status that is open) is used to show why it's 1, I don't want to output the sentence)
Re-producible code:
ID <- c(1,1,1,2,2,2)
Status <- c("status.open","status.closed", "status.wait", "status.open", "status.open", "status.wait" )
df <- data.frame(ID, Status)
pseudo-code:
df %>%
group_by(ID) %>%
summarize(count = length(Status where status like "%open"))
Please help, thanks!
You may achieve this with the following code:
require(dplyr)
df %>% filter(Status == "status.open") %>% ## you only want status.open
count(ID) ## count members of ID
Which produces:
# A tibble: 2 x 2
# Groups: ID [2]
ID n
<dbl> <int>
1 1 1
2 2 2
Solution (as close as possible to your 'pseudo-code') using dplyr and grepl and R's implicit conversion of booleans (where TRUE becomes 1 if we try to to math with it):
library(dplyr)
df %>%
group_by(ID) %>%
summarise(count = sum(grepl("open", Status)))
Returns:
# A tibble: 2 x 2
ID count
<dbl> <int>
1 1 1
2 2 2
Roughly like SQL-%open is:
library(stringr)
df %>%
filter(str_detect(Status, "open$")) # open$ = ends with open
What about this solution :
df %>% dplyr::group_by(ID) %>% dplyr::summarize(count = sum(Status == "status.open"))

Dplyr keeps automatically adding one of my columns [duplicate]

This question already has answers here:
"Adding missing grouping variables" message in dplyr in R
(4 answers)
Closed 4 years ago.
So, I have a large data.frame with multiple columns of which "trial.number" and "indexer" are 2.
It annoys me that dplyr constantly, no matter what, adds indexer column.
A simple example:
saccade.df %>%
distinct(trial.number, .keep_all = F)
I would expect to see the the unique trial.numbers and only the trial.number column. However, the output looks like this:
How do I stop dplyr from doing this? And why isn't it showing the unique trial.numbers but only the unique indexer (for which I didnt even ask).
example.df <- data.frame(trial.number = rep(1:10, each = 10), time =
seq(1:100), indexer = rep(21:30, each = 10))
example.df %>%
distinct(trial.number, .keep_all = F)
This goes give the right output. However, I somehow grouped my own variables.
Thanks!
Try ungroup :
df <- data.frame(trial.number=1:2,indexer=3:4)
df %>% distinct(trial.number)
# trial.number
#1 1
#2 2
df %>% group_by(trial.number,indexer) %>% distinct(trial.number)
## A tibble: 2 x 2
## Groups: trial.number, indexer [2]
# trial.number indexer
# <int> <int>
#1 1 3
#2 2 4
df %>% group_by(trial.number,indexer) %>% ungroup %>% distinct(trial.number)
## A tibble: 2 x 1
# trial.number
# <int>
#1 1
#2 2

Resources