I'm having trouble with figuring out how to deal with a column that features several observations that I would like to tally. For example:
HTML/CSS;Java;JavaScript;Python;SQL
This is one of the cells for a column of a data frame and I'd like to tally the occurrences of each programming language. Is this something that should be tackled with str_detect(), with corpus(), or is there another way I'm not seeing?
My goal is to make each one of these languages (HTML, CSS, Java, JavaScript, Python, SQL, etc...) into a column name with the tally of how many times they occur in this column of the data frame.
I feel like I might've phrased this strangely so let me know if you need any clarification.
In tidyverse you can use separate_rows and count.
library(dplyr)
df %>% tidyr::separate_rows(PL, sep = ';') %>% count(PL)
In base R, we can split the string on semi-colon and count with table :
table(unlist(strsplit(df$PL, ';')))
#If you need a dataframe
#stack(table(unlist(strsplit(df$PL, ';'))))
If you just want a total count of each label, you can use unnest_longer and a grouped count:
# using #DPH's example data
library(dplyr)
library(tidyr)
df %>%
mutate(across(PL, strsplit, ";")) %>%
unnest_longer(PL) %>%
group_by(PL) %>%
count()
# A tibble: 6 x 2
# Groups: PL [6]
PL n
<chr> <int>
1 HTML/CSS 2
2 Java 1
3 JavaScript 2
4 Python 1
5 R 3
6 SQL 2
If I understood your problem correctly this would be solution:
library(dplyr)
library(tidyr)
# demo data
df <- dplyr::tibble(ID = c("Line 1: ","Line 2:"),
PL = c("HTML/CSS;JavaScript;Python;SQL;R","R;HTML/CSS;Java;JavaScript;SQL;R"))
# calculations
df %>%
dplyr::mutate(PLANG = stringr::str_split(PL, ";")) %>%
tidyr::unnest(c(PLANG)) %>%
dplyr::group_by(ID, PLANG) %>%
dplyr::count() %>%
tidyr::pivot_wider(names_from = "PLANG", values_from = "n", values_fill = 0)
ID `HTML/CSS` JavaScript Python R SQL Java
<chr> <int> <int> <int> <int> <int> <int>
1 "Line 1: " 1 1 1 1 1 0
2 "Line 2:" 1 1 0 2 1 1
Related
I have an example concatenated text field (please see sample data below) that is created from two or three different fields, however there is no guarantee that the order of the words will be the same. I would like to create a new dataset where fields with the same words, regardless of order, are collapsed. However, since I do not know in advance what words will be concatenated together, the code will have to recognize that all words in both strings match.
Code for example data:
var1<-c("BLUE|RED","RED|BLUE","WHITE|BLACK|ORANGE","BLACK|WHITE|ORANGE")
freq<-c(1,1,1,1)
have<-as.data.frame(cbind(var1,freq))
Have:
var1 freq
BLUE|RED 1
RED|BLUE 1
WHITE|BLACK|ORANGE 1
BLACK|WHITE|ORANGE 1
How can I collapse the data into what I want below?
color freq
BLUE|RED 2
WHITE|BLACK|ORANGE 2
data.frame(table(sapply(strsplit(have$var1, '\\|'),
function(x)paste(sort(x), collapse = '|'))))
Var1 Freq
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2
In the world of piping: R > 4.0
have$var1 |>
strsplit('\\|')|>
sapply(\(x)paste0(sort(x), collapse = "|"))|>
table()|>
data.frame()
Here is a tidyverse approach:
library(dplyr)
library(tidyr)
have %>%
group_by(id=row_number()) %>%
separate_rows(var1) %>%
arrange(var1, .by_group = TRUE) %>%
mutate(var1 = paste(var1, collapse = "|")) %>%
slice(1) %>%
ungroup() %>%
count(var1, name = "freq")
var1 freq
<chr> <int>
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2
Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1
Im trying to perform a sum function to count the number of interactions for Unique Id's
So I have something like this:
Client ID
JOE12_EMI
ABC12_CANC
ABC12_EMI
ABC12_RENE
and so on...
It'll also have a column next to it that counts the how many times each unique ID repeats.
Frequency
1
2
2
1
Is there a way that i can have all the activity types (EMI, TELI, PFL) summed for each ID and then placed into new columns?
I've tried to transpose the data by separating the actual ID from the activity type but this doesn't return the sums, thank you for any help. I'm not sure if that's the best way or if transposing the data to wide format and then doing another sum function but I am unsure how to go about it.
separate(frequency, id, c("id", "act_code") )
nd <- melt(frequency, id=(c("id")))
Try this:
library(dplyr)
data=data.frame(Client_ID= c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"),
frequency= c(1,2,2,1))
client_and_id <- as.data.frame(do.call(rbind, strsplit(as.character(data$Client_ID), "_")))
names(client_and_id) <- c("client", "id")
data <- cbind(data, client_and_id)
data_sum <- data %>% group_by(id) %>% mutate(sum_freq = sum(frequency))
The output
> data_sum
# A tibble: 4 x 5
# Groups: id [3]
Client_ID frequency client id sum_freq
<fct> <dbl> <fct> <fct> <dbl>
1 JOE12_EMI 1 JOE12 EMI 3
2 ABC12_CANC 2 ABC12 CANC 2
3 ABC12_EMI 2 ABC12 EMI 3
4 ABC12_RENE 1 ABC12 RENE 1
You can also display the output by ID:
distinct(data_sum %>% dplyr::select(id, sum_freq))
# A tibble: 3 x 2
# Groups: id [3]
id sum_freq
<fct> <dbl>
1 EMI 3
2 CANC 2
3 RENE 1
You're on the right track; I think the only thing you need is a group_by. Something like this:
library(dplyr)
library(tidyr)
df = data.frame(ClientID = c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"))
df %>%
separate(ClientID, into = c("id", "act_code"), sep = "_") %>%
group_by(id) %>%
mutate(frequency = n()) %>%
ungroup() %>%
group_by(id, act_code) %>%
mutate(act_frequency = n()) %>%
ungroup() %>%
spread(act_code, act_frequency)
(This does the sum by user and the pivot by activity type separately; it's possible to calculate the sum by user after pivoting, but this way is easier for me to read.)
This question already has answers here:
"Adding missing grouping variables" message in dplyr in R
(4 answers)
Closed 4 years ago.
So, I have a large data.frame with multiple columns of which "trial.number" and "indexer" are 2.
It annoys me that dplyr constantly, no matter what, adds indexer column.
A simple example:
saccade.df %>%
distinct(trial.number, .keep_all = F)
I would expect to see the the unique trial.numbers and only the trial.number column. However, the output looks like this:
How do I stop dplyr from doing this? And why isn't it showing the unique trial.numbers but only the unique indexer (for which I didnt even ask).
example.df <- data.frame(trial.number = rep(1:10, each = 10), time =
seq(1:100), indexer = rep(21:30, each = 10))
example.df %>%
distinct(trial.number, .keep_all = F)
This goes give the right output. However, I somehow grouped my own variables.
Thanks!
Try ungroup :
df <- data.frame(trial.number=1:2,indexer=3:4)
df %>% distinct(trial.number)
# trial.number
#1 1
#2 2
df %>% group_by(trial.number,indexer) %>% distinct(trial.number)
## A tibble: 2 x 2
## Groups: trial.number, indexer [2]
# trial.number indexer
# <int> <int>
#1 1 3
#2 2 4
df %>% group_by(trial.number,indexer) %>% ungroup %>% distinct(trial.number)
## A tibble: 2 x 1
# trial.number
# <int>
#1 1
#2 2
Some questions are similar to this topic (here or here, as an example) and I know one solution that works, but I want a more elegant response.
I work in epidemiology and I have variables 1 and 0 (or NA). Example:
Does patient has cancer?
NA or 0 is no
1 is yes
Let's say I have several variables in my dataset and I want to count only variables with "1". Its a classical frequency table, but dplyr are turning things more complicated than I could imagine at the first glance.
My code is working:
dataset %>%
select(VISimpair, HEARimpai, IntDis, PhyDis, EmBehDis, LearnDis,
ComDis, ASD, HealthImpair, DevDelays) %>% # replace to your needs
summarise_all(funs(sum(1-is.na(.))))
And you can reproduce this code here:
library(tidyverse)
dataset <- data.frame(var1 = rep(c(NA,1),100), var2=rep(c(NA,1),100))
dataset %>% select(var1, var2) %>% summarise_all(funs(sum(1-is.na(.))))
But I really want to select all variables I want, count how many 0 (or NA) I have and how many 1 I have and report it and have this output
Thanks.
What about the following frequency table per variable?
First, I edit your sample data to also include 0's and load the necessary libraries.
library(tidyr)
library(dplyr)
dataset <- data.frame(var1 = rep(c(NA,1,0),100), var2=rep(c(NA,1,0),100))
Second, I convert the data using gather to make it easier to group_by later for the frequency table created by count, as mentioned by CPak.
dataset %>%
select(var1, var2) %>%
gather(var, val) %>%
mutate(val = factor(val)) %>%
group_by(var, val) %>%
count()
# A tibble: 6 x 3
# Groups: var, val [6]
var val n
<chr> <fct> <int>
1 var1 0 100
2 var1 1 100
3 var1 NA 100
4 var2 0 100
5 var2 1 100
6 var2 NA 100
A quick and dirty method to do this is to coerce your input into factors:
dataset$var1 = as.factor(dataset$var1)
dataset$var2 = as.factor(dataset$var2)
summary(dataset$var1)
summary(dataset$var2)
Summary tells you number of occurrences of each levels of factor.