Counting number of strings despite multiple elements in one cell - r

I got a vector A <- c("Tom; Jerry", "Lisa; Marc")
and try to identity the number of occurrences of every name.
I already used the code:
sort(table(unlist(strsplit(A, ""))), decreasing = TRUE)
However, this code is only able to create output like this:
Tom; Jerry: 1 - Lisa; Marc: 1
I am looking for a way to count every name, despite the fact, that two names are present in one cell. Consequently, my preferred result would be:
Tom: 1 Jerry: 1 Lisa: 1 Marc:1

The split should be ; followed by zero or more spaces (\\s*)
sort(table(unlist(strsplit(A, ";\\s*"))), decreasing = TRUE)
-output
Jerry Lisa Marc Tom
1 1 1 1

Use separate_rows to split the strings, group_by the names and summarise them:
library(tidyverse)
data.frame(A) %>%
separate_rows(A, sep = "; ") %>%
group_by(A) %>%
summarise(N = n())
# A tibble: 4 × 2
A N
<chr> <int>
1 Jerry 1
2 Lisa 1
3 Marc 1
4 Tom 1

Related

R: Using regular expression to keep rows of data with 6 digits

mydat <- data.frame(id = c("372303", "KN5232", "231244", "283472-3822"),
name = c("Adam", "Jane", "TJ", "Joyce"))
> mydat
id name
1 372303 Adam
2 KN5232 Jane
3 231244 TJ
4 283472-3822 Joyce
In my dataset, I want to keep the rows where id is a 6 digit number. For those that contain a 6 digit number followed by - and a 4 digit number, I just want to keep the first 6.
My final data should look like this:
> mydat2
id name
1 372303 Adam
3 231244 TJ
2 283472 Joyce
I am using the following grep("^[0-9]{6}$", c("372303", "KN5232", "231244", "283472-3822")) but this does not account for the case where I want to only keep the first 6 digits before the -.
One method would be to split at - and then extract with filter or subset
library(dplyr)
library(tidyr)
library(stringr)
mydat %>%
separate_rows(id, sep = "-") %>%
filter(str_detect(id, '^\\d{6}$'))
-output
# A tibble: 3 × 2
id name
<chr> <chr>
1 372303 Adam
2 231244 TJ
3 283472 Joyce
You can extract the first standalone 6-digit number from each ID and then only keep the items with 6-digit codes only:
mydat <- data.frame(id = c("372303", "KN5232", "231244", "283472-3822"),name = c("Adam", "Jane", "TJ", "Joyce"))
library(stringr)
mydat$id <- str_extract(mydat$id, "\\b\\d{6}\\b")
mydat[grepl("^\\d{6}$",mydat$id),]
Output:
id name
1 372303 Adam
3 231244 TJ
4 283472 Joyce
The \b\d{6}\b matches 6-digit codes as standalone numbers since \b are word boundaries.
You could also extract all 6-digit numbers with a very simple regex (\\d{6}), convert to numeric (as I would expect you would anyway) and remove NA's.
E.g.
library(dplyr)
library(stringr)
mydat |>
mutate(id = as.numeric(str_extract_all(id, "\\d{6}"))) |>
na.omit()
Output:
id name
1 372303 Adam
3 231244 TJ
4 283472 Joyce

How to arrange ny dataframe by number of characters in column?

I have a dataset:
id value
1 "include details"
2 "language"
2 "describe what you've tried"
How could I arrange it by number of characters in column value with strings? %>% arrange(value) doesnt work. How to do that?
I would use stringr package:
Data:
df <- data.frame(id = c(1,2,3),
value = c("include details","language","describe what you've tried"))
Code:
library(stringr)
df %>%
arrange(str_count(value))
Output:
id value
1 2 language
2 1 include details
3 3 describe what you've tried
arrange it by nchar -
library(dplyr)
df %>% arrange(nchar(value))
# id value
#1 2 language
#2 1 include details
#3 2 describe what you've tried
Or in descending order -
df %>% arrange(desc(nchar(value)))
# id value
#1 2 describe what you've tried
#2 1 include details
#3 2 language
Or in base R -
df[order(nchar(df$value)), ]

Count how many characters from a column appear in another column

I am trying to count how many characters from column expected appear in column read.
They may appear in different order and they should not be counted twice.
For example, in this df
df <- tibble::tibble(expected=c("AL0","CP1","NM3","PK9","RM2"),
read=c("AL0X24",
"CXP44",
"MLN",
"KKRR9",
"22MMRRS"
))
The result should be:
result <- c(3,2,2,2,3)
An option with str_extract/n_distinct. Wrap the [, ] with the 'expected' column string using paste, extract all the characters that show the pattern in 'expected' from 'read' and count the number of distinct elements with n_distinct
library(stringr)
library(dplyr)
with(df, sapply(str_extract_all(read, paste0("[", expected, "]")), n_distinct))
#[1] 3 2 2 2 3
Or another option with str_replace_all with str_count. Here, we remove the duplicate characters in 'read' with str_replace_all and use that to count the characters in 'expected' by pasteing the [, and ]
df %>%
mutate(Count = str_count(str_replace_all(read, "(\\w)\\1+", "\\1"),
str_c("[", expected, "]")))
# A tibble: 5 x 3
# expected read Count
# <chr> <chr> <int>
#1 AL0 AL0X24 3
#2 CP1 CXP44 2
#3 NM3 MLN 2
#4 PK9 KKRR9 2
#5 RM2 22MMRRS 3
One option could be:
mapply(function(x, y) sum(x %in% unique(y)),
x = strsplit(df$expected, ""),
y = strsplit(df$read, ""))
[1] 3 2 2 2 3

R creating combinations with replacement

I have a small example like the following:
df1 = data.frame(Id1=c(1,2,3))
I want to obtain the list of all combinations with replacement which would look like this:
So far I have seen the following functions which produces some parts of the above table:
a) combn function
t(combn(df1$Id1,2))
# Does not creates rows 1,4 and 5 in the above image
b) expand.grid function
expand.grid(df1$Id1,df1$Id1)
# Duplicates rows 2,3 and 5. In my case the combination 1,2 and 2,1
#are the same. Hence I do not need both of them at the same time.
c) CJ function (from data.table)
#install.packages("data.table")
CJ(df1$Id1,df1$Id1)
#Same problem as the previous function
For your reference, I know that the in python I could do the same using the itertools package (link here: https://www.hackerrank.com/challenges/itertools-combinations-with-replacement/problem)
Is there a way to do this in R?
Here's an alternative using expand.grid by creating a unique key for every combination and then removing duplicates
library(dplyr)
expand.grid(df1$Id1,df1$Id1) %>%
mutate(key = paste(pmin(Var1, Var2), pmax(Var1, Var2), sep = "-")) %>%
filter(!duplicated(key)) %>%
select(-key) %>%
mutate(row = row_number())
# Var1 Var2 row
#1 1 1 1
#2 2 1 2
#3 3 1 3
#4 2 2 4
#5 3 2 5
#6 3 3 6

Using regex to extract email address after # in dplyr pipe and then groupby to count occurrences [duplicate]

This question already has an answer here:
Filtering observations in dplyr in combination with grepl
(1 answer)
Closed 6 years ago.
I have dataframe which has column called email. I want to find email addresses after # symbol and then group by e.g (gmail,yahoo,hotmail) and count the occurrences of the same.
registrant_email
chamukan#yahoo.com
tmrsons1974#yahoo.com
123ajumohan#gmail.com
123#websiterecovery.org
salesdesk#2techbrothers.com
salesdesk#2techbrothers.com
Now I can extract emails after # using below code
sub(".*#", "", df$registrant_email)
How can I use it in dplyr pipe and then count occurrences of each email address
tidyr::separate is useful for splitting columns:
library(tidyr)
library(dplyr)
# separate email into `user` and `domain` columns
df %>% separate(registrant_email, into = c('user', 'domain'), sep = '#') %>%
# tally occurrences for each level of `domain`
count(domain)
## # A tibble: 4 x 2
## domain n
## <chr> <int>
## 1 2techbrothers.com 2
## 2 gmail.com 1
## 3 websiterecovery.org 1
## 4 yahoo.com 2
By first splitting into a character matrix, after coercing to data.frame, we can use common dplyr idioms
library(dplyr)
library(stringr)
str_split_fixed(df$registrant_email, pattern = "#", n =2) %>%
data.frame %>% group_by(X2) %>% count(X1)
The result is as follows
X2 X1 n
<fctr> <fctr> <int>
1 2techbrothers.com salesdesk 2
2 gmail.com 123ajumohan 1
3 websiterecovery.org 123 1
4 yahoo.com chamukan 1
5 yahoo.com tmrsons1974 1
If you want to set variable names for better code comprehension, you can use
str_split_fixed(df$registrant_email, pattern = "#", n =2) %>%
data.frame %>% setNames(c("local", "domain")) %>%
group_by(domain) %>% count(local)
We can use base R methods for this
aggregate(V1~V2, read.table(text = df1$registrant_email,
sep="#", stringsAsFactors=FALSE), FUN = length)
# V2 V1
#1 2techbrothers.com 2
#2 gmail.com 1
#3 websiterecovery.org 1
#4 yahoo.com 2
Or using the OP's method and wrap it with table
as.data.frame(table(sub(".*#", "", df1$registrant_email)))
# Var1 Freq
#1 2techbrothers.com 2
#2 gmail.com 1
#3 websiterecovery.org 1
#4 yahoo.com 2

Resources