R: Using regular expression to keep rows of data with 6 digits - r

mydat <- data.frame(id = c("372303", "KN5232", "231244", "283472-3822"),
name = c("Adam", "Jane", "TJ", "Joyce"))
> mydat
id name
1 372303 Adam
2 KN5232 Jane
3 231244 TJ
4 283472-3822 Joyce
In my dataset, I want to keep the rows where id is a 6 digit number. For those that contain a 6 digit number followed by - and a 4 digit number, I just want to keep the first 6.
My final data should look like this:
> mydat2
id name
1 372303 Adam
3 231244 TJ
2 283472 Joyce
I am using the following grep("^[0-9]{6}$", c("372303", "KN5232", "231244", "283472-3822")) but this does not account for the case where I want to only keep the first 6 digits before the -.

One method would be to split at - and then extract with filter or subset
library(dplyr)
library(tidyr)
library(stringr)
mydat %>%
separate_rows(id, sep = "-") %>%
filter(str_detect(id, '^\\d{6}$'))
-output
# A tibble: 3 × 2
id name
<chr> <chr>
1 372303 Adam
2 231244 TJ
3 283472 Joyce

You can extract the first standalone 6-digit number from each ID and then only keep the items with 6-digit codes only:
mydat <- data.frame(id = c("372303", "KN5232", "231244", "283472-3822"),name = c("Adam", "Jane", "TJ", "Joyce"))
library(stringr)
mydat$id <- str_extract(mydat$id, "\\b\\d{6}\\b")
mydat[grepl("^\\d{6}$",mydat$id),]
Output:
id name
1 372303 Adam
3 231244 TJ
4 283472 Joyce
The \b\d{6}\b matches 6-digit codes as standalone numbers since \b are word boundaries.

You could also extract all 6-digit numbers with a very simple regex (\\d{6}), convert to numeric (as I would expect you would anyway) and remove NA's.
E.g.
library(dplyr)
library(stringr)
mydat |>
mutate(id = as.numeric(str_extract_all(id, "\\d{6}"))) |>
na.omit()
Output:
id name
1 372303 Adam
3 231244 TJ
4 283472 Joyce

Related

Counting number of strings despite multiple elements in one cell

I got a vector A <- c("Tom; Jerry", "Lisa; Marc")
and try to identity the number of occurrences of every name.
I already used the code:
sort(table(unlist(strsplit(A, ""))), decreasing = TRUE)
However, this code is only able to create output like this:
Tom; Jerry: 1 - Lisa; Marc: 1
I am looking for a way to count every name, despite the fact, that two names are present in one cell. Consequently, my preferred result would be:
Tom: 1 Jerry: 1 Lisa: 1 Marc:1
The split should be ; followed by zero or more spaces (\\s*)
sort(table(unlist(strsplit(A, ";\\s*"))), decreasing = TRUE)
-output
Jerry Lisa Marc Tom
1 1 1 1
Use separate_rows to split the strings, group_by the names and summarise them:
library(tidyverse)
data.frame(A) %>%
separate_rows(A, sep = "; ") %>%
group_by(A) %>%
summarise(N = n())
# A tibble: 4 × 2
A N
<chr> <int>
1 Jerry 1
2 Lisa 1
3 Marc 1
4 Tom 1

R: Merge rows that share same code and at least one or more strings in name-column

I would like to merge rows in a dataframe if they have at least one word in common and have the same value for 'code'. The column to be searched for matching words is "name". Here's an example dataset:
df <- data.frame(
id = 1:8,
name = c("tiger ltd", "tiger cpy", "tiger", "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c(rep("4564AB", 3), rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The approach that I envision would look something like this:
use group_by on the code-column,
check if the group contains 2 or more rows,
check if there are any shared words among the different rows. If so, merge those rows and combine the information into a single row.
The final dataset would look like this:
final_df <- data.frame(
id = c("1|2|3", 4:8),
name = c(paste(c("tiger ltd", "tiger cpy", "tiger"), collapse = "|"), "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c("4564AB", rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The first three rows have the common word 'tiger' and the same code. Therefore they are merged into a single row with the different values separated by "|". The other rows are not merged because they either do not have a word in common or do not have the same code.
We could have a condition with if/else after grouping. Extract the words from the 'name' column and check for any intersecting elements, create a flag where the length of intersecting elements are greater than 0 and the group size (n()) is greater than 1 and use this to paste/str_c elements of the other columns
library(dplyr)
library(stringr)
library(purrr)
library(magrittr)
df %>%
group_by(code = factor(code, levels = unique(code))) %>%
mutate(flag = n() > 1 &
(str_extract_all(name, "\\w+") %>%
reduce(intersect) %>%
length %>%
is_greater_than(0))) %>%
summarise(across(-flag, ~ if(any(flag))
str_c(.x, collapse = "|") else as.character(.x)), .groups = 'drop') %>%
select(names(df))
-output
# A tibble: 6 × 3
id name code
<chr> <chr> <fct>
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
-OP's expected
> final_df
id name code
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
You can use this helper function f(), and apply it to each group:
f <- function(d) {
if(length(Reduce(intersect,strsplit(d[["name"]]," ")))>0) {
d = lapply(d,paste0,collapse="|")
}
return(d)
}
library(data.table)
setDT(df)[,id:=as.character(id)][, f(.SD),code]
Output:
code id name
<char> <char> <char>
1: 4564AB 1|2|3 tiger ltd|tiger cpy|tiger
2: 7845BC 4 rhino
3: 7845BC 5 hippo
4: 6144DE 6 elephant
5: 7845KI 7 elephant bros
6: 7845EG 8 last comp

How to arrange ny dataframe by number of characters in column?

I have a dataset:
id value
1 "include details"
2 "language"
2 "describe what you've tried"
How could I arrange it by number of characters in column value with strings? %>% arrange(value) doesnt work. How to do that?
I would use stringr package:
Data:
df <- data.frame(id = c(1,2,3),
value = c("include details","language","describe what you've tried"))
Code:
library(stringr)
df %>%
arrange(str_count(value))
Output:
id value
1 2 language
2 1 include details
3 3 describe what you've tried
arrange it by nchar -
library(dplyr)
df %>% arrange(nchar(value))
# id value
#1 2 language
#2 1 include details
#3 2 describe what you've tried
Or in descending order -
df %>% arrange(desc(nchar(value)))
# id value
#1 2 describe what you've tried
#2 1 include details
#3 2 language
Or in base R -
df[order(nchar(df$value)), ]

How to identify the text that are in common between sentences?

I would like to find the text or string that appeared in 3 of my columns.
> dput(df1)
structure(list(Jan = "The price of oil declined.", Feb = "The price of gold declined.",
Mar = "Prices remained unchanged."), row.names = c(NA, -1L
), class = c("tbl_df", "tbl", "data.frame"))
I want to get something like
Word Count
The 2
price 3
declined 2
of 2
Thank you.
You can count the occurrence of each word in the text and keep only the ones that occur more than once.
library(dplyr)
library(tidyr)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything()) %>%
separate_rows(value, sep = '\\s+') %>%
mutate(value = tolower(gsub('[[:punct:]]', '', value))) %>%
count(value) %>%
filter(n > 1)
May be this:
setNames(data.frame(table(unlist
(strsplit
(trimws(tolower(stack(df)$values),whitespace = '\\.'), '\\s+', perl=TRUE)
)
)
), c('words', 'Frequency'))
stack(df) will stack the df to columnar structure from row structure, then using values column we get all the sentences. we use trimws to remove all the unnecessary punctuation. we use strsplit to split data with spaces. Finally unlisting it to make it flatten. Taking the table and then converting to data.frame yields the desired results.setNames renames the columns.
Output:
# words Frequency
#1 declined 2
#2 gold 1
#3 of 2
#4 oil 1
#5 price 2
#6 prices 1
#7 remained 1
#8 the 2
#9 unchanged 1
This code won't process the data as you may wish, for ex. treating "price" and "Prices" as the same word. If you want that it will get more complicated.
> data.frame(table(strsplit(tolower(gsub("\\.|\\,","",paste(as.character(unlist(df)),collapse=" ")))," ")))
Var1 Freq
1 declined 2
2 gold 1
3 of 2
4 oil 1
5 price 2
6 prices 1
7 remained 1
8 the 2
9 unchanged 1
Base R solution:
setNames(
data.frame(
table(
unlist(strsplit(tolower(do.call(c, df1)), "\\s+|[[:punct:]]"))
)
),
c("Words", "Frequency")
)

Using regex to extract email address after # in dplyr pipe and then groupby to count occurrences [duplicate]

This question already has an answer here:
Filtering observations in dplyr in combination with grepl
(1 answer)
Closed 6 years ago.
I have dataframe which has column called email. I want to find email addresses after # symbol and then group by e.g (gmail,yahoo,hotmail) and count the occurrences of the same.
registrant_email
chamukan#yahoo.com
tmrsons1974#yahoo.com
123ajumohan#gmail.com
123#websiterecovery.org
salesdesk#2techbrothers.com
salesdesk#2techbrothers.com
Now I can extract emails after # using below code
sub(".*#", "", df$registrant_email)
How can I use it in dplyr pipe and then count occurrences of each email address
tidyr::separate is useful for splitting columns:
library(tidyr)
library(dplyr)
# separate email into `user` and `domain` columns
df %>% separate(registrant_email, into = c('user', 'domain'), sep = '#') %>%
# tally occurrences for each level of `domain`
count(domain)
## # A tibble: 4 x 2
## domain n
## <chr> <int>
## 1 2techbrothers.com 2
## 2 gmail.com 1
## 3 websiterecovery.org 1
## 4 yahoo.com 2
By first splitting into a character matrix, after coercing to data.frame, we can use common dplyr idioms
library(dplyr)
library(stringr)
str_split_fixed(df$registrant_email, pattern = "#", n =2) %>%
data.frame %>% group_by(X2) %>% count(X1)
The result is as follows
X2 X1 n
<fctr> <fctr> <int>
1 2techbrothers.com salesdesk 2
2 gmail.com 123ajumohan 1
3 websiterecovery.org 123 1
4 yahoo.com chamukan 1
5 yahoo.com tmrsons1974 1
If you want to set variable names for better code comprehension, you can use
str_split_fixed(df$registrant_email, pattern = "#", n =2) %>%
data.frame %>% setNames(c("local", "domain")) %>%
group_by(domain) %>% count(local)
We can use base R methods for this
aggregate(V1~V2, read.table(text = df1$registrant_email,
sep="#", stringsAsFactors=FALSE), FUN = length)
# V2 V1
#1 2techbrothers.com 2
#2 gmail.com 1
#3 websiterecovery.org 1
#4 yahoo.com 2
Or using the OP's method and wrap it with table
as.data.frame(table(sub(".*#", "", df1$registrant_email)))
# Var1 Freq
#1 2techbrothers.com 2
#2 gmail.com 1
#3 websiterecovery.org 1
#4 yahoo.com 2

Resources