Count word matches between two variables - r

Assume two datasets A and B:
X1<- c('a', 'b','c')
place<-c('andes','brooklyn', 'comorin')
A<-data.frame(X1,place)
X2<-c('a','a','a','b','c','c','d')
place2<-c('andes','alamo','andes','brooklyn','comorin','camden','dover')
B<-data.frame(X2,place2)
I want to count how many times each word in A$place occurs in B$place2.

A possible solution:
library(tidyverse)
A %>%
rowwise %>%
mutate(n = sum(place == B$place2)) %>%
ungroup
#> # A tibble: 3 × 3
#> X1 place n
#> <chr> <chr> <int>
#> 1 a andes 2
#> 2 b brooklyn 1
#> 3 c comorin 1

Use str_detect from the stringr package.
library(stringr)
sapply(A$place, function(x) sum(str_detect(x, B$place2)))
andes brooklyn comorin
2 1 1

table(B$place2[B$place2 %in% A$place])
# andes brooklyn comorin
# 2 1 1

Here's a base R version of user438383's answer.
sapply(A$place, function(y) sum(grepl(y, B$place2)))
andes brooklyn comorin
2 1 1
The key functionality is sapply() which repeats an operation over all elements of a vector, grepl() which checks the matches and generates TRUE or FALSE, and sum(). When you sum a logical vector, you get the count of TRUE.

Related

determine duplicate rows whose at least one row has different value in a column [duplicate]

I have data with a grouping variable ("from") and values ("number"):
from number
1 1
1 1
2 1
2 2
3 2
3 2
I want to subset the data and select groups which have two or more unique values. In my data, only group 2 has more than one distinct 'number', so this is the desired result:
from number
2 1
2 2
Several possibilities, here's my favorite
library(data.table)
setDT(df)[, if(+var(number)) .SD, by = from]
# from number
# 1: 2 1
# 2: 2 2
Basically, per each group we are checking if there is any variance, if TRUE, then return the group values
With base R, I would go with
df[as.logical(with(df, ave(number, from, FUN = var))), ]
# from number
# 3 2 1
# 4 2 2
Edit: for a non numerical data you could try the new uniqueN function for the devel version of data.table (or use length(unique(number)) > 1 instead
setDT(df)[, if(uniqueN(number) > 1) .SD, by = from]
You could try
library(dplyr)
df1 %>%
group_by(from) %>%
filter(n_distinct(number)>1)
# from number
#1 2 1
#2 2 2
Or using base R
indx <- rowSums(!!table(df1))>1
subset(df1, from %in% names(indx)[indx])
# from number
#3 2 1
#4 2 2
Or
df1[with(df1, !ave(number, from, FUN=anyDuplicated)),]
# from number
#3 2 1
#4 2 2
Using concept of variance shared by David but doing it dplyr way:
library(dplyr)
df %>%
group_by(from) %>%
mutate(variance=var(number)) %>%
filter(variance!=0) %>%
select(from,number)
#Source: local data frame [2 x 2]
#Groups: from
#from number
#1 2 1
#2 2 2

How to apply functions sequentially with purrr and pipes

I am struggling with the purrr package.
I am trying to apply the function is.factor to a data frame, and then fct_count on those columns that are factors.
I have tried some variations of modify_if, and summarise_if. I guess I am using incorrectly the dots (.) when calling for the previous object.
(A guide about purrr, and dots would be really beneficial if you have a link).
For example,
df <- data.frame(f1 = c("men", "woman", "men", "men"),
f2 = c("high", "low", "low", "low"),
n1 = c(1, 3, 3, 6))
Then
map(df, is.factor)
If I use
map_if(df, is.factor, forcats::fct_count)
I got results for every variable, instead of only for the factors.
I think it is a pretty simple problem, and with a bit of understanding of the dots (.) can be solved.
Thanks in advance
:)
Issue is that map_if returns the unmodified columns as well. Hence, when the OP tries the code (repeating the same code as in the OP just to show)
map_if(df, is.factor, forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
#$n1
#[1] 1 3 3 6 ### it is the same column value unchanged
Here, we can specify the .else and discard the NULL elements. So, if we specify the other columns to return NULL and then use discard the NULL elements, it would be a list of factor counts.
library(tidyverse)
map_if(df, is.factor, forcats::fct_count, .else = ~ NULL) %>%
discard(is.null)
#$f1
## A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
Or another option is summarise_if and place the output in a list
df %>%
summarise_if(is.factor, list(~ list(fct_count(.)))) %>%
unclass
Or another option would be to gather into 'long' format and then count once
gather(df, key, val, f1:f2) %>%
dplyr::count(key, val)
Or this can be done with lapply from base R
lapply(df[sapply(df, is.factor)], fct_count)
Or using only base R
lapply(df[sapply(df, is.factor)], table)
Or the results can be represented in a different way
table(names(df)[1:2][col(df[1:2])], unlist(df[1:2]))
The issue with map_if/modify_if is it applies the function to only the columns which satisfy the predicate function and rest of them are returned as it is.
Hence, when you try
library(tidyverse)
map_if(df, is.factor, forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
#$n1
#[1] 1 3 3 6
fct_count is applied to columns f1 and f2 which are factors and column n1 is returned as it is. If you want to get only factor columns in the output one way would be to select them first and then apply the function
df %>%
select_if(is.factor) %>%
map(forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3

Advanced Filtering of groups [duplicate]

I have data with a grouping variable ("from") and values ("number"):
from number
1 1
1 1
2 1
2 2
3 2
3 2
I want to subset the data and select groups which have two or more unique values. In my data, only group 2 has more than one distinct 'number', so this is the desired result:
from number
2 1
2 2
Several possibilities, here's my favorite
library(data.table)
setDT(df)[, if(+var(number)) .SD, by = from]
# from number
# 1: 2 1
# 2: 2 2
Basically, per each group we are checking if there is any variance, if TRUE, then return the group values
With base R, I would go with
df[as.logical(with(df, ave(number, from, FUN = var))), ]
# from number
# 3 2 1
# 4 2 2
Edit: for a non numerical data you could try the new uniqueN function for the devel version of data.table (or use length(unique(number)) > 1 instead
setDT(df)[, if(uniqueN(number) > 1) .SD, by = from]
You could try
library(dplyr)
df1 %>%
group_by(from) %>%
filter(n_distinct(number)>1)
# from number
#1 2 1
#2 2 2
Or using base R
indx <- rowSums(!!table(df1))>1
subset(df1, from %in% names(indx)[indx])
# from number
#3 2 1
#4 2 2
Or
df1[with(df1, !ave(number, from, FUN=anyDuplicated)),]
# from number
#3 2 1
#4 2 2
Using concept of variance shared by David but doing it dplyr way:
library(dplyr)
df %>%
group_by(from) %>%
mutate(variance=var(number)) %>%
filter(variance!=0) %>%
select(from,number)
#Source: local data frame [2 x 2]
#Groups: from
#from number
#1 2 1
#2 2 2

Using regex to extract email address after # in dplyr pipe and then groupby to count occurrences [duplicate]

This question already has an answer here:
Filtering observations in dplyr in combination with grepl
(1 answer)
Closed 6 years ago.
I have dataframe which has column called email. I want to find email addresses after # symbol and then group by e.g (gmail,yahoo,hotmail) and count the occurrences of the same.
registrant_email
chamukan#yahoo.com
tmrsons1974#yahoo.com
123ajumohan#gmail.com
123#websiterecovery.org
salesdesk#2techbrothers.com
salesdesk#2techbrothers.com
Now I can extract emails after # using below code
sub(".*#", "", df$registrant_email)
How can I use it in dplyr pipe and then count occurrences of each email address
tidyr::separate is useful for splitting columns:
library(tidyr)
library(dplyr)
# separate email into `user` and `domain` columns
df %>% separate(registrant_email, into = c('user', 'domain'), sep = '#') %>%
# tally occurrences for each level of `domain`
count(domain)
## # A tibble: 4 x 2
## domain n
## <chr> <int>
## 1 2techbrothers.com 2
## 2 gmail.com 1
## 3 websiterecovery.org 1
## 4 yahoo.com 2
By first splitting into a character matrix, after coercing to data.frame, we can use common dplyr idioms
library(dplyr)
library(stringr)
str_split_fixed(df$registrant_email, pattern = "#", n =2) %>%
data.frame %>% group_by(X2) %>% count(X1)
The result is as follows
X2 X1 n
<fctr> <fctr> <int>
1 2techbrothers.com salesdesk 2
2 gmail.com 123ajumohan 1
3 websiterecovery.org 123 1
4 yahoo.com chamukan 1
5 yahoo.com tmrsons1974 1
If you want to set variable names for better code comprehension, you can use
str_split_fixed(df$registrant_email, pattern = "#", n =2) %>%
data.frame %>% setNames(c("local", "domain")) %>%
group_by(domain) %>% count(local)
We can use base R methods for this
aggregate(V1~V2, read.table(text = df1$registrant_email,
sep="#", stringsAsFactors=FALSE), FUN = length)
# V2 V1
#1 2techbrothers.com 2
#2 gmail.com 1
#3 websiterecovery.org 1
#4 yahoo.com 2
Or using the OP's method and wrap it with table
as.data.frame(table(sub(".*#", "", df1$registrant_email)))
# Var1 Freq
#1 2techbrothers.com 2
#2 gmail.com 1
#3 websiterecovery.org 1
#4 yahoo.com 2

Select groups with more than one distinct value

I have data with a grouping variable ("from") and values ("number"):
from number
1 1
1 1
2 1
2 2
3 2
3 2
I want to subset the data and select groups which have two or more unique values. In my data, only group 2 has more than one distinct 'number', so this is the desired result:
from number
2 1
2 2
Several possibilities, here's my favorite
library(data.table)
setDT(df)[, if(+var(number)) .SD, by = from]
# from number
# 1: 2 1
# 2: 2 2
Basically, per each group we are checking if there is any variance, if TRUE, then return the group values
With base R, I would go with
df[as.logical(with(df, ave(number, from, FUN = var))), ]
# from number
# 3 2 1
# 4 2 2
Edit: for a non numerical data you could try the new uniqueN function for the devel version of data.table (or use length(unique(number)) > 1 instead
setDT(df)[, if(uniqueN(number) > 1) .SD, by = from]
You could try
library(dplyr)
df1 %>%
group_by(from) %>%
filter(n_distinct(number)>1)
# from number
#1 2 1
#2 2 2
Or using base R
indx <- rowSums(!!table(df1))>1
subset(df1, from %in% names(indx)[indx])
# from number
#3 2 1
#4 2 2
Or
df1[with(df1, !ave(number, from, FUN=anyDuplicated)),]
# from number
#3 2 1
#4 2 2
Using concept of variance shared by David but doing it dplyr way:
library(dplyr)
df %>%
group_by(from) %>%
mutate(variance=var(number)) %>%
filter(variance!=0) %>%
select(from,number)
#Source: local data frame [2 x 2]
#Groups: from
#from number
#1 2 1
#2 2 2

Resources