Combining rows using fuzzy matching of the keys in R

Combining rows using fuzzy matching of the keys in R - r

I have a data set that might contain some very similar keys - something like a row of data for each of the email address john.doe#foo.com and john.m.doe#foo.com. How can I combine similarly named keys and do an aggregate in R?
Sample input
|Email | Subscriptions |
-------------------------------------
|john.doe#foo.com | 10 |
|john.m.doe#foo.com | 11 |
|jane.doe#foo.com | 20 |
Expected result
|Email | Subscriptions |
-------------------------------------
|john.doe#foo.com | 21 |
|jane.doe#foo.com | 20 |
I know agrep and few other libraries can do fuzzy matching, but how do I employ it in combining rows in a data set?

Here is one way to use agrep in combination with dplyr:
df <- data.frame(mail = c("john.doe#foo.com", "john.m.doe#foo.com", "jane.doe#foo.com"),
sub = c(10, 11, 20))
df %>%
rowwise() %>%
mutate(new = paste(agrep(mail, df$mail, max = 2, ignore.case = TRUE), collapse = ",")) %>%
group_by(new) %>%
mutate(sub = sum(sub)) %>%
slice(1)
mail sub new
<fct> <dbl> <chr>
1 john.doe#foo.com 21 1,2
2 jane.doe#foo.com 20 3

Related

Combine two rows into one in R

I encounter another challenge about combine two row into one based on identifier col.
My dataset looks like this:
var<-c("round","round","round","hhid","hhid","chid","chid","sex")
dfile<-c("df1","df2","df3","df1","df2","df1","df2","df1")
uniquevar<-c("df1::round","df2::round","df3::round", "df1::hhid","df2::hhid","df1::chid","df2::chid","df1::sex")
flag<-c("dup","dup","dup","dup","dup","dup","dup","NA")
df<-data.frame(var, dfile,flag)
I am trying to do
find the obs which is marked as "dup"
If it is marked as "dup", combine two/three/or multiple rows into one with format:
df1::var | df2::var |df3::var
So, the ideal outcome would look like this
var dfile. uniquevar flag
round df1 |df2 |df3 df1::round | df2::round |df3::round dup
hhid df1 |df2 df1::hhid | df2::hhid dup
chid df1 |df2 df1::chid | df2::chid dup
sex df1 NA
So far I can only do that manually in excel, that is really time-consuming. I appreciate if I could be told how to achieve that in R, which would be much faster considering the dataset contains over 600,000 obs...
Thanks a lot~~!

You can paste cells together after using group_by(var). Use sep = "::" to specify the separator between different columns, and collapse = " | " for the separator representing rows. You can do this inside summarize from the dplyr package.
library(dplyr)
df %>%
group_by(var) %>%
summarize(uniquevar = ifelse(all(flag == "dup"),
paste(dfile, var, sep = "::", collapse = " | "),
""),
dfile = paste(dfile, collapse = " | "),
dup = flag[1]) %>%
select(var, dfile, uniquevar, dup)
#> # A tibble: 4 x 4
#> var dfile uniquevar dup
#> <chr> <chr> <chr> <chr>
#> 1 chid df1 | df2 "df1::chid | df2::chid" dup
#> 2 hhid df1 | df2 "df1::hhid | df2::hhid" dup
#> 3 round df1 | df2 | df3 "df1::round | df2::round | df3::round" dup
#> 4 sex df1 "" NA

Pass a variable into a filter - R dplyr

Here is a sample of the dataset that I have. I am looking to find the state that has the maximum number of stores. In this case, CA and also see how many IDs come from that state
| ID | | State | | Stores|
| -- | |------ | | ----- |
|a11 | | CA | | 16585 |
|a12 | | CA | | 45552 |
|a13 | | AK | | 7811 |
|a14 | | MA | | 4221 |
I have this code using dplyr
max_state <- df %>%
group_by(State) %>%
summarise(total_stores = sum(Stores)) %>%
top_n(1) %>%
select(State)
This gives me "CA"
Can I use this variable "max(state)" to pass through a filter and use summarise(n()) to count the number of Ids for CA?

A few ways:
# this takes your max_state (CA) and brings in the parts of
# your original table that have the same State
max_state %>%
left_join(df) %>%
summarize(n = n())
# filter the State in df to match the State in max_state
df %>%
filter(State == max_state$State) %>%
summarize(n = n())
# Add Stores_total for each State, only keep the State rows which
# match that of the max State, and count the # of IDs therein
df %>%
group_by(State) %>%
mutate(Stores_total = sum(Stores)) %>%
filter(Stores_total == max(Stores_total)) %>%
count(ID)

You can combine more operations into one summarize call that will be applied to the same group:
df |>
group_by(State) |>
summarize(gsum = sum(Stores), nids = n()) |>
filter(gsum == max(gsum))
##>+ # A tibble: 1 × 3
##> State gsum nids
##> <chr> <dbl> <int>
##>1 CA 62137 2
Where the dataset df is obtained by:
df <- data.frame(ID = c("a11", "a12","a13", "a14"),
State = c("CA", "CA", "AK", "MA"),
Stores = c(16585, 45552, 7811, 4221))

create a new summary variable if condition across many columns

I have a dataframe with an ID variable and a bunch of similarly named columns with information
+-------------------------------------------------
| ID | C1 | C2 | C3 | ...
+-------------------------------------------------
| 1 | 99 | 101 | 102 | ...
+-------------------------------------------------
I need to count the number of columns that fulfil certain condition (e.g. <100) If the number of columns was small I would do something like
df %>% mutate (counter= case_when(C1 <100 & C2<100 & C3<100 ~ "3",
C1<100 & C2<100 ~ 2, ...)
But that is obviously not an option with 100+ columns. I Could also pivot, summarise and pivot back, but it also seems like not the cleanest solution. Any ideas of how to do this properly?

We may use rowSums from base R on a logical matrix (df[-1] < 100) to get the count of elements in each row that are less than 100.
df$counter <- rowSums(df[-1] < 100, na.rm = TRUE)
TRUE -> 1 and FALSE -> 0, thus, when we take the row wise sum of logical matrix, each TRUE will be incremented as 1.
Or in a dplyr pipe
library(dplyr)
df %>%
mutate(counter = rowSums(across(-1) < 100, na.rm = TRUE))

Conditional Filtering using R

Consider the below given dataframe;
Sample DataFrame
| Name | Age | Type |
---------------------
| EF | 50 | A |
| GH | 60 | B |
| VB | 70 | C |
Code to perform Filter
df2 <- df1 %>% filter(Type == 'C') %>% select(Name)
The above code will provide me a dataframe with singe column and row.
I would like to perform a conditional filter where if a certain type is not present it should consider the name to be NULL/NA.
Example
df2 <- df1 %>% filter(Type = 'D') %>% select(Name)
Must give an output of;
| Name |
--------
| NA |
Instead of throwing an error. Any inputs will be really helpful. Either DPLYR or any other methods will be appreciable.

Here is a base R approach:
name <- df[df$Name == "D", "Name"]
ifelse(identical(name, character(0)), NA, name)
[1] NA
Should the name not match to D, the subset operation would return character(0). We can compare the output against this, and then return NA as appropriate.
Data:
df <- data.frame(Name=c("EF", "GH", "VB"),
Age=c(50, 60, 70),
Type=c("A", "B", "C"),
stringsAsFactors=FALSE)

An approach with complete from tidyr would be:
library(dplyr)
library(tidyr)
df1 %>%
complete(Type = LETTERS) %>% # Specify which Types you'd expect, other values are filled with NA
filter(Type == 'D') %>%
select(Name)
# A tibble: 1 x 1
# Name
# <fct>
# 1 NA

Remove old date rows in R

I have table:
Date | Column1 | Column2
------+---------+--------
6/1/1 | A | 3
5/1/1 | B | 4
4/1/1 | C | 5
1/1/1 | A | 1
7/1/1 | B | 2
1/1/1 | C | 3
I need table:
Date | Column1 | Column2
------+---------+--------
6/1/1 | A | 3
4/1/1 | C | 5
7/1/1 | B | 2
How to remove old rows based on two criteria (Column1, Column2)?

Group by Dates, arrange in descending order within group, then keep the first row with slice, like this
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(as.Date(Date))) %>% # will sort within group now
slice(1) %>% # keep first row entry of each group
ungroup()
Your error is occurring because your date format is a bit funny. I recommend using lubridate::parse_date_time which is more robust than base R datetime functions
library(lubridate)
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(parse_date_time(Date, format="mdy"))) %>% # will sort within group now
# the date format is specified as month-day-year
slice(1) %>% # keep first row entry of each group
ungroup()
EDIT
Based on helpful comment by #count, we can simplify dplyr chain to
library(lubridate)
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
slice(which.max(parse_date_time(Date, format="mdy"))) %>% # keep max-Date row entry of each group
ungroup()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Combining rows using fuzzy matching of the keys in R - r

Related

Combine two rows into one in R

Pass a variable into a filter - R dplyr

create a new summary variable if condition across many columns

Conditional Filtering using R

Remove old date rows in R

Categories

Resources