Using a variable number of groups with do in function - r

I would like to understand if and how this could be achieved using the tidyverse framework.
Assume I have the following simple function:
my_fn <- function(list_char) {
data.frame(comma_separated = rep(paste0(list_char, collapse = ","),2),
second_col = "test",
stringsAsFactors = FALSE)
}
Given the below list:
list_char <- list(name = "Chris", city = "London", language = "R")
my function works fine if you run:
my_fn(list_char)
However if we change some of the list's elements with a vector of characters we could use the dplyr::do function in the following way to achieve the below:
list_char_mult <- list(name = c("Chris", "Mike"),
city = c("New York", "London"), language = "R")
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(my_fn(list(name = .$name, city = .$city, language = "R")))
The question is how to write a function that could do this for a list with a variable number of elements. For example:
my_fn_generic <- function(list_char_mult) {
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(my_fn(...))
}
Thanks

Regarding how to use the function with variable number of arguments
my_fn_generic <- function(list_char) {
expand.grid(list_char, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(do.call(my_fn, list(.)))
}
my_fn_generic(list_char_mult)
# A tibble: 4 x 4
# Groups: name, city, language [4]
# name city language comma_separated
# <chr> <chr> <chr> <chr>
#1 Chris London R Chris,London,R
#2 Chris New York R Chris,New York,R
#3 Mike London R Mike,London,R
#4 Mike New York R Mike,New York,R
Or use the pmap
library(tidyverse)
list_char_mult %>%
expand.grid(., stringsAsFactors = FALSE) %>%
mutate(comma_separated = purrr::pmap_chr(.l = ., .f = paste, sep=", ") )
# name city language comma_separated
#1 Chris New York R Chris, New York, R
#2 Mike New York R Mike, New York, R
#3 Chris London R Chris, London, R
#4 Mike London R Mike, London, R

If I understand your question, you could use apply without grouping:
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
mutate(comma_separated = apply(., 1, paste, collapse=","))
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
mutate(comma_separated = apply(., 1, my_fn))
name city language comma_separated
1 Chris London R Chris,London,R
2 Chris New York R Chris,New York,R
3 Mike London R Mike,London,R
4 Mike New York R Mike,New York,R

Related

How to extract matches from stringr::str_detect in R into a list vector

I am trying to perform the following search on a database of text.
Here is the sample database, df
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6),
name = c("john doe", "carol jones", "jimmy smith",
"jenny ruiz", "joey jones", "tim brown"),
place = c("reno nevada", "poland maine", "warsaw poland",
"trenton new jersey", "brooklyn new york", "atlanta georgia")
)
I have a vector of strings which contains terms I am trying to find.
new_search <- c("poland", "jones")
I pass the vector to str_detect to find ANY of the strings in new_search in ANY of the columns in df and then return rows which match...
df %>%
filter_all(any_vars(str_detect(., paste(new_search, collapse = "|"))))
Question... how can I extract the results of str_detect into a new column?
For each row which is returned... I would like to generate a list of the terms which were successfully matched and put them in a list or character vector (matched_terms)...something like this...
id name place matched_terms
1 2 carol jones poland maine c("jones", "poland")
2 3 jimmy smith warsaw poland c("poland")
3 5 joey jones brooklyn new york c("jones")
This is my naive solution:
new_search <- c("poland", "jones") %>% paste(collapse = "|")
df %>%
mutate(new_var = str_extract_all(paste(name, place), new_search))
You can extract all the patterns in multiple columns using str_extract_all, combine them into one column with unite. unite combines the column into one string hence the empty values are turned into "character(0)" which we remove using str_remove_all and keep only those rows that have any matched term.
library(tidyverse)
pat <- str_c(new_search, collapse = "|")
df %>%
mutate(across(-id, ~str_extract_all(., pat), .names = '{col}_new')) %>%
unite(matched_terms, ends_with('new'), sep = ',') %>%
mutate(matched_terms = str_remove_all(matched_terms,
'character\\(0\\),?|,character\\(0\\)')) %>%
filter(matched_terms != '')
# id name place matched_terms
#1 2 carol jones poland maine jones,poland
#2 3 jimmy smith warsaw poland poland
#3 5 joey jones brooklyn new york jones

Matching strings to values in a different data frame

Consider this data frame, containing multiple entries for a person named Steve/Stephan Jones and a person named Steve/Steven Smith (as well as Jane Jones and Matt/Matthew Smith)
df <- data.frame(First = c("Steve", "Stephan", "Steve", "Jane", "Steve", "Steven", "Matt"),
Last = c(rep("Jones", 4), rep("Smith", 3)))
What I'd like is to match values of First to the appropriate value of Name in this data frame.
nicknames <- data.frame(Name = c("Stephan", "Steven", "Stephen", "Matthew"),
N1 = c(rep("Steve", 3), "Matt"))
To yield this target
target <- data.frame(First = c("Stephan", "Stephan", "Stephan", "Jane", "Steven", "Steven", "Matthew"),
Last = c(rep("Jones", 4), rep("Smith", 3)))
The issue is that there are multiple values of Name corresponding to a N1 (or First) value of "Steve", so I need to check within each group based of df$Last to see which version of Steven/Stephan/Stephen is correct.
Using something like this
library(dplyr)
library(stringr)
df %>%
group_by(Last) %>%
mutate(First = First[which.max(str_length(First))])
won't work because the value of "Jane" in row 4 will be converted to "Stephan"
I'm not sure, if this solves your problem and is consistent to your desired output:
library(dplyr)
df %>%
mutate(id = row_number()) %>%
left_join(nicknames, by=c("First" = "N1")) %>%
mutate(real_name = coalesce(Name, First)) %>%
group_by(Last, real_name) %>%
mutate(id = n()) %>%
group_by(Last, First) %>%
filter(id==max(id)) %>%
select(-Name, -id)
returns
# A tibble: 7 x 3
# Groups: Last, First [6]
First Last real_name
<chr> <chr> <chr>
1 Steve Jones Stephan
2 Stephan Jones Stephan
3 Steve Jones Stephan
4 Jane Jones Jane
5 Steve Smith Steven
6 Steven Smith Steven
7 Matt Smith Matthew

html_nodes returns an empty list

I am scraping the number of newspaper articles containing certain words. For example, the word "Republican" in CA in 1929, from this website:
url = https://www.newspapers.com/search/#query=republican&dr_year=1929-1929&p_place=CA
I want to copy the number of hits (in the example, 23490), and I am using this code:
hits <- url %>%
read_html() %>%
html_nodes('.total-hits') %>%
html_text()
but html_text() returns an empty list. I would appreciate any help. Thanks!
Here is one way. Seeing the page source, it seems that you want to target td. Then, do some string manipulation and crate the output. I leave the first 10 rows below.
read_html("https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA") %>%
html_nodes("td") %>%
html_text() %>%
gsub(pattern = "\\n", replacement = "") %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame() %>%
rename(state = V1, count = V2)
state count
1 California 23,490
2 Pennsylvania 51,697
3 New York 35,428
4 Indiana 23,199
5 New Jersey 22,787
6 Missouri 20,650
7 Ohio 15,270
8 Illinois 14,920
9 Iowa 14,676
10 Wisconsin 13,821
Another way is the following. I further specified where I wanted to get text. There are two targets. So I used map_dfc(). In this way, I directly created a data frame. Then, I did similar jobs. This time, I converted character to numeric.
map_dfc(.x = c("td.tn", "td.tar"),
.f = function(x){
read_html("https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA") %>%
html_nodes(x) %>%
html_text()}
) %>%
rename(state = `...1`, count = `...2`) %>%
mutate(state = gsub(x = state, pattern = "\\n", replacement = ""),
count = as.numeric(sub(x = count, pattern = ",", replacement = "")))
state count
<chr> <dbl>
1 California 23490
2 Pennsylvania 51697
3 New York 35428
4 Indiana 23199
5 New Jersey 22787
6 Missouri 20650
7 Ohio 15270
8 Illinois 14920
9 Iowa 14676
10 Wisconsin 13821
The problem is you scrape the wrong URL, change it to https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA and change html_nodes to html_node then your code would work.

From state and county names to fips in R

I have the following data frame in R. I would like to get fips from this dataset. I tried to use fips function in usmap (https://rdrr.io/cran/usmap/man/fips.html). But I could not get fips from this function because I need to enclose double quote. Then, I tried to use paste0(""", df$state, """), but I could not get it. Is there any efficient ways to get fips?
> df1
state county
1 california napa
2 florida palm beach
3 florida collier
4 florida duval
UPDATE
I can get "\"california\"" by using dQuote. Thanks. After the conversion of each column, I tried the followings. How do I deal with this issue?
> df1$state <- dQuote(df1$state, FALSE)
> df1$county <- dQuote(df1$county, FALSE)
> fips(state = df1$state, county = df1$county)
Error in fips(state = df1$state, county = df1$county) :
`county` parameter cannot be used with multiple states.
> fips(state = df1$state[1], county = df1$county[1])
Error in fips(state = df1$state[1], county = df1$county[1]) :
"napa" is not a valid county in "california".
> fips(state = "california", county = "napa")
[1] "06055"
We can split the dataset by state and apply the fips
library(usmap)
lapply(split(df1, df1$state), function(x)
fips(state = x$state[1], county = x$county))
#$california
#[1] "06055"
#$florida
#[1] "12099" "12021" "12031"
Or with Map
lst1 <- split(df1$county, df1$state)
Map(fips, lst1, state = names(lst1))
#$california
#[1] "06055"
#$florida
#[1] "12099" "12021" "12031"
Or with tidyverse
library(dplyr)
library(tidyr)
df1 %>%
group_by(state) %>%
summarise(new = list(fips(state = first(state), county = county))) %>%
unnest(c(new))
# A tibble: 4 x 2
# state new
# <chr> <chr>
#1 california 06055
#2 florida 12099
#3 florida 12021
#4 florida 12031
data
df1 <- structure(list(state = c("california", "florida", "florida",
"florida"), county = c("napa", "palm beach", "collier", "duval"
)), class = "data.frame", row.names = c("1", "2", "3", "4"))

How can I fuzzy string match multiple strings from different sized data frames?

I would like to match the strings from my first dataset with all of their closest common matches.
Data looks like:
dataset1:
California
Texas
Florida
New York
dataset2:
Californiia
callifoornia
T3xas
Te xas
texas
Fl0 rida
folrida
New york
new york
desired result is:
col_1 col_2 col_3 col4
California Californiia callifoornia
Texas T3xas texas Te xas
Florida folrida Fl0 rida
New York New york new york
The question is:
How do I search for common strings between the first dataset and the
second dataset, and generate a list of terms in the second dataset
that align with each term in the first?
Thanks in advance.
library(fuzzyjoin); library(tidyverse)
dataset1 %>%
stringdist_left_join(dataset2,
max_dist = 3) %>%
rename(col_1 = "states.x") %>%
group_by(col_1) %>%
mutate(col = paste0("col_", row_number() + 1)) %>%
spread(col, states.y)
#Joining by: "states"
## A tibble: 4 x 4
## Groups: col_1 [4]
# col_1 col_2 col_3 col_4
# <chr> <chr> <chr> <chr>
#1 California Californiia callifoornia NA
#2 Florida Fl0 rida folrida NA
#3 New York New york new york NA
#4 Texas T3xas Te xas texas
data:
dataset1 <- data.frame(states = c("California",
"Texas",
"Florida",
"New York"),
stringsAsFactors = F)
dataset2 <- data.frame(stringsAsFactors = F,
states = c(
"Californiia",
"callifoornia",
"T3xas",
"Te xas",
"texas",
"Fl0 rida",
"folrida",
"New york",
"new york"
)
)
I read a bit about stringdist and came up with this. It's a workaround, but I like it. Can definitely be improved:
library(stringdist)
library(janitor)
ds1a <- read.csv('dataset1')
ds2a <- read.csv('dataset2')
distancematrix <- stringdistmatrix(ds2a$name, ds1a$name, useNames = T)
df <- data.frame(stringdistmatrix(ds2a$name, ds1a$name, useNames = T), ncol=maxcol in distance matrix)
# go thru this df, and every cell that's < 4, replace with the column name, otherwise replace with empty string
for (j in 1:ncol(df)) {
trigger <- df[j,] < 4
df[trigger , j] <- names(df)[j]
df[!trigger , j] <- ""
}
df <- remove_constant(df)
write.csv(df, file="~/Desktop/df.csv")

Resources