I have a dataset with emails like:
my_df <- data.frame(email = c("mirko#asdoi.com", "elsa#asodida.co.uk", "elsapina#asoqw.com"))
And I have an open source dataset like:
open_data <- data.frame(name = c("mirko", "elsa", "pina"), gender = c("male", "female", "male")
How can I perform a lookup of my_df with open_data to associate the gender to each email?
In the case of multiple join, I want it to create multiple records
The result should be:
result <- data.frame(email = c("mirko#asdoi.com", "elsa#asodida.co.uk", "elsapina#asoqw.com", "elsapina#asoqw.com"), gender = c("male", "female", "female", "male))
One option is to use the sqldf library and solve this via a database style join between the two data frames:
library(sqldf)
my_df$name <- sub("#.*$", "", my_df$email)
sql <- "select t1.email, t2.gender from my_df t1 inner join open_data t2 "
sql <- paste0(sql, "on t1.name like '%' || t2.name || '%'")
result <- sqldf(sql)
Perhaps something along these lines? Not sure how robust this will be for more complex cases though.
library(tidyverse)
open_data %>%
rowwise() %>%
mutate(email = list(grep(name, my_df$email))) %>%
unnest() %>%
mutate(email = my_df$email[email])
## A tibble: 4 x 3
# name gender email
# <fct> <fct> <fct>
#1 mirko male mirko#asdoi.com
#2 elsa female elsa#asodida.co.uk
#3 elsa female elsapina#asoqw.com
#4 pina male elsapina#asoqw.com
Explanation: We use grep to find matches of open_data$name in my_df$email; then unnest to expand multiple matches, and use row indices to extract email entries.
Related
I am using the tidycensus package to pull out some census variables. I am making a list of desired variables with set variable names (dummy data below). I want to also create a codebook, where, ideally, I'd use the list of variable names to pull the rest of the information from the variable list that you can access with the command load_variable. I'm not sure how to do that join, or pull out that information, just using a character list. Any suggestions?
library("tidycensus")
library("dplyr")
decvarlist <- load_variables(2000, "sf1")
desiredvars = c(var1 = "H001001",
var2 = "H002002",
var3 = "H002003"
)
#this bit doesnt work, but is sort of how I'm thinking of it
codebook <- left_join(desiredvars, decvarlist, by = ())
Perhaps we need to filter
library(dplyr)
decvarlist %>%
filter(name %in% desiredvars) %>%
mutate(id = names(desiredvars), .before = 1)
-output
# A tibble: 3 × 4
id name label concept
<chr> <chr> <chr> <chr>
1 var1 H001001 Total HOUSING UNITS [1]
2 var2 H002002 Total!!Urban URBAN AND RURAL [6]
3 var3 H002003 Total!!Urban!!Inside urbanized areas URBAN AND RURAL [6]
Problem: I have 2 tables I'd like to join. However, the column upon which I wish to join the second to the first will vary dependent upon a successful parsing of the 2nd data frame to identify which column and row to join.
Request: I have found a solution to the problem (see below) but it does not seem to me to very computationally efficient. Not a problem for the reproducible example below but potentially less ideal when stepped out to a larger scale problem i.e. ~200,000+ rows / observations.
Wondering if anyone might be able to help identify something better - ideally utilising functionality from dplyr.
Reproducible Example:
# Equipment alias table
alias1 <- c('a1a1', 'a2a2', 'a3a3', 'a4a4', 'a5a5', 'a6a6')
alias2 <- c('bc001', 'bc002', 'bc003', 'bc004', 'bc005', 'bc006')
alias3 <- c('e1o1', 'e202', 'e303', 'e404', 'e505', 'e606')
df_alias <- data.frame(alias1, alias2, alias3)
# Attribute table
equip <- c('a1a1','bc006', 'e404')
att1 <- c('a', 'b', 'c')
att2 <- c('1', '2', '3')
df_att <- data.frame(equip, att1, att2)
Desired Outcome:
I'm looking to achieve the following....
# DESRIED OUTPUT - combining equipment alias table into attribute table based on string match between attibute_equip and any one of columns in equipment alias
equip <- c('a1a1','bc006', 'e404')
att1 <- c('a', 'b', 'c')
att2 <- c('1', '2', '3')
alias1 <- c('a1a1','a6a6', 'a4a4')
alias2 <- c('bc001','bc006', 'bc004')
alias3 <- c('e1o1','e606', 'e404')
df_att <- data.frame(equip, att1, att2, alias1, alias2, alias3)
Current Solution:
library(dplyr)
left_join(df_att, df_alias, by = character()) %>%
filter(equip == alias1 | equip == alias2 | equip == alias3)
Effective but not exactly elegant as there's a great deal of duplication for ultimately a filter to then be applied to undo that duplication.
An option is to filter with if_any and then bind the subset rows with the df_att
library(dplyr)
df_att2 <- df_alias %>%
filter(if_any(everything(), ~ .x %in% df_att$equip)) %>%
arrange(na.omit(unlist(across(everything(), ~ match(df_att$equip, .x))))) %>%
bind_cols(df_att, .)
-checking with OP's expected (changed the object name 'df_att' to 'out' to avoid any confusion)
> all.equal(df_att2, out)
[1] TRUE
I don't know how it compares efficiency-wise, but one idea is to pivot a copy of each alias so that you can left_join against a single column instead of multiple ones.
library(tidyr)
library(dplyr)
df_alias %>%
mutate(across(everything(), ~., .names = "_{.col}")) %>%
pivot_longer(starts_with('_'), names_to = NULL, values_to = 'equip') %>%
left_join(df_att, .)
#> Joining, by = "equip"
#> equip att1 att2 alias1 alias2 alias3
#> 1 a1a1 a 1 a1a1 bc001 e1o1
#> 2 bc006 b 2 a6a6 bc006 e606
#> 3 e404 c 3 a4a4 bc004 e404
I know it should be an easier or smarter way of doing what I need, but I haven't found it yet after several days.
I have 2 dataframes that I need to merge using a extra condition. For example:
df1 <- data.frame(Username = c("user1", "user2", "user3", "user4", "user5", "user6"))
df2 <- data.frame(File_Name = c(rep("StudyABC", 5), rep("AnotherStudyCDE", 4)), Username = c("user1", rep(c("user2", "user3", "user4", "user5"),2)))
print(df1)
print(df2)
What I need is to create 2 new columns in df1 called ABC and CDE that includes their "File_Name" values. Of course the real data is hundreds of lines and not ordered so no way of selecting by range.
One of the solutions (not elegant) that I have found is:
df2_filtered <- df2 %>% filter(str_detect(File_Name, "ABC"))
df1 <- left_join(df1, df2_filtered, by = "Username")
names(df1)[2] <- "ABC"
df2_filtered <- df2 %>% filter(str_detect(File_Name, "CDE"))
df1 <- left_join(df1, df2_filtered, by = "Username")
names(df1)[3] <- "CDE"
print(df1)
Is there a shortest way of doing it? Because I have to repeat the same logic 160 times.
Thanks
You can extract either "ABC" or "CDE" from File_Name and cast the data into wide format. We can join the data with df1 to get all the Username in the final dataframe.
library(dplyr)
df2 %>%
mutate(name = stringr::str_extract(File_Name, 'ABC|CDE')) %>%
tidyr::pivot_wider(names_from = name, values_from = File_Name) %>%
right_join(df1, by = 'Username')
# Username ABC CDE
# <chr> <chr> <chr>
#1 user1 StudyABC NA
#2 user2 StudyABC AnotherStudyCDE
#3 user3 StudyABC AnotherStudyCDE
#4 user4 StudyABC AnotherStudyCDE
#5 user5 StudyABC AnotherStudyCDE
#6 user6 NA NA
What you're looking for is a way of casting data from long to wide eg using data.table package I would do this:
library(data.table)
# converts data.frame to data.table
dt <- as.data.table(df2)
# I copy the file_name so one is used for the pivotting for long to wide and the other is used for filling in the data
dt[, study := File_Name]
dt_wide <- dcast(Username~File_Name, data=dt, value.var = "study")
# have a look at df2 in wide format
dt_wide[]
# now its just a direct merge to pull it back in to df1 and turn
# back in to data.frame for you
out <- merge(as.data.table(df1), dt_wide, by="Username", all.x=TRUE)
setDF(out)
out
Plenty of tutorials on melting/casting even without data.table. It's just knowing what to search for eg Google throws up https://ademos.people.uic.edu/Chapter8.html as the first result.
If one study can have more than one file path (which I assume is the case from your previous attempts), just converting your data to a wide format before joining won't work as you'll have one column per file path, not per study.
One method in this case could be to use a for-loop to create an additional column in df2 with the study name, then convert the data to a wide format using pivot_wider.
It's not a very R method though so I'd welcome suggestions to avoid creating the empty study column and the for-loop
studies <- c("ABC", "CDE")
#create empty column named "study"
df2 <- df2 %>%
mutate(study = NA_character_)
for (i in studies) {
df2 <- df2 %>%
mutate(study = if_else(grepl(i, File_Name), i, study))
}
df2 <- df2 %>%
pivot_wider(names_from = study, values_from = File_Name)
> df2
# A tibble: 5 x 3
Username ABC CDE
<chr> <chr> <chr>
1 user1 StudyABC NA
2 user2 StudyABC AnotherStudyCDE
3 user3 StudyABC AnotherStudyCDE
4 user4 StudyABC AnotherStudyCDE
5 user5 StudyABC AnotherStudyCDE
df2 is now in a wide format and you can join it to df1 as before to get your desired output.
df3 <- left_join(df1, df2)
I have the following data:
library(dplyr)
d <- tibble(
region = c('all', 'one', 'eleven', 'six'),
forename = c('John', 'Jane', 'Rich', 'Clive'),
surname = c('Smith', 'Jones', 'Smith', 'Jones'))
I would like to anonymise the values within the 'forename ' and 'surname ' variables so that the data looks like this.
d <- tibble(
region = c('all', 'one', 'eleven', 'six'),
forename = c('forename1', 'forename2', 'forename3', 'forename4'),
surname = c('surname1', 'surname2', 'surname3', 'surname4'))
I could just do this manually but I have a df with millions of rows. What I would like is for the row number in the df to coincide with the value rename. So the data on row 67 for example would show:
d <- tibble(
region = c('all'),
forename = c('forename67'),
surname = c('surname67'))
Does anyone know how I would achieve this using dplyr if possible?
Thannks
As every row is a unique user, we can paste row_number to the column names.
library(dplyr)
d %>%
mutate(forename = paste0("forename", row_number()),
surname = paste0("surname", row_number()))
# A tibble: 4 x 3
# region forename surname
# <chr> <chr> <chr>
#1 all forename1 surname1
#2 one forename2 surname2
#3 eleven forename3 surname3
#4 six forename4 surname4
An option with stringr
library(dplyr)
library(stringr)
d %>%
mutate(forename = str_c("forename", row_number()),
surname = str_c("surname", row_number()))
Or with lapply from base R
d[c('forename', 'surname')] <- lapply(c('forename', 'surname'), function(x)
paste0(x, seq_len(nrow(d))))]
demo_df <- data_frame(id = c(1,2,3), names = c("Hillary", "Madison", "John"), stock = c(43,5,2), bill = c(43,112,33))
How is it possible to use in names column the gender identification?
Expected output:
demo_df <- data_frame(id = c(1,2,3), names = c("Hillary", "Madison", "John"), gender = c("female", "female", "male"), stock = c(43,5,2), bill = c(43,112,33))
Tried this
library(gender)
test <- gender_df(demo_df, method = "demo",
name_col = "name", year_col = c("1900", "2000"))
but I receive this error
Error in gender_df(demo_df, method = "demo", name_col = "name") :
year_col %in% names(data) is not TRUE
Use gender() instead of gender_df().
Note that gender() automatically sorts output alphabetically by name, so it won't work to simply add the output as a new vector to demo_df, as the ordering may be wrong.
Two options to handle this:
1. Sort demo_df alphabetically by name before you call gender().
library(dplyr)
demo_df %>%
arrange(names) %>%
mutate(gender = gender::gender(demo_df$names)$gender)
2. Use a join method, like dplyr::inner_join, to merge demo_df and the resulting data frame output of the call to gender(), on the names column.
gender_df <- gender::gender(demo_df$names) %>%
select(names = name, gender)
inner_join(demo_df, gender_df, by = "names")
Output:
id names stock bill gender
1 1 Hillary 43 43 female
2 2 Madison 5 112 female
3 3 John 2 33 male
All of this is possible in base R, too, not including the gender imputation part. I just prefer dplyr.