R join by tolower - r

I´ve got some sample data
data1 = data.frame(name = c("cat", "dog", "parrot"), freq = c(1,2,3))
data2 = data.frame(name = c("Cat", "snake", "Dog", freq2 = c(2,3,4)))
data1$name = as.character(data1$name)
data2$name = as.character(data2$name)
which I want to join, but e.g. "cat" and "Cat" should be treated as the same value. I thought of using tolower and first to determine the entries which appear in both data frames by
in_both = data1[(tolower(data1$name) %in% tolower(data2$name)),]
Then I want to join with data2, but that doesn't work because the names doesn't match.
library(dplyr)
left_join(in_both, data2)
Is there a way to join by using tolower?

Why not create a dplyr function which would lower the name of left data.frame and perform merge.
With the custom function, you get more control and you wouldn't have to repeat many steps.
f_dplyr <- function(left,right){
left$name <- tolower(left$name)
inner_join(left,right,by="name")
}
f_dplyr(data2, data1)
Result
name freq2 freq
cat 2 1
dog 4 2

If you don't want to alter your original data2, as #AshofFire suggested, you can decapitalize the values in name in a pipe %>% and then perform the join operation:
data2 %>%
mutate(name = str_to_lower(name)) %>%
inner_join(data1, by = "name")
name freq2 freq
1 cat 2 1
2 dog 4 2

Related

How to recode dataframe values to keep only those that satisfy a certain set, replace others with "other"

I'm looking for a concise solution, preferably using dplyr, to clean up values in a dataframe column so that I can keep as they are values that match a certain set, but others that don't match will be recoded as "other".
Example
I have a dataframe with names of animals. There are 4 legit animal names, but other rows contain gibberish rather than names. I want to clean the column up, to keep only the legit animal names: zebra, lion, cow, or cat.
Data
library(tidyverse)
library(stringi)
real_animals_names <- sample(c("zebra", "cow", "lion", "cat"), size = 50, replace = TRUE)
gibberish <- do.call(paste0, Map(stri_rand_strings, n = 50, length=c(5, 4, 1),
pattern = c('[a-z]', '[0-9]', '[A-Z]')))
df <- tibble(animals = sample(c(animals, gibberish)))
> df
## # A tibble: 100 x 1
## animals
## <chr>
## 1 zebra
## 2 zebra
## 3 rbzal0677O
## 4 lion
## 5 cat
## 6 cfsgt0504G
## 7 cat
## 8 jhixe2566V
## 9 lion
## 10 zebra
## # ... with 90 more rows
One way to solve the problem -- which I find annoying and not concise
Using dplyr 1.0.2
df %>%
mutate(across(animals, recode,
"lion" = "lion",
"zebra" = "zebra",
"cow" = "cow",
"cat" = "cat",
.default = "other"))
This gets it done, but this code repeats each animal name twice, and I find it clunky. Is there a cleaner solution, preferably using dplyr?
EDIT GIVEN SUGGESTED ANSWERS BELOW
Since I do like the readability of dplyr::recode, but dislike having to repeat each animal name twice; and since the answers below utilize %in% – could I incorporate %in% in my own recode solution to make it simpler/more concise?
A base solution:
keep_names <- c('lion', 'zebra', 'cow', 'cat')
within(df, animals[!animals %in% keep_names] <- "other")
A dplyr option with replace():
library(tidyverse)
df %>%
mutate(animals = replace(animals, !animals %in% keep_names, "other"))
With recode(), you can use a named character vector for unquote splicing with !!!.
df %>%
mutate(animals = recode(animals, !!!set_names(keep_names), .default = "other"))
Note: set_names(keep_names) is equivalent to setNames(keep_names, keep_names).
You could keep the animals that you need as it is and turn the rest to "Others" :
library(dplyr)
keep_names <- c('lion', 'zebra', 'cow', 'cat')
df %>% mutate(animals = ifelse(animals %in% keep_names, animals, 'Others'))
I know you asked preferably for a dplyr solution but here a data.table solution (note that I changed the tibble() call to data.table()):
library(stringi)
library(data.table)
real_animals_names <- sample(c("zebra", "cow", "lion", "cat"), size = 50, replace = TRUE)
gibberish <- do.call(paste0, Map(stri_rand_strings, n = 50, length=c(5, 4, 1),
pattern = c('[a-z]', '[0-9]', '[A-Z]')))
df <- data.table(animals = sample(c(real_animals_names, gibberish)))
keep_names <- c("lion", "zebra", "cow", "cat")
df[!animals %in% keep_names, animals := "other"]

How do I aggregate data in R in a way that returns the entire row that satisfies the aggregation condition? [no dplyr]

I have data that looks like this:
ID FACTOR_VAR INT_VAR
1 CAT 1
1 DOG 0
I want to aggregate by ID such that the resulting dataframe contains the entire row that satisfies my aggregate condition. So if I aggregate by the max of INT_VAR, I want to return the whole first row:
ID FACTOR_VAR INT_VAR
1 CAT 1
The following will not work because FACTOR_VAR is a factor:
new_data <- aggregate(data[,c("ID", "FACTOR_VAR", "INT_VAR")], by=list(data$ID), fun=max)
How can I do this? I know dplyr has a group by function, but unfortunately I am working on a computer for which downloading packages takes a long time. So I'm looking for a way to do this with just vanilla R.
If you want to keep all the columns, use ave instead :
subset(df, as.logical(ave(INT_VAR, ID, FUN = function(x) x == max(x))))
You can use aggregate for this. If you want to retain all the columns, merge can be used with it.
merge(aggregate(INT_VAR ~ ID, data = df, max), df, all.x = T)
# ID INT_VAR FACTOR_VAR
#1 1 1 CAT
data
df <- structure(list(ID = c(1L, 1L), FACTOR_VAR = structure(1:2, .Label = c("CAT", "DOG"), class = "factor"), INT_VAR = 1:0), class = "data.frame", row.names = c(NA,-2L))
We can do this in dplyr
library(dplyr)
df %>%
group_by(ID)
filter(INT_VAR == max(INT_VAR))
Or using data.table
library(data.table)
setDT(df)[, .SD[INT_VAR == max(INT_VAR)], by = ID]

How to semi_join two dataframes by string column with one being colon-separated

I have two dataframes, dfa and dfb:
dfa <- data.frame(
gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
id = c(1:5)
)
dfb <- data.frame(
gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
id = c(6:10)
)
which look like this:
> dfa
gene_name id
1 MUC16 1
2 MUC2 2
3 MET 3
4 FAT1 4
5 TERT 5
> dfb
gene_name id
1 MUC1 6
2 MET; BLEP 7
3 MUC21 8
4 FAT 9
5 TERT 10
dfa is my genes of interest list: I want to keep the dfb rows where they appear, minding the digits (MUC1 is not MUC16). My new_df should look like this:
> new_df
gene_name id
1 MET; BLEP 7
2 TERT 10
My problem is that the regular dplyr::semi_join() does exact matches, which doesn't take into account the fact that dfb$gene_names can contain genes separated with "; ". Meaning that with this example, "MET" is not retained.
I tried to look into fuzzyjoin::regex_semi_join, but I can't make it do what I want...
A tidyverse solution would be welcome. (Maybe with stringr?!)
EDIT: Follow-up question...
How would I go about to do the reciprocal anti_join? Simply changing semi_join to anti_join in this method doesn't work because the row MET; BLEP is present when it shouldn't be...
Adding a filter(gene_name == new_col) after the anti_join works with the provided simple dataset, but if I twist it a little like this:
dfa <- data.frame(
gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
id = c(1:5)
)
dfb <- data.frame(
gene_name = c("MUC1", "MET; BLEP", "MUC21; BLOUB", "FAT", "TERT"),
id = c(6:10)
)
...then it doesn't anymore. Here and in my real-life dataset, dfa doesn't contain semicolons, it's only one column of individual gene names. But dfb contains a lot of information, and multiple combinations of semicolons...
You can use seperate_rows() to split the dataframe before joining. Note that if BLEP existed in dfa, it would result in a duplicate, which is why distinct is used
dfa <- data.frame(
gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
id = c(1:5),
stringsAsFactors = FALSE
)
dfb <- data.frame(
gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
id = c(6:10),
stringsAsFactors = FALSE
)
library(tidyverse)
dfb%>%
mutate(new_col = gene_name)%>%
separate_rows(new_col,sep = "; ")%>%
semi_join(dfa,by = c("new_col" = "gene_name"))%>%
select(gene_name,id)%>%
distinct()
Here's a solution using stringr and purrr.
library(tidyverse)
dfb %>%
mutate(gene_name_list = str_split(gene_name, "; ")) %>%
mutate(gene_of_interest = map_lgl(gene_name_list, some, ~ . %in% dfa$gene_name)) %>%
filter(gene_of_interest == TRUE) %>%
select(gene_name, id)
I think I finally managed to make fuzzyjoin::regex_joins do what I want. It was ridiculously simple, I just had to tweak my dfa filter list:
library(fuzzyjoin)
# add "\b" regex expression before/after each gene of the list to filtrate from
# (to search for whole words)
dfa$gene_name <- paste0("\\b", dfa$gene_name, "\\b")
# to keep genes from dfb that are present in the dfa filter list
dfb %>%
regex_semi_join(dfa, by = c(gene_name = "gene_name"))
# to exclude genes from dfb that are present in the dfa filter blacklist
dfb %>%
regex_anti_join(dfa, by = c(gene_name = "gene_name"))
One drawback though: it's quite slow...

dplyr join by exclusion?

When using the various join functions from dplyr you can either join all variables with the same name (by default) or specify those ones using by = c("a" = "b"). Is there a way to join by exclusion? For example, I have 1000 variables in two data frames and I want to join them by 999 of them, leaving one out. I don't want to do by = c("a1" = "b1", ...,"a999" = "b999"). Is there a way to join by excluding the one variable that is not used?
Ok, using this example from one answer:
set.seed(24)
df1 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
df2 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
I want to join them using all variables excluding val. I'm looking for a more general solution. Assuming there are 1000 variables and I only remember the name of the one that I want to exclude in the join, while not knowing the index of that variable. How can I perform the join while only knowing the variable names to exclude. I understand I can find the column index first but is there a simply way to add exclusions in by =?
We create a named vector to do this
library(dplyr)
grps <- setNames(paste0("b", 1:999), paste0("a", 1:999))
Note the 'grps' vector is created with paste as the OP's post suggested a pattern. If there is no pattern, but we know the column that is not to be grouped
nogroupColumn <- "someColumn"
grps <- setNames(setdiff(names(df1), nogroupColumn),
setdiff(names(df2), nogroupColumn))
inner_join(df1, df2, by = grps)
Using a reproducible example
set.seed(24)
df1 <- data_frame(a1 = LETTERS[1:3], a2 = letters[1:3], val = rnorm(3))
df2 <- data_frame(b1 = LETTERS[3:4], b2 = letters[3:4], valn = rnorm(2))
grps <- setNames(paste0("b", 1:2), paste0("a", 1:2))
inner_join(df1, df2, by = grps)
# A tibble: 1 x 4
# a1 a2 val valn
# <chr> <chr> <dbl> <dbl>
#1 C c 0.420 -0.584
To exclude a certain field(s), you need to identify the index of the columns you want. Here's one way:
which(!names(df1) %in% "sskjs" ) #<this excludes the column "sskjs"
[1] 1 2 4 #<and shows only the desired index columns
Use unite to create a join_id in each dataframe, and join by it.
df1 <- df1 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
df2 <- df2 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
left_join(df1, df2, by = "join_id" )

Joining multiple DataFrames using SparkR

I have a DataFrame with Person data and also have like 20 more DataFrames with a common key Person_Id. I want to join all of them to the Person DataFrame to have all my data in the same DataFrame.
I tried both join and merge like this:
merge(df_person, df_1, by="Person_Id", all.x=TRUE)
and
join(df_person, df_1, df_person$Person_Id == df_1$Person_Id, "left")
In both of them, I find the same error. Both functions Join the Datasets in the right way but it duplicates the field Person_Id. Is there any way to tell those functions to not duplicate the Person_Id field?
Also, anyone knows a more efficient way to join all those DataFrames together?
Thanks you so much for your help in advance.
Other supported languages support simplified equi-join syntax, but it looks like it is not implemented in R so you have to do it the old way (rename and drop):
library(magrittr)
withColumnRenamed(df_1, "Person_Id", "Person_Id_") %>%
join(df_2, column("Person_Id") == column("Person_id_")) %>%
drop("Person_Id_")
If you're doing a lot of joins in SparkR it is worthwhile to make your own function to rename then join then remove the renamed column
DFJoin <- function(left_df, right_df, key = "key", join_type = "left"){
left_df <- withColumnRenamed(left_df, key, "left_key")
right_df <- withColumnRenamed(right_df, key, "right_key")
result <- join(
left_df, right_df,
left_df$left_key == right_df$right_key,
joinType = join_type)
result <- withColumnRenamed(result, "left_key", key)
result$right_key <- NULL
return(result)
}
df1 <- as.DataFrame(data.frame(Person_Id = c("1", "2", "3"), value_1 =
c(2, 4, 6)))
df2 <- as.DataFrame(data.frame(Person_Id = c("1", "2"), value_2 = c(3,
6)))
df3 <- DFjoin(df1, df2, key = "Person_Id", join_type = "left")
head(df3)
Person_Id value_1 value_2
1 3 6 NA
2 1 2 3
3 2 4 6

Resources