I´ve got some sample data
data1 = data.frame(name = c("cat", "dog", "parrot"), freq = c(1,2,3))
data2 = data.frame(name = c("Cat", "snake", "Dog", freq2 = c(2,3,4)))
data1$name = as.character(data1$name)
data2$name = as.character(data2$name)
which I want to join, but e.g. "cat" and "Cat" should be treated as the same value. I thought of using tolower and first to determine the entries which appear in both data frames by
in_both = data1[(tolower(data1$name) %in% tolower(data2$name)),]
Then I want to join with data2, but that doesn't work because the names doesn't match.
library(dplyr)
left_join(in_both, data2)
Is there a way to join by using tolower?
Why not create a dplyr function which would lower the name of left data.frame and perform merge.
With the custom function, you get more control and you wouldn't have to repeat many steps.
f_dplyr <- function(left,right){
left$name <- tolower(left$name)
inner_join(left,right,by="name")
}
f_dplyr(data2, data1)
Result
name freq2 freq
cat 2 1
dog 4 2
If you don't want to alter your original data2, as #AshofFire suggested, you can decapitalize the values in name in a pipe %>% and then perform the join operation:
data2 %>%
mutate(name = str_to_lower(name)) %>%
inner_join(data1, by = "name")
name freq2 freq
1 cat 2 1
2 dog 4 2
Related
I'm looking for a concise solution, preferably using dplyr, to clean up values in a dataframe column so that I can keep as they are values that match a certain set, but others that don't match will be recoded as "other".
Example
I have a dataframe with names of animals. There are 4 legit animal names, but other rows contain gibberish rather than names. I want to clean the column up, to keep only the legit animal names: zebra, lion, cow, or cat.
Data
library(tidyverse)
library(stringi)
real_animals_names <- sample(c("zebra", "cow", "lion", "cat"), size = 50, replace = TRUE)
gibberish <- do.call(paste0, Map(stri_rand_strings, n = 50, length=c(5, 4, 1),
pattern = c('[a-z]', '[0-9]', '[A-Z]')))
df <- tibble(animals = sample(c(animals, gibberish)))
> df
## # A tibble: 100 x 1
## animals
## <chr>
## 1 zebra
## 2 zebra
## 3 rbzal0677O
## 4 lion
## 5 cat
## 6 cfsgt0504G
## 7 cat
## 8 jhixe2566V
## 9 lion
## 10 zebra
## # ... with 90 more rows
One way to solve the problem -- which I find annoying and not concise
Using dplyr 1.0.2
df %>%
mutate(across(animals, recode,
"lion" = "lion",
"zebra" = "zebra",
"cow" = "cow",
"cat" = "cat",
.default = "other"))
This gets it done, but this code repeats each animal name twice, and I find it clunky. Is there a cleaner solution, preferably using dplyr?
EDIT GIVEN SUGGESTED ANSWERS BELOW
Since I do like the readability of dplyr::recode, but dislike having to repeat each animal name twice; and since the answers below utilize %in% – could I incorporate %in% in my own recode solution to make it simpler/more concise?
A base solution:
keep_names <- c('lion', 'zebra', 'cow', 'cat')
within(df, animals[!animals %in% keep_names] <- "other")
A dplyr option with replace():
library(tidyverse)
df %>%
mutate(animals = replace(animals, !animals %in% keep_names, "other"))
With recode(), you can use a named character vector for unquote splicing with !!!.
df %>%
mutate(animals = recode(animals, !!!set_names(keep_names), .default = "other"))
Note: set_names(keep_names) is equivalent to setNames(keep_names, keep_names).
You could keep the animals that you need as it is and turn the rest to "Others" :
library(dplyr)
keep_names <- c('lion', 'zebra', 'cow', 'cat')
df %>% mutate(animals = ifelse(animals %in% keep_names, animals, 'Others'))
I know you asked preferably for a dplyr solution but here a data.table solution (note that I changed the tibble() call to data.table()):
library(stringi)
library(data.table)
real_animals_names <- sample(c("zebra", "cow", "lion", "cat"), size = 50, replace = TRUE)
gibberish <- do.call(paste0, Map(stri_rand_strings, n = 50, length=c(5, 4, 1),
pattern = c('[a-z]', '[0-9]', '[A-Z]')))
df <- data.table(animals = sample(c(real_animals_names, gibberish)))
keep_names <- c("lion", "zebra", "cow", "cat")
df[!animals %in% keep_names, animals := "other"]
I have data that looks like this:
ID FACTOR_VAR INT_VAR
1 CAT 1
1 DOG 0
I want to aggregate by ID such that the resulting dataframe contains the entire row that satisfies my aggregate condition. So if I aggregate by the max of INT_VAR, I want to return the whole first row:
ID FACTOR_VAR INT_VAR
1 CAT 1
The following will not work because FACTOR_VAR is a factor:
new_data <- aggregate(data[,c("ID", "FACTOR_VAR", "INT_VAR")], by=list(data$ID), fun=max)
How can I do this? I know dplyr has a group by function, but unfortunately I am working on a computer for which downloading packages takes a long time. So I'm looking for a way to do this with just vanilla R.
If you want to keep all the columns, use ave instead :
subset(df, as.logical(ave(INT_VAR, ID, FUN = function(x) x == max(x))))
You can use aggregate for this. If you want to retain all the columns, merge can be used with it.
merge(aggregate(INT_VAR ~ ID, data = df, max), df, all.x = T)
# ID INT_VAR FACTOR_VAR
#1 1 1 CAT
data
df <- structure(list(ID = c(1L, 1L), FACTOR_VAR = structure(1:2, .Label = c("CAT", "DOG"), class = "factor"), INT_VAR = 1:0), class = "data.frame", row.names = c(NA,-2L))
We can do this in dplyr
library(dplyr)
df %>%
group_by(ID)
filter(INT_VAR == max(INT_VAR))
Or using data.table
library(data.table)
setDT(df)[, .SD[INT_VAR == max(INT_VAR)], by = ID]
I have two dataframes, dfa and dfb:
dfa <- data.frame(
gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
id = c(1:5)
)
dfb <- data.frame(
gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
id = c(6:10)
)
which look like this:
> dfa
gene_name id
1 MUC16 1
2 MUC2 2
3 MET 3
4 FAT1 4
5 TERT 5
> dfb
gene_name id
1 MUC1 6
2 MET; BLEP 7
3 MUC21 8
4 FAT 9
5 TERT 10
dfa is my genes of interest list: I want to keep the dfb rows where they appear, minding the digits (MUC1 is not MUC16). My new_df should look like this:
> new_df
gene_name id
1 MET; BLEP 7
2 TERT 10
My problem is that the regular dplyr::semi_join() does exact matches, which doesn't take into account the fact that dfb$gene_names can contain genes separated with "; ". Meaning that with this example, "MET" is not retained.
I tried to look into fuzzyjoin::regex_semi_join, but I can't make it do what I want...
A tidyverse solution would be welcome. (Maybe with stringr?!)
EDIT: Follow-up question...
How would I go about to do the reciprocal anti_join? Simply changing semi_join to anti_join in this method doesn't work because the row MET; BLEP is present when it shouldn't be...
Adding a filter(gene_name == new_col) after the anti_join works with the provided simple dataset, but if I twist it a little like this:
dfa <- data.frame(
gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
id = c(1:5)
)
dfb <- data.frame(
gene_name = c("MUC1", "MET; BLEP", "MUC21; BLOUB", "FAT", "TERT"),
id = c(6:10)
)
...then it doesn't anymore. Here and in my real-life dataset, dfa doesn't contain semicolons, it's only one column of individual gene names. But dfb contains a lot of information, and multiple combinations of semicolons...
You can use seperate_rows() to split the dataframe before joining. Note that if BLEP existed in dfa, it would result in a duplicate, which is why distinct is used
dfa <- data.frame(
gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
id = c(1:5),
stringsAsFactors = FALSE
)
dfb <- data.frame(
gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
id = c(6:10),
stringsAsFactors = FALSE
)
library(tidyverse)
dfb%>%
mutate(new_col = gene_name)%>%
separate_rows(new_col,sep = "; ")%>%
semi_join(dfa,by = c("new_col" = "gene_name"))%>%
select(gene_name,id)%>%
distinct()
Here's a solution using stringr and purrr.
library(tidyverse)
dfb %>%
mutate(gene_name_list = str_split(gene_name, "; ")) %>%
mutate(gene_of_interest = map_lgl(gene_name_list, some, ~ . %in% dfa$gene_name)) %>%
filter(gene_of_interest == TRUE) %>%
select(gene_name, id)
I think I finally managed to make fuzzyjoin::regex_joins do what I want. It was ridiculously simple, I just had to tweak my dfa filter list:
library(fuzzyjoin)
# add "\b" regex expression before/after each gene of the list to filtrate from
# (to search for whole words)
dfa$gene_name <- paste0("\\b", dfa$gene_name, "\\b")
# to keep genes from dfb that are present in the dfa filter list
dfb %>%
regex_semi_join(dfa, by = c(gene_name = "gene_name"))
# to exclude genes from dfb that are present in the dfa filter blacklist
dfb %>%
regex_anti_join(dfa, by = c(gene_name = "gene_name"))
One drawback though: it's quite slow...
When using the various join functions from dplyr you can either join all variables with the same name (by default) or specify those ones using by = c("a" = "b"). Is there a way to join by exclusion? For example, I have 1000 variables in two data frames and I want to join them by 999 of them, leaving one out. I don't want to do by = c("a1" = "b1", ...,"a999" = "b999"). Is there a way to join by excluding the one variable that is not used?
Ok, using this example from one answer:
set.seed(24)
df1 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
df2 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
I want to join them using all variables excluding val. I'm looking for a more general solution. Assuming there are 1000 variables and I only remember the name of the one that I want to exclude in the join, while not knowing the index of that variable. How can I perform the join while only knowing the variable names to exclude. I understand I can find the column index first but is there a simply way to add exclusions in by =?
We create a named vector to do this
library(dplyr)
grps <- setNames(paste0("b", 1:999), paste0("a", 1:999))
Note the 'grps' vector is created with paste as the OP's post suggested a pattern. If there is no pattern, but we know the column that is not to be grouped
nogroupColumn <- "someColumn"
grps <- setNames(setdiff(names(df1), nogroupColumn),
setdiff(names(df2), nogroupColumn))
inner_join(df1, df2, by = grps)
Using a reproducible example
set.seed(24)
df1 <- data_frame(a1 = LETTERS[1:3], a2 = letters[1:3], val = rnorm(3))
df2 <- data_frame(b1 = LETTERS[3:4], b2 = letters[3:4], valn = rnorm(2))
grps <- setNames(paste0("b", 1:2), paste0("a", 1:2))
inner_join(df1, df2, by = grps)
# A tibble: 1 x 4
# a1 a2 val valn
# <chr> <chr> <dbl> <dbl>
#1 C c 0.420 -0.584
To exclude a certain field(s), you need to identify the index of the columns you want. Here's one way:
which(!names(df1) %in% "sskjs" ) #<this excludes the column "sskjs"
[1] 1 2 4 #<and shows only the desired index columns
Use unite to create a join_id in each dataframe, and join by it.
df1 <- df1 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
df2 <- df2 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
left_join(df1, df2, by = "join_id" )
I have a DataFrame with Person data and also have like 20 more DataFrames with a common key Person_Id. I want to join all of them to the Person DataFrame to have all my data in the same DataFrame.
I tried both join and merge like this:
merge(df_person, df_1, by="Person_Id", all.x=TRUE)
and
join(df_person, df_1, df_person$Person_Id == df_1$Person_Id, "left")
In both of them, I find the same error. Both functions Join the Datasets in the right way but it duplicates the field Person_Id. Is there any way to tell those functions to not duplicate the Person_Id field?
Also, anyone knows a more efficient way to join all those DataFrames together?
Thanks you so much for your help in advance.
Other supported languages support simplified equi-join syntax, but it looks like it is not implemented in R so you have to do it the old way (rename and drop):
library(magrittr)
withColumnRenamed(df_1, "Person_Id", "Person_Id_") %>%
join(df_2, column("Person_Id") == column("Person_id_")) %>%
drop("Person_Id_")
If you're doing a lot of joins in SparkR it is worthwhile to make your own function to rename then join then remove the renamed column
DFJoin <- function(left_df, right_df, key = "key", join_type = "left"){
left_df <- withColumnRenamed(left_df, key, "left_key")
right_df <- withColumnRenamed(right_df, key, "right_key")
result <- join(
left_df, right_df,
left_df$left_key == right_df$right_key,
joinType = join_type)
result <- withColumnRenamed(result, "left_key", key)
result$right_key <- NULL
return(result)
}
df1 <- as.DataFrame(data.frame(Person_Id = c("1", "2", "3"), value_1 =
c(2, 4, 6)))
df2 <- as.DataFrame(data.frame(Person_Id = c("1", "2"), value_2 = c(3,
6)))
df3 <- DFjoin(df1, df2, key = "Person_Id", join_type = "left")
head(df3)
Person_Id value_1 value_2
1 3 6 NA
2 1 2 3
3 2 4 6