Joining multiple DataFrames using SparkR

Joining multiple DataFrames using SparkR - r

I have a DataFrame with Person data and also have like 20 more DataFrames with a common key Person_Id. I want to join all of them to the Person DataFrame to have all my data in the same DataFrame.
I tried both join and merge like this:
merge(df_person, df_1, by="Person_Id", all.x=TRUE)
and
join(df_person, df_1, df_person$Person_Id == df_1$Person_Id, "left")
In both of them, I find the same error. Both functions Join the Datasets in the right way but it duplicates the field Person_Id. Is there any way to tell those functions to not duplicate the Person_Id field?
Also, anyone knows a more efficient way to join all those DataFrames together?
Thanks you so much for your help in advance.

Other supported languages support simplified equi-join syntax, but it looks like it is not implemented in R so you have to do it the old way (rename and drop):
library(magrittr)
withColumnRenamed(df_1, "Person_Id", "Person_Id_") %>%
join(df_2, column("Person_Id") == column("Person_id_")) %>%
drop("Person_Id_")

If you're doing a lot of joins in SparkR it is worthwhile to make your own function to rename then join then remove the renamed column
DFJoin <- function(left_df, right_df, key = "key", join_type = "left"){
left_df <- withColumnRenamed(left_df, key, "left_key")
right_df <- withColumnRenamed(right_df, key, "right_key")
result <- join(
left_df, right_df,
left_df$left_key == right_df$right_key,
joinType = join_type)
result <- withColumnRenamed(result, "left_key", key)
result$right_key <- NULL
return(result)
}
df1 <- as.DataFrame(data.frame(Person_Id = c("1", "2", "3"), value_1 =
c(2, 4, 6)))
df2 <- as.DataFrame(data.frame(Person_Id = c("1", "2"), value_2 = c(3,
6)))
df3 <- DFjoin(df1, df2, key = "Person_Id", join_type = "left")
head(df3)
Person_Id value_1 value_2
1 3 6 NA
2 1 2 3
3 2 4 6

Related

How to rbind multiple dataframes with a while-loop?

I'm trying to rbind multiple loaded datasets (all of them have the same num. of columns, named "num", "source" and "target"). In case, I have ten dataframes, which names are "test1", "test2", "test3" and so on...
I thought that trying the solution below (creating an empty dataframe and looping through the others) would solve my problem, but I guess that I'm missing something in the second argument of the rbind function. I don't know if the solution using paste0("test", I) to increment the variable (changing the name of the dataframe) it's correct... I'm afraid that I'm just trying to rbind a dataframe with a string object (and getting an error), is that right?
test = as.data.frame(matrix(ncol = 3, nrow = 0)) %>%
setNames(c("num", "source", "target"))
i=1
while (i < 11) {
test = rbind(test, paste0("test", i))
i = i + 1
}

We need replicate to return as a list
out <- setNames(replicate(10, test, simplify = FALSE),
paste0("test", seq_len(10)))
If there are multiple datasets already created in the global env, get those in to a list and rbind within do.call
out <- do.call(rbind, mget(paste0("test", 1:10)))

We could bind test1:test10 using the common pattern in the name:
library(dplyr)
result <- mget(ls(pattern="^test\\d+")) %>%
bind_rows()

If I understood correctly, this might help you
Libraries
library(dplyr)
Example data
list_of_df <-
list(
df1 = data.frame(a = "1"),
df2 = data.frame(a = "2"),
df3 = data.frame(a = "1"),
df4 = data.frame(a = "2")
)
Code
bind_rows(list_of_df,.id = "dataset")
Result
dataset a
1 df1 1
2 df2 2
3 df3 1
4 df4 2

matching large vector of string against large vector of patterns

I have a very large dataframe with a column containing postal codes:
data <- data.frame(data = rnorm(n = 4),
code = c("1001", "1130", "2001", "9010"),
stringsAsFactors = F)
I also have a second large-ish dataframe with postal codes patterns mapped to a zone.
mapping <- data.frame(code = c("10*", "20*"),
zone = c("zone1", "zone2"),
stringsAsFactors = F)
I would like to join those two tables to add the zone column to the data dataframe but the volume of the data is too large to do a "rowwise" grepl. What is the most efficient way of doing this?

The most efficient way to deal with large objects is data.table. To do joins, you need a common column in both objects. I'm using substr to get only the first two digits of the code column in the data object. Also note that I removed the "*" from mapping as that character is not present in data.
library(data.table)
setDT(data)
setDT(mapping)
data[, code := substr(code, start = 1, stop = 2)]
mapping[data, on="code"]
code zone data
1: 10 zone1 -1.0481912
2: 11 <NA> 1.1339476
3: 20 zone2 -0.8072921
4: 90 <NA> 1.5883562
DATA
data <- data.frame(data = rnorm(n = 4),
code = c("1001", "1130", "2001", "9010"),
stringsAsFactors = F)
mapping <- data.frame(code = c("10", "20"),
zone = c("zone1", "zone2"),
stringsAsFactors = F)

I am not sure what specific method you are using when you say "rowwise" but here is what I would do in the dplyr world.
mapping <- dplyr::rename(mapping, codeString = code) # rename for joining.
data <- data %>%
dplyr::mutate( codeString = paste0(substr(code, 1, 2), "*")) %>%
dplyr::left_join(mapping, by= "codeString")
You should be able to join like this and avoid any rowwise operation since the patter you're looking for is easy to create.

dplyr join by exclusion?

When using the various join functions from dplyr you can either join all variables with the same name (by default) or specify those ones using by = c("a" = "b"). Is there a way to join by exclusion? For example, I have 1000 variables in two data frames and I want to join them by 999 of them, leaving one out. I don't want to do by = c("a1" = "b1", ...,"a999" = "b999"). Is there a way to join by excluding the one variable that is not used?
Ok, using this example from one answer:
set.seed(24)
df1 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
df2 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
I want to join them using all variables excluding val. I'm looking for a more general solution. Assuming there are 1000 variables and I only remember the name of the one that I want to exclude in the join, while not knowing the index of that variable. How can I perform the join while only knowing the variable names to exclude. I understand I can find the column index first but is there a simply way to add exclusions in by =?

We create a named vector to do this
library(dplyr)
grps <- setNames(paste0("b", 1:999), paste0("a", 1:999))
Note the 'grps' vector is created with paste as the OP's post suggested a pattern. If there is no pattern, but we know the column that is not to be grouped
nogroupColumn <- "someColumn"
grps <- setNames(setdiff(names(df1), nogroupColumn),
setdiff(names(df2), nogroupColumn))
inner_join(df1, df2, by = grps)
Using a reproducible example
set.seed(24)
df1 <- data_frame(a1 = LETTERS[1:3], a2 = letters[1:3], val = rnorm(3))
df2 <- data_frame(b1 = LETTERS[3:4], b2 = letters[3:4], valn = rnorm(2))
grps <- setNames(paste0("b", 1:2), paste0("a", 1:2))
inner_join(df1, df2, by = grps)
# A tibble: 1 x 4
# a1 a2 val valn
# <chr> <chr> <dbl> <dbl>
#1 C c 0.420 -0.584

To exclude a certain field(s), you need to identify the index of the columns you want. Here's one way:
which(!names(df1) %in% "sskjs" ) #<this excludes the column "sskjs"
[1] 1 2 4 #<and shows only the desired index columns
Use unite to create a join_id in each dataframe, and join by it.
df1 <- df1 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
df2 <- df2 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
left_join(df1, df2, by = "join_id" )

R join by tolower

I´ve got some sample data
data1 = data.frame(name = c("cat", "dog", "parrot"), freq = c(1,2,3))
data2 = data.frame(name = c("Cat", "snake", "Dog", freq2 = c(2,3,4)))
data1$name = as.character(data1$name)
data2$name = as.character(data2$name)
which I want to join, but e.g. "cat" and "Cat" should be treated as the same value. I thought of using tolower and first to determine the entries which appear in both data frames by
in_both = data1[(tolower(data1$name) %in% tolower(data2$name)),]
Then I want to join with data2, but that doesn't work because the names doesn't match.
library(dplyr)
left_join(in_both, data2)
Is there a way to join by using tolower?

Why not create a dplyr function which would lower the name of left data.frame and perform merge.
With the custom function, you get more control and you wouldn't have to repeat many steps.
f_dplyr <- function(left,right){
left$name <- tolower(left$name)
inner_join(left,right,by="name")
}
f_dplyr(data2, data1)
Result
name freq2 freq
cat 2 1
dog 4 2

If you don't want to alter your original data2, as #AshofFire suggested, you can decapitalize the values in name in a pipe %>% and then perform the join operation:
data2 %>%
mutate(name = str_to_lower(name)) %>%
inner_join(data1, by = "name")
name freq2 freq
1 cat 2 1
2 dog 4 2

"Not Join" in R

I am looking for a quick way to do 'not join' (i.e. keep rows that didn't merge, or inverse of inner join). The way I've been doing is to use data.table for X and Y, then set key. For example:
require(data.table)
X <- data.table(category = c('A','B','C','D'), val1 = c(0.2,0.3,0.8,0.7))
Y <- data.table(category = c('B','C','D','E'), val2 = c(2,3,5,7))
XY <- merge(X,Y,by='category')
> XY
category val1 val2
1: B 0.3 2
2: C 0.8 3
3: D 0.7 5
But I need the inverse of this, so I have to do:
XY_All <- merge(X,Y,by='category',all=TRUE)
setkey(XY,category)
setkey(XY_All,category)
notXY <- XY_All[!XY] #data.table not join (finally)
> notXY
category val1 val2
1: A 0.2 NA
2: E NA 7
I feel like this is quite long winded (especially from data.frame). Am I missing something?
EDIT: I got this after thinking more about not joins
X <- data.table(category = c('A','B','C','D'), val1 = c(0.2,0.3,0.8,0.7),key = "category")
Y <- data.table(category = c('B','C','D','E'), val2 = c(2,3,5,7), key = "category")
notXY <- merge(X[!Y],Y[!X],all=TRUE)
But WheresTheAnyKey's answer below is clearer. One last hurdle is the presetting data.table keys, it'd be nice not to have to do that.
EDIT: To clarify, the accepted solution is:
merge(anti_join(X, Y, by = 'category'),anti_join(Y, X, by = 'category'), by = 'category', all = TRUE)

require(dplyr)
rbind_list(anti_join(X, Y), anti_join(Y, X))
EDIT:
Since someone asked for some explanation, here's what is happening:
The first anti_join() function returns rows from X that have no matching row in Y with the match determined by what the join is joining by. The second does the reverse. rbind_list() just takes the results of its inputs and makes them into a single tbl with all the observations from each of its inputs, replacing missing variable data with NA.

setkey(X,category)
setkey(Y,category)
rbind(X[!Y], Y[!X], fill = TRUE)

You can make it more concise like this:
X <- data.table(category = c('A','B','C','D'), val1 = c(0.2,0.3,0.8,0.7),key = "category")
Y <- data.table(category = c('B','C','D','E'), val2 = c(2,3,5,7), key = "category")
notXY <- merge(X,Y,all = TRUE)[!merge(X,Y)]

Try this.
First, merge with "all" set to "TRUE". Then take out all complete cases:
XY_All <- merge(X,Y,by='category',all=TRUE)
notXY <- XY_All[!complete.cases(XY_All),]

require(dplyr)
notXY = merge(X[!X$category %in% Y$category,], Y[!Y$category %in% X$category,],by =
"category",all = TRUE)
One way to look at an Anti-Join is that you need observations from X not in Y and observations from Y not in X concatenated together. This can be achieved in one step as shown above.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Joining multiple DataFrames using SparkR - r

Related

How to rbind multiple dataframes with a while-loop?

matching large vector of string against large vector of patterns

dplyr join by exclusion?

R join by tolower

"Not Join" in R

Categories

Resources