This question already has answers here:
Why does X[Y] join of data.tables not allow a full outer join, or a left join?
(3 answers)
Closed 9 years ago.
I have a second question on data.tables. As far as I have understood, merges are called joins in data tables. How can I control which type of merge I have (one-to-one, many-to-one, one-to-many), and whether the variables in the 'using' dataset will replace the variables in the master dataset?
Also, if keys are necessary in order to perform the merge, and I have to do more than one merge on my data, do I have to keep changing the keys? This appears not very clean to me ....
Thanks you in advance,
Matteo
You could try to play with the merge() function. There you could define how you want to merge your data.frames.
x, y
data frames, or objects to be coerced to one.
by, by.x, by.y
specifications of the columns used for merging. See ‘Details’.
all
logical; all = L is shorthand for all.x = L and all.y = L, where L is either TRUE or FALSE.
all.x
logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.
all.y
logical; analogous to all.x.
Try ?merge for more information.
You can also have a look here QuickR Merge.
Related
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 4 years ago.
I know I can use the plyr and its friends to combine dataframes, and merge as well, but so far I don't know how to merge two dataframes with multiple columns based on 2 columns?
See the documentation on ?merge, which states:
By default the data frames are merged on the columns with names they both have,
but separate specifications of the columns can be given by by.x and by.y.
This clearly implies that merge will merge data frames based on more than one column. From the final example given in the documentation:
x <- data.frame(k1=c(NA,NA,3,4,5), k2=c(1,NA,NA,4,5), data=1:5)
y <- data.frame(k1=c(NA,2,NA,4,5), k2=c(NA,NA,3,4,5), data=1:5)
merge(x, y, by=c("k1","k2")) # NA's match
This example was meant to demonstrate the use of incomparables, but it illustrates merging using multiple columns as well. You can also specify separate columns in each of x and y using by.x and by.y.
Hope this helps;
df1 = data.frame(CustomerId=c(1:10),
Hobby = c(rep("sing", 4), rep("pingpong", 3), rep("hiking", 3)),
Product=c(rep("Toaster",3),rep("Phone", 2), rep("Radio",3), rep("Stereo", 2)))
df2 = data.frame(CustomerId=c(2,4,6, 8, 10),State=c(rep("Alabama",2),rep("Ohio",1), rep("Cal", 2)),
like=c("sing", 'hiking', "pingpong", 'hiking', "sing"))
df3 = merge(df1, df2, by.x=c("CustomerId", "Hobby"), by.y=c("CustomerId", "like"))
Assuming df1$Hobby and df2$like mean the same thing.
You can also use the join command (dplyr).
For example:
new_dataset <- dataset1 %>% right_join(dataset2, by=c("column1","column2"))
I'm combining two paleoclimatology data sets into one for use in a regression model. Each data set has an integer value for time from 0-802kys.
However, one of the sets skips a year after 600kyrs (1). When I put all data into one frame, the time series with missing times is shorter, falls out of alignment with the other and restarts itself. What I am after is for the incomplete time series to have an NA value so I can omit these rows.
i.e. when v2=601 (see image 1), I want to respective columns to read NA, 601, 3.97
My code for combining is :
df_new <- cbind(Df1$Age,
Df2$Age,
Df1$Benthic,
Df2$Deut)
Just merging the data.frames should be enough, since both seem to have keys to match. You just have to make sure there are additional rows created if there is no matching key.
merge(Df1,Df2, all.x = T, all.y =T)
Should probably work for you. This would be a base R solution.
all.x / all.y does the following:
logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.
Information on how to merge data.frames:
How to join (merge) data frames (inner, outer, left, right)
I am merging two dataframes that have some overlapping observations. These observations don't overlap on all columns so they are not identical, but they are the same on the columns I've decided are important for linking. How do I merge/join such that the matched observations are excluded?
I'm familiar with the different join functions and how to perform inner and outer joins using merge(), but I don't see an option for excluding the rows that would constitute an inner join.
This is a similar question on the topic, Exclusive Full Join in r
but it assumes there are different columns in each dataframe that will produce NAs upon full join. How would you do it if the dataframes shared all the same columns?
The workaround I am using is to use duplicated() from first and last to remove the rows after full joining. Is there a more elegant way to get the complement of inner join?
df_joined <- merge(df1, df2, all = TRUE)
df_joined <- subset(df_joined, !(duplicated(df_joined[
,linking_cols])==TRUE | duplicated(df_joined[ ,linking_cols], fromLast =
TRUE)==TRUE))
you need to combine two anti joins
library(dplyr)
bind_rows(
anti_join(df1, df2),
anti_join(df2, df1),
)
I am trying to perform a left outer join on two data frames in R and I get some weird behavior. The first data frame (a) contains 509100 elements (rows) and the second one (b) contains 325020 rows. The function I have used for the left outer join is the following:
merge(a, b, by=c("ID","SEQUENCE"), all.x = T, all.y = F)
The resulting data frame now contains 513248 rows. I have used the same method with the same parameter configuration earlier in the script and it worked fine (i.e., the resulting data frame had the same number of rows as the first data frame passed as argument in the merge function). I have also created a column in each of the two data frames, as a combination of ID_SEQUENCE (at character level, for instance ID = 345 and SEQUENCE = 4, then the resulting value is 345_4) in order to avoid merging on multiple columns, if that would raise a problem, but the result is the same... 513248 rows instead of the expected 509100. Any ideas why this is happening or what am I doing wrong?
I have two structural identical dataframes: column id-part1, column id-part2 and column data1.
id-part1 and id-part2 are together used as an index-
Now I want to calculate the difference between the two dataframes of column data1 with respect to the two id columns. In fact, in one data-frames it might happen that the combination of id-part1 and id-part2 is not existing...
So it is somehow a SQL join operation, ins't?
The merge() function is what you are looking for.
It works similar as an SQL join operation. Given your description a solution would be:
solution <- merge(DF1, DF2, by = c('id-part1', 'id-part2'), all.x = TRUE, all.y = TRUE)
DF1 and DF2 are your corresponding data frames. merge() uses x and y to reference these data frames where x is the first (DF1) and y the second (DF2).
The by= property defines the column names to match (you can even specify different names for each data frame).
all.x and all.y specify the kind of join you like to perform, depending on the data you like to keep.
The result is a new data frame with different columns for data1. You can then continue with your calculations.