I'm combining two paleoclimatology data sets into one for use in a regression model. Each data set has an integer value for time from 0-802kys.
However, one of the sets skips a year after 600kyrs (1). When I put all data into one frame, the time series with missing times is shorter, falls out of alignment with the other and restarts itself. What I am after is for the incomplete time series to have an NA value so I can omit these rows.
i.e. when v2=601 (see image 1), I want to respective columns to read NA, 601, 3.97
My code for combining is :
df_new <- cbind(Df1$Age,
Df2$Age,
Df1$Benthic,
Df2$Deut)
Just merging the data.frames should be enough, since both seem to have keys to match. You just have to make sure there are additional rows created if there is no matching key.
merge(Df1,Df2, all.x = T, all.y =T)
Should probably work for you. This would be a base R solution.
all.x / all.y does the following:
logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.
Information on how to merge data.frames:
How to join (merge) data frames (inner, outer, left, right)
Related
I'm usually a SAS user but was wondering if there was a similar way in R to list data that can only be found in one data frame after merging them. In SAS I would have used
data want;
merge have1 (In=in1) have2 (IN=in2) ;
if not in2;
run;
to find the entries only in have1.
My R code is:
inner <- merge(have1, have2, by= "Date", all.x = TRUE, sort = TRUE)
I've tried setdiff() and antijoin() but neither seem to give me what I want. Additionally, I would like to find a way to do the converse of this. I would like to find the entries in have1 and have2 that have the same "Date" entry and then keep the remaining variables in the 2 data frames. For example, consider have1 with columns "Date", "ShotHeight", "ShotDistance" and have2 with columns "Date", "ThrowHeight", "ThrowDistance" so that the m]new dataframe, call it "new" has columns "Date", ShotHeight", "ShotDistance", "ThrowHeight", "ThrowDistance".
Assuming only one by-variable, the simplest solution is not to merge at all:
want <- subset(have1, !(county %in% have2$county))
This subsets have1 to exclude rows where the value of county is in have2.
I have two dataframes, one original and one that should be the original plus several additional columns of data after processing. I would like to make sure that the correspondence between original columns was preserved between dataframes (i.e., all subject identifiers still match up to the original vectors of data in each row.)
If original (orig) was dim 5000 x 50 and post-processing (pp) was 5000 x 100, and the first 50 columns that should be the same in each, how can I check? Is there something like setdiff() that can compare full dataframes?
SETDIFF <- setdiff(orig[,c(1:50)], pp[,c(1:50)])
In reply to comment above: to find the row and column indices where values are not equal, use which(orig[,1:50] != pp[,1:50], arr.ind = TRUE).
I am trying to perform a left outer join on two data frames in R and I get some weird behavior. The first data frame (a) contains 509100 elements (rows) and the second one (b) contains 325020 rows. The function I have used for the left outer join is the following:
merge(a, b, by=c("ID","SEQUENCE"), all.x = T, all.y = F)
The resulting data frame now contains 513248 rows. I have used the same method with the same parameter configuration earlier in the script and it worked fine (i.e., the resulting data frame had the same number of rows as the first data frame passed as argument in the merge function). I have also created a column in each of the two data frames, as a combination of ID_SEQUENCE (at character level, for instance ID = 345 and SEQUENCE = 4, then the resulting value is 345_4) in order to avoid merging on multiple columns, if that would raise a problem, but the result is the same... 513248 rows instead of the expected 509100. Any ideas why this is happening or what am I doing wrong?
I have two structural identical dataframes: column id-part1, column id-part2 and column data1.
id-part1 and id-part2 are together used as an index-
Now I want to calculate the difference between the two dataframes of column data1 with respect to the two id columns. In fact, in one data-frames it might happen that the combination of id-part1 and id-part2 is not existing...
So it is somehow a SQL join operation, ins't?
The merge() function is what you are looking for.
It works similar as an SQL join operation. Given your description a solution would be:
solution <- merge(DF1, DF2, by = c('id-part1', 'id-part2'), all.x = TRUE, all.y = TRUE)
DF1 and DF2 are your corresponding data frames. merge() uses x and y to reference these data frames where x is the first (DF1) and y the second (DF2).
The by= property defines the column names to match (you can even specify different names for each data frame).
all.x and all.y specify the kind of join you like to perform, depending on the data you like to keep.
The result is a new data frame with different columns for data1. You can then continue with your calculations.
This question already has answers here:
Why does X[Y] join of data.tables not allow a full outer join, or a left join?
(3 answers)
Closed 9 years ago.
I have a second question on data.tables. As far as I have understood, merges are called joins in data tables. How can I control which type of merge I have (one-to-one, many-to-one, one-to-many), and whether the variables in the 'using' dataset will replace the variables in the master dataset?
Also, if keys are necessary in order to perform the merge, and I have to do more than one merge on my data, do I have to keep changing the keys? This appears not very clean to me ....
Thanks you in advance,
Matteo
You could try to play with the merge() function. There you could define how you want to merge your data.frames.
x, y
data frames, or objects to be coerced to one.
by, by.x, by.y
specifications of the columns used for merging. See ‘Details’.
all
logical; all = L is shorthand for all.x = L and all.y = L, where L is either TRUE or FALSE.
all.x
logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.
all.y
logical; analogous to all.x.
Try ?merge for more information.
You can also have a look here QuickR Merge.