left outer join in R - r

I am trying to perform a left outer join on two data frames in R and I get some weird behavior. The first data frame (a) contains 509100 elements (rows) and the second one (b) contains 325020 rows. The function I have used for the left outer join is the following:
merge(a, b, by=c("ID","SEQUENCE"), all.x = T, all.y = F)
The resulting data frame now contains 513248 rows. I have used the same method with the same parameter configuration earlier in the script and it worked fine (i.e., the resulting data frame had the same number of rows as the first data frame passed as argument in the merge function). I have also created a column in each of the two data frames, as a combination of ID_SEQUENCE (at character level, for instance ID = 345 and SEQUENCE = 4, then the resulting value is 345_4) in order to avoid merging on multiple columns, if that would raise a problem, but the result is the same... 513248 rows instead of the expected 509100. Any ideas why this is happening or what am I doing wrong?

Related

combing data sets and aligning 2 separate time series

I'm combining two paleoclimatology data sets into one for use in a regression model. Each data set has an integer value for time from 0-802kys.
However, one of the sets skips a year after 600kyrs (1). When I put all data into one frame, the time series with missing times is shorter, falls out of alignment with the other and restarts itself. What I am after is for the incomplete time series to have an NA value so I can omit these rows.
i.e. when v2=601 (see image 1), I want to respective columns to read NA, 601, 3.97
My code for combining is :
df_new <- cbind(Df1$Age,
Df2$Age,
Df1$Benthic,
Df2$Deut)
Just merging the data.frames should be enough, since both seem to have keys to match. You just have to make sure there are additional rows created if there is no matching key.
merge(Df1,Df2, all.x = T, all.y =T)
Should probably work for you. This would be a base R solution.
all.x / all.y does the following:
logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.
Information on how to merge data.frames:
How to join (merge) data frames (inner, outer, left, right)

How do I merge 2 data frames on R based on 2 columns?

I am looking to merge 2 data frames based on 2 columns in R. The two data frames are called popr and dropped column, and they share the same 2 variables: USUBJID and TRTAG2N, which are the variables that I want to combine the 2 data frames by.
The merge function works when I am only trying to do it based off of one column:
merged <- merge(popr,droppedcol,by="USUBJID")
When I attempt to merge by using 2 columns and view the data frame "Duration", the table is empty and there are no values, only column headers. It says "no data available in table".
I am tasked with replicating the SAS code for this in R:
data duration;
set pop combined1 ;
by usubjid trtag2n;
run;
On R, I have tried the following
duration<- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")
duration <- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")
duration <- full_join(popr,droppedcol,by = c("USUBJID","TRTAG2N"))
duration <- merge(popr,droppedcol,by = c("USUBJID","TRTAG2N"))
I would like to see a data frame with the columns USUBJID, TRTAG2N, TRTAG2, and FUDURAG2, sorted by first FUDURAG2 and then USUBJID.
Per the SAS documentation, Combining SAS Data Sets, and confirmed by the SAS guru, #Tom, in comments above, the set with by simply means you are interleaving the datasets. No merge (which by the way is also a SAS method which you do not use) is taking place:
Interleaving uses a SET statement and a BY statement to combine
multiple data sets into one new data set. The number of observations
in the new data set is the sum of the number of observations from the
original data sets. However, the observations in the new data set are
arranged by the values of the BY variable or variables and, within
each BY group, by the order of the data sets in which they occur. You
can interleave data sets either by using a BY variable or by using an
index.
Therefore, the best translation of set without by in R is rbind(), and set with by is rbind + order (on the rows):
duration <- rbind(pop, combined1) # STACK DFs
duration <- with(duration, duration[order(usubjid, trtag2n),]) # ORDER ROWS
However, do note: rbind does not allow unmatched columns between the concatenated data sets. However, third-party packages allow for unmatched columns including: plyr::rbind.fill, dplyr::bind_rows, data.table::rbindlist.

R: multiple merge with big data frames

I have two big dataframes: DBa and DBb. All colums of DBb are in DBa.
I want to merge these two dataframes by all DBb's colums.
I'm trying:
new <- merge(DBa, DBb, by=colnames(DBb))
but it gives me the error:
Elements listed in `by` must be valid column names in x and y
How can I do it?
I don't think you are looking to merge the data frames, you should put them on top of each other with rbind. With merge you will put two data frames next to eachother, and you only need one common column (the key) which should be unique otherwise the results will be a mess.
So use row bind (rbind). The columns must be in the same order and one data frame must not have more columns than the other.
new_data <- rbind(data1, data2)

Joining two structural similar dataframes on two index columns?

I have two structural identical dataframes: column id-part1, column id-part2 and column data1.
id-part1 and id-part2 are together used as an index-
Now I want to calculate the difference between the two dataframes of column data1 with respect to the two id columns. In fact, in one data-frames it might happen that the combination of id-part1 and id-part2 is not existing...
So it is somehow a SQL join operation, ins't?
The merge() function is what you are looking for.
It works similar as an SQL join operation. Given your description a solution would be:
solution <- merge(DF1, DF2, by = c('id-part1', 'id-part2'), all.x = TRUE, all.y = TRUE)
DF1 and DF2 are your corresponding data frames. merge() uses x and y to reference these data frames where x is the first (DF1) and y the second (DF2).
The by= property defines the column names to match (you can even specify different names for each data frame).
all.x and all.y specify the kind of join you like to perform, depending on the data you like to keep.
The result is a new data frame with different columns for data1. You can then continue with your calculations.

R Using index of rows for merging file

I'm working with the survival library in R. I used residuals() on a survival object, which nicely outputs the residuals I want.
My question is on how R treats its row indexes.
Here is a sample data set. My goal is to merge them back together.
Data Frame:
> 1 - .2
> 2 - .4
> 3 - .6
> 4 - .8
Output:
> 1 - .2X
> 2 - .4X
> 4 - .8X
The output is a subset of the input (some data couldn't be processed). What I'd like is to add this new list back to the original input file to plot and run regressions, etc.
I don't know how to access the row indexes outside of the simple df[number] command. I think my approach to doing this is a bit prehistoric; I write.table() the objects which turns their row number into an actual printed column, and then go back and merge based on this new key. I feel like there is a smarter way then to write out and read back in the files. Any suggestions on how?
I hope this isn't a duplicate, as I looked around and couldn't quite find a good explanation on row indices.
Edit:
I can add column or row names to a data frame, but this results in a NULL value if done to a one dimensional object (my output file). The one dimensional object just has a subset of rows that I can't access.
rownames(res)
NULL
Instead of creating a new object as proposed above, you can simply use merge directly.
Just write:
merge(df1, df2, by.x = 0, by.y = res)
The by.x=0 refers then to the row names of the df1. The by.y refers to the row names of df2. The merge is performed using those as the link.
You can create an appropriate data.frame object out of res:
res.df <- data.frame(names=names(res), res=res)
Then use this as one of the inputs to merge.
For the join you can use merge() or join() from the plyr package.
Here is a question regarding both:
How to join (merge) data frames (inner, outer, left, right)?
I find join() more intuitive, has the SQL logic and it seems to perform better with large datasets also.

Resources