Joining two structural similar dataframes on two index columns? - r

I have two structural identical dataframes: column id-part1, column id-part2 and column data1.
id-part1 and id-part2 are together used as an index-
Now I want to calculate the difference between the two dataframes of column data1 with respect to the two id columns. In fact, in one data-frames it might happen that the combination of id-part1 and id-part2 is not existing...
So it is somehow a SQL join operation, ins't?

The merge() function is what you are looking for.
It works similar as an SQL join operation. Given your description a solution would be:
solution <- merge(DF1, DF2, by = c('id-part1', 'id-part2'), all.x = TRUE, all.y = TRUE)
DF1 and DF2 are your corresponding data frames. merge() uses x and y to reference these data frames where x is the first (DF1) and y the second (DF2).
The by= property defines the column names to match (you can even specify different names for each data frame).
all.x and all.y specify the kind of join you like to perform, depending on the data you like to keep.
The result is a new data frame with different columns for data1. You can then continue with your calculations.

Related

Match observations between two datasets by ID

I am working with the following data: http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv
What I want to do is train my algorithm to predict correctly if a person will drop out in the subsequent period.
data1 <- subset(data, YEAR==1984)
data2 <- subset(data, YEAR==1985)
didtheydrop <- as.integer(data1$id)
didtheydrop <- lapply(didtheydrop, function(x) as.integer(ifelse(x==data2$id, 0, 1)))
This created a large list with the values that I think I wanted, but I'm not sure. In the end, I would like to append this variable to the 1984 data and then use that to create my model.
What can I do to ensure that the appropriate values are compared? The list lengths aren't the same, and it's also not the case that they appear in the correct order (i.e. respondents 3 - 7 do not respond in 1984 but they appear in 1985)
Assumming data1 and data2 are two dataframes (unclear, because it appears that you extracted them from an original larger single dataframe called data), I think it is better to merge them and work with a single dataframe. That is, if there is a single larger dataframe, do not subset it, just delete the columns you do not need; if data1 and data2 are two dataframes merge them and work with only one dataframe.
There are multiple ways to do this in R.
You should review the merge function calling ?merge in your console and reading the function description.
Essentially, to merge two dataframes, you should do something like:
merge(data1, data2, by= columnID) #Where columnID is the name of the variable that identifies the ID. If it is different in data1 and data2 you can use by.x and by.y
Then you have to define if you want to merge all rows from both tables with the parameters all.x, all.y, and all: all values from data1 even if no matching is found in data2, or all values from data2 even if no matching is found in data1 or all values regardless of whether there is a matching ID in the other database.
Merge is in the base package with any installation of R.
You can also use dplyr package, which makes the type of join even more explicit:
inner_join(data1, data2, by = "ID")
left_join(data1, data2, by = "ID")
right_join(data1, data2, by = "ID")
full_join(data1, data2, by = "ID")
This is a good link for dplyr join https://rpubs.com/williamsurles/293454
Hope it helps

In r, Is there a way to extract exactly same data frames from a list of data frames without using any packages?

I have a list of 3 data.frame prepared as below:
df1<-read.csv(file="D:/PRADYUMNA MURALIDHAR/DATA SCIENCE/R FUNCTION/specdata/001.csv",sep=",")
df2<-read.csv(file="D:/PRADYUMNA MURALIDHAR/DATA SCIENCE/R FUNCTION/specdata/002.csv",sep=",")
df3<-read.csv(file="D:/PRADYUMNA MURALIDHAR/DATA SCIENCE/R FUNCTION/specdata/003.csv",sep=",")
dfcheck<-c(df1,df2,df3)
How do I extract each data.frame and merge them all together... considering the column variables are the same?
As you say the column variables are the same, I suppose you want to append the data.frames to each other rather than merging them.
Further, given the title of your question, I'll point out that dfcheck <- c(df1, df2, df3) will not give you a list of data.frames.
My guess is you want to do do.call("rbind", list(df1, df2, df3)).

Join or marge between two tables with different dateframe

i'm new in R , and i'm trying to join between two tables. the shared filed between the two tables is the date but when i'm importing the data i received him with deferent structure.
First Table:
Second Table:
actually what i need is to join the data by operation system and remove Linux like inner join in sql with condition on the operation system. Thanks
Say that your first dataset is called df1 and the second one df2, you can join the two by calling:
merge(df1, df2, by = "operatingSystem")
You can specify the kinds of join by using all = T, all.x = T, or all.y = T.
I am a bit lazy to reproduce your example but I will give it a go as is
First, in your second table, you need to convert the date column to an actual date
You can do this with easily with lubridate
Assuming df1 and df2 for the first and second table respectively
library(lubridate)
df2$date <- ymd(df2$date) #ymd function assumes `year` then `month` then `day` when converting
Then you can use the dplyr's inner_join to perform the desired join
from stat545
inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
library(dplyr)
semi_join(df1, df2, by = c("date", "operatingSystem")
This will keep all rows in df1 that have a match in df2 - Linux stays out, and also keep the columns newusers and will keep df2%users and rename into users.1.
Note:You might need to convert df1$date to dttm object with lubridate::date(df1$date)

left outer join in R

I am trying to perform a left outer join on two data frames in R and I get some weird behavior. The first data frame (a) contains 509100 elements (rows) and the second one (b) contains 325020 rows. The function I have used for the left outer join is the following:
merge(a, b, by=c("ID","SEQUENCE"), all.x = T, all.y = F)
The resulting data frame now contains 513248 rows. I have used the same method with the same parameter configuration earlier in the script and it worked fine (i.e., the resulting data frame had the same number of rows as the first data frame passed as argument in the merge function). I have also created a column in each of the two data frames, as a combination of ID_SEQUENCE (at character level, for instance ID = 345 and SEQUENCE = 4, then the resulting value is 345_4) in order to avoid merging on multiple columns, if that would raise a problem, but the result is the same... 513248 rows instead of the expected 509100. Any ideas why this is happening or what am I doing wrong?

data.tables: merge [duplicate]

This question already has answers here:
Why does X[Y] join of data.tables not allow a full outer join, or a left join?
(3 answers)
Closed 9 years ago.
I have a second question on data.tables. As far as I have understood, merges are called joins in data tables. How can I control which type of merge I have (one-to-one, many-to-one, one-to-many), and whether the variables in the 'using' dataset will replace the variables in the master dataset?
Also, if keys are necessary in order to perform the merge, and I have to do more than one merge on my data, do I have to keep changing the keys? This appears not very clean to me ....
Thanks you in advance,
Matteo
You could try to play with the merge() function. There you could define how you want to merge your data.frames.
x, y
data frames, or objects to be coerced to one.
by, by.x, by.y
specifications of the columns used for merging. See ‘Details’.
all
logical; all = L is shorthand for all.x = L and all.y = L, where L is either TRUE or FALSE.
all.x
logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.
all.y
logical; analogous to all.x.
Try ?merge for more information.
You can also have a look here QuickR Merge.

Resources