I am trying to join two table in R. Since the joiner is not unique for both of the table so there is a chance of duplicate values. To mitigate this duplicate I like to add some conditions within the joining statement.
My hypothetical approach is as follows: where I like to join the DF and GF by id where they have only the mentioned condition such as: DF's Incomedate= GF's Outgoingdate, FD's Incomedate = Gf's Outgoingday + 1day and DF's REFID=1000.
```
DF_new <- DF %>%
left_join(GF, by c('id'='id','Incomedate'='Outgoingdate','Incomedate'=as.date('Outgoingdate'+1),REFId=1000)
I can do it in SQL as follows:
DF LEFTJOIN GF on DF.[id]=GF.[id] and (DF.Incomedate=GF.Outgoingdate or
DATEADD(DAY,1,GF.[Outgoingdate])=DF.Incomingdate and (REFID=1000)
How can I replicate it in R?
Related
I am working with the following data: http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv
What I want to do is train my algorithm to predict correctly if a person will drop out in the subsequent period.
data1 <- subset(data, YEAR==1984)
data2 <- subset(data, YEAR==1985)
didtheydrop <- as.integer(data1$id)
didtheydrop <- lapply(didtheydrop, function(x) as.integer(ifelse(x==data2$id, 0, 1)))
This created a large list with the values that I think I wanted, but I'm not sure. In the end, I would like to append this variable to the 1984 data and then use that to create my model.
What can I do to ensure that the appropriate values are compared? The list lengths aren't the same, and it's also not the case that they appear in the correct order (i.e. respondents 3 - 7 do not respond in 1984 but they appear in 1985)
Assumming data1 and data2 are two dataframes (unclear, because it appears that you extracted them from an original larger single dataframe called data), I think it is better to merge them and work with a single dataframe. That is, if there is a single larger dataframe, do not subset it, just delete the columns you do not need; if data1 and data2 are two dataframes merge them and work with only one dataframe.
There are multiple ways to do this in R.
You should review the merge function calling ?merge in your console and reading the function description.
Essentially, to merge two dataframes, you should do something like:
merge(data1, data2, by= columnID) #Where columnID is the name of the variable that identifies the ID. If it is different in data1 and data2 you can use by.x and by.y
Then you have to define if you want to merge all rows from both tables with the parameters all.x, all.y, and all: all values from data1 even if no matching is found in data2, or all values from data2 even if no matching is found in data1 or all values regardless of whether there is a matching ID in the other database.
Merge is in the base package with any installation of R.
You can also use dplyr package, which makes the type of join even more explicit:
inner_join(data1, data2, by = "ID")
left_join(data1, data2, by = "ID")
right_join(data1, data2, by = "ID")
full_join(data1, data2, by = "ID")
This is a good link for dplyr join https://rpubs.com/williamsurles/293454
Hope it helps
I am working on a project in R. I have two data frames with multiple entries for each employee ID in both the data frames. That is, example, employee ID 1 has multiple entries in Table 1 and table 2. Therefore, there is no Primary key in these tables.
I want to merge these two tables for better analysis. When I try to merge these tables, it counts the permutations of each ID and distorts the data in the resulting table.
Can anyone please suggest a way out.
You can merge two tables with merge command.
by = "employeeid" enables you to specify key column. if you have more than one column by = c("emoloyeeid", "period")
table3 <- merge(table1, table2, by = "employeeid")
?merge will give you more options.
I am working on a project in R. I have two data frames with multiple entries for each employee ID in both the data frames. That is, example, employee ID 1 has multiple entries in Table 1 and table 2. Therefore, there is no Primary key in these tables.
One idea is to wrangle your data so there are no more multiple entries.
Another is to summarize your data so there is only row per Employee in each table.
A third is to use the full-join to connect all matching ID
https://dplyr.tidyverse.org/reference/join.html
library(dplyr)
full_join(df1, df2, by = "EmployeeID")
Check out the DPLYR "Data Transformation Cheat Sheet" https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
I have two tables:
one (table1) containing the ring numbers of caught birds plus all the information associated with that ringing (morphological characteristics, dates, locations etc). The other (table2) has all the ring number from a different campaign, which i already searched and trimmed down for the duplicates between the two.
Because there are allot of rings (>600) it would be time consuming to go one by one from one list to the other and copy paste the entire row of information to a new table.
I want to be able to extract all the rows corresponding from Ring column in table1 corresponding to the values for rings in table2, and obtain a new table with only the extracted values.
I tried to code for one of the rings but it didnt work newtbl<-as.data.frame(table1[table1$ring==L45523,]just to see if it would select by ring number directly on table1.
There should be a way of pulling the list of ring numbers from table2 and select only those in table1.
table1 looks like this
Hope this is possible. Thank you in advance!
This sounds like a classic relational join scenario. See the notes on relational data in R4DS here.
If you want the columns in table2 to also be pulled through in to your results then use:
library(dplyr)
results <- table1 %>% inner_join(table2, by = "Ring.No.")
If you just want those records from table1 that match a ring number in table2 you can try:
library(dplyr)
results <- table1 %>% semi_join(table2, by = "Ring.No.")
Note that if ring number is called something else in table2 then you can use the more complete by = ... syntax of:
library(dplyr)
results <- table1 %>% semi_join(table2, by = c("Ring.No." = ["the name of ring number in table2"])
You can try using package dplyr :
table1 %>% filter(Ring.No. %in% unique(table2$Ring.No.))
I am not sure of the structure of your table2 so adapt the code depending on wether it's a list, a data.frame or something else.
Why not do merge on the two tables based on RING, since that seems to be a common column name between the two tables and then filter based on the RING name. Other wise post an excerpt of your two table codes.
newtable = merge(table1,table2 , by.x="table1.RING" , by.y="table2.RING",all=TRUE)
subset( newtable, RING ==YYYY)
Adpat the code to suit your desiered results.
I've run into a questions I can't answer with conditional merging of 2 data frames. Let me describe the data frames (names changed):
The first, DF1, has a column called 'proceduredate' that contains the date of the procedure per instance (already formatted by as.Date in format %Y-%m-%d).
The second, DF2, has a variable called 'orderdate' that contains the date of each lab order (also formatted by as.Date in format %Y-%m-%d).
Each dataframe has an identifier (called 'id') for each individual that is used to merge "by" across the two dataframes. I would like to merge the dataframes conditionally to include only the DF2 instances that have an orderdate within 30 days of the proceduredate in DF1. As I understand it, this would look something like:
if ([abs(DF1$procdate-DF2$orderdate)<=30]), then{
merge(DF1,DF2,by="id")
}
However, I can't figure out a way to turn this idea into working code. Would you suggest any references or similar prior solutions?
SQL handles this better than (base) R - though I believe there's a way to do it in data.table.
library(sqldf)
result = sqldf("
select *
from DF1 left join DF2 on
abs(DF1.procdata - DF2.orderdate) <= 30
AND DF1.id = DF2.id
")
I'm not sure this will work with your dates, maybe if they are Date class columns. If you provide a reproducible example I'm happy to test.
i'm new in R , and i'm trying to join between two tables. the shared filed between the two tables is the date but when i'm importing the data i received him with deferent structure.
First Table:
Second Table:
actually what i need is to join the data by operation system and remove Linux like inner join in sql with condition on the operation system. Thanks
Say that your first dataset is called df1 and the second one df2, you can join the two by calling:
merge(df1, df2, by = "operatingSystem")
You can specify the kinds of join by using all = T, all.x = T, or all.y = T.
I am a bit lazy to reproduce your example but I will give it a go as is
First, in your second table, you need to convert the date column to an actual date
You can do this with easily with lubridate
Assuming df1 and df2 for the first and second table respectively
library(lubridate)
df2$date <- ymd(df2$date) #ymd function assumes `year` then `month` then `day` when converting
Then you can use the dplyr's inner_join to perform the desired join
from stat545
inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
library(dplyr)
semi_join(df1, df2, by = c("date", "operatingSystem")
This will keep all rows in df1 that have a match in df2 - Linux stays out, and also keep the columns newusers and will keep df2%users and rename into users.1.
Note:You might need to convert df1$date to dttm object with lubridate::date(df1$date)