From looking through Stackoverflow, and other sources, I believe that changing my dataframes to data.tables and using setkey, or similar, will give what I want. But as of yet I have been unable to get a working Syntax.
I have two data frames, one containing 26000 rows and the other containing 6410 rows.
The first dataframe contains the following columns:
Customer name, Base_Code, Idenity_Number, Financials
The second dataframe holds the following:
Customer name, Base_Code, Idenity_Number, Financials, Lapse
Both sets of data have identical formatting.
My goal is to join the Lapse column in the second dataframe to first dataframe. The issue I have is that the numeric value in Financials does not match between the two datasets and I only want the closest match in DF1 to have the value in the Lapse column in DF2 against it.
There will be examples where there are multiple entries for the same customer ID and Base Code in each dataframe, so I need to merge the two based on Idenity_Number and Base_Code (which is exact) and then match against the nearest financial numeric match for each entry only.
There will never be more entries in the DF2 then held within DF1 for each Customer and Base_Code.
Here is an example of DF1:
Here is an example of DF2:
And finally, here is what I want end up with:
If we use Jessica Rabbit as the example we have a match against DF1 and DF2, the financial value of 1240 from DF1 was matched against 1058 in DF2 as that was the closest match.
I could not work out how to get a working solution using data.table, so I re-thought my approach and have come up with a solution.
First of all I merged the two datasets, and then removed any entries that did not have a stauts of "LAP", this gave me all of the NON Lapsed entries:
NON_LAP <- merge(x=Merged,y=LapsesMonth,by=c("POLICY_NO","LOB_BASE"),all.x=TRUE)
NON_LAP <- NON_LAP [!grepl("LAP", NON_LAP$Status, ignore.case=FALSE),]
Next I merged again, this time looking specifically for the lapsed cases. To work out which was the cloest match I used the abs function, then I ordered by the lowest difference to get the closest matches in order. Finally I removed duplicates to show the closest matches and then also kept duplicates and stripped out the "LAP" status to ensure those that were not the closest match remained in the data.
Finally I merged them all together giving me the required outcome.
FIND_LAP <- merge(x=Merged,y=LapsesMonth,by=c("POLICY_NO","LOB_BASE"),all.y=FALSE)
FIND_LAP$Difference <- abs(FIND_LAP$GWP - FIND_LAP$ACTUAL_PRICE)
FIND_LAP <- FIND_LAP[order( FIND_LAP[,27] ),]
FOUND_LAP <- FIND_LAP [!duplicated(FIND_LAP[c("POLICY_NO","LOB_BASE")]),]
NOT_LAP <- FIND_LAP [duplicated(FIND_LAP[c("POLICY_NO","LOB_BASE")]),]
Hopefully this will help someone else who might be new to R and encounters the same issue.
Related
I have 2 data frames of unequal lengths, each with a column of timestamps. I would like to return the corresponding ID contained in df2 to df1 as a new column if the time difference is less than 60 minutes, meaning that so I know ID#1 in df2 with a specific appointment time are responsible for some of the entries in df1. Each ID should have 8 entries in df1.
To calculate the difference between each element in df1 and df2, I've tried
outer(df1$DataEntryTime, df2$ApptTime, '-')
and got a matrix of results.
enter image description here
What do I need to do next to build a conditional statement so it can return the ID# to df1 based on the results?
Many thanks!
I do have two data sets: data2 and data 3.
The relevant information of the data3 should be added to the respective rows of data2 and the commmon columns in both set are "Inschrijfjaar" and "Leeftijd".
I am using the code:
data4=merge(x=data2,y=data3, by=c("Inschrijfjaar", "Leeftijd"),all.x=TRUE)
A check up gives me:
dim(data2)
525380 5
dim(data3)
1707 7
dim(data4)
5307668 10
So the merge is not done correctly, the dimension of data4 should also be 525380, because it is a left join. So I am getting ways more rows then the left data set. What could be the cause?
I also tried the code:
data4=merge(x=data2,y=data3,all.x=TRUE)
Sorry, I cannot comment, this is not a full answer, but meant to be a comment:
There are many differnt forms of joins.
I find them well explained here.
You do a left join, which returns all rows from the left table, and any rows with matching keys from the right table. So you would expect more values in your data4.
What you you actually want seems to be a left semi-join "A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y" (from this question which may help you answer your question).
This behaviour occurs when there are multiple rows of data3 that have the same values in your columns c("Inschrijfjaar", "Leeftijd") (and these values appear in data2). Where data in data3 could be merged with multiple records in data2, they will all be included, leading to more records in data4 than in data2
I'm trying to assign a variable in one dataframe into multiple rows of another dataframe - namely the AWND variable here (average wind speed).
I'm trying to obtain the AWND from
here
And I am trying to match it with multiple dates based on the date
here
Here's what I've tried so far.
dfNew <- merge(dfWeather, dfFlight, by="DATE")
I'm not sure how to proceed with this.
Should I do a join?
(EDIT: Here's the data- https://shrib.com/#-7dXevTkb12Bt6Kdfxim (this is the dput output of the data I am getting AWND from)
I got the flights data (that I am trying to match dates with) from the nycflights13 package, and then I subset the flights data to include only the carriers that had at least 1000 flights depart from LaGuardia.
The flights data has the date-time class as shown in your tibble. First, make sure that the elements you want to join between are the same i.e. 2013-01-01 05:00:00 will not match with 2013-01-01 in your dfWeather data.frame
# Make sure dates match between data.frames
dfFlight$DATE <- stringr::str_extract(dfFlight$DATE, "\\S*")
# Join AWND wherever dates match to left-hand side
dfNew <- dplyr::left_join(dfFlight, dfWeather, by = "DATE")
I did assume some things about your data since I couldn't fully see what you're working with from screenshot. This is my first answer on Stack Overflow, so feel free to edit or leave me suggestions
I am still new to R and I am attempting to solve a seemingly simple problem. I would like to identify all of the unique combinations of values from 4 different rows, and update an additional column in my df to annotate whether or not it is unique.
Giving a df with columns A-Z, I have used the following code to identify unique combinations of column A,B,C,D, and E. I am trying to update column F with this information.
unique(df[ ,c("A", "B","C","D", "E")])
This returns each of the individual rows with unique combinations as expected, but I cannot figure out what the next step I should take in order to update column "F" with a value to indicate that it is a unique row. Thanks in advance for any pointers!
I want to aggregate and count how often in my dataset is a special kind of disease at one date. (I don't use duplicate because I want all rows, not only the duplicated ones)
My original data set looks like:
id dat kinds kind
AE00302 2011-11-20 valv 1
AE00302 2011-10-31 vask 2
(of course my data.frame is much larger)
I try this:
xagg<-aggregate(kind~id+dat+kinds,subx,length)
names(xagg)<-c("id","dat","kinds","kindn")
and get:
id dat kinds kindn
AE00302 2011-10-31 valv 1
AE00302 2011-11-20 vask 1
I wonder why R is going wrong by the 'date' resp. the 'kinds'-column.
Has anybody an idea?
I still don't know why.
But I found out, aggregate goes wrong, because of columns I don't use for aggregating.
Therefor these steps solve the problem for me:
# 1st step: reduce the data.frame to only the needed columns
# 2nd Step: aggregate the reduced data.frame
# 3rd Step: merge aggregated data to reduced dataset
# 4th step: remove duplicated rows from reduced dataset (if they occur)
# 5th step: merge reduced dataset without dublicated data to original dataset
Maybe the problem occurs, if there are duplicated datasets in the aggregated data.frame.
Thanks for all your help, questions and attempts to solve my problem!
elchvonoslo