How to join dataframes by ID column? [duplicate] - r

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 1 year ago.
I have 3 Dataframes like the ones below with IDs that may not necessarily occur in all
DF1:
ID | Name | Phone# | Country | State | Amount_month1
0210 | John K. | 8942829725 | USA | PA | 1300
0215 | Peter | 8711234566 | USA | KS | 50
2312 | Steven | 9012341221 | USA | TX | 1000
0005 | Haris | 9167456363 | USA | NY | 1200
DF2:
ID | Name | Phone# | Country | State | Amount_month2
0210 | John K. | 8942829725 | USA | PA | 200
2312 | Steven | 9012341221 | USA | TX | 350
2112 | Jerry | 9817273794 | USA | CA | 100
DF3:
ID | Name | Phone# | Country | State | Amount_month3
0210 | John K. | 8942829725 | USA | PA | 300
0005 | Haris | 9167456363 | USA | NY | 1250
1212 | Jerry | 9817273794 | USA | CA | 1200
1210 | Drew | 8012341234 | USA | TX | 1400
I would like to join these 3 dataframes by ID and add the varying column amounts as separate columns, the missing amount values can be either 0 or NA such as:
ID | Name | Phone# | Country |State| Amount_month1 | Amount_month2 | Amount_month3
0210 | John K. | 8942829725 | USA | PA | 1300 | 200 | 300
0215 | Peter | 8711234566 | USA | KS | 50 | 0 | 0
2312 | Steven | 9012341221 | USA | TX | 1000 | 350 | 0
0005 | Haris | 9167456363 | USA | NY | 1200 | 0 | 1250
1212 | Jerry | 9817273794 | USA | CA | 0 | 100 | 1200
1210 | Drew | 8012341234 | USA | TX | 0 | 0 | 1400

It can be done in a single line using Reduce and merge
Reduce(function(x, y) merge(x, y, all=TRUE), list(DF1, DF2, DF3))

You can use left_join from the package dplyr first joining the first two df`s, then joining that result with the third df:
library(dplyr)
df_12 <- left_join(df1,df2, by = "ID")
df_123 <- left_join(df_12, df3, by = "ID")
Result:
df_123
ID Amount_month1 Amount_month2 Amount_month3
1 1 100 NA NA
2 2 200 50 NA
3 3 300 177 666
4 4 400 NA 77
Mock data:
df1 <- data.frame(
ID = as.character(1:4),
Amount_month1 = c(100,200,300,400)
)
df2 <- data.frame(
ID = as.character(2:3),
Amount_month2 = c(50,177)
)
df3 <- data.frame(
ID = as.character(3:4),
Amount_month3 = c(666,77)
)

Related

R - Join two dataframes based on date difference

Let's consider two dataframes df1 and df2. I would like to join dataframes based on the date difference only. For Example;
Dataframe 1: (df1)
| version_id | date_invoiced | product_id |
-------------------------------------------
| 1 | 03-07-2020 | 201 |
| 1 | 02-07-2020 | 2013 |
| 3 | 02-07-2020 | 2011 |
| 6 | 01-07-2020 | 2018 |
| 7 | 01-07-2020 | 201 |
Dataframe 2: (df2)
| validfrom | pricelist| pricelist_id |
------------------------------------------
|02-07-2020 | 10 | 101 |
|01-07-2020 | 20 | 102 |
|29-06-2020 | 30 | 103 |
|28-07-2020 | 10 | 104 |
|25-07-2020 | 5 | 105 |
I need to map the pricelist_id and the pricelist based on the the validfrom column present in df2. Say that, based on the least difference between the date_invoiced (df1) and validfrom (df2), the row should be mapped.
Expected Outcome:
| version_id | date_invoiced | product_id | date_diff | pricelist_id | pricelist |
----------------------------------------------------------------------------------
| 1 | 03-07-2020 | 201 | 1 | 101 | 10 |
| 1 | 02-07-2020 | 2013 | 1 | 102 | 20 |
| 3 | 02-07-2020 | 2011 | 1 | 102 | 20 |
| 6 | 01-07-2020 | 2018 | 1 | 103 | 30 |
| 7 | 01-07-2020 | 201 | 1 | 103 | 30 |
I need to map purely based on the difference and the difference should be the least. Always, the date_invoiced (df1), should have closest difference comparing to validfrom (df2). Thanks
Perhaps you might want to try using date.table and nearest roll. Here, the join is made on DATE which would be DATEINVOICED from df1 and VALIDFROM in df2.
library(data.table)
setDT(df1)
setDT(df2)
df1$DATEINVOICED <- as.Date(df1$DATEINVOICED, format = "%d-%m-%y")
df2$VALIDFROM <- as.Date(df2$VALIDFROM, format = "%d-%m-%y")
setkey(df1, DATEINVOICED)[, DATE := DATEINVOICED]
setkey(df2, VALIDFROM)[, DATE := VALIDFROM]
df2[df1, on = "DATE", roll='nearest']

How to repeat the rows in a table

I have two tables.
Table1:
+-----+--------+-------------+
| Key | region | region_name |
+-----+--------+-------------+
| ABC | NT | NORTH |
| ABC | ST | SOUTH |
| XYZ | NT | NORTH |
| XYZ | ST | SOUTH |
| DEF | ST | SOUTH |
+-----+--------+-------------+
Table2:
+-----+-------+------+--------+
| KEY | Sales | cost | profit |
+-----+-------+------+--------+
| ABC | 130 | 100 | 30 |
| XYZ | 120 | 95 | 25 |
| DEF | 110 | 90 | 20 |
+-----+-------+------+--------+
I want the final output be like below.
+-----+-------+------+--------+--------+-------------+
| KEY | Sales | cost | profit | region | region_name |
+-----+-------+------+--------+--------+-------------+
| ABC | 130 | 100 | 30 | NT | NORTH |
| ABC | 130 | 100 | 30 | ST | SOUTH |
| XYZ | 120 | 95 | 25 | NT | NORTH |
| XYZ | 120 | 95 | 25 | ST | SOUTH |
| DEF | 110 | 90 | 20 | ST | SOUTH |
+-----+-------+------+--------+--------+-------------+
Thanks in advance..!
We can use left_join
library(dplyr)
left_join(df1, df2, by = 'Key')
Or with merge in base R
merge(df1, df2, by = 'Key', all.x = TRUE)

Removing duplicated data based on each group using R

I have a dataset which contains employee id, name and their bank account information. Some of these employees have duplicate names with either same employee id or different employee id for same employee name. Few of these employees also have same bank account information for same names while some have different bank account numbers under same name. The aim is to find those employees who have same name but different bank account number. Here's a sample of the data:
| Emp_id | Name | Bank Account |
|--------|:-------:|-------------:|
| 123 | Joan | 6758 |
| 134 | Karyn | 1244 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
| 235 | Larry | 5201 |
| 433 | Larry | 5201 |
| 231 | Larry | 5201 |
| 120 | Amy | 7890 |
| 135 | Amy | 7890 |
| 150 | Chris | 1280 |
| 150 | Chris | 6565 |
| 900 | Cassy | 1280 |
| 900 | Cassy | 9873 |
I had to find the employees who were duplicates based on their names which I could do successfully. Once that was done, I had to identify the employees with same name but different bank account no. Right now the issue is that it is not grouping the employees based on name and searching for different bank account. Instead, it is looking for account numbers of different individuals and if it finds it to be same, it removes one of the duplicate values. For example, Chris and Cassy have same bank account number '1280', so it is identifying it to be same and automatically removing one of Chris's record (bank account no 1280 in the output). The output that I'm getting is as shown below:
| Emp_id | Name | Bank Account |
|--------|:-----:|-------------:|
| 120 | Amy | 7890 |
| 900 | Cassy | 1280 |
| 900 | Cassy | 9873 |
| 150 | Chris | 6565 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
This is the code that I have followed:
sample=data.frame(Id=c("123","134","143","143","235","433","231","120","135","150","150","900","900"),
Name=c("Joan","Karyn","Larry","Larry","Larry","Larry","Larry","Amy","Amy","Chris","Chris","Cassy","Cassy"),
Bank_Account=c("6758","1244","4900","5201","5201","5201","5201","7890","7890","1280","6565","1280","9873"))
n_occur <- data.frame(table(sample$Name))
n_occur=n_occur[n_occur$Freq > 1,]
Duplicates=sample[sample$Name %in% n_occur$Var1[n_occur$Freq > 1],]
Duplicates=Duplicates %>% arrange(Duplicates$Name, Duplicates$Name)
Duplicates=Duplicates[!duplicated(Duplicates$Bank_Account),]
The actual output however, should have considered the bank account nos within each name (same name). The output should look something like this:
| Emp_id | Name | Bank Account |
|--------|:-------:|-------------:|
| 900 | Cassy |1280 |
| 900 | Cassy |9873 |
| 150 | Chris | 1280 |
| 150 | Chris | 6565 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
Can someone please direct me towards right code?
We can use n_distinct to filter
library(dplyr)
sample %>%
group_by(Name) %>%
filter(n() > 1) %>%
group_by(Id, add = TRUE) %>%
filter(n_distinct(Bank_Account) > 1) %>%
arrange(desc(Id))
# A tibble: 6 x 3
# Groups: Name, Id [3]
# Id Name Bank_Account
# <fct> <fct> <fct>
#1 900 Cassy 1280
#2 900 Cassy 9873
#3 150 Chris 1280
#4 150 Chris 6565
#5 143 Larry 4900
#6 143 Larry 5201
Step 1 - Identifying duplicate names:
step_1 <- sample %>%
arrange(Name) %>%
mutate(dup = duplicated(Name)) %>%
filter(Name %in% unique(as.character(Name[dup == T])))
Step 2 - Identifying duplicate accounts for these names:
step_2 <- step_1 %>%
group_by(Name, Bank_Account) %>%
mutate(dup = duplicated(Bank_Account)) %>%
filter(dup == F)

In R, how can I add some specific columns from a dataframe to another dataframe when some values are equal in both dataframes?

I have two datasets which have both the same row combinations Country & Year and I would like to add some columns from one dataset to the other one in a way that the row combinations match.
Dataset 1:
+----------+------+---------+---------+-----+
| Country | Year | exports | imports | ... |
+----------+------+---------+---------+-----+
| Germany | 2000 | 0.70 | 0.40 | ... |
| Germany | 2001 | 0.68 | 0.41 | ... |
| Germany | 2002 | 0.71 | 0.48 | ... |
| Germany | 2003 | ... | ... | ... |
| Spain | 2000 | 0.51 | 0.56 | ... |
| Spain | 2001 | 0.48 | 0.50 | ... |
| Spain | 2002 | 0.50 | 0.53 | ... |
| Spain | 2003 | ... | ... | ... |
| ... | ... | ... | ... | ... |
+----------+------+---------+---------+-----+
Dataset 2:
+----------+-----+------+--------------+-------+-----+
| Country | CC | Year | unemployment | Pop | ... |
+----------+-----+------+--------------+-------+-----+
| Germany | GER | 2000 | 0.03 | 79.50 | ... |
| Germany | GER | 2001 | 0.05 | 79.53 | ... |
| Germany | GER | 2002 | 0.04 | 79.80 | ... |
| Germany | GER | 2003 | ... | ... | ... |
| Hungary | HUN | 2000 | ... | ... | ... |
| Hungary | HUN | 2001 | ... | ... | ... |
| Hungary | HUN | 2002 | ... | ... | ... |
| Hungary | HUN | 2003 | ... | ... | ... |
| Spain | ESP | 2000 | 0.08 | 40.2 | ... |
| Spain | ESP | 2001 | 0.11 | 40.5 | ... |
| Spain | ESP | 2002 | 0.10 | 40.55 | ... |
| Spain | ESP | 2003 | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
+----------+-----+------+--------------+-------+-----+
I want the merged data to look like this:
+----------+-----+------+---------+---------+--------------+-------+-----+
| Country | CC | Year | exports | imports | unemployment | Pop | ... |
+----------+-----+------+---------+---------+--------------+-------+-----+
| Germany | GER | 2000 | 0.70 | 0.40 | 0.03 | 79.50 | ... |
| Germany | GER | 2001 | 0.68 | 0.41 | 0.05 | 79.53 | ... |
| Germany | GER | 2002 | 0.71 | 0.48 | 0.04 | 79.80 | ... |
| Germany | GER | 2003 | ... | ... | ... | ... | ... |
| Spain | ESP | 2000 | 0.51 | 0.56 | 0.08 | 40.2 | ... |
| Spain | ESP | 2001 | 0.48 | 0.50 | 0.11 | 40.5 | ... |
| Spain | ESP | 2002 | 0.50 | 0.53 | 0.10 | 40.55 | ... |
| Spain | ESP | 2003 | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... | ... |
+----------+-----+------+---------+---------+--------------+-------+-----+
So, the countries which are not in dataset 1 (like Hungary in this case) are not in the merged dataset and the country code is also in the new dataset. Could someone tell me how I can achieve this? I have 28 years for about 100 countries each. So using a function in which I have to specify every combination would not be handy...
I tried to merge it with merge()but did not succeed since it just created hundreds of rows with the same country and year combination.
You can do this with inner_join() from dplyr package
dplyr::inner_join(df1, df2, by=c("Country", "Year"))
merge absolutely should work for this. You should specify that you are merging on two columns.
merge( df1 , df2 , by=c( "Country", "Year") )
Also confirm that the class of the merging vars is the same
sapply( df1[, c( "Country", "Year")] , class )
sapply( df2[, c( "Country", "Year")] , class )
confirm that the variables are spelled the same way in both data frames
intersect( names( df1 ) , names( df2 ))
Finally confirm that year and country are unique in both data.frames
sum( duplicated( df1[ ,c( "Country", "Year") ] ))
sum( duplicated( df2[ ,c( "Country", "Year") ] ))
The answer with merge() worked! Now I am facing the problem that e.g. Spain does not have any unemployment data for the year 2000. However, I still want to add all years of Spain and would like to have a NA in the unemployment column for Spain in 2000 in the merged dataset. How can I achieve this?
I tried to use merge(df1, df2, all.x = TRUE) but sometimes it just creates NA's for some reason...

Copy column data when function unaggregates a single row into multiple in R

I need help in taking an annual total (for each of many initiatives) and breaking that down to each month using a simple division formula. I need to do this for each distinct combination of a few columns while copying down the columns that are broken from annual to each monthly total. The loop will apply the formula to two columns and loop through each distinct group in a vector. I tried to explain in an example below as it's somewhat complex.
What I have :
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2015 | TotalD | TotalD |
| A | Mike | 2015 | TotalE | TotalE |
| A | Rob | 2015 | TotalF | TotalF |
| B | John | 2015 | TotalG | TotalG |
| B | Mike | 2015 | TotalH | TotalH |
......
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2016 | TotalI | TotalI |
| A | Mike | 2016 | TotalJ | TotalJ |
| A | Rob | 2016 | TotalK | TotalK |
| B | John | 2016 | TotalL | TotalL |
| B | Mike | 2016 | TotalM | TotalM |
I'm going to loop a function for the first row to take the "Total Savings" and "Total Costs" and divide by 12 where Date = 2015 and 9 where Date = 2016 (YTD to Sept) and create an individual row for each. I'm essentially breaking out an annual total in a row and creating a row for each month of the year. I need help in running that loop to copy also columns "Init", "Name", until "Init", "Name" combination are not distinct. Also, note the formula for the division based on the year will be different as well. I suppose I could separate the datasets for 2015 and 2016 and use two different functions and merge if that would be easier. Below should be the output:
| Init | Name | Date |Monthly Savings|Monthly Costs|
| A | John | 01-01-2015 | TotalD/12* | MonthD |
| A | John | 02-01-2015 | MonthD | MonthD |
| A | John | 03-01-2015 | MonthD | MonthD |
...
| A | Mike | 01-01-2016 | TotalE/9* | TotalE |
| A | Mike | 02-01-2016 | TotalE | TotalE |
| A | Mike | 03-01-2016 | TotalE | TotalE |
...
| B | John | 01-01-2015 | TotalG/12* | MonthD |
| B | John | 02-01-2015 | MonthG | MonthD |
| B | John | 03-01-2015 | MonthG | MonthD |
TotalD/12* = MonthD - this is the formula for 2015
TotalE/9* = MonthE - this is the formula for 2016
Any help would be appreciated...
As a start, here are some reproducible data, with the columns described:
myData <-
data.frame(
Init = rep(LETTERS[1:3], each = 4)
, Name = rep(c("John", "Mike"), each = 2)
, Date = 2015:2016
, Savings = (1:12)*1200
, Cost = (1:12)*2400
)
Next, set the divisor to be used for each year:
toDivide <-
c("2015" = 12, "2016" = 9)
Then, I am using the magrittr pipe as I split the data up into single rows, then looping through them with lapply to expand each row into the appropriate number of rows (9 or 12) with the savings and costs divided by the number of months. Finally, dplyr's bind_rows stitches the rows back together.
myData %>%
split(1:nrow(.)) %>%
lapply(function(x){
temp <- data.frame(
Init = x$Init
, Name = x$Name
, Date = as.Date(paste(x$Date
, formatC(1:toDivide[as.character(x$Date)]
, width = 2, flag = "0")
, "01"
, sep = "-"))
, Savings = x$Savings / toDivide[as.character(x$Date)]
, Cost = x$Cost / toDivide[as.character(x$Date)]
)
}) %>%
bind_rows()
The head of this looks like:
Init Name Date Savings Cost
1 A John 2015-01-01 100.0000 200.0000
2 A John 2015-02-01 100.0000 200.0000
3 A John 2015-03-01 100.0000 200.0000
4 A John 2015-04-01 100.0000 200.0000
5 A John 2015-05-01 100.0000 200.0000
6 A John 2015-06-01 100.0000 200.0000
with similar entries for each expanded row.

Resources