Merging two dataframes by multiple columns without losing data [duplicate] - r

This question already has an answer here:
merging in R keeping all rows of a data set
(1 answer)
Closed 1 year ago.
I am new to R and have two very large datasets I want to merge. They look as follows
ID year val1 val3
1 1 2001 2 34
2 2 2004 1 25
3 3 2003 3 36
4 4 2003 2 46
5 5 1999 1 55
6 6 2005 3 44
The second dataframe is as follows
ID year val2
1 1 2001 2
2 2 2004 1
3 3 2003 3
4 4 2002 5
5 5 1998 4
6 6 2004 6
I want the final merged set to look like this
ID year val1 val3 val2
1 1 2001 2 34 2
2 2 2004 1 25 1
3 3 2003 3 36 3
4 4 2002 NA NA 5
5 4 2003 2 46 NA
6 5 1998 NA NA 4
7 5 1999 1 55 NA
8 6 2004 NA NA 6
9 6 2005 3 44 NA
I tried merging by ID and year using the following
total <- merge(df1,df2,by=c("id","year"))
But this results in only merging if ID and year BOTH match. I want to it to happen so that if the ID matches but year doesn't match, a new row will add in the same ID the entry for year and val2 while leaving val1 and val3 as NA.
I then tried merging only by ID and then removing rows if year.x != year.y, but since the datasets were too large it wasn't very efficient.

merge has an argument all that specifies if you want to keep all rows from left and right side (i.e. all rows from x and all rows from y)
total <- merge(df1,df2,by=c("id","year"), all=TRUE)

You must specify all.x=TRUE and all.y=TRUE so you keep all unique rows from both datasets
total <- merge(df1,df2,by=c("id","year"),all.x=TRUE,all.y=TRUE)

Related

Paste values in a column based on other observations in the dataframe in R

I have a very large (~30M observations) dataframe in R and I am having trouble with a new column I want to create.
The data is formatted like this:
Country Year Value
1 A 2000 1
2 A 2001 NA
3 A 2002 2
4 B 2000 4
5 B 2001 NA
6 B 2002 NA
7 B 2003 3
My problem is that I would like to impute the NAs in the value column based on other values in that column. Specifically, if there is a non-NA value for the same country I would like that to replace the NA in later years, until there is another non-NA value.
The data above would therefore be transformed into this:
Country Year Value
1 A 2000 1
2 A 2001 1
3 A 2002 2
4 B 2000 4
5 B 2001 4
6 B 2002 4
7 B 2003 3
To solve this, I first tried using a loop with a lookup function and also some if_else statements, but wasn't able to get it to behave as I expected. In general, I am struggling to find an efficient solution that will be able to perform the task in the order of minutes-hours and not days.
Is there an easy way to do this?
Thanks!
Using tidyr's fill:
library(tidyverse)
df %>%
group_by(Country) %>%
fill(Value)
Result:
# A tibble: 7 × 3
# Groups: Country [2]
Country Year Value
<chr> <dbl> <dbl>
1 A 2000 1
2 A 2001 1
3 A 2002 2
4 B 2000 4
5 B 2001 4
6 B 2002 4
7 B 2003 3

Merging two dataframes creates new missing observations

I have two dataframes with the following matching keys: year, region and province. They each have a set of variables (in this illustrative example I use x1 for df1 and x2 for df2) and both variables have several missing values on their own.
df1 df2
year region province x2 ... xn year region province x2 ... xn
2019 1 5 NA 2019 1 5 NA
2019 2 4 NA. 2019 2 4 NA.
2019 2 4 NA. 2019 2 4 NA
2018 3 7 13. 2018 3 7 13
2018 3 7 15 2018 3 7 15
2018 3 7 17 2018 3 7 17
I want to merge both dataframes such that they end up like this:
year region province x1 x2
2019 1 5 3 NA
2019 2 4 27 NA
2019 2 4 15 NA
2018 3 7 12 13
2018 3 7 NA 15
2018 3 7 NA 17
2017 4 9 NA 12
2017 4 9 19 30
2017 4 9 20 10
However, when doing so using merged_df <- merge(df1, df2, by=c("year","region","province"), all.x=TRUE), R seems to create a lot of additional missing values on each of the variable columns (x1 and x2), which were not there before. What is happening here? I have tried sorting both using df1 %>% arrange(province,-year) and df2 %>% arrange(province,-year), which is enough to have matching order in both dataframes, only to find the same issue when running the merge command. I've tried a bunch of other stuff too, but nothing seems to work. R's output sort of looks like this:
year region province x1 x2
2019 1 5 NA NA
2019 2 4 NA NA
2019 2 4 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2017 4 9 15 NA
2017 4 9 19 30
2017 4 9 20 10
I have done this before; in fact, one of the dataframes is an already merged dataframe in which I did not encounter this issue.
Maybe it is not clear the concept of merge(). I include two examples with example data. I hope you understand and it helps you.
#Data
set.seed(123)
DF1 <- data.frame(year=rep(c(2017,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x1=rnorm(9,3,1.5))
DF2 <- data.frame(year=rep(c(2016,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x2=rnorm(9,3,1.5))
#Merge based only in df1
Merged1 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all.x=T)
Merged1
year region province x1 x2
1 2017 1 2 2.8365510 NA
2 2017 1 3 3.7557187 NA
3 2017 1 5 4.9208323 NA
4 2018 2 4 2.8241371 NA
5 2018 2 5 6.7925048 1.460993
6 2018 2 5 0.4090941 1.460993
7 2019 3 1 5.5352765 NA
8 2019 3 3 3.8236451 4.256681
9 2019 3 3 3.2746239 4.256681
#Merge including all elements despite no match between ids
Merged2 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all = T)
Merged2
year region province x1 x2
1 2016 1 3 NA 4.052034
2 2016 1 4 NA 2.062441
3 2016 1 5 NA 2.673038
4 2017 1 2 2.8365510 NA
5 2017 1 3 3.7557187 NA
6 2017 1 5 4.9208323 NA
7 2018 2 1 NA 0.469960
8 2018 2 2 NA 2.290813
9 2018 2 4 2.8241371 NA
10 2018 2 5 6.7925048 1.460993
11 2018 2 5 0.4090941 1.460993
12 2019 3 1 5.5352765 NA
13 2019 3 2 NA 1.398264
14 2019 3 3 3.8236451 4.256681
15 2019 3 3 3.2746239 4.256681
16 2019 3 4 NA 1.906663

Assign unique ID based on two columns [duplicate]

This question already has answers here:
Add ID column by group [duplicate]
(4 answers)
How to create a consecutive group number
(13 answers)
Closed 5 years ago.
I have a dataframe (df) that looks like this:
School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000
And I would like to create a person ID column so that df looks like this:
ID School Student Year
1 A 10 1999
1 A 10 2000
2 A 20 1999
2 A 20 2000
2 A 20 2001
3 B 10 1999
3 B 10 2000
In other words, the ID variable indicates which person it is in the dataset, accounting for both Student number and School membership (here we have 3 students total).
I did df$ID <- df$Student and tried to request the value +1 if c("School", "Student) was unique. It isn't working. Help appreciated.
We can do this in base R without doing any group by operation
df$ID <- cumsum(!duplicated(df[1:2]))
df
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
NOTE: Assuming that 'School' and 'Student' are ordered
Or using tidyverse
library(dplyr)
df %>%
mutate(ID = group_indices_(df, .dots=c("School", "Student")))
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
As #radek mentioned, in the recent version (dplyr_0.8.0), we get the notification that group_indices_ is deprecated, instead use group_indices
df %>%
mutate(ID = group_indices(., School, Student))
Group by School and Student, then assign group id to ID variable.
library('data.table')
df[, ID := .GRP, by = .(School, Student)]
# School Student Year ID
# 1: A 10 1999 1
# 2: A 10 2000 1
# 3: A 20 1999 2
# 4: A 20 2000 2
# 5: A 20 2001 2
# 6: B 10 1999 3
# 7: B 10 2000 3
Data:
df <- fread('School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000')

How can I drop observations within a group following the occurrence of NA?

I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.
Here an example:
productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf
# productreference Year assets
# 1 1 2000 2
# 2 1 2001 3
# 3 1 2002 NA
# 4 1 2003 2
# 5 2 1999 34
# 6 2 2000 NA
# 7 2 2001 45
# 8 3 2005 1
# 9 3 2006 23
# 10 3 2007 34
# 11 3 2008 56
# 12 4 1998 56
# 13 4 1999 67
# 14 4 2000 23
# 15 5 2000 23
# 16 5 2001 NA
# 17 5 2002 14
# 18 5 2003 NA
I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.
mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1
I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!
I believe this is what you want:
library(dplyr)
group_by(mydf, productreference) %>%
filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
#
# productreference Year assets
# (dbl) (dbl) (dbl)
# 1 1 2000 2
# 2 1 2001 3
# 3 2 1999 34
# 4 3 2005 1
# 5 3 2006 23
# 6 3 2007 34
# 7 3 2008 56
# 8 4 1998 56
# 9 4 1999 67
# 10 4 2000 23
# 11 5 2000 23
Here is the same approach using data.table:
library(data.table)
dt <- as.data.table(mydf)
dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]
# productreference Year assets nas
# 1: 1 2000 2 0
# 2: 1 2001 3 0
# 3: 2 1999 34 0
# 4: 3 2005 1 0
# 5: 3 2006 23 0
# 6: 3 2007 34 0
# 7: 3 2008 56 0
# 8: 4 1998 56 0
# 9: 4 1999 67 0
#10: 4 2000 23 0
#11: 5 2000 23 0
Here is a base R option
mydf[unsplit(lapply(split(mydf, mydf$productreference),
function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]
# productreference Year assets
#1 1 2000 2
#2 1 2001 3
#5 2 1999 34
#8 3 2005 1
#9 3 2006 23
#10 3 2007 34
#11 3 2008 56
#12 4 1998 56
#13 4 1999 67
#14 4 2000 23
#15 5 2000 23
Or an option with data.table
library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)]
else .SD, by = productreference]
You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.
mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
s1 <- mydf[mydf$productreference==i,]
s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
mydf2 <- rbind(mydf2, s2)
mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2

Merging on pairs of values in one column of each data frame

I am trying to merge two data frames with columns of different lengths and rows.To give the exact idea DF1 is:
ID year freq1 mun
1 2005 2 61137
1 2006 1 61383
2 2005 3 14520
2 2006 2 14604
4 2005 3 101423
4 2006 1 102257
6 2005 0 39039
6 2006 1 39346
Whereas DF2 is:
ID year freq2 mun
1 2004 5 60857
1 2005 3 61137
2 2004 4 14278
2 2005 4 14520
3 2004 2 22563
3 2005 0 22635
4 2004 6 101015
4 2005 4 101423
5 2004 6 61152
5 2005 3 61932
6 2004 4 38456
6 2005 3 39039
As you can see both year and mun variables are somewhat different and have only one common entry. So what I'm trying to achieve is to merge freq1 and freq2 columns with respect to ID's. However the trick is that DF1 should take priority (left merge?) in such way that year and mun variables are the ones chosen from DF1. Desired output:
ID year freq1 mun freq2
1 2005 2 61137 5
1 2006 1 61383 3
2 2005 3 14520 4
2 2006 2 14604 4
4 2005 3 101423 6
4 2006 1 102257 4
6 2005 0 39039 4
6 2006 1 39346 3
As well as other way around for DF2 taking priority in such way that:
ID year freq2 mun freq1
1 2004 5 60857 2
1 2005 3 61137 1
2 2004 4 14278 3
2 2005 4 14520 2
3 2004 2 22563 0
3 2005 0 22635 0
4 2004 6 101015 3
4 2005 4 101423 1
5 2004 6 61152 0
5 2005 3 61932 0
6 2004 4 38456 0
6 2005 3 39039 1
I've tried deleting year and mun columns and merge freq1 and freq2 according to common ID's however it only provides me with multiple duplicate entries. Any suggestions?
It appears that you are trying to match pairs of IDs in the data frames, in the order presented.
Matching on the ID column alone will cause a cross-product to be formed, giving four rows for ID == 1, which is what I assume you mean by "multiple duplicate entries."
To merge the pairs of ID values, you need to disambiguate the individual values, so the merge merges the first ID value in df1 with the first ID value in df2, and similarly for the second ID values.
This disambiguation can be done by adding another column, which adds a counter for the number of ID values seen. seq_along counts, and ave applies to the "levels" of ID:
df1$ID2 <- ave(df1$ID, df1$ID, FUN=seq_along)
df2$ID2 <- ave(df2$ID, df2$ID, FUN=seq_along)
Here's the new df1. df2 is similarly modified.
> df1
ID year freq1 mun ID2
1 1 2005 2 61137 1
2 1 2006 1 61383 2
3 2 2005 3 14520 1
4 2 2006 2 14604 2
5 4 2005 3 101423 1
6 4 2006 1 102257 2
7 6 2005 0 39039 1
8 6 2006 1 39346 2
These are now appropriate for passing to merge to get the two sides that you want. Removing the unused column from each side prevents the merge from taking data that you don't want:
> merge(df1, df2[-c(2,4)], by=c('ID', 'ID2'), all.x=T)[-2]
ID year freq1 mun freq2
1 1 2005 2 61137 5
2 1 2006 1 61383 3
3 2 2005 3 14520 4
4 2 2006 2 14604 4
5 4 2005 3 101423 6
6 4 2006 1 102257 4
7 6 2005 0 39039 4
8 6 2006 1 39346 3
> merge(df1[-c(2,4)], df2, by=c('ID', 'ID2'), all.y=T)[-2]
ID freq1 year freq2 mun
1 1 2 2004 5 60857
2 1 1 2005 3 61137
3 2 3 2004 4 14278
4 2 2 2005 4 14520
5 3 NA 2004 2 22563
6 3 NA 2005 0 22635
7 4 3 2004 6 101015
8 4 1 2005 4 101423
9 5 NA 2004 6 61152
10 5 NA 2005 3 61932
11 6 0 2004 4 38456
12 6 1 2005 3 39039
Note that NA values are used where there is no match. You can replace these with 0 values if that is really appropriate.
The [-2] at the end removes the added column ID2.
This is a fairly unusual way to merge. It depends on the order of the data in addition the values, so it does seem to be fragile. But I do think that I've captured what you want to accomplish.
Use match function to find corresponding rows between DF1 and DF2. See the code below.
# Find rows in DF1 that matches rows in DF2, get "freq2" values from them.
cbind(DF1, DF2[ match( DF1[,"year"], DF2[,"year"] ), "freq2" ])
# Find rows in DF1 that matches rows in DF2, get "freq2" values from them.
cbind(DF2, DF1[ match( DF2[,"year"], DF1[,"year"] ), "freq1" ])

Resources