Merging on pairs of values in one column of each data frame - r

I am trying to merge two data frames with columns of different lengths and rows.To give the exact idea DF1 is:
ID year freq1 mun
1 2005 2 61137
1 2006 1 61383
2 2005 3 14520
2 2006 2 14604
4 2005 3 101423
4 2006 1 102257
6 2005 0 39039
6 2006 1 39346
Whereas DF2 is:
ID year freq2 mun
1 2004 5 60857
1 2005 3 61137
2 2004 4 14278
2 2005 4 14520
3 2004 2 22563
3 2005 0 22635
4 2004 6 101015
4 2005 4 101423
5 2004 6 61152
5 2005 3 61932
6 2004 4 38456
6 2005 3 39039
As you can see both year and mun variables are somewhat different and have only one common entry. So what I'm trying to achieve is to merge freq1 and freq2 columns with respect to ID's. However the trick is that DF1 should take priority (left merge?) in such way that year and mun variables are the ones chosen from DF1. Desired output:
ID year freq1 mun freq2
1 2005 2 61137 5
1 2006 1 61383 3
2 2005 3 14520 4
2 2006 2 14604 4
4 2005 3 101423 6
4 2006 1 102257 4
6 2005 0 39039 4
6 2006 1 39346 3
As well as other way around for DF2 taking priority in such way that:
ID year freq2 mun freq1
1 2004 5 60857 2
1 2005 3 61137 1
2 2004 4 14278 3
2 2005 4 14520 2
3 2004 2 22563 0
3 2005 0 22635 0
4 2004 6 101015 3
4 2005 4 101423 1
5 2004 6 61152 0
5 2005 3 61932 0
6 2004 4 38456 0
6 2005 3 39039 1
I've tried deleting year and mun columns and merge freq1 and freq2 according to common ID's however it only provides me with multiple duplicate entries. Any suggestions?

It appears that you are trying to match pairs of IDs in the data frames, in the order presented.
Matching on the ID column alone will cause a cross-product to be formed, giving four rows for ID == 1, which is what I assume you mean by "multiple duplicate entries."
To merge the pairs of ID values, you need to disambiguate the individual values, so the merge merges the first ID value in df1 with the first ID value in df2, and similarly for the second ID values.
This disambiguation can be done by adding another column, which adds a counter for the number of ID values seen. seq_along counts, and ave applies to the "levels" of ID:
df1$ID2 <- ave(df1$ID, df1$ID, FUN=seq_along)
df2$ID2 <- ave(df2$ID, df2$ID, FUN=seq_along)
Here's the new df1. df2 is similarly modified.
> df1
ID year freq1 mun ID2
1 1 2005 2 61137 1
2 1 2006 1 61383 2
3 2 2005 3 14520 1
4 2 2006 2 14604 2
5 4 2005 3 101423 1
6 4 2006 1 102257 2
7 6 2005 0 39039 1
8 6 2006 1 39346 2
These are now appropriate for passing to merge to get the two sides that you want. Removing the unused column from each side prevents the merge from taking data that you don't want:
> merge(df1, df2[-c(2,4)], by=c('ID', 'ID2'), all.x=T)[-2]
ID year freq1 mun freq2
1 1 2005 2 61137 5
2 1 2006 1 61383 3
3 2 2005 3 14520 4
4 2 2006 2 14604 4
5 4 2005 3 101423 6
6 4 2006 1 102257 4
7 6 2005 0 39039 4
8 6 2006 1 39346 3
> merge(df1[-c(2,4)], df2, by=c('ID', 'ID2'), all.y=T)[-2]
ID freq1 year freq2 mun
1 1 2 2004 5 60857
2 1 1 2005 3 61137
3 2 3 2004 4 14278
4 2 2 2005 4 14520
5 3 NA 2004 2 22563
6 3 NA 2005 0 22635
7 4 3 2004 6 101015
8 4 1 2005 4 101423
9 5 NA 2004 6 61152
10 5 NA 2005 3 61932
11 6 0 2004 4 38456
12 6 1 2005 3 39039
Note that NA values are used where there is no match. You can replace these with 0 values if that is really appropriate.
The [-2] at the end removes the added column ID2.
This is a fairly unusual way to merge. It depends on the order of the data in addition the values, so it does seem to be fragile. But I do think that I've captured what you want to accomplish.

Use match function to find corresponding rows between DF1 and DF2. See the code below.
# Find rows in DF1 that matches rows in DF2, get "freq2" values from them.
cbind(DF1, DF2[ match( DF1[,"year"], DF2[,"year"] ), "freq2" ])
# Find rows in DF1 that matches rows in DF2, get "freq2" values from them.
cbind(DF2, DF1[ match( DF2[,"year"], DF1[,"year"] ), "freq1" ])

Related

Paste values in a column based on other observations in the dataframe in R

I have a very large (~30M observations) dataframe in R and I am having trouble with a new column I want to create.
The data is formatted like this:
Country Year Value
1 A 2000 1
2 A 2001 NA
3 A 2002 2
4 B 2000 4
5 B 2001 NA
6 B 2002 NA
7 B 2003 3
My problem is that I would like to impute the NAs in the value column based on other values in that column. Specifically, if there is a non-NA value for the same country I would like that to replace the NA in later years, until there is another non-NA value.
The data above would therefore be transformed into this:
Country Year Value
1 A 2000 1
2 A 2001 1
3 A 2002 2
4 B 2000 4
5 B 2001 4
6 B 2002 4
7 B 2003 3
To solve this, I first tried using a loop with a lookup function and also some if_else statements, but wasn't able to get it to behave as I expected. In general, I am struggling to find an efficient solution that will be able to perform the task in the order of minutes-hours and not days.
Is there an easy way to do this?
Thanks!
Using tidyr's fill:
library(tidyverse)
df %>%
group_by(Country) %>%
fill(Value)
Result:
# A tibble: 7 × 3
# Groups: Country [2]
Country Year Value
<chr> <dbl> <dbl>
1 A 2000 1
2 A 2001 1
3 A 2002 2
4 B 2000 4
5 B 2001 4
6 B 2002 4
7 B 2003 3

Merging two dataframes by multiple columns without losing data [duplicate]

This question already has an answer here:
merging in R keeping all rows of a data set
(1 answer)
Closed 1 year ago.
I am new to R and have two very large datasets I want to merge. They look as follows
ID year val1 val3
1 1 2001 2 34
2 2 2004 1 25
3 3 2003 3 36
4 4 2003 2 46
5 5 1999 1 55
6 6 2005 3 44
The second dataframe is as follows
ID year val2
1 1 2001 2
2 2 2004 1
3 3 2003 3
4 4 2002 5
5 5 1998 4
6 6 2004 6
I want the final merged set to look like this
ID year val1 val3 val2
1 1 2001 2 34 2
2 2 2004 1 25 1
3 3 2003 3 36 3
4 4 2002 NA NA 5
5 4 2003 2 46 NA
6 5 1998 NA NA 4
7 5 1999 1 55 NA
8 6 2004 NA NA 6
9 6 2005 3 44 NA
I tried merging by ID and year using the following
total <- merge(df1,df2,by=c("id","year"))
But this results in only merging if ID and year BOTH match. I want to it to happen so that if the ID matches but year doesn't match, a new row will add in the same ID the entry for year and val2 while leaving val1 and val3 as NA.
I then tried merging only by ID and then removing rows if year.x != year.y, but since the datasets were too large it wasn't very efficient.
merge has an argument all that specifies if you want to keep all rows from left and right side (i.e. all rows from x and all rows from y)
total <- merge(df1,df2,by=c("id","year"), all=TRUE)
You must specify all.x=TRUE and all.y=TRUE so you keep all unique rows from both datasets
total <- merge(df1,df2,by=c("id","year"),all.x=TRUE,all.y=TRUE)

How can I drop observations within a group following the occurrence of NA?

I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.
Here an example:
productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf
# productreference Year assets
# 1 1 2000 2
# 2 1 2001 3
# 3 1 2002 NA
# 4 1 2003 2
# 5 2 1999 34
# 6 2 2000 NA
# 7 2 2001 45
# 8 3 2005 1
# 9 3 2006 23
# 10 3 2007 34
# 11 3 2008 56
# 12 4 1998 56
# 13 4 1999 67
# 14 4 2000 23
# 15 5 2000 23
# 16 5 2001 NA
# 17 5 2002 14
# 18 5 2003 NA
I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.
mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1
I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!
I believe this is what you want:
library(dplyr)
group_by(mydf, productreference) %>%
filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
#
# productreference Year assets
# (dbl) (dbl) (dbl)
# 1 1 2000 2
# 2 1 2001 3
# 3 2 1999 34
# 4 3 2005 1
# 5 3 2006 23
# 6 3 2007 34
# 7 3 2008 56
# 8 4 1998 56
# 9 4 1999 67
# 10 4 2000 23
# 11 5 2000 23
Here is the same approach using data.table:
library(data.table)
dt <- as.data.table(mydf)
dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]
# productreference Year assets nas
# 1: 1 2000 2 0
# 2: 1 2001 3 0
# 3: 2 1999 34 0
# 4: 3 2005 1 0
# 5: 3 2006 23 0
# 6: 3 2007 34 0
# 7: 3 2008 56 0
# 8: 4 1998 56 0
# 9: 4 1999 67 0
#10: 4 2000 23 0
#11: 5 2000 23 0
Here is a base R option
mydf[unsplit(lapply(split(mydf, mydf$productreference),
function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]
# productreference Year assets
#1 1 2000 2
#2 1 2001 3
#5 2 1999 34
#8 3 2005 1
#9 3 2006 23
#10 3 2007 34
#11 3 2008 56
#12 4 1998 56
#13 4 1999 67
#14 4 2000 23
#15 5 2000 23
Or an option with data.table
library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)]
else .SD, by = productreference]
You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.
mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
s1 <- mydf[mydf$productreference==i,]
s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
mydf2 <- rbind(mydf2, s2)
mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2

expand.grid() based on values in two variables in R

I would like to expand a grid in R such that the expansion occurs for unique values of one variable but joint values for two variables. For example:
frame <- data.frame(id = seq(1:2),id2 = seq(1:2), year = c(2005, 2008))
I would like to expand the frame for each year, but such that id and id2 are considered jointly (e.g. (1,1), and (2,2) to generate an output like:
id id2 year
1 1 2005
1 1 2006
1 1 2007
1 1 2005
2 2 2006
2 2 2007
2 2 2008
Using expand.grid(), does someone know how to do this? I have not been able to wrangle the code past looking at each id uniquely and producing a frame with all combinations given the following code:
with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), id2 = unique(id2)))
Thanks for any and all help.
You could do this with reshape::expand.grid.df
require(reshape)
expand.grid.df(data.frame(id=1:2,id2=1:2), data.frame(year=c(2005:2008)))
> expand.grid.df(data.frame(id=1:2,id2=1:2), data.frame(year=c(2005:2008)))
id id2 year
1 1 1 2005
2 2 2 2005
3 1 1 2006
4 2 2 2006
5 1 1 2007
6 2 2 2007
7 1 1 2008
8 2 2 2008
Here is another way using base R
indx <- diff(frame$year)+1
indx1 <- rep(1:nrow(frame), each=indx)
frame1 <- transform(frame[indx1,1:2], year=seq(frame$year[1], length.out=indx, by=1))
row.names(frame1) <- NULL
frame1
# id id2 year
#1 1 1 2005
#2 1 1 2006
#3 1 1 2007
#4 1 1 2008
#5 2 2 2005
#6 2 2 2006
#7 2 2 2007
#8 2 2 2008

repeat rows in a dataset based on a column, but increment the rows [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 5 years ago.
I have a dataset which has project name, start year and contract term. I need to develop this dataset into time series. For example, one row in my dataset is: Project A, start year 2003 and contract term 5. I would like to repeat each row based on contract term. My dataset looks like this:
Project Name Start Year Contract Term
A 2003 5
B 2013 3
C 2000 2
My desired result should look like this:
Project Name Start Year Contract Term
A 2003 5
A 2004 5
A 2005 5
A 2006 5
A 2007 5
B 2013 3
B 2014 3
B 2014 3
C 2000 2
C 2001 2
I have tried:
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
But this only repeats each project by the number in contract term. I can not make it to increment the years.
Thanks in advance!
Here it is in two steps:
Step 1, you know:
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
rpsData
# Project.Name Start.Year Contract.Term
# 1 A 2003 5
# 1.1 A 2003 5
# 1.2 A 2003 5
# 1.3 A 2003 5
# 1.4 A 2003 5
# 2 B 2013 3
# 2.1 B 2013 3
# 2.2 B 2013 3
# 3 C 2000 2
# 3.1 C 2000 2
Step 2 makes use of sequence and basic addition:
sequence(rpsInput$Contract.Term) ## This will be helpful...
# [1] 1 2 3 4 5 1 2 3 1 2
rpsData$Start.Year <- rpsData$Start.Year + sequence(rpsInput$Contract.Term)
rpsData
# Project.Name Start.Year Contract.Term
# 1 A 2004 5
# 1.1 A 2005 5
# 1.2 A 2006 5
# 1.3 A 2007 5
# 1.4 A 2008 5
# 2 B 2014 3
# 2.1 B 2015 3
# 2.2 B 2016 3
# 3 C 2001 2
# 3.1 C 2002 2
Just to piggy back on Ananda's answer, change
sequence(rpsInput$Contract.Term)
to
(sequence(rpsInput$Contract.Term)-1)
to get the output you desire.
ProjectName<-c("A","B","C")
Start.Year<-c(2003,2013,2000)
Contract.Term<-c(5,3,2)
rpsInput<-data.frame(ProjectName,Start.Year,Contract.Term)
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
rpsData$Start.Year <- rpsData$Start.Year + (sequence(rpsInput$Contract.Term)-1)
rpsData
# ProjectName Start.Year Contract.Term
#1 A 2003 5
#1.1 A 2004 5
#1.2 A 2005 5
#1.3 A 2006 5
#1.4 A 2007 5
#2 B 2013 3
#2.1 B 2014 3
#2.2 B 2015 3
#3 C 2000 2
#3.1 C 2001 2

Resources