Merge two dataframes with repeated columns - r

I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo

d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.

Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames

#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.

Related

In R, left join two tables whose 2 potential keys contain missing data

Background:
I'm working with a fairly large (>10,000 rows) dataset of individual cars, and I need to do some analysis on it. I need to keep this dataset d intact, but I'm only going to be analyzing cars made by Japanese companies (e.g. Nissan, Honda, etc.). d contains information like VIN_prefix (the first two letters of a VIN number that indicates the "World Manufacturer Number"), model year, and make, but no explicit indicator of whether the car is made by a Japanese firm. Here's d:
d <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
stringsAsFactors=FALSE)
Here, rows 3, 4, and 5 correspond to Japanese cars: the NA in row 3 is actually an Acura whose make is missing. See below when I get to the other dataset about why this is.
d also lacks some attributes (columns) about cars that I need for my analysis, e.g. the current CEO of Japanese car firms.
Enter another dataset, a, a dataset about Japanese car firms which contains those extra attributes as well as columns that could be used to identify whether a given car (row) in d is made by a Japanese firm. One of those is VIN_prefix; the other is jp_makes, a list of Japanese auto firms. Here's a:
a <- data.frame(
VIN_prefix = c("JH","JF","1N"),
jp_makes = c("Acura","Subaru","Nissan"),
current_ceo = c("Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida"),
stringsAsFactors=FALSE)
Here, we can see that the "Acura" make, missing in the car from row 3 in d, could be identified by its VIN_prefix "JH", which in row 3 of d is not NA.
Goal:
Left join a onto d so that each of the 3 Japanese cars in d gets the relevant corresponding attributes from a - mainly, current_ceo. (Non-Japanese cars in d would have NA for columns joined from a; this is fine.)
Problem:
As you can tell, the two relevant variables in d that could be used as keys in a join - make and VIN_prefix - have missing data in d. The "matching rules" we could use are imperfect: I could match on d$make == a$jp_makes or on d$VIN_prefix == a$VIN_prefix, but they'd each be wrong due to the missing data in d.
What to do?
What I've tried:
I can try left joining on either one of these potential keys, but not all 3 of the Japanese cars in d wind up with their correct information from a:
try1 <- left_join(d, a, by = c("make" = "jp_makes"))
try2 <- left_join(d, a, by = c("VIN_prefix" = "VIN_prefix"))
I can successfully generate an logical 'indicator' variable in d that tells me whether a car is Japanese or not:
entries_make <- a$jp_makes
entries_vin_prefix <- a$VIN_prefix
d<- d %>%
mutate(is_jp = ifelse(d$VIN_prefix %in% entries_vin_prefix | d$make %in% entries_make, 1, 0)
%>% as.logical())
But that only gets me halfway: I still need those other columns from a to sit next to those Japanese cars in d. It's unfeasible to manually fill all the missing data in some other way; the real datasets these toy examples correspond to are too big for that and I don't have the manpower or time.
Ideally, I'd like a dataset that looks something like this:
ideal <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
current_ceo = c("NA", "NA", "Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida", "NA"),
stringsAsFactors=FALSE)
What do you all think? I've looked at other posts (e.g. here) but their solutions don't really apply. Any help is much appreciated!
Left join on an OR of the two conditions.
library(sqldf)
sqldf("select d.*, a.current_ceo
from d
left join a on d.VIN_prefix = a.VIN_prefix or d.make = a.jp_makes")
giving:
make model_yr VIN_prefix current_ceo
1 GMC 1999 1G <NA>
2 Dodge 2004 1D <NA>
3 NA 1989 JH Toshihiro Mibe
4 Subaru 1999 JF Tomomi Nakamura
5 Nissan 2006 NA Makoto Ushida
6 Chrysler 2012 2C <NA>
Use a two pass method. First fill in the missing make (or VIN values). I'll illustrate by filling in make valuesDo notice taht "NA" is not the same as NA. The first is a character value while the latter is a true R missing value, so I'd first convert those to a true missing value. In natural language I am replacing the missing values in d (note correction of df) with values of 'jp_makes' that are taken from a on the basis of matching VIN_prefix values:
is.na( d$make) <- df$make=="NA"
d$make[is.na(df$make)] <- a$jp_makes[
match( d$VIN_prefix[is.na(d$make)], a$VIN_prefix) ]
Now you have the make values filled in on the basis of the table look up in a. It should be trivial to do the match you wanted by using by.x='make', by.y='jp_make'
merge(d, a, by.x='make', by.y='jp_makes', all.x=TRUE)
make model_yr VIN_prefix.x VIN_prefix.y current_ceo
1 Acura 1989 JH JH Toshihiro Mibe
2 Chrysler 2012 2C <NA> <NA>
3 Dodge 2004 1D <NA> <NA>
4 GMC 1999 1G <NA> <NA>
5 Nissan 2006 NA 1N Makoto Ushida
6 Subaru 1999 JF JF Tomomi Nakamura
You can then use the values in VIN_prefix.y to replace the values the =="NA" in VIN_prefix.x.

Using variables from two different size datasets (and logic relations) to create a new variable

I have two dataframes. The number of observations is very different, and I would like to use some information from one dataframe into the other, conditioning to some logical relations, and I can't seem to be able to. A down-scaled example would look something like this:
year <- as.vector(c(rep(1949,5), rep(1950,5), rep(1951,5), rep(1952,5)))
moneyband <- as.vector(c(rep(c(10,20,30,40,50),4)))
rate <-as.vector(c(rep(c(0.1,0.2,0.3,0.4,0.5),2),rep(c(0.15,0.25,0.35,0.45,0.55),2)))
datasmall <- as.data.frame(cbind(year,moneyband,rate))
yearbig <- as.vector(c(rep(1949,10), rep(1950,10), rep(1951,10), rep(1952,11)))
earnings <- as.vector(c(rep(c(9,19,30,39,50),8),60))
databig <- as.data.frame(cbind(yearbig,earnings))
Now I want to create a new variable in the big database (let's call it ratebig) that assigns to that variable the rate associated with that amount of earnings, if earnings (in the big database) equal moneyband (in the small database) for a given year. As you can see, in this example this would happen with the values 30 and 50. The rest I would like them to be NA.
I tried this:
databig$ratebig <- NA
for (i in 1949:1952) {
databig$ratebig[datasmall$year == i & (databig$earnings[databig$yearbig==i]==datasmall$moneyband[datasmall$year == i])] <- datasmall$rate[datasmall$year == i & (databig$earnings[databig$yearbig==i]==datasmall$moneyband[datasmall$year == i])]
}
But the different size of databases (or other things) are giving me trouble (it gives me errors and the results are wrong). It seems the result does not take care the conditions as I would like, and it is influenced by relative position and the structure in the two datasets.
In principle, I wouldn't want to merge the datasets (we are talking about a high number of observations in the real data) and was hoping for a way to do this.
Thanks!!
For your case merge works fine
merge(databig, datasmall, by.x = c("yearbig", "earnings"),
by.y = c("year", "moneyband"), all.x = TRUE)
# yearbig earnings rate
#1 1949 9 NA
#2 1949 9 NA
#3 1949 19 NA
#4 1949 19 NA
#5 1949 30 0.30
#6 1949 30 0.30
#7 1949 39 NA
#8 1949 39 NA
#9 1949 50 0.50
#10 1949 50 0.50
#.....
Regarding why your for loop doesn't work as expected you need to do it for every row of databig
databig$ratebig <- NA
for (i in 1:nrow(databig)) {
inds <- databig$yearbig[i] == datasmall$year &
databig$earnings[i] == datasmall$moneyband
if (any(inds))
databig$ratebig[i] <- datasmall$rate[inds]
}

Adding missing end_of_months values by different variables in R

I have the following xlsx file df.xlsx which looks like this:
client id dax dpd
1 2000-05-30 7
1 2000-12-31 6
2 2003-05-21 6
3 1999-12-30 5
3 2000-10-30 6
3 2001-12-30 5
4 1999-12-30 5
4 2002-05-30 6
It's about a loan migration from a snapshot to another. The problem is that I don't have all the months in between. (ie: client_id = 1 , dax is from 2000-05-30 and 2000-12-30) . I have tried several approaches but no result. I need to populate by client_id all the months in between dax and keep the same "dpd" as the first month. (ie client_id = 1 , dax is from 2000-05-30 and 2000-12-30, dpd=7 for all months except the last one "2000-12-31" where dpd= 6). If the client_id appears only once (like client_id = 2 ) it should remain the same.
(dpd means days past due aka rating bucket)
I have tried this code:
df2 <- data.frame(dax=seq(min(df$dax), max(df$dax), by="month"))
df3 <- merge(x=df2a, y=df, by="dax", all.x=T)
idx <- which(is.na(df3$values))
for (client_id in idx)
df3$values[client_id] <- df3$values[client_id-1]
df3
but the results were not quite okay for what i need.
i appreciate any advice. thank you very much!
If I understand your question correctly, you want to generate seqence of dates, given the start/end date.
R code to do this would be (insert values from your dataframe):
seq(as.Date("2017-01-30"), as.Date("2017-12-30"), "month")
Edit after comment:
In this case you can split your data by clients first and then generate the sequences:
new_data <- data.frame()
customerslist <- split(YOURDATA, YOURDATA$id)
for(i in 1:length(customerslist)){
dates <- seq(min(as.Date(customerslist[[i]]$dax)), max(as.Date(customerslist[[i]]$dax)), "month")
id <- rep(customerslist[[i]]$id[1], length(dates))
dpd <- rep(customerslist[[i]]$dpd[1], length(dates))
add <- cbind(id, as.character(dates), dpd)
new_data <- rbind(new_data, add)
}
new_data$V2 <- as.Date(new_data$V2)

select only rows that have the same id in r

ID Julian Month Year Location Distance
2 40749 July 2011 8300 39625
2 41425 May 2013 Hatchery 31325
3 40749 July 2011 6950 38625
3 41057 May 2012 Hatchery 31325
6 40735 July 2011 8300 39650
12 40743 July 2011 11025 42350
Above is the head() for the data frame I'm working with. It contains over 7,000 rows and 3,000 unique ID values. I want to delete all the rows that have only one ID value. Is this possible? Maybe the solution is in keeping only rows where the ID is repeated?
If d is your data frame, I'd use duplicated to find the rows that have recurring IDs. Using both arguments in fromLast gets you the first and last duplicate ID row.
d[(duplicated(d$ID, fromLast = FALSE) | duplicated(d$ID, fromLast = TRUE)),]
This double-duplicated method has a variety of uses:
Finding ALL duplicate rows, including "elements with smaller subscripts"
How to get a subset of a dataframe which only has elements which appear in the set more than once in R
How to identify "similar" rows in R?
Here is how I would do it:
new.dataframe <- c()
ids <- unique(dataframe$ID)
for(id in ids){
temp <- dataframe[dataframe$ID == id, ]
if(nrow(temp) > 1){
new.dataframe <- rbind(new.dataframe, temp)
}}
This will remove all the IDs that only have one row

How to Find difference between two values of last two dates in R program

DF2
Date EMMI ACT NO2
2011/02/12 12345 21 11
2011/02/14 43211 22 12
2011/02/19 12345 21 13
2011/02/23 43211 13 12
2011/02/23 56341 13 12
2011/03/03 56431 18 20
I need to find difference between two dates in a column. For example difference between ACT column values.For example, the EMMI 12345, Difference between dates 2011/02/19 - 2011/02/12 = 21-21 = 0. like that i want to do for entire column of ACT. Add a new column diff and add values to that. Can anybody let me know please how to do it.
This is the output i want
DF3
Date EMMI ACT NO2 DifACT
2011/02/12 12345 21 11 NA
2011/02/14 43211 22 12 NA
2011/02/19 12345 21 13 0
2011/02/23 43211 13 12 -9
2011/02/23 56341 13 12 5
Try this:
DF3 <- DF2
DF3$difACT <- ave( DF3$ACT, DF3$EMMI, FUN= function(x) c(NA, diff(x)) )
As long as the dates are sorted (within EMMI) this will work, if they are not sorted then we would need to modify the above to sort within EMMI first. I would probably sort the entire data frame on date first (and save the results of order), then run the above. Then if you need it back in the original order you can run order on the results of the original order results to "unorder" the data frame.
This is based on plyr package (not tested):
library(plyr)
DF3<-ddply(DF2,.(EMMI),mutate,difACT=diff(ACT))

Resources