Binding rows from list with meaningful duplicates in R [duplicate] - r

This question already has an answer here:
How to collapse many records into one while removing NA values
(1 answer)
Closed 2 years ago.
Guys I need to merge different data frames from a list by row and maintain some information contained in the duplicate rows. Each row contains daily observation of some variables (stock prices) and each of the data frames contains different time spans (years). From one data frame to the other some variables could change (columns - stocks inside the index). bind_rows from dplyr seems to do a great job at simply adding columns with the new variables and leaving NAs elsewhere.
The point is that some of the data frames contain the last day of the previous period (that is therefore already bind from the previous data frame) but they slightly differ in the variables shown (columns). I don't want to completely eliminate one of the duplicate rows because they both contain information I need and I would rather prefer to merge them. The duplicate rows contain either the same value (because refer to the same day) or one NA and one value (because refer to the different variables in the set). How can I do?
The problem could be synthetized in the following example:
library(dplyr)
df_1 <- data.frame(Date=c(1:4),A=c(20,30,20,30),B=c(15,16,15,16))
df_2 <- data.frame(Date=c(4:7),A=c(30,35,60,40),C=c(15,18,25,20))
dfs<-list(df_1,df_2)
bind_rows(dfs)
Outcome:
Date A B C
1 1 20 15 NA
2 2 30 16 NA
3 3 20 15 NA
4 4 30 16 NA
5 4 30 NA 15
6 5 35 NA 18
7 6 60 NA 25
8 7 40 NA 20
Desired outcome:
Date A B C
1 1 20 15 NA
2 2 30 16 NA
3 3 20 15 NA
4 4 30 16 15
5 5 35 NA 18
6 6 60 NA 25
7 7 40 NA 20

Instead of binding rows you can do a full join by Date and A column.
library(dplyr)
full_join(df_1, df_2, by = c('Date', 'A'))
#Thanks to #duckmayr for the suggestion.
# A B C
#1 20 15 NA
#2 30 16 NA
#3 20 15 NA
#4 30 16 15
#5 35 NA 18
#6 60 NA 25
#7 40 NA 20
which in base R, can be done as :
merge(df_1, df_2, by = c('Date', 'A'), all = TRUE)
If the data is in a list we can use Reduce
purrr::reduce(dfs, full_join, by = c('Date', 'A'))
Or
Reduce(function(x, y) merge(df_1, df_2, by = c('Date', 'A'), all = TRUE), dfs)

Related

lag/lead entire dataframe in R

I am having a very hard time leading or lagging an entire dataframe. What I am able to do is shifting individual columns with the following attempts but not the whole thing:
require('DataCombine')
df_l <- slide(df, Var = var1, slideBy = -1)
using colnames(x_ret_mon) as Var does not work, I am told the variable names are not found in the dataframe.
This attempt shifts the columns right but not down:
df_l<- dplyr::lag(df)
This only creates new variables for the lagged variables but then I do not know how to effectively delete the old non lagged values:
df_l<-shift(df, n=1L, fill=NA, type=c("lead"), give.names=FALSE)
Use dplyr::mutate_all to apply lags or leads to all columns.
df = data.frame(a = 1:10, b = 21:30)
dplyr::mutate_all(df, lag)
a b
1 NA NA
2 1 21
3 2 22
4 3 23
5 4 24
6 5 25
7 6 26
8 7 27
9 8 28
10 9 29
I don't see the point in lagging all columns in a data.frame. Wouldn't that just correspond to rbinding an NA row to your original data.frame (minus its last row)?
df = data.frame(a = 1:10, b = 21:30)
rbind(NA, df[-nrow(df), ]);
# a b
#1 NA NA
#2 1 21
#3 2 22
#4 3 23
#5 4 24
#6 5 25
#7 6 26
#8 7 27
#9 8 28
#10 9 29
And similarly for leading all columns.
A couple more options
data.frame(lapply(df, lag))
require(purrr)
map_df(df, lag)
If your data is a data.table you can do
require(data.table)
as.data.table(shift(df))
Or, if you're overwriting df
df[] <- lapply(df, lag) # Thanks Moody
require(magrittr)
df %<>% map_df(lag)

fill missing values with value from previous column

I have a data.frame whit some columns with missing values, and I want that the missing values are filled in with data from a previous column. For example:
country <- c('a','b','c')
yr01 <- c(15,16,7)
yr02 <- c(NA,18,NA)
yr03 <- c(20,22,NA)
yr04 <- c(15,18,19)
tab <- data.frame(country,yr01,yr02,yr03,yr04)
tab
country yr01 yr02 yr03 yr04
1 a 15 NA 20 15
2 b 16 18 22 18
3 c 7 NA NA 19
How can I make it so that the NA are replaced by the previous value? For example, in country a column yr02 will be equals to 15, and in country c columns year02 and yr03 will be 7. Thanks!
It's usually easier to work with columns, but we can apply to rows the standard answer from the R-FAQ Replace NAs with latest non-NA value.
tab[-1] = t(apply(tab[-1], 1, zoo::na.locf))
tab
# country yr01 yr02 yr03 yr04
# 1 a 15 15 20 15
# 2 b 16 18 22 18
# 3 c 7 7 7 19

How to combine unknown number of data frames in R?

Data
I have a data frame df. Following is a sample:
df <- data.frame(ID = rep(c(-1,7,8), each=3), LV.vel.fps = 40:48, frames = 1:9)
And there are unknown number of other data frames, each with the prefix "comb." followed by a number. Each of these data frames represent data for a vehicle. Following data frame contains names of those vehicles (These numbers change based on the experiment so, there are 2 vehicles now but after another experiment there could be 9):
> ADO.names
name
1 TrucPropk
2 Truck
So, nrow(ADO.names) tells us how many data frames there are. Following are the "comb." data frames for this particular example:
comb.1 <- data.frame(frames = 4:6, ADO.name = "TrucPropk", speed.fps = 43:45)
comb.2 <- data.frame(frames = 7:9, ADO.name = "Truck", speed.fps = 46:48)
Also, these data frames could have different number of rows.
What I want to do
The "ID" variable in df contains the IDs of the vehicles in the "comb." data frames. -1 means no vehicle. The IDs are not available in the "comb." data frames but I want to add a new column "final.name" in df that contains the name of the vehicle for a given ID. This can be done by matching "speed.fps" from "comb." to "LV.vel.fps" in df because both are speeds in feet per second.
Therefore, the final output should look like this:
> df
ID LV.vel.fps frames final.name
1 -1 40 1 NA
2 -1 41 2 NA
3 -1 42 3 NA
4 7 43 4 TrucPropk
5 7 44 5 TrucPropk
6 7 45 6 TrucPropk
7 8 46 7 Truck
8 8 47 8 Truck
9 8 48 9 Truck
Problems
For these sample data frames, I could do following for joining data frames:
library(dplyr)
df <- df %>%
left_join(x = ., y = comb.1, by = "frames") %>%
left_join(x = ., y = comb.2, by = "frames")
And ifelse for "final.name":
df$final.name <- ifelse(df$speed.fps.x==df$LV.vel.fps,
df$ADO.name.x,
ifelse(df$speed.fps.y==df$LV.vel.fps,
df$ADO.name.y, "NA"))
But the output I get is wrong:
> df
ID LV.vel.fps frames final.name ADO.name.x speed.fps.x ADO.name.y speed.fps.y
1 -1 40 1 NA <NA> NA <NA> NA
2 -1 41 2 NA <NA> NA <NA> NA
3 -1 42 3 NA <NA> NA <NA> NA
4 7 43 4 1 TrucPropk 43 <NA> NA
5 7 44 5 1 TrucPropk 44 <NA> NA
6 7 45 6 1 TrucPropk 45 <NA> NA
7 8 46 7 NA <NA> NA Truck 46
8 8 47 8 NA <NA> NA Truck 47
9 8 48 9 NA <NA> NA Truck 48
Questions
So basically I have 2 questions:
1) How do I write code so that all "comb." data frames are joined with df regardless of number of those data frames? I knew there were 2 in this case so manually wrote "comb.1" and "comb.2" But the code should be robust for any number of data frames.
2) Why is my ifelse statement not generating correct output? How could I write robust code for this case as well?
We can get the data.frame object names that starts with comb in a list using mget, rbind the list elements and then merge with the dataset 'df'.
res <- merge(df, do.call(rbind,
mget(ls(pattern='^comb\\.\\d+')))[1:2], by='frames', all.x=TRUE)
colnames(res)[4] <- 'final.name'
res
# frames ID LV.vel.fps final.name
#1 1 -1 40 <NA>
#2 2 -1 41 <NA>
#3 3 -1 42 <NA>
#4 4 7 43 TrucPropk
#5 5 7 44 TrucPropk
#6 6 7 45 TrucPropk
#7 7 8 46 Truck
#8 8 8 47 Truck
#9 9 8 48 Truck
EDIT: As the OP mentioned about matching the 'speed' columns, we can include that also in the merge
res <- merge(df,
do.call(rbind,mget(ls(pattern='^comb\\.\\d+'))),
by.x=c('frames', 'LV.vel.fps'), by.y= c('frames', 'speed.fps'),
all.x=TRUE)
colnames(res)[4] <- 'final.name'

How to merge tables and fill the empty cells in the mean time in R?

Assume there are two tables a and b.
Table a:
ID AGE
1 20
2 empty
3 40
4 empty
Table b:
ID AGE
2 25
4 45
5 60
How to merge the two table in R so that the resulting table becomes:
ID AGE
1 20
2 25
3 40
4 45
You could try
library(data.table)
setkey(setDT(a), ID)[b, AGE:= i.AGE][]
# ID AGE
#1: 1 20
#2: 2 25
#3: 3 40
#4: 4 45
data
a <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
b <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
Assuming you have NA on every position in the first table where you want to use the second table's age numbers you can use rbind and na.omit.
Example
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
na.omit(rbind(x,y))
Results in what you're after (although unordered and I assume you just forgot ID 5)
ID AGE
1 20
3 40
2 25
4 45
5 60
EDIT
If you want to merge two different data.frames's and keep the columns its a different thing. You can use merge to achieve this.
Here are two data frames with different columns:
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA), COUNTY=c(1,2,3,4))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60), STATE=c('CA','CA','IL'))
Add them together into one data.frame
res <- merge(x, y, by='ID', all=T)
giving us
ID AGE.x COUNTY AGE.y STATE
1 20 1 NA <NA>
2 NA 2 25 CA
3 40 3 NA <NA>
4 NA 4 45 CA
5 NA NA 60 IL
Then massage it into the form we want
idx <- which(is.na(res$AGE.x)) # find missing rows in x
res$AGE.x[idx] <- res$AGE.y[idx] # replace them with y's values
names(res)[agrep('AGE\\.x', names(res))] <- 'AGE' # rename merged column AGE.x to AGE
subset(res, select=-AGE.y) # dump the AGE.y column
Which gives us
ID AGE COUNTY STATE
1 20 1 <NA>
2 25 2 CA
3 40 3 <NA>
4 45 4 CA
5 60 NA IL
The package in the other answer will work. Here is a dirty hack if you don't want to use the package:
x$AGE[is.na(x$AGE)] <- y$AGE[y$ID %in% x$ID]
> x
ID AGE
1 1 20
2 2 25
3 3 40
4 4 45
But, I would use the package to avoid the clunky code.

Skip NA values using "FUN=first"

there's probably really an simple explaination as to what I'm doing wrong, but I've been working on this for quite some time today and I still can not get this to work. I thought this would be a walk in the park, however, my code isn't quite working as expected.
So for this example, let's say I have a data frame as followed.
df
Row# user columnB
1 1 NA
2 1 NA
3 1 NA
4 1 31
5 2 NA
6 2 NA
7 2 15
8 3 18
9 3 16
10 3 NA
Basically, I would like to create a new column that uses the first (as well as last) function (within the TTR library package) to obtain the first non-NA value for each user. So my desired data frame would be this.
df
Row# user columnB firstValue
1 1 NA 31
2 1 NA 31
3 1 NA 31
4 1 31 31
5 2 NA 15
6 2 NA 15
7 2 15 15
8 3 18 18
9 3 16 18
10 3 NA 18
I've looked around mainly using google, but I couldn't really find my exact answer.
Here's some of my code that I've tried, but I didn't get the results that I wanted (note, I'm bringing this from memory, so there are quite a few more variations of these, but these are the general forms that I've been trying).
df$firstValue<-ave(df$columnB,df$user,FUN=first,na.rm=True)
df$firstValue<-ave(df$columnB,df$user,FUN=function(x){x,first,na.rm=True})
df$firstValue<-ave(df$columnB,df$user,FUN=function(x){first(x,na.rm=True)})
df$firstValue<-by(df,df$user,FUN=function(x){x,first,na.rm=True})
Failed, these just give the first value of each group, which would be NA.
Again, these are just a few examples from the top of my head, I played around with na.rm, using na.exclude, na.omit, na.action(na.omit), etc...
Any help would be greatly appreciated. Thanks.
A data.table solution
require(data.table)
DT <- data.table(df, key="user")
DT[, firstValue := na.omit(columnB)[1], by=user]
Here is a solution with plyr :
ddply(df, .(user), transform, firstValue=na.omit(columnB)[1])
Which gives :
Row user columnB firstValue
1 1 1 NA 31
2 2 1 NA 31
3 3 1 NA 31
4 4 1 31 31
5 5 2 NA 15
6 6 2 NA 15
7 7 2 15 15
8 8 3 18 18
9 9 3 16 18
If you want to capture the last value, you can do :
ddply(df, .(user), transform, firstValue=tail(na.omit(columnB),1))
Using data.table
library (data.table)
DT <- data.table(df, key="user")
DT <- setnames(DT[unique(DT[!is.na(columnB), list(columnB), by="user"])], "columnB.1", "first")
Using a very small helper function
finite <- function(x) x[is.finite(x)]
here is an one-liner using only standard R functions:
df <- cbind(df, firstValue = unlist(sapply(unique(df[,1]), function(user) rep(finite(df[df[,1] == user,2])[1], sum(df[,1] == user))))
For a better overview, here is the one-liner unfolded into a "multi-liner":
# for each user, find the first finite (in this case non-NA) value of the second column and replicate it as many times as the user has rows
# then, the results of all users are joined into one vector (unlist) and appended to the data frame as column
df <- cbind(
df,
firstValue = unlist(
sapply(
unique(df[,1]),
function(user) {
rep(
finite(df[df[,1] == user,2])[1],
sum(df[,1] == user)
)
}
)
)
)

Resources