padding missing rows between two data frames in R - r

I have two large data frames A (N1 by 6), B (N2 by 2). The first two columns of A are the keys for matching B, all keys in A is a subset of B.
What I want to do is: padding A with those keys that are in B but not in A, and fill other 4 columns with "NA", reserve for missing value imputation later.
A
1 2 3 4 5 6
1 3 4 5 6 7
B
1 2
1 3
1 4
My new A
1 2 3 4 5 6
1 3 4 5 6 7
1 4 NA NA NA NA
I come up with something like this
rowDiff <- setdiff(A[,1:2],B)
pad <- cbind(rowDiff, matrix(rep("NA",4*nrow(rowDiff)),ncol=4))
A <- rowbind(A,pad)
Any more efficient solution? Thanks

Would this work?
merge(B, A, all.x=TRUE)
It tests OK:
> A <- read.table(text="1 2 3 4 5 6
+ 1 3 4 5 6 7")
>
> B <- read.table(text="1 2
+ 1 3
+ 1 4")
> merge(B, A, all.x=TRUE)
V1 V2 V3 V4 V5 V6
1 1 2 3 4 5 6
2 1 3 4 5 6 7
3 1 4 NA NA NA NA

Related

create an other data if elements are same

I have two data sets A and B (shown below), and wanted to create third data set called C, based on this condition: If elements of A and B are Same (or matched) then its should be C (if not macthed then that element should be NA/missing).
A
2 5 9 3
5 3 2 1
2 1 1 3
B
2 7 9 3
5 3 6 1
2 2 2 3
expected C should look like
2 NA 9 3
5 3 NA 1
2 NA NA 3
BOTH data have same dimensions, any suggestion please?
`is.na<-`(A,!A==B)
V1 V2 V3 V4
1 2 NA 9 3
2 5 3 NA 1
3 2 NA NA 3
This should work for both data frame and matrix.
If A and B are data frames:
C <- A
C[C != B] <- NA
C
# V1 V2 V3 V4
# 1 2 NA 9 3
# 2 5 3 NA 1
# 3 2 NA NA 3
If A and B are matrix:
A <- as.matrix(A)
B <- as.matrix(B)
C <- A
C[C != B] <- NA
C
# V1 V2 V3 V4
# [1,] 2 NA 9 3
# [2,] 5 3 NA 1
# [3,] 2 NA NA 3
DATA
A <- read.table(text = "2 5 9 3
5 3 2 1
2 1 1 3",
header = FALSE)
B <- read.table(text = "2 7 9 3
5 3 6 1
2 2 2 3",
header = FALSE)

Replace NA with average of the case before and after the NA, unless row starts or ends with NA [duplicate]

This question already has answers here:
Replace NA with average of the case before and after the NA
(2 answers)
Closed 5 years ago.
Say I have a data.frame:
t<-c(1,1,2,4,NA,3)
u<-c(1,3,4,6,4,2)
v<-c(2,3,4,NA,3,2)
w<-c(2,3,4,5,2,3)
x<-c(2,3,4,5,6,NA)
df<-data.frame(t,u,v,w,x)
df
t u v w x
1 1 1 2 2 2
2 1 3 3 3 3
3 2 4 4 4 4
4 4 6 NA 5 5
5 NA 4 3 2 6
6 3 2 2 3 NA
I would like to change the NAs so that the NA becomes replaced by the average of the one value before the NA and the one value after the NA. However, if a row starts with an NA I would like it to be replaced by the value that follows it. When a row ends with NA, I would like it to be replaced by the value before the NA. Thus, I would like to get the following result:
t u v w x
1 1 1 2 2 2
2 1 3 3 3 3
3 2 4 4 4 4
4 4 6 5.5 5 5 --> NA becomes average of 6 and 5
5 4 4 3 2 6 --> NA becomes value of next case
6 3 2 2 3 3 --> NA becomes value of previous case
I have thousands of rows, so any help is very much appreciated!
Based on previous na.approx solutions this might do the trick:
library(zoo)
t(apply(df, 1,function(x) na.approx(x,rule=2)))
Always search for parameter na.rm = T in functions that you use.
In this case you want to use mean of one of the column with the na.rm param set to true.
Then you want to substitute NA-s.
dt[is.na(dt[,'t']),'t'] = 0
(assuming that I did not reverse the order of dimensions)
Here is a possible solution,
if is NA replace with (lag + lead) /2 if still NA replace with lag if still NA replace with lead.
library(dplyr)
t(apply(df, 1, function(x){
lagx = dplyr::lag(x)
leadx = dplyr::lead(x)
b = ifelse(is.na(x),(leadx+lagx)/2, x)
b = ifelse(is.na(b), leadx, b)
b = ifelse(is.na(b), lagx, b)
return(b)
}
))
#output
t u v w x
[1,] 1 1 2.0 2 2
[2,] 1 3 3.0 3 3
[3,] 2 4 4.0 4 4
[4,] 4 6 5.5 5 5
[5,] 4 4 3.0 2 6
[6,] 3 2 2.0 3 3
t<-c(1,1,2,4,NA,3)
u<-c(1,3,4,6,4,2)
v<-c(2,3,4,NA,3,2)
w<-c(2,3,4,5,2,3)
x<-c(2,3,4,5,6,NA)
df<-data.frame(t,u,v,w,x)
df[which(is.na(t)), "t"] <- df[which(is.na(t)), "u"]
df[which(is.na(x)), "x"] <- df[which(is.na(x)), "w"]
df[which(is.na(v)), "v"] <- (df[which(is.na(v)), "u"] + df[which(is.na(v)), "w"])/2
> df
t u v w x
1 1 1 2.0 2 2
2 1 3 3.0 3 3
3 2 4 4.0 4 4
4 4 6 5.5 5 5
5 4 4 3.0 2 6
6 3 2 2.0 3 3

R: Updating a data frame with another data frame

Let's say our initial data frame looks like this:
df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,NA,NA,NA),C=c(1,2,3,NA,NA,NA))
> df1
Index A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 NA NA
5 5 5 NA NA
6 6 6 NA NA
Another data frame contains new information for col B and C
df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5))
> df2
Index B C
1 4 4 5
2 5 4 5
3 6 4 5
How can you update the missing values in df1 so it looks like this:
Index A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 5
5 5 5 4 5
6 6 6 4 5
My attempt:
library(dplyr)
> full_join(df1,df2)
Joining by: c("Index", "B", "C")
Index A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 NA NA
5 5 5 NA NA
6 6 6 NA NA
7 4 NA 4 5
8 5 NA 4 5
9 6 NA 4 5
Which as you can see has created duplicate rows for the 4,5,6 index instead of replacing the NA values.
Any help would be greatly appreciated!
merge then aggregate:
aggregate(. ~ Index, data=merge(df1, df2, all=TRUE), na.omit, na.action=na.pass )
# Index B C A
#1 1 1 1 1
#2 2 2 2 2
#3 3 3 3 3
#4 4 4 5 4
#5 5 4 5 5
#6 6 4 5 6
Or in dplyr speak:
df1 %>%
full_join(df2) %>%
group_by(Index) %>%
summarise_each(funs(na.omit))
#Joining by: c("Index", "B", "C")
#Source: local data frame [6 x 4]
#
# Index A B C
# (dbl) (int) (dbl) (dbl)
#1 1 1 1 1
#2 2 2 2 2
#3 3 3 3 3
#4 4 4 4 5
#5 5 5 4 5
#6 6 6 4 5
We can use join from data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), join on with 'df1' using "Index" and assign (:=), the values in 'B' and 'C' with 'i.B' and 'i.C'.
library(data.table)
setDT(df1)[df2, c('B', 'C') := .(i.B, i.C), on = "Index"]
df1
# Index A B C
#1: 1 1 1 1
#2: 2 2 2 2
#3: 3 3 3 3
#4: 4 4 4 5
#5: 5 5 4 5
#6: 6 6 4 5
For those interested, I've extended this problem to:
- handle updating a data frame with another data frame with new columns - replace any existing entries regardless if they're NA or not.
Heres the solution I found using the aggregate function from #thelatemail :)
df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,3,3,3),C=c(1,2,3,3,3,3))
df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5),D=c(6,6,6),E=c(7,7,7))
df3 = full_join(df1,df2)
# Create a function na.omit.last
na.omit.last = function(x){
x <- na.omit(x)
x <- last(x)
}
# For the columns not in df1
dfA = aggregate(. ~ Index, df3, na.omit,na.action = na.pass)
dfA = dfA[,-(1:ncol(df1))]
dfA = data.frame(lapply(dfA,as.numeric))
dfB = aggregate(. ~ Index, df3[,1:ncol(df1)], na.omit.last, na.action = na.pass)
# If there are more columns in df2 append dfA
if (ncol(df2) > ncol(df1)) {
df3 = cbind(dfB,dfA)
} else {
df3 = dfB
}
print(df3)
Not sure what the general case or conditions would be, but this works for this instance without dplyr
df3 <- as.matrix(df1)
df3[which(is.na(df3))] <- as.matrix(df2)
df3 <- as.data.frame(df3)
df3
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 5
5 5 4 5
6 6 4 5
As of dplyr >= 1.0.0 you can use rows_update:
library(dplyr)
df1 %>%
rows_update(df2, by = "Index")
Index A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 5
5 5 5 4 5
6 6 6 4 5
Alternatively, there is rows_patch:
rows_patch() works like rows_update() but only overwrites NA values.

How to only keep the columns with same names between two data frames?

I have two data frames like the following:
a<-c(1,3,4,5,6,8)
b<-c(2,3,4,2,6,7)
c<-c(2,5,6,3,5,6)
df1<-data.frame(a,b,c)
d<-c(3,4,5,6,7,8)
e<-c(1,2,3,2,1,1)
c<-c(1,3,4,5,6,2)
df2<-data.frame(d,e,c)
> df1
a b c
1 1 2 2
2 3 3 5
3 4 4 6
4 5 2 3
5 6 6 5
6 8 7 6
> df2
d e c
1 3 1 1
2 4 2 3
3 5 3 4
4 6 2 5
5 7 1 6
6 8 1 2
I want combine the two data frames,and only keep the columns with the same names. The final data frame should like this:
> df3
c1 c2
1 2 1
2 5 3
3 6 4
4 3 5
5 5 6
6 6 2
My real data frames have hundreds columns,so I need codes do this job. Can anyone help me?
Find out which names belong to both dataframes and then bind them:
eqnames <- names(df1)[names(df1) %in% names(df2)]
df3 <- cbind(df1[eqnames], df2[eqnames])
You can then rename the columns:
names(df3) <- paste0(names(df3), 1:ncol(df3))
Resulting in:
> df3
c1 c2
1 2 1
2 5 3
3 6 4
4 3 5
5 5 6
6 6 2

Replace the rows in dataframe with condition

Hi in relation to the question here:
[Dynamically replace row in dataframe with vector
I have a data.frame for example:
d <- read.table(text=' V1 V2 V3 V4 V5 V6 V7
1 1 a 2 3 4 9 6
2 1 b 2 2 4 5 NA
3 1 c 1 3 4 5 8
4 1 d 1 2 3 6 9
5 2 a 1 2 3 4 5
6 2 b 1 4 5 6 7
7 2 c 1 2 3 5 8
8 2 d 2 3 6 7 9', header=TRUE)
Now I want to take one row, for example the first one (1a) and:
Get the min and max value from that row. In this case min=2 and max=9 (note there are missing values in between for example there is no 5, 7, or 8 in that row).
Now I want to replace that row with all missing values and extend it (the row will be longer than all others as it will go from 2 until 9 (2,3,4,5,6,7,8,9). The whole data.frame should then be automatically extended by NA columns for the other rows that are not as long as the one I replaced.
Now the following code does achieve this:
row.to.change <- 1
(new.row <- seq(min(d[row.to.change,c(-1, -2)], na.rm=TRUE), max(d[row.to.change,c(-1,-2)], na.rm=TRUE)))
(num.add <- length(new.row) - ncol(d) + 2)
# [1] 3
if (num.add > 0) {
d <- cbind(d, replicate(num.add, rep(NA, nrow(d))))
} else if (num.add <= 0) {
new.row <- c(new.row, rep(NA, -num.add))
}
and finally renames the extended data.frame headers as the default ones:
d[row.to.change,c(-1, -2)] <- new.row
colnames(d) <- paste0("V", seq_len(ncol(d)))
Now: This does work for the row that I specify in: row.to.replace but how does this work, if for example I want it to work for all rows which have a 'b' in the second column? Something like: "do this where d$V2 == 'b'"? In case the data.frame is 5000 rows long.
You have already solved. Just make a function and then apply it to each row of your data.
rtc=function(row.to.change){# <- 1
(new.row <- seq(min(d[row.to.change,c(-1, -2)], na.rm=TRUE), max(d[row.to.change,c(-1,-2)], na.rm=TRUE)))
(num.add <- length(new.row) - ncol(d) + 2)
# [1] 3
if (num.add <= 0) {
new.row <- c(new.row, rep(NA, -num.add))
}
new.row
}
#d2=d
newr=lapply(1:nrow(d),rtc) # for the hole data
# for specific condition, like lines with "b" in V2 change to:
# newr=lapply(1:nrow(d),function(z)if(d$V2[z]=="b")rtc(z) else as.numeric(d[z,c(-1, -2)]))
mxl=max(sapply(newr,length))
newr=lapply(newr,function(z)if(length(z)<mxl)c(z,rep(NA,mxl-length(z))) else z)
if (ncol(d)-2 < mxl) {
d <- cbind(d, replicate(mxl-ncol(d)+2, rep(NA, nrow(d))))
}
d[,c(-1, -2)] <- do.call(rbind,newr)
colnames(d) <- paste0("V", seq_len(ncol(d)))
d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 1 a 2 3 4 5 6 7 8 9 NA
2 1 b 2 3 4 5 NA NA NA NA NA
3 1 c 1 2 3 4 5 6 7 8 NA
4 1 d 1 2 3 4 5 6 7 8 9
5 2 a 1 2 3 4 5 NA NA NA NA
6 2 b 1 2 3 4 5 6 7 NA NA
7 2 c 1 2 3 4 5 6 7 8 NA
8 2 d 2 3 4 5 6 7 8 9 NA

Resources