How to remove rows with NAs from two dataframes based on NAs from one? - r

I am trying to remove the same rows with NA in df1 from df2.
eg.
df1
A
1 1
2 NA
3 7
4 NA
df2
A B C D
1 2 4 7 10
2 3 6 1 3
3 9 5 1 3
4 4 9 2 5
Intended outcome:
df1
A
1 1
3 7
df2
A B C D
1 2 4 7 10
3 9 5 1 3
I have already tried things along the lines of...
newdf <- df2[-which(rowSums(is.na(df1))),]
and
noNA <- function(x) { x[!rowSums(!is.na(df1)) == 1]}
NMR_6mos_noNA <- as.data.frame(lapply(df2, noNA))
or
noNA <- function(x) { x[,!is.na(df1)]}
newdf3 <- as.data.frame(lapply(df2, noNA))

We can use is.na to create a logical condition and use that to subset the rows of 'df1' and 'df2'
i1 <- !is.na(df1$A)
df1[i1, , drop = FALSE]
# A
#1 1
#3 7
df2[i1,]
# A B C D
# 1 2 4 7 10
#3 9 5 1 3

Related

How to assign a value to a column based on a column index

Having a data frame I would like to assign a calculated value based on a given a column index
df <- data.frame(a = c(2,4,7,3,5,3), b = c(8,3,8,2,6,1))
> df
a b
1 2 8
2 4 3
3 7 8
4 3 2
5 5 6
6 3 1
max <- apply(df, 1, which.max)
> max
[1] 2 1 2 1 2 1
addition <- apply(df, 1, sum)
> addition
[1] 10 7 15 5 11 4
Then some operation which I cannot figure out with the following result being assigned to df2
> df2
a b
1 2 10
2 7 3
3 7 15
4 5 2
5 5 11
6 4 1
highly appreciate your ideas and your help. Thank you
You can use cbind to access your selected columns for each row:
df2 = df
df2[cbind(1:nrow(df2),max)] = addition
df2
a b
1 2 10
2 7 3
3 7 15
4 5 2
5 5 11
6 4 1
Here, cbind returns a matrix of 2 columns and 6 rows that we use to subset the dataframe using matrix subsetting.
You can also use vectorised ifelse directly:
with(df, cbind.data.frame(a = ifelse(a > b, a + b, a), b = ifelse(a > b, b, a + b)));
# a b
#1 2 10
#2 7 3
#3 7 15
#4 5 2
#5 5 11
#6 4 1

Insert NA-rows in data frame according to rownames of other data frame

I have 2 data frames with different rownames, e.g.:
df1 <- data.frame(A = c(1,3,7,1,5), B = c(5,2,9,5,5), C = c(1,1,3,4,5))
df2 <- data.frame(A = c(4,3,2), B = c(4,4,9), C = c(3,9,3))
rownames(df2) <- c(1, 3, 6)
> df1
A B C
1 1 5 1
2 3 2 1
3 7 9 3
4 1 5 4
5 5 5 5
> df2
A B C
1 4 4 3
3 3 4 9
6 2 9 3
I need to insert NA-rows in both data frames for each row that does exist in only one of the data frames. In the given example:
> df1
A B C
1 1 5 1
2 3 2 1
3 7 9 3
4 1 5 4
5 5 5 5
6 NA NA NA
> df2
A B C
1 4 4 3
2 NA NA NA
3 3 4 9
4 NA NA NA
5 NA NA NA
6 2 9 3
I will have to perform this operation many times with different data frames, so I need an automatized way to do this. I was trying to solve the issue with different if/else loops, but I am sure there must be a much more automatized way.
We can use functions union, %in% or intersect to find the common rownames and assign rows of an NA dataframe with the values of the dataset if it matches the rownames
un1 <- union(rownames(df1), rownames(df2))
d1 <- as.data.frame(matrix(NA, ncol = ncol(df1),
nrow = length(un1), dimnames = list(un1, names(df1))))
d2 <- d1
d1[rownames(d1) %in% rownames(df1),] <- df1
d2[rownames(d2) %in% rownames(df2),] <- df2
d2
# A B C
#1 4 4 3
#2 NA NA NA
#3 3 4 9
#4 NA NA NA
#5 NA NA NA
#6 2 9 3

Remove semi duplicate rows in R

I have the following data.frame.
a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)
a b c d
1 A NA 1 1
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
6 B NA 1 6
7 C 2 2 7
8 C NA 2 8
9 D NA 2 9
I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.
One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements
df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
# a b c d
#2 A 1 1 2
#3 A 2 2 3
#5 B 1 1 5
#4 B 4 4 4
#7 C 2 2 7
#9 D NA 2 9
setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8
First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :
x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]
Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.
rbind(na.omit(x),x1)
Answer:
a b c d
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
7 C 2 2 7
9 D NA 2 9
You can use dplyr to do this.
df %>% distinct(a, c, .keep_all = TRUE)
Output
a b c d
1 A NA 1 1
2 A 2 2 3
3 B 4 4 4
4 B 1 1 5
5 C 2 2 7
6 D NA 2 9
There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

R: Updating a data frame with another data frame

Let's say our initial data frame looks like this:
df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,NA,NA,NA),C=c(1,2,3,NA,NA,NA))
> df1
Index A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 NA NA
5 5 5 NA NA
6 6 6 NA NA
Another data frame contains new information for col B and C
df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5))
> df2
Index B C
1 4 4 5
2 5 4 5
3 6 4 5
How can you update the missing values in df1 so it looks like this:
Index A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 5
5 5 5 4 5
6 6 6 4 5
My attempt:
library(dplyr)
> full_join(df1,df2)
Joining by: c("Index", "B", "C")
Index A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 NA NA
5 5 5 NA NA
6 6 6 NA NA
7 4 NA 4 5
8 5 NA 4 5
9 6 NA 4 5
Which as you can see has created duplicate rows for the 4,5,6 index instead of replacing the NA values.
Any help would be greatly appreciated!
merge then aggregate:
aggregate(. ~ Index, data=merge(df1, df2, all=TRUE), na.omit, na.action=na.pass )
# Index B C A
#1 1 1 1 1
#2 2 2 2 2
#3 3 3 3 3
#4 4 4 5 4
#5 5 4 5 5
#6 6 4 5 6
Or in dplyr speak:
df1 %>%
full_join(df2) %>%
group_by(Index) %>%
summarise_each(funs(na.omit))
#Joining by: c("Index", "B", "C")
#Source: local data frame [6 x 4]
#
# Index A B C
# (dbl) (int) (dbl) (dbl)
#1 1 1 1 1
#2 2 2 2 2
#3 3 3 3 3
#4 4 4 4 5
#5 5 5 4 5
#6 6 6 4 5
We can use join from data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), join on with 'df1' using "Index" and assign (:=), the values in 'B' and 'C' with 'i.B' and 'i.C'.
library(data.table)
setDT(df1)[df2, c('B', 'C') := .(i.B, i.C), on = "Index"]
df1
# Index A B C
#1: 1 1 1 1
#2: 2 2 2 2
#3: 3 3 3 3
#4: 4 4 4 5
#5: 5 5 4 5
#6: 6 6 4 5
For those interested, I've extended this problem to:
- handle updating a data frame with another data frame with new columns - replace any existing entries regardless if they're NA or not.
Heres the solution I found using the aggregate function from #thelatemail :)
df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,3,3,3),C=c(1,2,3,3,3,3))
df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5),D=c(6,6,6),E=c(7,7,7))
df3 = full_join(df1,df2)
# Create a function na.omit.last
na.omit.last = function(x){
x <- na.omit(x)
x <- last(x)
}
# For the columns not in df1
dfA = aggregate(. ~ Index, df3, na.omit,na.action = na.pass)
dfA = dfA[,-(1:ncol(df1))]
dfA = data.frame(lapply(dfA,as.numeric))
dfB = aggregate(. ~ Index, df3[,1:ncol(df1)], na.omit.last, na.action = na.pass)
# If there are more columns in df2 append dfA
if (ncol(df2) > ncol(df1)) {
df3 = cbind(dfB,dfA)
} else {
df3 = dfB
}
print(df3)
Not sure what the general case or conditions would be, but this works for this instance without dplyr
df3 <- as.matrix(df1)
df3[which(is.na(df3))] <- as.matrix(df2)
df3 <- as.data.frame(df3)
df3
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 5
5 5 4 5
6 6 4 5
As of dplyr >= 1.0.0 you can use rows_update:
library(dplyr)
df1 %>%
rows_update(df2, by = "Index")
Index A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 5
5 5 5 4 5
6 6 6 4 5
Alternatively, there is rows_patch:
rows_patch() works like rows_update() but only overwrites NA values.

Loop function for comparing the columns

I have a very large data set including 400 string and numeric variables. I want to compare each two consequiative columns 3&4, 5&6, etc. I am going to compare the third variable (.x) with fourth (.y) , fifth with sixth one, seventh one with eightth one and so on in the following way: if (.y) is NA then we replace the NA with the value of corresponding row from (.x) . For example if number .y is NA we replace NA with the corresponding value from number .x which would be 5. Again, if day.y is NA we replace NA in day.y with the corresponding value from day.x which would be 3. How can I write a loope function to do that?
A<-c(1,2,3,4,5,6,7,NA,NA,5,5,6)
B<-c(3,4,5,6,1,2,7,6,7,NA,NA,6)
number.x<-c(1,2,3,4,5,6,7,NA,NA,5,5,6)
number.y<-c(3,4,5,6,1,2,7,6,7,NA,NA,6)
day.x<-c(1,3,4,5,6,7,8,1,NA,3,5,3)
day.y<-c(4,5,6,7,8,7,8,1,2,3,5,NA)
school.x<-c("a","b","b","c","n","f","h","NA","F","G","z","h")
school.y<-c("a","b","b","c","m","g","h","NA","NA","G","H","T")
city.x<- c(1,2,3,7,5,8,7,5,6,7,5,1)
city.y<- c(1,2,3,5,5,7,7,NA,NA,3,4,5)
df<-data.frame(A,B,number.x,number.y,day.x,day.y,school.x,school.y,city.x,city.y)
This is a hacked approach to your question and it requires that every two columns are going to be compared against one another.
library(dplyr)
start_group <- seq(1, length(df), by = 2)
df2 <- data.frame(id = 1:nrow(df))
for(i in start_group){
i <- i
j <- i + 1
dnames <- df[, c(i, j)] %>%
names
df_ <- data.frame(col1 = df[, i],
col2 = df[, j]) %>%
mutate(col1 = ifelse(is.na(col1), col2 %>% paste, col1 %>% paste)) %>%
mutate(col2 = ifelse(is.na(col2), col1 %>% paste, col2 %>% paste))
names(df_) <- dnames
df2 <- cbind(df2, df_)
}
df2[, -1]
number.x number.y day.x day.y school.x school.y city.x city.y
1 1 3 1 4 a a 1 1
2 2 4 3 5 b b 2 2
3 3 5 4 6 b b 3 3
4 4 6 5 7 c c 7 5
5 5 1 6 8 n m 5 5
6 6 2 7 7 f g 8 7
7 7 7 8 8 h h 7 7
8 6 6 1 1 NA NA 5 5
9 7 7 2 2 F F 6 6
10 5 5 3 3 G G 7 3
11 5 5 5 5 z H 5 4
12 6 6 3 3 h T 1 5
Consider the following base R solution. Essentially, it loops through a distinct list of column stem names (number, day, school, class) and replaces NA values in .x columns with corresponding NA values in .y columns and vice versa. NOTE: Schools column require conversion from factor to character and one of its rows has NA in both .x and .y columns
# CONVERT TO CHARACTER (NOTE: NA VALUE BECOME "NA" STRINGS)
df[,c('school.x', 'school.y')] <-
sapply(df[,c('school.x', 'school.y')], as.character)
# SET UP FINAL DF
finaldf <- df
# OBTAIN UNIQUE LIST OF COLUMNS STEM (W/O x AND y SUFFIXES)
distinctcols <- unique(gsub("[.][x]|[.][y]", "", names(df)[49:ncol(df)]))
# LOOP THROUGH COLUMN STEM REPLACING NA VALUES
for (col in distinctcols) {
# REPLACE NA .x COLUMN VALUES
finaldf[is.na(finaldf[paste0(col,'.x')])|finaldf[paste0(col,'.x')]=="NA",
paste0(col,'.x')] <-
finaldf[is.na(finaldf[paste0(col,'.x')])|finaldf[paste0(col,'.x')]=="NA",
paste0(col,'.y')]
# REPLACE NA .y COLUMN VALUES
finaldf[is.na(finaldf[paste0(col,'.y')])|finaldf[paste0(col,'.y')]=="NA",
paste0(col,'.y')] <-
finaldf[is.na(finaldf[paste0(col,'.y')])|finaldf[paste0(col,'.y')]=="NA",
paste0(col,'.x')]
}
OUTPUT
number.x number.y day.x day.y school.x school.y city.x city.y
1 1 3 1 4 a a 1 1
2 2 4 3 5 b b 2 2
3 3 5 4 6 b b 3 3
4 4 6 5 7 c c 7 5
5 5 1 6 8 n m 5 5
6 6 2 7 7 f g 8 7
7 7 7 8 8 h h 7 7
8 6 6 1 1 NA NA 5 5
9 7 7 2 2 F F 6 6
10 5 5 3 3 G G 7 3
11 5 5 5 5 z H 5 4
12 6 6 3 3 h T 1 5

Resources