unlist and merge selected columns to data frame in R - r

Suppose I have a list such like:
df1<-data.frame(n=letters[1:4], x=1:4, y=2:5, z=3:6)
df2<-data.frame(n=letters[2:5], x=2:5, y=3:6, z=4:7)
df3<-data.frame(n=letters[3:7], x=2:6, y=3:7, z=4:8)
ls<-list(df1, df2, df3)
ls
[[1]]
n x y z
1 a 1 2 3
2 b 2 3 4
3 c 3 4 5
4 d 4 5 6
[[2]]
n x y z
1 b 2 3 4
2 c 3 4 5
3 d 4 5 6
4 e 5 6 7
[[3]]
n x y z
1 c 2 3 4
2 d 3 4 5
3 e 4 5 6
4 f 5 6 7
5 g 6 7 8
what I wanted is to merger the first two columns of each data frame in the list by column n and a desired output would be:
n x1 x2 x3
1 a 1 NA NA
2 b 2 2 NA
3 c 3 3 2
4 d 4 4 3
5 e NA 5 4
6 f NA NA 5
7 g NA NA 6
And same thing for y and z:
n y1 y2 y3
1 a 2 NA NA
2 b 3 3 NA
3 c 4 4 3
4 d 5 5 4
5 e NA 6 5
6 f NA NA 6
7 g NA NA 7
n z1 z2 z3
1 a 3 NA NA
2 b 4 4 NA
3 c 5 5 4
4 d 6 6 5
5 e NA 7 6
6 f NA NA 7
7 g NA NA 8

We get the unique column names from the list of data.frames except the 'n' ('nm1'), loop through those (lapply(nm1,...), subset the columns of each of the 'data.frame' in 'ls' (lapply(ls, function(x) ...), and use Reduce, with merge to merge the datasets in the list.
nm1 <- setdiff(unlist(lapply(ls, names)), "n")
lapply(nm1, function(nm) setNames(Reduce(function(...)
merge(..., all=TRUE, by = "n"), lapply(ls,
function(x) x[c("n", nm)])), make.unique(c("n", rep(nm, length(nm1))))))
#[[1]]
# n x x.1 x.2
#1 a 1 NA NA
#2 b 2 2 NA
#3 c 3 3 2
#4 d 4 4 3
#5 e NA 5 4
#6 f NA NA 5
#7 g NA NA 6
#[[2]]
# n y y.1 y.2
#1 a 2 NA NA
#2 b 3 3 NA
#3 c 4 4 3
#4 d 5 5 4
#5 e NA 6 5
#6 f NA NA 6
#7 g NA NA 7
#[[3]]
# n z z.1 z.2
#1 a 3 NA NA
#2 b 4 4 NA
#3 c 5 5 4
#4 d 6 6 5
#5 e NA 7 6
#6 f NA NA 7
#7 g NA NA 8
NOTE: ls is a function name that lists the objects. It is better to avoid naming objects with known R functions.

Here is another base R method that uses do.call, data.frame, and cbind within a nested pair of lapply functions.
# get all levels of n across data frames
allN <- unique(unlist(sapply(ls, "[[", "n")))
# extract desired columns and provide names with setNames
lapply(names(ls[[1]])[-1], function(var) {
cbind("n"=allN, setNames(do.call(data.frame,
lapply(seq_along(ls), function(i) {
ls[[i]][[var]][match(allN, ls[[i]]$n, nomatch=NA)]
})), paste0(var, seq_along(ls))))
})
The first lapply runs through each of the variable names, the second lapply extracts the current variable from the each data frame in the list. In the middle, do.call makes the list a data.frame, setNames provides the desired names, and the n column is added with cbind.
In the innermost portion of the inner lapply, the code ls[[i]][[var]][match(allN, ls[[i]]$n, nomatch=NA)] is used to expand (and potentially reorder) the current vector according to the levels in allN. If the current vector is missing a level, the nomatch=NA tells match to instead return NA.

Related

Find the index of columns containing more than 5 NA values

I want to subset a dataframe and extract only the columns that contain 5 or more NA values.
data.frame(A = rep(1, 10), B = c(rep(2,5), rep(3,5)), D = rep(5, 10), E = c(rep(1,2), rep(NA,6), rep(6,2)), F = c(rep(NA,2), rep(2,8)))
A B D E F
1 1 2 5 1 NA
2 1 2 5 1 NA
3 1 2 5 NA 2
4 1 2 5 NA 2
5 1 2 5 NA 2
6 1 3 5 NA 2
7 1 3 5 NA 2
8 1 3 5 NA 2
9 1 3 5 6 2
10 1 3 5 6 2
So in this example I want to have the index of the column "E".
My original dataset has about 3000 columns, so speed is more or less important.
I have been trying to do this with sum(is.na) and filter_if(any_vars) but all to no avail..
Using ColSums with is.na
names(df)[colSums(is.na(df))>5]
[1] "E"
We can use colSums on logical matrix (is.na(df1)), get the index with which and extract the names
names(which(colSums(is.na(df1)) >= 5))
#[1] "E"
which(unlist(lapply(df, function(x) sum(is.na(x)) > 5)))
4

Insert NA-rows in data frame according to rownames of other data frame

I have 2 data frames with different rownames, e.g.:
df1 <- data.frame(A = c(1,3,7,1,5), B = c(5,2,9,5,5), C = c(1,1,3,4,5))
df2 <- data.frame(A = c(4,3,2), B = c(4,4,9), C = c(3,9,3))
rownames(df2) <- c(1, 3, 6)
> df1
A B C
1 1 5 1
2 3 2 1
3 7 9 3
4 1 5 4
5 5 5 5
> df2
A B C
1 4 4 3
3 3 4 9
6 2 9 3
I need to insert NA-rows in both data frames for each row that does exist in only one of the data frames. In the given example:
> df1
A B C
1 1 5 1
2 3 2 1
3 7 9 3
4 1 5 4
5 5 5 5
6 NA NA NA
> df2
A B C
1 4 4 3
2 NA NA NA
3 3 4 9
4 NA NA NA
5 NA NA NA
6 2 9 3
I will have to perform this operation many times with different data frames, so I need an automatized way to do this. I was trying to solve the issue with different if/else loops, but I am sure there must be a much more automatized way.
We can use functions union, %in% or intersect to find the common rownames and assign rows of an NA dataframe with the values of the dataset if it matches the rownames
un1 <- union(rownames(df1), rownames(df2))
d1 <- as.data.frame(matrix(NA, ncol = ncol(df1),
nrow = length(un1), dimnames = list(un1, names(df1))))
d2 <- d1
d1[rownames(d1) %in% rownames(df1),] <- df1
d2[rownames(d2) %in% rownames(df2),] <- df2
d2
# A B C
#1 4 4 3
#2 NA NA NA
#3 3 4 9
#4 NA NA NA
#5 NA NA NA
#6 2 9 3

Fill missing values with new data R-Python

I have two dataset x and y
> x
a index b
1 1 1 5
2 NA 2 6
3 2 3 NA
4 NA 4 9
> y
index a
1 2 100
2 4 101
>
I would like to fill the missing values of x with the values contained in y.
I have tried to use the merge function but the result is not what I want.
> merge(x,y, by = 'index', all=T)
index a.x b a.y
1 1 1 5 NA
2 2 NA 6 100
3 3 2 7 NA
4 4 NA 9 101
In the real problem there are additional limitations:
1 - y does not fill all the missing values
2 - x and y have in common more variables (so not only a and index)
EDIT : More realistic example
> x
a index b c
1 1 1 5 NA
2 NA 2 6 NA
3 2 3 NA 5
4 NA 4 9 NA
5 NA 5 10 6
> y
index a c
1 2 100 4
2 4 101 NA
>
The solution would be accepted both in python or R
I used your merge idea and did the following using dplyr. I am sure there will be better ways of doing this task.
index <- 1:5
a <- c(1, NA, 2, NA, NA)
b <- c(5,6,NA,9,10)
c <- c(NA,NA,5,NA,6)
ana <- data.frame(index, a,b,c, stringsAsFactors=F)
index <- c(2,4)
a <- c(100, 101)
c <- c(4, NA)
bob <- data.frame(index, a,c, stringsAsFactors=F)
> ana
index a b c
1 1 1 5 NA
2 2 NA 6 NA
3 3 2 NA 5
4 4 NA 9 NA
5 5 NA 10 6
> bob
index a c
1 2 100 4
2 4 101 NA
ana %>%
merge(., bob, by = "index", all = TRUE) %>%
mutate(a.x = ifelse(a.x %in% NA, a.y, a.x)) %>%
mutate(c.x = ifelse(c.x %in% NA, c.y, c.x))
index a.x b c.x a.y c.y
1 1 1 5 NA NA NA
2 2 100 6 4 100 4
3 3 2 NA 5 NA NA
4 4 101 9 NA 101 NA
5 5 NA 10 6 NA NA
I overwrote a.x (ana$$a) using a.y (bob$a) using mutate. I did a similar thing for c.x (ana$c). If you remove a.y and c.y in the end, that will be the outcome you expect, I think.
Try:
xa = x[,c(1,2)]
m1 = merge(y,xa,all=T)
m1 = m1[!duplicated(m1$index),]
m1$b = x$b[match(m1$index, x$index)]
m1$c = x$c[match(m1$index, x$index)]
m1
index a b c
1 1 1 5 NA
2 2 100 6 NA
4 3 2 NA 5
5 4 101 9 NA
7 5 NA 10 6
or, if there many other columns like b and c:
xa = x[,c(1,2)]
m1 = merge(y,xa,all=T)
m1 = m1[!duplicated(m1$index),]
for(nn in names(x)[3:4]) m1[,nn] = x[,nn][match(m1$index, x$index)]
m1
index a b c
1 1 1 5 NA
2 2 100 6 NA
4 3 2 NA 5
5 4 101 9 NA
7 5 NA 10 6
If there are multiple columns to replace, you could try converting from wide to long form as shown in the first two methods and replace in one step
m1 <- merge(x,y, by="index", all=TRUE)
m1L <- reshape(m1, idvar="index", varying=grep("\\.", colnames(m1)), direction="long", sep=".")
row.names(m1L) <- 1:nrow(m1L)
lst1 <- split(m1L, m1L$time)
indx <- is.na(lst1[[1]][,4:5])
lst1[[1]][,4:5][indx] <- lst1[[2]][,4:5][indx]
res <- lst1[[1]][,c(4,1,2,5)]
res
# a index b c
#1 1 1 5 NA
#2 100 2 6 4
#3 2 3 NA 5
#4 101 4 9 NA
#5 NA 5 10 6
Or you could use dplyr with tidyr
library(dplyr)
library(tidyr)
z <- left_join(x, y, by="index") %>%
gather(Var, Val, matches("\\.")) %>%
separate(Var, c("Var1", "Var2"))
indx1 <- which(is.na(z$Val) & z$Var2=="x")
z$Val[indx1] <- z$Val[indx1+nrow(z)/2]
z %>%
spread(Var1, Val) %>%
filter(Var2=="x") %>%
select(-Var2)
# index b a c
#1 1 5 1 NA
#2 2 6 100 4
#3 3 NA 2 5
#4 4 9 101 NA
#5 5 10 NA 6
Or split the columns by matching names before the . and use lapply to replace the NA's.
indx <- grep("\\.", colnames(m1),value=TRUE)
res <- cbind(m1[!names(m1) %in% indx],
sapply(split(indx, gsub("\\..*", "", indx)), function(x) {
x1 <- m1[x]
indx1 <- is.na(x1[,1])
x1[,1][indx1] <- x1[,2][indx1]
x1[,1]} ))
res
# index b a c
#1 1 5 1 NA
#2 2 6 100 4
#3 3 NA 2 5
#4 4 9 101 NA
#5 5 10 NA 6

Retrieving subset of a data frame by finding entries with NA in specific columns

Suppose we had a data frame with NA values like so,
>data
A B C D
1 3 NA 4
2 1 3 4
NA 3 3 5
4 2 NA NA
2 NA 4 3
1 1 1 2
I wish to know a general method for retrieving the subset of data with NA values in C or A. So the output should be,
A B C D
1 3 NA 4
NA 3 3 5
4 2 NA NA
I tried using the subset command like so, subset(data, A==NA | C==NA), but it didn't work. Any ideas?
A very handy function for these sort of things is complete.cases. It checks row-wise for NA and if any returns FALSE. If there are no NAs, returns TRUE.
So, you need to subset just the two columns of your data and then use complete.cases(.) and negate it and subset those rows back from your original data, as follows:
# assuming your data is in 'df'
df[!complete.cases(df[, c("A", "C")]), ]
# A B C D
# 1 1 3 NA 4
# 3 NA 3 3 5
# 4 4 2 NA NA
Here is one possibility:
# Read your data
data <- read.table(text="
A B C D
1 3 NA 4
2 1 3 4
NA 3 3 5
4 2 NA NA
2 NA 4 3
1 1 1 2",header=T,sep="")
# Now subset your data
subset(data, is.na(C) | is.na(A))
A B C D
1 1 3 NA 4
3 NA 3 3 5
4 4 2 NA NA

Selecting values in a dataframe based on a priority list

I am new to R so am still getting my head around the way it works. My problem is as follows, I have a data frame and a prioritised list of columns (pl), I need:
To find the maximum value from the columns in pl for each row and create a new column with this value (df$max)
Using the priority list, subtract this maximum value from the priority value, ignoring NAs and returning the absolute difference
Probably better with an example:
My priority list is
pl <- c("E","D","A","B")
and the data frame is:
A B C D E F G
1 15 5 20 9 NA 6 1
2 3 2 NA 5 1 3 2
3 NA NA 3 NA NA NA NA
4 0 1 0 7 8 NA 6
5 1 2 3 NA NA 1 6
So for the first line the maximum is from column A (15) and the priority value is from column D (9) since E is a NA. The answer I want should look like this.
A B C D E F G MAX MAX-PR
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA NA NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1
How about this?
df$MAX <- apply(df[,pl], 1, max, na.rm = T)
df$MAX_PR <- df$MAX - apply(df[,pl], 1, function(x) x[!is.na(x)][1])
df$MAX[is.infinite(df$MAX)] <- NA
> df
# A B C D E F G MAX MAX_PR
# 1 15 5 20 9 NA 6 1 15 6
# 2 3 2 NA 5 1 3 2 5 4
# 3 NA NA 3 NA NA NA NA NA NA
# 4 0 1 0 7 8 NA 6 8 0
# 5 1 2 3 NA NA 1 6 2 1
Example:
df <- data.frame(A=c(1,NA,2,5,3,1),B=c(3,5,NA,6,NA,10),C=c(NA,3,4,5,1,4))
pl <- c("B","A","C")
#now we find the maximum per row, ignoring NAs
max.per.row <- apply(df,1,max,na.rm=T)
#and the first element according to the priority list, ignoring NAs
#(there may be a more efficient way to do this)
first.per.row <- apply(df[,pl],1, function(x) as.vector(na.omit(x))[1])
#and finally compute the difference
max.less.first.per.row <- max.per.row - first.per.row
Note that this code will break for any row that is all NA. There is no check against that.
Here a simple version. First , I take only pl columns , for each line I remove na then I compute the max.
df <- dat[,pl]
cbind(dat, t(apply(df, 1, function(x) {
x <- na.omit(x)
c(max(x),max(x)-x[1])
}
)
)
)
A B C D E F G 1 2
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA -Inf NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1

Resources