Remove null value from nested list in R [duplicate] - r

This question already has answers here:
Remove NULL elements from list of lists
(7 answers)
Closed 3 years ago.
I have a nested list of data frames. In those data frames I have NA variables (vectors now?). I want to remove those elements.
EDIT: actually I have NULL instead of NA.
df.ls <- list(list(id = NULL, x = 3, works = NULL),
list(id = 2, x = 4, works = NULL),
NULL)
I tried this code, but don't know how to tell which level should it use.
df.ls[sapply(df.ls, is.null)] <- NULL

For NULL values we can do
l1 <- lapply(df.ls, function(x) x[lengths(x) > 0])
For NAs we can do
l1 <- lapply(df.ls, function(x) x[!is.na(x)])
l1
#[[1]]
#[[1]]$x
#[1] 3
#[[2]]
#[[2]]$id
#[1] 2
#[[2]]$x
#[1] 4
#[[3]]
#list()
If you want to remove the empty list, you can do
l1[lengths(l1) > 0]

I am not sure what you are trying to do, since you say you have a list of data.frames but the example you provide is only a list of lists with elements of length one.
Lets assume you have a list of data.frames, which in turn contain vectors of length > 1, and you want to drop all columns that "only" contain NAs.
df.ls <- list(data.frame(id = c(NA,NA,NA),
x = c(NA,3,5),
works = c(4,5,NA)),
data.frame(id = c("a","b","c"),
x = c(NA,3,5),
works = c(NA,NA,NA)),
data.frame(id = c("e","d",NA),
x = c(NA,3,5),
works = c(4,5,NA)))
> [[1]]
id x works
1 NA NA 4
2 NA 3 5
3 NA 5 NA
[[2]]
id x works
1 a NA NA
2 b 3 NA
3 c 5 NA
[[3]]
id x works
1 e NA 4
2 d 3 5
3 <NA> 5 NA
Then this approach will work:
library(dplyr)
library(purrr)
non_empty_col <- function(x) {
sum(is.na(x)) != length(x)
}
map(df.ls, ~ .x %>% select_if(non_empty_col))
Which returns your list of data.frames without columns that contain only NA.
[[1]]
x works
1 NA 4
2 3 5
3 5 NA
[[2]]
id x
1 a NA
2 b 3
3 c 5
[[3]]
id x works
1 e NA 4
2 d 3 5
3 <NA> 5 NA
If you, however, prefer your list to have only complete cases in each data.frame (rows with no NAs), then the following code will work.
library(dplyr)
map(df.ls, ~ .x[complete.cases(.x), ])
Leaving you, in case of my example data, only with row 2 of data.frame 3.

To remove the NULL
discard(map(df.ls, ~ discard(.x, is.null)), is.null)
#[[1]]
#[[1]]$x
#[1] 3
#[[2]]
#[[2]]$id
#[1] 2
#[[2]]$x
#[1] 4
Or in base R with Filter and is.null
Filter(Negate(is.null), lapply(df.ls, function(x) Filter(Negate(is.null), x)))
Earlier version before the OP's update
library(purrr)
map(df.ls, ~ .x[!is.na(.x)])
#[[1]]
#[[1]]$x
#[1] 3
#[[2]]
#[[2]]$id
#[1] 2
#[[2]]$x
#[1] 4
#[[3]]
#list()

Related

How to organize the output of the list of list in R

Suppose this is my list of list (I would like to organize the result as my data contains more than 40 results and it is difficult for me to organize them manually).
s <- c(1,2,3)
ss <- c(4,5,6)
S <- list(s,ss)
h <- c(4,8,7)
hh <- c(0,3,4)
H <- list(h,hh)
HH <- list(S,H)
names1 <- c("First","Second")
lapply(setNames(HH, paste0(names1, '_Model')), function(x)
setNames(x, paste0('Res_', seq_along(x))))
#$First_Model
#$First_Model$Res_1
#[1] 1 2 3
#$First_Model$Res_2
#[1] 4 5 6
#$Second_Model
#$Second_Model$Res_1
#[1] 4 8 7
#$Second_Model$Res_2
#[1] 0 3 4
I would like to have the result similar to the following:
#$First_Model
#$First_Model$Res_1
#[1] 1 2 3
#$Second_Model
#$Second_Model$Res_1
#[1] 4 8 7
#$First_Model$Res_2
#[1] 4 5 6
#$Second_Model$Res_2
#[1] 0 3 4
The problem in question is how to rearrange the nested list from "Model No. > Results No." to "Results No. > Model No."
I was going for something similar to Wimpel's answer.
Res_no <- seq_along(HH[[1]]) # results elements
lapply(setNames(Res_no, paste0("Res_", Res_no)), function(x)
lapply(setNames(HH, paste0(names1, '_Model')), `[[`, x)
)
Output
#$Res_1
#$Res_1$First_Model
#[1] 1 2 3
#
#$Res_1$Second_Model
#[1] 4 8 7
#
#
#$Res_2
#$Res_2$First_Model
#[1] 4 5 6
#
#$Res_2$Second_Model
#[1] 0 3 4
The base of this solution is to extract the x-th element of the nested list (seen in the inner lapply() function of the code). You can do this with lapply or purrr:map, as described here.
The outer lapply() function lets you repeat it for all the "Results No."
Something like this perhaps?
# From your code, create a list L
L <- lapply(setNames(HH, paste0(names1, '_Model')), function(x)
setNames(x, paste0('Res_', seq_along(x))))
# get all x-th elements from the list, and add them to new list L2
L2 <- lapply( 1:length(L[[1]]), function(x) {
lapply(L, "[[", x)
})
# set names of L2
names(L2) <- names(L[[1]])
output
# $Res_1
# $Res_1$First_Model
# [1] 1 2 3
#
# $Res_1$Second_Model
# [1] 4 8 7
#
#
# $Res_2
# $Res_2$First_Model
# [1] 4 5 6
#
# $Res_2$Second_Model
# [1] 0 3 4

R lapply: check if data frame contains a column. If not, create this column

I have a list of dataframes.
I would like to check every column name of the dataframes. If the column name is missing, I want to create this column to the dataframe, and complete with NA values.
Dummy data:
d1 <- data.frame(a=1:2, b=2:3, c=4:5)
d2 <- data.frame(a=1:2, b=2:3)
l<-list(d1, d2)
# Check the columns names of the dataframes
# If column is missing, add new column, add NA as values
lapply(l, function(x) if(!("c" %in% colnames(x)))
{
c<-rep(NA, nrow(x))
cbind(x, c) # does not work!
})
What I get:
[[1]]
NULL
[[2]]
a b c
1 1 2 NA
2 2 3 NA
What I want instead:
[[1]]
a b c
1 1 2 4
2 2 3 5
[[2]]
a b c
1 1 2 NA
2 2 3 NA
Thanks for your help!
You could use dplyr::mutate with an ifelse:
library(dplyr)
lapply(l, function(x) mutate(x, c = ifelse("c" %in% names(x), c, NA)))
[[1]]
a b c
1 1 2 4
2 2 3 4
[[2]]
a b c
1 1 2 NA
2 2 3 NA
You have some good answers, but if you want to stick to base R:
lapply(l, function(x)
if(!("c" %in% colnames(x))) {
c<-rep(NA, nrow(x))
return(cbind(x, c))
}
else(return(x))
)
Your code was returning NULL for the first df because you had no else statement to handle the case of c existing (i.e FALSE in the if statement).
One way is to use dplyr::bind_rows to bind data.frames in the list and fill entries from missing columns with NA, and then split the resulting data.frame again to produce a list of data.frames:
df <- dplyr::bind_rows(l, .id = "id");
lapply(split(df, df$id), function(x) x[, -1])
#$`1`
# a b c
#1 1 2 4
#2 2 3 5
#
#$`2`
# a b c
#3 1 2 NA
#4 2 3 NA
Or the same as a tidyverse/magrittr chain
bind_rows(l, .id = "id") %>% split(., .$id) %>% lapply(function(x) x[, -1])
library(purrr)
map(l, ~{if(!length(.x$c)) .x$c <- NA; .x})

Identify duplicate values and remove them

I have a vector:
vec <- c(2,3,5,5,5,5,6,1,9,4,4,4)
I want to check if a particular value is repeated consecutively and if yes, keep the first two values and assign NA to the rest of the values.
For example, in the above vector, 5 is repeated 4 times, therefore I will keep the first two 5's and make the second two 5's NA.
Similarly, 4 is repeated three times, so I will keep the first two 4's and remove the third one.
In the end my vector should look like:
2,3,5,5,NA,NA,6,1,9,4,4,NA
I did this:
bad.values <- vec - binhf::shift(vec, 1, dir="right")
bad.repeat <- bad.values == 0
vec[bad.repeat] <- NA
[1] 2 3 5 NA NA NA 6 1 9 4 NA NA
I can only get it to work to keep the first 5 and 4 (rather than first two 5's or 4',4's).
Any solutions?
Another option with just base R functions:
rl <- rle(vec)
i <- unlist(lapply(rl$lengths, function(l) if (l > 2) c(FALSE,FALSE,rep(TRUE, l - 2)) else rep(FALSE, l)))
vec * NA^i
which gives:
[1] 2 3 5 5 NA NA 6 1 9 4 4 NA
I figured it out. I just had to change the argument to 2 in binhf::shift
vec <- c(2,3,5,5,5,5,6,1,9,4,4,4)
bad.values <- vec - binhf::shift(vec, 2, dir="right")
bad.repeat <- bad.values == 0
vec[bad.repeat] <- NA
[1] 2 3 5 5 NA NA 6 1 9 4 4 NA
I think this might work, if I got your problem right:
vec <- c(2,3,5,5,5,5,6,1,9,4,4,4)
diffs1<-vec-binhf::shift(vec,1,dir="right")
diffs2<-vec-binhf::shift(vec,2,dir="right")
get_zeros<-abs(diffs1)+abs(diffs2)
vec[which(get_zeros==0)]<-NA
I hope this helps!
This question may refer to a problem you encountered in a dataframe, not a vector. In any case, here's a tidyverse solution to both.
tibble(x = vec) %>%
group_by(x) %>%
mutate(mycol = ifelse(row_number()>2, NA, x) ) %>%
pull(mycol)

Complete.cases used on list of data frames

I'm trying to remove all the NA values from a list of data frames. The only way I have got it to work is by cleaning the data with complete.cases in a for loop. Is there another way of doing this with lapply as I had been trying for a while to no avail. Here is the code that works.
I start with
data_in <- lapply (file_name,read.csv)
Then have:
clean_data <- list()
for (i in seq_along(id)) {
clean_data[[i]] <- data_in[[i]][complete.cases(data_in[[i]]), ]
}
But what I tried to get to work was using lapply all the way like this.
comp <- lapply(data_in, complete.cases)
clean_data <- lapply(data_in, data_in[[id]][comp,])
Which returns this error "Error in [.default(xj, i) : invalid subscript type 'list' "
What I'd like to know is some alternatives or if I was going about this right. And why didn't the last example not work?
Thank you so much for your time. Have a nice day.
I'm not sure what you expected with
clean_data <- lapply(data_in, data_in[[id]][comp,])
The second parameter to lapply should be a proper function to which each member of the data_in list will be passed one at a time. Your expression data_in[[id]][comp,] is not a function. I'm not sure where you expected id to come from, but lapply does not create magic variables for you like that. Also, at this point comp is now a list itself of indices. You are making no attempt to iterate over this list in sync with your data_in list. If you wanted to do it in two separate steps, a more appropriate approach would be
comp <- lapply(data_in, complete.cases)
clean_data <- Map(function(d,c) {d[c,]}, data_in, comp)
Here we use Map to iterate over the data_in and comp lists simultaneously. They each get passed in to the function as a parameter and we can do the proper extraction that way. Otherwise, if we wanted to do it in one step, we could do
clean_data <- lapply(data_in, function(x) x[complete.cases(x),])
welcome to SO, please provide some working code next time
here is how i would do it with na.omit (since complete.cases only returns a logical)
(dat.l <- list(dat1 = data.frame(x = 1:2, y = c(1, NA)),
dat2 = data.frame(x = 1:3, y = c(1, NA, 3))))
# $dat1
# x y
# 1 1 1
# 2 2 NA
#
# $dat2
# x y
# 1 1 1
# 2 2 NA
# 3 3 3
Map(na.omit, dat.l)
# $dat1
# x y
# 1 1 1
#
# $dat2
# x y
# 1 1 1
# 3 3 3
Do you mean like the below?
> lst
$a
a
1 1
2 2
3 NA
4 3
5 4
$b
b
1 1
2 NA
3 2
4 3
5 4
$d
d e
1 NA 1
2 NA 2
3 3 3
4 4 NA
5 5 NA
> f <- function(x) x[complete.cases(x),]
> lapply(lst, f)
$a
[1] 1 2 3 4
$b
[1] 1 2 3 4
$d
d e
3 3 3
file_name[complete.cases(file_name), ]
complete.cases() returns only a logical value. This should do the job and returns only the rows with no NA values.

Difference between `names(df[1]) <- ` and `names(df)[1] <- `

Consider the following:
df <- data.frame(a = 1, b = 2, c = 3)
names(df[1]) <- "d" ## First method
## a b c
##1 1 2 3
names(df)[1] <- "d" ## Second method
## d b c
##1 1 2 3
Both methods didn't return an error, but the first didn't change the column name, while the second did.
I thought it has something to do with the fact that I'm operating only on a subset of df, but why, for example, the following works fine then?
df[1] <- 2
## a b c
##1 2 2 3
What I think is happening is that replacement into a data frame ignores the attributes of the data frame that is drawn from. I am not 100% sure of this, but the following experiments appear to back it up:
df <- data.frame(a = 1:3, b = 5:7)
# a b
# 1 1 5
# 2 2 6
# 3 3 7
df2 <- data.frame(c = 10:12)
# c
# 1 10
# 2 11
# 3 12
df[1] <- df2[1] # in this case `df[1] <- df2` is equivalent
Which produces:
# a b
# 1 10 5
# 2 11 6
# 3 12 7
Notice how the values changed for df, but not the names. Basically the replacement operator `[<-` only replaces the values. This is why the name was not updated. I believe this explains all the issues.
In the scenario:
names(df[2]) <- "x"
You can think of the assignment as follows (this is a simplification, see end of post for more detail):
tmp <- df[2]
# b
# 1 5
# 2 6
# 3 7
names(tmp) <- "x"
# x
# 1 5
# 2 6
# 3 7
df[2] <- tmp # `tmp` has "x" for names, but it is ignored!
# a b
# 1 10 5
# 2 11 6
# 3 12 7
The last step of which is an assignment with `[<-`, which doesn't respect the names attribute of the RHS.
But in the scenario:
names(df)[2] <- "x"
you can think of the assignment as (again, a simplification):
tmp <- names(df)
# [1] "a" "b"
tmp[2] <- "x"
# [1] "a" "x"
names(df) <- tmp
# a x
# 1 10 5
# 2 11 6
# 3 12 7
Notice how we directly assign to names, instead of assigning to df which ignores attributes.
df[2] <- 2
works because we are assigning directly to the values, not the attributes, so there are no problems here.
EDIT: based on some commentary from #AriB.Friedman, here is a more elaborate version of what I think is going on (note I'm omitting the S3 dispatch to `[.data.frame`, etc., for clarity):
Version 1 names(df[2]) <- "x" translates to:
df <- `[<-`(
df, 2,
value=`names<-`( # `names<-` here returns a re-named one column data frame
`[`(df, 2),
value="x"
) )
Version 2 names(df)[2] <- "x" translates to:
df <- `names<-`(
df,
`[<-`(
names(df), 2, "x"
) )
Also, turns out this is "documented" in R Inferno Section 8.2.34 (Thanks #Frank):
right <- wrong <- c(a=1, b=2)
names(wrong[1]) <- 'changed'
wrong
# a b
# 1 2
names(right)[1] <- 'changed'
right
# changed b
# 1 2

Resources