Remove NAs from nested list data frame - r

The following really seems to be a tough nut to crack:
I have a data frame with a nested list:
df <- structure(list(zerobonds = c(1, 1, NA), nominal = c(20, 20, NA
), calls = list(list(c(NA, -1), 1), list(list(NA, -1), 1), NA),
call_strike = list(list(c(NA, 90), 110), list(list(NA, 90),
110), NA), puts = list(NA, NA, list(c(NA, 1), -1)), put_strike = list(
NA, NA, list(c(NA, 110), 90))), row.names = c(NA, -3L
), class = "data.frame")
df
## zerobonds nominal calls call_strike puts put_strike
## 1 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 2 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 3 NA NA NA NA NA, 1, -1 NA, 110, 90
I want to print the structure without any NAs (dots instead of the blanks are ok too):
zerobonds nominal calls call_strike puts put_strike
1 1 20 -1, 1 90, 110
2 1 20 -1, 1 90, 110
3 1, -1 110, 90
I have tried all kinds of things, the best approach so far seems to be something like rapply(df, na.omit, how = "replace") where I can't even suppress the Warnings (suppressWarnings doesn't seem to work here!). print(df, na.print = "") doesn't help either.
I am really exhausted now, nothing seems to work... data frames in the form of nested lists doesn't seem to be a good idea after all... could anybody help?

You can try the code below
df[]<-rapply(Map(as.list,df), na.omit, how = "replace")
which gives
> df
zerobonds nominal calls call_strike puts put_strike
1 1 20 -1, 1 90, 110
2 1 20 -1, 1 90, 110
3 1, -1 110, 90

You can create your own recursive function and apply it to each column :
rm_nested_na <- function(x) {
if (is.atomic(x)) {
na.omit(x)
} else {
lapply(x, rm_nested_na)
}
}
res <- df
listcol <- sapply(res, is.list)
res[listcol] <- lapply(res[listcol], rm_nested_na)
res
This is clearly inefficient if the nesting is deep.

Related

Error in eval(predvars, data, env) : object 'oly.success' in Regression model

I have look into this problem and some people suggest that changing column name might work. But I can't seems to figure out which column is causing the issue.
my code
library(Amelia)
library(corrplot)
library(GGally)
library(caret)
data <- asianmen_100.free
summary(data)
#remove unwated variables
reject_vars <- names(data) %in% c("firstname","lastname","country","Event","Pool.Length","Competition",
"Comp.Country","name","DOB","Date","mins","secs","minsAsSecDuration","earliest_date",
"Final_Medal","Time","secsAsDuration")
data.new <- data[!reject_vars]
data.new$Age. <- as.numeric(data.new$Age.)
#Remove Target variables
remove_vars <- names(data.new) %in% c("oly_success")
data.new <- data.new[!remove_vars]
ggcorr(data.new, label = TRUE)
# find variables that have higher cross-correlation
M <- data.matrix(data.new)
corrM <- cor(M)
highlyCorrM <- findCorrelation(corrM, cutoff=0.5)
names(data.new)[highlyCorrM]
#sample size
smp_size <- floor(2/3 * nrow(data.new))
set.seed(2)
#sample dataset
data.new <- data.new[sample(nrow(data.new)), ]
data.train <- data.new[1:smp_size, ]
data.test <- data.new[(smp_size+1):nrow(data.new), ]
#model building
formula = oly_success ~ .
rmodel <- glm(formula = formula,
data=data.train,
family=binomial(link="logit"))
summary(rmodel)
This is the data :
> head(data.new)
# A tibble: 6 x 8
Age. timeAsDuration Success oly_success first_appear.age first_oly.age age_diff total_medal
<dbl> <Duration> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 20 49.37s 0 0 17 NA NA 1
2 21 49.8s 0 0 21 NA NA 0
3 16 57.75s 0 0 16 NA NA 0
4 20 51.42s 0 0 17 NA NA 0
5 21 51.01s 0 0 16 NA NA 2
6 NA 54.11s 0 0 NA NA NA 0
Sample data
> dput(data.new[1:10,])
structure(list(Age. = c(20, 21, 16, 20, 21, NA, 19, 25, 26, 24
), timeAsDuration = new("Duration", .Data = c(49.37, 49.8, 57.75,
51.42, 51.01, 54.11, 50.88, 57.69, 51.49, 49.97)), Success = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), oly_success = c(0, 0, 0, 0, 0, 0,
0, 1, 0, 0), first_appear.age = c(17, 21, 16, 17, 16, NA, 19,
25, 25, 23), first_oly.age = c(NA, NA, NA, NA, NA, NA, NA, 26,
NA, NA), age_diff = c(NA, NA, NA, NA, NA, NA, NA, 1, NA, NA),
total_medal = c(1, 0, 0, 0, 2, 0, 0, 0, 0, 1)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
I have tried changing some of the column name and event the target variables name such to oly.success and still no success, where am I wrong?
First of all in your dput(data.new) the target variable is called oly_success and in the formula, you use oly.success, second you remove the target variable with:
#Remove Target variables
remove_vars <- names(data.new) %in% c("oly_success")
data.new <- data.new[!remove_vars]
if you fix these errors your code works well:
library(Amelia)
library(corrplot)
library(GGally)
library(caret)
ggcorr(data.new, label = TRUE)
# find variables that have higher cross-correlation
M <- data.matrix(data.new)
corrM <- cor(M)
highlyCorrM <- findCorrelation(corrM, cutoff=0.5)
names(data.new)[highlyCorrM]
#sample size
smp_size <- floor(2/3 * nrow(data.new))
set.seed(2)
#sample dataset
data.new <- data.new[sample(nrow(data.new)), ]
data.train <- data.new[1:smp_size, ]
data.test <- data.new[(smp_size+1):nrow(data.new), ]
#model building
rmodel <- glm(formula = oly_success ~ .,
data=data.new, #I use the entire dataset because the training one does not have all the levels for the logistic regression, since the example dataset is too small
family=binomial(link="logit"))
summary(rmodel)

Remove duplicate rows in nested list data frame

I have a data frame with a nested list:
df <- structure(list(zerobonds = c(1, 1, NA), nominal = c(20, 20, NA
), calls = list(list(c(NA, -1), 1), list(list(NA, -1), 1), NA),
call_strike = list(list(c(NA, 90), 110), list(list(NA, 90),
110), NA), puts = list(NA, NA, list(c(NA, 1), -1)), put_strike = list(
NA, NA, list(c(NA, 110), 90))), row.names = c(NA, -3L
), class = "data.frame")
df
## zerobonds nominal calls call_strike puts put_strike
## 1 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 2 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 3 NA NA NA NA NA, 1, -1 NA, 110, 90
My question: You see that the first and second row are duplicated. I want to remove all duplicate rows in such data frames and I am looking for some general method.
What I tried: duplicated doesn't seem to work, I guess because of this special structure of a data frame with nested lists inside.
You may need to flatten the nested lists of each column and then apply unique, e.g.,
> unique({df[]<-Map(function(x) Map(unlist,x),df);df})
zerobonds nominal calls call_strike puts put_strike
1 1 20 NA, -1, 1 NA, 90, 110 NA NA
3 NA NA NA NA NA, 1, -1 NA, 110, 90

adding multiple columns include na in dataframe in r

I have dataframe like this:
I want to create a new column which is the sum of other columns by ignoring NA if there is any numeric value in a row. But if all value (like the second row) in a row are na, the sum column gets NA.
As this is your first activity here on SO you should have a look to this which describes how a minimal and reproducible examples is made. This is certainly needed in the future, if you have more questions. An image is mostly not accepted as a starting point.
Fortunately your table was a small one. I turned it into a tribble and then used rowSums to calculate the numbers you seem to want.
df <- tibble::tribble(
~x, ~y, ~z,
6000, NA, NA,
NA, NA, NA,
100, 7000, 1000,
0, 0, NA
)
df$sum <- rowSums(df, na.rm = T)
df
#> # A tibble: 4 x 4
#> x y z sum
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6000 NA NA 6000
#> 2 NA NA NA 0
#> 3 100 7000 1000 8100
#> 4 0 0 NA 0
Created on 2020-06-15 by the reprex package (v0.3.0)
Let's say that your data frame is called df
cbind(df, apply(df, 1, function(x){if (all(is.na(x))) {NA} else {sum(x, na.rm = T)}))
Note that if your data frame has other columns, you will need to restrict the df call within apply to only be the columns you're after.
You can count the NA values in df. If in a row there is no non-NA value you can assign output as NA or calculate row-wise sum otherwise using rowSums.
ifelse(rowSums(!is.na(df)) == 0, NA, rowSums(df, na.rm = TRUE))
#[1] 6000 NA 10000 8100 0
data
df <- structure(list(x = c(6000, NA, 10000, 100, 0), y = c(NA, NA,
NA, 7000, 0), z = c(NA, NA, NA, 1000, NA)), class = "data.frame",
row.names = c(NA, -5L))

Excel: Merging Repeating Columns Within A Dataset

forgive the very basic question. I have some output from an experiment that had 3 different versions of the same question, depending on the condition. The output file treated each question as a separate column so my output looks like this, where the headers for the columns repeat:
Q1,Q2,Q3,Q1,Q2,Q3,Q1,Q2,Q3
1, 0, 1
-----------0, 1, 0
--------------------1, 1, 1
How would I be able to merge the output (preferably in Excel - my output is currently stored in an excel file, or alternatively in R), so that the desired output looks like this:
Q1,Q2,Q3
1, 0, 1
0, 1, 0
1, 1, 1
Thanks in advance!
An option in R after reading the dataset with a function that reads thee excel file (read_excel etc.) would be to loop over the unique names of dataset, extract the columns, unlist, remove the NA elements (if any - assuming the blanks are NA)
nm1 <- unique(sub("\\.\\d+", "", names(df1)))
out <- sapply(nm1, function(x) na.omit(unlist(df1[grep(x, names(df1))])))
row.names(out) <- NULL
out
# Q1 Q2 Q3
#[1,] 1 0 1
#[2,] 0 1 0
#[3,] 1 1 1
Or with tidyverse with gather/spread
library(tidyverse)
gather(df1, na.rm = TRUE) %>%
mutate(key = str_remove(key, "\\.\\d+$"), ind = rowid(key)) %>%
spread(key, value) %>%
select(-ind)
# Q1 Q2 Q3
#1 1 0 1
#2 0 1 0
#3 1 1 1
Or another option is to split into a list of data.frames having similar columns, use coalesce to reduce it to a single vector which would remove the NA elements in the row and get the first non-NA element in that row
split.default(df1, nm1) %>%
map_df(reduce, coalesce)
# A tibble: 3 x 3
# Q1 Q2 Q3
# <dbl> <dbl> <dbl>
#1 1 0 1
#2 0 1 0
#3 1 1 1
data
df1 <- structure(list(Q1 = c(1, NA, NA), Q2 = c(0, NA, NA), Q3 = c(1,
NA, NA), Q1.1 = c(NA, 0, NA), Q2.1 = c(NA, 1, NA), Q3.1 = c(NA,
0, NA), Q1.2 = c(NA, NA, 1), Q2.2 = c(NA, NA, 1), Q3.2 = c(NA,
NA, 1)), class = "data.frame", row.names = c(NA, -3L))

Complete.obs of cor() function

I am establishing a correlation matrix for my data, which looks like this
df <- structure(list(V1 = c(56, 123, 546, 26, 62, 6, NA, NA, NA, 15
), V2 = c(21, 231, 5, 5, 32, NA, 1, 231, 5, 200), V3 = c(NA,
NA, 24, 51, 53, 231, NA, 153, 6, 700), V4 = c(2, 10, NA, 20,
56, 1, 1, 53, 40, 5000)), .Names = c("V1", "V2", "V3", "V4"), row.names = c(NA,
10L), class = "data.frame")
This gives the following data frame:
V1 V2 V3 V4
1 56 21 NA 2
2 123 231 NA 10
3 546 5 24 NA
4 26 5 51 20
5 62 32 53 56
6 6 NA 231 1
7 NA 1 NA 1
8 NA 231 153 53
9 NA 5 6 40
10 15 200 700 5000
I normally use a complete.obs command to establish my correlation matrix using this command
crm <- cor(df, use="complete.obs", method="pearson")
My question here is, how does the complete.obs treat the data? does it omit any row having a "NA" value, make a "NA" free table and make a correlation matrix at once like this?
df2 <- structure(list(V1 = c(26, 62, 15), V2 = c(5, 32, 200), V3 = c(51,
53, 700), V4 = c(20, 56, 5000)), .Names = c("V1", "V2", "V3",
"V4"), row.names = c(NA, 3L), class = "data.frame")
or does it omit "NA" values in a pairwise fashion, for example when calculating correlation between V1 and V2, the row that contains an NA value in V3, (such as rows 1 and 2 in my example) do they get omitted too?
If this is the case, I am looking forward to establish a command that reserves as much as possible of the data, by omitting NA values in a pairwise fashion.
Many thanks,
Look at the help file for cor, i.e. ?cor. In particular,
If ‘use’ is ‘"everything"’, ‘NA’s will propagate conceptually, i.e., a
resulting value will be ‘NA’ whenever one of its contributing
observations is ‘NA’.
If ‘use’ is ‘"all.obs"’, then the presence of missing observations
will produce an error. If ‘use’ is ‘"complete.obs"’ then missing
values are handled by casewise deletion (and if there are no complete
cases, that gives an error).
To get a better feel about what is going on, is to create an (even) simpler example:
df1 = df[1:5,1:3]
cor(df1, use="pairwise.complete.obs", method="pearson")
cor(df1, use="complete.obs", method="pearson")
cor(df1[3:5,], method="pearson")
So, when we use complete.obs, we discard the entire row if an NA is present. In my example, this means we discard rows 1 and 2. However, pairwise.complete.obs uses the non-NA values when calculating the correlation between V1 and V2.

Resources