I want to loop though a lot of columns in an r dataframe and replace NA with column mean.
I can get a mean for columns like this
mean(df$col20, na.rm = TRUE)
But this gets the warning: argument is not numeric or logical: returning NA
mean(df[ , 20], na.rm = TRUE)
I tried the above syntax with a small dummy df including some NA and it works fine. Any idea what else to look for to fix this?
ps. head(df[20]) tells me it's a dbl and str(df) says it's num.
(and [ , 20] is an example; I actually get lots of warnings because it really sits in a for loop - but I have executed the line by itself as a test)
1) na.aggregate Create a logical vector ok which is TRUE for each numeric column and FALSE for other columns. Then use na.aggregate on the numeric columns.
library(zoo)
df <- data.frame(a = c(1, NA, 2), b = c("a", NA, "b")) # test data
ok <- sapply(df, is.numeric)
replace(df, ok, na.aggregate(df[ok]))
giving:
a b
1 1.0 a
2 1.5 <NA>
3 2.0 b
2) dplyr/tidyr Alternately use dplyr. df is from above and the output is the same.
library(dplyr)
library(tidyr)
df %>%
mutate(across(where(is.numeric), ~ replace_na(., mean(., na.rm =TRUE))))
3) collapse We could alternately use ftransformv in collapse.
library(collapse)
library(zoo)
ftransformv(df, is.numeric, na.aggregate)
4) base A base solution would be:
fill_na <- function(x) {
if (!is.numeric(x) || all(is.na(x))) x
else replace(x, is.na(x), mean(x, na.rm = TRUE))
}
replace(df, TRUE, lapply(df, fill_na))
Related
I need to find the max and min value in a dataframe "df" like this:
col1 col2 col3
7 4 5
2 NA 6
3 2 4
NA NA 1
The result should be: min = 1 and max = 7.
I have used this function:
min <- min(df, na.rm=TRUE)
max <- max(df, na.rm=TRUE)
but it gives me the following error:
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
So I have converted all the values as.numeric in this way:
df <- as.numeric(as.character(df))
but it introduces NAs by coercion and now the results are:
min = -Inf and max=Inf
How can I operate on the df ignoring NAs?
If the columns are not numeric, convert it with type.convert
df <- type.convert(df, as.is = TRUE)
Or use a force conversion with matrix route
df[] <- as.numeric(as.matrix(df))
Or with lapply
df[] <- lapply(df, function(x) as.numeric(as.character(x)))
With R 4.1.0 we can also do
sapply(df, \(x) as.numeric(as.character(x))) |>
range(na.rm = TRUE)
#[1] 1 7
Once the columns are numeric, the functions work as expected
min(df, na.rm = TRUE)
#[1] 1
max(df, na.rm = TRUE)
#[1] 7
Note that as.character/as.numeric requires a vector input and not a data.frame
We could use minMax function from dataMaid package (handles NA's)
library(dataMaid)
minMax(df, maxDecimals = 2)
Output:
Min. and max.: 2; 7
data:
df <- tribble(
~col1, ~col2, ~col3,
7, 4, 5,
2, NA, 6,
3, 2, 4,
NA, NA, 1)
Another base R option
> range(na.omit(as.numeric(unlist(df))))
[1] 1 7
If it is factor class, you should use (thank #akrun's comment)
as.numeric(as.character(unlist(df)))
I need to carry forward NA values from one column to the next. An example of the code is below
df <- data.frame(a = c(1,2,NA,NA,NA,NA,NA,NA,NA,NA),
b =c(NA,NA,3,4,NA,NA,NA,NA,NA,NA),
c = c(NA,NA,NA,NA,5,6,NA,NA,NA,NA),
d = c(NA,NA,NA,NA,NA,NA,7,8,NA,NA),
e = c(NA,NA,NA,NA,NA,NA,NA,NA,9,10))
I have tried to use a loop with the na.locf function in zoo but this only carries the previous columns values
columns <- seq(2,ncol(df))
output <- list()
for (i in columns){
output[[i]] <- t(zoo::na.locf(t(df[,(i-1):i])))[,2]
}
The expected output would be like
expected_output <- data.frame(a = c(1,2,NA,NA,NA,NA,NA,NA,NA,NA),
b = c(1,2,3,4,NA,NA,NA,NA,NA,NA),
c = c(1,2,3,4,5,6,NA,NA,NA,NA),
d = c(1,2,3,4,5,6,7,8,NA,NA),
e = c(1,2,3,4,5,6,7,8,9,10))
Transpose df, apply na.locf, transpose again and replace df contents with that to make it a data frame with the correct names.
library(zoo)
out <- replace(df, TRUE, t(na.locf(t(df), fill = NA)))
identical(out, expected_output)
## [1] TRUE
This also works and is similar except it applies na.locf0 to each row instead of applying na.locf to the transpose.
out <- replace(df, TRUE, t(apply(df, 1, na.locf0)))
identical(out, expected_output)
## [1] TRUE
I suspect that this will be a duplicate, but my efforts to find an answer have failed. Suppose that I have a data frame with columns made entirely of either integers or factors. Some of these columns have factors with many levels and some do not. Suppose that I want to select parts of or otherwise subset the data such that I only get the columns with factors that have less than 10 levels. How can I do this? My first thought was to make a particularly nasty sapply command, but I'm hoping for a better way.
We can use select_if
library(dplyr)
df1 %>%
select_if(~ is.factor(.) && nlevels(.) < 10)
With a reproducible example using iris
data(iris)
iris %>%
select_if(~ is.factor(.) && nlevels(.) < 10)
Or using sapply
i1 <- sapply(df1, function(x) is.factor(x) && nlevels(x) < 10)
df1[i1]
With data.table you can do:
library(data.table)
setDT(df)
df[,.SD, .SDcols = sapply(df, function(x) length(levels(x))<10)]
Example:
df <- data.table(x = factor(1:3, levels = 1:5), y = factor(1:3, levels = 1:10))
df[,.SD, .SDcols = sapply(df, function(x) length(levels(x))>5)]
y
1: 1
2: 2
3: 3
I have a data.frame "dat" and a numeric vector "test":
code <- c("A22", "B15", "C03")
v.1 <- 1:3
v.2 <- 3:1
v.3 <- c(2, NA, 2)
bob <- c("yes", "no", "no")
dat <- data.frame(code, v.1, v.2, v.3, bob, stringsAsFactors = FALSE)
test <- c(3, 1, 2)
I want to find the row in the data.frame where the second to fourth columns ("v.1", "v.2", "v.3") contain the same values as the vector, in the same order, and return the value from the "code"-column (in this case "C03").
I tried
dat[dat[, 2:4] == test]$code
and
which(apply(dat, 1, function(x) all.equal(dat[, 2:4], test)) == FALSE)
both of which do not work.
I would prefer a solution with base R.
Your second option (with which) does not work for several problems: using apply on whole dat converts it to a matrix of character, you're actually not using x, the function argument and you should use all instead of all.equal and probably TRUE instead of FALSE (the comparison is actually not needed).
You can modify it a bit to make it work:
which(apply(dat[, 2:4], 1, function(x) all(x==test)))
[1] 3
Or
dat[apply(dat[, 2:4], 1, function(x) all(x==test)), "code"]
[1] C03
With apply we can paste the columns together and check which row has the same value as that of test when pasted together and selected the column code of respective row.
dat[apply(dat[2:4], 1, paste0, collapse = "|") ==
paste0(test, collapse = "|"), "code"]
#[1] C03
We just need to replicate the 'test' to make the lengths equal before doing the comparison
dat[2:4] == test[row(dat[2:4])]
If we need the 'code'
dat$code[rowSums(dat[2:4] == test[row(dat[2:4])], na.rm = TRUE)==3]
#[1] C03
I am writing a function, which needs a check on whether (and which!) column (variable) has all missing values (NA, <NA>). The following is fragment of the function:
test1 <- data.frame (matrix(c(1,2,3,NA,2,3,NA,NA,2), 3,3))
test2 <- data.frame (matrix(c(1,2,3,NA,NA,NA,NA,NA,2), 3,3))
na.test <- function (data) {
if (colSums(!is.na(data) == 0)){
stop ("The some variable in the dataset has all missing value,
remove the column to proceed")
}
}
na.test (test1)
Warning message:
In if (colSums(!is.na(data) == 0)) { :
the condition has length > 1 and only the first element will be used
Q1: Why is the above error and any fixes ?
Q2: Is there any way to find which of columns have all NA, for example output the list (name of variable or column number)?
This is easy enough to with sapply and a small anonymous function:
sapply(test1, function(x)all(is.na(x)))
X1 X2 X3
FALSE FALSE FALSE
sapply(test2, function(x)all(is.na(x)))
X1 X2 X3
FALSE TRUE FALSE
And inside a function:
na.test <- function (x) {
w <- sapply(x, function(x)all(is.na(x)))
if (any(w)) {
stop(paste("All NA in columns", paste(which(w), collapse=", ")))
}
}
na.test(test1)
na.test(test2)
Error in na.test(test2) : All NA in columns 2
In dplyr
ColNums_NotAllMissing <- function(df){ # helper function
as.vector(which(colSums(is.na(df)) != nrow(df)))
}
df %>%
select(ColNums_NotAllMissing(.))
example:
x <- data.frame(x = c(NA, NA, NA), y = c(1, 2, NA), z = c(5, 6, 7))
x %>%
select(ColNums_NotAllMissing(.))
or, the other way around
Cols_AllMissing <- function(df){ # helper function
as.vector(which(colSums(is.na(df)) == nrow(df)))
}
x %>%
select(-Cols_AllMissing(.))
To find the columns with all values missing
allmisscols <- apply(dataset,2, function(x)all(is.na(x)));
colswithallmiss <-names(allmisscols[allmisscols>0]);
print("the columns with all values missing");
print(colswithallmiss);
This one will generate the column names that are full of NAs:
library(purrr)
df %>% keep(~all(is.na(.x))) %>% names
To test whether columns have all missing values:
apply(test1,2,function(x) {all(is.na(x))})
To get which columns have all missing values:
test1.nona <- test1[ , colSums(is.na(test1)) == 0]
dplyr approach to finding the number of NAs for each column:
df %>%
summarise_all((funs(sum(is.na(.)))))
The following command gives you a nice table with the columns that have NA values:
sapply(dataframe, function(x)all(any(is.na(x))))
It's an improvement for the first answer you got, which doesn't work properly from some cases.
sapply(b,function(X) sum(is.na(X))
This will give you the count of na in each column of the dataset and also will give 0 if there is no na present in the column
Variant dplyr approach:
dataframe %>% select_if(function(x) all(is.na(x))) %>% colnames()