I'm having trouble handling NAs while calculating aggregated means. Please see the following code:
tab=data.frame(a=c(1:3,1:3), b=c(1,2,NA,3,NA,NA))
tab
a b
1 1 1
2 2 2
3 3 NA
4 1 3
5 2 NA
6 3 NA
attach(tab)
aggregate(b, by=list(a), data=tab, FUN=mean, na.rm=TRUE)
Group.1 x
1 1 2
2 2 2
3 3 NaN
I want NA instead of NaN if the vector has all NAs i.e. I want the output to be
Group.1 x
1 1 2
2 2 2
3 3 NA
I tried using a custom function:
adjmean=function(x) {if(all(is.na(x))) NA else mean(x,na.rm=TRUE)}
However, I get the following error:
aggregate(b, by=list(a), data=tab, FUN=adjmean)
Error in FUN(X[[1L]], ...) :
unused argument (data = list(a = c(1, 2, 3, 1, 2, 3), b = c(1, 2, NA, 3, NA, NA)))
In short, if the column has all NAs I want NA as an output instead of NaN. If it has few NAs, then it should compute the mean ignoring the NAs.
Any help would be appreciated.
Thanks
This is very close to what you had, but replaces mean(x, na.rm=TRUE) with a custom function which either computes the mean of the non-NA values, or supplies NA itself:
R> with(tab,
aggregate(b, by=list(a), FUN=function(x)
if (any(is.finite(z<-na.omit(x)))) mean(z) else NA))
Group.1 x
1 1 2
2 2 2
3 3 NA
R>
That is really one line, but I broke it up to make it fit into the SO display.
And you already had a similar idea, but I altered the function a bit more to return suitable values in all cases.
There is nothing wrong with your function. What is wrong is that you are using an argument in the default method for aggregate that doesn't exist:
adjmean = function(x) {if(all(is.na(x))) NA else mean(x,na.rm=TRUE)}
attach(tab) ## Just because you did it. I don't recommend this.
## Your error
aggregate(b, by=list(a), data=tab, FUN=adjmean)
# Error in FUN(X[[i]], ...) :
# unused argument (data = list(a = c(1, 2, 3, 1, 2, 3), b = c(1, 2, NA, 3, NA, NA)))
## Dropping the "data" argument
aggregate(b, list(a), FUN = adjmean)
# Group.1 x
# 1 1 2
# 2 2 2
# 3 3 NA
If you wanted to use the data argument, you should use the formula method for aggregate. However, this method treats NA differently, so you need an additional argument, na.action.
Example:
detach(tab) ## I don't like having things attached
aggregate(b ~ a, data = tab, adjmean)
# a b
# 1 1 2
# 2 2 2
aggregate(b ~ a, data = tab, adjmean, na.action = na.pass)
# a b
# 1 1 2
# 2 2 2
# 3 3 NA
Related
I have a dataset with over 200,000 rows and when I did summary(df), I noticed that it has Inf, in it. However, when I tried to check using any(is.na(Data) | is.infinite(Data)) I got this error
Error in is.infinite(Data) :
default method not implemented for type 'list'
I then tried to replace the Inf with NA, yet I got the same error.
>Data[is.infinite(Data)] <- NA
Error in is.infinite(Data) :
default method not implemented for type 'list'
Please is there something I am not doing right?
Under the hood, a data frame is a special type of list, and as the message says, is.infinite can't be used on lists. Let's use this example:
Data <- data.frame(a = c(1, 2, Inf), b = c(-Inf, 3, 4))
Data
#> a b
#> 1 1 -Inf
#> 2 2 3
#> 3 Inf 4
is.infinite(Data)
#> Error in is.infinite(Data): default method not implemented for type 'list'
We instead need to apply your NA replacement to each column in the data frame. We can do this quickly using
Data[] <- lapply(Data, function(x) replace(x, is.infinite(x), NA))
Data
#> a b
#> 1 1 NA
#> 2 2 3
#> 3 NA 4
An alternative that requires a bit less code would be to convert your data frame to a matrix on the fly to get the replacement indices which are applied to the data frame itself for subsetting.
Data <- data.frame(a = c(1, 2, Inf), b = c(-Inf, 3, 4))
replace(Data, is.infinite(as.matrix(Data)), NA)
#> a b
#> 1 1 NA
#> 2 2 3
#> 3 NA 4
Created on 2023-02-16 with reprex v2.0.2
I want to write a function that creates a new column with rowmeans for Columns 1-3, only if more than 2 questions for Columns 1-3 per row were answered, otherwise print 'N'.
Here is my dataframe:
test <- data.frame(Manager1 = c(1, 3, 3), Manager2 = c(3, 4, 1), Manager3 = c(NA , 4, 2), Team1 = c(3, 4, 1))
Desired output:
Manager1 Manager2 Manager3 Team1 mean_score
1 3 3 N
3 4 4 4 3.66667
3 1 2 1 2
My code is as follows, but it's not working:
#create function
mean_score <- function(x) {
for (i in 1:nrow(test)){
if (sum(test[i, x] != "NA", na.rm = TRUE) >2){
test$mean_score[i] <- rowMeans(test[i, x], na.rm = TRUE)
} else
test$mean_score[i] <- print("N")
}
}
#compute function
mean_score(1:3)
What am I missing? Suggestions on better code are welcome too.
I think it is not ideal to put a character together with a numeric value, since it will convert the whole column into character. However, if that is what you want:
my_sum <- function(x,min=2){
s <- mean(x, na.rm = T) # get the mean
no_na <- sum(!is.na(x)) # count the number of non NAs
if(no_na>min){s}else{"N"} # return mean if enough non NAs
}
test$mean <- apply(test[,1:3],1,my_sum)
test
Manager1 Manager2 Manager3 Team1 mean
1 1 3 NA 3 N
2 3 4 4 4 3.66666666666667
3 3 1 2 1 2
str(test)
'data.frame': 3 obs. of 5 variables:
$ Manager1: num 1 3 3
$ Manager2: num 3 4 1
$ Manager3: num NA 4 2
$ Team1 : num 3 4 1
$ mean : chr "N" "3.66666666666667" "2"
You simply can use rowMeans what will return NA if there is one row holding NA what should be here equivalent to only if more than 2 questions for Columns 1-3 per row were answered.
test$mean_score <- rowMeans(test[,1:3])
# Manager1 Manager2 Manager3 Team1 mean_score
#1 1 3 NA 3 NA
#2 3 4 4 4 3.666667
#3 3 1 2 1 2.000000
While GKi has a better answer that's more simple and that you should use here is what I changed your code to be so that it works.
Generally when making a function you want to have the input be the dataframe, in this case text and changing the function from there.
Another important thing of note is you probably want to make a vector of values first and then attach said vector to the dataframe as I do in the code below, but you need to make sure you create an empty vector object to do so. R doesn't really let you slowly add cell data to a dataframe, it prefers that a vector (which can be added to) of equal length be joined to it.
Also you don't need to use print() to insert a character into a vector either.
Hope this helps explain why your function was having issues, but frankly GKi's answer is better for general R use!
mean_score <- function(x) {
mean_score <- vector()
for (i in 1:nrow(x)){
if (sum(x[i,] != "NA", na.rm = TRUE) >3){
mean_score[i] <- rowMeans(x[i,], na.rm = TRUE)
} else
mean_score[i] <- "N"
}
x$mean_score <- mean_score
return(x)
}
mean_score(test)
Can somebody please help me with a recode from SPSS into R?
SPSS code:
RECODE variable1
(1,2=1)
(3 THRU 8 =2)
(9, 10 =3)
(ELSE = SYSMIS)
INTO variable2
I can create new variables with the different values. However, I'd like it to be in the same variable, as SPSS does.
Many thanks.
x <- y<- 1:20
x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
y[x %in% (1:2)] <- 1
y[x %in% (3:8)] <- 2
y[x %in% (9:10)] <- 3
y[!(x %in% (1:10))] <- NA
y
[1] 1 1 2 2 2 2 2 2 3 3 NA NA NA NA NA NA NA NA NA NA
I wrote a function that has very similiar coding to the spss code recode. See here
variable1 <- -1:11
recodeR(variable1, c(1, 2, 1), c(3:8, 2), c(9, 10, 3), else_do= "missing")
NA NA 1 1 2 2 2 2 2 2 3 3 NA
This function now works also for other examples. This is how the function is defined
recodeR <- function(vec_in, ..., else_do){
l <- list(...)
# extract the "from" values
from_vec <- unlist(lapply(l, function(x) x[1:(length(x)-1)]))
# extract the "to" values
to_vec <- unlist(lapply(l, function(x) rep(x[length(x)], length(x)-1)))
# plyr is required for mapvalues
require(plyr)
# recode the variable
vec_out <- mapvalues(vec_in, from_vec, to_vec)
# if "missing" is written then all outside the defined range will be missings.
# Otherwise values outside the defined range stay the same
if(else_do == "missing"){
vec_out <- ifelse(vec_in < min(from_vec, na.rm=T) | vec_in > max(from_vec, na.rm=T), NA, vec_out)
}
# return resulting vector
return(vec_out)}
I'm trying to remove all the NA values from a list of data frames. The only way I have got it to work is by cleaning the data with complete.cases in a for loop. Is there another way of doing this with lapply as I had been trying for a while to no avail. Here is the code that works.
I start with
data_in <- lapply (file_name,read.csv)
Then have:
clean_data <- list()
for (i in seq_along(id)) {
clean_data[[i]] <- data_in[[i]][complete.cases(data_in[[i]]), ]
}
But what I tried to get to work was using lapply all the way like this.
comp <- lapply(data_in, complete.cases)
clean_data <- lapply(data_in, data_in[[id]][comp,])
Which returns this error "Error in [.default(xj, i) : invalid subscript type 'list' "
What I'd like to know is some alternatives or if I was going about this right. And why didn't the last example not work?
Thank you so much for your time. Have a nice day.
I'm not sure what you expected with
clean_data <- lapply(data_in, data_in[[id]][comp,])
The second parameter to lapply should be a proper function to which each member of the data_in list will be passed one at a time. Your expression data_in[[id]][comp,] is not a function. I'm not sure where you expected id to come from, but lapply does not create magic variables for you like that. Also, at this point comp is now a list itself of indices. You are making no attempt to iterate over this list in sync with your data_in list. If you wanted to do it in two separate steps, a more appropriate approach would be
comp <- lapply(data_in, complete.cases)
clean_data <- Map(function(d,c) {d[c,]}, data_in, comp)
Here we use Map to iterate over the data_in and comp lists simultaneously. They each get passed in to the function as a parameter and we can do the proper extraction that way. Otherwise, if we wanted to do it in one step, we could do
clean_data <- lapply(data_in, function(x) x[complete.cases(x),])
welcome to SO, please provide some working code next time
here is how i would do it with na.omit (since complete.cases only returns a logical)
(dat.l <- list(dat1 = data.frame(x = 1:2, y = c(1, NA)),
dat2 = data.frame(x = 1:3, y = c(1, NA, 3))))
# $dat1
# x y
# 1 1 1
# 2 2 NA
#
# $dat2
# x y
# 1 1 1
# 2 2 NA
# 3 3 3
Map(na.omit, dat.l)
# $dat1
# x y
# 1 1 1
#
# $dat2
# x y
# 1 1 1
# 3 3 3
Do you mean like the below?
> lst
$a
a
1 1
2 2
3 NA
4 3
5 4
$b
b
1 1
2 NA
3 2
4 3
5 4
$d
d e
1 NA 1
2 NA 2
3 3 3
4 4 NA
5 5 NA
> f <- function(x) x[complete.cases(x),]
> lapply(lst, f)
$a
[1] 1 2 3 4
$b
[1] 1 2 3 4
$d
d e
3 3 3
file_name[complete.cases(file_name), ]
complete.cases() returns only a logical value. This should do the job and returns only the rows with no NA values.
I have a dataframe where some of the values are NA. I would like to remove these columns.
My data.frame looks like this
v1 v2
1 1 NA
2 1 1
3 2 2
4 1 1
5 2 2
6 1 NA
I tried to estimate the col mean and select the column means !=NA. I tried this statement, it does not work.
data=subset(Itun, select=c(is.na(colMeans(Itun))))
I got an error,
error : 'x' must be an array of at least two dimensions
Can anyone give me some help?
The data:
Itun <- data.frame(v1 = c(1,1,2,1,2,1), v2 = c(NA, 1, 2, 1, 2, NA))
This will remove all columns containing at least one NA:
Itun[ , colSums(is.na(Itun)) == 0]
An alternative way is to use apply:
Itun[ , apply(Itun, 2, function(x) !any(is.na(x)))]
Here's a convenient way to do it using the dplyr function select_if(). Combine not (!), any() and is.na(), which is equivalent to selecting all columns that don't contain any NA values.
library(dplyr)
Itun %>%
select_if(~ !any(is.na(.)))
Alternatively, select(where(~FUNCTION)) can be used:
library(dplyr)
(df <- data.frame(x = letters[1:5], y = NA, z = c(1:4, NA)))
#> x y z
#> 1 a NA 1
#> 2 b NA 2
#> 3 c NA 3
#> 4 d NA 4
#> 5 e NA NA
# Remove columns where all values are NA
df %>%
select(where(~!all(is.na(.))))
#> x z
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e NA
# Remove columns with at least one NA
df %>%
select(where(~!any(is.na(.))))
#> x
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
You can use transpose twice:
newdf <- t(na.omit(t(df)))
data[,!apply(is.na(data), 2, any)]
A base R method related to the apply answers is
Itun[!unlist(vapply(Itun, anyNA, logical(1)))]
v1
1 1
2 1
3 2
4 1
5 2
6 1
Here, vapply is used as we are operating on a list, and, apply, it does not coerce the object into a matrix. Also, since we know that the output will be logical vector of length 1, we can feed this to vapply and potentially get a little speed boost. For the same reason, I used anyNA instead of any(is.na()).
Another alternative with the dplyr package would be to make use of the Filter function
Filter(function(x) !any(is.na(x)), Itun)
with data.table would be a little more cumbersome
setDT(Itun)[,.SD,.SDcols=setdiff((1:ncol(Itun)),
which(colSums(is.na(Itun))>0))]
You can also try:
df <- df[,colSums(is.na(df))<nrow(df)]