function in lapply not working when applied to some columns - r

I have a dataframe, say
data <- data.frame(x1 = c(5, NA, 1, 6),
x2 = c(4, 3, 0, NA),
c = c('a', 'b', 'a', NA)); data
x1 x2 c
1 5 4 a
2 NA 3 b
3 1 0 a
4 6 NA NA
I want to replace the NAs by 0 on x1 and x2 columns only, so I use the lapply function as below:
data[c("x1","x2")] <- lapply(data[c("x1","x2")], function (x) {x[is.na(x)] <- 0}); data
This does not work as the output is:
x1 x2 c
1 0 0 a
2 0 0 b
3 0 0 a
4 0 0 NA
I then tried to create a separate function
fxNAtoZero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
and if I use this like below:
data[c("x1","x2")] <- lapply(data[c("x1","x2")], fxNAtoZero); data
it works, but the first case does not. I do not understand why the function created on fly is not working in lapply?

Your problem is that your first attempt just return the last line of the function in lapply, that is 0:
lapply(data[c("x1","x2")], function (x) {x[is.na(x)] <- 0})
$x1
[1] 0
$x2
[1] 0
while your second attempt return explicitely return the entire vector after changing the NA, because you used return. You should prefer if you want to use lapply:
lapply(data[c("x1","x2")], function (x) {ifelse(is.na(x),0,x) })
because ifelse does return a vector of the same length as the initial one.

You can also try using dplyr verbs to transform your data, and replace NA's for the desired cases. This is perhaps a bit more readable than using lapply, but note that the variables are converted to strings since that is the format for variable c.
data <- data.frame(x1 = c(5, NA, 1, 6),
x2 = c(4, 3, 0, NA),
c = c('a', 'b', 'a', NA),
id = c(1:4)) # create with row id, for spread
data %>% gather(k,v,-id) %>%
mutate(v=ifelse(is.na(v) & k!='c',0,v)) %>% # replace NA's based on conditions
spread(k,v) %>% select(-id)
c x1 x2
1 a 5 4
2 b 0 3
3 a 1 0
4 <NA> 6 0

Related

Perform a function on a dataframe across variable number of columns after removing zeros

I'm trying to create a function where I can pass a function as a variable to perform on a variable number of columns, after removing zeros. I'm not too comfortable with ellipses yet, and I'm guessing this is where the problem is arising. The function is using all the values in the specified rows, summarizing them based on the selected function, and then mutating that one value. I'd like to maintain the function across the row (e.g. rowMeans)
Example:
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
df <- data.frame(a, b, c, d)
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
# get all the values - I'm sure this is the problem, need to somehow turn it back into a df?
vals <- unlist(list(...))
# Remove 0's and perform function
valsNo0 <- vals[vals != 0]
compiledVals <- FUN(valsNo0)
return(compiledVals)
}
df %>%
mutate(foo = FUNexcludeZero(function_name = 'sd', a, b))
a b c d foo
1 1 0 5 2 1.457738
2 2 4 4 0 1.457738
3 3 3 3 1 1.457738
4 4 0 2 0 1.457738
5 5 1 1 4 1.457738
df %>%
mutate(foo = FUNexcludeZero(function_name = 'min', a, b))
a b c d foo
1 1 0 5 2 1
2 2 4 4 0 1
3 3 3 3 1 1
4 4 0 2 0 1
5 5 1 1 4 1
# Try row-function (same error occurs with rowMeans)
df %>%
mutate(foo = FUNexcludeZero(function_name = 'pmin', a, b))
Error in mutate_impl(.data, dots) :
Column `foo` must be length 5 (the number of rows) or one, not 8
For function_name = 'sd' the column should be c(NA, 1.41, 0, NA, 2.828) and the min and pmin should be c(1, 2, 3, 4, 1). I'm 100% sure the error has something to do with the list/unlist, but any other way I try it I end up with an error.
I am not sure if this is exactly what you what. You needed to perform a row wise operation on the two vectors, thus I used the apply function. This should work for any number of equal length vectors.
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
#df <- data.frame(a, b, c, d) #not used
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
#combine the vectors into a matrix
df<-cbind(...)
#remove 0 from rows and apply function to the rows
compiledVals <- apply(df, 1, function(x) { x<-x[x!=0]
FUN(x)})
return(compiledVals)
}
FUNexcludeZero(function_name = 'sd', a, b)
#[1] NA 1.414214 0.000000 NA 2.828427
FUNexcludeZero(function_name = 'min', a, b)
#[1] 1 2 3 4 1

Double for loop with NA in R

I have a couple of questions with my R script. I have a database with many series which have NA and numeric values. I would like to replace the NA by a 0 from the moment we have a numeric value but keep the NA if the serie is not started.
As we see below, for example in the second column I would like to keep the 2 first NA but replace the fourth by 0.
example
There is my script, but it doesn't work
my actual script
It would be very kind to have some suggestions
Many thanks
ER
In case you, or anyone else, want to avoid for loops:
# example dataset
df = data.frame(x1 = c(23,NA,NA,35),
x2 = c(NA,NA,45,NA),
x3 = c(4,34,NA,5))
# function to replace NAs not in the beginning of vector with 0
f = function(x) { x[is.na(x) & cumsum(!is.na(x)) != 0] = 0; x }
# apply function and save as dataframe
data.frame(sapply(df, f))
# x1 x2 x3
# 1 23 NA 4
# 2 0 NA 34
# 3 0 45 0
# 4 35 0 5
Or using tidyverse and the same function f:
library(tidyverse)
df %>% map_df(f)
# # A tibble: 4 x 3
# x1 x2 x3
# <dbl> <dbl> <dbl>
# 1 23. NA 4.
# 2 0. NA 34.
# 3 0. 45. 0.
# 4 35. 0. 5.
if this is your dataset:
ORIGINAL_DATA <- data.frame(X1 = c(23, NA, NA, 35),
X2 = c(NA, NA, 45, NA),
X3 = c(4, 34, NA, 5))
This could probably work:
for(i in 1:ncol(ORIGINAL_DATA)) {
for (j in 1:nrow(ORIGINAL_DATA)) {
if(!is.na(ORIGINAL_DATA[j, i])) {
ORIGINAL_DATA[c(j:nrow(ORIGINAL_DATA)), i] <- ifelse(is.na(ORIGINAL_DATA[c(j:nrow(ORIGINAL_DATA)), i]), 0, ORIGINAL_DATA[c(j:nrow(ORIGINAL_DATA)), i])
# To end this for-loop
j <- nrow(ORIGINAL_DATA)
}
}
}

The Number of Rows with a Specific Number of Missing Values

Imagine a small data set like the one below, composed of three variables:
v1 <- c(0, 1, NA, 1, NA, 0)
v2 <- c(0, 0, NA, 1, NA, NA)
v3 <- c(1, NA, 0, 0, NA, 0)
df <- data.frame(v1, v2, v3)
df
v1 v2 v3
1 0 0 1
2 1 0 NA
3 NA NA 0
4 1 1 0
5 NA NA NA
6 0 NA 0
One can use the is.na command as follows to calculate the number of rows with at least one missing value - and R would return 4:
sum(is.na(df$v1) | is.na(df$v2) | is.na(df$v3))
Or the number of rows with all three values missing - and R would return 1:
sum(is.na(df$v1) & is.na(df$v2) & is.na(df$v3))
Two questions at this point:
(1) How can I calculate the number of rows where "exactly one" or "exactly two" values are missing?
(2) If I am to do the above in a large data set, how can I limit the scope of the calculation to v1, v2 and v3 (that is, without having to create a subset)?
I tried variations of is.na, nrow and df, but could not get any of them to work.
Thanks!
We can use rowSums on the logical matrix (is.na(df)) and check whether the number of NAs are equal to the value of interest.
n1 <- 1
sum(rowSums(is.na(df))==n1)
To make it easier, create a function to do this
f1 <- function(dat, n){
sum(rowSums(is.na(dat)) == n)
}
f1(df, 0)
#[1] 2
f1(df, 1)
#[1] 2
f1(df, 3)
#[1] 1
f1(df, 2)
#[1] 1
NOTE: rowSums is very fast, but if it is a large dataset, then creating a logical matrix can also create problems in memory. So, we can use Reduce after looping through the columns of the dataset (lapply(df, is.na)).
sum(Reduce(`+`, lapply(df, is.na))==1)
#[1] 2
f2 <- function(dat, n){
sum(Reduce(`+`, lapply(dat, is.na))==n)
}
f2(df, 1)
Try this:
num.rows.with.x.NA <- function(df, x, cols=names(df)) {
return(sum(apply(df, 1, function(y) sum(is.na(y[cols])) == x)))
}
df
v1 v2 v3
1 0 0 1
2 1 0 NA
3 NA NA 0
4 1 1 0
5 NA NA NA
6 0 NA 0
num.rows.with.x.NA(df, 0, names(df))
#[1] 2
num.rows.with.x.NA(df, 1, names(df))
#[1] 2
num.rows.with.x.NA(df, 2, names(df))
#[1] 1
num.rows.with.x.NA(df, 3, names(df))
#[1] 1

R: Find the Variance of all Non-Zero Elements in Each Row

I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA

Insert count number of elements in columns into table in R

I'm working in R and I've got a matrix with A, B and NA values, and I would like to count the number of A or B or NA values in every column and insert the results into the table. I used the code below to account the A, B and NA.
mydata <- matrix(c(rep("A", 8), rep("B", 2), rep(NA, 2), rep("A", 4),
rep(c("B", "A", "A", "A"), 2), rep("A", 4)), ncol = 4, byrow = TRUE)
myFun <- function(x) {
data.frame(n.A = sum(x == "A", na.rm = TRUE), n.B = sum(x == "B",
na.rm = TRUE), n.NA = sum(is.na(x)))
}
count <- apply(mydata, 2, myFun)
Now, I need to insert the results from count (count <- apply(mydata, 2, myFun)) into the a dataframe as a table with only a header.
Almost identical in concept to mnel's answer, you can also try the following in base R:
sapply(as.data.frame(mydata),
function(x) table(factor(x, levels = unique(as.vector(mydata))),
useNA = "always"))
# V1 V2 V3 V4
# A 4 6 6 6
# B 3 1 0 0
# <NA> 0 0 1 1
Here, rather than manually specifying the factor levels, I've made use of the data in mydata.
I think the easiest with using plyr and adply or ldply
You can replace myfun with a call to table.
library(plyr)
adply(mydata,2, function(x) table(factor(x, levels = c('A','B')), useNA = 'always'))
# X1 A B NA
# 1 1 4 3 0
# 2 2 6 1 0
# 3 3 6 0 1
# 4 4 6 0 1
If you have large data, then plyr isn't the way go. apply will work nicely
apply(mydata, 2, function(x) {
xx <- table(factor(x, levels = c('A','B')), useNA = 'always')
names(xx) <- c('nA','nB', 'nNA')
xx})
[,1] [,2] [,3] [,4]
nA 4 6 6 6
nB 3 1 0 0
nNA 0 0 1 1

Resources