Double for loop with NA in R - r

I have a couple of questions with my R script. I have a database with many series which have NA and numeric values. I would like to replace the NA by a 0 from the moment we have a numeric value but keep the NA if the serie is not started.
As we see below, for example in the second column I would like to keep the 2 first NA but replace the fourth by 0.
example
There is my script, but it doesn't work
my actual script
It would be very kind to have some suggestions
Many thanks
ER

In case you, or anyone else, want to avoid for loops:
# example dataset
df = data.frame(x1 = c(23,NA,NA,35),
x2 = c(NA,NA,45,NA),
x3 = c(4,34,NA,5))
# function to replace NAs not in the beginning of vector with 0
f = function(x) { x[is.na(x) & cumsum(!is.na(x)) != 0] = 0; x }
# apply function and save as dataframe
data.frame(sapply(df, f))
# x1 x2 x3
# 1 23 NA 4
# 2 0 NA 34
# 3 0 45 0
# 4 35 0 5
Or using tidyverse and the same function f:
library(tidyverse)
df %>% map_df(f)
# # A tibble: 4 x 3
# x1 x2 x3
# <dbl> <dbl> <dbl>
# 1 23. NA 4.
# 2 0. NA 34.
# 3 0. 45. 0.
# 4 35. 0. 5.

if this is your dataset:
ORIGINAL_DATA <- data.frame(X1 = c(23, NA, NA, 35),
X2 = c(NA, NA, 45, NA),
X3 = c(4, 34, NA, 5))
This could probably work:
for(i in 1:ncol(ORIGINAL_DATA)) {
for (j in 1:nrow(ORIGINAL_DATA)) {
if(!is.na(ORIGINAL_DATA[j, i])) {
ORIGINAL_DATA[c(j:nrow(ORIGINAL_DATA)), i] <- ifelse(is.na(ORIGINAL_DATA[c(j:nrow(ORIGINAL_DATA)), i]), 0, ORIGINAL_DATA[c(j:nrow(ORIGINAL_DATA)), i])
# To end this for-loop
j <- nrow(ORIGINAL_DATA)
}
}
}

Related

How to verify if when a column is NA the other is not?

I have a dataframe with two columns. I need to check if where a column is NA the other is not. Thanks
Edited.
I would like to know, for each row of the dataframe, if there are rows with both columns not NA.
You can use the following code to check which row has no NA values:
df <- data.frame(x = c(1, NA),
y = c(2, NA))
which(rowSums(is.na(df))==ncol(df))
Output:
[1] 1
As you can see the first rows has no NA values so both columns have no NA values.
Here's a simple code to generate a column of the NA count for each row:
x <- sample(c(1, NA), 25, replace = TRUE)
y <- sample(c(1, NA), 25, replace = TRUE)
df <- data.frame(x, y)
df$NA_Count <- apply(df, 1, function(x) sum(is.na(x)))
df
x y NA_Count
1 NA 1 1
2 NA NA 2
3 1 NA 1
4 1 NA 1
5 NA NA 2
6 1 NA 1
7 1 1 0
8 1 1 0
9 1 1 0

Counting split rules in decision trees in R

I'm trying to count each unique split rule from a data frame of decision trees in R. For example, if I have a data frame containing 4 trees like the one shown below:
df <- data.frame(
var = c('x10', NA, NA,
'x10', NA, 'x7', NA, NA,
'x5', 'x2', NA, NA, 'x9', NA, NA,
'x5', NA, NA),
num = c(1,1,1,
2,2,2,2,2,
1,1,1,1,1,1,1,
2,2,2),
iter = c(rep(1, 8), rep(2, 10))
)
> df
var num iter
1 x10 1 1
2 <NA> 1 1
3 <NA> 1 1
4 x10 2 1
5 <NA> 2 1
6 x7 2 1
7 <NA> 2 1
8 <NA> 2 1
9 x5 1 2
10 x2 1 2
11 <NA> 1 2
12 <NA> 1 2
13 x9 1 2
14 <NA> 1 2
15 <NA> 1 2
16 x5 2 2
17 <NA> 2 2
18 <NA> 2 2
The var column contains the variable name used in the splitting rule and is ordered by depth first. So, for example, the 4 trees created from that data would look like this:
I'm trying to find a way to return the count of each pair of variables used in a split rule, but grouped by iter. For example, if we look at the 2nd tree (i.e.,num == 2, iter == 1) we can see that x7 splits on x10. so, the pair x10:x7 appears 1 time when iter == 1.
My desired output would look something like this:
allSplits count iter
1 x10:x7 1 1
2 x5:x2 1 2
3 x5:x9 1 2
Any suggestions as to how I could do this?
There is probably a package that knows how to operate on this kind of data frame, but maybe these two hand-crafted recursive functions can get you started.
mkTree <- function(x, pos = 1L) {
var <- x[pos]
if (is.na(var)) {
list(NA_character_, NULL, NULL, 1L)
} else {
node <- vector("list", 4L)
node[[1L]] <- var
node[[2L]] <- l <- Recall(x, pos + 1L)
node[[3L]] <- r <- Recall(x, pos + 1L + l[[4L]])
node[[4L]] <- 1L + l[[4L]] + r[[4L]]
node
}
}
tabTree <- function(tree, sep = ":") {
x <- rep.int(NA_character_, tree[[4L]])
pos <- 1L
recurse <- function(subtree) {
var1 <- subtree[[1L]]
if (!is.na(var1)) {
for (i in 2:3) {
var2 <- subtree[[c(i, 1L)]]
if (!is.na(var2)) {
x[pos] <<- paste0(var1, sep, var2)
pos <<- pos + 1L
Recall(subtree[[i]])
}
}
}
}
recurse(tree)
x <- x[!is.na(x)]
if (length(x)) {
x <- factor(x)
setNames(tabulate(x), levels(x))
} else {
integer(0L)
}
}
mkTree transforms into recursive lists the segments of var in your data frame that specify a tree. Nodes in these recursive structures have the form:
list(variable_name, left_node, right_node, subtree_size)
tabTree takes the mkTree result and returns a named integer vector tabulating the splits. So you could do
f <- function(x) tabTree(mkTree(x))
L <- tapply(df[["var"]], df[c("num", "iter")], f, simplify = FALSE)
to get a list matrix storing the tabulated splits for each [num, iter] pair (i.e., for each tree).
L
## iter
## num 1 2
## 1 integer,0 integer,2
## 2 1 integer,0
L[2L, 1L]
## [[1]]
## x10:x7
## 1
L[1L, 2L]
## [[1]]
## x5:x2 x5:x9
## 1 1
And you could sum over num to get tabulated splits for each level of iter.
g <- function(l) {
x <- unlist(unname(l))
tapply(x, names(x), sum)
}
apply(L, 2L, g)
## $`1`
## x10:x7
## 1
## $`2`
## x5:x2 x5:x9
## 1 1

How to make a For loop that keeps the original row value

I am trying to run multiple conditional statements in a loop. My first conditional is an if, else if with 3 conditions (4 technically if nothing matches). My second really only needs one condition, and I want to keep the original row value if it doesn't meet that condition. The problem is my output doesn't match the row numbers, and I'm not sure how to output only to a specific row in a loop.
I want to loop over each column, and within each column I use sapply to check each value for falling outside of a range1 (gets marked with 4), inside of range1 (gets marked with 1), is.na (gets marked with 9), otherwise is marked -999. A narrower range would then be used, if each value in a column falls inside of range2, mark with a 3, otherwise don't update.
My partially working code, and a reproducible example is below. My input and first loop is:
df <- structure(list(A = c(-2, 3, 5, 10, NA), A.c = c(NA, NA, NA, NA, NA), B = c(2.2, -55, 3, NA, 99), B.c = c(NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -5L))
> df
A A.c B B.c
1 -2 NA 2.2 NA
2 3 NA -55.0 NA
3 5 NA 3.0 NA
4 10 NA NA NA
5 NA NA 99.0 NA
min1 <- 0
max1 <- 8
test1.func <- function(x) {
val <- if (!is.na(x) & is.numeric(x) & (x < min1 | x > max1){
num = 4
} else if (!is.na(x) & is.numeric(x) & x >= min1 & x <= max1){
num = 1
} else if (is.na(x)){# TODO it would be better to make this just what is already present in the row
} else {
num = -999
}
val
}
Test1 <- function(x) {
i <- NA
for(i in seq(from = 1, to = ncol(x), by = 2)){
x[, i + 1] <- sapply(x[[i]], test1.func)
}
x
}
df_result <- Test1(df)
> df_result
A A.c B B.c
1 -2 4 2.2 1
2 3 1 -55.0 4
3 5 1 3.0 1
4 10 4 NA 9
5 NA 9 99.0 4
The next loop and conditional (any existing values of 4 or 9 would remain):
min2 <- 3
max2 <- 5
test2.func <- function(x) {
val <- if (!is.na(x) & is.numeric(x) & (x < min2 | x > max2){
num = 3
}
val
}
Test2 <- function(x) {
i <- NA
for(i in seq(from = 1, to = ncol(x), by = 2)){
x[, i + 1] <- sapply(x[[i]], test2.func)
}
x
}
df_result2 <- Test2(df_result)
# Only 2.2 matches, if working correctly would output
> df_result2
A A.c B B.c
1 -2 4 2.2 3
2 3 1 -55.0 4
3 5 1 3.0 1
4 10 4 NA 9
5 NA 9 99.0 4
Current code errors, since there is only one match:
Warning messages:
1: In `[<-.data.frame`(`*tmp*`, , i + 1, value = list(3, NULL, NULL, :
provided 5 variables to replace 1 variables
Some thoughts.
for loops are not necessary, it is better to capitalize on R's vectorized operations;
it appears that your values of 4 and 3 are really something like "outside band 1" and "outside band 2", in which case this can be resolved in one function.
Testing for == "NA" is a bit off ... if one of the values in a column is a string "NA" (and not R's NA value), then all values in that column are strings and you have other problems. Because of this, I don't explicitly check for is.numeric, though it is not hard to work back in.
Try this:
func <- function(x, range1, range2) {
ifelse(is.na(x), 9L,
ifelse(x < range1[1] | x > range1[2], 4L,
ifelse(x < range2[1] | x > range2[2], 3L,
1L)))
}
df[,c("A.c", "B.c")] <- lapply(df[,c("A", "B")], func, c(0, 8), c(3, 5))
df
# A A.c B B.c
# 1 -2 4 2.2 3
# 2 3 1 -55.0 4
# 3 5 1 3.0 1
# 4 10 4 NA 9
# 5 NA 9 99.0 4
One problem I have with this is that it uses a 3-nested ifelse loop. While this works fine, it can be difficult to trace and troubleshoot (and ifelse has problems of its own). If you have other conditions to incorporate, it might be nice to use dplyr::case_when.
func2 <- function(x, range1, range2) {
dplyr::case_when(
is.na(x) ~ 9L,
x < range1[1] | x > range1[2] ~ 4L,
x < range2[1] | x > range2[2] ~ 3L,
TRUE ~ 1L
)
}
I find this second method much easier to read, though it does have the added dependency of dplyr (which, while it definitely has advantages and strengths, includes an army of other dependencies). If you are already using any of the tidyverse packages in your workflow, though, this is likely the better solution.

function in lapply not working when applied to some columns

I have a dataframe, say
data <- data.frame(x1 = c(5, NA, 1, 6),
x2 = c(4, 3, 0, NA),
c = c('a', 'b', 'a', NA)); data
x1 x2 c
1 5 4 a
2 NA 3 b
3 1 0 a
4 6 NA NA
I want to replace the NAs by 0 on x1 and x2 columns only, so I use the lapply function as below:
data[c("x1","x2")] <- lapply(data[c("x1","x2")], function (x) {x[is.na(x)] <- 0}); data
This does not work as the output is:
x1 x2 c
1 0 0 a
2 0 0 b
3 0 0 a
4 0 0 NA
I then tried to create a separate function
fxNAtoZero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
and if I use this like below:
data[c("x1","x2")] <- lapply(data[c("x1","x2")], fxNAtoZero); data
it works, but the first case does not. I do not understand why the function created on fly is not working in lapply?
Your problem is that your first attempt just return the last line of the function in lapply, that is 0:
lapply(data[c("x1","x2")], function (x) {x[is.na(x)] <- 0})
$x1
[1] 0
$x2
[1] 0
while your second attempt return explicitely return the entire vector after changing the NA, because you used return. You should prefer if you want to use lapply:
lapply(data[c("x1","x2")], function (x) {ifelse(is.na(x),0,x) })
because ifelse does return a vector of the same length as the initial one.
You can also try using dplyr verbs to transform your data, and replace NA's for the desired cases. This is perhaps a bit more readable than using lapply, but note that the variables are converted to strings since that is the format for variable c.
data <- data.frame(x1 = c(5, NA, 1, 6),
x2 = c(4, 3, 0, NA),
c = c('a', 'b', 'a', NA),
id = c(1:4)) # create with row id, for spread
data %>% gather(k,v,-id) %>%
mutate(v=ifelse(is.na(v) & k!='c',0,v)) %>% # replace NA's based on conditions
spread(k,v) %>% select(-id)
c x1 x2
1 a 5 4
2 b 0 3
3 a 1 0
4 <NA> 6 0

R: Find the Variance of all Non-Zero Elements in Each Row

I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA

Resources