Iterating over columns to create flagging variables - r

I've got a dataset that has a lot of numerical columns (in the example below these columns are x, y, z). I want to create individual flagging variables for each of those columns (x_YN, y_YN, z_YN) such that, if the numerical column is > 0, the flagging variable is = 1 and otherwise it's = 0. What might be the most efficient way to tackle this?
Thanks for the help!
x <- c(3, 7, 0, 10)
y <- c(5, 2, 20, 0)
z <- c(0, 0, 4, 12)
df <- data.frame(x,y,z)

We may use a logical matrix and coerce
df[paste0(names(df), "_YN")] <- +(df > 0)
-output
> df
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1

The dplyr alternative:
library(dplyr)
df %>%
mutate(across(everything(), ~ +(.x > 0), .names = "{col}_YN"))
output
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1

Related

Get the first non-null value from selected cells in a row

Good afternoon, friends!
I'm currently performing some calculations in R (df is displayed below). My goal is to display in a new column the first non-null value from selected cells for each row.
My df is:
MD <- c(100, 200, 300, 400, 500)
liv <- c(0, 0, 1, 3, 4)
liv2 <- c(6, 2, 0, 4, 5)
liv3 <- c(1, 1, 1, 1, 1)
liv4 <- c(1, 0, 0, 3, 5)
liv5 <- c(0, 2, 7, 9, 10)
df <- data.frame(MD, liv, liv2, liv3, liv4, liv5)
I want to display (in a column called "liv6") the first non-null value from 5 cells (given the data, liv1 = 0, liv2 = 6 , liv3 = 1, liv 4 = 1 and liv5 = 1). The result should be 6. And this calculation should be repeated fro each row in my dataframe..
I do know how to do this in Python, but not in R..
Any help is highly appreciated!
One option with dplyr could be:
df %>%
rowwise() %>%
mutate(liv6 = with(rle(c_across(liv:liv5)), values[which.max(values != 0)]))
MD liv liv2 liv3 liv4 liv5 liv6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 0 6 1 1 0 6
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 3
5 500 4 5 1 5 10 4
A Base R solution:
df$liv6 <- apply(df[-1], 1, function(x) x[min(which(x != 0))])
output
df
MD liv liv2 liv3 liv4 liv5 liv6
1 100 0 6 1 1 0 2
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 1
5 500 4 5 1 5 10 1
A simple base R option is to apply across relevant columns (I exclude MD here, you can use any data frame subsetting style you want), then just take the first value of the non-zero values of that row.
df$liv6 <- apply(df[-1], 1, \(x) head(x[x > 0], 1))
df
#> MD liv liv2 liv3 liv4 liv5 liv6
#> 1 100 0 6 1 1 0 6
#> 2 200 0 2 1 0 2 2
#> 3 300 1 0 1 0 7 1
#> 4 400 3 4 1 3 9 3
#> 5 500 4 5 1 5 10 4
One approach is to use purrr::detect to detect the first non-zero element of each row.
We define a function which takes a numeric vector (row) and returns a boolean indicating whether each element is non-zero:
is_nonzero <- function(x) x != 0
We use this function to detect the first non-zero element in each row via purrr:detect
first_nonzero <- apply(df %>% dplyr::select(liv:liv5), 1, function(x) {
purrr::detect(x, is_nonzero, .dir = "forward")
})
We finally create the new column:
df$liv6 <- first_nonzero
As a result, we have
> df
MD liv liv2 liv3 liv4 liv5 liv6
100 0 6 1 1 0 6
200 0 2 1 0 2 2
300 1 0 1 0 7 1
400 3 4 1 3 9 3
500 4 5 1 5 10 4
Another straightforward solution is:
Reduce(function(x, y) ifelse(!x, y, x), df[, -1])
#[1] 6 2 1 3 4
This way should be very efficient, since we "scan" by column, as, presumably, the data have much fewer columns than rows.
The Reduce approach is a more functional form of a simple, old-school, loop:
ans = df[, 2]
for(j in 3:ncol(df)) {
i = !ans
ans[i] = df[i, j]
}
ans
#[1] 6 2 1 3 4

Add X number of columns to a data.frame

I would like to add a varying number (X) of columns with 0 to an existing data.frame within a function.
Here is an example data.frame:
dt <- data.frame(x=1:3, y=4:6)
I would like to get this result if X=1 :
a x y
1 0 1 4
2 0 2 5
3 0 3 6
And this if X=3 :
a b c x y
1 0 0 0 1 4
2 0 0 0 2 5
3 0 0 0 3 6
What would be an efficient way to do this?
We can assign multiple columns to '0' based on the value of 'X'
X <- 3
nm1 <- names(dt)
dt[letters[seq_len(X)]] <- 0
dt[c(setdiff(names(dt), nm1), nm1)]
Also, we can use add_column from tibble and create columns at a specific location
library(tibble)
add_column(dt, .before = 1, !!!setNames(as.list(rep(0, X)),
letters[seq_len(X)]))
A second option is cbind
f <- function(x, n = 3) {
cbind.data.frame(matrix(
0,
ncol = n,
nrow = nrow(x),
dimnames = list(NULL, letters[1:n])
), x)
}
f(dt, 5)
# a b c d e x y
#1 0 0 0 0 0 1 4
#2 0 0 0 0 0 2 5
#3 0 0 0 0 0 3 6
NOTE: because letters has a length of 26 the function would need some adjustment regarding the naming scheme if n > 26.
You can try the code below
dt <- cbind(`colnames<-`(t(rep(0,X)),letters[seq(X)]),dt)
If you don't care the column names of added columns, you can use just
dt <- cbind(t(rep(0,X)),dt)
which is much shorter

R: modify a variable conditioned on data from multiple previous rows

Hi I would really appreciate some help for this, I really couldn't find the solution in previous questions.
I have a tibble in long format (rows grouped by id and arranged by time).
I want to create a variable "eleg" based on "varx". The condition would be that "eleg" = 1 if "varx" in the previous 3 rows == 0 and in the current row varx == 1, if not = 0, for each ID. If possible using dplyr.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3)
time <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,1,2,3,4)
varx <- c(0,0,0,0,1,1,0,0,1,1,1,1,1,0,0,0,1)
eleg <- c(0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1)
table <- data.frame(id, time, varx, eleg)
In my real dataset the condition is "in the previous 24 rows" and the same ID could have eleg == 1 more than one time if it suits the condition.
Thank you.
One of the approach could be
library(dplyr)
m <- 3 #number of times previous rows are looked back
df %>%
group_by(id) %>%
mutate(eleg = ifelse(rowSums(sapply(1:m, function(k) lag(varx, n = k, order_by = id, default = 1) == 0)) == m & varx == 1,
1,
0)) %>%
data.frame()
which gives
id time varx eleg
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 0 0
5 1 5 1 1
6 1 6 1 0
7 1 7 0 0
8 2 1 0 0
9 2 2 1 0
10 2 3 1 0
11 2 4 1 0
12 2 5 1 0
13 2 6 1 0
14 3 1 0 0
15 3 2 0 0
16 3 3 0 0
17 3 4 1 1
Sample data:
df <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3), time = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,
1, 2, 3, 4), varx = c(0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 1)), .Names = c("id", "time", "varx"), row.names = c(NA,
-17L), class = "data.frame")
library(data.table)
df %>%
mutate(elegnew = ifelse(Reduce("+", shift(df$varx, 1:3)) == 0 & df$varx == 1, 1, 0))
id time varx eleg elegnew
1 1 1 0 0 0
2 1 2 0 0 0
3 1 3 0 0 0
4 1 4 0 0 0
5 1 5 1 1 1
6 1 6 1 0 0
7 1 7 0 0 0
8 2 1 0 0 0
9 2 2 1 0 0
10 2 3 1 0 0
11 2 4 1 0 0
12 2 5 1 0 0
13 2 6 1 0 0
14 3 1 0 0 0
15 3 2 0 0 0
16 3 3 0 0 0
17 3 4 1 1 1
Here's another approach, using dplyr and zoo:
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(elegnew = as.integer(varx == 1 &
rollsum(varx == 1, k = 4, align = "right", fill = 0) == 1))
# # A tibble: 17 x 5
# # Groups: id [3]
# id time varx eleg elegnew
# <dbl> <dbl> <dbl> <dbl> <int>
# 1 1. 1. 0. 0. 0
# 2 1. 2. 0. 0. 0
# 3 1. 3. 0. 0. 0
# 4 1. 4. 0. 0. 0
# 5 1. 5. 1. 1. 1
# 6 1. 6. 1. 0. 0
# 7 1. 7. 0. 0. 0
# 8 2. 1. 0. 0. 0
# 9 2. 2. 1. 0. 0
# 10 2. 3. 1. 0. 0
# 11 2. 4. 1. 0. 0
# 12 2. 5. 1. 0. 0
# 13 2. 6. 1. 0. 0
# 14 3. 1. 0. 0. 0
# 15 3. 2. 0. 0. 0
# 16 3. 3. 0. 0. 0
# 17 3. 4. 1. 1. 1
The idea is to group by id and then check a) whether varx is 1 and b) whether the sum of varx=1 events in the previous 3 plus current row (k=4) is 1 (which means all previous 3 must be 0). I assume that varx is either 0 or 1.
You have asked for a dplyr solution, preferably.
The following is a base R one, with a function that you can adapt to "in the previous 24 rows", just pass n = 24 to the function.
fun <- function(DF, crit = "varx", new = "eleg", n = 3){
DF[[new]] <- 0
for(i in seq_len(nrow(DF))[-seq_len(n)]){
if(all(DF[[crit]][(i - n):(i - 1)] == 0) && DF[[crit]][i] == 1)
DF[[new]][i] <- 1
}
DF
}
sp <- split(table[-4], table[-4]$id)
new_df <- do.call(rbind, lapply(sp, fun))
row.names(new_df) <- NULL
identical(table, new_df)
#[1] TRUE
Note that if you are creating a new column, eleg, you would probably not need to split table[-4], just table since the 4th column wouldn't exist yet.
You could do do.call(rbind, lapply(sp, fun, n = 24)) and the rest would be the same.

Generate a dummy matrix from multiple factor columns

I already searched the web and found no answer. I have a big data.frame that contains multiple columns. Each column is a factor variable.
I want to transform the data.frame such that each possible value of the factor variables is a variable that either contains a "1" if the variable is present in the factor column or "0" otherwise.
Here is an example of what I mean.
labels <- c("1", "2", "3", "4", "5", "6", "7")
#create data frame (note, not all factor levels have to be in the columns,
#NA values are possible)
input <- data.frame(ID = c(1, 2, 3),
Cat1 = factor(c( 4, 1, 1), levels = labels),
Cat2 = factor(c(2, NA, 4), levels = labels),
Cat3 = factor(c(7, NA, NA), levels = labels))
#the seven factor levels now are the variables of the data.frame
desired_output <- data.frame(ID = c(1, 2, 3),
Dummy1 = c(0, 1, 1),
Dummy2 = c(1, 0, 0),
Dummy3 = c(0, 0, 0),
Dummy4 = c(1, 0, 1),
Dummy5 = c(0, 0, 0),
Dummy6 = c(0, 0, 0),
Dummy7 = c(1, 0, 0))
input
ID Cat1 Cat2 Cat3
1 4 2 7
2 1 <NA> <NA>
3 1 4 <NA>
desired_output
ID Dummy1 Dummy2 Dummy3 Dummy4 Dummy5 Dummy6 Dummy7
1 0 1 0 1 0 0 1
2 1 0 0 0 0 0 0
3 1 0 0 1 0 0 0
My actual data.frame has over 3000 rows and factors with more than 100 levels.
I hope you can help me converting the input to the desired output.
Greetings
sush
A couple of methods, that riff off of Gregor's and Aaron's answers.
From Aaron's. factorsAsStrings=FALSE keeps the factor variables hence all labes when using dcast
library(reshape2)
dcast(melt(input, id="ID", factorsAsStrings=FALSE), ID ~ value, drop=FALSE)
ID 1 2 3 4 5 6 7 NA
1 1 0 1 0 1 0 0 1 0
2 2 1 0 0 0 0 0 0 2
3 3 1 0 0 1 0 0 0 1
Then you just need to remove the last column.
From Gregor's
na.replace <- function(x) replace(x, is.na(x), 0)
options(na.action='na.pass') # this keeps the NA's which are then converted to zero
Reduce("+", lapply(input[-1], function(x) na.replace(model.matrix(~ 0 + x))))
x1 x2 x3 x4 x5 x6 x7
1 0 1 0 1 0 0 1
2 1 0 0 0 0 0 0
3 1 0 0 1 0 0 0
Then you just need to cbind the ID column
One way to do this is with matrix indexing. You have data specifying which locations in your output matrix should be 1 (the rest should be zero), so we'll make a matrix of zeros and then fill in the 1's based on your data. To do that, your data needs to be in a two column matrix, with the first column being the row (ID) of the output and the second column being the columns.
Put input data in long format, remove missings, convert values to integers matching the labels, then make a matrix as needed.
in2 <- reshape2::melt(input, id.vars="ID")
in2 <- subset(in2, !is.na(value))
in2$value <- match(in2$value, labels)
in2$variable <- NULL
in2 <- as.matrix(in2)
Then make the new output matrix with all zeros, and fill in the ones using that matrix.
out <- matrix(0, nrow=nrow(input), ncol=length(labels))
colnames(out) <- labels
rownames(out) <- input$ID
out[in2] <- 1
out
## 1 2 3 4 5 6 7
## 1 0 1 0 1 0 0 1
## 2 1 0 0 0 0 0 0
## 3 1 0 0 1 0 0 0
Here's a way using model.matrix. We convert the missing values to 0s, and specify 0 as the reference level for the factor contrasts. Then we just add the individual model matrices together and stick on the IDs:
new_lab = as.character(0:7)
for (i in 2:4) {
temp = as.character(input[[i]])
temp[is.na(temp)] = "0"
input[[i]] = factor(temp, levels = new_lab)
}
mm =
model.matrix(~ Cat1, data = input) +
model.matrix(~ Cat2, data = input) +
model.matrix(~ Cat3, data = input)
mm[, 1] = input$ID
colnames(mm) = c("ID", paste0("Dummy", 1:(ncol(mm) - 1)))
mm
# ID Dummy1 Dummy2 Dummy3 Dummy4 Dummy5 Dummy6 Dummy7
# 1 1 0 1 0 1 0 0 1
# 2 2 1 0 0 0 0 0 0
# 3 3 1 0 0 1 0 0 0
# attr(,"assign")
# [1] 0 1 1 1 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$Cat1
# [1] "contr.treatment"
You can leave the result as a model matrix, change it back to a data frame, or whatever else.
This should work on your data frame. I converted the values to numeric before running the ifelse statement. Hope it works:
# Make dummy df
Cat1 = factor(c( 4, 1, 1))
Cat2 = factor(c(2, NA, 4))
Cat3 = factor(c(7, NA, NA))
df <- data.frame(Cat1,Cat2,Cat3)
# Specify columns
cols <- c(1:length(df))
# Convert Values To Numeric
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
# Perform ifelse. If its NA print 0, else print 1
df[,cols] %<>% lapply(function(x) ifelse(x == is.na(x) | (x) %in% NA, 0, 1))
Based on input:
Cat1 Cat2 Cat3
1 4 2 7
2 1 <NA> <NA>
3 1 4 <NA>
Output looks like this:
Cat1 Cat2 Cat3
1 1 1 1
2 1 0 0
3 1 1 0

Set NA to 0 in R

After merging a dataframe with another im left with random NA's for the occasional row. I'd like to set these NA's to 0 so I can perform calculations with them.
Im trying to do this with:
bothbeams.data = within(bothbeams.data, {
bothbeams.data$x.x = ifelse(is.na(bothbeams.data$x.x) == TRUE, 0, bothbeams.data$x.x)
bothbeams.data$x.y = ifelse(is.na(bothbeams.data$x.y) == TRUE, 0, bothbeams.data$x.y)
})
Where $x.x is one column and $x.y is the other of course, but this doesn't seem to work.
You can just use the output of is.na to replace directly with subsetting:
bothbeams.data[is.na(bothbeams.data)] <- 0
Or with a reproducible example:
dfr <- data.frame(x=c(1:3,NA),y=c(NA,4:6))
dfr[is.na(dfr)] <- 0
dfr
x y
1 1 0
2 2 4
3 3 5
4 0 6
However, be careful using this method on a data frame containing factors that also have missing values:
> d <- data.frame(x = c(NA,2,3),y = c("a",NA,"c"))
> d[is.na(d)] <- 0
Warning message:
In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
invalid factor level, NA generated
It "works":
> d
x y
1 0 a
2 2 <NA>
3 3 c
...but you likely will want to specifically alter only the numeric columns in this case, rather than the whole data frame. See, eg, the answer below using dplyr::mutate_if.
A solution using mutate_all from dplyr in case you want to add that to your dplyr pipeline:
library(dplyr)
df %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Result:
A B C
1 0 0 0
2 1 0 0
3 2 0 2
4 3 0 5
5 0 0 2
6 0 0 1
7 1 0 1
8 2 0 5
9 3 0 2
10 0 0 4
11 0 0 3
12 1 0 5
13 2 0 5
14 3 0 0
15 0 0 1
If in any case you only want to replace the NA's in numeric columns, which I assume it might be the case in modeling, you can use mutate_if:
library(dplyr)
df %>%
mutate_if(is.numeric, funs(ifelse(is.na(.), 0, .)))
or in base R:
replace(is.na(df), 0)
Result:
A B C
1 0 0 0
2 1 <NA> 0
3 2 0 2
4 3 <NA> 5
5 0 0 2
6 0 <NA> 1
7 1 0 1
8 2 <NA> 5
9 3 0 2
10 0 <NA> 4
11 0 0 3
12 1 <NA> 5
13 2 0 5
14 3 <NA> 0
15 0 0 1
Update
with dplyr 1.0.0, across is introduced:
library(dplyr)
# Replace `NA` for all columns
df %>%
mutate(across(everything(), ~ ifelse(is.na(.), 0, .)))
# Replace `NA` for numeric columns
df %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), 0, .)))
Data:
set.seed(123)
df <- data.frame(A=rep(c(0:3, NA), 3),
B=rep(c("0", NA), length.out = 15),
C=sample(c(0:5, NA), 15, replace = TRUE))
You can use replace_na() from tidyr package
df %>% replace_na(list(column1 = 0, column2 = 0)
To add to James's example, it seems you always have to create an intermediate when performing calculations on NA-containing data frames.
For instance, adding two columns (A and B) together from a data frame dfr:
temp.df <- data.frame(dfr) # copy the original
temp.df[is.na(temp.df)] <- 0
dfr$C <- temp.df$A + temp.df$B # or any other calculation
remove('temp.df')
When I do this I throw away the intermediate afterwards with remove/rm.
If you only want to replace NAs with 0s for a few select columns you also use an lapply solution, e.g:
data = data.frame(
one = c(NA,0),
two = c(NA,NA),
three = c(1,2),
four = c("A",NA)
)
data[1:2] = lapply(data[1:2],function(x){
x[is.na(x)] = 0
return(x)
})
data
Why not try this
na.zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
na.zero(df)

Resources