Count several rows and make a new column in R - r

I want to count several rows (x1-x4) and make a new column (x1_x4) in R looks like the below picture. Can anyone help me?
df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10),
x1 = c(0,NA,0,1,0,0,1,1,1,NA),
x2 = c(0,NA,1,0,0,NA,0,1,0,0),
x3 = c(0,NA,0,1,1,0,1,1,1,0),
x4 = c(0,NA,0,0,0,0,1,1,1,1))

You can use rowSums and test with apply if all are NA.
df$x1_x4 <- rowSums(df[-1], TRUE)
df$x1_x4[apply(is.na(df[2:5]), 1, all)] <- NA
# ID x1 x2 x3 x4 x1_x4
#1 1 0 0 0 0 0
#2 2 NA NA NA NA NA
#3 3 0 1 0 0 1
#4 4 1 0 1 0 2
#5 5 0 0 1 0 1
#6 6 0 NA 0 0 0
#7 7 1 0 1 1 3
#8 8 1 1 1 1 4
#9 9 1 0 1 1 3
#10 10 NA 0 0 1 1

One dplyr solution could be:
df %>%
rowwise() %>%
mutate(x1_x4 = any(!is.na(c_across(-ID)))^NA * sum(c_across(-ID), na.rm = TRUE))
ID x1 x2 x3 x4 x1_x4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 0 0 0
2 2 NA NA NA NA NA
3 3 0 1 0 0 1
4 4 1 0 1 0 2
5 5 0 0 1 0 1
6 6 0 NA 0 0 0
7 7 1 0 1 1 3
8 8 1 1 1 1 4
9 9 1 0 1 1 3
10 10 NA 0 0 1 1

vars <- paste0("x", 1:4)
df$x1_x4 <- rowSums(df[vars], na.rm = TRUE)
df[rowSums(is.na(df[vars]), na.rm = TRUE) == 4, "x1_x4"] <- NA
df
# ID x1 x2 x3 x4 x1_x4
# 1 1 0 0 0 0 0
# 2 2 NA NA NA NA NA
# 3 3 0 1 0 0 1
# 4 4 1 0 1 0 2
# 5 5 0 0 1 0 1
# 6 6 0 NA 0 0 0
# 7 7 1 0 1 1 3
# 8 8 1 1 1 1 4
# 9 9 1 0 1 1 3
# 10 10 NA 0 0 1 1

Base R one (obfuscated) expression:
within(df, {x1_x4 <- apply(df[,grepl("^x", names(df))], 1,
function(x){ifelse(all(is.na(x)), NA_integer_, sum(x, na.rm = TRUE))})})

Related

How to filter out data with conditional statement for series of numbers in R?

Data
Here is the data for my example:
#### Create Data ####
df <- data.frame(X1 = c(NA,1,1,1,0),
X2 = c(1,1,1,0,0),
X3 = c(1,1,NA,0,0),
X4 = c(1,1,1,1,NA),
X5 = c(1,1,1,0,NA),
X6 = c(1,NA,1,1,NA)) %>%
as_tibble()
Problem
When you print the data, it looks like this:
# A tibble: 5 × 6
X1 X2 X3 X4 X5 X6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA 1 1 1 1 1
2 1 1 1 1 1 NA
3 1 1 NA 1 1 1
4 1 0 0 1 0 1
5 0 0 0 NA NA NA
Basically there are cases where there is sporadic and random missingness in this data (rows 1-4). However, those with three zeroes in a row are those that have been converted to NA values after a stopping rule for multiple "wrong" answers (row 5). Theoretically I could just blindly remove these with the following code:
df %>%
mutate(across(everything(),
~ replace(.,
is.na(.),
0)))
And the NA's would be removed:
# A tibble: 5 × 6
X1 X2 X3 X4 X5 X6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 1 1 1 1 1
2 1 1 1 1 1 0
3 1 1 0 1 1 1
4 1 0 0 1 0 1
5 0 0 0 0 0 0
However, it appears that this does not faithfully attack the problem. The NAs that are random are actually missing whereas the values that have been made NA are not. So I need a way to conditionally filter these values out for all cases where three 0s are recorded in a row, however I'm struggling with figuring out how to do this.
Using is.na we could paste0 the rows to strings and check if number of matches with 111 are greater than zero using stringi::stri_count to create a flag. After that, replace NAs with zeros if a flag is present.
num_NA <- 3
flag <- apply(+(is.na(df)), 1, paste0, collapse='') |>
stringi::stri_count(regex=paste(rep(1, num_NA), collapse='')) |> base::`>`(0)
df[flag, ] <- lapply(df[flag, ], \(x) replace(x, is.na(x), 0))
df
# X1 X2 X3 X4 X5 X6
# 1 NA 1 1 1 1 1
# 2 1 1 1 1 1 NA
# 3 1 1 NA 1 1 1
# 4 1 0 0 1 0 1
# 5 0 0 0 0 0 0
Data:
df <- structure(list(X1 = c(NA, 1, 1, 1, 0), X2 = c(1, 1, 1, 0, 0),
X3 = c(1, 1, NA, 0, 0), X4 = c(1, 1, 1, 1, NA), X5 = c(1,
1, 1, 0, NA), X6 = c(1, NA, 1, 1, NA)), class = "data.frame", row.names = c(NA,
-5L))
using base, and complicating things a little...
df2 <- rbind(df, df)
> df2
X1 X2 X3 X4 X5 X6
1 NA 1 1 1 1 1
2 1 1 1 1 1 NA
3 1 1 NA 1 1 1
4 1 0 0 1 0 1
5 0 0 0 NA NA NA
6 NA 1 1 1 1 1
7 1 1 1 1 1 NA
8 1 1 NA 1 1 1
9 1 0 0 1 0 1
10 0 0 0 NA NA NA
# fiddle with it
df2[3,] <- c(0,NA,0,NA,0,NA)
df2[6,] <- c(NA,0,0,0,NA,NA)
You're at you earliest stage, data-wrangling.
df2
X1 X2 X3 X4 X5 X6
1 NA 1 1 1 1 1
2 1 1 1 1 1 NA
3 0 NA 0 NA 0 NA
4 1 0 0 1 0 1
5 0 0 0 NA NA NA
6 NA 0 0 0 NA NA
7 1 1 1 1 1 NA
8 1 1 NA 1 1 1
9 1 0 0 1 0 1
10 0 0 0 NA NA NA
After applying #jay52 solution, above (entirely correct given the data offered), what should be said to test takers row 5 and 10, about the good fortune of test taker row 6?:
df3
X1 X2 X3 X4 X5 X6
1 NA 1 1 1 1 1
2 1 1 1 1 1 NA
3 0 NA 0 NA 0 NA
4 1 0 0 1 0 1
5 0 0 0 0 0 0
6 NA 0 0 0 NA NA
7 1 1 1 1 1 NA
8 1 1 NA 1 1 1
9 1 0 0 1 0 1
10 0 0 0 0 0 0
given that a series of 3 zeros in a row is intended per scoring protocol meant to have consequences (three in a row, it appears, and you're out). I would say that a resort to a rle style test is necessary to capture this circumstance, as unweildy as 'rle' seems to be continuing with base:
rle_lst_unc <- lapply(apply(df2, 1, rle), unclass)
for (k in 1:length(rle_lst_unc)) {
idx_3_0[[k]] <- unname(rle_lst_unc[[k]]$values[rle_lst_unc[[k]]$lengths == 3] == 0)
}
true_3_0 <- which(lengths(idx_3_0) == 1)[which(unlist(idx_3_0) == TRUE)]
df2[true_3_0, ] <- 0
df2
X1 X2 X3 X4 X5 X6
1 NA 1 1 1 1 1
2 1 1 1 1 1 NA
3 0 NA 0 NA 0 NA
4 1 0 0 1 0 1
5 0 0 0 0 0 0
6 0 0 0 0 0 0
7 1 1 1 1 1 NA
8 1 1 NA 1 1 1
9 1 0 0 1 0 1
10 0 0 0 0 0 0
And due to rle, 6 is treated the same as 5 & 10.

How to use forloop to create a sequence of columns

I need to create the variable x out of the variable y as below.
df$x<-0
df$x<-ifelse(df$y==0 | df$y==1, 1, 0)
df$x[is.na(df$x)] <- 0
However i hhave y ranging from 1 to 52, which means i need to create x1 thru x52. I am an avid stata user and it is pretty straight forward to do using the forval function. However I am having difficulties doing it in R. I thought about the following, but it didn't workout very well:
for (i in 1:52){
df$x[i] <- 0
.
.
.
}
I thought i could let r replace the i by the values from the loop the same way stata does.
thanks
Try something like this. Here an example using dummy data:
set.seed(123)
#Data
df <- as.data.frame(matrix(rnorm(520),nrow = 10,ncol = 52))
names(df) <- paste0('y',1:52)
#new names
vals <- paste0('x',1:52)
#Loop
for(i in vals)
{
df[[i]]<-0
df[[i]]<-ifelse(df[[gsub('x','y',i)]]==0 | df[[gsub('x','y',i)]]==1, 1, 0)
df[[i]][is.na(df[[i]])] <- 0
}
Suppose you had data that looked something like this:
data
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
1 1 3 0 NA 3 NA 3 1 2 3
2 1 1 NA NA 3 0 3 3 0 0
3 0 1 1 0 2 2 1 1 0 1
4 0 NA 2 1 3 2 NA 0 2 0
5 1 2 NA 2 0 1 2 3 2 3
6 3 NA 1 3 NA NA NA 3 NA 3
7 2 NA 3 3 NA 0 NA 1 1 1
8 NA 3 2 1 1 NA 1 0 1 2
9 0 1 0 NA NA 0 2 0 NA 2
10 1 0 3 0 3 2 NA 0 1 2
One approach might be to use dplyr::mutate with across:
library(tidyverse)
data %>%
mutate(across(everything(),~ ifelse(. %in% c(0,1), 1, 0),
.names = "y{.col}")) %>%
rename_all(~str_replace(.,"yx","y"))
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 y1 y2 y3 y4 y5 y6 y7 y8 y9 y10
1 2 1 2 2 2 2 0 1 1 0 0 1 0 0 0 0 1 1 1 1
2 3 2 3 1 3 3 3 2 0 1 0 0 0 1 0 0 0 0 1 1
3 0 2 2 1 0 2 3 0 2 0 1 0 0 1 1 0 0 1 0 1
4 2 2 3 1 0 0 1 1 2 3 0 0 0 1 1 1 1 1 0 0
5 1 3 1 2 2 2 3 2 3 3 1 0 1 0 0 0 0 0 0 0
6 3 1 1 1 0 3 2 2 1 2 0 1 1 1 1 0 0 0 1 0
7 1 1 3 1 3 1 1 0 1 2 1 1 0 1 0 1 1 1 1 0
8 1 2 3 3 2 1 2 2 2 0 1 0 0 0 0 1 0 0 0 1
9 1 2 3 0 2 3 0 0 2 1 1 0 0 1 0 0 1 1 0 1
10 2 0 1 0 3 2 3 2 2 3 0 1 1 1 0 0 0 0 0 0
Example data:
set.seed(123)
data <- as.data.frame(matrix(sample(c(NA,0:3),100,replace = TRUE),ncol =10))
names(data) <- paste0("x",1:10)

Count of a string of values between values

I have a simple dataframe that is a set of ID columns and values of 0 or 1, for an example:
data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 1 0 1 0 0 1 1 1 0
2 0 0 0 1 0 1 0 0 1 0
3 0 1 1 1 1 0 1 1 1 1
4 0 0 0 1 1 1 1 1 1 0
5 1 0 1 0 1 1 0 1 1 0
6 0 1 1 1 1 1 0 1 1 1
I want to write a code or loop that for every column, counts the number of 0's until encountering another 1, and continues down the column. So ideally the output is a new dataframe with the same ID column head, and a list of counts:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 3 1 2 1 2 1 1 1 NA 2
2 1 2 1 1 NA 1 2 NA NA 2
I'm not sure how to do this and also the row outcome may be of different lengths. If each column has to create a new dataframe that's fine.
Here's a base R solution. I used a size-10 example instead of a size 1000 example so we can actually see what's going on and make sure it looks right.
set.seed(47)
d = data.frame(replicate(10,sample(0:1,10,rep=TRUE)))
d
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1 0 0 0 0 0 0 1 1 0 0
# 2 0 1 0 1 0 0 0 0 0 0
# 3 1 1 1 0 1 0 0 0 1 0
# 4 0 0 0 0 0 1 1 1 1 1
# 5 1 1 0 1 0 0 1 1 1 0
# 6 0 1 1 1 1 1 1 1 0 1
# 7 1 1 0 0 1 0 0 1 1 0
# 8 0 0 1 0 1 0 1 0 0 0
# 9 0 0 0 1 1 1 0 0 1 1
# 10 1 1 1 0 1 0 1 1 0 0
results = lapply(d, function(x) with(rle(x), lengths[values == 0]))
max_length = max(lengths(results))
results = lapply(results, function(x) {length(x) = max_length; x})
results = do.call(cbind, results)
results
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# [1,] 2 1 2 1 2 3 2 2 2 3
# [2,] 1 1 2 2 2 1 1 2 1 1
# [3,] 1 2 1 2 NA 2 1 NA 1 2
# [4,] 2 NA 1 1 NA 1 NA NA 1 1
One dplyr and purrr option could be:
map(.x = names(df),
~ df %>%
mutate(rleid = with(rle(get(.x)), rep(seq_along(lengths), lengths))) %>%
filter(get(.x) == 0) %>%
group_by(rleid = cumsum(!duplicated(rleid))) %>%
summarise(!!.x := n())) %>%
reduce(full_join, by = c("rleid" = "rleid"))
rleid X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 2 2 9 2 3 4 1 1
2 2 1 1 NA 3 NA 1 1 2 1 1
3 3 1 3 NA NA NA 2 1 NA 2 2
4 4 1 NA NA NA NA 1 NA NA 1 2
Sample data:
set.seed(123)
df <- data.frame(replicate(10, sample(0:1, 10, rep = TRUE)))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 0 1 1 1 0 0 1 1 0 0
2 1 0 1 1 0 0 0 1 1 1
3 0 1 1 1 0 1 0 1 0 0
4 1 1 1 1 0 0 0 0 1 1
5 1 0 1 0 0 1 1 0 0 0
6 0 1 1 0 0 0 0 0 0 0
7 1 0 1 1 0 0 1 0 1 1
8 1 0 1 0 0 1 1 1 1 0
9 1 0 0 0 0 1 1 0 1 0
10 0 1 0 0 1 0 0 0 0 1
Here's an alternate approach that uses the indices of the 1 values to determine the runs of zero (using Gregor's data):
library(purrr)
map(df, ~ {
y <- diff(c(0, which(.x == 1), nrow(df) + 1)) - 1
y[y != 0]
}) %>%
map_df(`length<-`, max(lengths(.)))
# A tibble: 4 x 10
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 2 1 2 3 2 2 2 3
2 1 1 2 2 2 1 1 2 1 1
3 1 2 1 2 NA 2 1 NA 1 2
4 2 NA 1 1 NA 1 NA NA 1 1
Or same in base R:
res <- lapply(df, function(x) {
y <- diff(c(0, which(x == 1), nrow(df) + 1)) - 1
y[y != 0]})
data.frame(do.call(cbind, lapply(res, `length<-`, max(lengths(res)))))

Creating dummies with apply in R

I have data about different study strategies for individuals (stored in columns labeled StrategyA, StrategyB, StrategyC. The strategies are coded 1-15. I want to create a dummy for each strategy (e.g. strategy1, strategy2, etc) because each student can list up to 3 strategies.
Example Data
ID = c(1, 2, 3, 4, 5)
Strategy_A = c(10, 12, 13, 1, 2)
Strategy_B = c(1, 2, 1, 4, 5)
Strategy_C = c(2, 3, 6, 8, 15)
all = data.frame(ID, Strategy_A, Strategy_B, Strategy_C)
I thought about using apply and creating a function linked to the fastDummies package.
dummies = function(x){
dummy_cols(x)
}
new = apply(all [,-1], 2, dummies)
new = as.data.frame(new)
However, this creates dummies for StrategyA_1 StrategyA_2 StrategyA_3 rather than summarizing the dummies as Strategy1 Strategy2 Strategy3. Any ideas how to fix this?
After a small transformation of all, you can use dummy.data.frame() from dummies (you can also use dummy_cols() from fastDummies) and then aggregate per ID.
all <- data.frame(ID = rep(all$ID, 3),
Strategy = c(all$Strategy_A, all$Strategy_B, all$Strategy_C)) # data frame "all" with one column Strategy
library(dummies)
all <- dummy.data.frame(all, "Strategy") # or fastDummies::dummy_cols(all, "Strategy")
aggregate(. ~ ID, all, sum) # since strategies are now dummies, the sum will always be 0 or 1
# output
ID Strategy1 Strategy2 Strategy3 Strategy4 Strategy5 Strategy6 Strategy8 Strategy10 Strategy12 Strategy13 Strategy15
1 1 1 1 0 0 0 0 0 1 0 0 0
2 2 0 1 1 0 0 0 0 0 1 0 0
3 3 1 0 0 0 0 1 0 0 0 1 0
4 4 1 0 0 1 0 0 1 0 0 0 0
5 5 0 1 0 0 1 0 0 0 0 0 1
I provide a method with the tidyverse way.
library(tidyverse)
new <- all %>% gather(select = -ID) %>%
mutate(key = NULL, num = 1) %>%
spread(value, num)
# ID 1 2 3 4 5 6 8 10 12 13 15
# 1 1 1 1 NA NA NA NA NA 1 NA NA NA
# 2 2 NA 1 1 NA NA NA NA NA 1 NA NA
# 3 3 1 NA NA NA NA 1 NA NA NA 1 NA
# 4 4 1 NA NA 1 NA NA 1 NA NA NA NA
# 5 5 NA 1 NA NA 1 NA NA NA NA NA 1
new[is.na(new)] <- 0
new
# ID 1 2 3 4 5 6 8 10 12 13 15
# 1 1 1 1 0 0 0 0 0 1 0 0 0
# 2 2 0 1 1 0 0 0 0 0 1 0 0
# 3 3 1 0 0 0 0 1 0 0 0 1 0
# 4 4 1 0 0 1 0 0 1 0 0 0 0
# 5 5 0 1 0 0 1 0 0 0 0 0 1

Row-wise operation by group over time R

Problem:
I am trying to create variable x2 which is equal to 1, for all rows within each ID group where over time x1 switches from 1 to 0.
Additionally, after the switch, every consecutive 0 in the run, x2 is set to 1.
I tried to figure out how to do this using library(dplyr), but could not figure out how to look at previous records within the group.
Input Data:
ID<-c("1","1","1","1","1","2","2","2","2","3","3","3","4","4","5","5","5")
time<-c("1","2","3","4","5","1","2","3","4","1","2","3","1","2","1","2","3")
x1<-c("0","1","1","1","1","0","0","0","0","1","0","0","1","1","1","0","1")
df<-data.frame(ID,time,x1)
Required Output:
ID time x1 x2
1 1 0 0
1 2 1 0
1 3 1 0
1 4 1 0
1 5 1 0
2 1 0 0
2 2 0 0
2 3 0 0
2 4 0 0
3 1 1 0
3 2 0 1
3 3 0 1
4 1 1 0
4 2 1 0
5 1 1 0
5 2 0 1
5 3 1 0
It is better to have the 'x1' as numeric column
library(data.table)
setDT(df)[, x2 := (cumsum(x1) < 2)*cumsum(c(FALSE, diff(x1) < 0)), ID]
df
# ID time x1 x2
# 1: 1 1 0 0
# 2: 1 2 1 0
# 3: 1 3 1 0
# 4: 1 4 1 0
# 5: 1 5 1 0
# 6: 2 1 0 0
# 7: 2 2 0 0
# 8: 2 3 0 0
# 9: 2 4 0 0
#10: 3 1 1 0
#11: 3 2 0 1
#12: 3 3 0 1
#13: 4 1 1 0
#14: 4 2 1 0
#15: 5 1 1 0
#16: 5 2 0 1
#17: 5 3 1 0
data
ID<-c("1","1","1","1","1","2","2","2","2","3","3","3","4","4","5","5","5")
time<-c("1","2","3","4","5","1","2","3","4","1","2","3","1","2","1","2","3")
x1<- as.integer(c("0","1","1","1","1","0","0","0","0","1","0","0","1","1","1","0","1"))
df<-data.frame(ID,time,x1)
If you want a dplyr answer, you can use #akrun's code in mutate after grouping by ID
library(dplyr)
ID<-c("1","1","1","1","1","2","2","2","2","3","3","3","4","4","5","5","5")
time<-c("1","2","3","4","5","1","2","3","4","1","2","3","1","2","1","2","3")
x1<- as.integer(c("0","1","1","1","1","0","0","0","0","1","0","0","1","1","1","0","1"))
df<-data.frame(ID,time,x1)
df <- df %>%
group_by(ID) %>%
mutate(x2 = (cumsum(x1) < 2)*cumsum(c(FALSE, diff(x1) < 0)))
df
# ID time x1 x2
# 1 1 0 0
# 1 2 1 0
# 1 3 1 0
# 1 4 1 0
# 1 5 1 0
# 2 1 0 0
# 2 2 0 0
# 2 3 0 0
# 2 4 0 0
# 3 1 1 0
# 3 2 0 1
# 3 3 0 1
# 4 1 1 0
# 4 2 1 0
# 5 1 1 0
# 5 2 0 1
# 5 3 1 0

Resources