Count of a string of values between values - r

I have a simple dataframe that is a set of ID columns and values of 0 or 1, for an example:
data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 1 0 1 0 0 1 1 1 0
2 0 0 0 1 0 1 0 0 1 0
3 0 1 1 1 1 0 1 1 1 1
4 0 0 0 1 1 1 1 1 1 0
5 1 0 1 0 1 1 0 1 1 0
6 0 1 1 1 1 1 0 1 1 1
I want to write a code or loop that for every column, counts the number of 0's until encountering another 1, and continues down the column. So ideally the output is a new dataframe with the same ID column head, and a list of counts:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 3 1 2 1 2 1 1 1 NA 2
2 1 2 1 1 NA 1 2 NA NA 2
I'm not sure how to do this and also the row outcome may be of different lengths. If each column has to create a new dataframe that's fine.

Here's a base R solution. I used a size-10 example instead of a size 1000 example so we can actually see what's going on and make sure it looks right.
set.seed(47)
d = data.frame(replicate(10,sample(0:1,10,rep=TRUE)))
d
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1 0 0 0 0 0 0 1 1 0 0
# 2 0 1 0 1 0 0 0 0 0 0
# 3 1 1 1 0 1 0 0 0 1 0
# 4 0 0 0 0 0 1 1 1 1 1
# 5 1 1 0 1 0 0 1 1 1 0
# 6 0 1 1 1 1 1 1 1 0 1
# 7 1 1 0 0 1 0 0 1 1 0
# 8 0 0 1 0 1 0 1 0 0 0
# 9 0 0 0 1 1 1 0 0 1 1
# 10 1 1 1 0 1 0 1 1 0 0
results = lapply(d, function(x) with(rle(x), lengths[values == 0]))
max_length = max(lengths(results))
results = lapply(results, function(x) {length(x) = max_length; x})
results = do.call(cbind, results)
results
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# [1,] 2 1 2 1 2 3 2 2 2 3
# [2,] 1 1 2 2 2 1 1 2 1 1
# [3,] 1 2 1 2 NA 2 1 NA 1 2
# [4,] 2 NA 1 1 NA 1 NA NA 1 1

One dplyr and purrr option could be:
map(.x = names(df),
~ df %>%
mutate(rleid = with(rle(get(.x)), rep(seq_along(lengths), lengths))) %>%
filter(get(.x) == 0) %>%
group_by(rleid = cumsum(!duplicated(rleid))) %>%
summarise(!!.x := n())) %>%
reduce(full_join, by = c("rleid" = "rleid"))
rleid X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 2 2 9 2 3 4 1 1
2 2 1 1 NA 3 NA 1 1 2 1 1
3 3 1 3 NA NA NA 2 1 NA 2 2
4 4 1 NA NA NA NA 1 NA NA 1 2
Sample data:
set.seed(123)
df <- data.frame(replicate(10, sample(0:1, 10, rep = TRUE)))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 0 1 1 1 0 0 1 1 0 0
2 1 0 1 1 0 0 0 1 1 1
3 0 1 1 1 0 1 0 1 0 0
4 1 1 1 1 0 0 0 0 1 1
5 1 0 1 0 0 1 1 0 0 0
6 0 1 1 0 0 0 0 0 0 0
7 1 0 1 1 0 0 1 0 1 1
8 1 0 1 0 0 1 1 1 1 0
9 1 0 0 0 0 1 1 0 1 0
10 0 1 0 0 1 0 0 0 0 1

Here's an alternate approach that uses the indices of the 1 values to determine the runs of zero (using Gregor's data):
library(purrr)
map(df, ~ {
y <- diff(c(0, which(.x == 1), nrow(df) + 1)) - 1
y[y != 0]
}) %>%
map_df(`length<-`, max(lengths(.)))
# A tibble: 4 x 10
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 2 1 2 3 2 2 2 3
2 1 1 2 2 2 1 1 2 1 1
3 1 2 1 2 NA 2 1 NA 1 2
4 2 NA 1 1 NA 1 NA NA 1 1
Or same in base R:
res <- lapply(df, function(x) {
y <- diff(c(0, which(x == 1), nrow(df) + 1)) - 1
y[y != 0]})
data.frame(do.call(cbind, lapply(res, `length<-`, max(lengths(res)))))

Related

How to filter out data with conditional statement for series of numbers in R?

Data
Here is the data for my example:
#### Create Data ####
df <- data.frame(X1 = c(NA,1,1,1,0),
X2 = c(1,1,1,0,0),
X3 = c(1,1,NA,0,0),
X4 = c(1,1,1,1,NA),
X5 = c(1,1,1,0,NA),
X6 = c(1,NA,1,1,NA)) %>%
as_tibble()
Problem
When you print the data, it looks like this:
# A tibble: 5 × 6
X1 X2 X3 X4 X5 X6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA 1 1 1 1 1
2 1 1 1 1 1 NA
3 1 1 NA 1 1 1
4 1 0 0 1 0 1
5 0 0 0 NA NA NA
Basically there are cases where there is sporadic and random missingness in this data (rows 1-4). However, those with three zeroes in a row are those that have been converted to NA values after a stopping rule for multiple "wrong" answers (row 5). Theoretically I could just blindly remove these with the following code:
df %>%
mutate(across(everything(),
~ replace(.,
is.na(.),
0)))
And the NA's would be removed:
# A tibble: 5 × 6
X1 X2 X3 X4 X5 X6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 1 1 1 1 1
2 1 1 1 1 1 0
3 1 1 0 1 1 1
4 1 0 0 1 0 1
5 0 0 0 0 0 0
However, it appears that this does not faithfully attack the problem. The NAs that are random are actually missing whereas the values that have been made NA are not. So I need a way to conditionally filter these values out for all cases where three 0s are recorded in a row, however I'm struggling with figuring out how to do this.
Using is.na we could paste0 the rows to strings and check if number of matches with 111 are greater than zero using stringi::stri_count to create a flag. After that, replace NAs with zeros if a flag is present.
num_NA <- 3
flag <- apply(+(is.na(df)), 1, paste0, collapse='') |>
stringi::stri_count(regex=paste(rep(1, num_NA), collapse='')) |> base::`>`(0)
df[flag, ] <- lapply(df[flag, ], \(x) replace(x, is.na(x), 0))
df
# X1 X2 X3 X4 X5 X6
# 1 NA 1 1 1 1 1
# 2 1 1 1 1 1 NA
# 3 1 1 NA 1 1 1
# 4 1 0 0 1 0 1
# 5 0 0 0 0 0 0
Data:
df <- structure(list(X1 = c(NA, 1, 1, 1, 0), X2 = c(1, 1, 1, 0, 0),
X3 = c(1, 1, NA, 0, 0), X4 = c(1, 1, 1, 1, NA), X5 = c(1,
1, 1, 0, NA), X6 = c(1, NA, 1, 1, NA)), class = "data.frame", row.names = c(NA,
-5L))
using base, and complicating things a little...
df2 <- rbind(df, df)
> df2
X1 X2 X3 X4 X5 X6
1 NA 1 1 1 1 1
2 1 1 1 1 1 NA
3 1 1 NA 1 1 1
4 1 0 0 1 0 1
5 0 0 0 NA NA NA
6 NA 1 1 1 1 1
7 1 1 1 1 1 NA
8 1 1 NA 1 1 1
9 1 0 0 1 0 1
10 0 0 0 NA NA NA
# fiddle with it
df2[3,] <- c(0,NA,0,NA,0,NA)
df2[6,] <- c(NA,0,0,0,NA,NA)
You're at you earliest stage, data-wrangling.
df2
X1 X2 X3 X4 X5 X6
1 NA 1 1 1 1 1
2 1 1 1 1 1 NA
3 0 NA 0 NA 0 NA
4 1 0 0 1 0 1
5 0 0 0 NA NA NA
6 NA 0 0 0 NA NA
7 1 1 1 1 1 NA
8 1 1 NA 1 1 1
9 1 0 0 1 0 1
10 0 0 0 NA NA NA
After applying #jay52 solution, above (entirely correct given the data offered), what should be said to test takers row 5 and 10, about the good fortune of test taker row 6?:
df3
X1 X2 X3 X4 X5 X6
1 NA 1 1 1 1 1
2 1 1 1 1 1 NA
3 0 NA 0 NA 0 NA
4 1 0 0 1 0 1
5 0 0 0 0 0 0
6 NA 0 0 0 NA NA
7 1 1 1 1 1 NA
8 1 1 NA 1 1 1
9 1 0 0 1 0 1
10 0 0 0 0 0 0
given that a series of 3 zeros in a row is intended per scoring protocol meant to have consequences (three in a row, it appears, and you're out). I would say that a resort to a rle style test is necessary to capture this circumstance, as unweildy as 'rle' seems to be continuing with base:
rle_lst_unc <- lapply(apply(df2, 1, rle), unclass)
for (k in 1:length(rle_lst_unc)) {
idx_3_0[[k]] <- unname(rle_lst_unc[[k]]$values[rle_lst_unc[[k]]$lengths == 3] == 0)
}
true_3_0 <- which(lengths(idx_3_0) == 1)[which(unlist(idx_3_0) == TRUE)]
df2[true_3_0, ] <- 0
df2
X1 X2 X3 X4 X5 X6
1 NA 1 1 1 1 1
2 1 1 1 1 1 NA
3 0 NA 0 NA 0 NA
4 1 0 0 1 0 1
5 0 0 0 0 0 0
6 0 0 0 0 0 0
7 1 1 1 1 1 NA
8 1 1 NA 1 1 1
9 1 0 0 1 0 1
10 0 0 0 0 0 0
And due to rle, 6 is treated the same as 5 & 10.

How to use forloop to create a sequence of columns

I need to create the variable x out of the variable y as below.
df$x<-0
df$x<-ifelse(df$y==0 | df$y==1, 1, 0)
df$x[is.na(df$x)] <- 0
However i hhave y ranging from 1 to 52, which means i need to create x1 thru x52. I am an avid stata user and it is pretty straight forward to do using the forval function. However I am having difficulties doing it in R. I thought about the following, but it didn't workout very well:
for (i in 1:52){
df$x[i] <- 0
.
.
.
}
I thought i could let r replace the i by the values from the loop the same way stata does.
thanks
Try something like this. Here an example using dummy data:
set.seed(123)
#Data
df <- as.data.frame(matrix(rnorm(520),nrow = 10,ncol = 52))
names(df) <- paste0('y',1:52)
#new names
vals <- paste0('x',1:52)
#Loop
for(i in vals)
{
df[[i]]<-0
df[[i]]<-ifelse(df[[gsub('x','y',i)]]==0 | df[[gsub('x','y',i)]]==1, 1, 0)
df[[i]][is.na(df[[i]])] <- 0
}
Suppose you had data that looked something like this:
data
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
1 1 3 0 NA 3 NA 3 1 2 3
2 1 1 NA NA 3 0 3 3 0 0
3 0 1 1 0 2 2 1 1 0 1
4 0 NA 2 1 3 2 NA 0 2 0
5 1 2 NA 2 0 1 2 3 2 3
6 3 NA 1 3 NA NA NA 3 NA 3
7 2 NA 3 3 NA 0 NA 1 1 1
8 NA 3 2 1 1 NA 1 0 1 2
9 0 1 0 NA NA 0 2 0 NA 2
10 1 0 3 0 3 2 NA 0 1 2
One approach might be to use dplyr::mutate with across:
library(tidyverse)
data %>%
mutate(across(everything(),~ ifelse(. %in% c(0,1), 1, 0),
.names = "y{.col}")) %>%
rename_all(~str_replace(.,"yx","y"))
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 y1 y2 y3 y4 y5 y6 y7 y8 y9 y10
1 2 1 2 2 2 2 0 1 1 0 0 1 0 0 0 0 1 1 1 1
2 3 2 3 1 3 3 3 2 0 1 0 0 0 1 0 0 0 0 1 1
3 0 2 2 1 0 2 3 0 2 0 1 0 0 1 1 0 0 1 0 1
4 2 2 3 1 0 0 1 1 2 3 0 0 0 1 1 1 1 1 0 0
5 1 3 1 2 2 2 3 2 3 3 1 0 1 0 0 0 0 0 0 0
6 3 1 1 1 0 3 2 2 1 2 0 1 1 1 1 0 0 0 1 0
7 1 1 3 1 3 1 1 0 1 2 1 1 0 1 0 1 1 1 1 0
8 1 2 3 3 2 1 2 2 2 0 1 0 0 0 0 1 0 0 0 1
9 1 2 3 0 2 3 0 0 2 1 1 0 0 1 0 0 1 1 0 1
10 2 0 1 0 3 2 3 2 2 3 0 1 1 1 0 0 0 0 0 0
Example data:
set.seed(123)
data <- as.data.frame(matrix(sample(c(NA,0:3),100,replace = TRUE),ncol =10))
names(data) <- paste0("x",1:10)

Count several rows and make a new column in R

I want to count several rows (x1-x4) and make a new column (x1_x4) in R looks like the below picture. Can anyone help me?
df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10),
x1 = c(0,NA,0,1,0,0,1,1,1,NA),
x2 = c(0,NA,1,0,0,NA,0,1,0,0),
x3 = c(0,NA,0,1,1,0,1,1,1,0),
x4 = c(0,NA,0,0,0,0,1,1,1,1))
You can use rowSums and test with apply if all are NA.
df$x1_x4 <- rowSums(df[-1], TRUE)
df$x1_x4[apply(is.na(df[2:5]), 1, all)] <- NA
# ID x1 x2 x3 x4 x1_x4
#1 1 0 0 0 0 0
#2 2 NA NA NA NA NA
#3 3 0 1 0 0 1
#4 4 1 0 1 0 2
#5 5 0 0 1 0 1
#6 6 0 NA 0 0 0
#7 7 1 0 1 1 3
#8 8 1 1 1 1 4
#9 9 1 0 1 1 3
#10 10 NA 0 0 1 1
One dplyr solution could be:
df %>%
rowwise() %>%
mutate(x1_x4 = any(!is.na(c_across(-ID)))^NA * sum(c_across(-ID), na.rm = TRUE))
ID x1 x2 x3 x4 x1_x4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 0 0 0
2 2 NA NA NA NA NA
3 3 0 1 0 0 1
4 4 1 0 1 0 2
5 5 0 0 1 0 1
6 6 0 NA 0 0 0
7 7 1 0 1 1 3
8 8 1 1 1 1 4
9 9 1 0 1 1 3
10 10 NA 0 0 1 1
vars <- paste0("x", 1:4)
df$x1_x4 <- rowSums(df[vars], na.rm = TRUE)
df[rowSums(is.na(df[vars]), na.rm = TRUE) == 4, "x1_x4"] <- NA
df
# ID x1 x2 x3 x4 x1_x4
# 1 1 0 0 0 0 0
# 2 2 NA NA NA NA NA
# 3 3 0 1 0 0 1
# 4 4 1 0 1 0 2
# 5 5 0 0 1 0 1
# 6 6 0 NA 0 0 0
# 7 7 1 0 1 1 3
# 8 8 1 1 1 1 4
# 9 9 1 0 1 1 3
# 10 10 NA 0 0 1 1
Base R one (obfuscated) expression:
within(df, {x1_x4 <- apply(df[,grepl("^x", names(df))], 1,
function(x){ifelse(all(is.na(x)), NA_integer_, sum(x, na.rm = TRUE))})})

R: rowsum function changes order of groups after aggregation

I've got this data frame which has duplicates (same ID but different numbers):
ID X1 X2 X3 X4 X5
45 1 0 0 1 0
45 0 1 0 0 1
15 1 0 1 0 0
7 1 0 1 1 0
7 0 1 0 0 0
I want to sum the vectors that have the same ID so I've used rowsum:
m <- rowsum(m, m$ID)
However it messes up with the order of the rows showing something like this:
ID X1 X2 X3 X4 X5
15 1 0 1 0 0
45 1 1 0 1 1
7 1 1 1 1 0
Instead of what I want:
ID X1 X2 X3 X4 X5
45 1 1 0 1 1
15 1 0 1 0 0
7 1 1 1 1 0
Anyone knows how to fix this?
Put reorder = FALSE in rowsum.
From ?rowsum:
reorder: if ‘TRUE’, then the result will be in order of
‘sort(unique(group))’, if ‘FALSE’, it will be in the order
that groups were encountered.

How to change data to binary in R and keep the row names column?

I have a data frame that looks like this
Site <- c("X1","X2","X3","X4","X5","X6","X7","X8","X9","X10")
A <- c(0,0,1,2,4,5,6,7,13,56)
B <- c(1,0,0,0,0,4,5,7,7,8)
C <- c(2,3,0,0,4,5,67,8,43,21)
D <- c(134,0,0,2,0,0,9,0,45,55)
mydata <- data.frame(Site,A,B,C,D,stringsAsFactors=FALSE)
I want to convert all values > 0 to be 1 (i.e. binary), without jeopardising the column and row names.
I have tried mydata[mydata>=1]<-1 but it also changed my first column (the row names) to 1 as well:
head(mydata)
Site A B C D
1 1 0 1 1 1
2 1 0 0 1 0
3 1 1 0 0 0
4 1 1 0 0 1
5 1 1 0 1 0
6 1 1 1 1 0
So how do I change just the values to binary, not the row names?
We can create a logical matrix and coerce to binary
mydata[-1] <- +(mydata[-1] > 0)
As an alternative to the answer given by #akrun (+1), we can also try using sapply() to logically convert any non-zero number to 1 or else 0:
mydata[-1] <- sapply(mydata[-1], function(x) { as.numeric(x > 0) })
mydata
Site A B C D
1 X1 0 1 1 1
2 X2 0 0 1 0
3 X3 1 0 0 0
4 X4 1 0 0 1
5 X5 1 0 1 0
6 X6 1 1 1 0
7 X7 1 1 1 1
8 X8 1 1 1 0
9 X9 1 1 1 1
10 X10 1 1 1 1
If we weren't sure about the relative positioning of the columns, we could also address the numeric columns using mydata[c("A", "B", "C", "D")] or something similar.
You could also try this which disregards if the number is negative or positive:
mydata[-1] <- (!is.na(mydata[-1]/mydata[-1]))*1
ifelse function allows you to assign a new data if the value agrees or not your condition. Works for vectors but data frames also. I bind the Site column with the transformed ones.
myBinData <- data.frame(Site = mydata$Site, ifelse(mydata[, -1] == 0, 0, 1))
Site A B C D
1 X1 0 1 1 1
2 X2 0 0 1 0
3 X3 1 0 0 0
4 X4 1 0 0 1
5 X5 1 0 1 0
6 X6 1 1 1 0
7 X7 1 1 1 1
8 X8 1 1 1 0
9 X9 1 1 1 1
10 X10 1 1 1 1

Resources