Counts of occurrences for column combinations - r

I have a dataset with three variables (X1,X2,X3) and these variables only take the value of 0 or 1.
The dataset is
dput(data)
structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0), .Dim = c(10L, 3L), .Dimnames = list(
NULL, c("x1", "x2", "x3")))
In the example above each row is an observation and each column is a variable.
I need to know the frequency of
(X1=1), (X2=1), (X3=1), (X1=1,X2=1), (X1=1,X3=1), (X2=1,X3=1), (X1=1,X2=1,X3=1)
I tried
table(rowSums(data !=0))
But this only give me the frequencies of one, two or three variables happens.

You could do:
comb <- sapply(1:3, combn, x = 3)
find <- function(colComb) sum(rowSums(data.frame(df[, colComb])) == length(colComb))
list <- sapply(comb, function(colComb) apply(colComb, 2, find))
names(list) <- sapply(comb, function(colComb) paste(apply(colComb, 2, paste, collapse = "&"), collapse = "|"))
$`1|2|3`
[1] 10 9 4
$`1&2|1&3|2&3`
[1] 9 4 3
$`1&2&3`
[1] 3
As suggested by user2957945 the short version:
lapply(1:3, function(x) combn(3, x, FUN=function(y) sum(Reduce("&", as.data.frame(df[,y])))))

You could use xtabs which is meant for three-way tables:
s <- structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0), .Dim = c(10L, 3L), .Dimnames = list(
NULL, c("x1", "x2", "x3")))
mytab <- xtabs(~x1+x2+x3, data = s)
mytab
, , x3 = 0
x2
x1 0 1
1 0 6
, , x3 = 1
x2
x1 0 1
1 1 3
If you would like it to look better, follow this up with ftable
ftable(mytab)
x3 0 1
x1 x2
1 0 0 1
1 6 3
Please note though, your example only has one value for x1.

Related

Create column conditioning the behavior of rows in the dataset

I would like to do something very specific. I have a vast set of data, which, in summary, looks more or less like this, with values 0, 1 and 2:
I need to create a situation variable so that it contains the value 0, 1 and 2.
The value 0 for cases that contain only 0's and 1's in the entire line.
The value 1 for the case where the value 2 appears, but at some point 1 appears before it.
The value 2 for the case where the value 2 appears, but at some point 0 appears before it.
So it's something close to:
structure(list(X1 = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1), X2 = c(1,
1, 1, 1, 0, 0, 0, 0, 0, 2), X3 = c(0, 1, 1, 1, 1, 0, 0, 1, 0,
0), X4 = c(0, 1, 1, 0, 1, 1, 0, 0, 0, 0), X5 = c(2, 1, 1, 0,
2, 1, 1, 0, 0, 0), X6 = c(2, 1, 1, 0, 2, 1, 1, 0, 0, 0), X7 = c(2,
1, 1, 1, 2, 1, 1, 2, 0, 0), X8 = c(0, 1, 1, 1, 2, 1, 2, 2, 2,
0)), class = "data.frame", row.names = c(NA, 10L))
I wrote a score function and applied it over all the rows of your dataframe.
score <- function(x) {
a <- which(x == 2)
ifelse(length(a) > 0, ifelse(a[1] >=2, 2 - x[a[1] - 1], 1), 0)
}
df <- structure(list(X1 = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1),
X2 = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 2),
X3 = c(0, 1, 1, 1, 1, 0, 0, 1, 0, 0),
X4 = c(0, 1, 1, 0, 1, 1, 0, 0, 0, 0),
X5 = c(2, 1, 1, 0, 2, 1, 1, 0, 0, 0),
X6 = c(2, 1, 1, 0, 2, 1, 1, 0, 0, 0),
X7 = c(2, 1, 1, 1, 2, 1, 1, 2, 0, 0),
X8 = c(0, 1, 1, 1, 2, 1, 2, 2, 2, 0)),
class = "data.frame", row.names = c(NA, 10L))
df$situation <- sapply(1:nrow(df), function(i) score(as.numeric(df[i,])))
df
Here's a tidyverse approach.
I'll first concatenate all columns together, then use grepl() to look for 12 or 02.
library(tidyverse)
df %>% rowwise() %>%
mutate(concat = paste(c_across(everything()), collapse = "")) %>%
ungroup() %>%
mutate(situation = case_when(
!grepl(2, concat) ~ 0,
grepl("12", concat) ~ 1,
grepl("02", concat) ~ 2
)) %>%
select(-concat)
Output
# A tibble: 10 x 9
X1 X2 X3 X4 X5 X6 X7 X8 situation
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 0 2 2 2 0 2
2 1 1 1 1 1 1 1 1 0
3 1 1 1 1 1 1 1 1 0
4 1 1 1 0 0 0 1 1 0
5 1 0 1 1 2 2 2 2 1
6 1 0 0 1 1 1 1 1 0
7 1 0 0 0 1 1 1 2 1
8 1 0 1 0 0 0 2 2 2
9 0 0 0 0 0 0 0 2 2
10 1 2 0 0 0 0 0 0 1
Note that this solution assumes that:
2 will not appear in the first column
1 or 2 in the situation is defined by the number immediately before 2 in your dataset
There will not be a case of 12 and 02 happening in the same row

Cumulative count for a column using R

I got data like this
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), drug_1 = c(0,
0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1), drug_2 = c(0, 1, 1, 1, 1, 0,
1, 0, 0, 1, 0, 1)), class = "data.frame", row.names = c(NA, -12L
))
I would like to get the cumulative count of each column for each id and get the data like this
structure(list(id2 = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), drug_1_b = c(0,
0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 2), drug_2_b = c(0, 1, 2, 3, 4,
0, 5, 0, 0, 1, 0, 2)), class = "data.frame", row.names = c(NA,
-12L))
You can get a cumulative sum with cumsum.
To split data.frame into subsets, you can use split and then lapply cumsum over the list of the data.frames and again over the list of the columns, or you can use the ave function which does exactly that:
data = structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), drug_1 = c(0,
0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1), drug_2 = c(0, 1, 1, 1, 1, 0,
1, 0, 0, 1, 0, 1)), class = "data.frame", row.names = c(NA, -12L
))
data[-1] = ave(data[-1], data$id, FUN=cumsum)
edit:
I assumed that the cumulative sum is requested (as per instructions) and that there is a mistake in the example data. If the example data is correct, then the condition is If the count is zero, don't do cumulative sum and leave at zero or ifelse(x == 0, 0, cumsum(x)) (as per #r2evans). However, this construct doesn't work when applied for the data.frame. A more complex helper function is required:
data[-1] = ave(data[-1], data$id, FUN=function(x){
y = cumsum(x)
y[x == 0] = 0
y
})
We can now compare it with the requested (renamed) data:
result = structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), drug_1 = c(0,
0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 2), drug_2 = c(0, 1, 2, 3, 4,
0, 5, 0, 0, 1, 0, 2)), class = "data.frame", row.names = c(NA,
-12L))
identical(data, result)
Base R,
ave(df$drug_2, df$id, FUN = function(z) ifelse(z == 0, z, cumsum(z)))
# [1] 0 1 2 3 4 0 5 0 0 1 0 2
Edit Simplified the solution after reading r2evans' approach.
You could use
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with("drug"),
~ifelse(.x == 0, 0, cumsum(.x)))) %>%
ungroup()
This returns
# A tibble: 12 x 3
id drug_1 drug_2
<dbl> <dbl> <dbl>
1 1 0 0
2 1 0 1
3 1 0 2
4 1 0 3
5 1 0 4
6 1 1 0
7 1 2 5
8 2 0 0
9 2 0 0
10 2 1 1
11 2 0 0
12 2 2 2
Base R solution:
# Resolve the names of vectors we want to cumulatively sum:
# drug_vec_names => character vector
drug_vec_names <- grep( "^drug\\_", colnames(df), value = TRUE)
# Resolve the names of vectors we want to keep:
# not_drug_vec_names => character vector
not_drug_vec_names <- names(df)[!(names(df) %in% drug_vec_names)]
# Calculate the result: res => data.frame
res <- setNames(
cbind(
df[,not_drug_vec_names],
replace(
ave(
df[,drug_vec_names],
df[,not_drug_vec_names],
FUN = cumsum
),
df[,drug_vec_names] == 0,
0
)
),
c(not_drug_vec_names, drug_vec_names)
)
If you have binary values (1/0) in drug columns, you can multiply the cumulative sum with itself to get 0 for 0 values.
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with('drug'), ~cumsum(.) * .)) %>%
ungroup
# id drug_1 drug_2
# <dbl> <dbl> <dbl>
# 1 1 0 0
# 2 1 0 1
# 3 1 0 2
# 4 1 0 3
# 5 1 0 4
# 6 1 1 0
# 7 1 2 5
# 8 2 0 0
# 9 2 0 0
#10 2 1 1
#11 2 0 0
#12 2 2 2

Within rows of data frame, find first occurrence and longest sequence of value

Consider this data frame, which provides the scored responses on a 15-item test for 10 individuals:
library(tidyverse)
input <- tribble(
~ID, ~i1, ~i2, ~i3, ~i4, ~i5, ~i6, ~i7, ~i8, ~i9, ~i10, ~i11, ~i12, ~i13, ~i14, ~i15,
"A", 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
"B", 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
"C", 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0,
"D", 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0,
"E", 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
"F", 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
"G", 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,
"H", 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
"I", 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
"J", 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1
)
I want R to go row-by-row, and scan the cells in each row from left to right, in order to create these new columns:
first_0_name: returns the column name of the cell containing the first occurrence of the value 0
first_0_loc: returns the column location of the cell containing the first occurrence of the value 0
streak_1: starting from the first occurrence of 0, find the next occurrence of 1, and then count how many consecutive 1 before the next occurrence of 0.
The new columns should appear as below
new_cols <- tribble(
~first_0_name, ~first_0_loc, ~streak_1,
"i9", 10, 5,
"i4", 5, 4,
"i6", 7, 8,
"i8", 9, 4,
"i9", 10, 5,
NA, NA, NA,
"i1", 2, 5,
"i3", 4, 8,
"i2", 3, NA,
"i1", 2, 1
)
Thanks in advance for any help!
If you wanted to use base R a little more directly and avoid the cost of transforming the whole data frame. This solution also retains the order of rows without having to create extra ordering columns (unlike the tidyverse solution).
results <- apply(input, 1, function(x) {
# get indices of all zeros
zeros <- which(x == 0)
# exit early if no zeros are found
if (length(zeros) == 0) {
return(data.frame(first_0_name = NA, first_0_loc = NA, streak_1 = NA))
}
first.name <- names(zeros[1]) # name of first 0 column
first.idx <- zeros[1] # location of first zero
longest.streak <- diff(zeros)[1] - 1 # length of first 0-0 streak
return(data.frame(first_0_name = first.name,
first_0_loc = first.idx,
streak_1 = ifelse(longest.streak == 0, NA, longest.streak))
)
})
output <- do.call(rbind, results)
first_0_name first_0_loc streak_1
i9 i9 10 5
i4 i4 5 4
i6 i6 7 8
i8 i8 9 NA
i91 i9 10 5
1 <NA> NA NA
i1 i1 2 5
i3 i3 4 8
i2 i2 3 NA
i31 i3 4 2
Edit #2: Rewrote as combination of two summarizations.
input_tidy <- input %>%
gather(col, val, -ID) %>%
group_by(ID) %>%
arrange(ID) %>%
mutate(col_num = row_number() + 1)
input[,1] %>%
# Combine with summary of each ID's first zero
left_join(input_tidy %>% filter(val == 0) %>%
summarize(first_0_name = first(col),
first_0_loc = first(col_num))) %>%
# Combine with length of each ID's first post-0 streak of 1's
left_join(input_tidy %>%
filter(val == 1 & cumsum(val == 1 & lag(val, default = 1) == 0) == 1) %>%
summarize(streak_1 = n()))
# A tibble: 10 x 4
ID first_0_name first_0_loc streak_1
<chr> <chr> <dbl> <int>
1 A i9 10 5
2 B i4 5 4
3 C i6 7 8
4 D i8 9 4
5 E i9 10 5
6 F NA NA NA
7 G i1 2 5
8 H i3 4 8
9 I i2 3 NA
10 J i3 4 2
An option using melt from data.table
library(data.table)
melt(setDT(input), id.var = 'ID')[, .(first_o_name = first(variable[value == 0]),
first_o_loc = which(value == 0)[1] +1,
streak_1 = sum(cumsum(c(TRUE, diff(value == 0) < 0)) == 2) - 1 ), ID
][streak_1 < 0, streak_1 := NA_real_][]
A base R option can also be with apply and rle
do.call(rbind, apply(input[-1], 1, function(x) {
first_o_loc <- unname(which(x == 0)[1] + 1)
first_o_name <- names(x)[first_o_loc-1]
rl <- rle(x)
rl1 <- within.list(rl, {
i1 <- cumsum(values == 0) == 1
values <- values[i1]
lengths <- lengths[i1]})
streak_1 <- unname(rl1$lengths[2])
data.frame(first_o_name, first_o_loc, streak_1)}))
# first_o_name first_o_loc streak_1
#1 i9 10 5
#2 i4 5 4
#3 i6 7 8
#4 i8 9 4
#5 i9 10 5
#6 <NA> NA NA
#7 i1 2 5
#8 i3 4 8
#9 i2 3 NA
#10 i3 4 2

Replacing values in one matrix with values from another

I'm trying to compare to matrices. When the values aren't equivalent then I want to use the value from mat2 so long as it is greater than 0; if it is zero, then I want the value from mat1. As the code is currently, it appears to constantly return the value of mat1.
Here is my attempt:
mat.data1 <- c(1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1)
mat1 <- matrix(data = mat.data1, nrow = 5, ncol = 5, byrow = TRUE)
mat.data2 <- c(0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 2, 2, 0, 0, 0, 1, 2, 2, 0, 2, 1, 0, 1)
mat2 <- matrix(data = mat.data2, nrow = 5, ncol = 5, byrow = TRUE)
mat3 = if(mat1 == mat2){mat1} else {if(mat2>0){mat2} else {mat1}}
the expected output should be
1 0 1 1 1
0 1 2 1 1
1 1 2 2 0
1 1 1 2 2
1 1 1 0 1
Here is one potential way to do it.
mat.data1 <- c(1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1)
mat1 <- matrix(data = mat.data1, nrow = 5, ncol = 5, byrow = TRUE)
mat.data2 <- c(0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 2, 2, 0, 0, 0, 1, 2, 2, 0, 2, 1, 0, 1)
mat2 <- matrix(data = mat.data2, nrow = 5, ncol = 5, byrow = TRUE)
mat3 <- mat1
to_change <- which(mat2 != mat1 & mat2 > 0)
mat3[to_change] <- mat2[to_change]
This specific use of which essentially asks for the locations in mat2 that are not equal to that in mat1 AND where mat2 is greater than zero. You can then just do a subset and place those values in mat3.
This output is then:
> mat3
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 1 1
[2,] 0 1 2 1 1
[3,] 1 1 2 2 0
[4,] 1 1 1 2 2
[5,] 1 2 1 0 1
We can use coalesce
library(dplyr)
out <- coalesce(replace(mat2, !mat2, NA), replace(mat1, !mat1, NA))
replace(out, is.na(out), 0)
Or as #Axeman mentioned
coalesce(out, 0)

Main diagonal into anti-diagonal

I need to transform main diagonal
{matrix(
1 1 1 1,
0 2 2 2,
0 0 3 3,
0 0 0 4)
}
into:
{matrix(
0 0 0 1,
0 0 1 2,
0 1 2 3,
1 2 3 4)
}
I tried all operators I could find t(), arev(), flipud(), apply(x,2,rev) and so on. Without a positive result. Hope you can help me.
Does this work for you? Takes each column and 'rotates' (for lack of a better word) x places, where x is the column index.
res <- sapply(1:ncol(input),function(x){
#get relevant column
base <- input[,x]
n <- length(base)
indices <- 1:n
#reshuffle indices: first above x, then below x
out <- base[c(indices[indices>x],indices[indices<=x])]
out
})
all(res==output)
[1] TRUE
data used:
input <- structure(c(1, 0, 0, 0, 1, 2, 0, 0, 1, 2, 3, 0, 1, 2, 3, 4), .Dim = c(4L,
4L))
output <- structure(c(0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 4), .Dim = c(4L,
4L))

Resources