How to count the number of changes in ROW (R) - r

df <- data.frame(Num1=c(1,0,1,0,1), Num2=c(0,1,1,0,1), Num3=c(1,1,1,1,1), Num4=c(1,1,0,0,1), Num5=c(1,1,1,0,0))
I need to count how many times value changes in each row in r.
Thank you!

rowSums(df[-1] != df[-ncol(df)])
[1] 2 1 2 2 1
ie first row there is a change from 1 to 0 then back to 1. so a total of 2 changes etc

Here is a try ,
df <- data.frame(Num1=c(1,0,1,0,1),
Num2=c(0,1,1,0,1),
Num3=c(1,1,1,1,1),
Num4=c(1,1,0,0,1),
Num5=c(1,1,1,0,0))
df$n_changes <- apply(df , MARGIN = 1 , function(x) sum(abs(diff(x))))
df
#> Num1 Num2 Num3 Num4 Num5 n_changes
#> 1 1 0 1 1 1 2
#> 2 0 1 1 1 1 1
#> 3 1 1 1 0 1 2
#> 4 0 0 1 0 0 2
#> 5 1 1 1 1 0 1
Created on 2022-06-02 by the reprex package (v2.0.1)

We may loop over the row with apply (MARGIN=1), then use rle (run-length-encoding), check the lengths
apply(df, 1, function(x) lengths(rle(x)))[1,]-1
[1] 2 1 2 2 1

Related

Counting Frequencies of Sequences

Suppose there are two students - each student takes an exam multiple times (e.g.result_id = 1 is the first exam, result_id = 2 is the second exam, etc.). The student can either "pass" (1) or "fail" (0).
The data looks something like this:
library(data.table)
my_data = data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,2,2), results = c(0,1,0,1,0,0,1,1,1,0,1,1,0,1,0), result_id = c(1,2,3,4,5,6,1,2,3,4,5,6,7,8,9))
my_data = setDT(my_data)
id results result_id
1: 1 0 1
2: 1 1 2
3: 1 0 3
4: 1 1 4
5: 1 0 5
6: 1 0 6
7: 2 1 1
8: 2 1 2
9: 2 1 3
10: 2 0 4
11: 2 1 5
12: 2 1 6
13: 2 0 7
14: 2 1 8
15: 2 0 9
I am interested in counting the number of times that a student passes an exam, given that the student passed the previous two exams.
I tried to do this with the following code:
my_data$current_exam = shift(my_data$results, 0)
my_data$prev_exam = shift(my_data$results, 1)
my_data$prev_2_exam = shift(my_data$results, 2)
# Count the number of exam results for each record
out <- my_data[!is.na(prev_exam), .(tally = .N), by = .(id, current_exam, prev_exam, prev_2_exam)]
out = na.omit(out)
My code produces the following results:
> out
id current_exam prev_exam prev_2_exam tally
1: 1 0 1 0 2
2: 1 1 0 1 1
3: 1 0 0 1 1
4: 2 1 0 0 1
5: 2 1 1 0 2
6: 2 1 1 1 1
7: 2 0 1 1 2
8: 2 1 0 1 2
9: 2 0 1 0 1
However, I do not think that my code is correct.
For example, with Student_ID = 2 :
My code says that "Current_Exam = 1, Prev_Exam = 1, Prev_2_Exam = 0" happens 1 time, but looking at the actual data - this does not happen at all
Can someone please show me what I am doing wrong and how I can correct this?
Note: I think that this should be the expected output:
> expected_output
id current_exam prev_exam prev_2_exam tally
1: 1 0 1 0 2
2: 1 1 0 1 1
3: 1 0 0 1 1
4: 2 1 0 0 1
5: 2 1 1 0 1
6: 2 1 1 1 1
7: 2 0 1 1 2
8: 2 1 0 1 2
9: 2 0 1 0 0
You did not consider that you can not shift the results over id without placing NA.
. <- my_data[order(my_data$id, my_data$result_id),] #sort if needed
.$p1 <- ave(.$results, .$id, FUN = \(x) c(NA, x[-length(x)]))
.$p2 <- ave(.$p1, .$id, FUN = \(x) c(NA, x[-length(x)]))
aggregate(list(tally=.$p1), .[c("id","results", "p1", "p2")], length)
# id results p1 p2 tally
#1 1 0 1 0 2
#2 2 0 1 0 1
#3 2 1 1 0 1
#4 1 0 0 1 1
#5 1 1 0 1 1
#6 2 1 0 1 2
#7 2 0 1 1 2
#8 2 1 1 1 1
.
# id results result_id p1 p2
#1 1 0 1 NA NA
#2 1 1 2 0 NA
#3 1 0 3 1 0
#4 1 1 4 0 1
#5 1 0 5 1 0
#6 1 0 6 0 1
#7 2 1 1 NA NA
#8 2 1 2 1 NA
#9 2 1 3 1 1
#10 2 0 4 1 1
#11 2 1 5 0 1
#12 2 1 6 1 0
#13 2 0 7 1 1
#14 2 1 8 0 1
#15 2 0 9 1 0
An option would be to use filter to indicate those which had passed 3 times in a row.
cbind(., n=ave(.$results, .$id, FUN = \(x) filter(x, c(1,1,1), sides=1)))
# id results result_id n
#1 1 0 1 NA
#2 1 1 2 NA
#3 1 0 3 1
#4 1 1 4 2
#5 1 0 5 1
#6 1 0 6 1
#7 2 1 1 NA
#8 2 1 2 NA
#9 2 1 3 3
#10 2 0 4 2
#11 2 1 5 2
#12 2 1 6 2
#13 2 0 7 2
#14 2 1 8 2
#15 2 0 9 1
If olny the number of times that a student passes an exam, given that the student passed the previous two exams:
sum(ave(.$results, .$id, FUN = \(x) filter(x, c(1,1,1))==3), na.rm=TRUE)
#[1] 1
sum(ave(.$results, .$id, FUN = \(x)
x==1 & c(x[-1], 0) == 1 & c(x[-1:-2], 0, 0) == 1))
#[1] 1
When trying to count events that happen in series, cumsum() comes in quite handy. As opposed to creating multiple lagged variables, this scales well to counts across a larger number of events:
library(tidyverse)
d <- my_data |>
group_by(id) |> # group to cumulate within student only
mutate(
csum = cumsum(results), # cumulative sum of results
i = csum - lag(csum, 3, 0) # substract the cumulative sum from 3 observation before. This gives the number of exams passed in the current and previous 2 observations.
)
# Ungroup to get global count
d |>
ungroup() |>
count(i == 3) # Count the number of cases where the number of exams passes within 3 observations equals 3
#> # A tibble: 2 × 2
#> `i == 3` n
#> <lgl> <int>
#> 1 FALSE 14
#> 2 TRUE 1
# Retaining the group gives counts by student
d |>
count(i == 3) # Count the number of cases where the number of exams passes within 3 observations equals 3
#> # A tibble: 3 × 3
#> # Groups: id [2]
#> id `i == 3` n
#> <dbl> <lgl> <int>
#> 1 1 FALSE 6
#> 2 2 FALSE 8
#> 3 2 TRUE 1
Since you provided the data as data.table, here is how to do the same in that ecosystem:
my_data[ , csum := cumsum(results), .(id)]
my_data[ , i := csum - lag(csum, 3, 0), .(id)]
my_data[ , .(n_cases = sum(i ==3)), id]
#> id n_cases
#> 1: 1 0
#> 2: 2 1
Here's an approach using dplyr. It uses the lag function to look back 1 and 2 results. If the sum together with the current result is 3, then the condition is met. In the example you provided, the condition is only met once
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(!is.na(threex))
id results result_id threex
<dbl> <dbl> <dbl> <dbl>
1 1 0 3 0
2 1 1 4 0
3 1 0 5 0
4 1 0 6 0
5 2 1 3 1
6 2 0 4 0
7 2 1 5 0
8 2 1 6 0
9 2 0 7 0
10 2 1 8 0
11 2 0 9 0
If you then just want to capture the cases when the condition is met, add a filter.
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(threex == 1)
id results result_id threex
<dbl> <dbl> <dbl> <dbl>
1 2 1 3 1
If you are looking to understand how many times the condition is met per id, you can do this.
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(threex == 1) %>%
select(id) %>%
summarize(count = n())
id count
<dbl> <int>
1 2 1

one hot encoding dirty column in R dplyr

I have a column like so. The column begins and ends with a ',' and each value is separated by ',,'.
col1
,101,,9,,201,,200,
,201,,101,,102,
,9,,101,,102,,200,,201,
,101,,200,,9,,102,,102,
How can i transform this column into the following:
col1_9 col1_101 col1_102 col1_200 col1_201
1 1 0 1 1
0 1 1 0 1
1 1 1 1 1
1 1 2 1 0
df%>%
mutate(rowid = row_number(), value = 1)%>%
separate_rows(col1)%>%
filter(nzchar(col1)) %>%
pivot_wider(rowid, names_from = col1,
values_fn = sum, names_prefix = 'col1_',
values_fill = 0)
# A tibble: 4 x 6
rowid col1_101 col1_9 col1_201 col1_200 col1_102
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 0
2 2 1 0 1 0 1
3 3 1 1 1 1 1
4 4 1 1 0 1 2
in Base R:
a <- setNames(strsplit(trimws(df$col1,white=','), ',+'), seq(nrow(df)))
as.data.frame.matrix(t(table(stack(a))))
101 102 200 201 9
1 1 0 1 1 1
2 1 1 0 1 0
3 1 1 1 1 1
4 1 2 1 0 1
An option could be:
First remove the "," at begin and end using str_sub from stringr
One Hot encode the column using mtabulate and strsplit with sep of ",,"
Order the column names based on number
Finally, give the columns the "col1_" names using paste0
Which gives this as result:
df <- read.table(text = "col1
,101,,9,,201,,200,
,201,,101,,102,
,9,,101,,102,,200,,201,
,101,,200,,9,,102,,102,", header = TRUE)
library(stringr)
library(qdapTools)
df$col1 <- str_sub(df$col1, 2, -2)
df <- mtabulate(strsplit(df$col1, ",,"))
df <- df[, order(as.numeric(names(df)))]
names(df) <- paste0("col1_", names(df))
df
#> col1_9 col1_101 col1_102 col1_200 col1_201
#> 1 1 1 0 1 1
#> 2 0 1 1 0 1
#> 3 1 1 1 1 1
#> 4 1 1 2 1 0
Created on 2022-07-21 by the reprex package (v2.0.1)

In R: Subset observations that have values, 0, 1, and 2 by group

I have the following data:
companyID status
1 1
1 1
1 0
1 2
2 1
2 1
2 1
3 1
3 0
3 2
3 2
3 2
And would like to subset those observations (by companyID) where status has 0, 1, and 2 across the group (companyID). My preferred outcome would look like the following:
companyID status
1 1
1 1
1 0
1 2
3 1
3 0
3 2
3 2
3 2
Thank you in advance for any help!!
You can select groups where all the values from 0-2 are present in the group.
library(dplyr)
df %>% group_by(companyID) %>%filter(all(0:2 %in% status))
# companyID status
# <int> <int>
#1 1 1
#2 1 1
#3 1 0
#4 1 2
#5 3 1
#6 3 0
#7 3 2
#8 3 2
#9 3 2
In base R and data.table :
#Base R :
subset(df, as.logical(ave(status, companyID, FUN = function(x) all(0:2 %in% x))))
#data.table
library(data.table)
setDT(df)[, .SD[all(0:2 %in% status)], companyID]
We can use
library(dplyr)
df %>%
group_by(companyID) %>%
filter(sum(0:2 %in% status) == 3)

Combining two (boolean) categorical factors two new one

Given two boolean, categorical factors, how can I get the combination of them as a third category?
> my_data <- data.frame(a = c(0, 0, 1, 1, 1),
b = c(0, 1, 0, 1, 1))
> my_data
a b
1 0 0
2 0 1
3 1 0
4 1 1
5 1 1
I want to add a new category, with the combination of a and b so that:
> my_data
a b c
1 0 0 1
2 0 1 2
3 1 0 3
4 1 1 4
5 1 1 4
I didn't want to be lazy and thought about it for myself:
my_data$c <- as.numeric(as.factor(my_data$a + 1 + (my_data$b + 1) * 2))
This comes close, but I don't find it particularly elegant.
Therefore, any nicer solution in base R would be appreciated.
There are certainly also packages likes reshape2 which would offer similar functionality.
The following logic seems to be enough for all the cases you have provided.
my_data$c <- with(my_data, 2*a + b + 1)
my_data
a b c
1 0 0 1
2 0 1 2
3 1 0 3
4 1 1 4
5 1 1 4
Another option with base R:
r <- rle(do.call(paste0, my_data))
r$values <- seq_along(r$values)
my_data$c <- inverse.rle(r)
The result:
> my_data
a b c
1 0 0 1
2 0 1 2
3 1 0 3
4 1 1 4
5 1 1 4
A shorter version of above code:
r <- rle(do.call(paste0, my_data))$lengths
my_data$c <- rep(seq_along(r), r)
The expected output in the question is just the input seen as numbers in base 2 converted to base 10 plus 1.
So, looking for a function that converts from base 2 to base 10 I have found the accepted answer to this SO question.
So it's a matter of apply()ing that function to the data frame.
apply(my_data, 1, bitsToInt) + 1
#[1] 1 2 3 4 4
A general solution with dplyr:
library(dplyr)
my_data %>% mutate(c = group_indices(.,a,b))
# a b c
# 1 0 0 1
# 2 0 1 2
# 3 1 0 3
# 4 1 1 4
# 5 1 1 4
A base equivalent:
temp <- unique(my_data)
temp$c <- seq(nrow(temp))
merge(my_data,temp)
# a b c
# 1 0 0 1
# 2 0 1 2
# 3 1 0 3
# 4 1 1 4
# 5 1 1 4

the max number of an occurence in R

I have this array:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
[25] 1 1 1 1 2 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1
[49] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[73] 1 1 1 1 1 4 3 2 5 3 2 3 3 2 3 2 3 2 3 3 2 3 3 2
[97] 3 2 2 2 3 2 2 2 2 2 3 2 3 3 2 3 2 1 2 2 3 2 2 3
I need a function that returns only the number of the maximum occurrences. For example, if I use:
table(x[1:80])
I will get:
1 2 3 4
74 3 2 1
How can I get automatically the value '74'? Meaning that I can't know if '1' or '2' and so on... is the the maximum occurrence in my array. Thanks!
Edit:
I run:
tf<- tablulate(x):
[1] 75 24 19 1 1
and tried to run a for loop to get the "maximum" of each element on the "tabulate result" as following:
for (element in tf)
{
+ b= max(table(x[element]))
+ print (b)
+ }
I don't get the expected result, it is probably simple but not really for me.
I tried this:
> a=max(table(C[1:75]))
[1] 72
> b=max(table(C[76:99]))
[1] 11
> c=max(table(C[100:118]))
[1] 12
> d=max(table(C[119]))
[1] 1
> e=max(table(C[120]))
[1] 1
and so on.
and it works but it's really long and not fun if I have a big dataset.
To the commenter's tip, if you want a function use:
maximum <- function(vector, upto=length(vector)) {
max(table(vector[1:upto]))
}
So for:
set.seed(123)
x <- sample(1:3, 100, replace=T)
maximum(x)
[1] 34
maximum(x, 55) #checking at the 55th number in the vector
[1] 19
Update
To answer your edited question. Use this function:
maxtable <- function(vector) {
index <- cumsum(1:length(vector) %in% cumsum(tabulate(vector)))
s <- split(vector, index)
sapply(s, function(v) max(table(v)))
}
maxtable(x)
0 1 2 3 4 5
71 11 12 1 1 1
Edit
I think this small change is more of what you're looking for:
maxtable2 <- function(vector) {
index <- cumsum(1:length(vector) %in% (cumsum(tabulate(vector))+1))
s <- split(vector, index+1)
sapply(s, function(v) max(table(v)))
}
maxtable2(x)
1 2 3 4 5
72 11 12 1 1
There is a function in the library modeest that gets you most of the way there called mfv. But the function itself is simple enough to make yourself:
> mfv
function (x, ...)
{
f <- factor(x)
tf <- tabulate(f)
return(as.numeric(levels(f)[tf == max(tf)]))
}
<environment: namespace:modeest>
So you can do sum(x == mfv(x)) to get 74.

Resources