I have a dataframe "data" with a grouping variable "grp" and a binary classification variable "classif". For each group in grp, I want to create a "result" variable creating an index of separate blocks of 0 in the classif variable. For the time being, I don't know how to reset the count for each level of the grouping variable and I don't find a way to only create the index for blocks of 0s (ignoring the 1s).
Example data:
grp <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3)
classif <- c(0,1,0,0,1,0,0,1,1,0,0,0,0,1,0,1,1,1,0,0,1,1,0,0,0,1,0,1,0)
result <- c(1,0,2,2,0,3,3,0,0,1,1,1,1,0,2,0,0,0,3,3,0,0,1,1,1,0,2,0,3)
wrong_result <- c(1,2,3,3,4,5,5,1,1,2,2,2,2,3,4,5,5,5,6,6,1,1,2,2,2,3,4,5,6)
Data <- data.frame(grp,classif,result, wrong_result)
I have tried using rleid but the following command produces "wrong_result", which is not what I'm after.
data[, wrong_result:= rleid(classif)]
data[, wrong_result:= rleid(classif), by=grp]
With dplyr, use cumsum() and lag() to find blocks of zeroes .by group. (Make sure you’re using the latest version of dplyr to use the .by argument).
library(dplyr)
Data %>%
mutate(
result2 = ifelse(
classif == 0,
cumsum(classif == 0 & lag(classif, default = 1) == 1),
0
),
.by = grp
)
grp classif result result2
1 1 0 1 1
2 1 1 0 0
3 1 0 2 2
4 1 0 2 2
5 1 1 0 0
6 1 0 3 3
7 1 0 3 3
8 2 1 0 0
9 2 1 0 0
10 2 0 1 1
11 2 0 1 1
12 2 0 1 1
13 2 0 1 1
14 2 1 0 0
15 2 0 2 2
16 2 1 0 0
17 2 1 0 0
18 2 1 0 0
19 2 0 3 3
20 2 0 3 3
21 3 1 0 0
22 3 1 0 0
23 3 0 1 1
24 3 0 1 1
25 3 0 1 1
26 3 1 0 0
27 3 0 2 2
28 3 1 0 0
29 3 0 3 3
Use rle and sequentially number the runs produced and then convert back and zero out the runs of 1's. No packages are used.
seq0 <- function(x) {
r <- rle(x)
is0 <- r$values == 0
r$values[is0] <- seq_len(sum(is0))
inverse.rle(r) * !x
}
transform(Data, result2 = ave(classif, grp, FUN = seq0))
I have the following data frame:
# the original dataset
dat <- data.frame(a = c(0,0,2,3), b= c(1,0,0,0), c=c(0,0,1,3))
It looks like this:
> dat
a b c
1 0 1 0
2 0 0 0
3 2 0 1
4 3 0 3
What I want to do is to remove rows with all zero, resulting in :
a b c
0 1 0
2 0 1
3 0 3
How can I do that with data.table.
In reality I have much higher dimension need to be processed so need to be super fast.
I tried this but still slow:
dat <- dat[Reduce(`|`, dat), ]
You can try with rowSums -
library(data.table)
setDT(dat)
dat[rowSums(dat != 0) != 0]
# a b c
#1: 0 1 0
#2: 2 0 1
#3: 3 0 3
Again using rowSums(). I think this is more readable.
library(data.table)
dat[(rowSums(dat) !=0),]
a b c
1 0 1 0
3 2 0 1
4 3 0 3
i used count in order to count the same rows and get the frequency and it was working very well like 2 hours ago and now it's giving me an ERROR that i do not understand. I wanted that every time i have the same row, add the concentration of these rows. Here is my toy data and my function.
df=data.frame(ID=seq(1:6),A=rep(0,6),B=c(rep(0,5),1),C=c(rep(1,5),0),D=rep(1,6),E=c(rep(0,3),rep(1,2),0),concentration=c(0.002,0.004,0.001,0.0075,0.00398,0.006))
df
ID A B C D E concentration
1 1 0 0 1 1 0 0.00200
2 2 0 0 1 1 0 0.00400
3 3 0 0 1 1 0 0.00100
4 4 0 0 1 1 1 0.00750
5 5 0 0 1 1 1 0.00398
6 6 0 1 0 1 0 0.00600
freq.concentration=function(df,Vars){
df=data.frame(df)
Vars=as.character(Vars)
compte=count(df,Vars)
frequence.C= (compte$freq)/nrow(df)
output=cbind(compte,frequence.C)
return(output)
}
freq.concentration(df,colnames(df[2:6]))
# and here is the error that i get when i run the function which was working perfectly a while ago!
# Error: Must group by variables found in `.data`.
# * Column `Vars` is not found.
# Run `rlang::last_error()` to see where the error occurred.
PS: I do not know if this is related or not but i got this problem when i opened a script Rmd and did copy paste all my function to this script and all of a sudden my function stopped working .
I really appreciate your help in advance. Thank you.
Here is the output that i had when it was working properly :
output
ID A B C D E concentration.C.1 concentration.C.2
1 1 0 0 1 1 0 3 0.007
2 4 0 0 1 1 1 2 0.01148
3 6 0 1 0 1 0 1 0.00600
The first 3 rows are similar so we sum the concentration of the 3 and get 0.007, and then rows 4 and 5 are the same so we add their concentration and get 0.01148 and the last row is unique so the concentration remains the same.
We can convert to symbol and evaluate (!!!) in count to get the frequency count based on those columns and then get the 'frequence.C' as the proportion of 'n' with the sum of that count
library(dplyr)
freq.concentration <- function(df, Vars){
df %>%
count(!!! rlang::syms(Vars)) %>%
mutate(frequence.C = n/sum(n))
}
-testing
freq.concentration(df,colnames(df)[2:6])
# A B C D E n frequence.C
#1 0 0 1 1 0 3 0.5000000
#2 0 0 1 1 1 2 0.3333333
#3 0 1 0 1 0 1 0.1666667
If we need the sum of 'concentration', we could use a group_by operation instead of count
freq.concentration <- function(df, Vars){
df %>%
group_by(across(all_of(Vars))) %>%
summarise(n = n(), frequency.C = sum(concentration), .groups = 'drop')
}
-testing
freq.concentration(df,colnames(df)[2:6])
# A tibble: 3 x 7
# A B C D E n frequency.C
# <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#1 0 0 1 1 0 3 0.007
#2 0 0 1 1 1 2 0.0115
#3 0 1 0 1 0 1 0.006
I have a data frame where each Item has three categories (a, b,c) and a numeric Answer for each category is recorded (either 0 or 1). I would like to create a new column contingent on the rows in the Answer column. This is how my data frame looks like:
Item <- rep(c(1:3), each=3)
Option <- rep(c('a','b','c'), times=3)
Answer <- c(1,1,0,1,0,1,1,1,1)
df <- data.frame(Item, Option, Answer)
Item Option Answer
1 1 a 1
2 1 b 1
3 1 c 0
4 2 a 0
5 2 b 0
6 2 c 1
7 3 a 1
8 3 b 1
9 3 c 1
What is needed: whenever the three categories in the Option column are 1, the New column should receive a 1. In any other case, the column should have a 0. The desired output should look like this:
Item Option Answer New
1 1 a 1 0
2 1 b 1 0
3 1 c 0 0
4 2 a 0 0
5 2 b 0 0
6 2 c 1 0
7 3 a 1 1
8 3 b 1 1
9 3 c 1 1
I tried to achieve this without using a loop, but I got stuck because I don't know how to make a new column contingent on a group of rows, not just a single one. I have tried this solution but it doesn't work if the rows are not grouped in pairs.
Do you have any suggestions? Thanks a bunch!
This should work:
df %>%
group_by(Item)%>%
mutate(New = as.numeric(all(as.logical(Answer))))
using data.table
DT <- data.table(Item, Option, Answer)
DT[, Index := as.numeric(all(as.logical(Answer))), by= Item]
DT
Item Option Answer Index
1: 1 a 1 0
2: 1 b 1 0
3: 1 c 0 0
4: 2 a 1 0
5: 2 b 0 0
6: 2 c 1 0
7: 3 a 1 1
8: 3 b 1 1
9: 3 c 1 1
Or using only base R
df$Index <- with(df, +(ave(!!Answer, Item, FUN = all)))
df$Index
#[1] 0 0 0 0 0 0 1 1 1
I am trying to match all combinations of a dataframe (each combination reduces to a 1 or a 0 based on the sum) to another column and count the matches. I hacked this together but I feel like there is a better solution. Can someone suggest a better way to do this?
library(HapEstXXR)
test<-data.frame(a=c(0,1,0,1),b=c(1,1,1,1),c=c(0,1,0,1))
actual<-c(0,1,1,1)
ps<-powerset(1:dim(test)[2])
lapply(ps,function(x){
tt<-rowSums(test[,c(x)]) #Note: this fails when there is only one column
tt[tt>1]<-1 #if the sum is greater than 1 reduce it to 1
cbind(sum(tt==actual,na.rm=T),colnames(test)[x])
})
> test
a b c
1 0 1 0
2 1 1 1
3 0 1 0
4 1 1 1
goal: compare all combinations of columns (order doesnt matter) to actual column and see which matches most
b c a ab ac bc abc actual
1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 1
1 1 1 1 1 1 1 1
matches:
a: 3
b: 3
c: 3
ab: 3
....
Your code seems fine to me, I just simplified it a little bit:
sapply(ps,function(x){
tt <- rowSums(test[,x,drop=F]) > 0
colname <- paste(names(test)[x],collapse='')
setNames(sum(tt==actual,na.rm=T), colname) # make a named vector of one element length
})
# a b ab c ac bc abc
# 3 3 3 3 3 3 3