ifelse replace value if it is lower than previous - r

I am working with a dataset which has some errors in the data. Numbers are sometimes registered wrong. Here is some toy data example:
The issue is that the Reversal column should only be counting up (per unique ID). So in a vector of 0,0,0,1,1,1,0,1,2,2,0,0,2,3, the 0's following the 1 and 2 should not be 0's. Instead, they should be equal to whatever value came before. I tried to remedy this by using the lag function from the dplyr package:
Data$Reversal <- ifelse(Data$Reversal < lag(Data$Reversal), lag(Data$Reversal), Data$Reversal) .
But this results in numerous issues:
The first value becomes NA. I've tried using the default=Data$Reversal call in the lag function but to no avail.
The Reversal value should reset to 0 for each Unique ID. Now it continues across ID's. I tried a messy code using group_by(ID) but could not get this to work, as it broke my earlier ifelse function.
This only works when there is 1 error. But if there are two errors in a row it only fixes 1 value.
Alternatively, I found this thread in which the answer provided by Andrie also seems promising. This fixes problem 1 and 3, but I can't get this code to work per ID (using the group_by function).
Andrie's answer:
local({
r <- rle(data)
x <- r$values
x0 <- which(x==0) # index positions of zeroes
xt <- x[x0-1]==x[x0+1] # zeroes surrounded by same value
r$values[x0[xt]] <- x[x0[xt]-1] # substitute with surrounding value
inverse.rle(r)
})
Any help would be much appreciated.

I think cummax does exactly what you need.
Base R
dat$Reversal <- ave(dat$Reversal, dat$ID, FUN = cummax)
dat
# ID Owner Reversal Success
# 1 1 A 0 0
# 2 1 A 0 0
# 3 1 A 0 0
# 4 1 B 1 1
# 5 1 B 1 0
# 6 1 B 1 0
# 7 1 error 1 0
# 8 1 error 1 0
# 9 1 B 1 0
# 10 1 B 1 0
# 11 1 C 1 1
# 12 1 C 2 0
# 13 1 error 2 0
# 14 1 C 2 0
# 15 1 C 3 1
# 16 2 J 0 0
# 17 2 J 0 0
dplyr
dat %>%
group_by(ID) %>%
mutate(Reversal = cummax(Reversal)) %>%
ungroup()
data.table
as.data.table(dat)[, Reversal := cummax(Reversal), by = .(ID)][]
Data, courtesy of https://extracttable.com/
dat <- read.table(header = TRUE, text = "
ID Owner Reversal Success
1 A 0 0
1 A 0 0
1 A 0 0
1 B 1 1
1 B 1 0
1 B 1 0
1 error 0 0
1 error 0 0
1 B 1 0
1 B 1 0
1 C 1 1
1 C 2 0
1 error 0 0
1 C 2 0
1 C 3 1
2 J 0 0
2 J 0 0")

Related

Create an index variable for blocks of values

I have a dataframe "data" with a grouping variable "grp" and a binary classification variable "classif". For each group in grp, I want to create a "result" variable creating an index of separate blocks of 0 in the classif variable. For the time being, I don't know how to reset the count for each level of the grouping variable and I don't find a way to only create the index for blocks of 0s (ignoring the 1s).
Example data:
grp <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3)
classif <- c(0,1,0,0,1,0,0,1,1,0,0,0,0,1,0,1,1,1,0,0,1,1,0,0,0,1,0,1,0)
result <- c(1,0,2,2,0,3,3,0,0,1,1,1,1,0,2,0,0,0,3,3,0,0,1,1,1,0,2,0,3)
wrong_result <- c(1,2,3,3,4,5,5,1,1,2,2,2,2,3,4,5,5,5,6,6,1,1,2,2,2,3,4,5,6)
Data <- data.frame(grp,classif,result, wrong_result)
I have tried using rleid but the following command produces "wrong_result", which is not what I'm after.
data[, wrong_result:= rleid(classif)]
data[, wrong_result:= rleid(classif), by=grp]
With dplyr, use cumsum() and lag() to find blocks of zeroes .by group. (Make sure you’re using the latest version of dplyr to use the .by argument).
library(dplyr)
Data %>%
mutate(
result2 = ifelse(
classif == 0,
cumsum(classif == 0 & lag(classif, default = 1) == 1),
0
),
.by = grp
)
grp classif result result2
1 1 0 1 1
2 1 1 0 0
3 1 0 2 2
4 1 0 2 2
5 1 1 0 0
6 1 0 3 3
7 1 0 3 3
8 2 1 0 0
9 2 1 0 0
10 2 0 1 1
11 2 0 1 1
12 2 0 1 1
13 2 0 1 1
14 2 1 0 0
15 2 0 2 2
16 2 1 0 0
17 2 1 0 0
18 2 1 0 0
19 2 0 3 3
20 2 0 3 3
21 3 1 0 0
22 3 1 0 0
23 3 0 1 1
24 3 0 1 1
25 3 0 1 1
26 3 1 0 0
27 3 0 2 2
28 3 1 0 0
29 3 0 3 3
Use rle and sequentially number the runs produced and then convert back and zero out the runs of 1's. No packages are used.
seq0 <- function(x) {
r <- rle(x)
is0 <- r$values == 0
r$values[is0] <- seq_len(sum(is0))
inverse.rle(r) * !x
}
transform(Data, result2 = ave(classif, grp, FUN = seq0))

How to remove rows where all columns are zero using data.table

I have the following data frame:
# the original dataset
dat <- data.frame(a = c(0,0,2,3), b= c(1,0,0,0), c=c(0,0,1,3))
It looks like this:
> dat
a b c
1 0 1 0
2 0 0 0
3 2 0 1
4 3 0 3
What I want to do is to remove rows with all zero, resulting in :
a b c
0 1 0
2 0 1
3 0 3
How can I do that with data.table.
In reality I have much higher dimension need to be processed so need to be super fast.
I tried this but still slow:
dat <- dat[Reduce(`|`, dat), ]
You can try with rowSums -
library(data.table)
setDT(dat)
dat[rowSums(dat != 0) != 0]
# a b c
#1: 0 1 0
#2: 2 0 1
#3: 3 0 3
Again using rowSums(). I think this is more readable.
library(data.table)
dat[(rowSums(dat) !=0),]
a b c
1 0 1 0
3 2 0 1
4 3 0 3

ERROR when using count in R which was working before

i used count in order to count the same rows and get the frequency and it was working very well like 2 hours ago and now it's giving me an ERROR that i do not understand. I wanted that every time i have the same row, add the concentration of these rows. Here is my toy data and my function.
df=data.frame(ID=seq(1:6),A=rep(0,6),B=c(rep(0,5),1),C=c(rep(1,5),0),D=rep(1,6),E=c(rep(0,3),rep(1,2),0),concentration=c(0.002,0.004,0.001,0.0075,0.00398,0.006))
df
ID A B C D E concentration
1 1 0 0 1 1 0 0.00200
2 2 0 0 1 1 0 0.00400
3 3 0 0 1 1 0 0.00100
4 4 0 0 1 1 1 0.00750
5 5 0 0 1 1 1 0.00398
6 6 0 1 0 1 0 0.00600
freq.concentration=function(df,Vars){
df=data.frame(df)
Vars=as.character(Vars)
compte=count(df,Vars)
frequence.C= (compte$freq)/nrow(df)
output=cbind(compte,frequence.C)
return(output)
}
freq.concentration(df,colnames(df[2:6]))
# and here is the error that i get when i run the function which was working perfectly a while ago!
# Error: Must group by variables found in `.data`.
# * Column `Vars` is not found.
# Run `rlang::last_error()` to see where the error occurred.
PS: I do not know if this is related or not but i got this problem when i opened a script Rmd and did copy paste all my function to this script and all of a sudden my function stopped working .
I really appreciate your help in advance. Thank you.
Here is the output that i had when it was working properly :
output
ID A B C D E concentration.C.1 concentration.C.2
1 1 0 0 1 1 0 3 0.007
2 4 0 0 1 1 1 2 0.01148
3 6 0 1 0 1 0 1 0.00600
The first 3 rows are similar so we sum the concentration of the 3 and get 0.007, and then rows 4 and 5 are the same so we add their concentration and get 0.01148 and the last row is unique so the concentration remains the same.
We can convert to symbol and evaluate (!!!) in count to get the frequency count based on those columns and then get the 'frequence.C' as the proportion of 'n' with the sum of that count
library(dplyr)
freq.concentration <- function(df, Vars){
df %>%
count(!!! rlang::syms(Vars)) %>%
mutate(frequence.C = n/sum(n))
}
-testing
freq.concentration(df,colnames(df)[2:6])
# A B C D E n frequence.C
#1 0 0 1 1 0 3 0.5000000
#2 0 0 1 1 1 2 0.3333333
#3 0 1 0 1 0 1 0.1666667
If we need the sum of 'concentration', we could use a group_by operation instead of count
freq.concentration <- function(df, Vars){
df %>%
group_by(across(all_of(Vars))) %>%
summarise(n = n(), frequency.C = sum(concentration), .groups = 'drop')
}
-testing
freq.concentration(df,colnames(df)[2:6])
# A tibble: 3 x 7
# A B C D E n frequency.C
# <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#1 0 0 1 1 0 3 0.007
#2 0 0 1 1 1 2 0.0115
#3 0 1 0 1 0 1 0.006

How to write new column conditional on grouped rows in R?

I have a data frame where each Item has three categories (a, b,c) and a numeric Answer for each category is recorded (either 0 or 1). I would like to create a new column contingent on the rows in the Answer column. This is how my data frame looks like:
Item <- rep(c(1:3), each=3)
Option <- rep(c('a','b','c'), times=3)
Answer <- c(1,1,0,1,0,1,1,1,1)
df <- data.frame(Item, Option, Answer)
Item Option Answer
1 1 a 1
2 1 b 1
3 1 c 0
4 2 a 0
5 2 b 0
6 2 c 1
7 3 a 1
8 3 b 1
9 3 c 1
What is needed: whenever the three categories in the Option column are 1, the New column should receive a 1. In any other case, the column should have a 0. The desired output should look like this:
Item Option Answer New
1 1 a 1 0
2 1 b 1 0
3 1 c 0 0
4 2 a 0 0
5 2 b 0 0
6 2 c 1 0
7 3 a 1 1
8 3 b 1 1
9 3 c 1 1
I tried to achieve this without using a loop, but I got stuck because I don't know how to make a new column contingent on a group of rows, not just a single one. I have tried this solution but it doesn't work if the rows are not grouped in pairs.
Do you have any suggestions? Thanks a bunch!
This should work:
df %>%
group_by(Item)%>%
mutate(New = as.numeric(all(as.logical(Answer))))
using data.table
DT <- data.table(Item, Option, Answer)
DT[, Index := as.numeric(all(as.logical(Answer))), by= Item]
DT
Item Option Answer Index
1: 1 a 1 0
2: 1 b 1 0
3: 1 c 0 0
4: 2 a 1 0
5: 2 b 0 0
6: 2 c 1 0
7: 3 a 1 1
8: 3 b 1 1
9: 3 c 1 1
Or using only base R
df$Index <- with(df, +(ave(!!Answer, Item, FUN = all)))
df$Index
#[1] 0 0 0 0 0 0 1 1 1

calculating dataframe row combinations and matches with a separate column

I am trying to match all combinations of a dataframe (each combination reduces to a 1 or a 0 based on the sum) to another column and count the matches. I hacked this together but I feel like there is a better solution. Can someone suggest a better way to do this?
library(HapEstXXR)
test<-data.frame(a=c(0,1,0,1),b=c(1,1,1,1),c=c(0,1,0,1))
actual<-c(0,1,1,1)
ps<-powerset(1:dim(test)[2])
lapply(ps,function(x){
tt<-rowSums(test[,c(x)]) #Note: this fails when there is only one column
tt[tt>1]<-1 #if the sum is greater than 1 reduce it to 1
cbind(sum(tt==actual,na.rm=T),colnames(test)[x])
})
> test
a b c
1 0 1 0
2 1 1 1
3 0 1 0
4 1 1 1
goal: compare all combinations of columns (order doesnt matter) to actual column and see which matches most
b c a ab ac bc abc actual
1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 1
1 1 1 1 1 1 1 1
matches:
a: 3
b: 3
c: 3
ab: 3
....
Your code seems fine to me, I just simplified it a little bit:
sapply(ps,function(x){
tt <- rowSums(test[,x,drop=F]) > 0
colname <- paste(names(test)[x],collapse='')
setNames(sum(tt==actual,na.rm=T), colname) # make a named vector of one element length
})
# a b ab c ac bc abc
# 3 3 3 3 3 3 3

Resources