ERROR when using count in R which was working before - r

i used count in order to count the same rows and get the frequency and it was working very well like 2 hours ago and now it's giving me an ERROR that i do not understand. I wanted that every time i have the same row, add the concentration of these rows. Here is my toy data and my function.
df=data.frame(ID=seq(1:6),A=rep(0,6),B=c(rep(0,5),1),C=c(rep(1,5),0),D=rep(1,6),E=c(rep(0,3),rep(1,2),0),concentration=c(0.002,0.004,0.001,0.0075,0.00398,0.006))
df
ID A B C D E concentration
1 1 0 0 1 1 0 0.00200
2 2 0 0 1 1 0 0.00400
3 3 0 0 1 1 0 0.00100
4 4 0 0 1 1 1 0.00750
5 5 0 0 1 1 1 0.00398
6 6 0 1 0 1 0 0.00600
freq.concentration=function(df,Vars){
df=data.frame(df)
Vars=as.character(Vars)
compte=count(df,Vars)
frequence.C= (compte$freq)/nrow(df)
output=cbind(compte,frequence.C)
return(output)
}
freq.concentration(df,colnames(df[2:6]))
# and here is the error that i get when i run the function which was working perfectly a while ago!
# Error: Must group by variables found in `.data`.
# * Column `Vars` is not found.
# Run `rlang::last_error()` to see where the error occurred.
PS: I do not know if this is related or not but i got this problem when i opened a script Rmd and did copy paste all my function to this script and all of a sudden my function stopped working .
I really appreciate your help in advance. Thank you.
Here is the output that i had when it was working properly :
output
ID A B C D E concentration.C.1 concentration.C.2
1 1 0 0 1 1 0 3 0.007
2 4 0 0 1 1 1 2 0.01148
3 6 0 1 0 1 0 1 0.00600
The first 3 rows are similar so we sum the concentration of the 3 and get 0.007, and then rows 4 and 5 are the same so we add their concentration and get 0.01148 and the last row is unique so the concentration remains the same.

We can convert to symbol and evaluate (!!!) in count to get the frequency count based on those columns and then get the 'frequence.C' as the proportion of 'n' with the sum of that count
library(dplyr)
freq.concentration <- function(df, Vars){
df %>%
count(!!! rlang::syms(Vars)) %>%
mutate(frequence.C = n/sum(n))
}
-testing
freq.concentration(df,colnames(df)[2:6])
# A B C D E n frequence.C
#1 0 0 1 1 0 3 0.5000000
#2 0 0 1 1 1 2 0.3333333
#3 0 1 0 1 0 1 0.1666667
If we need the sum of 'concentration', we could use a group_by operation instead of count
freq.concentration <- function(df, Vars){
df %>%
group_by(across(all_of(Vars))) %>%
summarise(n = n(), frequency.C = sum(concentration), .groups = 'drop')
}
-testing
freq.concentration(df,colnames(df)[2:6])
# A tibble: 3 x 7
# A B C D E n frequency.C
# <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#1 0 0 1 1 0 3 0.007
#2 0 0 1 1 1 2 0.0115
#3 0 1 0 1 0 1 0.006

Related

Mark IDs that have a specific value across columns

I am trying to create a variable 'check' with values 1/0. These should be assigned based on whether across columns V1 to V3 there is at least one value = 1 for each ID.
DF <- data.frame (ID= c(1,1,1,2,3,3,4,5,5,6), V1= c(1,0,0,1,1,0,0,1,0,0),
V2= c(1,1,0,0,1,0,1,1,0,0), V3= c(0,1,0,1,0,0,0,0,1,0))
This is the code I am using but group by doesn't seem to work. It does seem to go across columns and mark as 1 all of those having at least one value of 1 but not by ID.
DF %>% dplyr::group_by(ID) %>%
dplyr::mutate(Check= case_when(if_any('V1':'V3',~.x!=0)~1,TRUE ~0)) %>%
dplyr::ungroup()
So the output I am looking for is this one:
ID
V1
V2
V3
check
1
1
1
0
1
1
0
1
1
1
1
0
0
0
1
2
1
0
1
0
3
1
1
0
0
3
0
0
0
0
4
0
1
0
0
5
1
1
0
1
5
0
0
1
1
6
0
0
0
0
Could you help?
Many thanks!
Edit: apologies, I have noticed a mistake in the output, it should be fine now.
Please check the below code
these are steps i followed
after grouping by ID column, derive new columns where if column is equal to 0 then change the value to NA, replace the 0 with NA
then retain the previous values to all the other rows so if the value is 1 it will be retained to other rows within the by group
then sum the values of all the three variables and if the sum of all 3 variable is 3 then derive check variable and update the value to 1
retain the 1 to other rows within the by group else set to zero
DF %>% group_by(ID) %>%
mutate(across(starts_with('V'), ~ ifelse(.x==0, NA, .x), .names = 'new_{col}')) %>%
fill(starts_with('new')) %>%
mutate(check=ifelse(rowSums(across(starts_with('new')))==3,1,0)) %>%
fill(check, .direction = 'downup') %>% mutate(check=replace_na(check,0)) %>%
select(-starts_with('new'))
Created on 2023-02-03 with reprex v2.0.2
# A tibble: 10 × 5
# Groups: ID [6]
ID V1 V2 V3 check
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 0 1
2 1 0 1 1 1
3 1 0 0 0 1
4 2 1 0 1 0
5 3 1 1 0 0
6 3 0 0 0 0
7 4 0 1 0 0
8 5 1 1 0 1
9 5 0 0 1 1
10 6 0 0 0 0

Is there a R function for preparing datasets for survival analysis like stset in Stata?

Datasets look like this
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 0 1 1 1
2 1 3 1 1
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
As you see, when id = 1, it's just the data input to coxph in survival package. However, when id = 2, at the beginning and end, failure occurs, but in the middle, failure disappears.
Is there a general function to extract data from id = 2 and get the result like id = 1?
I think when id = 2, the result should look like below.
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
A bit hacky, but should get the job done.
Data:
# Load data
library(tidyverse)
df <- read_table("
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 0 1 1 1
2 1 3 1 1
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
")
Data wrangling:
# Check for sub-groups within IDs and remove all but the last one
df <- df %>%
# Group by ID
group_by(
id
) %>%
mutate(
# Check if a new sub-group is starting (after a failure)
new_group = case_when(
# First row is always group 0
row_number() == 1 ~ 0,
# If previous row was a failure, then a new sub-group starts here
lag(failure) == 1 ~ 1,
# Otherwise not
TRUE ~ 0
),
# Assign sub-group number by calculating cumulative sums
group = cumsum(new_group)
) %>%
# Keep only last sub-group for each ID
filter(
group == max(group)
) %>%
ungroup() %>%
# Remove working columns
select(
-new_group, -group
)
Result:
> df
# A tibble: 6 × 5
id start end failure x1
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0
2 1 1 3 0 0
3 1 3 6 1 0
4 2 3 4 0 1
5 2 4 6 0 1
6 2 6 7 1 1

ifelse replace value if it is lower than previous

I am working with a dataset which has some errors in the data. Numbers are sometimes registered wrong. Here is some toy data example:
The issue is that the Reversal column should only be counting up (per unique ID). So in a vector of 0,0,0,1,1,1,0,1,2,2,0,0,2,3, the 0's following the 1 and 2 should not be 0's. Instead, they should be equal to whatever value came before. I tried to remedy this by using the lag function from the dplyr package:
Data$Reversal <- ifelse(Data$Reversal < lag(Data$Reversal), lag(Data$Reversal), Data$Reversal) .
But this results in numerous issues:
The first value becomes NA. I've tried using the default=Data$Reversal call in the lag function but to no avail.
The Reversal value should reset to 0 for each Unique ID. Now it continues across ID's. I tried a messy code using group_by(ID) but could not get this to work, as it broke my earlier ifelse function.
This only works when there is 1 error. But if there are two errors in a row it only fixes 1 value.
Alternatively, I found this thread in which the answer provided by Andrie also seems promising. This fixes problem 1 and 3, but I can't get this code to work per ID (using the group_by function).
Andrie's answer:
local({
r <- rle(data)
x <- r$values
x0 <- which(x==0) # index positions of zeroes
xt <- x[x0-1]==x[x0+1] # zeroes surrounded by same value
r$values[x0[xt]] <- x[x0[xt]-1] # substitute with surrounding value
inverse.rle(r)
})
Any help would be much appreciated.
I think cummax does exactly what you need.
Base R
dat$Reversal <- ave(dat$Reversal, dat$ID, FUN = cummax)
dat
# ID Owner Reversal Success
# 1 1 A 0 0
# 2 1 A 0 0
# 3 1 A 0 0
# 4 1 B 1 1
# 5 1 B 1 0
# 6 1 B 1 0
# 7 1 error 1 0
# 8 1 error 1 0
# 9 1 B 1 0
# 10 1 B 1 0
# 11 1 C 1 1
# 12 1 C 2 0
# 13 1 error 2 0
# 14 1 C 2 0
# 15 1 C 3 1
# 16 2 J 0 0
# 17 2 J 0 0
dplyr
dat %>%
group_by(ID) %>%
mutate(Reversal = cummax(Reversal)) %>%
ungroup()
data.table
as.data.table(dat)[, Reversal := cummax(Reversal), by = .(ID)][]
Data, courtesy of https://extracttable.com/
dat <- read.table(header = TRUE, text = "
ID Owner Reversal Success
1 A 0 0
1 A 0 0
1 A 0 0
1 B 1 1
1 B 1 0
1 B 1 0
1 error 0 0
1 error 0 0
1 B 1 0
1 B 1 0
1 C 1 1
1 C 2 0
1 error 0 0
1 C 2 0
1 C 3 1
2 J 0 0
2 J 0 0")

How do you apply if else statements across user id #?

I am trying to create a dummy variable that flags the user id of people who attended a specific event. Each user id has multiple rows and I would like this dummy variable to apply to every row of the flagged user id. For example, using the data set below, I would like to flag the user IDs of everyone who attended "event b" (using a "1" for attended event b and "0" for did not attend event b). The tricky part is that I want the 1 to appear in every row that matches the user IDs of the people who attended "event b".
I want to use this dummy variable to eventually subset the data so that I can assess the event attending patterns of the users who attended a particular event.
df<-data.frame(id=(100,100,100,101,101,102,102,103,103,103,103),
event=("a","b","c","b","d","a","c","a","c","d","e"))
Consider ifelse and ave, iterating across unique values or levels of event
for(ev in unique(df$event)) { # for(ev in levels(df$event)) {
df[[paste0("event_", ev, "_flag")]] <- with(df, ave(ifelse(event == ev, 1, 0), id, FUN=max))
}
df
# id event event_a_flag event_b_flag event_c_flag event_d_flag event_e_flag
# 1 100 a 1 1 1 0 0
# 2 100 b 1 1 1 0 0
# 3 100 c 1 1 1 0 0
# 4 101 b 0 1 0 1 0
# 5 101 d 0 1 0 1 0
# 6 102 a 1 0 1 0 0
# 7 102 c 1 0 1 0 0
# 8 103 a 1 0 1 1 1
# 9 103 c 1 0 1 1 1
# 10 103 d 1 0 1 1 1
# 11 103 e 1 0 1 1 1
As I understood you want to one-hot encode.
You can use the following code with the dummyVars function of the caret package. Afterwards you aggregate the duplicate rows with the corresponding dplyr function.
library(caret)
library(dplyr)
df<-data.frame(id=c(100,100,100,101,101,102,102,103,103,103,103),
event=c("a","b","c","b","d","a","c","a","c","d","e"))
dmy <- dummyVars(" ~ .", data = df)
trsf <- data.frame(predict(dmy, newdata = df))
aggregate(.~id, trsf, FUN=sum)
id event.a event.b event.c event.d event.e
1 100 1 1 1 0 0
2 101 0 1 0 1 0
3 102 1 0 1 0 0
4 103 1 0 1 1 1
Perhaps I'm using a way to simple approach. Using dplyr and tidyr:
df %>%
mutate(value=1) %>%
pivot_wider(names_from="event", values_fill=0)
returns
# A tibble: 4 x 6
id a b c d e
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 1 1 1 0 0
2 101 0 1 0 1 0
3 102 1 0 1 0 0
4 103 1 0 1 1 1

Only Use The First Match For Every N Rows

I have a data.frame that looks like this.
Date Number
1 1
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 1
I would like to create a new column that puts a 1 in the column if it is the first 1 of every 3 rows. Otherwise put a 0. For example, this is how I would like the new data.frame to look
Date Number New
1 1 1
2 0 0
3 1 0
4 0 0
5 0 0
6 1 1
7 0 0
8 0 0
9 1 1
Every three rows we find the first 1 and populate the column otherwise we place a 0. Thank you.
Hmm, at first glance I thought Akrun answer provided me the solution. However, it is not exactly what I am looking for. Here is what #akrun solution provides.
df1 = data.frame(Number = c(1,0,1,0,1,1,1,0,1,0,0,0))
head(df1,9)
Number
1 1
2 0
3 1
4 0
5 1
6 1
7 1
8 0
9 1
Attempt at solution:
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(Number == row_number()))
Number grp New
<dbl> <int> <int>
1 1 1 1
2 0 1 0
3 1 1 0
4 0 2 0
5 1 2 0 #should be a 1
6 1 2 0
7 1 3 1
8 0 3 0
9 1 3 0
As you can see the code misses the one on row 5. I am looking for the first 1 in every chunk. Then everything else should be 0.
Sorry if i was unclear akrn
Edit** Akrun new answer is exactly what I am looking for. Thank you very much
Here is an option to create a grouping column with gl and then do a == with the row_number on the index of matched 1. Here, match will return only the index of the first match.
library(dplyr)
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(row_number() == match(1, Number, nomatch = 0)))
# A tibble: 12 x 3
# Groups: grp [4]
# Number grp New
# <dbl> <int> <int>
# 1 1 1 1
# 2 0 1 0
# 3 1 1 0
# 4 0 2 0
# 5 1 2 1
# 6 1 2 0
# 7 1 3 1
# 8 0 3 0
# 9 1 3 0
#10 0 4 0
#11 0 4 0
#12 0 4 0
Looking at the logic, perhaps you want to check if Number == 1 and that the prior 2 values were both 0. If that is not correct please let me know.
library(dplyr)
df %>%
mutate(New = ifelse(Number == 1 & lag(Number, n = 1L, default = 0) == 0 & lag(Number, n = 2L, default = 0) == 0, 1, 0))
Output
Date Number New
1 1 1 1
2 2 0 0
3 3 1 0
4 4 0 0
5 5 0 0
6 6 1 1
7 7 0 0
8 8 0 0
9 9 1 1
You can replace Number value to 0 except for the 1st occurrence of 1 in each 3 rows.
library(dplyr)
df %>%
group_by(gr = ceiling(row_number()/3)) %>%
mutate(New = replace(Number, -which.max(Number), 0)) %>%
#Or to be safe and specific use
#mutate(New = replace(Number, -which(Number == 1)[1], 0)) %>%
ungroup() %>% select(-gr)
# A tibble: 9 x 3
# Date Number New
# <int> <int> <int>
#1 1 1 1
#2 2 0 0
#3 3 1 0
#4 4 0 0
#5 5 0 0
#6 6 1 1
#7 7 0 0
#8 8 0 0
#9 9 1 1

Resources