Add an incremental counter to a dataframe in R - r

I have a data-frame composed of 3 columns and the third column ('V3') contains the name of markers that were presented during the experiment.
I'd like to add another column that tells me for each marker how many time that specific marker was presented before that instance.
I have done it in R without using the tidyverse but it is quite time consuming, and I was wondering if you could help me do it using the tidyverse.
At the moment my script goes like this:
data$counter <- NA
x<-1
for (i in 1:nrow(data)){
if ((str_detect(data$V3[i], 'NEGATIVE'))==TRUE){
data$counter[i] <- x
x <- x +1
}
}

I think cumsum will do an excellent job here:
testdata <- data.frame(V3=sample(c("NEGATIVEsomething","others"),50,replace=TRUE), stringsAsFactors = FALSE)
negatives <- grepl("NEGATIVE",testdata$V3)
negatives <- as.numeric(negatives)
negatives <- cumsum(negatives)
negatives[negatives == 0] <- NA
testdata$counter <- negatives
Edit: Since you want to increment the counter after finding a "NEGATIVE" and place the old counter at the position, you should use
negatives <- cumsum(negatives)-1
and then remove the 0 and -1 counts at the beginning:
negatives[negatives %in% c(0,-1)] <- NA

You can do this with group_by() and then generating the counters with seq_along().
library(dplyr)
data %>%
group_by(V3) %>%
mutate(counter = seq_along(V3)) %>%
ungroup()

Try
data$counter <- cumsum((str_detect(data$V3, 'NEGATIVE')))

Related

R 4.1.2: Dynamically check values for a cumulative pattern. Null following values if that pattern occurs at any time across values

This relates to another problem I posted, but I did not quite ask the right question. If anyone can help with this, it would really be appreciated.
I have a DF with several players' answers to 100 questions in a quiz (example data frame below with 10 questions and 10 players-not the real data, which is not really from a quiz, but the principle is the same).
My goal is to create a function that will check when a player has answered 3 questions incorrectly cumulatively at any point during their answers, and then change their following answers to the string "disc". I would like to be able to change the parameters also, so it could be 4 or 5 questions incorrect etc. In the df: 1=correct, 0=incorrect, and 2=unanswered. Unanswered is considered incorrect, but I do not want to recode it as 0.
df=data.frame(playerID=numeric(),
q1=numeric(),
q2=numeric(),
q3=numeric(),
q4=numeric(),
q5=numeric(),
q6=numeric(),
q7=numeric(),
q8=numeric(),
q9=numeric(),
q10=numeric())
set.seed(1)
for(i in 1:10){
list_i=c(i,sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1))
df[i,]=list_i
}
So, in this DF, for example, playerID=3,8 and 9 should have their answers="disc" from q4 onwards, whereas playerid5 should have “disc” from 8 onwards. So anytime there are 3 consecutive incorrect answers (including values of 2), the following answers should change to “disc”.
I presume the syntax would be a for loop with an if statement inside using mutate or similar.
One possible solution using mutate and across:
df %>%
ungroup() %>%
mutate(
# Mutate across all question columns
across(
starts_with("q"),
function(col) {
# Get previous columns
col_i <- which(names(cur_data())==cur_column())
previous_cols <- 2:(col_i-1)
# Get results for previous questions as string (i.e. zero, or 2)
previous_qs <- select(cur_data(), all_of(previous_cols)) %>%
mutate(across(everything(), ~as.numeric(.x %in% c(0,2)))) %>%
tidyr::unite("str", sep = "") %>%
pull(str)
# Check for three successive incorrect answers at some previous point
results <- grepl(pattern = "111", previous_qs)
# For those with three successive incorrect answers at some previous point, overwrite value with 'disc'
col[results] <- "disc"
col
}
)
)
Are you looking for something like this?
library(tidyverse)
n <- 100
f <- function(v, cap, new_value){
df <-
data.frame(v = v) |>
mutate(
b = cumsum(v),
v_new = ifelse(b > cap, new_value, v)
)
return(df$v_new)
}
# apply function to vector
v <- runif(n)
v_new <- f(v, 5, "disc")
# apply function in a dataframe with mutate
df <-
data.frame(a = runif(n))
df |>
mutate(
b = f(a, 5, "disc")
)

Calculate cumulative mean for dataset randomized 100 times

I have a dataset, and I would like to randomize the order of this dataset 100 times and calculate the cumulative mean each time.
# example data
ID <- seq.int(1,100)
val <- rnorm(100)
df <- cbind(ID, val) %>%
as.data.frame(df)
I already know how to calculate the cumulative mean using the function "cummean()" in dplyr.
df2 <- df %>%
mutate(cm = cummean(val))
However, I don't know how to randomize the dataset 100 times and apply the cummean() function to each iteration of the dataframe. Any advice on how to do this would be greatly appreciated.
I realize this could probably be solved via either a loop, or in tidyverse, and I'm open to either solution.
Additionally, if possible, I'd like to include a column that indicates which iteration the data was produced from (i.e., randomization #1, #2, ..., #100), as well as include the "ID" value, which indicates how many data values were included in the cumulative mean. Thanks in advance!
Here is an approach using the purrr package. Also, not sure what cummean is calculating (maybe someone can share that in the comments) so I included an alternative, the column cm2 as a comparison.
library(tidyverse)
set.seed(2000)
num_iterations <- 100
num_sample <- 100
1:num_iterations %>%
map_dfr(
function(i) {
tibble(
iteration = i,
id = 1:num_sample,
val = rnorm(num_sample),
cm = cummean(val),
cm2 = cumsum(val) / seq_along(val)
)
}
)
You can mutate to create 100 samples then call cummean:
library(dplyr)
library(purrr)
df %>% mutate(map_dfc(1:100, ~cummean(sample(val))))
We may use rerun from purrr
library(dplyr)
library(purrr)
f1 <- function(dat, valcol) {
dat %>%
sample_n(size = n()) %>%
mutate(cm = cummean({{valcol}}))
}
n <- 100
out <- rerun(n, f1(df, val))
The output of rerun is a list, which we can name it with sequence and if we need to create a new column by binding, use bind_rows
out1 <- bind_rows(out, .id = 'ID')
> head(out1)
ID val cm
1 1 0.3376980 0.33769804
2 1 -1.5699384 -0.61612019
3 1 1.3387892 0.03551628
4 1 0.2409634 0.08687807
5 1 0.7373232 0.21696708
6 1 -0.8012491 0.04726439

For loop to extract data

I Have data set with these variables (Branch, Item, Sales, Stock) I need to make a for loop to extract a data with the following
The same item which has
1-different branches
2- its sales is higher than the stock
and save the result in data frame
The code I used is
trials <- sample_n(Data_with_stock,1000)
for (i in 1:nrow(trials))
{
if(trials$sales[i] > trials$stock[i] & trials$item[i] == trials$item[i+1] & trials$branch[i] != trials$branch[i+1])
{s <-data.frame( (trials$NAME[i])
,(trials$branch[i]))
}
}
Suggest you use dplyr library, post installing considering "df" is your dataset, use the below commands for question 1 and 2
Question 1
question_one = df %>%
group_by(Item) %>%
summarise(No_of_branches = n_distinct(Branch))
items_with_more_than_one_branch = question_one[which(question_one$No_of_branches>1)"Item"]
Question 2: Similarly,
question_two = df %>%
group_by(Item) %>%
summarise(Stock_Val = sum(Stock), Sales_Val = sum(Sales))
item_with_sales_greater_than_stock = question_two[which(question_two$Sales > question_two$Stock),"Item"]
Couldn't help but solve without dplyr, however suggest, if not used yet, dplyr will always be useful for data crunching
As you just want to fix your code:
You missed to set one =in your code.
Use:
trials <- sample_n(Data_with_stock,1000)
# next you need first to define s used in your loop
s <- array(NA, dim = c(1,2)) # as you only save 2 things in s per iteration
for (i in 1:nrow(trials)) {
# but I dont get why you compare the second condition.
if(trials$sales[i] > trials$stock[i] & trials$item[i] == trials$item[i] & trials$branch[i] != trials$branch[i+1]) {
s[i,] <- cbind(trials$NAME[i], trials$branch[i])
} else {
s[i,] <- NA # just to have no problem with the index i, you can delete the one with na afterwards with na.omit()
}

Replace NA with Value with Previous Value

I have a column in data frame that I created in R. After a certain month, the values become NA. I would like to replace the NAs with the record 12 months back. Is there a function in R for me to do this? Or do I have to do a loop?
So Jan-11 would then become 10, Feb-11 would become 11 and so forth.
EDIT:
I also tried:
for (i in 1:length(df$var)) {
df$var[i] <- ifelse(is.na(df$var[i]), df$var[i - 12],
df$var[i]) }
but the whole column ends up being NA.
Aha, from the last comment it sounds like you'd like a "chained" lag, where it uses the last value of that month that is available, however many years back you need to go.
Jan-11 will show the value 10, but when it comes to Jan-12, it shows
NA (when it should be 10).
Here's an approach that relies on first grouping by month, and then using tidyr::fill() to fill in from the last valid value for that month.
First, some fake data. (BTW it would be useful to include something like this in your question so that answerers don't have to retype your numbers or generate new ones.)
# Make fake data with 1 year values, 2 yrs NAs
library(lubridate)
set.seed(42);
data <- data.frame(
dates = seq.Date(from = ymd(20100101), to = ymd(20121201), by = "month"),
values = c(as.integer(rnorm(12, 10, 3)), rep(NA_integer_, 24))
)
# Group by months, fill within groups, ungroup.
library(tidyverse)
data_filled <- data %>%
group_by(month = month(dates)) %>%
fill(values) %>%
ungroup() %>%
arrange(dates)
I can't think of a way to do this without a loop, but this should give you what you need:
df <- data.frame(col1 = LETTERS[1:24],
col2 = c(rnorm(12), rep(NA, 12)))
for(i in 1:nrow(df)) {
if(is.na(df[i, 2])) {
df[i, 2] <- df[i - 12, 2]
}
}

Simple mutate with dplyr gives "wrong result size" error

My data table df has a subject column (e.g. "SubjectA", "SubjectB", ...). Each subject answers many questions, and the table is in long format, so there are many rows for each subject. The subject column is a factor. I want to create a new column - call it subject.id - that is simply a numeric version of subject. So for all rows with "SubjectA", it would be 1; for all rows with "SubjectB", it would be 2; etc.
I know that an easy way to do this with dplyr would be to call df %>% mutate(subject.id = as.numeric(subject)). But I was trying to do it this way:
subj.list <- unique(as.character(df$subject))
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
And I get this error:
Error: wrong result size (12), expected 72 or 1
Why does this happen? I'm not interested in other ways to solve this particular problem. Rather, I worry that my inability to understand this error reflects a deep misunderstanding of dplyr or mutate. My understanding is that this call should be conceptually equivalent to:
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list))
}
But the latter works and the former doesn't. Why?
Reproducible example:
df <- InsectSprays %>% rename(subject = spray)
subj.list <- unique(as.character(df$subject))
# this works
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list)
}
# but this doesn't
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
The issue is that operators and functions are applied in a vectorized way by mutate. Thus, which is applied to the vector produced by as.character(df$subject) == subj.list, not to each row (as in your loop).
Using rowwise as described here would solve the issue: https://stackoverflow.com/a/24728107/3772587
So, this will work:
df %>%
rowwise() %>%
mutate(subject.id = which(as.character(subject) == subj.list))
Since your df$subject is a factor, you could simply do:
df %>% mutate(subj.id=as.numeric(subject))
Or use a left join approach:
subj.df <- df$subject %>%
unique() %>%
as_tibble() %>%
rownames_to_column(var = 'subj.id')
df %>% left_join(subj.df,by = c("subject"="value"))

Resources