Add an occasion flag to data frame

Add an occasion flag to data frame - r

For each individual, I would like to add an occasion flag for my data frame when the amount is bigger than zero. I need this flag for further calculations. Here what I would like to achieve.
dfin <-
ID AMT
1 50
1 NA
1 10
1 NA
2 15
2 NA
2 NA
3 10
3 15
dfout <-
ID AMT FLAG
1 50 1
1 NA 1
1 10 2
1 NA 2
2 15 1
2 NA 1
2 NA 1
3 10 1
3 15 2
How can I achieve this in R?

You can test which values are not NA and compute the cumulative sum.
dfout = dfin
dfout$FLAG = cumsum(!is.na(dfin$AMT))
dfout
ID AMT FLAG
1 1 50 1
2 1 NA 1
3 1 10 2
4 1 NA 2
5 2 15 3
6 2 NA 3
7 2 NA 3
8 3 10 4

As I have changed the output that I want. I am here answering the question based on the answer provided by #G5W to make it by ID
library(dplyr)
dfout <- dfin %>%
group_by(ID) %>%
mutate(FLAG = cumsum(!is.na(AMT)))

Related

How to make the next number in a column a sequence in r

sorry to bother everyone. I have been stuck with coding
Student Number
1 NA
1 NA
1 1
1 1
2 NA
2 1
2 1
2 1
3 NA
3 NA
3 1
3 1
I tried using dplyr to cluster by students try to find a way so that every time it reads that 1, it adds it to the following column so it would read as
Student Number
1 NA
1 NA
1 1
1 2
2 NA
2 1
2 2
2 3
3 NA
3 NA
3 1
3 2
etc
Thank you! It'd help with attendance.

data.table solution;
library(data.table)
setDT(df)
df[!is.na(Number),Number:=cumsum(Number),by=Student]
df
Student Number
<int> <int>
1 1 NA
2 1 NA
3 1 1
4 1 2
5 2 NA
6 2 1
7 2 2
8 2 3
9 3 NA
10 3 NA
11 3 1
12 3 2

Try using cumsum, note that cumsum itself cannot ignore NA
library(dplyr)
df %>%
group_by(Student) %>%
mutate(n = cumsum(ifelse(is.na(Number), 0, Number)) + 0 * Number)
Student Number n
<int> <int> <dbl>
1 1 NA NA
2 1 NA NA
3 1 1 1
4 1 1 2
5 2 NA NA
6 2 1 1
7 2 1 2
8 2 1 3
9 3 NA NA
10 3 NA NA
11 3 1 1
12 3 1 2

Shifting rows up in columns and flush remaining ones

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!

If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5

You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.

One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

Replace duplicate values in vector using criteria from other columns in data frame

I have a very similar problem to:
Identify and replace duplicates elements from a vector
I need to replace duplicate values in a column occurring in a sequence BUT based on criteria from other columns in the data frame.
I have a data frame like this (plus a number of extra columns):
ID<- c("1V","1V","1V","1V","2V","2V","4V","4V","4V","4V","4V")
year<- c(1,1,1,2,1,1,2,2,3,3,3)
sequence<- c(1,2,2,1, 1,2,1,2,1,1,1)
score <- c(5,5,5,5,10,10,10,10,11,11,11)
examp <- data.frame(ID,year, sequence, score)
> examp
ID year sequence score
1 1V 1 1 5
2 1V 1 2 5
3 1V 1 2 5
4 1V 2 1 5
5 2V 1 1 10
6 2V 1 2 10
7 4V 2 1 10
8 4V 2 2 10
9 4V 3 1 11
10 4V 3 1 11
11 4V 3 1 11
What I need is to replace the duplicate scores within each ID, year and sequence with NA. Also the sequence couple with the score should be replaced with NA. Thus, no rows are deleted, only specific entries.
> examp
ID year sequence score
1 1V 1 1 5
2 1V 1 2 5
3 1V 1 NA NA
4 1V 2 2 5
5 2V 1 1 10
6 2V 1 2 10
7 4V 2 1 10
8 4V 2 2 10
9 4V 3 1 11
10 4V 3 NA NA
11 4V 3 NA NA
All rows are retained. The same scores may occur across different IDs/years/sequences, but only within each unique combination of these three columns can I replace a duplicate score.
Example with a single vector and solution from the other linked question:
a <- 1 1 1 2 3 2 2 2 2 1 0 0 0 0 2 3 4 4 1 1
ifelse(a == c(a[1]-1,a[(1:length(a)-1)]) , 0 , a)
[1] 1 0 0 2 3 2 0 0 0 1 0 0 0 0 2 3 4 0 1 0
I am unsure of how to adapt the above code in the question above with multiple criteria. Is it possible?
Primarily, the most important is to replace the scores, but if someone has a solution to replacing both scores and sequence I would be very happy.

In base R, you can use subsetting and is.na.
is.na(examp[duplicated(examp[1:3]), c("sequence", "score")]) <- TRUE
examp
ID year sequence score
1 1V 1 1 5
2 1V 1 2 5
3 1V 1 NA NA
4 1V 2 1 5
5 2V 1 1 10
6 2V 1 2 10
7 4V 2 1 10
8 4V 2 2 10
9 4V 3 1 11
10 4V 3 NA NA
11 4V 3 NA NA
Here, ID year sequence returns a logical vector the length of your data.frame that signals whether the rows of the first three variables are duplicates of previous rows. c("sequence", "score") determines the columns that are to be replaced. Then is.na is set to TRUE in those column for the duplicated rows.
A longer, but more readable version is to use the variable names rather than their positions.
is.na(examp[duplicated(examp[c("ID", "year", "sequence")]), c("sequence", "score")]) <- TRUE
This is also safer in the long run in case the positions shift due to merging or other manipulations. It may be also easier to read/interpret when reviewing the code six months from now.

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(examp)), grouped by 'ID', 'year', we get the row index (.I) where column 'sequence', is duplicated and then set those values in the dataset columns 'sequence', 'score' to NA. This should be very efficient as we are setting in place
library(data.table)
i1 <- setDT(examp)[, .I[duplicated(sequence)], .(ID, year)]$V1
for(j in 3:4){
set(examp, i = i1, j=j, value = NA)
}
examp
# ID year sequence score
# 1: 1V 1 1 5
# 2: 1V 1 2 5
# 3: 1V 1 NA NA
# 4: 1V 2 1 5
# 5: 2V 1 1 10
# 6: 2V 1 2 10
# 7: 4V 2 1 10
# 8: 4V 2 2 10
# 9: 4V 3 1 11
#10: 4V 3 NA NA
#11: 4V 3 NA NA
Or with dplyr
library(dplyr)
examp %>%
group_by(ID, year) %>%
mutate_each(funs(replace(., duplicated(.), NA)))
With base R, we can do a compact option
examp[duplicated(examp[1:3]), 3:4] <- NA
examp
# ID year sequence score
#1 1V 1 1 5
#2 1V 1 2 5
#3 1V 1 NA NA
#4 1V 2 1 5
#5 2V 1 1 10
#6 2V 1 2 10
#7 4V 2 1 10
#8 4V 2 2 10
#9 4V 3 1 11
#10 4V 3 NA NA
#11 4V 3 NA NA
Or another option is replace with lapply
examp[3:4] <- lapply(examp[3:4], function(x) replace(x, duplicated(examp[1:3]), NA))

Exclude a Specific Value from a Unique Value Counter

I am trying to count how many different responses a person gives during a trial of an experiment, but there is a catch.
There are supposed to be 6 possible responses (1,2,3,4,5,6) BUT sometimes 0 is recorded as a response (it's a glitch / flaw in design).
I need to count the number of different responses they give, BUT ONLY counting unique values within the range 1-6. This helps us calculate their accuracy.
Is there a way to exclude the value 0 from contributing to a unique value counter? Any other work-arounds?
Currently I am trying this method below, but it includes 0, NA, and I think any other entry in a cell in the Unique Value Counter Column (I have named "Span6"), which makes me sad.
# My Span6 calculator:
ASixImageTrials <- data.frame(eSOPT_831$T8.RESP, eSOPT_831$T9.RESP, eSOPT_831$T10.RESP, eSOPT_831$T11.RESP, eSOPT_831$T12.RESP, eSOPT_831$T13.RESP)
ASixImageTrials$Span6 = apply(ASixImageTrials, 1, function(x) length(unique(x)))

Use na.omit inside unique and sum logic vector as below
df$res = apply(df, 1, function(x) sum(unique(na.omit(x)) > 0))
df
Output:
X1 X2 X3 X4 X5 res
1 2 1 1 2 1 2
2 3 0 1 1 2 3
3 3 NA 1 1 3 2
4 3 3 3 4 NA 2
5 1 1 0 NA 3 2
6 3 NA NA 1 1 2
7 2 0 2 3 0 2
8 0 2 2 2 1 2
9 3 2 3 0 NA 2
10 0 2 3 2 2 2
11 2 2 1 2 1 2
12 0 2 2 2 NA 1
13 0 1 4 3 2 4
14 2 2 1 1 NA 2
15 3 NA 2 2 NA 2
16 2 2 NA 3 NA 2
17 2 3 2 2 2 2
18 2 NA 3 2 2 2
19 NA 4 5 1 3 4
20 3 1 2 1 NA 3
Data:
set.seed(752)
mat <- matrix(rbinom(100, 10, .2), nrow = 20)
mat[sample(1:100, 15)] = NA
data.frame(mat) -> df
df$res = apply(df, 1, function(x) sum(unique(na.omit(x)) > 0))

could you edit your question and clarify why this doesn't solve your problem?
# here is a numeric vector with a bunch of numbers
mtcars$carb
# here is how to limit that vector to only 1-6
mtcars$carb[ mtcars$carb %in% 1:6 ]
# here is how to tabulate that result
table( mtcars$carb[ mtcars$carb %in% 1:6 ] )

calculate each chunk by group using dplyr?

How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?

Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3

As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3

Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3