r recode by a splitting rule - r

I have a student dataset including student information, question id (5 questions), the sequence of each trial to answer the questions. I would like to create a variable to distinguish where exactly student starts reviewing questions after finishing all questions.
Here is a sample dataset:
data <- data.frame(
person = c(1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
question = c(1,2,2,3,3,3,4,3,5,1,2, 1,1,1,2,3,4,4,4,5,5,4,3,4,4,5,4,5),
sequence = c(1,1,2,1,2,3,1,4,1,2,3, 1,2,3,1,1,1,2,3,1,2,4,2,5,6,3,7,4))
data
person question sequence
1 1 1 1
2 1 2 1
3 1 2 2
4 1 3 1
5 1 3 2
6 1 3 3
7 1 4 1
8 1 3 4
9 1 5 1
10 1 1 2
11 1 2 3
12 2 1 1
13 2 1 2
14 2 1 3
15 2 2 1
16 2 3 1
17 2 4 1
18 2 4 2
19 2 4 3
20 2 5 1
21 2 5 2
22 2 4 4
23 2 3 2
24 2 4 5
25 2 4 6
26 2 5 3
27 2 4 7
28 2 5 4
sequence variables record each visit by giving a sequence number. Generally revisits could be before seeing all questions. However, the attempt variable should only record after the student sees all 5 questions. With the new variable, I target this dataset.
> data
person question sequence attempt
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 initial
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
Any ideas?
Thanks!

What a challenging question. Took almost 2 hours to find the solution.
Try this
library(dplyr)
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
data %>%
mutate(var0 = n_distinct(question)) %>%
group_by(person) %>%
mutate(var1 = dist_cum(question),
var2 = cumsum(c(1, diff(question) != 0))) %>%
ungroup() %>%
mutate(var3 = if_else(sequence == 1 | var1 < var0, 0, 1)) %>%
group_by(person, var2) %>%
mutate(var4 = min(var3)) %>%
ungroup() %>%
mutate(attemp = if_else(var4 == 0, "initial", "review")) %>%
select(-starts_with("var")) %>%
as.data.frame
Result
person question sequence attemp
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 initial
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
dist_cum is a function to calculate rolling distinct (Source). var0...var4 are helpers

One way to do it is by finding where the reviewing starts (i.e. the next entry after the fifth question has been seen) and where the sequence is 2. See v1 and v2. Then by means of subsetting for every individual person and looping by each subset, you can update the missing entries for the attempt variable since it is now known where the reviewing starts.
v1 <- c(FALSE, (data$question == 5)[-(nrow(data))])
v2 <- data$sequence == 2
data$attempt <- ifelse(v1 * v2 == 1, "review", NA)
persons <- unique(data$person)
persons.list <- vector(mode = "list", length = length(persons))
for(i in 1:length(persons)){
person.i <- subset(data, person == persons[i])
n <- which(person.i$attempt == "review")
m <- nrow(person.i)
person.i$attempt[(n+1):m] <- "review"
person.i$attempt[which(is.na(person.i$attempt))] <- "initial"
persons.list[[i]] <- person.i
}
do.call(rbind, persons.list)
person question sequence attempt
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 review
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
Alternatively, you can also use lapply:
do.call(rbind,
lapply(persons, function(x){
person.x <- subset(data, person == x)
n <- which(person.x$attempt == "review")
m <- nrow(person.x)
person.x$attempt[(n+1):m] <- "review"
person.x$attempt[which(is.na(person.x$attempt))] <- "initial"
person.x
}))

Related

Sorting out the data with specific headers in R

A small sample of the data are as follows:
df<-read.table (text=" ID Class1a Time1a MD1a MD2a Class1b Time1b MD1b MD2b Class2a Time2a MD3a MD4a Class2b Time2b MD3b MD4b Class3a Time3a MD5a MD6a Class3b Time3b MD5b MD6b
1 1 1 1 2 2 1 1 2 9 2 2 2 10 2 1 1 17 3 2 2 18 3 1 1
2 3 1 1 1 4 1 2 1 11 2 2 1 12 2 1 1 19 3 2 1 20 3 1 1
3 5 1 2 1 6 1 2 2 13 2 1 1 14 2 2 2 21 3 1 1 22 3 2 2
4 7 1 1 1 8 1 2 2 15 2 1 1 16 2 1 1 23 3 1 1 24 3 1 1
", header=TRUE)
I want to get the following output, especially headers
ID Class Time MD MD1 MD2
1 1 1 1-2 1 2
2 3 1 1-2 1 1
3 5 1 1-2 2 1
4 7 1 1-2 1 1
1 2 1 1-2 1 2
2 4 1 1-2 2 2
3 6 1 1-2 2 2
4 8 1 1-2 2 2
1 9 2 3-4 2 2
2 11 2 3-4 2 1
3 13 2 3-4 1 1
4 15 2 3-4 1 1
1 10 2 3-4 2 1
2 12 2 3-4 2 1
3 14 2 3-4 2 2
4 16 2 3-4 2 1
1 17 3 5-6 2 2
2 19 3 5-6 2 2
3 21 3 5-6 1 2
4 23 3 5-6 1 2
1 18 3 5-6 1 1
2 20 3 5-6 1 1
3 22 3 5-6 2 2
4 24 3 5-6 1 1
df1<- df %>% pivot_longer(
cols = starts_with("Time"),
names_to = "Q",
values_to = "Score",
values_drop_na = TRUE)
df2<- df1 %>% pivot_longer(
cols = starts_with("Class"),
names_prefix = "MD",
values_drop_na = TRUE
) %>% dplyr::select(-value)
But I have failed the get the output of interest
This answer started as a pivot_longer example using names_pattern, but while renaming some of them made sense, it becomes less intuitive how to easily extract the MD column (e.g., 1-2, 3-4) during the pivoting process.
Instead, let's split the frame by column-group, rename the columns as you'd like, then bind_rows them.
bind_rows(
lapply(split.default(df[,-1], cumsum(grepl("Class", names(df)[-1]))),
function(Z) {
out <- transform(Z,
ID = df$ID,
MD = paste(gsub("\\D", "", grep("^MD", names(Z), value = TRUE)), collapse = "-"))
names(out)[1:4] <- c("Class", "Time", "MD1", "MD3")
out
})
)
# Class Time MD1 MD3 ID MD
# 1 1 1 1 2 1 1-2
# 2 3 1 1 1 2 1-2
# 3 5 1 2 1 3 1-2
# 4 7 1 1 1 4 1-2
# 5 2 1 1 2 1 1-2
# 6 4 1 2 1 2 1-2
# 7 6 1 2 2 3 1-2
# 8 8 1 2 2 4 1-2
# 9 9 2 2 2 1 3-4
# 10 11 2 2 1 2 3-4
# 11 13 2 1 1 3 3-4
# 12 15 2 1 1 4 3-4
# 13 10 2 1 1 1 3-4
# 14 12 2 1 1 2 3-4
# 15 14 2 2 2 3 3-4
# 16 16 2 1 1 4 3-4
# 17 17 3 2 2 1 5-6
# 18 19 3 2 1 2 5-6
# 19 21 3 1 1 3 5-6
# 20 23 3 1 1 4 5-6
# 21 18 3 1 1 1 5-6
# 22 20 3 1 1 2 5-6
# 23 22 3 2 2 3 5-6
# 24 24 3 1 1 4 5-6
This relies on:
ID being the first column (ergo df[,-1] and names(df)[-1]), and
Each group of columns starting with a Class* column.

Add new variables financial dataset r

I have a question about a financial transaction dataset.
The data set look as the following:
Account_from Account_to Value Timestamp
1 1 2 10 1
2 1 3 15 1
3 3 4 20 1
4 2 1 10 2
5 1 3 25 2
6 2 1 15 3
7 1 3 10 3
8 1 4 20 4
I would like to create a couple of extra variables based on the account from column. The variables I want to create are :
(total value out of account from in the last two timestamps),
(total value incoming of account from in the last two timestamps),
(total transaction value out that timestamp of all transaction done during that timestamp),
(value out of account from / previous value out of account from),
That it will look like this:
Acc_from Acc_to Value Timestamp Tot_val_out Tot_val_inc Tot_val_out_1time val_out/prev_val_out
1 1 2 10 1 10 0 45 0
2 1 3 15 1 25 0 45 1.5
3 3 4 20 1 20 15 45 0
4 2 1 10 2 10 10 35 0
5 1 3 25 2 50 10 35 1.67
6 2 1 15 3 25 0 25 1.5
7 1 3 10 3 35 25 25 0.4
8 1 4 20 4 30 15 20 2
For example row 5 tot_val_out is 50, this means that account 1 transferred the amount of 50 in the last two timestamps (timestamps 1 and 2). At row 8 account 1 transferred 30 in the last two timestamps (timestamps 3 and 4).
The same should be done for incoming value.
Additionally I would like to create the variables:
(number of transactions done by account from in the previous 4 timestamps)
(number of transactions done by account from in the previous 2 timestamps)
So that:
Account_from Account_to Value Timestamp Transactions_previous4 Transactions_previous2
1 1 2 10 1 1 1
2 1 3 15 1 2 2
3 3 4 20 1 1 1
4 2 1 10 2 1 1
5 1 3 25 2 3 3
6 2 1 15 3 2 2
7 1 3 10 3 4 2
8 1 4 20 4 5 2
At row 8 account 1 has made 5 transactions the last 4 timestamps (timestamps 1 till 4), but in the last 2 timestamps (timestamps 3 and 4) only 2 transactions.
I cannot figure out how to do this. It would be extremely helpfull if someone knows how to do this.
Thanks in advance,
Here is a base R code
for (k in seq(nrow(df))) {
nr <- seq(nrow(df))
Account_from <- df$Account_from
Account_to <- df$Account_to
Value <- df$Value
Timestamp <- df$Timestamp
df$Tot_val_out[k] <- sum(Value[which(Account_from == Account_from[k] & Timestamp %in% (Timestamp[k]-(1:0)) & nr<=k)])
df$Tot_val_in[k] <- sum(Value[which(Account_to == Account_from[k] & Timestamp %in% (Timestamp[k]-(1:0)) & nr<=k)])
df$Transactions_previous4[k] <- length(which(Account_from == Account_from[k] & Timestamp %in% (Timestamp[k]-(3:0)) & nr<=k))
df$Transactions_previous2[k] <- length(which(Account_from == Account_from[k] & Timestamp %in% (Timestamp[k]-(1:0)) & nr<=k))
}
dfout <- cbind(df,
with(df,
data.frame(Tot_val_out_1time = ave(Value,Timestamp,FUN = sum),
val_out_prev_val_out = ave(Value, Account_from, FUN = function(x) c(0,x[-1]/x[-length(x)])))))
such that
> dfout
Account_from Account_to Value Timestamp Tot_val_out Tot_val_in Transactions_previous4 Transactions_previous2 Tot_val_out_1time val_out_prev_val_out
1 1 2 10 1 10 0 1 1 45 0.000000
2 1 3 15 1 25 0 2 2 45 1.500000
3 3 4 20 1 20 15 1 1 45 0.000000
4 2 1 10 2 10 10 1 1 35 0.000000
5 1 3 25 2 50 10 3 3 35 1.666667
6 2 1 15 3 25 0 2 2 25 1.500000
7 1 3 10 3 35 25 4 2 25 0.400000
8 1 4 20 4 30 15 5 2 20 2.000000

dplyr: comparing values within a variable dependent on another variable

How can I compare values within a variable dependent on another variable with dplyr?
The df is based on choice data (long format) from a survey. It has one variable that indicates a participants id, another that indicates the choice instance and one that indicates which alternative was chosen.
In my data I have the feeling that a lot of people tend to get bored of the task and therefore stick to one alternative for every instance. I would therefore like to identify people who always selected the same option from a certain instance onwards till the end.
Here is an example df:
set.seed(0)
df <- tibble(
id = rep(1:5,each=12),
inst = rep(1:12,5),
alt = sample(1:3, size =60, replace=T),
)
That looks like the following:
id inst alt
1 1 1 3
2 1 2 1
3 1 3 2
4 1 4 2
5 1 5 3
6 1 6 1
7 1 7 3
8 1 8 3
9 1 9 2
10 1 10 2
11 1 11 1 <-
12 1 12 1 <-
13 2 1 1
14 2 2 3
...
I would like to create two new variables count and count_alt. The new variable count should indicate how often the same value appeared in alt based on id and inst, only counting values from the end of id. So for participant (id==1) the count variable should be 2, since alternative 1 was chosen in the last two instances (11 & 12). The count_alt would take the value 1 (always the same as inst == 12)
The new df schould look like the following
id inst alt count count_alt
1 1 1 3 2 1
2 1 2 1 2 1
3 1 3 2 2 1
4 1 4 2 2 1
5 1 5 3 2 1
6 1 6 1 2 1
7 1 7 3 2 1
8 1 8 3 2 1
9 1 9 2 2 1
10 1 10 2 2 1
11 1 11 1 2 1
12 1 12 1 2 1
...
I would prefer to solve this with dplyr and not with a loop since I want to incooperate it into further data wrangling steps.
See if that solves it:
library(dplyr)
df %>%
group_by(id) %>%
mutate(
count = cumsum(alt != lag(alt, default = "rndm")),
count = sum(count == max(count)),
count_alt = alt[n()]
)
Output:
id inst alt count count_alt
1 1 1 3 2 1
2 1 2 1 2 1
3 1 3 2 2 1
4 1 4 2 2 1
5 1 5 3 2 1
6 1 6 1 2 1
7 1 7 3 2 1
8 1 8 3 2 1
9 1 9 2 2 1
10 1 10 2 2 1
11 1 11 1 2 1
12 1 12 1 2 1
13 2 1 1 1 2
14 2 2 3 1 2
15 2 3 2 1 2
16 2 4 3 1 2
17 2 5 2 1 2
18 2 6 3 1 2
19 2 7 3 1 2
20 2 8 2 1 2
21 2 9 3 1 2
22 2 10 3 1 2
23 2 11 1 1 2
24 2 12 2 1 2
25 3 1 1 1 3
26 3 2 1 1 3
27 3 3 2 1 3
28 3 4 1 1 3
29 3 5 2 1 3
30 3 6 3 1 3
31 3 7 2 1 3
32 3 8 2 1 3
33 3 9 2 1 3
34 3 10 2 1 3
35 3 11 1 1 3
36 3 12 3 1 3
37 4 1 3 1 1
38 4 2 3 1 1
39 4 3 1 1 1
40 4 4 3 1 1
41 4 5 2 1 1
42 4 6 3 1 1
43 4 7 2 1 1
44 4 8 3 1 1
45 4 9 2 1 1
46 4 10 2 1 1
47 4 11 3 1 1
48 4 12 1 1 1
49 5 1 2 2 2
50 5 2 3 2 2
51 5 3 3 2 2
52 5 4 2 2 2
53 5 5 3 2 2
54 5 6 2 2 2
55 5 7 1 2 2
56 5 8 1 2 2
57 5 9 1 2 2
58 5 10 1 2 2
59 5 11 2 2 2
60 5 12 2 2 2

R: Return values in a columns when the value in another column becomes negative for the first time

For each ID, I want to return the value in the 'distance' column where the value becomes negative for the first time. If the value does not become negative at all, return the value 99 (or some other random number) for that ID. A sample data frame is given below.
df <- data.frame(ID=c(rep(1, 4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)),distance=rep(1:4,5), value=c(1,4,3,-1,2,1,-4,1,3,2,-1,1,-4,3,2,1,2,3,4,5))
> df
ID distance value
1 1 1 1
2 1 2 4
3 1 3 3
4 1 4 -1
5 2 1 2
6 2 2 1
7 2 3 -4
8 2 4 1
9 3 1 3
10 3 2 2
11 3 3 -1
12 3 4 1
13 4 1 -4
14 4 2 3
15 4 3 2
16 4 4 1
17 5 1 2
18 5 2 3
19 5 3 4
20 5 4 5
The desired output is as follows
> df2
ID first_negative_distance
1 1 4
2 2 3
3 3 3
4 4 1
5 5 99
I tried but couldn't figure out how to do it through dplyr. Any help would be much appreciated. The actual data I'm working on has thousands of ID's with 30 different distance levels for each. Bear in mind that for any ID, there could be multiple instances of negative values. I just need the first one.
Edit:
Tried the solution proposed by AntonoisK.
> df%>%group_by(ID)%>%summarise(first_neg_dist=first(distance[value<0]))
first_neg_dist
1 4
This is the result I am getting. Does not match what Antonois got. Not sure why.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(first_neg_dist = first(distance[value < 0]))
# # A tibble: 5 x 2
# ID first_neg_dist
# <dbl> <int>
# 1 1 4
# 2 2 3
# 3 3 3
# 4 4 1
# 5 5 NA
If you really prefer 99 instead of NA you can use
summarise(first_neg_dist = coalesce(first(distance[value < 0]), 99L))
instead.

Assigning test / control group vector using split-apply-combine strategy [duplicate]

This question already has answers here:
Stratified random sampling from data frame
(6 answers)
Closed 6 years ago.
this should be simple but it's got me pulling my hair out!
Here is some data:
Clicks <- c(1,2,3,4,5,6,5,4,3,2)
Cost <- c(10,11,12,13,14,15,14,13,12,11)
Cluster <- c(1,1,1,2,2,1,1,1,1,1)
df <- data.frame(Clicks,Cost,Cluster)
I want to filter my df by cluster, assign a new vector that assigns "test" and "control" group at random, then recombine to the original data frame
Step 1: Filter (by cluster 1)
Clicks Cost Cluster
1 1 10 1
2 2 11 1
3 3 12 1
4 6 15 1
5 5 14 1
6 4 13 1
7 3 12 1
8 2 11 1
Step 2: Assign test and control group at random
Clicks Cost Cluster group
1 1 10 1 Test
2 2 11 1 Control
3 3 12 1 Control
4 6 15 1 Test
5 5 14 1 Control
6 4 13 1 Control
7 3 12 1 Test
8 2 11 1 Control
Step 3: Get back to the original data frame
Clicks Cost Cluster group
1 1 10 1 Test
2 2 11 1 Control
3 3 12 1 Control
4 4 13 2 NULL
5 5 14 2 NULL
6 6 15 1 Test
7 5 14 1 Control
8 4 13 1 Control
9 3 12 1 Test
10 2 11 1 Control
Step 4: do the same for cluster 2
Thanks :)
How about
df$Group <- 'NULL'
df1 <- df
df1[df1$Cluster==1, ]$Group <- ifelse(runif(sum(df1$Cluster==1)) > 0.5, 'Control', 'Test')
df1
Clicks Cost Cluster Group
1 1 10 1 Test
2 2 11 1 Test
3 3 12 1 Test
4 4 13 2 NULL
5 5 14 2 NULL
6 6 15 1 Control
7 5 14 1 Test
8 4 13 1 Test
9 3 12 1 Control
10 2 11 1 Control
df2 <- df
df2[df2$Cluster==2, ]$Group <- ifelse(runif(sum(df2$Cluster==2)) > 0.5, 'Control', 'Test')
df2
Clicks Cost Cluster Group
1 1 10 1 NULL
2 2 11 1 NULL
3 3 12 1 NULL
4 4 13 2 Test
5 5 14 2 Control
6 6 15 1 NULL
7 5 14 1 NULL
8 4 13 1 NULL
9 3 12 1 NULL
10 2 11 1 NULL

Resources