I have a dataframe with 2 columns: time and day. there are 3 days and for each day, time runs from 1 to 12. I want to add new rows for each day with times: -2, 1 and 0. How do I do this?
I have tried using add_row and specifying the row number to add to, but this changes each time a new row is added making the process tedious. Thanks in advance
picture of the dataframe
We could use add_row
then slice the desired sequence
and bind all to a dataframe:
library(tibble)
library(dplyr)
df1 <- df %>%
add_row(time = -2:0, Day = c(1,1,1), .before = 1) %>%
slice(1:15)
df2 <- bind_rows(df1, df1, df1) %>%
mutate(Day = rep(row_number(), each=15, length.out = n()))
Output:
# A tibble: 45 x 2
time Day
<dbl> <int>
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
Here's a fast way to create the desired dataframe from scratch using expand.grid(), rather than adding individual rows:
df <- expand.grid(-2:12,1:3)
colnames(df) <- c("time","day")
Results:
df
time day
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
You can use tidyr::crossing
library(dplyr)
library(tidyr)
add_values <- c(-2, 1, 0)
crossing(time = add_values, Day = unique(day$Day)) %>%
bind_rows(day) %>%
arrange(Day, time)
# A tibble: 45 x 2
# time Day
# <dbl> <int>
# 1 -2 1
# 2 0 1
# 3 1 1
# 4 1 1
# 5 2 1
# 6 3 1
# 7 4 1
# 8 5 1
# 9 6 1
#10 7 1
# … with 35 more rows
If you meant -2, -1 and 0 you can also use complete.
tidyr::complete(day, Day, time = -2:0)
I have a student dataset including student information, question id (5 questions), the sequence of each trial to answer the questions. I would like to create a variable to distinguish where exactly student starts reviewing questions after finishing all questions.
Here is a sample dataset:
data <- data.frame(
person = c(1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
question = c(1,2,2,3,3,3,4,3,5,1,2, 1,1,1,2,3,4,4,4,5,5,4,3,4,4,5,4,5),
sequence = c(1,1,2,1,2,3,1,4,1,2,3, 1,2,3,1,1,1,2,3,1,2,4,2,5,6,3,7,4))
data
person question sequence
1 1 1 1
2 1 2 1
3 1 2 2
4 1 3 1
5 1 3 2
6 1 3 3
7 1 4 1
8 1 3 4
9 1 5 1
10 1 1 2
11 1 2 3
12 2 1 1
13 2 1 2
14 2 1 3
15 2 2 1
16 2 3 1
17 2 4 1
18 2 4 2
19 2 4 3
20 2 5 1
21 2 5 2
22 2 4 4
23 2 3 2
24 2 4 5
25 2 4 6
26 2 5 3
27 2 4 7
28 2 5 4
sequence variables record each visit by giving a sequence number. Generally revisits could be before seeing all questions. However, the attempt variable should only record after the student sees all 5 questions. With the new variable, I target this dataset.
> data
person question sequence attempt
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 initial
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
Any ideas?
Thanks!
What a challenging question. Took almost 2 hours to find the solution.
Try this
library(dplyr)
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
data %>%
mutate(var0 = n_distinct(question)) %>%
group_by(person) %>%
mutate(var1 = dist_cum(question),
var2 = cumsum(c(1, diff(question) != 0))) %>%
ungroup() %>%
mutate(var3 = if_else(sequence == 1 | var1 < var0, 0, 1)) %>%
group_by(person, var2) %>%
mutate(var4 = min(var3)) %>%
ungroup() %>%
mutate(attemp = if_else(var4 == 0, "initial", "review")) %>%
select(-starts_with("var")) %>%
as.data.frame
Result
person question sequence attemp
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 initial
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
dist_cum is a function to calculate rolling distinct (Source). var0...var4 are helpers
One way to do it is by finding where the reviewing starts (i.e. the next entry after the fifth question has been seen) and where the sequence is 2. See v1 and v2. Then by means of subsetting for every individual person and looping by each subset, you can update the missing entries for the attempt variable since it is now known where the reviewing starts.
v1 <- c(FALSE, (data$question == 5)[-(nrow(data))])
v2 <- data$sequence == 2
data$attempt <- ifelse(v1 * v2 == 1, "review", NA)
persons <- unique(data$person)
persons.list <- vector(mode = "list", length = length(persons))
for(i in 1:length(persons)){
person.i <- subset(data, person == persons[i])
n <- which(person.i$attempt == "review")
m <- nrow(person.i)
person.i$attempt[(n+1):m] <- "review"
person.i$attempt[which(is.na(person.i$attempt))] <- "initial"
persons.list[[i]] <- person.i
}
do.call(rbind, persons.list)
person question sequence attempt
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 review
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
Alternatively, you can also use lapply:
do.call(rbind,
lapply(persons, function(x){
person.x <- subset(data, person == x)
n <- which(person.x$attempt == "review")
m <- nrow(person.x)
person.x$attempt[(n+1):m] <- "review"
person.x$attempt[which(is.na(person.x$attempt))] <- "initial"
person.x
}))
I have a tidy data.frame of experimental data with subjects ID who were measured three times (Trial) at a varying(!) number of time points (Session) in two different conditions (Direction) on a dependent continuous variable, say LC:
set.seed(5)
nSubjects <- 4
nDirections <- 2
nTrials <- 3
# Between 1 and 3 sessions per subject:
nSessions <- round(runif(nSubjects,
min = 1, max = 3))
mydat <- data.frame(ID = do.call(rep, args = list(1:nSubjects,
times = nSessions * nDirections * nTrials)),
Session = rep(sequence(nSessions),
each = nDirections * nTrials),
Trial = rep(rep(1:nTrials,
each = nDirections),
times = sum(nSessions)),
Direction = rep(c("up", "down"),
times = nTrials * sum(nSessions)),
LC = 1:(nDirections * nTrials * sum(nSessions)))
What I would like to calculate is a vector of length nrow(mydat) that contains the difference in LC between a given subject's and trial's and direction's first and current session. In other words, from each (absolute) LC score of any ID, session, trial and direction, the (absolute) LC from session == 1 of the same ID, trial and direction gets subtracted, like this (for the sake of simplicity I chose LC to be monotonically increasing):
# ID Session Trial Direction LC LC_diff
# 7 2 1 1 up 7 0
# 8 2 1 2 down 8 0
# 9 2 1 3 up 9 0
# 10 2 1 1 down 10 0
# 11 2 1 2 up 11 0
# 12 2 1 3 down 12 0
# 13 2 2 1 up 13 6
# 14 2 2 2 down 14 6
# 15 2 2 3 up 15 6
# 16 2 2 1 down 16 6
# 17 2 2 2 up 17 6
# 18 2 2 3 down 18 6
I thought the following code would yield the desired result:
library(dplyr)
ordered <- group_by(mydat, ID, Session, Trial, Direction)
mydat$LC_diff <- summarise(ordered,
Diff = sum(abs(LC[Trial != 1]),
- abs(LC[Trial == 1])))$Diff
But, alas:
mydat[7:18, ]
# ID Session Trial Direction LC LC_diff
# 7 2 1 1 up 7 -8
# 8 2 1 2 down 8 -7
# 9 2 1 3 up 9 10
# 10 2 1 1 down 10 9
# 11 2 1 2 up 11 12
# 12 2 1 3 down 12 11
# 13 2 2 1 up 13 -14
# 14 2 2 2 down 14 -13
# 15 2 2 3 up 15 16
# 16 2 2 1 down 16 15
# 17 2 2 2 up 17 18
# 18 2 2 3 down 18 17
I am at a complete loss here and would appreciate any pointers to where my code is wrong.
I'm not sure this is what you meant, but with data.table would be like this:
library(data.table)
setDT(mydat)[,new:= abs(LC)-abs(LC[1]),by=.(ID, Trial, Direction)]
mydat[ID==2,]
ID Session Trial Direction LC new
1: 2 1 1 up 7 0
2: 2 1 1 down 8 0
3: 2 1 2 up 9 0
4: 2 1 2 down 10 0
5: 2 1 3 up 11 0
6: 2 1 3 down 12 0
7: 2 2 1 up 13 6
8: 2 2 1 down 14 6
9: 2 2 2 up 15 6
10: 2 2 2 down 16 6
11: 2 2 3 up 17 6
12: 2 2 3 down 18 6
New to R and computer science so help appreciated
Trying to figure out how to create a new column (y) in a data frame that corresponds to the value of (x), every value of 10 in x should given to a new y. Quite difficult to explain, don't know where to start either with a loop or if statement. For Example current data set
event_id x
1 0
2 2
3 5
4 11
5 12
6 17
7 25
8 28
9 30
10 34
but I want it to look like this
event_id x y
1 0 1
2 2 1
3 5 1
4 11 2
5 12 2
6 17 2
7 25 3
8 28 3
9 30 3
10 34 4
Hope this makes sense as the first 3 values are all < 10 so are given a value of 1, but then this repeats as the next 3 values are between 10-20 so y is 2 etc.
df$y <- with(
df,
findInterval(x, seq(0, max(x) + 10, by = 10))
)
df
event_id x y
1 1 0 1
2 2 2 1
3 3 5 1
4 4 11 2
5 5 12 2
6 6 17 2
7 7 25 3
8 8 28 3
9 9 30 4
10 10 34 4
This assumes 30 should be mapped 4, just as 0 is mapped to 1.
You can use
df$y = df$x %/% 10 + 1
df$y <- floor(df$x / 10.1) + 1
matches OP request:
# event_id x y
#1 1 0 1
#2 2 2 1
#3 3 5 1
#4 4 11 2
#5 5 12 2
#6 6 17 2
#7 7 25 3
#8 8 28 3
#9 9 30 3
#10 10 34 4
the dataset is like:
ID week
1 2
1 3
1 4
1 5
1 6
1 7
2 10
2 11
2 12
2 13
2 14
3 13
3 14
3 15
3 16
3 17
3 18
3 19
3 20
3 21
3 22
each ID has different start week. I want to randomly select only for 3 consecutive weeks for each ID in R. the output will be like
ID week
1 4
1 5
1 6
2 14
2 15
2 16
3 20
3 21
3 22
is there any faster to achieve it ? thanks
Here is one approach with data.table
library(data.table)
setDT(df1)[df1[, if(.N < 4) .I[1:.N] else {
i1 <- .I[sample(.N-2, 1)]
i1:(i1+2) } , by = ID]$V1]