Sample part of a dataset while keeping subgroups intact - r

I have a dataframe which I would like to split into one 75% and one 25% parts of the original.
I thought a good first step would be to create the 25% dataset from the original dataset, by randomly sampling a quarter of the data.
However sampling shouldn't be entirely random, I want to preserve groups of a certain variable.
So with the example below, I want to randomly sample 1/4 of the data frame, but data needs to remain grouped via the 'team' variable. I have 8 teams, so I want to randomly sample 2 teams.
Data example (dput below)
team points assists
1 1 99 33
2 1 90 28
3 1 86 31
4 1 88 39
5 2 95 34
6 2 92 30
7 2 91 32
8 2 79 35
9 3 85 36
10 3 90 29
11 3 91 24
12 3 97 26
13 4 96 28
14 4 94 18
15 4 95 19
16 4 98 25
17 5 78 36
18 5 80 34
19 5 85 39
20 5 89 33
21 6 94 34
22 6 85 39
23 6 99 28
24 6 79 31
25 7 78 35
26 7 99 29
27 7 98 36
28 7 75 39
29 8 97 33
30 8 68 26
31 8 86 38
32 8 76 31
I've tried this using the slice_sample code from dplyr, but this does the exact opposite of what I want (it splits all teams)
testdata <- df %>% group_by(team) %>% slice_sample(n = 2)
My code results in
team points assists
<dbl> <dbl> <dbl>
1 1 90 28
2 1 99 33
3 2 95 34
4 2 92 30
5 3 91 24
6 3 85 36
7 4 95 19
8 4 98 25
9 5 80 34
10 5 78 36
11 6 85 39
12 6 94 34
13 7 78 35
14 7 98 36
15 8 76 31
16 8 86 38
Example of the dataframe:
structure(list(team = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8), points = c(99,
90, 86, 88, 95, 92, 91, 79, 85, 90, 91, 97, 96, 94, 95, 98, 78,
80, 85, 89, 94, 85, 99, 79, 78, 99, 98, 75, 97, 68, 86, 76),
assists = c(33, 28, 31, 39, 34, 30, 32, 35, 36, 29, 24, 26,
28, 18, 19, 25, 36, 34, 39, 33, 34, 39, 28, 31, 35, 29, 36,
39, 33, 26, 38, 31)), class = "data.frame", row.names = c(NA,
-32L))

With dplyr, if you group_by(team) and then sample, that's sampling within each team--the opposite of what you want. Here's a direct approach:
test_teams = sample(unique(dataset$team), size = 2)
test = dataset %>% filter(team %in% test_teams)
train = dataset %>% filter(!team %in% test_teams)

library(caTools)
split <- sample.split(dataset$team, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

Related

Calculate a complex difference score with tidyverse?

I have a large dataset of 70 000 rows that I want to perform some operations on, but I can't find an appropriate solution.
bib sta run course finish comment day
1 42 9 1 SG 19.88 99 1
2 42 17 2 A 19.96 11 1
3 42 27 3 B 20.92 22 1
4 42 39 4 A 19.60 11 1
5 42 48 5 SG 20.24 99 1
6 42 61 6 C 22.90 33 1
7 42 76 7 B 20.70 22 1
8 42 86 8 C 22.74 33 1
9 42 93 9 C 22.75 33 1
10 42 103 10 A 19.79 11 1
11 42 114 11 B 20.67 22 1
12 42 120 12 SG 20.10 99 1
I want to end up with a tibble that:
calculates the mean finish time in SG course for each bib number on one particular day. For example, 19.88 + 20.24 + 20.10 / 3
calculate a difference score for each observation in the dataset by subtracting finish from this mean SG score. For example, 19.88 - mean(SG), 19.96 - mean(SG).
I have tried the following approach:
First group by day, bib and course. Then filter by SG and calculate the mean:
avg.sgtime <- df %>%
group_by(day, bib, course) %>%
filter(course == 'SG') %>%
mutate(avg.sg = mean(finish))
Resulting in the following tibble
bib sta run course finish comment day avg.sg
<int> <int> <int> <chr> <dbl> <int> <chr> <dbl>
1 42 9 1 SG 19.9 99 1 20.1
2 42 48 5 SG 20.2 99 1 20.1
3 42 120 12 SG 20.1 99 1 20.1
4 42 6 1 SG 20.0 99 2 19.9
5 42 42 5 SG 19.8 77 2 19.9
6 42 130 15 SG 19.9 99 2 19.9
7 42 6 1 SG 20.6 99 3 20.5
8 42 68 12 SG 20.6 77 3 20.5
9 42 90 15 SG 20.4 77 3 20.5
Finally I join the two tibbles together using the following syntax:
df %>% full_join(avg.sgtime) %>%
mutate(diff = finish - avg.sg)
However, this doesn't work. It only works for the SG course but not for course A, B and C. Is there a way to fix this or is there a better solution to the problem?
bib sta run course finish comment day avg.sg diff
1 42 9 1 SG 19.88 99 1 20.07333 -0.193333333
2 42 17 2 A 19.96 11 1 NA NA
3 42 27 3 B 20.92 22 1 NA NA
4 42 39 4 A 19.60 11 1 NA NA
5 42 48 5 SG 20.24 99 1 20.07333 0.166666667
You can filter your values for finish within the mutate() and calculate the mean based on those:
df %>%
group_by(day,bib) %>%
mutate(
avg.sg = mean(finish[course=="SG"]),
diff = finish - avg.sg)
Is the following what you are aiming for?
(note that I added a few random values for a second bib just to make sure the join is done properly)
The difference to your attempt is using summarise() instead of mutate() to consolidate the avg.sgtime data frame, and also dropping a few columns so that the join is not populated with NAs. Instead of dropping you can also set the relevant columns to join by passing the by argument to the left_join() function.
library(dplyr)
library(tidyr) # for join
avg.sgtime <- df %>%
group_by(day, bib, course) %>%
filter(course == 'SG') %>%
summarise(avg.sg = mean(finish), .groups = "drop") %>%
select(c(bib, day, avg.sg))
avg.sgtime
#> # A tibble: 3 x 3
#> bib day avg.sg
#> <dbl> <dbl> <dbl>
#> 1 42 1 20.1
#> 2 43 1 19.1
#> 3 44 2 19.3
df %>% left_join(avg.sgtime) %>%
mutate(diff = finish - avg.sg)
#> Joining, by = c("bib", "day")
#> # A tibble: 36 x 9
#> bib sta run course finish comment day avg.sg diff
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 42 9 1 SG 19.9 99 1 20.1 -0.193
#> 2 42 17 2 A 20.0 11 1 20.1 -0.113
#> 3 42 27 3 B 20.9 22 1 20.1 0.847
#> 4 42 39 4 A 19.6 11 1 20.1 -0.473
#> 5 42 48 5 SG 20.2 99 1 20.1 0.167
#> 6 42 61 6 C 22.9 33 1 20.1 2.83
#> 7 42 76 7 B 20.7 22 1 20.1 0.627
#> 8 42 86 8 C 22.7 33 1 20.1 2.67
#> 9 42 93 9 C 22.8 33 1 20.1 2.68
#> 10 42 103 10 A 19.8 11 1 20.1 -0.283
#> # … with 26 more rows
Created on 2021-07-04 by the reprex package (v2.0.0)
data
df <- tribble(~bib, ~sta, ~run, ~course, ~finish, ~comment, ~day,
42, 9, 1, "SG", 19.88, 99, 1,
42, 17, 2, "A", 19.96, 11, 1,
42, 27, 3, "B", 20.92, 22, 1,
42, 39, 4, "A", 19.60, 11, 1,
42, 48, 5, "SG", 20.24, 99, 1,
42, 61, 6, "C", 22.90, 33, 1,
42, 76, 7, "B", 20.70, 22, 1,
42, 86, 8, "C", 22.74, 33, 1,
42, 93, 9, "C", 22.75, 33, 1,
42, 103, 10, "A", 19.79, 11, 1,
42, 114, 11, "B", 20.67, 22, 1,
42, 120, 12, "SG", 20.10, 99, 1,
43, 9, 1, "SG", 19.12, 99, 1,
43, 17, 2, "A", 19.64, 11, 1,
43, 27, 3, "B", 20.62, 22, 1,
43, 39, 4, "A", 19.23, 11, 1,
43, 48, 5, "SG", 20.11, 99, 1,
43, 61, 6, "C", 22.22, 33, 1,
43, 76, 7, "B", 20.33, 22, 1,
43, 86, 8, "C", 22.51, 33, 1,
43, 93, 9, "C", 22.78, 33, 1,
43, 103, 10, "A", 19.98, 11, 1,
43, 114, 11, "B", 20.11, 22, 1,
43, 120, 12, "SG", 18.21, 99, 1,
44, 9, 1, "SG", 19.18, 99, 2,
44, 17, 2, "A", 19.56, 11, 2,
44, 27, 3, "B", 20.62, 22, 2,
44, 39, 4, "A", 19.20, 11, 2,
44, 48, 5, "SG", 20.74, 99, 2,
44, 61, 6, "C", 22.50, 33, 2,
44, 76, 7, "B", 20.60, 22, 2,
44, 86, 8, "C", 22.74, 33, 2,
44, 93, 9, "C", 22.85, 33, 2,
44, 103, 10, "A", 19.59, 11, 2,
44, 114, 11, "B", 20.27, 22, 2,
44, 120, 12, "SG", 18.10, 99, 2,
)
Thanks #Marcelo Avila for providing me with a very good hint:
I hope this is what you are looking for:
library(dplyr)
df %>%
group_by(bib, day) %>%
mutate(across(finish, ~ mean(.x[course == "SG"]), .names = "avg_{.col}"),
diff = finish - avg_finish,
avg_finish = ifelse(course == "SG", avg_finish, NA))
# A tibble: 12 x 9
# Groups: bib, day [1]
bib sta run course finish comment day avg_finish diff
<int> <int> <int> <chr> <dbl> <int> <int> <dbl> <dbl>
1 42 9 1 SG 19.9 99 1 20.1 -0.193
2 42 17 2 A 20.0 11 1 NA -0.113
3 42 27 3 B 20.9 22 1 NA 0.847
4 42 39 4 A 19.6 11 1 NA -0.473
5 42 48 5 SG 20.2 99 1 20.1 0.167
6 42 61 6 C 22.9 33 1 NA 2.83
7 42 76 7 B 20.7 22 1 NA 0.627
8 42 86 8 C 22.7 33 1 NA 2.67
9 42 93 9 C 22.8 33 1 NA 2.68
10 42 103 10 A 19.8 11 1 NA -0.283
11 42 114 11 B 20.7 22 1 NA 0.597
12 42 120 12 SG 20.1 99 1 20.1 0.0267
I also added another alternative solution with a minor change, using dear #Marcelo Avila's data set:
df %>%
group_by(bib, day) %>%
mutate(across(finish, ~ mean(.x[select(cur_data(), course) == "SG"]), .names = "avg_{.col}"),
diff = finish - avg_finish,
avg_finish = ifelse(course == "SG", avg_finish, NA))
# A tibble: 36 x 9
# Groups: bib, day [3]
bib sta run course finish comment day avg_finish diff
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 42 9 1 SG 19.9 99 1 20.1 -0.193
2 42 17 2 A 20.0 11 1 NA -0.113
3 42 27 3 B 20.9 22 1 NA 0.847
4 42 39 4 A 19.6 11 1 NA -0.473
5 42 48 5 SG 20.2 99 1 20.1 0.167
6 42 61 6 C 22.9 33 1 NA 2.83
7 42 76 7 B 20.7 22 1 NA 0.627
8 42 86 8 C 22.7 33 1 NA 2.67
9 42 93 9 C 22.8 33 1 NA 2.68
10 42 103 10 A 19.8 11 1 NA -0.283
# ... with 26 more rows

R - Identify and remove duplicate rows based on two columns

I have some data that looks like this:
Course_ID Text_ID
33 17
33 17
58 17
5 22
8 22
42 25
42 25
17 26
17 26
35 39
51 39
Not having a background in programming, I'm finding it tricky to articulate my question, but here goes: I only want to keep rows where Course_ID varies but where Text_ID is the same. So for example, the final data would look something like this:
Course_ID Text_ID
5 22
8 22
35 39
51 39
As you can see, Text_ID 22 and 39 are the only ones that have different Course_ID values. I suspect subsetting the data would be the way to go, but as I said, I'm quite a novice at this kind of thing and would really appreciate any advice on how to approach this.
Select those groups where there is no repeats of Course_ID.
In dplyr you can write this as -
library(dplyr)
df %>% group_by(Text_ID) %>% filter(n_distinct(Course_ID) == n()) %>% ungroup
# Course_ID Text_ID
# <int> <int>
#1 5 22
#2 8 22
#3 35 39
#4 51 39
and in data.table -
library(data.table)
setDT(df)[, .SD[uniqueN(Course_ID) == .N], Text_ID]
You can use ave testing if not anyDuplicated.
x[ave(x$Course_ID, x$Text_ID, FUN=anyDuplicated)==0,]
# Course_ID Text_ID
#4 5 22
#5 8 22
#10 35 39
#11 51 39
Data:
x <- read.table(header=TRUE, text="Course_ID Text_ID
33 17
33 17
58 17
5 22
8 22
42 25
42 25
17 26
17 26
35 39
51 39")
Here is my approach with rlist and dplyr:
library(dplyr)
your_data %>%
split(~ Text_ID) %>%
rlist::list.filter(length(unique(Course_ID)) == length(Course_ID)) %>%
bind_rows()
Returns:
# A tibble: 4 x 2
Course_ID Text_ID
<dbl> <dbl>
1 5 22
2 8 22
3 35 39
4 51 39
# Data used:
your_data <- structure(list(Course_ID = c(33, 33, 58, 5, 8, 42, 42, 17, 17, 35, 51), Text_ID = c(17, 17, 17, 22, 22, 25, 25, 26, 26, 39, 39)), row.names = c(NA, -11L), class = c("tbl_df", "tbl", "data.frame"))

How to nest ifelse statements to accommodate three conditions

I have a simple dataframe, my data, with two variables, A and B. Here's a sample of the first 100 rows:
structure(list(A = c(0, 6, 35, 0, 99, 20, 3, 6, 80, 12, 23, 77,
28, 80, 18, 90, 12, 60, 99, 90, 1, 3, 99, 100, 24, 99, 0, 40,
0, 0, 99, 10, 23, 7, 99, 0, 76, 57, 99, 0, 21, 6, 0, 0, 0, 0,
0, 0, 25, 50, 0, 100, 35, 40, 25, 90, 10, 20, 25, 100, 0, 15,
98, 35, 85, 90, 0, 0, 90, 90, 90, 50, 45, 90, 20, 15, 85, 100,
90, 15, 90, 85, 15, 25, 35, 90, 10, 35, 35, 100, 20, 0, 60, 100,
19, 60, 0, 50, 50, 6), B = c(10, 14, 5, 25, 87, 12, 12, 5, 80,
87, 60, 78, 23, 60, 18, 45, 12, 34, 99, 70, 2, 21, 50, 57, 50,
70, 12, 18, 34, 34, 23, 45, 34, 12, 99, 29, 76, 34, 50, 12, 20,
12, 50, 45, 2, 5, 12, 34, 25, 25, 25, 90, 45, 25, 35, 80, 15,
15, 20, 80, 4, 45, 27, 15, 85, 20, 58, 25, 20, 58, 45, 45, 48,
80, 25, 10, 80, 45, 25, 10, 45, 65, 45, 25, 35, 87, 10, 13, 25,
45, 25, 15, 25, 85, 19, 40, 12, 45, 65, 10)), row.names = 52:151, class = "data.frame")
I want to add a new column for variable P, but the calculation for P differs for three conditions. Such that...
If A < B, then P is equal to (B - A)/(B - 1)
If A > B, then P is equal to (A - B)/(100 - B)
If A = B, then P is equal to 0
How do I apply this logic? I have attempted to use a nested ifelse function as follows:
mydata$P <- ifelse(mydata$A < mydata$B, ((mydata$B-mydata$A)/(mydata$B - 1)),
ifelse(mydata$A == mydata$B), 0,
((mydata$A-mydata$B)/(100 - mydata$B)))
But it returns this error:
Error in ifelse(mydata$A < mydata$B, ((mydata$B - mydata$A)/(mydata$B - :
unused arguments (0, ((mydata$A - mydata$B)/(100 - mydata$B)))
Where am I going wrong?
Here's an alternative:
mydata$ P <- with(mydata,
ifelse(A < B, (B - A)/(B - 1),
ifelse(A > B, (A - B)/(100 - B), 0)))
Here's a solution that uses case_when from dplyr, as I find it quite neat and tidy for structuring these sorts of statements. First, I define the data:
# Define data frame
df <- structure(list(A = c(0, 6, 35, 0, 99, 20, 3, 6, 80, 12, 23, 77,
28, 80, 18, 90, 12, 60, 99, 90, 1, 3, 99, 100, 24, 99, 0, 40,
0, 0, 99, 10, 23, 7, 99, 0, 76, 57, 99, 0, 21, 6, 0, 0, 0, 0,
0, 0, 25, 50, 0, 100, 35, 40, 25, 90, 10, 20, 25, 100, 0, 15,
98, 35, 85, 90, 0, 0, 90, 90, 90, 50, 45, 90, 20, 15, 85, 100,
90, 15, 90, 85, 15, 25, 35, 90, 10, 35, 35, 100, 20, 0, 60, 100,
19, 60, 0, 50, 50, 6),
B = c(10, 14, 5, 25, 87, 12, 12, 5, 80,
87, 60, 78, 23, 60, 18, 45, 12, 34, 99, 70, 2, 21, 50, 57, 50,
70, 12, 18, 34, 34, 23, 45, 34, 12, 99, 29, 76, 34, 50, 12, 20,
12, 50, 45, 2, 5, 12, 34, 25, 25, 25, 90, 45, 25, 35, 80, 15,
15, 20, 80, 4, 45, 27, 15, 85, 20, 58, 25, 20, 58, 45, 45, 48,
80, 25, 10, 80, 45, 25, 10, 45, 65, 45, 25, 35, 87, 10, 13, 25,
45, 25, 15, 25, 85, 19, 40, 12, 45, 65, 10)),
row.names = 52:151, class = "data.frame")
Then, I apply case_when, like so:
# Perform calculation
df$P <- with(df,
dplyr::case_when(
A < B ~ (B - A)/(B - 1),
A > B ~ (A - B)/(100 - B),
A == B ~ 0
))
which gives
df
#> A B P
#> 52 0 10 1.11111111
#> 53 6 14 0.61538462
#> 54 35 5 0.31578947
#> 55 0 25 1.04166667
#> 56 99 87 0.92307692
#> 57 20 12 0.09090909
#> 58 3 12 0.81818182
#> 59 6 5 0.01052632
#> 60 80 80 0.00000000
#> 61 12 87 0.87209302
#> 62 23 60 0.62711864
#> 63 77 78 0.01298701
#> 64 28 23 0.06493506
#> 65 80 60 0.50000000
#> 66 18 18 0.00000000
#> 67 90 45 0.81818182
#> 68 12 12 0.00000000
#> 69 60 34 0.39393939
#> 70 99 99 0.00000000
#> 71 90 70 0.66666667
#> 72 1 2 1.00000000
#> 73 3 21 0.90000000
#> 74 99 50 0.98000000
#> 75 100 57 1.00000000
#> 76 24 50 0.53061224
#> 77 99 70 0.96666667
#> 78 0 12 1.09090909
#> 79 40 18 0.26829268
#> 80 0 34 1.03030303
#> 81 0 34 1.03030303
#> 82 99 23 0.98701299
#> 83 10 45 0.79545455
#> 84 23 34 0.33333333
#> 85 7 12 0.45454545
#> 86 99 99 0.00000000
#> 87 0 29 1.03571429
#> 88 76 76 0.00000000
#> 89 57 34 0.34848485
#> 90 99 50 0.98000000
#> 91 0 12 1.09090909
#> 92 21 20 0.01250000
#> 93 6 12 0.54545455
#> 94 0 50 1.02040816
#> 95 0 45 1.02272727
#> 96 0 2 2.00000000
#> 97 0 5 1.25000000
#> 98 0 12 1.09090909
#> 99 0 34 1.03030303
#> 100 25 25 0.00000000
#> 101 50 25 0.33333333
#> 102 0 25 1.04166667
#> 103 100 90 1.00000000
#> 104 35 45 0.22727273
#> 105 40 25 0.20000000
#> 106 25 35 0.29411765
#> 107 90 80 0.50000000
#> 108 10 15 0.35714286
#> 109 20 15 0.05882353
#> 110 25 20 0.06250000
#> 111 100 80 1.00000000
#> 112 0 4 1.33333333
#> 113 15 45 0.68181818
#> 114 98 27 0.97260274
#> 115 35 15 0.23529412
#> 116 85 85 0.00000000
#> 117 90 20 0.87500000
#> 118 0 58 1.01754386
#> 119 0 25 1.04166667
#> 120 90 20 0.87500000
#> 121 90 58 0.76190476
#> 122 90 45 0.81818182
#> 123 50 45 0.09090909
#> 124 45 48 0.06382979
#> 125 90 80 0.50000000
#> 126 20 25 0.20833333
#> 127 15 10 0.05555556
#> 128 85 80 0.25000000
#> 129 100 45 1.00000000
#> 130 90 25 0.86666667
#> 131 15 10 0.05555556
#> 132 90 45 0.81818182
#> 133 85 65 0.57142857
#> 134 15 45 0.68181818
#> 135 25 25 0.00000000
#> 136 35 35 0.00000000
#> 137 90 87 0.23076923
#> 138 10 10 0.00000000
#> 139 35 13 0.25287356
#> 140 35 25 0.13333333
#> 141 100 45 1.00000000
#> 142 20 25 0.20833333
#> 143 0 15 1.07142857
#> 144 60 25 0.46666667
#> 145 100 85 1.00000000
#> 146 19 19 0.00000000
#> 147 60 40 0.33333333
#> 148 0 12 1.09090909
#> 149 50 45 0.09090909
#> 150 50 65 0.23437500
#> 151 6 10 0.44444444
Created on 2019-08-08 by the reprex package (v0.3.0)
Alternatively, you could avoid using ifelse in the first place:
mydata$P <- with(mydata, abs(B - A) / ((A <= B) * (B - 1) + (A >= B) * (100 - B)))
NB: if A equals B, the numerator is zero and the denominator is 99 independent of the value of B, so there will be no issues trying to divide by zero.

R Impute NA's by Linear Increase Depending on Time Interval

PROBLEM
I neeed to impute the NA's in my data frame that comes from a repeated measures study. On this particular outcome, I need to impute the NA's with the last observed non-NA value +1 by each +52 week interval starting from the last observed value.
EXAMPLE
An example data frame with the target imputation goal included.
df <- data.frame(
subject = rep(1:3, each = 12),
week = rep(c(8, 10, 12, 16, 20, 26, 32, 44, 52, 64, 78, 104),3),
value = c(112, 97, 130, 104, NA, NA, NA, NA, NA, NA, NA, NA,
89, 86, 94, 96, 88,107, 110, 102, 107, NA, NA, NA,
107, 110, 102, 130, 104, 88, 82, 79, 92, 106, NA, NA),
goal = c(112, 97, 130, 104, 104, 104, 104, 104, 104, 104, 105, 105,
89, 86, 94, 96, 88,107, 110, 102, 107, 107,107, 108,
107, 110, 102, 130, 104, 88, 82, 79, 92, 106, 106, 106)
)
I left the intermediate columns in to make what's happening more obvious, but you can remove them with a simple select.
df = df %>%
group_by(subject) %>%
mutate(last_obs_week = max(week[!is.na(value)]),
since_last_week = pmax(0, week - last_obs_week),
inc_52 = since_last_week %/% 52,
result = zoo::na.locf(value) + inc_52
)
all(df$goal == df$result)
# [1] TRUE
print.data.frame(df)
# subject week value goal last_obs_week since_last_week inc_52 result
# 1 1 8 112 112 16 0 0 112
# 2 1 10 97 97 16 0 0 97
# 3 1 12 130 130 16 0 0 130
# 4 1 16 104 104 16 0 0 104
# 5 1 20 NA 104 16 4 0 104
# 6 1 26 NA 104 16 10 0 104
# 7 1 32 NA 104 16 16 0 104
# 8 1 44 NA 104 16 28 0 104
# 9 1 52 NA 104 16 36 0 104
# 10 1 64 NA 104 16 48 0 104
# 11 1 78 NA 105 16 62 1 105
# 12 1 104 NA 105 16 88 1 105
# 13 2 8 89 89 52 0 0 89
# ...
One can use dplyr and tidyr::fill to get the desired result. The logic will be to add a column to track the week which had the non-NA value. Use tidyr::fill to populate last non-NA value and then check if difference of current week with last non-NA week is more than 52 then increase the value by 1.
library(dplyr)
library(tidyr)
df %>% group_by(subject) %>%
mutate(weekWithLastNonNaValue = ifelse(is.na(value), NA, week)) %>%
fill(value, weekWithLastNonNaValue) %>%
mutate(value = value + (week-weekWithLastNonNaValue) %/% 52) %>%
select(-weekWithLastNonNaValue) %>%
as.data.frame()
# subject week value goal
# 1 1 8 112 112
# 2 1 10 97 97
# 3 1 12 130 130
# 4 1 16 104 104
# 5 1 20 104 104
# 6 1 26 104 104
# 7 1 32 104 104
# 8 1 44 104 104
# 9 1 52 104 104
# 10 1 64 104 104
# 11 1 78 105 105
# 12 1 104 105 105
# 13 2 8 89 89
# 14 2 10 86 86
# 15 2 12 94 94
# 16 2 16 96 96
# 17 2 20 88 88
# 18 2 26 107 107
# 19 2 32 110 110
# 20 2 44 102 102
#
# so on
#

Creating a subset in R using a double loop Continuation

DF:
Year 1901 1901 1903 1968 1978 2002 2006 2010
species 1 1 2 65 1 82 3 1
lat: 49 46 47 47 48 43.1 44.23 47.11
long: -79.22 -79.5 -78.22 -79.84 -78.11 -77.114 -76.33 -76.2
Julian_Day: 79 125 165 178 193 68 90 230
Land: 16 24 25 30 34 34 39 41
There are more variables but that's an example of the matrix. I only want to keep the rows for each year AND for each species that has the lowest value for the Julian_day. Ie: the second row would be omitted here, because 79 is less than 125 for species 1 in 1901.
First of all. I would suggest you providing a data.frame in a format that is easy for people to use. We'll be able to help you better and faster
df <- structure(list(Year = c(1901, 1901, 1903, 1968, 1978,
2002, 2006, 2010), species = c(1, 1, 2, 65, 1, 82, 3, 1), lat =
c(49, 46, 47, 47, 48, 43.1, 44.23, 47.11), long = c(79.22,
-79.5, -78.22, -79.84, -78.11, -77.114, -76.33, -76.2),
Julian_Day = c(79, 125, 165, 178, 193, 68, 90, 230), Land =
c(16, 24, 25, 30, 34, 34, 39, 41)), .Names =
c("Year", "species", "lat", "long", "Julian_Day", "Land"),
row.names = c(NA, -8L), class = "data.frame")
Here is your data.frame
df
# Year species lat long Julian_Day Land
#1: 1901 1 49.00 79.220 79 16
#2: 1901 1 46.00 -79.500 125 24
#3: 1903 2 47.00 -78.220 165 25
#4: 1968 65 47.00 -79.840 178 30
#5: 1978 1 48.00 -78.110 193 34
#6: 2002 82 43.10 -77.114 68 34
#7: 2006 3 44.23 -76.330 90 39
#8: 2010 1 47.11 -76.200 230 41
Generally, you just have to do dput(head(your dataframe)) But you can build a small fake data frame to illustrate your point if cannot reveal your data.
Her's a possible solution using the data.table package
library(data.table)
setDT(df)[ ,.SD[which.min(Julian_Day)], .(species, Year)]

Resources