Partition groups of data by group - r

I have the following dataset:
df<- as.data.frame(c(rep("a", times = 9), rep("b", times = 18), rep("c", times = 27)))
colnames(df)<-"Location"
Year<-c(rep(1:3,times = 3), rep(1:6, times = 3), rep(1:9, times = 3))
df$Year<-Year
df<- df %>%
mutate(Predictor = seq_along(Location)) %>%
ungroup(df)
print(df)
Location Year Predictor
a 1 1
a 2 2
a 3 3
a 1 4
a 2 5
a 3 6
a 1 7
a 2 8
a 3 9
b 1 10
b 2 11
b 3 12
b 4 13
b 5 14
... 40 more rows
I want to split the above dataframe into training and test sets. For the test set, I want to randomly sample a third of the number of years in each Location, while keeping the years together. So if year "1" is selected for location "a", I want all three "1's" in the test set and so on. My test set should look something like this:
Location Year Predictor
a 1 1
a 1 4
a 1 7
b 3 12
b 3 18
b 3 24
b 5 14
b 5 20
b 5 26
c 3 30
c 3 39
c 3 48
c 6 33
c 6 42
c 6 51
c 7 34
c 7 43
c 7 52
I found a similar question here, but this procedure would sample the same year and the same number of years from every location (and YEAR is numeric, not a factor). I want a different random sample of years from each location and a proportional number of samples.
Would like to do this in dplyr if possible

You can first create a distinct set of year/location combinations, then sample some of them for each location and use that in a semi_join on the original data. This could be done as:
df %>%
distinct(Location, Year) %>%
group_by(Location) %>%
sample_frac(.3) %>%
semi_join(df, .)
# Location Year Predictor
# 1 a 3 3
# 2 a 3 6
# 3 a 3 9
# 4 b 4 13
# 5 b 4 19
# 6 b 4 25
# 7 b 5 14
# 8 b 5 20
# 9 b 5 26
# 10 c 8 35
# 11 c 8 44
# 12 c 8 53
# 13 c 1 28
# 14 c 1 37
# 15 c 1 46
# 16 c 2 29
# 17 c 2 38
# 18 c 2 47

Related

How to fill a column by group with sampled row numbers according to n per group

I am working with a dataframe in R. I have groups stated by column Group1. I need to create a new column named sampled where I need to fill with a specific value after using sample per group from 1 to each number of rows per group. Here is the data I have:
library(tidyverse)
#Data
dat <- data.frame(Group1=sample(letters[1:3],15,replace = T))
Then dat looks like this:
dat
Group1
1 b
2 a
3 a
4 c
5 c
6 c
7 a
8 b
9 c
10 b
11 a
12 b
13 c
14 c
15 c
In order to get the N per group, we do this:
#Code
dat %>%
arrange(Group1) %>%
group_by(Group1) %>%
mutate(N=n())
Which produces:
# A tibble: 15 x 2
# Groups: Group1 [3]
Group1 N
<chr> <int>
1 a 4
2 a 4
3 a 4
4 a 4
5 b 4
6 b 4
7 b 4
8 b 4
9 c 7
10 c 7
11 c 7
12 c 7
13 c 7
14 c 7
15 c 7
What I need to do is next. I have the N per group, so I have to create a sample of 3 numbers from 1:N. In the case of group a having N=4 it would be sample(1:4,3) which produces [1] 2 4 3. With this in the group a I need that rows belonging to sampled values must be filled with 999. So for first group we would have:
Group1 N sampled
<chr> <int> <int>
1 a 4 NA
2 a 4 999
3 a 4 999
4 a 4 999
And then the same for the rest of groups. In this way using sample we will have random values per group. Is that possible to do using dplyr or tidyverse. Many thanks!
You could try:
set.seed(3242)
library(dplyr)
dat %>%
arrange(Group1) %>%
add_count(Group1, name = 'N') %>%
group_by(Group1) %>%
mutate(
sampled = case_when(
row_number() %in% sample(1:n(), 3L) ~ 999L,
TRUE ~ NA_integer_
)
)
Output:
# A tibble: 15 × 3
# Groups: Group1 [3]
Group1 N sampled
<chr> <int> <int>
1 a 4 999
2 a 4 999
3 a 4 NA
4 a 4 999
5 b 4 999
6 b 4 999
7 b 4 999
8 b 4 NA
9 c 7 NA
10 c 7 999
11 c 7 NA
12 c 7 999
13 c 7 NA
14 c 7 NA
15 c 7 999

Assign value to data based on more than two conditions and on other data

I have a data frame that looks like this
> df
name time count
1 A 10 9
2 A 12 17
3 A 24 19
4 A 3 15
5 A 29 11
6 B 31 14
7 B 7 7
8 B 30 18
9 C 29 13
10 C 12 12
11 C 3 16
12 C 4 6
and for each name group (A, B, C) I would need to assign a category following the rules below:
if time<= 10 then category = 1
if 10 <time<= 20 then category = 2
if 20 <time<= 30 then category = 3
if time> 30 then category = 4
to have a data frame that looks like this:
> df_final
name time count category
1 A 10 9 1
2 A 12 17 2
3 A 24 19 3
4 A 3 15 1
5 A 29 11 3
6 B 31 14 4
7 B 7 7 1
8 B 30 18 3
9 C 29 13 3
10 C 12 12 2
11 C 3 16 1
12 C 4 6 1
after that I would need to sum the value in count based on their category. The ultimate data frame should loo like this:
> df_ultimate
name count category
1 A 24 1
2 A 17 2
3 A 30 3
4 A NA 4
5 B 7 1
6 B NA 2
7 B 18 3
8 B 14 4
9 C 22 1
10 C 12 2
11 C 13 3
12 C NA 4
I have tried to play around with summarise and group_by but without much success.
Thanks for your help
With cut + complete:
library(dplyr)
library(tidyr)
dat %>%
group_by(name, category = cut(time, breaks = c(-Inf, 10, 20, 30, Inf), labels = 1:4)) %>%
summarise(count = sum(count)) %>%
complete(category)
# # Groups: name [3]
# name category count
# 1 A 1 24
# 2 A 2 17
# 3 A 3 30
# 4 A 4 NA
# 5 B 1 7
# 6 B 2 NA
# 7 B 3 18
# 8 B 4 14
# 9 C 1 22
# 10 C 2 12
# 11 C 3 13
# 12 C 4 NA

R, How to generate additional observations denoted by numbered sequence

I'm currently a bit stuck, since I'm a bit unsure of how to even formulate my problem.
What I have is a dataframe of observations with a few variables.
Lets say:
test <- data.frame(var1=c("a","b"),var2=c(15,12))
Is my initial dataset.
What I want to end up with is something like:
test2 <- data.frame(var1_p=c("a","a","a","a","a","b","b","b","b","b"),
var2=c(15,15,15,15,15,12,12,12,12,12),
var3=c(1,2,3,4,5,1,2,3,4,5)
However, the initial observation count and the fact, that I need the numbering to run from 0-9 makes it rather tedious to do by hand.
Does anybody have a nice alternative solution?
Thank you.
What I tried so far was:
a)
testdata$C <- 0
testdata <- for (i in testdata$Combined_Number) {add_row(testdata,C=seq(0,9))}
which results in the dataset to be empty.
b)
testdata$C <- with(testdata, ave(Combined_Number,flur, FUN = seq(0,9)))
which gives the following error code:
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'FUN' of mode 'function' was not found
Perhaps crossing helps
library(tidyr)
crossing(df, var3 = 0:9)
-output
# A tibble: 20 × 3
var1 var2 var3
<chr> <dbl> <int>
1 a 15 0
2 a 15 1
3 a 15 2
4 a 15 3
5 a 15 4
6 a 15 5
7 a 15 6
8 a 15 7
9 a 15 8
10 a 15 9
11 b 12 0
12 b 12 1
13 b 12 2
14 b 12 3
15 b 12 4
16 b 12 5
17 b 12 6
18 b 12 7
19 b 12 8
20 b 12 9
With dplyr this is one approach
library(dplyr)
df %>%
group_by(var1) %>%
summarize(var2, var3 = 0:9, .groups="drop")
# A tibble: 20 × 3
var1 var2 var3
<chr> <dbl> <int>
1 a 15 0
2 a 15 1
3 a 15 2
4 a 15 3
5 a 15 4
6 a 15 5
7 a 15 6
8 a 15 7
9 a 15 8
10 a 15 9
11 b 12 0
12 b 12 1
13 b 12 2
14 b 12 3
15 b 12 4
16 b 12 5
17 b 12 6
18 b 12 7
19 b 12 8
20 b 12 9
Data
df <- structure(list(var1 = c("a", "b"), var2 = c(15, 12)), class = "data.frame", row.names = c(NA,
-2L))

Filtering out a grade variable that is not meeting consecutive order in r

I have a combined dataset that consists of three years of data for the same ids. When I merged the dataset, I see some of the students' grades are not consecutive in the following years.
Here is sample dataset looks like:
df <- data.frame( id = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
category = c("A","A","A","B","B","B","A","A","A","B","B","B","A","A","A","B","B","B"),
year = c(18,19,20,18,19,20,18,19,20,18,19,20,18,19,20,18,19,20),
grade = c(3,4,5,3,4,5,5,6,8,5,6,8,3,4,6,3,4,6))
> df
id category year grade
1 1 A 18 3
2 1 A 19 4
3 1 A 20 5
4 1 B 18 3
5 1 B 19 4
6 1 B 20 5
7 2 A 18 5
8 2 A 19 6
9 2 A 20 8
10 2 B 18 5
11 2 B 19 6
12 2 B 20 8
13 3 A 18 3
14 3 A 19 4
15 3 A 20 6
16 3 B 18 3
17 3 B 19 4
18 3 B 20 6
In this sample dataset, id=2 and id=3 have those grades not in order as 5,6,7 and 3,4,5. id=2 has 5,6,8 instead of 5,6,7 and id=3 has 3,4,6 instead of 3,4,5. I would like remove those students from the dataset. My desired output would include only id=1 who has the grades are in order for the consecutive years.
My desired output file would be:
> df
id category year grade
1 1 A 18 3
2 1 A 19 4
3 1 A 20 5
4 1 B 18 3
5 1 B 19 4
6 1 B 20 5
Any ideas?
Thanks!
Get the diff and check if all of them is equal to 1, grouped by 'id', and 'category' to filter the groups
library(dplyr)
df %>%
group_by(id, category) %>%
filter(all(diff(grade) == 1)) %>%
ungroup
-output
# A tibble: 6 × 4
id category year grade
<dbl> <chr> <dbl> <dbl>
1 1 A 18 3
2 1 A 19 4
3 1 A 20 5
4 1 B 18 3
5 1 B 19 4
6 1 B 20 5

R - calculating the average value of a dataframe column from the top row to bottom row

The title may not be that clear, since it was difficult to summarize the problem in a few words, although I don't think the problem is that difficult to solve. To explain the problem, let me share a dataframe for reference:
head(df, n = 10)
team score
1 A 10
2 A 4
3 A 10
4 A 16
5 A 20
6 B 5
7 B 11
8 B 8
9 B 16
10 B 5
I'd like to add a third column, that calculates the average score for each team, with the average score updating as I go down the rows for each team, and then resetting at a new team. For example, the output column I am hoping for would look like this:
head(df, n = 10)
team score avg_score
1 A 10 10
2 A 4 7
3 A 10 8
4 A 16 10
5 A 20 12
6 B 5 5
7 B 11 8
8 B 8 8
9 B 16 10
10 B 5 9
# row1: 10 = 10
# row2: 7 = (10 + 4)/2
# row3: 8 = (10 + 4 + 10)/3
# ...
with the pattern following, and the calculation restarting for a new team.
Thanks,
library("data.table")
setDT(df)[, `:=` (avg_score = cumsum(score)/1:.N), by = team]
or more readable as per the comment by #snoram
setDT(dt)[, avg_score := cumsum(score)/(1:.N), by = team]
# team score avg_score
# 1: A 10 10
# 2: A 4 7
# 3: A 10 8
# 4: A 16 10
# 5: A 20 12
# 6: B 5 5
# 7: B 11 8
# 8: B 8 8
# 9: B 16 10
# 10: B 5 9
Here's an R base solution
df$avg_score <- unlist(tapply(df$score, df$team, function(x) cumsum(x)/seq_along(x)))
> df
team score avg_score
1 A 10 10
2 A 4 7
3 A 10 8
4 A 16 10
5 A 20 12
6 B 5 5
7 B 11 8
8 B 8 8
9 B 16 10
10 B 5 9
We can use the cummean from dplyr (also noted that #aosmith commented it - assuming that he is not posting it as solution)
library(dplyr)
df %>%
group_by(team) %>%
mutate(avg_score = cummean(score))
# team score avg_score
# <chr> <int> <dbl>
#1 A 10 10
#2 A 4 7
#3 A 10 8
#4 A 16 10
#5 A 20 12
#6 B 5 5
#7 B 11 8
#8 B 8 8
#9 B 16 10
#10 B 5 9

Resources