How to keep grouped variables together in training and test data - r

I'm making and testing the accuracy of age extrapolations from growth measurements and to do this I have to split my data into my training and test data.
The issue is that individuals in my data set were measured multiple times and sometimes they were measured twice, sometimes 3 times. In the dataset Birds is the individual chick, age is the age at measurement, and wing is that measurement value.
I've tried using the group_by function to keep their measurements together, but this doesn't seem to work. I also tried nesting the data but that puts the data in a new table and my code doesn't like that. Is there another way I could keep the groups together while still randomly assigning them to training and test data?
library('tidyverse')
library("ggplot2")
library("readxl")
library("writexl")
library('dplyr')
library('Rmisc')
library('cowplot')
library('purrr')
library('caTools')
library('MLmetrics')
Bird<-c(1,1,1,2,2,3,3,3,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9,10,10,)
Age<-c(10,17,27,17,28,10,17,27,10,17,10,17,28,10,17,28,10,17,28,10,17,28,10,17,28,11,18,)
Wing<-c(39,63,98,61,99,34,48,80,30,37,35,51,71,40,55,79,34,47,77,36,55,84,35,55,88,36,59,)
Set14<-data.frame(Bird, Age, Wing) %>%
group_by(Bird)
Set14$Bird<-as.factor((Set14$Bird))
Set14
sample_size = floor(0.7*nrow(Set14))
picked = sample(seq_len(nrow(Set14)),size = sample_size)
Training =Set14[picked,]
Training
Test =Set14[-picked,]
Test
trm<-lm(Age~Wing, data=Training)
predval<-predict(object=trm,
newdata=Test)
predval
error<-data.frame(actual=Test$Age, calculated=predval)
error
MAPE(error$actual, error$calculated)

In Base R you could do:
a <- as.integer(Set14$Bird)
train_index <- a %in% sample(n<-length(unique(a)), 0.7*n)
train <- set14[train, ]
test <- set14[!train, ]
in Tidyverse:
ungroup(Set14) %>%
nest_by(Bird) %>%
ungroup() %>%
mutate(tt = floor(.7*n()),
tt = sample(rep(c('train', 'test'), c(tt[1], n()-tt[1])))) %>%
unnest(data) %>%
group_split(tt, .keep = FALSE)
[[1]]
# A tibble: 9 x 3
Bird Age Wing
<fct> <dbl> <dbl>
1 1 10 39
2 1 17 63
3 1 27 98
4 3 10 34
5 3 17 48
6 3 27 80
7 7 10 34
8 7 17 47
9 7 28 77
[[2]]
# A tibble: 18 x 3
Bird Age Wing
<fct> <dbl> <dbl>
1 2 17 61
2 2 28 99
3 4 10 30
4 4 17 37
5 5 10 35
6 5 17 51
7 5 28 71
8 6 10 40
9 6 17 55
10 6 28 79
11 8 10 36
12 8 17 55
13 8 28 84
14 9 10 35
15 9 17 55
16 9 28 88
17 10 11 36
18 10 18 59

Related

R: Creating a new row for each group, with values being the difference of existing entries in the group

Region
Age
Student Type
Values
A
17
Any
32
A
17
Full time
24
A
18
Any
27
A
18
Full time
19
B
17
Any
22
B
17
Full time
14
B
18
Any
80
B
18
Full time
75
I am working with this dataset in R. I am hoping to create a new tow for each region and age, with student type being "Part time" and values being the values of "Any" - "Full time". It seems I can use lag in the process, but I was hoping to be more explicit, specifying it is "Any" - "Full time", as while this dataset is well organized there may be data sets where entries are reversed.
Ideally the result would look something like
Region
Age
Student Type
Values
A
17
Any
32
A
17
Full time
24
A
17
Part time
8
A
18
Any
27
A
18
Full time
19
A
18
Part time
8
B
17
Any
22
B
17
Full time
14
B
17
Part time
8
B
18
Any
80
B
18
Full time
75
B
18
Part time
5
Thank you!
You may try
library(dplyr)
df %>%
group_by(Region, Age) %>%
summarize(Student.Type = "Part time",
Values = abs(diff(Values))) %>%
rbind(., df) %>%
arrange(Region, Age, Student.Type)
Region Age Student.Type Values
<chr> <int> <chr> <int>
1 A 17 Any 32
2 A 17 Full time 24
3 A 17 Part time 8
4 A 18 Any 27
5 A 18 Full time 19
6 A 18 Part time 8
7 B 17 Any 22
8 B 17 Full time 14
9 B 17 Part time 8
10 B 18 Any 80
11 B 18 Full time 75
12 B 18 Part time 5
With dplyr, you could use group_modify() + add_row().
df %>%
group_by(Region, Age) %>%
group_modify(~ {
.x %>%
summarise(StudentType = "Part time", Values = -diff(Values)) %>%
add_row(.x, .)
}) %>%
ungroup()
# # A tibble: 12 × 4
# Region Age StudentType Values
# <chr> <int> <chr> <int>
# 1 A 17 Any 32
# 2 A 17 Full time 24
# 3 A 17 Part time 8
# 4 A 18 Any 27
# 5 A 18 Full time 19
# 6 A 18 Part time 8
# 7 B 17 Any 22
# 8 B 17 Full time 14
# 9 B 17 Part time 8
# 10 B 18 Any 80
# 11 B 18 Full time 75
# 12 B 18 Part time 5

add rows to data frame for non-observations

I have a dataframe that summarizes the number of times birds were observed at their breeding site one each day and each hour during daytime (i.e., when the sun was above the horizon). example:
head(df)
ID site day hr nObs
1 19 A 202 11 60
2 19 A 202 13 18
3 19 A 202 15 27
4 8 B 188 8 6
5 8 B 188 9 6
6 8 B 188 11 7
However, this dataframe does not include hours when the bird was not observed. Eg. no line for bird 19 on day 202 at 14 with an nObs value of 0.
I'd like to find a way, preferably with dplyr (tidy verse), to add in those rows for when individuals were not observed.
You can use complete from tidyr, i.e.
library(tidyverse)
df %>%
group_by(ID, site) %>%
complete(hr = seq(min(hr), max(hr)))
which gives,
# A tibble: 9 x 5
# Groups: ID, site [2]
ID site hr day nObs
<int> <fct> <int> <int> <int>
1 8 B 8 188 6
2 8 B 9 188 6
3 8 B 10 NA NA
4 8 B 11 188 7
5 19 A 11 202 60
6 19 A 12 NA NA
7 19 A 13 202 18
8 19 A 14 NA NA
9 19 A 15 202 27
One way to do this would be to first build a "template" of all possible combinations where birds can be observed and then merge ("left join") the actual observations onto that template:
a = read.table(text = " ID site day hr nObs
1 19 A 202 11 60
2 19 A 202 13 18
3 19 A 202 15 27
4 8 B 188 8 6
5 8 B 188 9 6
6 8 B 188 11 7")
tpl <- expand.grid(c(unique(a[, 1:3]), list(hr = 1:24)))
merge(tpl, a, all.x = TRUE)
Edit based on comment by #user3220999: in case we want to do the process per ID, we can just use split to get a list of data.frames per ID, get a list of templates and mapply merge on the two lists:
a <- split(a, a$ID)
tpl <- lapply(a, function(ai) {
expand.grid(c(unique(ai[, 1:3]), list(hr = 1:24)))
})
res <- mapply(merge, tpl, a, SIMPLIFY = FALSE, MoreArgs = list(all.x = TRUE))

summarise dataset conditioning on variable using dplyr

I want to summarise my dataset grouping the variable age into 5 years age groups, so instead of single age 0 1 2 3 4 5 6... I would have 0 5 10 15 etc. with 80 being my open-ended category. I could do this by categorizing everything by hand creating a new variable, but I am sure there must be a quicker way!
a <- cbind(age=c(rep(seq(0, 90, by=1), 2)), value=rnorm(182))
Any ideas?
like this ?
library(dplyr)
a %>% data.frame %>% group_by(age_group = (sapply(age,min,80) %/% 5)*5) %>%
summarize(avg_val = mean(value))
# A tibble: 17 x 2
age_group avg_val
<dbl> <dbl>
1 0 -0.151470805
2 5 0.553619149
3 10 0.198915973
4 15 -0.436646287
5 20 -0.024193193
6 25 0.102671120
7 30 -0.350059839
8 35 0.010762264
9 40 0.339268917
10 45 -0.056448481
11 50 0.002982158
12 55 0.348232262
13 60 -0.364050091
14 65 0.177551510
15 70 -0.178885909
16 75 0.664215782
17 80 -0.376929230
Example data
set.seed(1)
df <- data.frame(age=runif(1000)*100,
value=runif(1000))
Simply add the max value of your group to seq(0,80,5) for irregular breaks with c(..., max(age))
library(dplyr)
df %>%
mutate(age = cut(age, breaks=c(seq(0,80,5), max(age)))) %>%
group_by(age) %>%
summarise(value=mean(value))
Output
age value
<fctr> <dbl>
1 (0,5] 0.4901119
2 (5,10] 0.5131055
3 (10,15] 0.5022297
4 (15,20] 0.4712481
5 (20,25] 0.5610872
6 (25,30] 0.4207203
7 (30,35] 0.5218318
8 (35,40] 0.4377102
9 (40,45] 0.5007616
10 (45,50] 0.4941768
11 (50,55] 0.5350272
12 (55,60] 0.5226967
13 (60,65] 0.5031688
14 (65,70] 0.4652641
15 (70,75] 0.5667020
16 (75,80] 0.4664898
17 (80,100] 0.4604779

Group by followed by select only rows if its value in a particular column is less than its value from the same column

I am new to R
I have a data frame [1390 *6], where the last variable is the rank.
[Example of the Dataset]
So I would like to group_by by the "ID",then ignore the rows for the particular "ID" whose rank is higher than that of "15001"-highlighted in yellow colour.
This is what I have tried so far:
SS3<-SS1 %>% group_by(ID) %>% filter(any(DC== 15001) & any(SS1$rank <SS1$rank[DC== 15001]))
[Expected result]
Example that's similar to the data you provide, with only the relevant rows required for your operation. This should work with your own data (given what you've shown):
set.seed(1)
df <- data.frame(ID=c(rep(2122051,20),rep(2122052,20)),
DC=as.integer(runif(40)*100),
rank=rep(1:20,2),
stringsAsFactors=F)
df$DC[c(10,30)] <- as.integer(15001)
I store the rank-1 of each position where DC==15001 as a vector
positions <- df$rank[df$DC==15001]
[1] 9 9
I use tidyverse map2 to store the entries that have rank less than those indicated in positions for each group.
library(tidyverse)
df1 <- df %>%
group_by(ID) %>%
nest() %>%
mutate(data = map2(data, 1:length(unique(df$ID)), ~head(.x,positions[.y]))) %>%
unnest(data)
Output
ID DC rank
1 2122051 26 1
2 2122051 37 2
3 2122051 57 3
4 2122051 90 4
5 2122051 20 5
6 2122051 89 6
7 2122051 94 7
8 2122051 66 8
9 2122051 62 9
10 2122051 15001 10
11 2122052 93 1
12 2122052 21 2
13 2122052 65 3
14 2122052 12 4
15 2122052 26 5
16 2122052 38 6
17 2122052 1 7
18 2122052 38 8
19 2122052 86 9
20 2122052 15001 10

How to calculate the prediction power of each independent variable on a new data frame

I would like to calculate the prediction power of each independent variable.I have a training data frame named df and the test data frame named df1. I wrote a code that should append the prediction results based on each cloumn as part of the test data frame.My code give a strange result: It presents only one variable's prediction results and without its name.I would like to see all variables predictions and their names too.I'm new to function writing so any help is welcome.
df <- read.table(text = " target birds wolfs
32 9 7
56 8 4
11 2 8
22 2 3
33 8 3
54 1 2
34 7 16
66 1 5
74 17 7
52 8 7
45 2 7
65 20 3
99 6 3
88 1 1
77 3 11
55 30 1 ",header = TRUE)
df1 <- read.table(text = " target birds wolfs
34 9 7
23 8 4
43 2 8
45 2 3
65 8 3
23 1 2
22 7 16
99 1 5
56 17 7
32 8 7
19 2 7
91 20 3
78 6 3
62 1 1
78 3 11
69 30 1 ",header = TRUE)
Here is the code that I use
for(i in names(df))
{
if(is.numeric(df[3,i])) ##if row 3 is numeric, the entire column is
{
fit_pred <- predict(lm(df[,i] ~ target, data=df), newdata=df1)
res <- fit_pred
g<-as.data.frame(cbind(df1,res))
g
}
}
The output that I got is :
userid target birds wolfs res
10 321 45 8 7 0.0515967
8 608 33 1 5 0.1696638
3 234 23 2 8 0.1696638
7 294 44 7 1 0.0515967
2 444 46 8 4 0.0515967
11 226 90 2 7 0.1696638
9 123 89 9 7 0.0515967
1 222 67 9 7 0.0515967
5 678 43 8 3 0.0515967
15 999 12 3 9 0.1696638
6 987 33 1 2 0.1696638
14 225 18 1 1 0.1696638
16 987 83 1 1 0.1696638
12 556 77 2 3 0.1696638
You should not use a for loop here. You should one of the xxapply family functions. Here the R-way to do this:
fit_pred <- function(x)predict(lm(x ~ target, data=df), newdata=df1)
do.call(cbind,lapply(df,fit_pre))
I wrap your code in a function
I use lapply to loop over all the columns
do.call and cbind toi aggregate the result
Here's is a process that uses packages dplyr and tidyr, in order to create models based on y~x combinations (the dependent variables you specify ~ the independent variables you specify) and then use those models to predict new data.
The idea behind it is that both y and x variables might change (even if here you have only "target" as y). I'm using the dataframes df and df1 you specified in the beginning (I don't know why "target" becomes binary in your output).
Run the process step by step to see how it works and modify it to better fit your objective.
library(dplyr)
library(tidyr)
# input what you want as independent variables y and dependent x
ynames = c("target")
xnames = c("birds","wolfs")
###### build models
# create and reshape train y dataframes
dty = df[ynames]
dty = dty %>% gather(yvariable, yvalue)
# create and reshape train x dataframes
dtx = df[xnames]
dtx = dtx %>% gather(xvariable, xvalue)
# build model for each y~x combination
dt_model =
dty %>% do(data.frame(.,dtx)) %>% # create combinations of y and x variables
group_by(yvariable, xvariable) %>% # for each pair y and x
do(model = lm(yvalue~xvalue, data=.)) # build the lm y~x
# you've managed to create a model for each combination and it's stored in a dataframe
dt_model
# yvariable xvariable model
# 1 target birds <S3:lm>
# 2 target wolfs <S3:lm>
####### predict
# create and reshape test y dataframes
dty = df1[ynames]
dty = dty %>% gather(yvariable, yvalue)
# create and reshape test x dataframes
dtx = df1[xnames]
dtx = dtx %>% gather(xvariable, xvalue)
dty %>% do(data.frame(.,dtx)) %>% # create combinations of y and x variables
group_by(yvariable, xvariable) %>% # for each pair y and x
do(data.frame(., pred = predict(dt_model$model[dt_model$yvariable==.$yvariable &
dt_model$xvariable==.$xvariable][[1]]))) %>% # get the corresponding model and predict new data
ungroup()
# yvariable yvalue xvariable xvalue pred
# 1 target 34 birds 9 54.30627
# 2 target 23 birds 8 53.99573
# 3 target 43 birds 2 52.13249
# 4 target 45 birds 2 52.13249
# 5 target 65 birds 8 53.99573
# 6 target 23 birds 1 51.82195
# 7 target 22 birds 7 53.68519
# 8 target 99 birds 1 51.82195
# 9 target 56 birds 17 56.79059
# 10 target 32 birds 8 53.99573
# 11 target 19 birds 2 52.13249
# 12 target 91 birds 20 57.72220
# 13 target 78 birds 6 53.37465
# 14 target 62 birds 1 51.82195
# 15 target 78 birds 3 52.44303
# 16 target 69 birds 30 60.82760
# 17 target 34 wolfs 7 51.49364
# 18 target 23 wolfs 4 56.38136
# 19 target 43 wolfs 8 49.86441
# 20 target 45 wolfs 3 58.01059
# 21 target 65 wolfs 3 58.01059
# 22 target 23 wolfs 2 59.63983
# 23 target 22 wolfs 16 36.83051
# 24 target 99 wolfs 5 54.75212
# 25 target 56 wolfs 7 51.49364
# 26 target 32 wolfs 7 51.49364
# 27 target 19 wolfs 7 51.49364
# 28 target 91 wolfs 3 58.01059
# 29 target 78 wolfs 3 58.01059
# 30 target 62 wolfs 1 61.26907
# 31 target 78 wolfs 11 44.97669
# 32 target 69 wolfs 1 61.26907

Resources