summarise dataset conditioning on variable using dplyr - r

I want to summarise my dataset grouping the variable age into 5 years age groups, so instead of single age 0 1 2 3 4 5 6... I would have 0 5 10 15 etc. with 80 being my open-ended category. I could do this by categorizing everything by hand creating a new variable, but I am sure there must be a quicker way!
a <- cbind(age=c(rep(seq(0, 90, by=1), 2)), value=rnorm(182))
Any ideas?

like this ?
library(dplyr)
a %>% data.frame %>% group_by(age_group = (sapply(age,min,80) %/% 5)*5) %>%
summarize(avg_val = mean(value))
# A tibble: 17 x 2
age_group avg_val
<dbl> <dbl>
1 0 -0.151470805
2 5 0.553619149
3 10 0.198915973
4 15 -0.436646287
5 20 -0.024193193
6 25 0.102671120
7 30 -0.350059839
8 35 0.010762264
9 40 0.339268917
10 45 -0.056448481
11 50 0.002982158
12 55 0.348232262
13 60 -0.364050091
14 65 0.177551510
15 70 -0.178885909
16 75 0.664215782
17 80 -0.376929230

Example data
set.seed(1)
df <- data.frame(age=runif(1000)*100,
value=runif(1000))
Simply add the max value of your group to seq(0,80,5) for irregular breaks with c(..., max(age))
library(dplyr)
df %>%
mutate(age = cut(age, breaks=c(seq(0,80,5), max(age)))) %>%
group_by(age) %>%
summarise(value=mean(value))
Output
age value
<fctr> <dbl>
1 (0,5] 0.4901119
2 (5,10] 0.5131055
3 (10,15] 0.5022297
4 (15,20] 0.4712481
5 (20,25] 0.5610872
6 (25,30] 0.4207203
7 (30,35] 0.5218318
8 (35,40] 0.4377102
9 (40,45] 0.5007616
10 (45,50] 0.4941768
11 (50,55] 0.5350272
12 (55,60] 0.5226967
13 (60,65] 0.5031688
14 (65,70] 0.4652641
15 (70,75] 0.5667020
16 (75,80] 0.4664898
17 (80,100] 0.4604779

Related

How to keep grouped variables together in training and test data

I'm making and testing the accuracy of age extrapolations from growth measurements and to do this I have to split my data into my training and test data.
The issue is that individuals in my data set were measured multiple times and sometimes they were measured twice, sometimes 3 times. In the dataset Birds is the individual chick, age is the age at measurement, and wing is that measurement value.
I've tried using the group_by function to keep their measurements together, but this doesn't seem to work. I also tried nesting the data but that puts the data in a new table and my code doesn't like that. Is there another way I could keep the groups together while still randomly assigning them to training and test data?
library('tidyverse')
library("ggplot2")
library("readxl")
library("writexl")
library('dplyr')
library('Rmisc')
library('cowplot')
library('purrr')
library('caTools')
library('MLmetrics')
Bird<-c(1,1,1,2,2,3,3,3,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9,10,10,)
Age<-c(10,17,27,17,28,10,17,27,10,17,10,17,28,10,17,28,10,17,28,10,17,28,10,17,28,11,18,)
Wing<-c(39,63,98,61,99,34,48,80,30,37,35,51,71,40,55,79,34,47,77,36,55,84,35,55,88,36,59,)
Set14<-data.frame(Bird, Age, Wing) %>%
group_by(Bird)
Set14$Bird<-as.factor((Set14$Bird))
Set14
sample_size = floor(0.7*nrow(Set14))
picked = sample(seq_len(nrow(Set14)),size = sample_size)
Training =Set14[picked,]
Training
Test =Set14[-picked,]
Test
trm<-lm(Age~Wing, data=Training)
predval<-predict(object=trm,
newdata=Test)
predval
error<-data.frame(actual=Test$Age, calculated=predval)
error
MAPE(error$actual, error$calculated)
In Base R you could do:
a <- as.integer(Set14$Bird)
train_index <- a %in% sample(n<-length(unique(a)), 0.7*n)
train <- set14[train, ]
test <- set14[!train, ]
in Tidyverse:
ungroup(Set14) %>%
nest_by(Bird) %>%
ungroup() %>%
mutate(tt = floor(.7*n()),
tt = sample(rep(c('train', 'test'), c(tt[1], n()-tt[1])))) %>%
unnest(data) %>%
group_split(tt, .keep = FALSE)
[[1]]
# A tibble: 9 x 3
Bird Age Wing
<fct> <dbl> <dbl>
1 1 10 39
2 1 17 63
3 1 27 98
4 3 10 34
5 3 17 48
6 3 27 80
7 7 10 34
8 7 17 47
9 7 28 77
[[2]]
# A tibble: 18 x 3
Bird Age Wing
<fct> <dbl> <dbl>
1 2 17 61
2 2 28 99
3 4 10 30
4 4 17 37
5 5 10 35
6 5 17 51
7 5 28 71
8 6 10 40
9 6 17 55
10 6 28 79
11 8 10 36
12 8 17 55
13 8 28 84
14 9 10 35
15 9 17 55
16 9 28 88
17 10 11 36
18 10 18 59

Data manipulation: gather or spread or both?

I am trying to change my data frame so I can look at it with some different plots. Essentially I want to compare different models. This is what I have:
variable = c('A','B','C','A','B','C')
optimal = c(10,20,30,40,80,100)
control = c(15,15,15,15,15,15)
method_1 = c(11,22,28,44,85,95)
method_2 = c(9, 19,31,39,79,102)
df = data.frame(variable, optimal, control, method_1, method_2)
df
and so it looks like this:
variable optimal control method_1 method_2
1 A 10 15 11 9
2 B 20 15 22 19
3 C 30 15 28 31
4 A 40 15 44 39
5 B 80 15 85 79
6 C 100 15 95 102
And I need something that looks like this:
variable A B C
1 optimal 10 20 30
2 optimal 40 80 100
3 control 15 15 15
4 control 15 15 15
5 method_1 11 22 28
6 method_1 44 85 95
7 method_2 9 19 31
8 method_2 39 79 102
I've tried gather and spread and transpose but nothing worked. Any thoughts? Feels that should be a easy fix, but I could not get my head around it. Thanks in advance.
You have to go long first and then wide, i.e.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-1) %>%
pivot_wider(names_from = variable, values_from = value) %>%
unnest()
name A B C
<chr> <dbl> <dbl> <dbl>
1 optimal 10 20 30
2 optimal 40 80 100
3 control 15 15 15
4 control 15 15 15
5 method_1 11 22 28
6 method_1 44 85 95
7 method_2 9 19 31
8 method_2 39 79 102
I think you need both. Also note that gather and spread has been retired and replaced with pivot_longer and pivot_wider instead.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -variable) %>%
group_by(variable) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = variable, values_from = value) %>%
select(-row)
# name A B C
# <chr> <dbl> <dbl> <dbl>
#1 optimal 10 20 30
#2 control 15 15 15
#3 method_1 11 22 28
#4 method_2 9 19 31
#5 optimal 40 80 100
#6 control 15 15 15
#7 method_1 44 85 95
#8 method_2 39 79 102

Group_by / summarize by two variables within a function

I would like to write a function that summarize the provided data by some specified criteria, in this case by age
The example data is a table of users' age and their stats.
df <- data.frame('Age'=rep(18:25,2), 'X1'=10:17, 'X2'=28:35,'X4'=22:29)
Next I define the output columns that are relevant for the analysis
output_columns <- c('Age', 'X1', 'X2', 'X3')
This function computes the basic the sum of X1. X2 and X3 grouped by age.
aggr <- function(data, criteria, output_columns){
k <- data %>% .[, colnames(.) %in% output_columns] %>%
group_by_(.dots = criteria) %>%
#summarise_each(funs(count), age) %>%
summarize_if(is.numeric, sum)
return (k)
}
When I call it like this
> e <- aggr(df, "Age", output_columns)
> e
# A tibble: 8 x 3
Age X1 X2
<int> <int> <int>
1 18 20 56
2 19 22 58
3 20 24 60
4 21 26 62
5 22 28 64
6 23 30 66
7 24 32 68
8 25 34 70
I want to have another column called count which shows the number of observations in each age group. Desired output is
> desired
Age X1 X2 count
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2
I have tried different ways to do that, e.g. tally(), summarize_each
etc. They all deliver wrong results.
I believe their should be an easy and simple way to do that.
Any help is appreciated.
Since you're already summing all variables, you can just add a column of all 1s before the summary function
aggr <- function(data, criteria, output_columns){
data %>%
.[, colnames(.) %in% output_columns] %>%
group_by_(.dots = criteria) %>%
mutate(n = 1L) %>%
summarize_if(is.numeric, sum)
}
# A tibble: 8 x 4
Age X1 X2 n
<int> <int> <int> <int>
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2
We could create the 'count' column before summarise_if
aggr<- function(data, criteria, output_columns){
data %>%
select(intersect(names(.), output_columns))%>%
group_by_at(criteria)%>%
group_by(count = n(), add= TRUE) %>%
summarize_if(is.numeric,sum) %>%
select(setdiff(names(.), 'count'), count)
}
aggr(df,"Age",output_columns)
# A tibble: 8 x 4
# Groups: Age [8]
# Age X1 X2 count
# <int> <int> <int> <int>
#1 18 20 56 2
#2 19 22 58 2
#3 20 24 60 2
#4 21 26 62 2
#5 22 28 64 2
#6 23 30 66 2
#7 24 32 68 2
#8 25 34 70 2
In base R you could do
aggr <- function(data, criteria, output_columns){
ds <- data[, colnames(data) %in% output_columns]
d <- aggregate(ds, by=list(criteria), function(x) c(sum(x), length(x)))
"names<-"(do.call(data.frame, d)[, -c(2:3, 5)], c(names(ds), "n"))
}
> with(df, aggr(df, Age, output_columns))
Age X1 X2 n
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2

Appending many columns - functions of existing columns - to data frame

I have a data frame with 200 columns: A_1, ..., A_100, B_1, ..., B_100. The entries of A are integers from 1 to 5 or NA, while the entries of B are -1, 0, 1, NA.
I want to append 100 more columns: C_1, ..., C_100 where C_i = A_i + B_i, except when it would yield 0 or 6, in which case it should stay as is.
What would be the best way to do this in R, in terms of clarity and computational complexity? There has to be a better way than a for loop or something like that, perhaps there are functions for this in some library? I'm going to have to do similar operations a lot so I'd like a streamlined method.
You can try:
library(tidyverse)
# some data
d <- data.frame(A_1=1:10,
A_2=1:10,
A_3=1:10,
B_1=11:20,
B_2=21:30,
B_3=31:40)
d %>%
gather(key, value) %>%
separate(key, into = c("a","b")) %>%
group_by(b, a) %>%
mutate(n=row_number()) %>%
unite(a2,b, n) %>%
spread(a, value) %>%
mutate(Sum=A+B) %>%
separate(a2, into = c("a", "b"), remove = T) %>%
select(-A,-B) %>%
mutate(a=paste0("C_",a)) %>%
spread(a, Sum) %>%
arrange(as.numeric(b)) %>%
left_join(d %>% rownames_to_column(), by=c("b"="rowname"))
# A tibble: 10 x 10
b C_1 C_2 C_3 A_1 A_2 A_3 B_1 B_2 B_3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 12 22 32 1 1 1 11 21 31
2 2 14 24 34 2 2 2 12 22 32
3 3 16 26 36 3 3 3 13 23 33
4 4 18 28 38 4 4 4 14 24 34
5 5 20 30 40 5 5 5 15 25 35
6 6 22 32 42 6 6 6 16 26 36
7 7 24 34 44 7 7 7 17 27 37
8 8 26 36 46 8 8 8 18 28 38
9 9 28 38 48 9 9 9 19 29 39
10 10 30 40 50 10 10 10 20 30 40
The idea is to use tidyr's gather and spread to get the columns A and B side by side. Then you can calculate the sum and transform it back to the expected data.frame. As long your data.frame has the same number of A and B columns, it is working.

Group by followed by select only rows if its value in a particular column is less than its value from the same column

I am new to R
I have a data frame [1390 *6], where the last variable is the rank.
[Example of the Dataset]
So I would like to group_by by the "ID",then ignore the rows for the particular "ID" whose rank is higher than that of "15001"-highlighted in yellow colour.
This is what I have tried so far:
SS3<-SS1 %>% group_by(ID) %>% filter(any(DC== 15001) & any(SS1$rank <SS1$rank[DC== 15001]))
[Expected result]
Example that's similar to the data you provide, with only the relevant rows required for your operation. This should work with your own data (given what you've shown):
set.seed(1)
df <- data.frame(ID=c(rep(2122051,20),rep(2122052,20)),
DC=as.integer(runif(40)*100),
rank=rep(1:20,2),
stringsAsFactors=F)
df$DC[c(10,30)] <- as.integer(15001)
I store the rank-1 of each position where DC==15001 as a vector
positions <- df$rank[df$DC==15001]
[1] 9 9
I use tidyverse map2 to store the entries that have rank less than those indicated in positions for each group.
library(tidyverse)
df1 <- df %>%
group_by(ID) %>%
nest() %>%
mutate(data = map2(data, 1:length(unique(df$ID)), ~head(.x,positions[.y]))) %>%
unnest(data)
Output
ID DC rank
1 2122051 26 1
2 2122051 37 2
3 2122051 57 3
4 2122051 90 4
5 2122051 20 5
6 2122051 89 6
7 2122051 94 7
8 2122051 66 8
9 2122051 62 9
10 2122051 15001 10
11 2122052 93 1
12 2122052 21 2
13 2122052 65 3
14 2122052 12 4
15 2122052 26 5
16 2122052 38 6
17 2122052 1 7
18 2122052 38 8
19 2122052 86 9
20 2122052 15001 10

Resources