Create a column from groupby with a calculated label - r

I have a dataframe and I would like to create a dataframe column based on the groupby on another column. The group by should be in increments of 50 on the column and the label should be the middle number in the group numbers. I am demonstrating this here with a reproducible example.
Here is the dataframe
das <- data.frame(val=1:27,
weigh=c(20,25,37,38,50,52,56,59,64,68,69,70,75,76,82,85,90,100,109,150,161,178,181,179,180,201,201))
val weigh
1 1 20
2 2 25
3 3 37
4 4 38
5 5 50
6 6 52
7 7 56
8 8 59
9 9 64
10 10 68
11 11 69
12 12 70
13 13 75
14 14 76
15 15 82
16 16 85
17 17 90
18 18 100
19 19 109
20 20 150
21 21 161
22 22 178
23 23 181
24 24 179
25 25 180
26 26 201
27 27 201
The desired output will be
val weigh label
1 1 20 45
2 2 25 45
3 3 37 45
4 4 38 45
5 5 50 45
6 6 52 45
7 7 56 45
8 8 59 45
9 9 64 45
10 10 68 45
11 11 69 45
12 12 70 45
13 13 75 95
14 14 76 95
15 15 82 95
16 16 85 95
17 17 90 95
18 18 100 95
19 19 109 95
20 20 150 145
21 21 161 145
22 22 178 195
23 23 181 195
24 24 179 195
25 25 180 195
26 26 201 195
27 27 201 195
Here the 45 is calculate by 20+ (20+50) /2 = 45, where 20 is where it start and 20+50 = 70 is where this group need to stop. And the label is the middle number between 20 and 70 which is 45.
Similarly with other labels
70+(70+50)/2= 95
120 + (170)/2= 145
170 + (220)/2 = 195
I am new to R and tried looking at many sources here, but I couldn't find anything that will do something like this. The closest I could find is grouping like this using cut2
df %>% mutate(label = as.numeric(cut2(weigh, g=5)))

library(dplyr)
# create your breaks
breaks = unique(c(seq(min(das$weigh), max(das$weigh)+1, 50), max(das$weigh)+1))
das %>%
group_by(group = cut(weigh, breaks, right=F)) %>% # group by intervals
mutate(group2 = as.numeric(group), # use the intervals as a number
label = (breaks[group2]+breaks[group2]+50)/2) %>% # call the corresponding break value and calculate your label
ungroup()
# # A tibble: 27 x 5
# val weigh group group2 label
# <int> <dbl> <fct> <dbl> <dbl>
# 1 1 20 [20,70) 1 45
# 2 2 25 [20,70) 1 45
# 3 3 37 [20,70) 1 45
# 4 4 38 [20,70) 1 45
# 5 5 50 [20,70) 1 45
# 6 6 52 [20,70) 1 45
# 7 7 56 [20,70) 1 45
# 8 8 59 [20,70) 1 45
# 9 9 64 [20,70) 1 45
#10 10 68 [20,70) 1 45
# # ... with 17 more rows
You can remove any unnecessary columns. I left them there just to make easier to understand how the process works.

Related

Creating a table viable for t.test function

I am given a data frame with multiple variables but I am only interested in 2 variables and am required to group the variables into 2 groups. (i.e. group 1:mean age at child-birth with having 10+ years of education; group 2: mean age at child-birth with having less than 10 years of education) I am trying to figure out how to put this into a table but I am having troubles on how I can group the rows I want based on years of education. I currently have a table that looks like this with the following code:
'''
means<-table(bfeed_df$ybirth,bfeed_df$yschool)
'''
giving me:
'''
3 6 7 8 9 10 11 12 13 14 15 16 17 18 19
78 0 0 2 2 5 8 8 26 1 2 1 0 0 0 0
79 1 2 2 3 6 12 12 38 10 5 0 0 0 0 0
80 0 0 0 5 10 11 13 38 10 5 2 0 0 0 0
.
.
'''
I want:
<10years +10years
78 9 46
79 14 77
80 15 88
. . .
. . .
# Let's generate some fake data that matches your input
temp = matrix(sample(60,60), ncol = 15)
colnames(temp) = c(3,6,7,8,9,10,11,12,13,14,15,16,17,18,19)
rownmes(temp) = c(78, 79, 80, 81)
# 3 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# 78 5 4 21 13 18 17 34 43 19 41 55 36 12 52 15
# 79 56 14 38 28 30 25 8 44 35 59 39 49 20 2 58
# 80 22 27 3 9 33 54 26 50 53 45 10 40 48 7 6
# 81 42 46 23 1 60 57 47 16 24 51 37 32 11 29 31
Now we can create the summations using apply
sums = t(apply(temp, 1, function(x) c(sum(x[1:4]), sum(x[5:15])) ))
colnames(sums) = c("<10y","+10y")
sums
> sums
<10y +10y
78 43 342
79 136 369
80 61 372
81 112 395
Is this what you are looking for?
You can use cut to divide yschool in two categories and use it in table.
means <- table(bfeed_df$ybirth,cut(bfeed_df$yschool, c(-Inf, 10, Inf)))
colnames(means) <- c('<10years', '10+years')
means

How can I add with ggplot the colored legend?

I have this dataframe and I want to plot with ggplot on x axis the result_df50$id column and on the y axis the columns result_df50$Sens and result_df50$Spec.
Also I want result_df50$Sens and result_df50$Spec to be displayed in different colors. The legend should also show the different colors of the columns.
> result_df50
Acc Sens Spec id
1 12 51 15 1
2 24 78 28 2
3 31 86 32 3
4 78 23 90 4
5 49 43 56 5
6 25 82 33 6
7 6 87 8 7
8 60 33 61 8
9 54 4 66 9
10 5 54 9 10
11 1 53 4 11
12 2 59 7 12
13 4 73 3 13
14 48 41 55 14
15 30 72 39 15
16 57 10 67 16
17 80 31 91 17
18 30 65 36 18
19 58 45 61 19
20 12 50 19 20
21 39 47 46 21
22 38 49 45 22
23 3 69 5 23
24 68 24 76 24
25 35 64 42 25
So far I tried this and I am happy with it.
ggplot(data = result_df50) +
geom_line(data= result_df50, aes(x = result_df50$id, y = result_df50$Spec), colour = "blue") +
geom_line(data= result_df50, aes(x = result_df50$id, y = result_df50$Sens), colour = "red") +
labs(x="Number of iterations")
Now I just want to add the legend with the colors of each line. I tried fill, but R gives a warning and ignores this unknown aesthetics: fill....
How can I do this?
This is because your dataset has the wrong format (wide). You'll have to convert it into long format to make it work as follows:
result_df50 <- read.table(text="Acc Sens Spec id
1 12 51 15 1
2 24 78 28 2
3 31 86 32 3
4 78 23 90 4
5 49 43 56 5
6 25 82 33 6
7 6 87 8 7
8 60 33 61 8
9 54 4 66 9
10 5 54 9 10
11 1 53 4 11
12 2 59 7 12
13 4 73 3 13
14 48 41 55 14
15 30 72 39 15
16 57 10 67 16
17 80 31 91 17
18 30 65 36 18
19 58 45 61 19
20 12 50 19 20
21 39 47 46 21
22 38 49 45 22
23 3 69 5 23
24 68 24 76 24
25 35 64 42 25")
# conversion to long format
library(reshape2)
result_df50 <- melt(result_df50, id.vars=c("Acc", "id"))
head(result_df50)
# Acc id variable value
# 1 12 1 Sens 51
# 2 24 2 Sens 78
# 3 31 3 Sens 86
# 4 78 4 Sens 23
# 5 49 5 Sens 43
# 6 25 6 Sens 82
# your plot
ggplot(data = result_df50, aes(x = id, y =value , color=variable)) +
geom_line() +
labs(x="Number of iterations")+
scale_color_manual(values=c("red", "blue")) # in case you want to keep your colors
Is this what you want?

Filter using paste and name in dplyr

Sample data
df <- data.frame(loc.id = rep(1:5, each = 6), day = sample(1:365,30),
ref.day1 = rep(c(20,30,50,80,90), each = 6),
ref.day2 = rep(c(10,28,33,49,67), each = 6),
ref.day3 = rep(c(31,49,65,55,42), each = 6))
For each loc.id, if I want to keep days that are >= then ref.day1, I do this:
df %>% group_by(loc.id) %>% dplyr::filter(day >= ref.day1)
I want to make 3 data frames, each whose rows are filtered by ref.day1, ref.day2,ref.day3 respectively
I tried this:
col.names <- c("ref.day1","ref.day2","ref.day3")
temp.list <- list()
for(cl in seq_along(col.names)){
col.sub <- col.names[cl]
columns <- c("loc.id","day",col.sub)
df.sub <- df[,columns]
temp.dat <- df.sub %>% group_by(loc.id) %>% dplyr::filter(day >= paste0(col.sub)) # this line does not work
temp.list[[cl]] <- temp.dat
}
final.dat <- rbindlist(temp.list)
I was wondering how to refer to columns by names and paste function in dplyr in order to filter it out.
The reason why your original code doesn't work is that your col.names are strings, but dplyr function uses non-standard evaluation which doesn't accept strings. So you need to convert the string into variables.rlang::sym() can do that.
Also, you can use map function in purrr package, which is much more compact:
library(dplyr)
library(purrr)
col_names <- c("ref.day1","ref.day2","ref.day3")
map(col_names,~ df %>% dplyr::filter(day >= UQ(rlang::sym(.x))))
#it will return you a list of dataframes
By the way I removed group_by() because they don't seem to be useful.
Returned result:
[[1]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 362 20 10 31
2 1 69 20 10 31
3 1 65 20 10 31
4 1 88 20 10 31
5 1 142 20 10 31
6 2 355 30 28 49
7 2 255 30 28 49
8 2 136 30 28 49
9 2 156 30 28 49
10 2 194 30 28 49
11 2 204 30 28 49
12 3 129 50 33 65
13 3 254 50 33 65
14 3 279 50 33 65
15 3 201 50 33 65
16 3 282 50 33 65
17 4 351 80 49 55
18 4 114 80 49 55
19 4 338 80 49 55
20 4 283 80 49 55
21 5 199 90 67 42
22 5 141 90 67 42
23 5 241 90 67 42
24 5 187 90 67 42
[[2]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 16 20 10 31
2 1 362 20 10 31
3 1 69 20 10 31
4 1 65 20 10 31
5 1 88 20 10 31
6 1 142 20 10 31
7 2 355 30 28 49
8 2 255 30 28 49
9 2 136 30 28 49
10 2 156 30 28 49
11 2 194 30 28 49
12 2 204 30 28 49
13 3 129 50 33 65
14 3 254 50 33 65
15 3 279 50 33 65
16 3 201 50 33 65
17 3 282 50 33 65
18 4 351 80 49 55
19 4 114 80 49 55
20 4 338 80 49 55
21 4 283 80 49 55
22 4 79 80 49 55
23 5 199 90 67 42
24 5 67 90 67 42
25 5 141 90 67 42
26 5 241 90 67 42
27 5 187 90 67 42
[[3]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 362 20 10 31
2 1 69 20 10 31
3 1 65 20 10 31
4 1 88 20 10 31
5 1 142 20 10 31
6 2 355 30 28 49
7 2 255 30 28 49
8 2 136 30 28 49
9 2 156 30 28 49
10 2 194 30 28 49
11 2 204 30 28 49
12 3 129 50 33 65
13 3 254 50 33 65
14 3 279 50 33 65
15 3 201 50 33 65
16 3 282 50 33 65
17 4 351 80 49 55
18 4 114 80 49 55
19 4 338 80 49 55
20 4 283 80 49 55
21 4 79 80 49 55
22 5 199 90 67 42
23 5 67 90 67 42
24 5 141 90 67 42
25 5 241 90 67 42
26 5 187 90 67 42
You may also want to check these:
https://dplyr.tidyverse.org/articles/programming.html
Use variable names in functions of dplyr

Filling "implied missing values" in a data frame that has varying observations per time unit

I have a large dataset with spatiotemporal data. Each set of coordinates are associated with an id (player id in a computer game). Unfortunately the coordinates for each id aren't logged at every time unit. If a reading is not available for a specific id at x time stamp, then that row was entirely omitted from the dataset rather than logged as NA.
I would like to have the same exact amount of observations per time unit as there are unique ids (i.e. inserting "implied missing NAs"). On time units where ids are missing, they should be inserted as new rows with NAs as their coordinates.
Here's a dummy dataset to illustrate:
time <- c(10,10,10,10,11,11,11,11,11,11,12,12,12,12,13,13,14,14,14,14,14,14,15,15,15)
id <- c(1,3,4,5,1,2,3,4,5,6,2,4,5,6,3,6,1,2,3,4,5,6,2,4,5)
x <- c(128,128,64,64,124,128,120,68,64,64,122,71,65,64,112,74,116,114,113,73,70,70,111,75,70)
y <- c(128,128,64,66,125,128,124,66,67,64,124,67,71,68,113,68,115,119,113,76,69,77,116,80,82)
spatiodf <- as.data.frame(cbind(time, id, x, y))
time id x y
1 10 1 128 128
2 10 3 128 128
3 10 4 64 64
4 10 5 64 66
5 11 1 124 125
6 11 2 128 128
7 11 3 120 124
8 11 4 68 66
9 11 5 64 67
10 11 6 64 64
11 12 1 118 123
12 12 2 122 124
13 12 4 71 67
14 12 5 65 71
15 12 6 64 68
16 13 3 112 113
17 13 6 74 68
18 14 1 116 115
19 14 2 114 119
20 14 3 113 113
21 14 4 73 76
22 14 5 70 69
23 14 6 70 77
24 15 2 111 116
25 15 4 75 80
26 15 5 70 82
From the above output I would like to get to the following below output where the data frame was recreated with each time unit having an equal amount of observations (and NA values were manually inserted into rows that had missing values).
time <- rep(10:15, each = 6)
id <- rep(1:6, times = 6)
x <- c(128,NA,128,64,64,NA,124,128,120,68,64,64,NA,122,NA,71,65,64,NA,NA,112,NA,NA,74,116,114,113,73,70,70,NA,111,NA,75,70,NA)
y <- c(128,NA,128,64,66,NA,125,128,124,66,67,64,NA,124,NA,67,71,68,NA,NA,113,NA,NA,68,115,119,113,76,69,77,NA,116,NA,80,82,NA)
spatiodf_equal_obs <- as.data.frame(cbind(time, id, x, y))
library(dplyr)
spatiodf_equal_obs %>%
arrange(id)
time id x y
1 10 1 128 128
2 11 1 124 125
3 12 1 NA NA
4 13 1 NA NA
5 14 1 116 115
6 15 1 NA NA
7 10 2 NA NA
8 11 2 128 128
9 12 2 122 124
10 13 2 NA NA
11 14 2 114 119
12 15 2 111 116
13 10 3 128 128
14 11 3 120 124
15 12 3 NA NA
16 13 3 112 113
17 14 3 113 113
18 15 3 NA NA
19 10 4 64 64
20 11 4 68 66
21 12 4 71 67
22 13 4 NA NA
23 14 4 73 76
24 15 4 75 80
25 10 5 64 66
26 11 5 64 67
27 12 5 65 71
28 13 5 NA NA
29 14 5 70 69
30 15 5 70 82
31 10 6 NA NA
32 11 6 64 64
33 12 6 64 68
34 13 6 74 68
35 14 6 70 77
36 15 6 NA NA
The reason the data needs to be in the above format is because I want to be able to fill in the NA values with the nearest available previous or following entry from the same id. Once we have the dataframe in the above output that can be done using fill() from tidyr:
library(tidyr)
res <- spatiodf_equal_obs %>%
group_by(id) %>%
fill(x, y, .direction = "down") %>%
fill(x, y, .direction = "up")
I've tried a lot of combinations of spreading, gathering (and trickery with creating new dataframes to merge(df1, df2, all=TRUE)). I can't seem to figure out how to go from that first data frame to the second one though.
The final output should look like this:
time id x y
1 10 1 128 128
2 11 1 124 125
3 12 1 124 125
4 13 1 124 125
5 14 1 116 115
6 15 1 116 115
7 10 2 128 128
8 11 2 128 128
9 12 2 122 124
10 13 2 122 124
11 14 2 114 119
12 15 2 111 116
13 10 3 128 128
14 11 3 120 124
15 12 3 120 124
16 13 3 112 113
17 14 3 113 113
18 15 3 113 113
19 10 4 64 64
20 11 4 68 66
21 12 4 71 67
22 13 4 71 67
23 14 4 73 76
24 15 4 75 80
25 10 5 64 66
26 11 5 64 67
27 12 5 65 71
28 13 5 65 71
29 14 5 70 69
30 15 5 70 82
31 10 6 64 64
32 11 6 64 64
33 12 6 64 68
34 13 6 74 68
35 14 6 70 77
36 15 6 70 77
To fill in gaps with values taken from the nearest row, you can do:
library(data.table)
setDT(spatiodf)
resDT = spatiodf[
CJ(id = id, time = min(time):max(time), unique = TRUE), on=.(id, time), roll="nearest"
]
# verify
fsetequal(data.table(res), resDT) # TRUE
How it works
setDT converts to a data.table in place, so no <- is needed.
DT[i, on=, roll=] uses i to look up rows in DT, rolling each i to a row in DT. The "roll" is done on the final column in on=.
CJ(a, b, unique = TRUE) returns all combos of a and b, like expand.grid in base.

Generate population data with specific distribution in R

I have a distribution of ages in a population.
For instance, you can imagine something like this:
Ages <24: 15%
Ages 25-49: 40%
Ages 50-60: 20%
Ages >60: 25%
I don't have the mean and standard deviation for each stratum/age group in the data. I am trying to generate a sample population of 1000 individuals where the generated data matches the distribution of ages shown above.
Let's put this data in a more friendly format:
(dat <- data.frame(min=c(0, 25, 50, 60), max=c(25, 50, 60, 100), prop=c(0.15, 0.40, 0.20, 0.25)))
# min max prop
# 1 0 25 0.15
# 2 25 50 0.40
# 3 50 60 0.20
# 4 60 100 0.25
We can easily sample 1000 rows of the table using the sample function:
set.seed(144) # For reproducibility
rows <- sample(nrow(dat), 1000, replace=TRUE, prob=dat$prop)
table(rows)
# rows
# 1 2 3 4
# 139 425 198 238
To sample actual ages you will need to define a distribution over the ages represented by each row. A simple one would be uniformly distributed ages:
age <- round(dat$min[rows] + runif(1000) * (dat$max[rows] - dat$min[rows]))
table(age)
# age
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
# 2 5 5 3 7 7 9 6 7 6 1 7 7 5 5 6 2 4 6 7 4 11 8 2 3 10 11 13
# 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
# 19 16 20 16 18 21 16 19 14 20 15 13 18 15 24 20 16 16 29 16 11 12 18 17 17 26 27 21
# 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
# 17 26 11 13 20 3 8 9 6 4 3 3 5 4 3 3 5 8 3 13 5 6 4 7 9 9 6 4
# 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
# 5 5 9 9 5 6 8 9 5 4 6 5 9 6 8 4 1
Of course, if uniformly sampling the ages in each range is inappropriate in your application, then you would need to pick some other function to get ages from buckets.
This doesn't do exactly what you were looking for, but does help with the cut-offs. Hope it helps!
install.packages("truncnorm")
library(truncnorm)
set.seed(123)
pop <- 1000
ages <- rtruncnorm(n=pop, a=0, b=100, mean=40, sd=25) # ---> You can set your own mean and sd
summary(ages)

Resources