Generate population data with specific distribution in R - r

I have a distribution of ages in a population.
For instance, you can imagine something like this:
Ages <24: 15%
Ages 25-49: 40%
Ages 50-60: 20%
Ages >60: 25%
I don't have the mean and standard deviation for each stratum/age group in the data. I am trying to generate a sample population of 1000 individuals where the generated data matches the distribution of ages shown above.

Let's put this data in a more friendly format:
(dat <- data.frame(min=c(0, 25, 50, 60), max=c(25, 50, 60, 100), prop=c(0.15, 0.40, 0.20, 0.25)))
# min max prop
# 1 0 25 0.15
# 2 25 50 0.40
# 3 50 60 0.20
# 4 60 100 0.25
We can easily sample 1000 rows of the table using the sample function:
set.seed(144) # For reproducibility
rows <- sample(nrow(dat), 1000, replace=TRUE, prob=dat$prop)
table(rows)
# rows
# 1 2 3 4
# 139 425 198 238
To sample actual ages you will need to define a distribution over the ages represented by each row. A simple one would be uniformly distributed ages:
age <- round(dat$min[rows] + runif(1000) * (dat$max[rows] - dat$min[rows]))
table(age)
# age
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
# 2 5 5 3 7 7 9 6 7 6 1 7 7 5 5 6 2 4 6 7 4 11 8 2 3 10 11 13
# 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
# 19 16 20 16 18 21 16 19 14 20 15 13 18 15 24 20 16 16 29 16 11 12 18 17 17 26 27 21
# 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
# 17 26 11 13 20 3 8 9 6 4 3 3 5 4 3 3 5 8 3 13 5 6 4 7 9 9 6 4
# 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
# 5 5 9 9 5 6 8 9 5 4 6 5 9 6 8 4 1
Of course, if uniformly sampling the ages in each range is inappropriate in your application, then you would need to pick some other function to get ages from buckets.

This doesn't do exactly what you were looking for, but does help with the cut-offs. Hope it helps!
install.packages("truncnorm")
library(truncnorm)
set.seed(123)
pop <- 1000
ages <- rtruncnorm(n=pop, a=0, b=100, mean=40, sd=25) # ---> You can set your own mean and sd
summary(ages)

Related

Creating a table viable for t.test function

I am given a data frame with multiple variables but I am only interested in 2 variables and am required to group the variables into 2 groups. (i.e. group 1:mean age at child-birth with having 10+ years of education; group 2: mean age at child-birth with having less than 10 years of education) I am trying to figure out how to put this into a table but I am having troubles on how I can group the rows I want based on years of education. I currently have a table that looks like this with the following code:
'''
means<-table(bfeed_df$ybirth,bfeed_df$yschool)
'''
giving me:
'''
3 6 7 8 9 10 11 12 13 14 15 16 17 18 19
78 0 0 2 2 5 8 8 26 1 2 1 0 0 0 0
79 1 2 2 3 6 12 12 38 10 5 0 0 0 0 0
80 0 0 0 5 10 11 13 38 10 5 2 0 0 0 0
.
.
'''
I want:
<10years +10years
78 9 46
79 14 77
80 15 88
. . .
. . .
# Let's generate some fake data that matches your input
temp = matrix(sample(60,60), ncol = 15)
colnames(temp) = c(3,6,7,8,9,10,11,12,13,14,15,16,17,18,19)
rownmes(temp) = c(78, 79, 80, 81)
# 3 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# 78 5 4 21 13 18 17 34 43 19 41 55 36 12 52 15
# 79 56 14 38 28 30 25 8 44 35 59 39 49 20 2 58
# 80 22 27 3 9 33 54 26 50 53 45 10 40 48 7 6
# 81 42 46 23 1 60 57 47 16 24 51 37 32 11 29 31
Now we can create the summations using apply
sums = t(apply(temp, 1, function(x) c(sum(x[1:4]), sum(x[5:15])) ))
colnames(sums) = c("<10y","+10y")
sums
> sums
<10y +10y
78 43 342
79 136 369
80 61 372
81 112 395
Is this what you are looking for?
You can use cut to divide yschool in two categories and use it in table.
means <- table(bfeed_df$ybirth,cut(bfeed_df$yschool, c(-Inf, 10, Inf)))
colnames(means) <- c('<10years', '10+years')
means

boot.ci error, due to too little variation in data?

I have a dataset of 79 values ranging from 0-22 (and not a lot of variation; median 5). Using the boot.ci function, I am trying to bootstrap to calculate confidence intervals. I receive the error: [1] "All values of t are equal to 5 \n Cannot calculate confidence intervals"
I assumed this is because my number of replications wasn't high enough (as there is little variation in the data), but increasing to even 1M doesn't help.
Does anyone have any tips?
Boot_symp_pres = boot(data$Delay,
function(x,i) median(x[i], na.rm=TRUE),
R=1000000)
boot.ci(Boot_symp_pres,
conf = 0.95,
type = c("norm", "basic" ,"perc", "bca"))
Data
Delay
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 2
10 2
11 2
12 2
13 3
14 3
15 3
16 4
17 5
18 5
19 5
20 5
21 5
22 5
23 5
24 5
25 5
26 5
27 5
28 5
29 5
30 5
31 5
32 5
33 5
34 5
35 5
36 5
37 5
38 5
39 5
40 5
41 5
42 5
43 5
44 5
45 5
46 5
47 5
48 5
49 5
50 5
51 5
52 5
53 5
54 5
55 5
56 5
57 5
58 5
59 5
60 5
61 5
62 5
63 5
64 5
65 5
66 5
67 5
68 5
69 5
70 6
71 6
72 6
73 6
74 8
75 9
76 10
77 11
78 13
79 22

Create a column from groupby with a calculated label

I have a dataframe and I would like to create a dataframe column based on the groupby on another column. The group by should be in increments of 50 on the column and the label should be the middle number in the group numbers. I am demonstrating this here with a reproducible example.
Here is the dataframe
das <- data.frame(val=1:27,
weigh=c(20,25,37,38,50,52,56,59,64,68,69,70,75,76,82,85,90,100,109,150,161,178,181,179,180,201,201))
val weigh
1 1 20
2 2 25
3 3 37
4 4 38
5 5 50
6 6 52
7 7 56
8 8 59
9 9 64
10 10 68
11 11 69
12 12 70
13 13 75
14 14 76
15 15 82
16 16 85
17 17 90
18 18 100
19 19 109
20 20 150
21 21 161
22 22 178
23 23 181
24 24 179
25 25 180
26 26 201
27 27 201
The desired output will be
val weigh label
1 1 20 45
2 2 25 45
3 3 37 45
4 4 38 45
5 5 50 45
6 6 52 45
7 7 56 45
8 8 59 45
9 9 64 45
10 10 68 45
11 11 69 45
12 12 70 45
13 13 75 95
14 14 76 95
15 15 82 95
16 16 85 95
17 17 90 95
18 18 100 95
19 19 109 95
20 20 150 145
21 21 161 145
22 22 178 195
23 23 181 195
24 24 179 195
25 25 180 195
26 26 201 195
27 27 201 195
Here the 45 is calculate by 20+ (20+50) /2 = 45, where 20 is where it start and 20+50 = 70 is where this group need to stop. And the label is the middle number between 20 and 70 which is 45.
Similarly with other labels
70+(70+50)/2= 95
120 + (170)/2= 145
170 + (220)/2 = 195
I am new to R and tried looking at many sources here, but I couldn't find anything that will do something like this. The closest I could find is grouping like this using cut2
df %>% mutate(label = as.numeric(cut2(weigh, g=5)))
library(dplyr)
# create your breaks
breaks = unique(c(seq(min(das$weigh), max(das$weigh)+1, 50), max(das$weigh)+1))
das %>%
group_by(group = cut(weigh, breaks, right=F)) %>% # group by intervals
mutate(group2 = as.numeric(group), # use the intervals as a number
label = (breaks[group2]+breaks[group2]+50)/2) %>% # call the corresponding break value and calculate your label
ungroup()
# # A tibble: 27 x 5
# val weigh group group2 label
# <int> <dbl> <fct> <dbl> <dbl>
# 1 1 20 [20,70) 1 45
# 2 2 25 [20,70) 1 45
# 3 3 37 [20,70) 1 45
# 4 4 38 [20,70) 1 45
# 5 5 50 [20,70) 1 45
# 6 6 52 [20,70) 1 45
# 7 7 56 [20,70) 1 45
# 8 8 59 [20,70) 1 45
# 9 9 64 [20,70) 1 45
#10 10 68 [20,70) 1 45
# # ... with 17 more rows
You can remove any unnecessary columns. I left them there just to make easier to understand how the process works.

Distance Matrix from table in R

Good evening,
I need to solve a location problem in R and I'm stuck in one of the first steps.
From a .txt file I need to create a distance matrix using the euclidean method.
datos <- file.choose()
servidores <- read.table(datos)
servidores
From which I obtain the following information:
X50 shows the total number of servers.
x5 the number of hubs required.
x120 the total capacity.
The first column shows the distance of x.
The second column shows the distance of y.
The third column shows the requirements of the node.
X50 X5 X120
1 2 62 3
2 80 25 14
3 36 88 1
4 57 23 14
5 33 17 19
6 76 43 2
7 77 85 14
8 94 6 6
9 89 11 7
10 59 72 6
11 39 82 10
12 87 24 18
13 44 76 3
14 2 83 6
15 19 43 20
16 5 27 4
17 58 72 14
18 14 50 11
19 43 18 19
20 87 7 15
21 11 56 15
22 31 16 4
23 51 94 13
24 55 13 13
25 84 57 5
26 12 2 16
27 53 33 3
28 53 10 7
29 33 32 14
30 69 67 17
31 43 5 3
32 10 75 3
33 8 26 12
34 3 1 14
35 96 22 20
36 6 48 13
37 59 22 10
38 66 69 9
39 22 50 6
40 75 21 18
41 4 81 7
42 41 97 20
43 92 34 9
44 12 64 1
45 60 84 8
46 35 100 5
47 38 2 1
48 9 9 7
49 54 59 9
50 1 58 2
I tried to use the dist() function:
distance_matrix <-dist(servidores,method = "euclidean",diag = TRUE,upper = TRUE)
but since x and y are on different columns I am not sure what to do to get a 50x50 matrix with all the distances.
Anybody knows how could I create such matrix?.
Many thanks in advance.

Filling "implied missing values" in a data frame that has varying observations per time unit

I have a large dataset with spatiotemporal data. Each set of coordinates are associated with an id (player id in a computer game). Unfortunately the coordinates for each id aren't logged at every time unit. If a reading is not available for a specific id at x time stamp, then that row was entirely omitted from the dataset rather than logged as NA.
I would like to have the same exact amount of observations per time unit as there are unique ids (i.e. inserting "implied missing NAs"). On time units where ids are missing, they should be inserted as new rows with NAs as their coordinates.
Here's a dummy dataset to illustrate:
time <- c(10,10,10,10,11,11,11,11,11,11,12,12,12,12,13,13,14,14,14,14,14,14,15,15,15)
id <- c(1,3,4,5,1,2,3,4,5,6,2,4,5,6,3,6,1,2,3,4,5,6,2,4,5)
x <- c(128,128,64,64,124,128,120,68,64,64,122,71,65,64,112,74,116,114,113,73,70,70,111,75,70)
y <- c(128,128,64,66,125,128,124,66,67,64,124,67,71,68,113,68,115,119,113,76,69,77,116,80,82)
spatiodf <- as.data.frame(cbind(time, id, x, y))
time id x y
1 10 1 128 128
2 10 3 128 128
3 10 4 64 64
4 10 5 64 66
5 11 1 124 125
6 11 2 128 128
7 11 3 120 124
8 11 4 68 66
9 11 5 64 67
10 11 6 64 64
11 12 1 118 123
12 12 2 122 124
13 12 4 71 67
14 12 5 65 71
15 12 6 64 68
16 13 3 112 113
17 13 6 74 68
18 14 1 116 115
19 14 2 114 119
20 14 3 113 113
21 14 4 73 76
22 14 5 70 69
23 14 6 70 77
24 15 2 111 116
25 15 4 75 80
26 15 5 70 82
From the above output I would like to get to the following below output where the data frame was recreated with each time unit having an equal amount of observations (and NA values were manually inserted into rows that had missing values).
time <- rep(10:15, each = 6)
id <- rep(1:6, times = 6)
x <- c(128,NA,128,64,64,NA,124,128,120,68,64,64,NA,122,NA,71,65,64,NA,NA,112,NA,NA,74,116,114,113,73,70,70,NA,111,NA,75,70,NA)
y <- c(128,NA,128,64,66,NA,125,128,124,66,67,64,NA,124,NA,67,71,68,NA,NA,113,NA,NA,68,115,119,113,76,69,77,NA,116,NA,80,82,NA)
spatiodf_equal_obs <- as.data.frame(cbind(time, id, x, y))
library(dplyr)
spatiodf_equal_obs %>%
arrange(id)
time id x y
1 10 1 128 128
2 11 1 124 125
3 12 1 NA NA
4 13 1 NA NA
5 14 1 116 115
6 15 1 NA NA
7 10 2 NA NA
8 11 2 128 128
9 12 2 122 124
10 13 2 NA NA
11 14 2 114 119
12 15 2 111 116
13 10 3 128 128
14 11 3 120 124
15 12 3 NA NA
16 13 3 112 113
17 14 3 113 113
18 15 3 NA NA
19 10 4 64 64
20 11 4 68 66
21 12 4 71 67
22 13 4 NA NA
23 14 4 73 76
24 15 4 75 80
25 10 5 64 66
26 11 5 64 67
27 12 5 65 71
28 13 5 NA NA
29 14 5 70 69
30 15 5 70 82
31 10 6 NA NA
32 11 6 64 64
33 12 6 64 68
34 13 6 74 68
35 14 6 70 77
36 15 6 NA NA
The reason the data needs to be in the above format is because I want to be able to fill in the NA values with the nearest available previous or following entry from the same id. Once we have the dataframe in the above output that can be done using fill() from tidyr:
library(tidyr)
res <- spatiodf_equal_obs %>%
group_by(id) %>%
fill(x, y, .direction = "down") %>%
fill(x, y, .direction = "up")
I've tried a lot of combinations of spreading, gathering (and trickery with creating new dataframes to merge(df1, df2, all=TRUE)). I can't seem to figure out how to go from that first data frame to the second one though.
The final output should look like this:
time id x y
1 10 1 128 128
2 11 1 124 125
3 12 1 124 125
4 13 1 124 125
5 14 1 116 115
6 15 1 116 115
7 10 2 128 128
8 11 2 128 128
9 12 2 122 124
10 13 2 122 124
11 14 2 114 119
12 15 2 111 116
13 10 3 128 128
14 11 3 120 124
15 12 3 120 124
16 13 3 112 113
17 14 3 113 113
18 15 3 113 113
19 10 4 64 64
20 11 4 68 66
21 12 4 71 67
22 13 4 71 67
23 14 4 73 76
24 15 4 75 80
25 10 5 64 66
26 11 5 64 67
27 12 5 65 71
28 13 5 65 71
29 14 5 70 69
30 15 5 70 82
31 10 6 64 64
32 11 6 64 64
33 12 6 64 68
34 13 6 74 68
35 14 6 70 77
36 15 6 70 77
To fill in gaps with values taken from the nearest row, you can do:
library(data.table)
setDT(spatiodf)
resDT = spatiodf[
CJ(id = id, time = min(time):max(time), unique = TRUE), on=.(id, time), roll="nearest"
]
# verify
fsetequal(data.table(res), resDT) # TRUE
How it works
setDT converts to a data.table in place, so no <- is needed.
DT[i, on=, roll=] uses i to look up rows in DT, rolling each i to a row in DT. The "roll" is done on the final column in on=.
CJ(a, b, unique = TRUE) returns all combos of a and b, like expand.grid in base.

Resources