I am given a data frame with multiple variables but I am only interested in 2 variables and am required to group the variables into 2 groups. (i.e. group 1:mean age at child-birth with having 10+ years of education; group 2: mean age at child-birth with having less than 10 years of education) I am trying to figure out how to put this into a table but I am having troubles on how I can group the rows I want based on years of education. I currently have a table that looks like this with the following code:
'''
means<-table(bfeed_df$ybirth,bfeed_df$yschool)
'''
giving me:
'''
3 6 7 8 9 10 11 12 13 14 15 16 17 18 19
78 0 0 2 2 5 8 8 26 1 2 1 0 0 0 0
79 1 2 2 3 6 12 12 38 10 5 0 0 0 0 0
80 0 0 0 5 10 11 13 38 10 5 2 0 0 0 0
.
.
'''
I want:
<10years +10years
78 9 46
79 14 77
80 15 88
. . .
. . .
# Let's generate some fake data that matches your input
temp = matrix(sample(60,60), ncol = 15)
colnames(temp) = c(3,6,7,8,9,10,11,12,13,14,15,16,17,18,19)
rownmes(temp) = c(78, 79, 80, 81)
# 3 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# 78 5 4 21 13 18 17 34 43 19 41 55 36 12 52 15
# 79 56 14 38 28 30 25 8 44 35 59 39 49 20 2 58
# 80 22 27 3 9 33 54 26 50 53 45 10 40 48 7 6
# 81 42 46 23 1 60 57 47 16 24 51 37 32 11 29 31
Now we can create the summations using apply
sums = t(apply(temp, 1, function(x) c(sum(x[1:4]), sum(x[5:15])) ))
colnames(sums) = c("<10y","+10y")
sums
> sums
<10y +10y
78 43 342
79 136 369
80 61 372
81 112 395
Is this what you are looking for?
You can use cut to divide yschool in two categories and use it in table.
means <- table(bfeed_df$ybirth,cut(bfeed_df$yschool, c(-Inf, 10, Inf)))
colnames(means) <- c('<10years', '10+years')
means
I have a dataframe and I would like to create a dataframe column based on the groupby on another column. The group by should be in increments of 50 on the column and the label should be the middle number in the group numbers. I am demonstrating this here with a reproducible example.
Here is the dataframe
das <- data.frame(val=1:27,
weigh=c(20,25,37,38,50,52,56,59,64,68,69,70,75,76,82,85,90,100,109,150,161,178,181,179,180,201,201))
val weigh
1 1 20
2 2 25
3 3 37
4 4 38
5 5 50
6 6 52
7 7 56
8 8 59
9 9 64
10 10 68
11 11 69
12 12 70
13 13 75
14 14 76
15 15 82
16 16 85
17 17 90
18 18 100
19 19 109
20 20 150
21 21 161
22 22 178
23 23 181
24 24 179
25 25 180
26 26 201
27 27 201
The desired output will be
val weigh label
1 1 20 45
2 2 25 45
3 3 37 45
4 4 38 45
5 5 50 45
6 6 52 45
7 7 56 45
8 8 59 45
9 9 64 45
10 10 68 45
11 11 69 45
12 12 70 45
13 13 75 95
14 14 76 95
15 15 82 95
16 16 85 95
17 17 90 95
18 18 100 95
19 19 109 95
20 20 150 145
21 21 161 145
22 22 178 195
23 23 181 195
24 24 179 195
25 25 180 195
26 26 201 195
27 27 201 195
Here the 45 is calculate by 20+ (20+50) /2 = 45, where 20 is where it start and 20+50 = 70 is where this group need to stop. And the label is the middle number between 20 and 70 which is 45.
Similarly with other labels
70+(70+50)/2= 95
120 + (170)/2= 145
170 + (220)/2 = 195
I am new to R and tried looking at many sources here, but I couldn't find anything that will do something like this. The closest I could find is grouping like this using cut2
df %>% mutate(label = as.numeric(cut2(weigh, g=5)))
library(dplyr)
# create your breaks
breaks = unique(c(seq(min(das$weigh), max(das$weigh)+1, 50), max(das$weigh)+1))
das %>%
group_by(group = cut(weigh, breaks, right=F)) %>% # group by intervals
mutate(group2 = as.numeric(group), # use the intervals as a number
label = (breaks[group2]+breaks[group2]+50)/2) %>% # call the corresponding break value and calculate your label
ungroup()
# # A tibble: 27 x 5
# val weigh group group2 label
# <int> <dbl> <fct> <dbl> <dbl>
# 1 1 20 [20,70) 1 45
# 2 2 25 [20,70) 1 45
# 3 3 37 [20,70) 1 45
# 4 4 38 [20,70) 1 45
# 5 5 50 [20,70) 1 45
# 6 6 52 [20,70) 1 45
# 7 7 56 [20,70) 1 45
# 8 8 59 [20,70) 1 45
# 9 9 64 [20,70) 1 45
#10 10 68 [20,70) 1 45
# # ... with 17 more rows
You can remove any unnecessary columns. I left them there just to make easier to understand how the process works.
I have a large dataset with spatiotemporal data. Each set of coordinates are associated with an id (player id in a computer game). Unfortunately the coordinates for each id aren't logged at every time unit. If a reading is not available for a specific id at x time stamp, then that row was entirely omitted from the dataset rather than logged as NA.
I would like to have the same exact amount of observations per time unit as there are unique ids (i.e. inserting "implied missing NAs"). On time units where ids are missing, they should be inserted as new rows with NAs as their coordinates.
Here's a dummy dataset to illustrate:
time <- c(10,10,10,10,11,11,11,11,11,11,12,12,12,12,13,13,14,14,14,14,14,14,15,15,15)
id <- c(1,3,4,5,1,2,3,4,5,6,2,4,5,6,3,6,1,2,3,4,5,6,2,4,5)
x <- c(128,128,64,64,124,128,120,68,64,64,122,71,65,64,112,74,116,114,113,73,70,70,111,75,70)
y <- c(128,128,64,66,125,128,124,66,67,64,124,67,71,68,113,68,115,119,113,76,69,77,116,80,82)
spatiodf <- as.data.frame(cbind(time, id, x, y))
time id x y
1 10 1 128 128
2 10 3 128 128
3 10 4 64 64
4 10 5 64 66
5 11 1 124 125
6 11 2 128 128
7 11 3 120 124
8 11 4 68 66
9 11 5 64 67
10 11 6 64 64
11 12 1 118 123
12 12 2 122 124
13 12 4 71 67
14 12 5 65 71
15 12 6 64 68
16 13 3 112 113
17 13 6 74 68
18 14 1 116 115
19 14 2 114 119
20 14 3 113 113
21 14 4 73 76
22 14 5 70 69
23 14 6 70 77
24 15 2 111 116
25 15 4 75 80
26 15 5 70 82
From the above output I would like to get to the following below output where the data frame was recreated with each time unit having an equal amount of observations (and NA values were manually inserted into rows that had missing values).
time <- rep(10:15, each = 6)
id <- rep(1:6, times = 6)
x <- c(128,NA,128,64,64,NA,124,128,120,68,64,64,NA,122,NA,71,65,64,NA,NA,112,NA,NA,74,116,114,113,73,70,70,NA,111,NA,75,70,NA)
y <- c(128,NA,128,64,66,NA,125,128,124,66,67,64,NA,124,NA,67,71,68,NA,NA,113,NA,NA,68,115,119,113,76,69,77,NA,116,NA,80,82,NA)
spatiodf_equal_obs <- as.data.frame(cbind(time, id, x, y))
library(dplyr)
spatiodf_equal_obs %>%
arrange(id)
time id x y
1 10 1 128 128
2 11 1 124 125
3 12 1 NA NA
4 13 1 NA NA
5 14 1 116 115
6 15 1 NA NA
7 10 2 NA NA
8 11 2 128 128
9 12 2 122 124
10 13 2 NA NA
11 14 2 114 119
12 15 2 111 116
13 10 3 128 128
14 11 3 120 124
15 12 3 NA NA
16 13 3 112 113
17 14 3 113 113
18 15 3 NA NA
19 10 4 64 64
20 11 4 68 66
21 12 4 71 67
22 13 4 NA NA
23 14 4 73 76
24 15 4 75 80
25 10 5 64 66
26 11 5 64 67
27 12 5 65 71
28 13 5 NA NA
29 14 5 70 69
30 15 5 70 82
31 10 6 NA NA
32 11 6 64 64
33 12 6 64 68
34 13 6 74 68
35 14 6 70 77
36 15 6 NA NA
The reason the data needs to be in the above format is because I want to be able to fill in the NA values with the nearest available previous or following entry from the same id. Once we have the dataframe in the above output that can be done using fill() from tidyr:
library(tidyr)
res <- spatiodf_equal_obs %>%
group_by(id) %>%
fill(x, y, .direction = "down") %>%
fill(x, y, .direction = "up")
I've tried a lot of combinations of spreading, gathering (and trickery with creating new dataframes to merge(df1, df2, all=TRUE)). I can't seem to figure out how to go from that first data frame to the second one though.
The final output should look like this:
time id x y
1 10 1 128 128
2 11 1 124 125
3 12 1 124 125
4 13 1 124 125
5 14 1 116 115
6 15 1 116 115
7 10 2 128 128
8 11 2 128 128
9 12 2 122 124
10 13 2 122 124
11 14 2 114 119
12 15 2 111 116
13 10 3 128 128
14 11 3 120 124
15 12 3 120 124
16 13 3 112 113
17 14 3 113 113
18 15 3 113 113
19 10 4 64 64
20 11 4 68 66
21 12 4 71 67
22 13 4 71 67
23 14 4 73 76
24 15 4 75 80
25 10 5 64 66
26 11 5 64 67
27 12 5 65 71
28 13 5 65 71
29 14 5 70 69
30 15 5 70 82
31 10 6 64 64
32 11 6 64 64
33 12 6 64 68
34 13 6 74 68
35 14 6 70 77
36 15 6 70 77
To fill in gaps with values taken from the nearest row, you can do:
library(data.table)
setDT(spatiodf)
resDT = spatiodf[
CJ(id = id, time = min(time):max(time), unique = TRUE), on=.(id, time), roll="nearest"
]
# verify
fsetequal(data.table(res), resDT) # TRUE
How it works
setDT converts to a data.table in place, so no <- is needed.
DT[i, on=, roll=] uses i to look up rows in DT, rolling each i to a row in DT. The "roll" is done on the final column in on=.
CJ(a, b, unique = TRUE) returns all combos of a and b, like expand.grid in base.