Creating a table viable for t.test function - r

I am given a data frame with multiple variables but I am only interested in 2 variables and am required to group the variables into 2 groups. (i.e. group 1:mean age at child-birth with having 10+ years of education; group 2: mean age at child-birth with having less than 10 years of education) I am trying to figure out how to put this into a table but I am having troubles on how I can group the rows I want based on years of education. I currently have a table that looks like this with the following code:
'''
means<-table(bfeed_df$ybirth,bfeed_df$yschool)
'''
giving me:
'''
3 6 7 8 9 10 11 12 13 14 15 16 17 18 19
78 0 0 2 2 5 8 8 26 1 2 1 0 0 0 0
79 1 2 2 3 6 12 12 38 10 5 0 0 0 0 0
80 0 0 0 5 10 11 13 38 10 5 2 0 0 0 0
.
.
'''
I want:
<10years +10years
78 9 46
79 14 77
80 15 88
. . .
. . .

# Let's generate some fake data that matches your input
temp = matrix(sample(60,60), ncol = 15)
colnames(temp) = c(3,6,7,8,9,10,11,12,13,14,15,16,17,18,19)
rownmes(temp) = c(78, 79, 80, 81)
# 3 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# 78 5 4 21 13 18 17 34 43 19 41 55 36 12 52 15
# 79 56 14 38 28 30 25 8 44 35 59 39 49 20 2 58
# 80 22 27 3 9 33 54 26 50 53 45 10 40 48 7 6
# 81 42 46 23 1 60 57 47 16 24 51 37 32 11 29 31
Now we can create the summations using apply
sums = t(apply(temp, 1, function(x) c(sum(x[1:4]), sum(x[5:15])) ))
colnames(sums) = c("<10y","+10y")
sums
> sums
<10y +10y
78 43 342
79 136 369
80 61 372
81 112 395
Is this what you are looking for?

You can use cut to divide yschool in two categories and use it in table.
means <- table(bfeed_df$ybirth,cut(bfeed_df$yschool, c(-Inf, 10, Inf)))
colnames(means) <- c('<10years', '10+years')
means

Related

boot.ci error, due to too little variation in data?

I have a dataset of 79 values ranging from 0-22 (and not a lot of variation; median 5). Using the boot.ci function, I am trying to bootstrap to calculate confidence intervals. I receive the error: [1] "All values of t are equal to 5 \n Cannot calculate confidence intervals"
I assumed this is because my number of replications wasn't high enough (as there is little variation in the data), but increasing to even 1M doesn't help.
Does anyone have any tips?
Boot_symp_pres = boot(data$Delay,
function(x,i) median(x[i], na.rm=TRUE),
R=1000000)
boot.ci(Boot_symp_pres,
conf = 0.95,
type = c("norm", "basic" ,"perc", "bca"))
Data
Delay
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 2
10 2
11 2
12 2
13 3
14 3
15 3
16 4
17 5
18 5
19 5
20 5
21 5
22 5
23 5
24 5
25 5
26 5
27 5
28 5
29 5
30 5
31 5
32 5
33 5
34 5
35 5
36 5
37 5
38 5
39 5
40 5
41 5
42 5
43 5
44 5
45 5
46 5
47 5
48 5
49 5
50 5
51 5
52 5
53 5
54 5
55 5
56 5
57 5
58 5
59 5
60 5
61 5
62 5
63 5
64 5
65 5
66 5
67 5
68 5
69 5
70 6
71 6
72 6
73 6
74 8
75 9
76 10
77 11
78 13
79 22

R plot numbers of factor levels having n, n+1, .... counts

I have a very large dataset (> 200000 lines) with 6 variables (only the first two shown)
>head(gt7)
ChromKey POS
1 2447 25
2 2447 183
3 26341 75
4 26341 2213
5 26341 2617
6 54011 1868
I have converted the Chromkey variable to a factor variable made up of > 55000 levels.
> gt7[1] <- lapply(gt7[1], factor)
> is.factor(gt7$ChromKey)
[1] TRUE
I can further make a table with counts of ChromKey levels
> table(gt7$ChromKey)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
88 88 44 33 11 11 33 22 121 11 22 11 11 11 22 11 33
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
22 22 44 55 22 11 22 66 11 11 11 22 11 11 11 187 77
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
77 11 44 11 11 11 11 11 11 22 66 11 22 11 44 22 22
... outut cropped
Which I can save in table format
> table <- table(gt7$ChromKey)
> head(table)
1 2 3 4 5 6
88 88 44 33 11 11
I would like to know whether is it possible to have a table (and histogram) of the number of levels with specific count numbers. From the example above, I would expect
88 44 33 11
2 1 1 2
I would very much appreciate any hint.
We can apply table again on the output to get the frequency count of the frequency
table(table(gt7$ChromKey))

Create a column from groupby with a calculated label

I have a dataframe and I would like to create a dataframe column based on the groupby on another column. The group by should be in increments of 50 on the column and the label should be the middle number in the group numbers. I am demonstrating this here with a reproducible example.
Here is the dataframe
das <- data.frame(val=1:27,
weigh=c(20,25,37,38,50,52,56,59,64,68,69,70,75,76,82,85,90,100,109,150,161,178,181,179,180,201,201))
val weigh
1 1 20
2 2 25
3 3 37
4 4 38
5 5 50
6 6 52
7 7 56
8 8 59
9 9 64
10 10 68
11 11 69
12 12 70
13 13 75
14 14 76
15 15 82
16 16 85
17 17 90
18 18 100
19 19 109
20 20 150
21 21 161
22 22 178
23 23 181
24 24 179
25 25 180
26 26 201
27 27 201
The desired output will be
val weigh label
1 1 20 45
2 2 25 45
3 3 37 45
4 4 38 45
5 5 50 45
6 6 52 45
7 7 56 45
8 8 59 45
9 9 64 45
10 10 68 45
11 11 69 45
12 12 70 45
13 13 75 95
14 14 76 95
15 15 82 95
16 16 85 95
17 17 90 95
18 18 100 95
19 19 109 95
20 20 150 145
21 21 161 145
22 22 178 195
23 23 181 195
24 24 179 195
25 25 180 195
26 26 201 195
27 27 201 195
Here the 45 is calculate by 20+ (20+50) /2 = 45, where 20 is where it start and 20+50 = 70 is where this group need to stop. And the label is the middle number between 20 and 70 which is 45.
Similarly with other labels
70+(70+50)/2= 95
120 + (170)/2= 145
170 + (220)/2 = 195
I am new to R and tried looking at many sources here, but I couldn't find anything that will do something like this. The closest I could find is grouping like this using cut2
df %>% mutate(label = as.numeric(cut2(weigh, g=5)))
library(dplyr)
# create your breaks
breaks = unique(c(seq(min(das$weigh), max(das$weigh)+1, 50), max(das$weigh)+1))
das %>%
group_by(group = cut(weigh, breaks, right=F)) %>% # group by intervals
mutate(group2 = as.numeric(group), # use the intervals as a number
label = (breaks[group2]+breaks[group2]+50)/2) %>% # call the corresponding break value and calculate your label
ungroup()
# # A tibble: 27 x 5
# val weigh group group2 label
# <int> <dbl> <fct> <dbl> <dbl>
# 1 1 20 [20,70) 1 45
# 2 2 25 [20,70) 1 45
# 3 3 37 [20,70) 1 45
# 4 4 38 [20,70) 1 45
# 5 5 50 [20,70) 1 45
# 6 6 52 [20,70) 1 45
# 7 7 56 [20,70) 1 45
# 8 8 59 [20,70) 1 45
# 9 9 64 [20,70) 1 45
#10 10 68 [20,70) 1 45
# # ... with 17 more rows
You can remove any unnecessary columns. I left them there just to make easier to understand how the process works.

Distance Matrix from table in R

Good evening,
I need to solve a location problem in R and I'm stuck in one of the first steps.
From a .txt file I need to create a distance matrix using the euclidean method.
datos <- file.choose()
servidores <- read.table(datos)
servidores
From which I obtain the following information:
X50 shows the total number of servers.
x5 the number of hubs required.
x120 the total capacity.
The first column shows the distance of x.
The second column shows the distance of y.
The third column shows the requirements of the node.
X50 X5 X120
1 2 62 3
2 80 25 14
3 36 88 1
4 57 23 14
5 33 17 19
6 76 43 2
7 77 85 14
8 94 6 6
9 89 11 7
10 59 72 6
11 39 82 10
12 87 24 18
13 44 76 3
14 2 83 6
15 19 43 20
16 5 27 4
17 58 72 14
18 14 50 11
19 43 18 19
20 87 7 15
21 11 56 15
22 31 16 4
23 51 94 13
24 55 13 13
25 84 57 5
26 12 2 16
27 53 33 3
28 53 10 7
29 33 32 14
30 69 67 17
31 43 5 3
32 10 75 3
33 8 26 12
34 3 1 14
35 96 22 20
36 6 48 13
37 59 22 10
38 66 69 9
39 22 50 6
40 75 21 18
41 4 81 7
42 41 97 20
43 92 34 9
44 12 64 1
45 60 84 8
46 35 100 5
47 38 2 1
48 9 9 7
49 54 59 9
50 1 58 2
I tried to use the dist() function:
distance_matrix <-dist(servidores,method = "euclidean",diag = TRUE,upper = TRUE)
but since x and y are on different columns I am not sure what to do to get a 50x50 matrix with all the distances.
Anybody knows how could I create such matrix?.
Many thanks in advance.

Generate population data with specific distribution in R

I have a distribution of ages in a population.
For instance, you can imagine something like this:
Ages <24: 15%
Ages 25-49: 40%
Ages 50-60: 20%
Ages >60: 25%
I don't have the mean and standard deviation for each stratum/age group in the data. I am trying to generate a sample population of 1000 individuals where the generated data matches the distribution of ages shown above.
Let's put this data in a more friendly format:
(dat <- data.frame(min=c(0, 25, 50, 60), max=c(25, 50, 60, 100), prop=c(0.15, 0.40, 0.20, 0.25)))
# min max prop
# 1 0 25 0.15
# 2 25 50 0.40
# 3 50 60 0.20
# 4 60 100 0.25
We can easily sample 1000 rows of the table using the sample function:
set.seed(144) # For reproducibility
rows <- sample(nrow(dat), 1000, replace=TRUE, prob=dat$prop)
table(rows)
# rows
# 1 2 3 4
# 139 425 198 238
To sample actual ages you will need to define a distribution over the ages represented by each row. A simple one would be uniformly distributed ages:
age <- round(dat$min[rows] + runif(1000) * (dat$max[rows] - dat$min[rows]))
table(age)
# age
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
# 2 5 5 3 7 7 9 6 7 6 1 7 7 5 5 6 2 4 6 7 4 11 8 2 3 10 11 13
# 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
# 19 16 20 16 18 21 16 19 14 20 15 13 18 15 24 20 16 16 29 16 11 12 18 17 17 26 27 21
# 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
# 17 26 11 13 20 3 8 9 6 4 3 3 5 4 3 3 5 8 3 13 5 6 4 7 9 9 6 4
# 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
# 5 5 9 9 5 6 8 9 5 4 6 5 9 6 8 4 1
Of course, if uniformly sampling the ages in each range is inappropriate in your application, then you would need to pick some other function to get ages from buckets.
This doesn't do exactly what you were looking for, but does help with the cut-offs. Hope it helps!
install.packages("truncnorm")
library(truncnorm)
set.seed(123)
pop <- 1000
ages <- rtruncnorm(n=pop, a=0, b=100, mean=40, sd=25) # ---> You can set your own mean and sd
summary(ages)

Resources