How to draw boxplot by creating age buckets from integer age entries - r

I have a data frame in R, an example of which is given below:
age number_of_visits
19 10
50 24
25 50
24 35
31 19
42 26
55 40
64 15
20 35
67 20
69 18
33 15
28 50
62 18
I need to create age bins like 18 to 24, 25 to 39, 40 to 54, 55 to 65, and above 65, and then for each of these age bins I need to create boxplots for number of visits.
It would be helpful, if any one can provide code to be used in Rstudio
Thank you!

You could do this with cut2 from the Hmisc package:
library(Hmisc)
# Toy Data
age <- rnorm(100, mean=45, sd=15)
number_of_visits <- rnorm(100, mean=20, sd=10)
# cut2 lets you set custom cutpoints
interval <- cut2(age, c(18,25,40,55,65))
boxplot(number_of_visits ~ interval)

Related

Randomly Select 10 percent of data from the whole data set in R

For my project, I have taken a data set which have 1296765 observations of 23 columns, I want to take just 10% of this data randomly. How can I do that in R.
I tried the below code but it only sampled out just 10 rows. But, I wanted to select randomly 10% of the data. I am a beginner so please help.
library(dplyr)
x <- sample_n(train, 10)
Here is a function from dplyr that select rows at random by a specific proportion:
dplyr::slice_sample(train,prop = .1)
In base R, you can subset by sampling a proportion of nrow():
set.seed(13)
train <- data.frame(id = 1:101, x = rnorm(101))
train[sample(nrow(train), nrow(train) / 10), ]
id x
69 69 1.14382456
101 101 -0.36917269
60 60 0.69967564
58 58 0.82651036
59 59 1.48369123
72 72 -0.06144699
12 12 0.46187091
89 89 1.60212039
8 8 0.23667967
49 49 0.27714729

How to calculate 95% confidence interval for a proportion in R?

Assume I own a factory that produces 150 screws a day and there is a 22% error rate. Now I am going to estimate how many screws are faulty each day for a year (365 days) with
rbinom(n = 365, size = 150, prob = 0.22)
which generates 365 values in this way
45 31 35 31 34 37 33 41 37 37 26 32 37 38 39 35 44 36 25 27 32 25 30 33 25 37 36 31 32 32 43 42 32 33 33 38 26 24 ...................
Now for each of the value generated, I am supposed to calculate a 95% confidence interval for the proportion of faulty screws in each day.
I am not sure how I can do this. Is there any built in functions for this (I am not supposed to use any packages) or should I create a new function?
If the number of trials per day is large enough and the probability of failure not too extreme, then you can use the normal approximation https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval.
# number of failures, for each of the 365 days
f <- rbinom(365, size = 150, prob = 0.22)
# failure rates
p <- f/150
# confidence interval for the failur rate, for each day
p + 1.96*sqrt((p*(1-p)/150))
p - 1.96*sqrt((p*(1-p)/150))

Recursive regression in R (extract residuals)

For a different question, this procedure for a recursive regression of X on Y, starting at say the first 20 observations and increasing the regression window by one observation at a time until it covers the full sample, was suggested:
X1 <- runif(50, 0, 1)
X2 <- runif(50, 0, 10)
Y <- runif(50, 0, 1)
df <- data.frame(X1,X2,Y)
rolling_lms <- lapply( seq(20,nrow(df) ), function(x) lm( Y ~ X1+X2, data = df[1:x , ]) )
This works fine, but is there a way to:
Get the residuals for the first 20 observations.
Adding on the residuals one by one for each regression.
So that the 21. residual is the one from the regression including 21 observations, the 22. residual is the one from the regression with 22 observation and so on?
Here is a possibile solution for your problem.
set.seed(1)
X1 <- runif(50, 0, 1)
X2 <- runif(50, 0, 10)
Y <- runif(50, 0, 1)
df <- data.frame(X1,X2,Y)
rolling_lms <- lapply(seq(20,nrow(df)), function(x) lm(Y ~ X1+X2, data = df[1:x , ]))
resk <- function(k) if(k==1) rolling_lms[[k]]$residuals else tail(rolling_lms[[k]]$residuals,1)
unlist(sapply(1:length(rolling_lms), resk))
############
1 2 3 4 5 6
0.051243613 -0.284725835 -0.209235819 0.677747763 0.085196300 -0.077111032
7 8 9 10 11 12
-0.185700617 0.016194254 0.422214060 -0.067994796 0.265315143 0.130531648
13 14 15 16 17 18
-0.083662353 -0.098826853 -0.298235953 -0.459746026 0.282954796 -0.281752756
19 20 21 22 23 24
-0.037180134 0.152774597 0.576060893 -0.121303797 0.001336554 -0.357956306
25 26 27 28 29 30
0.205847757 -0.111231524 -0.082662882 -0.291013740 -0.223480493 0.051223304
31 32 33 34 35 36
0.082970698 -0.393398739 -0.428164426 0.122919273 0.457861478 0.148282532
37 38 39 40 41 42
0.081855106 0.023024731 0.500627476 0.005097244 0.189354101 0.092481013
43 44 45 46 47 48
-0.245542247 -0.217881519 0.234771342 -0.023343600 -0.328489644 0.242163946
49 50
-0.358311100 0.373917319

Regroup lines of a data frame for which a column value is inferior to x

I have this data frame :
> df
Z freq proba
1 17 1 0.0033289263
2 18 4 0.0055569026
3 19 2 0.0087878028
4 20 3 0.0132023556
5 21 16 0.0188900561
6 22 12 0.0257995234
7 23 30 0.0337042731
8 24 41 0.0421963455
9 25 56 0.0507149437
10 26 65 0.0586089198
11 27 65 0.0652230449
12 28 93 0.0699913154
13 29 82 0.0725182432
14 30 94 0.0726318551
15 31 72 0.0703990113
16 32 74 0.0661024717
17 33 58 0.0601873020
18 34 66 0.0531896431
19 35 38 0.0456625487
20 36 45 0.0381117389
21 37 27 0.0309498221
22 38 17 0.0244723502
23 39 15 0.0188543771
24 40 13 0.0141629367
25 41 4 0.0103793600
26 42 1 0.0074254435
27 43 2 0.0051886582
28 45 1 0.0023658767
29 46 1 0.0015453804
30 49 2 0.0003792308
# Here are my datas :
> dput(df)
structure(list(Z = c(17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 45, 46, 49), freq = c(1, 4, 2, 3, 16, 12, 30, 41, 56, 65,
65, 93, 82, 94, 72, 74, 58, 66, 38, 45, 27, 17, 15, 13, 4, 1,
2, 1, 1, 2), proba = c(0.0033289262662263, 0.00555690264007235,
0.00878780282243439, 0.0132023555702843, 0.0188900560866825,
0.0257995234198431, 0.0337042730520012, 0.0421963455163949, 0.0507149437492447,
0.0586089198012906, 0.0652230449359029, 0.0699913153996099, 0.0725182432348992,
0.0726318551493006, 0.0703990113442269, 0.0661024716831246, 0.0601873020200862,
0.0531896430528685, 0.045662548708844, 0.0381117389181843, 0.030949822142559,
0.0244723501557229, 0.01885437705459, 0.0141629366839816, 0.0103793599644779,
0.00742544354411115, 0.00518865818999788, 0.00236587669133322,
0.00154538036835848, 0.000379230768851682)), .Names = c("Z",
"freq", "proba"), row.names = c(NA, -30L), class = "data.frame")
And I want to regroup lines for which the value "freq" is < 5 with the next line, and this while the next line is < 5.
Idk if I'm clear enough so this is the output I expect :
> df2
labels effectifs pi
1 17;20 10 0.03087599
2 21 16 0.01889006
3 22 12 0.02579952
4 23 30 0.03370427
5 24 41 0.04219635
6 25 56 0.05071494
7 26 65 0.05860892
8 27 65 0.06522304
9 28 93 0.06999132
10 29 82 0.07251824
11 30 94 0.07263186
12 31 72 0.07039901
13 32 74 0.06610247
14 33 58 0.06018730
15 34 66 0.05318964
16 35 38 0.04566255
17 36 45 0.03811174
18 37 27 0.03094982
19 38 17 0.02447235
20 39 15 0.01885438
21 40 13 0.01416294
22 41;49 11 0.02728395
I did it with nested while, but I find this solution very painful and so unoptimized.
i <- 1
freqs <- c()
labels <- c()
pi <- c()
while(i < nrow(df)) {
if (df$freq[i] >= 5) {
freqs <- c(freqs, df$freq[i])
labels <- c(labels, df$Z[i])
pi <- c(pi, df$proba[i])
i <- i + 1
}
else {
count <- df$freq[i]
countPi <- df$proba[i]
k <- i
j <- i
while(df$freq[i] < 5 & i < nrow(df)) {
if (df$freq[i+1] < 5) {
count <- count + df$freq[i+1]
countPi <- countPi + df$proba[i+1]
j <- i + 1
}
i <- i + 1
}
labels <- c(labels, paste0(df$Z[k], ";", df$Z[j]))
freqs <- c(freqs, count)
pi <- c(pi, countPi)
}
}
df2 <- data.frame(labels, freqs, pi)
I'm sure there is far better, maybe with dplyr. If you have a better solution.. Thanks !
We could use the "devel" version of "data.table" as new functions are introduced (rleid). Here, we convert the "data.frame" to "data.table" (setDT(df)), create a grouping variable ("gr") based on the logical index (freq <5) using rleid. 'Z' column is 'numeric/integer' class. Create a character column ("Z1") from the "Z". Grouped by 'gr', if the "freq" is less than 5 for all the elements of that group, summarise the rows to a single row by taking the first observation of columns (.SD[1L]), remove the unwanted columns (as .SD includes "Z1" which will result in duplicate columns), append it with the "Z1" that we get from pasting the min and max value of "Z" for that group. Otherwise, leave it unchanged (else .SD). Remove the columns that we don't need by assigning it to "NULL".
library(data.table) #data.table_1.9.5
res <- setDT(df)[, gr:=rleid(freq<5)][, Z1:= as.character(Z)][,
if(all(freq<5)) c(.SD[1L][,-4, with=FALSE],
list(Z1=toString(c(min(Z), max(Z)))))
else .SD, gr][,1:2 :=NULL][]
head(res,3)
# freq proba Z1
#1: 1 0.003328926 17, 20
#2: 16 0.018890056 21
#3: 12 0.025799523 22
Since this is a dplyr question, here is a dplyr solution. First I used a grouping function in order to define the groups (similar to the rleid function in data.table). Then the summary and is fairly simple.
# grouping function
grouping <- function(condition){
# calculate runs for grouping
run <- rle((!condition) * 1:length(condition))
# revalue
run$values <- seq_along(run$values)
# invert to get grouping
inverse.rle(run)
}
# load dplyr
require(dplyr)
df %>%
mutate(group = grouping(freq<5)) %>% # add groups
group_by(group) %>% # group data
summarize(freq = sum(freq), # sum freq
proba = sum(proba), # sum proba
Z = toString(unique(range(Z)))) %>% # rename Z
mutate(group=NULL) # remove groups
## Source: local data table [22 x 3]
##
## freq proba Z
## 1 10 0.03087599 17, 20
## 2 16 0.01889006 21
## 3 12 0.02579952 22
## 4 30 0.03370427 23
## 5 41 0.04219635 24
## 6 56 0.05071494 25
## 7 65 0.05860892 26
## 8 65 0.06522304 27
## 9 93 0.06999132 28
## 10 82 0.07251824 29
## .. ... ... ...

How to generate a frequency table in R with with cumulative frequency and relative frequency

I'm new with R. I need to generate a simple Frequency Table (as in books) with cumulative frequency and relative frequency.
So I want to generate from some simple data like
> x
[1] 17 17 17 17 17 17 17 17 16 16 16 16 16 18 18 18 10 12 17 17 17 17 17 17 17 17 16 16 16 16 16 18 18 18 10
[36] 12 15 19 20 22 20 19 19 19
a table like:
frequency cumulative relative
(9.99,11.7] 2 2 0.04545455
(11.7,13.4] 2 4 0.04545455
(13.4,15.1] 1 5 0.02272727
(15.1,16.9] 10 15 0.22727273
(16.9,18.6] 22 37 0.50000000
(18.6,20.3] 6 43 0.13636364
(20.3,22] 1 44 0.02272727
I know it should be simple, but I don't know how.
I got some results using this code:
factorx <- factor(cut(x, breaks=nclass.Sturges(x)))
as.matrix(table(factorx))
You're close! There are a few functions that will make this easy for you, namely cumsum() and prop.table(). Here's how I'd probably put this together. I make some random data, but the point is the same:
#Fake data
x <- sample(10:20, 44, TRUE)
#Your code
factorx <- factor(cut(x, breaks=nclass.Sturges(x)))
#Tabulate and turn into data.frame
xout <- as.data.frame(table(factorx))
#Add cumFreq and proportions
xout <- transform(xout, cumFreq = cumsum(Freq), relative = prop.table(Freq))
#-----
factorx Freq cumFreq relative
1 (9.99,11.4] 11 11 0.25000000
2 (11.4,12.9] 3 14 0.06818182
3 (12.9,14.3] 11 25 0.25000000
4 (14.3,15.7] 2 27 0.04545455
5 (15.7,17.1] 6 33 0.13636364
6 (17.1,18.6] 3 36 0.06818182
7 (18.6,20] 8 44 0.18181818
The base functions table, cumsum and prop.table should get you there:
cbind( Freq=table(x), Cumul=cumsum(table(x)), relative=prop.table(table(x)))
Freq Cumul relative
10 2 2 0.04545455
12 2 4 0.04545455
15 1 5 0.02272727
16 10 15 0.22727273
17 16 31 0.36363636
18 6 37 0.13636364
19 4 41 0.09090909
20 2 43 0.04545455
22 1 44 0.02272727
With cbind and naming of the columns to your liking this should be pretty easy for you in the future. The output from the table function is a matrix, so this result is also a matrix. If this were being done on something big it would be more efficient todo this:
tbl <- table(x)
cbind( Freq=tbl, Cumul=cumsum(tbl), relative=prop.table(tbl))
If you are looking for something pre-packaged, consider the freq() function from the descr package.
library(descr)
x = c(sample(10:20, 44, TRUE))
freq(x, plot = FALSE)
Or to get cumulative percents, use the ordered() function
freq(ordered(x), plot = FALSE)
To add a "cumulative frequencies" column:
tab = as.data.frame(freq(ordered(x), plot = FALSE))
CumFreq = cumsum(tab[-dim(tab)[1],]$Frequency)
tab$CumFreq = c(CumFreq, NA)
tab
If your data has missing values, a valid percent column is added to the table.
x = c(sample(10:20, 44, TRUE), NA, NA)
freq(ordered(x), plot = FALSE)
Yet another possibility:
library(SciencesPo)
x = c(sample(10:20, 50, TRUE))
freq(x)
My suggestion is to check the agricolae package... check it out:
library(agricolae)
weight<-c( 68, 53, 69.5, 55, 71, 63, 76.5, 65.5, 69, 75, 76, 57, 70.5,
+ 71.5, 56, 81.5, 69, 59, 67.5, 61, 68, 59.5, 56.5, 73,
+ 61, 72.5, 71.5, 59.5, 74.5, 63)
h1<- graph.freq(weight,col="yellow",frequency=1,las=2,xlab="h1")
print(summary(h1),row.names=FALSE)

Resources