creating a dataframe of means of 5 randomly sampled observations - r

I'm currently reading "Practical Statistics for Data Scientists" and following along in R as they demonstrate some code. There is one chunk of code I'm particularly struggling to follow the logic of and was hoping someone could help. The code in question is creating a dataframe with 1000 rows where each observation is the mean of 5 randomly drawn income values from the dataframe loans_income. However, I'm getting confused about the logic of the code as it is fairly complicated with a tapply() function and nested rep() statements.
The code to create the dataframe in question is as follows:
samp_mean_5 <- data.frame(income = tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean),
type='mean_of_5')
In particular, I'm confused about the nested rep() statements and the 1000*5 portion of the sample() function. Any help understanding the logic of the code would be greatly appreciated!
For reference, the original dataset loans_income simply has a single column of 50,000 income values.

You have 50,000 loans_income in a single vector. Let's break your code down:
tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean)
I will replace 1000 with 10 and income with random numbers, so it's easier to explain. I also set set.seed(1) so the result can be reproduced.
sample(loans_income$income,1000*5)
We 50 random incomes from your vector without replacement. They are (temporarily) put into a vector of length 50, so the output looks like this:
> sample(runif(50000),10*5)
[1] 0.73283101 0.60329970 0.29871173 0.12637654 0.48434952 0.01058067 0.32337850
[8] 0.46873561 0.72334215 0.88515494 0.44036341 0.81386225 0.38118213 0.80978822
[15] 0.38291273 0.79795343 0.23622492 0.21318431 0.59325586 0.78340477 0.25623138
[22] 0.64621658 0.80041393 0.68511759 0.21880083 0.77455662 0.05307712 0.60320912
[29] 0.13191926 0.20816298 0.71600799 0.70328349 0.44408218 0.32696205 0.67845445
[36] 0.64438336 0.13241312 0.86589561 0.01109727 0.52627095 0.39207860 0.54643661
[43] 0.57137320 0.52743012 0.96631114 0.47151170 0.84099503 0.16511902 0.07546454
[50] 0.85970500
rep(1:1000,rep(5,1000))
Now we are creating an indexing vector of length 50:
> rep(1:10,rep(5,10))
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6
[29] 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10
Those indices "group" the samples from step 1. So basically this vector tells R that the first 5 entries of your "sample vector" belong together (index 1), the next 5 entries belong together (index 2) and so on.
FUN = mean
Just apply the mean-function on the data.
tapply
So tapply takes the sampled data (sample-part) and groups them by the second argument (the rep()-part) and applies the mean-function on each group.
If you are familiar with data.frames and the dplyr package, take a look at this (only the first 10 rows are displayed):
set.seed(1)
df <- data.frame(income=sample(runif(5000),10*5), index=rep(1:10,rep(5,10)))
income index
1 0.42585569 1
2 0.16931091 1
3 0.48127444 1
4 0.68357403 1
5 0.99374923 1
6 0.53227877 2
7 0.07109499 2
8 0.20754511 2
9 0.35839481 2
10 0.95615917 2
I attached the an index to the random numbers (your income). Now we calculate the mean per group:
df %>%
group_by(index) %>%
summarise(mean=mean(income))
which gives us
# A tibble: 10 x 2
index mean
<int> <dbl>
1 1 0.551
2 2 0.425
3 3 0.827
4 4 0.391
5 5 0.590
6 6 0.373
7 7 0.514
8 8 0.451
9 9 0.566
10 10 0.435
Compare it to
set.seed(1)
tapply(sample(runif(5000),10*5),
rep(1:10,rep(5,10)),
mean)
which yields basically the same result:
1 2 3 4 5 6 7 8 9
0.5507529 0.4250946 0.8273149 0.3905850 0.5902823 0.3730092 0.5143829 0.4512932 0.5658460
10
0.4352546

Related

Different ways of indexing dataframe in R

Say, I have a dataframe df in R as follows,
id inflam
1 1 0.03093764
2 2 0.50115406
3 3 0.82153770
4 4 0.01985961
5 5 0.04994588
6 6 0.91714810
7 7 0.83438400
8 8 0.80832225
9 9 0.12360681
10 10 0.08490079
I can access the entirety of the inflam column by indexing as df[,2] or df[2]. However, typeof(df[,2]) returns double, whereas typeof(df[2]) returns list. The comma seems to be the differentiator, but why is this the case? What is going on under the hood?

purrr map / lapply / sapply across groups of multiple (n > 1) elements at a time?

Suppose we have a vector, we can easily enough lapply, sapply or map across 1 element at a time.
Is there a way to do the same across groups of (>1) elements of the vector?
Example
Suppose we are constructing API calls by appending comma-separated user_identifiers to the URL, like so:
user_identifiers <- c("0011399", "0011400", "0013581", "0013769", "0013770", "0018374",
"0018376", "0018400", "0018401", "0018410", "0018415", "0018417",
"0018419", "0018774", "0018775", "0018776", "0018777", "0018778",
"0018779", "0021627", "0023492", "0023508", "0023511", "0023512",
"0024120", "0025672", "0025673", "0025675", "0025676", "0028226",
"0028227", "0028266", "0028509", "0028510", "0028512", "0028515",
"0028518", "0028520", "0028523", "0029160", "0033141", "0034586",
"0035035", "0035310", "0035835", "0035841", "0035862", "0036503",
"0036580", "0036583", "0036587", "0037577", "0038582", "0038583",
"0038587", "0039727", "0039729", "0039731", "0044703", "0044726"
)
get_data <- function(user_identifier) {
url <- paste0("https://www.myapi.com?userIdentifier=",
paste0(user_identifier, collapse=","))
fromJSON(url)
}
In the above, get_data(user_identifiers) would return the APIs response for all 60 user_identifiers in one single request.
But suppose the API accepts a maximum of 10 identifiers at a time (so we cannot do all 60 at once).
A simple solution could be to simply map/lapply/sapply over each element, e.g. sapply(get_data, user_identifiers - this would work fine - however, we would make 60 API calls, when all we really need is 6. If we could map/lapply/sapply over groups of 10 at a time; that would be ideal
Question
Is there an elegant way to map/lapply/sapply over groups of n elements at a time (where n>1)?
We can split user_identifiers in groups of 10 and use sapply/map/lapply
sapply(split(user_identifiers, gl(length(user_identifiers)/10, 10)), get_data)
where gl creates groups from 1 to 6 each of length 10.
gl(length(user_identifiers)/10, 10)
# [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
# 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6
#Levels: 1 2 3 4 5 6
The same groups can be created with rep
rep(1:ceiling(length(user_identifiers)/10), each = 10)
As #thelatemail mentioned, we can use cut and specify number of groups to cut the data into
sapply(split(user_identifiers, cut(seq_along(user_identifiers),6)), get_data)

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

An easier way to get average of a table with some conditions in R

I am trying to get the average of all 6 quizzes for each male student.
Here is part of the code that I've tried:
a<-subset(mydf,Sex=="M")
b<-a[4:9]
b
sum(b[1:6])
My logic is to get a table only contains male students with each of their 6 quizzes, then sum the table and divide by the number of male student. But I think there should be an easier way to do this.
Sample data:
df <- data.frame(Section=c(rep('A',9)),
Degree=c(rep('MBA',4),'MS','MBA','MBA','MS','MBA'),
Sex=c(rep('M',5),'F','M','M','F'),
Quiz1=c(0,10,2,2,8,6,6,2,3),
Quiz2=c(0,1,4,4,1,5,0,3,9),
Quiz3=c(6,5,6,6,4,2,7,9,3),
Quiz4=c(5,4,5,5,10,5,7,7,3),
Quiz5=c(7,3,6,3,10,7,6,10,5),
Quiz6=c(3,8,6,6,5,8,10,10,5))
How about this:
data.frame(df[which(df$Sex=='M'),],QuizMeans=rowMeans(df[which(df$Sex=='M'),c(4:9)]))
Note: "c(4:9)" in the code above is takes the row average for quiz columns 4-9.
So we're calculating quiz scores for each individual this way.
Output:
Section Degree Sex Quiz1 Quiz2 Quiz3 Quiz4 Quiz5 Quiz6 QuizMeans
1 A MBA M 0 0 6 5 7 3 3.500000
2 A MBA M 10 1 5 4 3 8 5.166667
3 A MBA M 2 4 6 5 6 6 4.833333
4 A MBA M 2 4 6 5 3 6 4.333333
5 A MS M 8 1 4 10 10 5 6.333333
7 A MBA M 6 0 7 7 6 10 6.000000
8 A MS M 2 3 9 7 10 10 6.833333
Then if you wanted to take the mean of their means (i.e. the grand mean), you could store the above as something like "df", then use mean() to calculate the mean of the column QuizMeans, like this:
df <- data.frame(df[which(df$Sex=='M'),],QuizMeans=rowMeans(df[which(df$Sex=='M'),c(4:9)]))
mean(df$QuizMeans)
[1] 5.285714
If there are missing values in your data, you'll need to add na.rm=TRUE to either the mean() or rowMeans() function, like this:
mean(df$QuizMeans, na.rm=TRUE)
[1] 5.285714
You could use the following without specifying column positions
ans <- sum(df[df$Sex=="M", grepl("Quiz",names(df))])/sum(df$Sex=="M")
# 31.71429
If you know the column positions
ans <- sum(df[df$Sex=="M", 4:9])/sum(df$Sex=="M")
# 31.71429
Data
df <- data.frame(Section=c(rep('A',9)),
Degree=c(rep('MBA',4),'MS','MBA','MBA','MS','MBA'),
Sex=c(rep('M',5),'F','M','M','F'),
Quiz1=c(0,10,2,2,8,6,6,2,3),
Quiz2=c(0,1,4,4,1,5,0,3,9),
Quiz3=c(6,5,6,6,4,2,7,9,3),
Quiz4=c(5,4,5,5,10,5,7,7,3),
Quiz5=c(7,3,6,3,10,7,6,10,5),
Quiz6=c(3,8,6,6,5,8,10,10,5))
Use dplyr.
library(dplyr)
mydf %>% filter(Sex == "Male") %>%
summarise(avg_q6 = mean(Quiz6))

Summing depth data (consecutive rows) in R

How is it possible with to sum up consecutive depth data with R?
For instance:
a <- data.frame(label = as.factor(c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood")),
depth = as.numeric(c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14)))
The given output should be something like:
Label Depth
Air 7
Wood 3
Stone 1
First the removal of negative values is done with cummax(), because depth can only increase in this special case. Hence:
label depth
1 Air 1
2 Air 2
3 Air 3
4 Air 3
5 Air 4
6 Air 5
7 Wood 5
8 Wood 5
9 Wood 5
10 Wood 6
11 Wood 8
12 Air 9
13 Air 9
14 Air 9
15 Air 10
16 Stone 10
17 Stone 10
18 Stone 11
19 Stone 11
20 Air 11
21 Air 12
22 Air 12
23 Air 12
24 Air 13
25 Wood 14
26 Wood 14
Now by max-min the increase in depth for every consecutive row you would get: (the question is how to do this step)
label depth
1 Air 4
2 Wood 3
3 Air 1
4 Stone 1
5 Air 2
5 Wood 0
And finally summing up those max-min values the output is the one presented above.
Steps tried to achieve the output:
The first obvious solution would be for instance for Air:
diff(cummax(a[a$label=="Air",]$depth))
This solution gets rid of the negative data, which is necessary due to an expected constant increase in depth.
The problem is the output also takes into account the big steps in between each consecutive subset. Hence, the sum for Air would be 12 instead of 7.
[1] 1 1 0 1 1 4 0 0 1 1 1 0 0 1
Even worse would be a solution with aggreagte, e.g.:
aggregate(depth~label, a, FUN=function(x){sum(x>0)})
Note: solutions with filtering big jumps is not what i'm looking for. Sure you could hard code a limit for instance <2 for the example of Air once again:
sum(diff(cummax(a[a$label=="Air",]$depth))[diff(cummax(a[a$label=="Air",]$depth))<2])
Gives you almost the right result but does not work as it is expected here. I'm pretty sure there is already a function for what I'm looking for because it is not a uncommon problem for many different tasks.
I guess taking the minimum and maximum value of each set of consecutive rows per material and summing those up would be one possible solution, but I'm not sure how to apply a function to only the consecutive subsets.
You can use data.table::rleid to quickly group by run, or reconstruct it with rle if you really like. After that, aggregating is fairly easy in any grammar. In dplyr,
library(dplyr)
a <- data.frame(label = c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood"),
depth = c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14))
a2 <- a %>%
# filter to rows where previous value is lower, equal, or NA
filter(depth >= lag(depth) | is.na(lag(depth))) %>%
# group by label and its run
group_by(label, run = data.table::rleid(label)) %>%
summarise(depth = max(depth) - min(depth)) # aggregate
a2 %>% arrange(run) # sort to make it pretty
#> # A tibble: 6 x 3
#> # Groups: label [3]
#> label run depth
#> <fctr> <int> <dbl>
#> 1 Air 1 4
#> 2 Wood 2 3
#> 3 Air 3 1
#> 4 Stone 4 1
#> 5 Air 5 2
#> 6 Wood 6 0
a3 <- a2 %>% summarise(depth = sum(depth)) # a2 is still grouped, so aggregate more
a3
#> # A tibble: 3 x 2
#> label depth
#> <fctr> <dbl>
#> 1 Air 7
#> 2 Stone 1
#> 3 Wood 3
A base R method using aggregate is
aggregate(cbind(val=cummax(a$depth)),
list(label=a$label, ID=c(0, cumsum(diff(as.integer(a$label)) != 0))),
function(x) diff(range(x)))
The first argument to aggregate calculates the cumulative maximum as the OP does above for the input vector, the use of cbind provide for the final output of the calculated vector. The second argument is the grouping argument. This uses a different method than rle, which calculates the cumulative sum of the differences. Finally, the third argument provides the function which calculates the desired output by taking a difference of the range for each group.
This returns
label ID val
1 Air 0 4
2 Wood 1 3
3 Air 2 1
4 Stone 3 1
5 Air 4 2
6 Wood 5 0
The data.table way (borrowing in part from #alistaire):
setDT(a)
a[, depth := cummax(depth)]
depth_gain <- a[,
list(
depth = max(depth) - depth[1], # Only need the starting and max values
label = label[1]
),
by = rleidv(label)
]
result <- depth_gain[, list(depth = sum(depth)), by = label]

Resources