Different ways of indexing dataframe in R - r

Say, I have a dataframe df in R as follows,
id inflam
1 1 0.03093764
2 2 0.50115406
3 3 0.82153770
4 4 0.01985961
5 5 0.04994588
6 6 0.91714810
7 7 0.83438400
8 8 0.80832225
9 9 0.12360681
10 10 0.08490079
I can access the entirety of the inflam column by indexing as df[,2] or df[2]. However, typeof(df[,2]) returns double, whereas typeof(df[2]) returns list. The comma seems to be the differentiator, but why is this the case? What is going on under the hood?

Related

creating a dataframe of means of 5 randomly sampled observations

I'm currently reading "Practical Statistics for Data Scientists" and following along in R as they demonstrate some code. There is one chunk of code I'm particularly struggling to follow the logic of and was hoping someone could help. The code in question is creating a dataframe with 1000 rows where each observation is the mean of 5 randomly drawn income values from the dataframe loans_income. However, I'm getting confused about the logic of the code as it is fairly complicated with a tapply() function and nested rep() statements.
The code to create the dataframe in question is as follows:
samp_mean_5 <- data.frame(income = tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean),
type='mean_of_5')
In particular, I'm confused about the nested rep() statements and the 1000*5 portion of the sample() function. Any help understanding the logic of the code would be greatly appreciated!
For reference, the original dataset loans_income simply has a single column of 50,000 income values.
You have 50,000 loans_income in a single vector. Let's break your code down:
tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean)
I will replace 1000 with 10 and income with random numbers, so it's easier to explain. I also set set.seed(1) so the result can be reproduced.
sample(loans_income$income,1000*5)
We 50 random incomes from your vector without replacement. They are (temporarily) put into a vector of length 50, so the output looks like this:
> sample(runif(50000),10*5)
[1] 0.73283101 0.60329970 0.29871173 0.12637654 0.48434952 0.01058067 0.32337850
[8] 0.46873561 0.72334215 0.88515494 0.44036341 0.81386225 0.38118213 0.80978822
[15] 0.38291273 0.79795343 0.23622492 0.21318431 0.59325586 0.78340477 0.25623138
[22] 0.64621658 0.80041393 0.68511759 0.21880083 0.77455662 0.05307712 0.60320912
[29] 0.13191926 0.20816298 0.71600799 0.70328349 0.44408218 0.32696205 0.67845445
[36] 0.64438336 0.13241312 0.86589561 0.01109727 0.52627095 0.39207860 0.54643661
[43] 0.57137320 0.52743012 0.96631114 0.47151170 0.84099503 0.16511902 0.07546454
[50] 0.85970500
rep(1:1000,rep(5,1000))
Now we are creating an indexing vector of length 50:
> rep(1:10,rep(5,10))
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6
[29] 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10
Those indices "group" the samples from step 1. So basically this vector tells R that the first 5 entries of your "sample vector" belong together (index 1), the next 5 entries belong together (index 2) and so on.
FUN = mean
Just apply the mean-function on the data.
tapply
So tapply takes the sampled data (sample-part) and groups them by the second argument (the rep()-part) and applies the mean-function on each group.
If you are familiar with data.frames and the dplyr package, take a look at this (only the first 10 rows are displayed):
set.seed(1)
df <- data.frame(income=sample(runif(5000),10*5), index=rep(1:10,rep(5,10)))
income index
1 0.42585569 1
2 0.16931091 1
3 0.48127444 1
4 0.68357403 1
5 0.99374923 1
6 0.53227877 2
7 0.07109499 2
8 0.20754511 2
9 0.35839481 2
10 0.95615917 2
I attached the an index to the random numbers (your income). Now we calculate the mean per group:
df %>%
group_by(index) %>%
summarise(mean=mean(income))
which gives us
# A tibble: 10 x 2
index mean
<int> <dbl>
1 1 0.551
2 2 0.425
3 3 0.827
4 4 0.391
5 5 0.590
6 6 0.373
7 7 0.514
8 8 0.451
9 9 0.566
10 10 0.435
Compare it to
set.seed(1)
tapply(sample(runif(5000),10*5),
rep(1:10,rep(5,10)),
mean)
which yields basically the same result:
1 2 3 4 5 6 7 8 9
0.5507529 0.4250946 0.8273149 0.3905850 0.5902823 0.3730092 0.5143829 0.4512932 0.5658460
10
0.4352546

purrr map / lapply / sapply across groups of multiple (n > 1) elements at a time?

Suppose we have a vector, we can easily enough lapply, sapply or map across 1 element at a time.
Is there a way to do the same across groups of (>1) elements of the vector?
Example
Suppose we are constructing API calls by appending comma-separated user_identifiers to the URL, like so:
user_identifiers <- c("0011399", "0011400", "0013581", "0013769", "0013770", "0018374",
"0018376", "0018400", "0018401", "0018410", "0018415", "0018417",
"0018419", "0018774", "0018775", "0018776", "0018777", "0018778",
"0018779", "0021627", "0023492", "0023508", "0023511", "0023512",
"0024120", "0025672", "0025673", "0025675", "0025676", "0028226",
"0028227", "0028266", "0028509", "0028510", "0028512", "0028515",
"0028518", "0028520", "0028523", "0029160", "0033141", "0034586",
"0035035", "0035310", "0035835", "0035841", "0035862", "0036503",
"0036580", "0036583", "0036587", "0037577", "0038582", "0038583",
"0038587", "0039727", "0039729", "0039731", "0044703", "0044726"
)
get_data <- function(user_identifier) {
url <- paste0("https://www.myapi.com?userIdentifier=",
paste0(user_identifier, collapse=","))
fromJSON(url)
}
In the above, get_data(user_identifiers) would return the APIs response for all 60 user_identifiers in one single request.
But suppose the API accepts a maximum of 10 identifiers at a time (so we cannot do all 60 at once).
A simple solution could be to simply map/lapply/sapply over each element, e.g. sapply(get_data, user_identifiers - this would work fine - however, we would make 60 API calls, when all we really need is 6. If we could map/lapply/sapply over groups of 10 at a time; that would be ideal
Question
Is there an elegant way to map/lapply/sapply over groups of n elements at a time (where n>1)?
We can split user_identifiers in groups of 10 and use sapply/map/lapply
sapply(split(user_identifiers, gl(length(user_identifiers)/10, 10)), get_data)
where gl creates groups from 1 to 6 each of length 10.
gl(length(user_identifiers)/10, 10)
# [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
# 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6
#Levels: 1 2 3 4 5 6
The same groups can be created with rep
rep(1:ceiling(length(user_identifiers)/10), each = 10)
As #thelatemail mentioned, we can use cut and specify number of groups to cut the data into
sapply(split(user_identifiers, cut(seq_along(user_identifiers),6)), get_data)

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

R - set bucket from a mapper data frame

Probably a similar situation has already been solved but I could not find it.
I have a mapper data frame like the following
mapper
bucket_label bucket_no
1 (-Inf; 9.99) 1
2 (25.01; 29.99) 1
3 (29.99; 30.01) 1
4 (30.01; Inf) 1
5 (19.99; 20.01) 2
6 (20.01; 24.99) 2
7 (24.99; 25.01) 2
8 (9.99; 10.11) 3
9 (10.11; 14.99) 3
10 (14.99; 15.01) 3
11 (15.01; 19.99) 3
and a vector x with random data
x <- rnorm(100)*100
I need to set the corresponding bucket for each entry of this in a quick way and findInterval and cut seem not to help for this issue.

Remove lines with efficient way

I have a data frame named df:
number value
1 5
2 5
3 5
4 6
5 6
6 6
7 6
8 7
9 7
10 7
11 7
12 7
13 8
14 9
15 9
I want to remove specific rows in case of a min and max level. I tried separate this:
df[df$value>5 , ]
and after that this:
df[df$value>8 , ]
After I tried this:
df[df$value>5 & df$value>8, ]
but it execute online the df$value>8
and another problem I observed is that when I type
df[df$value>5, ]
it eliminate the value however when I type df it contains the values I tried to remove before. What could be wrong and I don’t take a clear data frames without the removed values?
An example of the output data:
number value
4 6
5 6
6 6
7 6
8 7
9 7
10 7
11 7
12 7
If you want remove lines with level lower than min and higher than max, try this:
df[df$value<5 | df$value>8, ]
Edit
Look right code:
df <- df[df$value>5 & df$value<8,]
Its work for me.

Resources