Getting observations corresponding to each quartile - r

q <- quantile(faithful$eruptions)
> q
0% 25% 50% 75% 100%
1.60000 2.16275 4.00000 4.45425 5.10000
I get the following result, the dataset is provided in R.
head(faithful)
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
I want a dataframe containing the data and an additional column for pointing out the quantile to which each observations belong. For example the final dataset should look like
eruptions waiting Quartile
1 3.600 79 Q1
2 1.800 54 Q2
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
How can this be done?

Something along the lines of this? Use values from quantile function as values to cut the desired vector.
faithful$kva <- cut(faithful$eruptions, q)
levels(faithful$kva) <- c("Q1", "Q2", "Q3", "Q4")
faithful
eruptions waiting kva
1 3.600 79 Q2
2 1.800 54 Q1
3 3.333 74 Q2
4 2.283 62 Q2
5 4.533 85 Q4

The cut function has the option to create numeric labels for each quantile right away:
faithful$Quartile <- cut(faithful$eruptions,
quantile(faithful$eruptions),
labels = FALSE)
This will create an NA for the smallest eruption, if you want to assign the lowest eruption to the first quantile, you can add include.lowest = TRUE when calling the cut function:
faithful$Quartile <- cut(faithful$eruptions,
quantile(faithful$eruptions),
labels = FALSE,
include.lowest = T)

This can now be done more conveniently via a dplyr pipe and ggplot2::cut_number().
library(dplyr)
library(ggplot2)
faithful %>%
mutate(Quartile = cut_number(eruptions, n = 4, labels = c("Q1", "Q2", "Q3", "Q4")))
The lowest observation is included by default unlike base R cut().

Related

Using Rolling Average to Calculate over Window of Values

I am trying to calculate rolling averages of Heart Rate over 15 second intervals. I have millisecond data for many participants and as such the millisecond values can potentially be repeated multiple times, and due to inconsistent time readings, creating intervals by row is not viable.
Below is a small sample of the data for one participant. Data for another participant would obviously feature different millisecond data taken at different intervals.
Ideal output would involve a new column with the rolling average for each value of millisecond data.
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(MS, HR)
I have tried a few packages (namely Zoo's suite of rolling functions) but have had trouble applying them to this problem.
Thank you!
rollapplyr in zoo accepts a vector of widths and findInterval can be used to calculate the index in MS 15 seconds ago so if we subtract that from 1:n we get w, the number of positions to average. Exactly which intervals to produce is not discussed in the question so we will assumes that the right hand edge of each interval is at an input point.
library(zoo)
w <- with(df, seq_along(MS) - findInterval(MS - 15000, MS))
transform(df, roll = rollapplyr(HR, w, mean, fill = NA))
An option using non-equi join in data.table which also handles an ID:
library(data.table)
setDT(df)[, avgHR :=
df[.(ID=ID, start=MS-15000, end=MS), on=.(ID, MS>=start, MS<=end),
by=.EACHI, mean(HR)]$V1
]
output:
ID MS HR avgHR
1: 1 36148 84 84.00000
2: 1 36753 84 84.00000
3: 1 37364 84 84.00000
4: 1 38062 84 84.00000
5: 1 38737 84 84.00000
6: 1 39580 96 86.00000
7: 1 40029 84 85.71429
8: 1 40387 84 85.50000
9: 1 41208 96 86.66667
10: 1 42006 84 86.40000
11: 1 42796 84 86.18182
12: 1 43533 96 87.00000
13: 1 44274 84 86.76923
14: 1 44988 84 86.57143
15: 1 45696 96 87.20000
16: 1 46398 84 87.00000
17: 1 47079 84 86.82353
18: 1 47742 84 86.66667
19: 1 48429 84 86.52632
20: 1 49135 84 86.40000
21: 1 49861 84 86.28571
22: 1 50591 84 86.18182
23: 1 51324 84 86.18182
24: 1 52059 84 86.18182
ID MS HR avgHR
data:
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(ID=1, MS, HR)
I'm not totally sure how you want to apply the 15s rolling average, but here is one way to go about what I think youre looking for. First we subset the data that is between 7.5s before and 7.5s after, then we take the average. This, however, will have an edge effect since there is no 7.5s before the first value.
library(tidyverse)
roll_vec <- c()
for(i in 1:nrow(df)){
ref <- df$MS[[i]]
val <- df %>%
filter(MS <= ref + 7500 & MS >= ref- 7500) %>%
pull(HR) %>%
mean
roll_vec[[i]] <- val
}
df %>%
mutate(roll_15s = roll_vec)
#> MS HR roll_15s
#> 1 36148 84 87.00000
#> 2 36753 84 87.00000
#> 3 37364 84 86.76923
#> 4 38062 84 86.57143
#> 5 38737 84 86.57143
#> 6 39580 96 86.57143
#> 7 40029 84 86.57143
#> 8 40387 84 86.57143
#> 9 41208 96 86.57143
#> 10 42006 84 86.57143
#> 11 42796 84 86.57143
#> 12 43533 96 86.57143
#> 13 44274 84 87.00000
#> 14 44988 84 87.27273
#> 15 4569 96 96.00000
df %>%
mutate(roll_15s = roll_vec) %>%
ggplot(aes(MS, HR))+
geom_line()+
geom_line(aes(y = roll_15s), color = "blue")
Notice that in the plot, the black line is the raw data and the blue line is the 15s rolling average.
One possible solution:
library(magrittr)
start_range <- df$MS[df$MS < max(df$MS)-15000]
lapply(start_range,function(t){
data.frame(MS = mean(df$MS[df$MS %between% c(t,t+15000)]),
HR = mean(df$HR[df$MS %between% c(t,t+15000)]))
}) %>% Reduce(rbind,.)
MS HR
1 43218.00 86.18182
2 43907.82 86.18182
3 44603.55 86.18182
4 44948.29 86.28571
5 45673.38 86.33333
I added some points to your data (I had only two points with the data you give):
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059,53289,54424)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84,85,88)
df <- data.frame(MS, HR)
The idea here is to calculate, for each MS value, the mean of HR and the time MSof all points having a time between this value (t in lapply) and 15 s after.
I restrict that on the range where I have values encompassing the 15s : the start_range vector.

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

Loop over the data-set columns and calculate statistics in R

I am just starting with R and need help with looping over the data-set and calculating statistics.
I have two data-sets:
>head(windows)
W1
W1
W2
W2
W3
W4
W4
W5
...
>head(values) # this is very large file (>20Gb)
Case1 Case2 Case3 Case4 ...
21 19 14 64
14 24 48 13
21 34 65 83
45 53 25 63
62 32 72 11
24 75 12 66
12 23 73 37
45 23 56 74
...
What I what to do:
For every Case column in values join it with windows row by row;
Should look something like this (Case1):
W1 21
W1 14
W2 21
W2 45
W3 62
W4 24
W4 12
W5 45
For every joined window group, e.g.:
W1(Case1): 21,14
W2(Case1): 21,45
W3(Case1): 62
W4(Case1): 24,12
W5(Case1): 45
W1(Case2): 19,24
Calculate mean (or median);
Perfect output would look like this:
Case1 Case2 Case3 Case4
W1 17.50 21.50 mean mean
W2 33.00 mean mean mean
W3 62.00 mean mean mean
W4 18.00 mean mean mean
W5 45.00 mean mean mean
Pseudo code might be:
For cases in values
join row by row with windows
For every window
Calculate mean
end
end
NB: I have tried joining windows with values using rbind,merge,data.frame, but data-sets are too large and process gets killed.
Since you have a considerably large data file, I think there are two good options to do it, either using data.table or dplyr. So here's how you could do it using dplyr.
But first of all, I think you don't really want to merge values and windows. Based on your description, I think what you want to do is add windows as an additional column to values (since there is nothing that could be merged, it seems).
So I would first create that additional column in values. (I assume here, that windows is a vector, although it is not clear from your question, it might also be a data.frame, but you could do it very similar in that case):
values$windows <- windows #assuming windows is a vector
Then you can use dplyr for the calculation:
Method 1:
Referencing each column you want to operate on:
library(dplyr)
values %>%
group_by(windows) %>%
summarize(Case1 = mean(Case1, na.rm=TRUE),
Case2 = mean(Case2, na.rm=TRUE),
Case3 = mean(Case3, na.rm=TRUE),
Case4 = mean(Case4, na.rm=TRUE))
Method 2:
Using summarise_each to do the same operation for all columns except the grouping variables (windows in this case). If you have a large number of columns you want to do the same operation on, this saves you some typing. Plus, you can specify more functions to be calculated, for example mean and median, if you want.
library(dplyr) # if it's not yet loaded
values %>%
group_by(windows) %>%
summarise_each(funs(mean(., na.rm=TRUE)))
The result is the same in both cases:
# windows Case1 Case2 Case3 Case4
#1 W1 17.5 21.5 31.0 38.5
#2 W2 33.0 43.5 45.0 73.0
#3 W3 62.0 32.0 72.0 11.0
#4 W4 18.0 49.0 42.5 51.5
#5 W5 45.0 23.0 56.0 74.0
Edit
Here's an example with much larger sample data including conversion from matrix to data.frame/vector. If your conversion from "big.matrix" to matrix works, then I think, this should work the same way with your original data.
# create a matrix with 100 columns and 5 million rows for per column
m <- matrix(runif(100*5e6), ncol=100)
dim(m)
#[1] 5000000 100
object.size(m)
# 4000000200 bytes
# convert to data.frame
df <- as.data.frame(m)
# create a second matrix "windows" with a single column
windows <- matrix(sample(1:1000, nrow(df), replace=TRUE), ncol = 1)
# convert matrix "windows" to vector
windows.vec <- as.vector(windows[,1])
# add windows.vec as a grouping variable to "df"
df$windows <- windows.vec # you could also do this directly from the "windows" matrix
# check dimensions of "df"
dim(df)
#[1] 5000000 101
# now you can do the calculation
df %>%
group_by(windows) %>%
summarise_each(funs(mean(., na.rm=T), median(., na.rm=TRUE)))
This is by no means the most elegant solution, but it seems to do what you want simply by stacking your values data into a single column and then using a tapply() function. It also prevents the need to bind together your windows factors and values data.
First, a small sample dataset, similar to the above format:
> set.seed(42)
> values <- data.frame(replicate(4, sample(1:100, 1e3, replace=T)))
> head(values)
[,1] [,2] [,3] [,4]
[1,] 85 34 42 77
[2,] 21 3 72 66
[3,] 36 45 77 14
[4,] 78 50 7 31
[5,] 51 89 42 92
[6,] 61 23 55 2
> windows <- rep(1:(1e3/2), each=2)
> head(windows)
[1] 1 1 2 2 3 3
Now stack the values data into a single column, creating a new variable ind:
> values <- stack(values)
And repeat your windows values to match the length of the stacked dataframe:
> windows <- rep(windows, 4)
Now you can use a simple tapply to calculate the mean by windows variable for each column:
> tapply(values$values, list(values$ind, windows), mean)
Sample output:
1 2 3 ...
X1 50.0 81.5 39.5
X2 36.0 26.5 52.5
X3 68.5 77.5 85.5
X4 52.0 90.0 91.5

How to find the highest value of a column in a data frame in R?

I have the following data frame which I called ozone:
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
I would like to extract the highest value from ozone, Solar.R, Wind...
Also, if possible how would I sort Solar.R or any column of this data frame in descending order
I tried
max(ozone, na.rm=T)
which gives me the highest value in the dataset.
I have also tried
max(subset(ozone,Ozone))
but got "subset" must be logical."
I can set an object to hold the subset of each column, by the following commands
ozone <- subset(ozone, Ozone >0)
max(ozone,na.rm=T)
but it gives the same value of 334, which is the max value of the data frame, not the column.
Any help would be great, thanks.
Similar to colMeans, colSums, etc, you could write a column maximum function, colMax, and a column sort function, colSort.
colMax <- function(data) sapply(data, max, na.rm = TRUE)
colSort <- function(data, ...) sapply(data, sort, ...)
I use ... in the second function in hopes of sparking your intrigue.
Get your data:
dat <- read.table(h=T, text = "Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9")
Use colMax function on sample data:
colMax(dat)
# Ozone Solar.R Wind Temp Month Day
# 41.0 313.0 20.1 74.0 5.0 9.0
To do the sorting on a single column,
sort(dat$Solar.R, decreasing = TRUE)
# [1] 313 299 190 149 118 99 19
and over all columns use our colSort function,
colSort(dat, decreasing = TRUE) ## compare with '...' above
To get the max of any column you want something like:
max(ozone$Ozone, na.rm = TRUE)
To get the max of all columns, you want:
apply(ozone, 2, function(x) max(x, na.rm = TRUE))
And to sort:
ozone[order(ozone$Solar.R),]
Or to sort the other direction:
ozone[rev(order(ozone$Solar.R)),]
Here's a dplyr solution:
library(dplyr)
# find max for each column
summarise_each(ozone, funs(max(., na.rm=TRUE)))
# sort by Solar.R, descending
arrange(ozone, desc(Solar.R))
UPDATE: summarise_each() has been deprecated in favour of a more featureful family of functions: mutate_all(), mutate_at(), mutate_if(), summarise_all(), summarise_at(), summarise_if()
Here is how you could do:
# find max for each column
ozone %>%
summarise_if(is.numeric, funs(max(., na.rm=TRUE)))%>%
arrange(Ozone)
or
ozone %>%
summarise_at(vars(1:6), funs(max(., na.rm=TRUE)))%>%
arrange(Ozone)
In response to finding the max value for each column, you could try using the apply() function:
> apply(ozone, MARGIN = 2, function(x) max(x, na.rm=TRUE))
Ozone Solar.R Wind Temp Month Day
41.0 313.0 20.1 74.0 5.0 9.0
Another way would be to use ?pmax
do.call('pmax', c(as.data.frame(t(ozone)),na.rm=TRUE))
#[1] 41.0 313.0 20.1 74.0 5.0 9.0
There is a package matrixStats that provides some functions to do column and row summaries, see in the package vignette, but you have to convert your data.frame into a matrix.
Then you run: colMaxs(as.matrix(ozone))
max(may$Ozone, na.rm = TRUE)
Without $Ozone it will filter in the whole data frame, this can be learned in the swirl library.
I'm studying this course on Coursera too ~
Assuming that your data in data.frame called maxinozone, you can do this
max(maxinozone[1, ], na.rm = TRUE)
max(ozone$Ozone, na.rm = TRUE) should do the trick. Remember to include the na.rm = TRUE or else R will return NA.
Try this solution:
Oz<-subset(data, data$Month==5,select=Ozone) # select ozone value in the month of
#May (i.e. Month = 5)
summary(T) #gives caracteristics of table( contains 1 column of Ozone) including max, min ...

Select a value for based on a highest value in another column

I don't understand why I can't find a solution for this, since I feel that this is a pretty basic question. Need to ask for help, then. I want to rearrange airquality dataset by month with maximum temp value for each month. In addition I want to find the corresponding day for each monthly maximum temperature. What is the laziest (code-wise) way to do this?
I have tried following without a success:
require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"))
dcast(mm, month + day ~ variable, max)
aggregate(formula = temp ~ month + day, data = airquality, FUN = max)
I am after something like this:
month day temp
5 7 89
...
There was quite a discussion a while back about whether being lazy is good or not. Anwyay, this is short and natural to write and read (and is fast for large data so you don't need to change or optimize it later) :
require(data.table)
DT=as.data.table(airquality)
DT[,.SD[which.max(Temp)],by=Month]
Month Ozone Solar.R Wind Temp Day
[1,] 5 45 252 14.9 81 29
[2,] 6 NA 259 10.9 93 11
[3,] 7 97 267 6.3 92 8
[4,] 8 76 203 9.7 97 28
[5,] 9 73 183 2.8 93 3
.SD is the subset of the data for each group, and you just want the row from it with the largest Temp, iiuc. If you need the row number then that can be added.
Or to get all the rows where the max is tied :
DT[,.SD[Temp==max(Temp)],by=Month]
Month Ozone Solar.R Wind Temp Day
[1,] 5 45 252 14.9 81 29
[2,] 6 NA 259 10.9 93 11
[3,] 7 97 267 6.3 92 8
[4,] 7 97 272 5.7 92 9
[5,] 8 76 203 9.7 97 28
[6,] 9 73 183 2.8 93 3
[7,] 9 91 189 4.6 93 4
Another approach with plyr
require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"), value.name = 'temp')
library(plyr)
ddply(mm, .(month), subset, subset = temp == max(temp), select = -variable)
Gives
month day temp
1 5 29 81
2 6 11 93
3 7 8 92
4 7 9 92
5 8 28 97
6 9 3 93
7 9 4 93
Or, even simpler
require(reshape2)
require(plyr)
names(airquality) <- tolower(names(airquality))
ddply(airquality, .(month), subset,
subset = temp == max(temp), select = c(month, day, temp) )
how about with plyr?
max.func <- function(df) {
max.temp <- max(df$temp)
return(data.frame(day = df$Day[df$Temp==max.temp],
temp = max.temp))
}
ddply(airquality, .(Month), max.func)
As you can see, the max temperature for the month happens on more than one day. If you want different behavior, the function is easy enough to adjust.
Or if you want to use the data.table package (for instance, if speed is an issue and the data set is large or if you prefer the syntax):
library(data.table)
DT <- data.table(airquality)
DT[, list(maxTemp=max(Temp), dayMaxTemp=.SD[max(Temp)==Temp, Day]), by="Month"]
If you want to know what the .SD stands for, have a look here: SO

Resources