I have a data.frame with daily maximum and minimum temperatures for 40 years and need to select all days that have maximum temperature above 90th percentile of maximum temperature and minimum temperatures above the 85th percentile of minimum temperature.
I was able to do that
> head(df)
YEAR MONTH DAY Date MEAN MAX MIN
1 1965 1 1 1/1/1965 NA 27.0 17.0
2 1965 1 2 1/2/1965 24.0 28.0 20.7
3 1965 1 3 1/3/1965 19.9 23.7 16.2
4 1965 1 4 1/4/1965 18.0 23.4 12.0
5 1965 1 5 1/5/1965 19.7 24.0 14.0
6 1965 1 6 1/6/1965 18.6 24.0 13.0
df[, hotday := +(df$MAX>=(quantile(df$MAX,.90, na.rm = T, type = 6)) & df$MIN>=(quantile(df$MIN,.85, na.rm = T, type = 6)))
] [, length := with(rle(hotday), rep(lengths,lengths)) # to calculate lenght so I can select consecutive days only
] [hotday==0, length:=0][!!hotday, Highest_Mean := max(MEAN) , rleid(length)][] # to find the highest Mean temp for each consecutive group
But I need to do the same thing using centered rolling percentiles for every 15 days (i.e., for a given day, the 90th percentile of maximum temperature is the 90th percentile of the historical data for a 15-day window centered on that day)
I mean that the percentile to be calculated from the historical data of each calendar day using 15-days calendar window. That is, there are 365 days so for day 118 I will use the historical data for day 111, 112,..... to day 125. So in my case, I have data for 40 years so the 15-day window will yield a total sample size of 40 years × 15 days = 600 for each calendar day. The moving window is based on the calendar day, not the time series
Any thought please
What about something like this to select the rows you want ?
Since you want a sliding window of 15 days centered at the day of interest, you will always have windows of 7 preceding days + day of interest + 7 following days. Because of this constraint, the first 7 and the last 7 days (rows) of the dataset are excluded and forced == FALSE { rep(FALSE, 7) }
the code included in the sapply() call will test each day (starting from day n.(7+1=8) ) against the 15-day sliding window (as defined before) and check if the max temperature is higher than the 90th percentile of that window (test1). A similar test (test2) is executed looking at the MIN temp. If one of the two tests is TRUE, TRUE is returned (otherwise, FALSE is outputted. You can easily adapt this to your needs).
The resulting vector (stored in the KEEP vector) includes booleans TRUE/FALSE that can be used for subsetting the initial dataframe.
set.seed(111)
df <- data.frame(MIN=sample(50:70, size = 50, replace = T),
MAX=sample(70:90, size = 50, replace = T))
head(df)
KEEP <- c(rep(FALSE, 7),
sapply(8:(length(df$MAX) - 7), (function(i){
test1 <- df$MAX[i] >= as.numeric(quantile(df$MAX[(i-7):(i+7)], 0.9, na.rm = TRUE))
test2 <- df$MIN[i] <= as.numeric(quantile(df$MIN[(i-7):(i+7)], 0.15, na.rm = TRUE))
test1 | test2
})),
rep(FALSE, 7))
head(KEEP)
df <- df[KEEP,]
df
This should return
MIN MAX
10 51 86
13 51 73
14 50 75
15 53 89
22 55 83
28 55 90
31 51 72
32 60 88
37 52 84
42 56 87
Related
I have 1000+ rows and I want to calculate the CV for each row that has the same condition.
The data look like this:
Condition Y
0.5 25
0.5 26
0.5 27
1 43
1 45
1 75
5 210
5 124
5 20
10 54
10 78
10 10
and then I did:
CV <- function(x){
(sd(x)/mean(x))*100
}
CV.for every row. <- aggregate(y ~ Condition,
data = df,
FUN = CV)
I have the feeling that what I did, uses the mean of the whole column, cause the results are a bit whatever.
Suppose I have a dataframe named score.master that looks like this:
school perc.prof num.tested
A 8 482
B 6-9 34
C 40-49 49
D GE50 81
E 80-89 26
Here, school A's percent proficient is 8%, and the number of students tested is 482. However, suppose that when num.tested falls below a certain number (in this case arbitrarily 100), data suppression is introduced. In most cases, ranges of perc.prof are given but in other cases a value such as "GE50" is given, indicating greater than or equal to 50.
My question is, in a much larger dataset, what is the best way to replace a range with its median? So for example I want the final dataset to look like this:
school perc.prof num.tested
A 8 482
B 8 34
C 44 49
D 75 81
E 85 26
I know this can be done manually like this:
score.master$perc.prof[score.master$perc.prof == "6-9"] <- round(median(6:9), 0)
But the actual dataset has many more range combinations. One way I thought of selecting the correct values is by length; all provided values are 1-2 characters long (no more than 99 percent proficient) whereas the range values are 3 or more characters long.
You can use stringr::str_split() to get the lower and upper bound, then calculate the median. The "GE50" and similar are not generalizable to this, and you could use ifelse() to handle special cases.
df <- data.frame(perc.prof = c('8', '6-9', '40-49', 'GE50', '80-89'))
df$lower.upper <- sapply(stringr::str_split(df$perc.prof, '-'), as.integer)
df$perc.prof.median <- sapply(df$lower.upper, median)
df$lower.upper <- NULL
> df
perc.prof perc.prof.median
1 8 8.0
2 6-9 7.5
3 40-49 44.5
4 GE50 NA
5 80-89 84.5
You could do the following to convert your ranges with the median. However, I did not handle the "GExx" or "LExx" situations since it's not well defined enough.
Note that you would need the stringr package for my solution.
score.master$perc.prof <- sapply(score.master$perc.prof, function(x){
sep <- stringr::str_locate(x, "-")[, 1]
if(is.na(sep)) {
x
} else {
as.character(round(median(as.integer(stringr::str_sub(x, c(1L, sep+1), c(sep-1, -1L))))))
}
})
Here's a tidyverse approach. First I replace "GE50" with it's expected output, then use tidyr::separate to split perc.prof where possible. Last step either uses the given perc.prof if large school, or uses the median for small schools.
library(tidyverse)
df %>%
mutate(perc.prof = if_else(perc.prof == "GE50", "75", perc.prof)) %>%
separate(perc.prof, c("low", "high"), remove = F, convert = T) %>%
mutate(perc.prof.adj = if_else(num.tested > 100,
as.numeric(perc.prof),
rowSums(select(., low, high), na.rm = T)/2)
)
school perc.prof low high num.tested perc.prof.adj
1 A 8 8 NA 482 8.0
2 B 6-9 6 9 34 7.5
3 C 40-49 40 49 49 44.5
4 D 75 75 NA 81 37.5
5 E 80-89 80 89 26 84.5
A sample data first
set.seed(123)
dat <- data.frame(day= 1:50 ,demand = sample(0:17, size = 50,replace = T),supply = sample(2:9, size = 50,replace = T))
reservoir <- 200
I have a data of demand and supply starting with day 1 till 50
and a fourth column which is the difference between supply and demand
dat$balance <- dat$supply - dat$demand
I want to calculate another column called net deficit. Here's the logic
If for a given day, Demand > Supply, a deficit exists. However, this
deficit can be met by reservoir and hence the net deficit columns will get zero,
If Supply > Demand, the excess supply is either added to reservoir (only if reservoir < 200).If reservoir is at its full capacity (200), the excess supply is discarded
If Demand > Supply and reservoir is zero, then the net deficit column gets the difference between Demand and Supply
For example, starting with day 1, there was a deficit (balance) of 3. This deficit is met by reservoir (making it 197) and, net deficit is zero,
Day 2: deficit is -9 which is borrowed from reservoir (making it 188) and net defict will be zero again.
Day 3, there is an excess of 1 which is used to fill reservoir (since reservoir < 100) net deficit gets a value of 0 and reservour becomes (189)
Day 4: there is a deficit of 13 in balance which is met by reservoir. Reservoir further reduces to 176
I hope this is clear.
If at some point of time, reservoir becomes 0, deficit cannot be compensated and
therefore net deficit gets the value of dat$balance
The solution is basically using a for loop to construct the reservoir vector based on the balance each day. The provided sample did not actually manage to empty the reservoir in 50 days, so I made it longer (but this means the numbers are not the same as the 50 day example). You can then simply bind the vector as a column to your data, and set the net_deficit column to zero while reservoir is positive.
library(tidyverse)
set.seed(123)
dat <- tibble(
day = 1:100,
demand = sample(0:17, size = 100,replace = T),
supply = sample(2:9, size = 100,replace = T)
)
balance <- dat$supply - dat$demand
reservoir <- rep(200, nrow(dat))
reservoir[1] <- reservoir[1] + balance[1]
for (day in 2:nrow(dat)){
reservoir[day] <- reservoir[day - 1] + balance[day]
}
out <- dat %>%
bind_cols(balance = balance, reservoir = reservoir) %>%
mutate(net_deficit = ifelse(reservoir >= 0, 0, reservoir))
out[61:70, ]
# A tibble: 10 x 6
day demand supply balance reservoir net_deficit
<int> <int> <int> <int> <dbl> <dbl>
1 61 11 6 - 5 3.00 0
2 62 1 4 3 6.00 0
3 63 6 7 1 7.00 0
4 64 4 4 0 7.00 0
5 65 14 4 -10 - 3.00 - 3.00
6 66 8 6 - 2 - 5.00 - 5.00
7 67 14 7 - 7 -12.0 -12.0
8 68 14 3 -11 -23.0 -23.0
9 69 14 5 - 9 -32.0 -32.0
10 70 7 4 - 3 -35.0 -35.0
I have a data set of items downloaded from a website based on reports we generate. The idea is to remove reports that are no longer needed based on the number of downloads. The logic is basically count all the reports for the last year that have been downloaded, check if they are outside of two absolute deviations around the median for the current year, check if the report has been downloaded within the last 4 weeks and if so how many times
I have the code below which doesn't work, I was wondering if anyone can help
It gives me the error: for the n_recent_downloads section
Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
reports <- c("Report_A","Report_B","Report_C","Report_D","Report_A","Report_A","Report_A","Report_D","Report_D","Report_D")
Week_no <- c(36,36,33,32,20,18,36,30,29,27)
New.Downloads <- data.frame (Report1 = reports, DL.Week = Week_no)
test <- New.Downloads %>%
group_by(report1) %>%
summarise(n_downloads = n(),
n_recent_downloads = ifelse(sum((as.integer(DL.Week) >= (as.integer(max(DL.Week))) - 4),value,0)))
Providing a reproducible example would make life a lot easier. Nonetheless I have modified your code to do what I think you were trying to achieve.
I've split it into two so you can see what is going on. I moved the ifelsestatement to a mutate call which gives:
library(dplyr)
New.Downloads <- data.frame(
Report1 = c("Report_A","Report_B","Report_C","Report_D","Report_A","Report_A","Report_A","Report_D","Report_D","Report_D"),
DL.Week = as.numeric(c(36,36,33,32,20,18,36,30,29,27))
)
test <- New.Downloads %>%
group_by(Report1) %>%
mutate(
median = median(DL.Week),
mad = 2 * mad(DL.Week),
check = ifelse(DL.Week > median + mad | DL.Week < median - mad, 0, DL.Week)
)
test
Source: local data frame [10 x 5]
Groups: Report1
Report1 DL.Week median mad check
1 Report_A 36 28.0 23.7216 36
2 Report_B 36 36.0 0.0000 36
3 Report_C 33 33.0 0.0000 33
4 Report_D 32 29.5 4.4478 32
5 Report_A 20 28.0 23.7216 20
6 Report_A 18 28.0 23.7216 18
7 Report_A 36 28.0 23.7216 36
8 Report_D 30 29.5 4.4478 30
9 Report_D 29 29.5 4.4478 29
10 Report_D 27 29.5 4.4478 27
Note that from your example none of the values are classed as extreme relative to the median + 2 * mad criterion, so the check values are identical to DL.week.
You can then chain a summarise onto the end of this to give you the sums.
test %>%
summarise(
n_recent_downloads = sum(check)
)
Source: local data frame [4 x 2]
Report1 n_recent_downloads
1 Report_A 110
2 Report_B 36
3 Report_C 33
4 Report_D 118
So As you can see I have a price and Day columns below
Price Day
2 1
5 2
8 3
11 4
14 5
17 6
20 7
23 8
26 9
29 10
32 11
35 12
38 13
41 14
44 15
47 16
50 17
53 18
56 19
59 20
I then want the output below
Difference Day
12 5
15 10
15 15
15 20
So now I have the difference in prices every 5 days...it just basically subtracts the 5th day with the first day.....and then the 10th day with the 5th day etc....
I already made a code that will seperate my data into 5 day intervals...but I want the code that will let me minus the 5th with the 1st day....the 10th day with the 5th day...etc
So the code should look something like this
difference<-tapply(Price[,1],Day, ____________)
So basically Price[,1] will be my Price data.....while "Day" is the variable that I created that will let me seperate my Day data into 5 day intervals.....I'm thinking that in the blank section I could put in the function or another variable that will let me subtract the 5th day with the 1st day prices and then the 10th day and 5th day prices...etc.....you dont have to help me to seperate my Days into intervals...just how to do "difference" section....thanks guys
Here's one option, assuming your data.frame is called "SODF":
within(SODF[c(1, seq(5, nrow(SODF), 5)), ], {
Price <- diff(c(0, Price))
})[-1, ]
# Price Day
# 5 12 5
# 10 15 10
# 15 15 15
# 20 15 20
The first step is basic subsetting. According to your description and expected answer, you want the first row, and then every fifth row starting from row 5:
> SODF[c(1, seq(5, nrow(SODF), 5)), ]
Price Day
1 2 1
5 14 5
10 29 10
15 44 15
20 59 20
From there, you can use diff on the "Price" column, but since diff will result in a vector that is one in length shorter than your input, you need to "pad" the input vector, which I did with diff(c(0, Price)).
# Correct values, but the number of rows needs to be 5
> diff(SODF[c(1, seq(5, nrow(SODF), 5)), "Price"])
[1] 12 15 15 15
Then, the [-1, ] at the end just deletes the extraneous row.
Update
In the comments below, #geektrader points out in the comments (thanks!), an alternative to using:
SODF[c(1, seq(5, nrow(SODF), 5)), ]
as your input data.frame, you may consider using the following instead:
rbind(SODF[1,], SODF[$Day %% 5 == 0,] )
The difference in the two approaches is that the first approach simply subsets by row number, while the second approach subsets according to the value in the "Day" column, extracting rows where "Day" is a multiple of 5. This second approach might be useful, for instance, when there are missing rows in the dataset.
Ananda's is a nice approach (always forget about within myself). Here's another approach:
dat2 <- dat[seq(0, nrow(dat), by=5), ]
data.frame(Difference=diff(c(dat[1,1], dat2[, 1])), Day=dat2[, 2])
Here a solution if you have a matrix as input.
The subsequent function, given a matrix m, a column col_id and a numeric interval interv, subtracts every interv rows the current value in the col_id column of the m matrix with the previous value (5 rows before, same column, obiviously).
The results are stored in a new column called diff and appended to the end of the m matrix.
In short, the approach is very similar to that used by #Ananda Mahto.
So, this is the function:
subtract_column <- function(m, col_id, interv) {
select <- c(1, seq(interv, nrow(m), interv))
cbind(m[select[-1], ], diff = diff(m[select, col_id]))
}
Example:
# this emulates your data as a matrix
price_vect <- c(2,5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50,53,56,59)
day_vect <- 1:20
matr <- do.call(cbind, list(price = price_vect, day = day_vect))
# and this calls the function above and does the job:
# subtracts every 5 rows the current and the previous (5 rows back) value in the column `price` of matrix `matr`
subtract_column(matr, 'price', 5)
Output:
price day diff
[1,] 14 5 12
[2,] 29 10 15
[3,] 44 15 15
[4,] 59 20 15