I am working with a dataset of hourly temperatures and I need to calculate "degree hours" above a heat threshold for each extreme event. I intend to run stats on the intensities (combined magnitude and duration) of each event to compare multiple sites over the same time period.
Example of data:
Temp
1 14.026
2 13.714
3 13.25
.....
21189 12.437
21190 12.558
21191 12.703
21192 12.896
Data after selecting only hours above the threshold of 18 degrees and then subtracting 18 to reveal degrees above 18:
Temp
5297 0.010
5468 0.010
5469 0.343
5470 0.081
5866 0.010
5868 0.319
5869 0.652
After this step I need help to sum consecutive hours during which the reading exceeded my specified threshold.
What I am hoping to produce out of above sample:
Temp
1 0.010
2 0.434
3 0.010
4 0.971
I've debated manipulating these data within a time series or by adding additional columns, but I do not want multiple rows for each warming event. I would immensely appreciate any advice.
This is an alternative solution in base R.
You have some data that walks around, and you want to sum up the points above a cutoff. For example:
set.seed(99999)
x <- cumsum(rnorm(30))
plot(x, type='b')
abline(h=2, lty='dashed')
which looks like this:
First, we want to split the data in to groups based on when they cross the cutoff. We can use run length encoding on the indicator to get a compressed version:
x.rle <- rle(x > 2)
which has the value:
Run Length Encoding
lengths: int [1:8] 5 2 3 1 9 4 5 1
values : logi [1:8] FALSE TRUE FALSE TRUE FALSE TRUE ...
The first group is the first 5 points where x > 2 is FALSE; the second group is the two following points, and so on.
We can create a group id by replacing the values in the rle object, and then back transforming:
x.rle$values <- seq_along(x.rle$values)
group <- inverse.rle(x.rle)
Finally, we aggregate by group, keeping only the data above the cut off:
aggregate(x~group, subset = x > 2, FUN=sum)
Which produces:
group x
1 2 5.113291213
2 4 2.124118005
3 6 11.775435706
4 8 2.175868979
I'd use data.table for this, although there are certainly other ways.
library( data.table )
setDT( df )
temp.threshold <- 18
First make a column showing the previous value from each one in your data. This will help to find the point at which the temperature rose above your threshold value.
df[ , lag := shift( Temp, fill = 0, type = "lag" ) ]
Now use that previous value column to compare with the Temp column. Mark every point at which the temperature rose above the threshold with a 1, and all other points as 0.
df[ , group := 0L
][ Temp > temp.threshold & lag <= temp.threshold, group := 1L ]
Now we can get cumsum of that new column, which will give each sequence after the temperature rose above the threshold its own group ID.
df[ , group := cumsum( group ) ]
Now we can get rid of every value not above the threshold.
df <- df[ Temp > temp.threshold, ]
And summarise what's left by finding the "degree hours" of each "group".
bygroup <- df[ , sum( Temp - temp.threshold ), by = group ]
I modified your input data a little to provide a couple of test events where the data rose above threshold:
structure(list(num = c(1L, 2L, 3L, 4L, 5L, 21189L, 21190L, 21191L,
21192L, 21193L, 21194L), Temp = c(14.026, 13.714, 13.25, 20,
19, 12.437, 12.558, 12.703, 12.896, 21, 21)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -11L), .Names = c("num",
"Temp"), spec = structure(list(cols = structure(list(num = structure(list(), class = c("collector_integer",
"collector")), Temp = structure(list(), class = c("collector_double",
"collector"))), .Names = c("num", "Temp")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
With that data, here's the output of the code above (note $V1 is in "degree hours"):
> bygroup
group V1
1: 1 3
2: 2 6
Related
I have trouble to understand how to apply lead on a data.table
I would like to calculate bearing between the current point and the next point.
So basically calculate bearing between current and the next row in the data.table.
Here what i have tried:
I have the route data.table
library(geosphere)
library(data.table)
route<-structure(list(counter = 1:6, lon = c(11.829711, 11.8336202,
11.8333238, 11.8341994, 11.8336198, 11.8337213), lat = c(48.1091400999115,
48.1153102999101, 48.1269571999072, 48.1273386999071, 48.1297995999066,
48.1309630999063)), row.names = c(NA, -6L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x55b3b7da26f0>)
next I create the "lead" data.table for (next) rows
lead_route_dt<-route[, data.table::shift(.SD, 1, NA, "lead", TRUE), ]
And try to apply bearingRhumb on both data.tables:
apply(data.frame(route$lon,route$lat), 1, FUN = function(x) bearingRhumb(x,cbind(lead_route_dt$lon_lead_1,lead_route_dt$lat_lead_1)))
but sadly as result i get an error:
Error in if (sum(keep) == 0) { : missing value where TRUE/FALSE needed
What am i doing wrong?
The error is due to bearingRhumb not handling NA:
bearingRhumb(c(NA,NA),c(11,42))
#Error in if (sum(keep) == 0) { : missing value where TRUE/FALSE needed
You need to exclude rows containing NA from calculation:
route[,`:=`(next_lon = shift(lon,1,type='lead'),next_lat = shift(lat,1,type='lead'))]
route[!(is.na(next_lon)),bearing:=bearingRhumb(cbind(lon,lat),cbind(next_lon,next_lat))]
route
counter lon lat next_lon next_lat bearing
1: 1 11.82971 48.10914 11.83362 48.11531 22.928957
2: 2 11.83362 48.11531 11.83332 48.12696 359.026720
3: 3 11.83332 48.12696 11.83420 48.12734 56.865278
4: 4 11.83420 48.12734 11.83362 48.12980 351.066053
5: 5 11.83362 48.12980 11.83372 48.13096 3.332292
6: 6 11.83372 48.13096 NA NA NA
I have a genetic dataset where I am grouping gene variants that are physically close together in the genome. I group genes that are a 500 +/- distance from certain spots in the genome per chromosome.
I've coded for this, however, when I look at my output, my newly formed group column isn't making groups starting from number 1, the lowest-numbered group is 5 and I can't figure out why.
My 'spots' dataset is of positions that variants need to be within 500 +/- range of and looks like:
connected_spots chrom min max low high
1 1 1000 1200 500 1700
2 1 20000 20100 19500 20600
3 5 900 1000 400 1500
The low and the high columns are the range I am looking to collect variants in. They are created by being +500 or -500 the max and min values with:
spots[, c("low", "high") := .(min - 500, max + 500)]
I then compare if my second dataset, df, has any variants (rows) in a position in the genome that are within the range of low and high and group them together into a group column if the variants are in range of the same spot:
df$connected_spots <- seq.int(nrow(df))
#Find matches in df on matching chromosomes and position between low&high:
df <- df[spots, group := i.connected_spots, on = .(chrom, position > low, position < high ) ]
My df that I'm looking to see if any gene variants fit in the range of the spots looks like:
Gene chrom position connected_spots
Gene1 1 1200 1
Gene2 1 10000 2
Gene3 5 500 3
I pre-create a connected_spots column for df just to give each variant/row an ID to be picked up on for searching the position matching - is this actually problematic to do if I am ultimately searching for which variant is in connect_spots range in my first dataset? What would cause my real data to have the lowest group be 5?
So the output I'm getting (when I sort groups lowest to highest) is:
Gene chrom position group
Gene4 1 1200 5
Gene8 1 10000 5
Gene9 5 500 6
Why doesn't group start with group 1?
Is there another way I can code this line:
df <- df[spots, group := i.connected_spots, on = .(chrom, position > low, position < high ) ]
to potentially avoid the problems I'm having with my real data?
Input data:
#Spots dataset:
structure(list(connected_spots = 1:3, chrom = c(1L, 1L, 5L),
min = c(1000L, 20000L, 900L), max = c(1200L, 20100L, 1000L
), low = c(500L, 19500L, 400L), high = c(1700L, 20600L, 1500L
)), row.names = c(NA, -3L), class = c("data.table", "data.frame"))
#Variants to find in range of spots:
structure(list(Gene = c("Gene1", "Gene2", "Gene3"), chrom = c(1L,
1L, 5L), position = c(1200L, 10000L, 500L), connected_spots = 1:3), row.names = c(NA,
-3L), class = c("data.table", "data.frame"))
Based on the following data-frame, I would like to compute the rolling correlations (with a window size of 12):
library(rugarch)
library(rmgarch)
data(dji30retw)
Dat = dji30retw[, 1:8, drop = FALSE]
> dput(head(Dat))
structure(list(AA = c(-0.00595239852729524, 0.00595239852729524,
-0.0149479614358734, 0.0470675108579858, 0.0170944333593002,
0.0251059211310762), AXP = c(-0.00794285351393668, -0.0258495814613253,
-0.0265355536259657, -0.0359320092260634, -0.0555200763856309,
0.0254559199933486), BA = c(-0.00886642920564158, -0.0102302682508148,
-0.0142397228111357, -0.0237478178500363, -0.046456440823212,
-0.0590524317817008), BAC = c(-0.0311983708558615, 0, -0.0358461317731357,
-0.0258794479878207, -0.0304205967007118, -0.0116506172199752
), C = c(-0.0258635105899192, -0.0176216013498196, -0.0134230203321406,
-0.0944096844710748, -0.0352681388374579, 0.0203052661607457),
CAT = c(0.0158733491562901, 0.0411369055604894, -0.046400075604764,
-0.00794706169253204, -0.0106952891167477, 0.0369435151916841
), CVX = c(-0.0220481372217624, 0.0632438936600297, -0.0165791288029112,
-0.0340063679851951, -0.0287101058824313, 0.0112631922787107
), DD = c(0.00638979809877117, 0.0354573118367292, -0.0354573118367292,
0.00529381860971498, -0.031101702565588, -0.0198026272961791
)), .Names = c("AA", "AXP", "BA", "BAC", "C", "CAT", "CVX",
"DD"), row.names = c("1987-03-27", "1987-04-03", "1987-04-10",
"1987-04-17", "1987-04-24", "1987-05-01"), class = "data.frame")
And then after computing the rolling correlations, I would like to create a data-frame consisting of one column with the average correlation coefficient per time period T (in this case: per week).
Is there anyone out there that could help me out? I would really appreciate that!
Thanks in advance!
There are methods in R that are more tailored to time series analysis than the one I'm about to show. Here's a link.
This is a very inelegant solution. I've created my own data for the example:
library(dplyr)
#set seed
set.seed(123)
#initialize matrix
roll_corr <- data.frame(matrix(nrow = 365,ncol = 5))
names(roll_corr) <- c("date","week","sales1","sales2","corr")
#generate sequence of dates
roll_corr$date <- seq(as.Date("2000/01/01"), as.Date("2000/12/30"), by="day")
# calculate week number
roll_corr$week <- as.numeric(roll_corr$date - roll_corr$date[1]) %/% 7
#generate random variates for sales
roll_corr$sales1 <- rnorm(365,500,1000)
roll_corr$sales2 <- runif(365,1000,80000)
#calculate rolling correlation using for loop
for(i in 1:365) {
roll_corr$corr[i] <- cor(roll_corr$sales1[1:i],roll_corr$sales2[1:i])
}
#use dplyr to group data by week and calculate average correlation
weekly_roll_corr <- roll_corr %>%
group_by(week) %>%
summarize(average = mean(corr,na.rm = TRUE))
head(weekly_roll_corr)
week average
1 0 0.1480184
2 1 -0.1008872
3 2 0.1265146
4 3 0.2481083
5 4 0.2518001
6 5 0.1892407
I'm a trying-to-be R user. I never learned to code properly and have been just doing it by finding stuff online.
I encountered a problem that I would need some of you experts' help.
I have two data files.
Particulate matter (PM) concentrations (~20000 observations)
Coefficient combinations to use with the particulate matter concentrations to calculate final concentrations.
For example..
Data set 1.
ID PM
1 5
2 10
... ...
1500 25
Data set 2.
alpha beta
5 6
1 2
... ...
I ultimately have to use all the coefficient combinations (alpha and beta) for each of the IDs from data set 1. For example, if I have 10 observations in data set 1, and 10 coefficient combinations in data set 2, my output table should have 100 different output values (10*10=100).
for (i in cmaq$FID) {
mean=cmaq$PM*IER$alpha*IER$beta
}
I used the above code to do what I'm trying to do, but it only gave me 10 output values rather than 100. I think using the split function first, and somehow use that with the second dataset would work, but have not figured out how...
It may be a very very simple problem, but after spending hours to figure it out, I thought it may be a better strategy to get some help from R experts.
Thank you in advance!!!
You can do:
df1 = data.frame(
ID = c(1, 2, 1500),
PM = c(5, 10, 25)
)
df2 = data.frame(
alpha = c(5, 6),
beta = c(1, 2)
)
library(tidyverse)
library(dplyr)
df1 %>%
group_by(ID) %>%
do(data.frame(result = .$PM * df2$alpha * df2$beta,
alpha = df2$alpha,
beta = df2$beta))
Look for the term 'cross join' or 'cartesian join' (eg, How to do cross join in R?).
If that doesn't address the issue, please see https://stackoverflow.com/help/mcve. I think there is a mistake inside the loop. beta is free-floating, and not connected to the IER data.frame
We can do this with outer
data.frame(ID = rep(df1$ID, each = nrow(df2)), alpha = df2$alpha,
beta = df2$beta, result = c(t(outer(df1$PM, df2$alpha*df2$beta))))
# ID alpha beta result
#1 1 5 1 25
#2 1 6 2 60
#3 2 5 1 50
#4 2 6 2 120
#5 1500 5 1 125
#6 1500 6 2 300
data
df1 <- structure(list(ID = c(1, 2, 1500), PM = c(5, 10, 25)), .Names = c("ID",
"PM"), row.names = c(NA, -3L), class = "data.frame")
df2 <- structure(list(alpha = c(5, 6), beta = c(1, 2)), .Names = c("alpha",
"beta"), row.names = c(NA, -2L), class = "data.frame")
A novice R user here. So i have a data set formated like:
Date Temp Month
1-Jan-90 10.56 1
2-Jan-90 11.11 1
3-Jan-90 10.56 1
4-Jan-90 -1.67 1
5-Jan-90 0.56 1
6-Jan-90 10.56 1
7-Jan-90 12.78 1
8-Jan-90 -1.11 1
9-Jan-90 4.44 1
10-Jan-90 10.00 1
In R syntax:
datacl <- structure(list(Date = structure(1:10, .Label = c("1990/01/01",
"1990/01/02", "1990/01/03", "1990/01/04", "1990/01/05", "1990/01/06",
"1990/01/07", "1990/01/08", "1990/01/09", "1990/01/10"), class = "factor"),
Temp = c(10.56, 11.11, 10.56, -1.67, 0.56, 10.56, 12.78,
-1.11, 4.44, 10), Month = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L)), .Names = c("Date", "Temp", "Month"), class = "data.frame", row.names = c(NA,
-10L))
i would like to subset the data for a particular month and apply a change factor to the temp then save the results. so i have something like
idx <- subset(datacl, Month == 1) # Index
results[idx[,2],1] = idx[,2]+change # change applied to only index values
but i keep getting an error like
Error in results[idx[, 2], 1] = idx[, 2] + change:
only 0's may be mixed with negative subscripts
Any help would be appreciated.
First, give the change factor a value:
change <- 1
Now, here is how to create an index:
# one approach to subsetting is to create a logical vector:
jan.idx <- datacl$Month == 1
# alternatively the which function returns numeric indices:
jan.idx2 <- which(datacl$Month == 1)
If you want just the subset of data from January,
jandata <- datacl[jan.idx,]
transformed.jandata <- transform(jandata, Temp = Temp + change)
To keep the entire data frame, but only add the change factor to Jan temps:
datacl$Temp[jan.idx] <- datacl$Temp[jan.idx] + change
First, note that subset does not produce an index, it produces a subset of your original dataframe containing all rows with Month == 1.
Then when you are doing idx[,2], you are selecting out the Temp column.
results[idx[,2],1] = idx[,2] + change
But then you are using these as an index into results, i.e. you're using them as row numbers. Row numbers can't be things like 10.56 or -1.11, hence your error. Also, you're selecting the first column of results which is Date and trying to add temperatures to it.
There are a few ways you can do this.
You can create a logical index that is TRUE for a row with Month == 1 and FALSE otherwise like so:
idx <- datac1$Month == 1
Then you can use that index to select the rows in datac1 you want to modify (this is what you were trying to do originally, I think):
datac1$Temp[idx] <- datac1$Temp[idx] + change # or 'results' instead of 'datac1'?
Note that datac1$Temp[idx] selects the Temp column of datac1 and the idx rows.
You could also do
datac1[idx,'Temp']
or
datac1[idx,2] # as Temp is the second column.
If you only want results to be the subset where Month == 1, try:
results <- subset(datac1, Month == 1)
results$Temp <- results$Temp + change
This is because results only contains the rows you are interested in, so there's no need to do subsetting.
Personally, I would use ifelse() and leverage the syntactic beauty that is within() for a nice one liner datacl <- within(datacl, Temp <- ifelse(Month == 1, Temp + change,Temp)). Well, I said one liner, but you'd need to define change somewhere else too.