Create subset of data in r - r

I have data with one column which specifies day of the year, the code below provides an example dataset. No errors are appearing with my code but when I look at the number of observations in 'df_2' and 'df_3' something is wrong. I can't work out what it is.
#Example data
height <- c(21,34,64,27,74,90)
weight <- c(1,45,2,46,3,7)
day <- c(23,67,34,1,90,54)
df <- data.frame(height,weight,day)
#get days between 30 &70, and between 80 & 100
df_2 <- subset(df, day>”30” & day<”70”)
df_3 <- subset(df, day>”80” & day<”100”)
df_4 <- rbind(df_2,df_3)
I have also tried typing it as a range eg: subset(df, day[30:70] but this produces an error.

Please remove quotes as its they are numeric format
df_2 <- subset(df, day>= 30 & day <= 70)
df_3 <- subset(df, day>=80 & day<=100)
df_4 <- rbind(df_2,df_3)
> print(df_4)
height weight day
34 45 67
64 2 34
90 7 54
74 3 90

Related

Select the same period every year in R

This seems really simple, yet I can't find an easy solution. I'm working with future streamflow projections for every day of a 25 year period (2024-2050). I'm only interested in streamflow during the 61 day period between 11th of April and 10th of June each year. I want to extract the data from the seq and Data column that are within this period for each year and have it in a data frame together.
Example data:
library(xts)
seq <- timeBasedSeq('2024-01-01/2050-12-31')
Data <- xts(1:length(seq),seq)
I want to achieve something like this BUT with all the dates between April 11 and June 10th and for all years (2024-2050). This is a shortened sample output:
seq_x <- c("2024-04-11","2024-06-10","2025-04-11","2025-06-10","2026-04-11","2027-06-10",
"2027-04-11", "2027-06-10")
Data_x <- c(102, 162, 467, 527, 832, 892, 1197, 1257)
output <- data.frame(seq_x, Data_x)
This question is similar to:
Calculating average for certain time period in every year
and
select date ranges for multiple years in r
but doesn't provide an efficient answer to my question on how to extract the same period over multiple years.
Here is a base R approach :
dates <- index(Data)
month <- as.integer(format(dates, '%m'))
day <- as.integer(format(dates, '%d'))
result <- Data[month == 4 & day >= 11 | month == 5 | month == 6 & day <= 10]
result
#2024-04-11 102
#2024-04-12 103
#2024-04-13 104
#2024-04-14 105
#2024-04-15 106
#2024-04-16 107
#...
#...
#2024-06-07 159
#2024-06-08 160
#2024-06-09 161
#2024-06-10 162
#2025-04-11 467
#2025-04-12 468
#...
#...
Create an mmdd character string and subset using it:
mmdd <- format(time(Data), "%m%d")
Data1 <- Data[mmdd >= "0411" & mmdd <= "0610"]
These would also work. They shift the dates back by 10 days in which case it coincides with April and May
Data2 <- Data[format(time(Data)-10, "%m") %in% c("04", "05")]
or
Data3 <- Data[ cycle(as.yearmon(time(Data)-10)) %in% 4:5 ]
The command fortify.zoo(x) can be used to convert an xts object x to a data frame.
Here is an option. Do a group by year of the 'seq_x', then summarise to create a list column by subsetting 'Data' based on the first and last elements of 'seq_x' and select the column
library(dplyr)
library(lubridate)
library(tidyr)
library(purrr)
output %>%
group_by(year = year(seq_x)) %>%
summarise(new = list(Data[str_c(first(seq_x), last(seq_x), sep="::")]),
.groups = 'drop') %>%
pull(new) %>%
invoke(rbind, .)
# [,1]
#2024-04-11 102
#2024-04-12 103
#2024-04-13 104
#2024-04-14 105
#2024-04-15 106
#2024-04-16 107
# ...

R filtering/selecting data by POSIXct time and a condition

I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.

Calculate the mean from a specific date

I have two dataframes.
I would like to make the average of sp variable for the previous 5 days defined by a specific date from a second dataframe.
For example, the mean from the day 1997.05.05 (that would be between the day 1997.05.05 and 1997.05.01) and the average between 1997.05.27 and 1997.05.31 calculate the days that have values (in this case 3).
Here are the variables:
sp < - c(7,9,9,4,2,5,2,9,NA,14,NA,NA,NA,NA,NA,14,25,NA,11,10,12,NA,9,NA,6,8,6,1,NA,7,NA)
Date <- c("1997-05-01","1997-05-02","1997-05-03","1997-05-04","1997-05-05",
"1997-05-06","1997-05-07","1997-05-08","1997-05-09","1997-05-10",
"1997-05-11","1997-05-12","1997-05-13","1997-05-14","1997-05-15",
"1997-05-16","1997-05-17","1997-05-18","1997-05-19","1997-05-20",
"1997-05-21","1997-05-22","1997-05-23","1997-05-24","1997-05-25",
"1997-05-26","1997-05-27","1997-05-28","1997-05-29","1997-05-30",
"1997-05-31")
data1 <- data.frame(sp, Date)
DateX <- c("1997-05-05","1997-05-15","1997-05-31")
data2 <- data.frame(DateX)
how to do that best? Help would be much appreciated.
Here is my expected result (in the second dataframe, data2):
1. DateX spMean
2. 1997-05-05 6.2
3. 1997-05-15 NA
4. 1997-05-31 4.6
I have made a few type changes to your initial code. Give the below a shot...I use lapply to run a quick function against the data1 object using the dates in your second object.
sp <- c(7,9,9,4,2,5,2,9,NA,14,NA,NA,NA,NA,NA,14,25,NA,11,10,12,NA,9,NA,6,8,6,1,NA,7,NA)
Date <- as.Date(c("1997-05-01","1997-05-02","1997-05-03","1997-05-04","1997-05-05",
"1997-05-06","1997-05-07","1997-05-08","1997-05-09","1997-05-10",
"1997-05-11","1997-05-12","1997-05-13","1997-05-14","1997-05-15",
"1997-05-16","1997-05-17","1997-05-18","1997-05-19","1997-05-20",
"1997-05-21","1997-05-22","1997-05-23","1997-05-24","1997-05-25",
"1997-05-26","1997-05-27","1997-05-28","1997-05-29","1997-05-30",
"1997-05-31"))
data1 <- data.frame(sp, Date)
DateX <- as.Date(c("1997-05-05","1997-05-15","1997-05-31"))
data2 <- data.frame(DateX)
#Add column for mean, NA values return NA
data2$spMean_na <- lapply(DateX,
function(m) mean(data1$sp[data1$Date >= m - 5 & data1$Date <= m]))
#Add column for mean, remove NA values
data2$spMean_na_omit <- lapply(DateX,
function(m) mean(data1$sp[data1$Date >= m - 5 & data1$Date <= m],
na.rm = TRUE))
> data2
DateX spMean_na spMean_na_omit
1 1997-05-05 6.2 6.2
2 1997-05-15 NA 14
3 1997-05-31 NA 5.5
I think you might need to change your expected result. Row 29 has an NA for the sp value and is within 5 days of 1997-05-31. So it should return an NA per your requirements as I understand them.

Using lapply to output values between date ranges within different factor levels

I have 2 dataframes, one representing daily sales figures of different stores (df1) and one representing when each store has been audited (df2). I need to create a new dataframe displaying sales information from each site taken 1 week before each audit (i.e. the information in df2). Some example data, firstly for the daily sales figures from different stores across a certain period:
Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Sales <- as.data.frame(matrix(sample(0:50, 30*10, replace=TRUE), ncol=3))
df1 <- cbind(Dates,Sales)
colnames(df1) <- c("Dates","Site.A","Site.B","Site.C")
And for the dates of each audit across different stores:
Store<- c("Store.A","Store.A","Store.B","Store.C","Store.C")
Audit_Dates <- as.data.frame(as.POSIXct(c("2016/1/4","2016/3/1","2016/2/1","2016/2/1","2016/3/1")))
df2 <- as.data.frame(cbind(Store,Audit_Dates ))
colnames(df2) <- c("Store","Audit_Dates")
Of note is that there will be an uneven amount of dates within each output (i.e. there may not be a full weeks worth of information prior to some store audits). I have previously asked a question addressing a similar problem Creating a dataframe from an lapply function with different numbers of rows. Below shows an answer from this which would work for an example if I was to consider information from only 1 store:
library(lubridate)
##Data input
Store.A_Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Store.A_Sales <- as.data.frame(matrix(sample(0:50, 10*10, replace=TRUE), ncol=1))
Store.A_df1 <- cbind(Store.A_Dates,Store.A_Sales)
colnames(Store.A_df1) <- c("Store.A_Dates","Store.A_Sales")
Store.A_df2 <- as.Date(c("2016/1/3","2016/3/1"))
##Output
Store.A_output<- lapply(Store.A_df2, function(x) {Store.A_df1[difftime(Store.A_df1[,1], x - days(7)) >= 0 & difftime(Store.A_df1[,1], x) <= 0, ]})
n1 <- max(sapply(Store.A_output, nrow))
output <- data.frame(lapply(Store.A_output, function(x) x[seq_len(n1),]))
But I don't know how I would get this for multiple sites.
Try this:
# Renamed vars for my convenience...
colnames(df1) <- c("t","Store.A","Store.B","Store.C")
colnames(df2) <- c("Store","t")
library(tidyr)
library(dplyr)
# Gather df1 so that df1 and df2 have the same format:
df1 = gather(df1, Store, Sales, -t)
head(df1)
t Store Sales
1 2015-12-30 Store.A 16
2 2015-12-31 Store.A 24
3 2016-01-01 Store.A 8
4 2016-01-02 Store.A 42
5 2016-01-03 Store.A 7
6 2016-01-04 Store.A 46
# This lapply call does not iterate over actual values, just indexes, which allows
# you to subset the data comfortably:
r <- lapply(1:nrow(df2), function(i) {
audit.t = df2[i, "t"] #time of audit
audit.s = df1[, "Store"] == df2[i, "Store"] #store audited
df = df1[audit.s, ] #data from audited store
df[, "audited"] = audit.t #add extra column with audit date
week_before = difftime(df[, "t"], audit.t - (7*24*3600)) >= 0
week_audit = difftime(df[, "t"], audit.t) <= 0
df[week_before & week_audit, ]
})
Does this give you the proper subsets?
Also, to summarise your results:
r = do.call("rbind", r) %>%
group_by(audited, Store) %>%
summarise(sales = sum(Sales))
r
audited Store sales
<time> <chr> <int>
1 2016-01-04 Store.A 97
2 2016-02-01 Store.B 156
3 2016-02-01 Store.C 226
4 2016-03-01 Store.A 115
5 2016-03-01 Store.C 187

Date intervals and data manipulation

I'm a new user of R and I'm a little bit stuck, my data looks like this:
dates temp
01/31/2011 40
01/30/2011 34
01/29/2011 30
01/28/2011 52
01/27/2011 39
01/26/2011 37
...
01/01/2011 31
i want take only temp under 40 degrees and with the dates of beginning and the end and how many days it lasts, for example:
from to days
01/29/2011 01/30/2011 2
01/26/2011 01/27/2011 2
I tried with difftime but it didn't work, maybe with a function it will.
any help would be appreciated.
I'd do something like this. I'll use data.table here.
df <- read.table(header=TRUE, text="dates temp
01/31/2011 40
01/30/2011 34
01/29/2011 30
01/28/2011 52
01/27/2011 39
01/26/2011 37", stringsAsFactors=FALSE)
require(data.table)
dt <- data.table(df)
dt <- dt[, `:=`(date.form = as.Date(dates, format="%m/%d/%Y"),
id = cumsum(as.numeric(temp >= 40)))][temp < 40]
dt[, list(from=min(date.form), to=max(date.form), count=.N), by=id]
# id from to count
# 1: 1 2011-01-29 2011-01-30 2
# 2: 2 2011-01-26 2011-01-27 2
The idea is to first create a column with the dates column converted to Date format first. Then, another column id that finds the positions where temp >= 40 and uses that to create the group of values that are within two temp>=40. That is, if you have c(40, 34, 30, 52, 39, 37), then you'd want c(1,1,1,2,2,2). That is, everything between to values >= 40, must belong to the same group (34, 30 -> 1 and 39, 37 -> 2). After doing this, I'd remove temp >= 40 entries.
then, you can split by this group and then take min and max and length(.) (which is by default stored in .N).
Not as elegant as Arun's data.table but here is base solution
DF <- read.table(text = "dates temp\n01/31/2011 40\n01/30/2011 34\n01/29/2011 30\n01/28/2011 52\n01/27/2011 39\n01/26/2011 37",
header = TRUE, stringsAsFactors = FALSE)
DF$dates <- as.POSIXct(DF$dates, format = "%m/%d/%Y")
DF <- DF[order(DF$dates), ]
DF$ID <- cumsum(DF$temp >= 40)
DF2 <- DF[DF$temp < 40, ]
# Explanation split : split DF2 by DF2$ID
# lapply : apply function on each list element given by split
# rbind : bind all the data together
do.call(rbind, lapply(split(DF2, DF2$ID), function(x)
data.frame(from = min(x$dates),
to = max(x$dates),
count = length(x$dates))))
## from to count
## 0 2011-01-26 2011-01-27 2
## 1 2011-01-29 2011-01-30 2
First read in the data. read.zoo handles many of the details all in one line including reordering the data to be ascending and converting the dates to "Date" class. If z is the resulting zoo object then coredata(z) gives the temperatures and time(z) gives the dates.
Lines <- "
dates temp
01/31/2011 40
01/30/2011 34
01/29/2011 30
01/28/2011 52
01/27/2011 39
01/26/2011 37
"
library(zoo)
z <- read.zoo(text = Lines, header = TRUE, format = "%m/%d/%Y")
The crux of all this is the use of rle which computes lengths and values from which we can derive all quantities:
tt <- time(z)
with(rle(coredata(z) < 40), {
to <- cumsum(lengths)[values]
lengths <- lengths[values]
from <- to - lengths + 1
data.frame(from = tt[from], to = tt[to], days = lengths)
})
Using the first 6 lines of the input data shown we get:
from to days
1 2011-01-26 2011-01-27 2
2 2011-01-29 2011-01-30 2

Resources