Multiple data frame handling - r

I have several data frames and they were named like this
plant1_wd_hrly, plant2_wd_hrly,plant3_wd_hrly......,
Each of them have data like this :
time temp
1 2012-01-01 00:00:00 20
2 2012-01-01 01:00:00 21
3 2012-01-01 02:00:00 22
4 2012-01-01 03:00:00 23
5 2012-01-01 04:00:00 24
I need to do a aggregation to the daily level with all of them and also calculate the daily max, min.
Here is the code to generate such df:
x=seq(
from=as.POSIXct("2012-1-1 0:00", tz="UTC"),
to=as.POSIXct("2012-1-3 23:00", tz="UTC"),
by="hour")
plant1_wd_hrly=data.frame("time"=x,"temp"=seq(20,length.out=length(x)))
plant1_wd_hrly$time=as.POSIXct(substr(plant1_wd_hrly$time,1,10))
plant2_wd_hrly=data.frame("time"=x,"temp"=seq(25,length.out=length(x)))
plant2_wd_hrly$time=as.POSIXct(substr(plant1_wd_hrly$time,1,10))
plant1_wd_hrly$temp[2:3]=NA
plant2_wd_hrly$temp[5:6]=NA
If it is only one df I usually do the aggregation using dplyr package:
plant1_hrly=plant1_wd_hrly %>% group_by(time) %>% summarise(
temp_avg = mean(temp,na.rm=TRUE),
temp_max = max(temp,na.rm=TRUE),
temp_min = min(temp,na.rm=TRUE))
But with multiple df, what is a more efficient way to do this?
First thing I'm thinking is to do a for loop, could I load a dymanic generated variable name from R, so I could loop through the different df since they all have very similar names? If I want to assign a value to a dynamic generated variable name I could use assign, but how to load one?
Thank you.

Make a vector of df names like that, for instance:
df_names <- grep("plant", ls(), value = T)
If no other variable names contain "plant". Otherwise you need to play with regex. Or pick them by hand.
Then just loop over the names using get() and assign() in the body.
You give the first one the name as a string, and it get the value from the variable. The second takes a name and a value and assign the value to the name.
for(df_n in df_names){
temp_data = get(df_n) %>% group_by(time) %>% summarise(
temp_avg = mean(temp,na.rm=TRUE),
temp_max = max(temp,na.rm=TRUE),
temp_min = min(temp,na.rm=TRUE))
assign(paste0(df_n, "_agr"), temp_data)
}

Related

How can I pass multiple arguments from one dataframe to modify values in another dataframe in R?

I want to use manual inputs to a QAQC 'log file' to update an existing dataframe. The following log file would indicate date ranges (bounded by datetime_min and datetime_max) for particular variable (or 'all' of them) observations to be omitted from the dateframe (set to NA).
library(tidyverse)
library(lubridate)
QC_log <- tibble(
variable = c("SpCond", "pH", "pH", "all"),
datetime_min = ymd_hms(c("2021-06-01 18:00:00","2021-07-19 18:00:00","2021-08-19 18:00:00","2021-11-23 18:00:00")),
datetime_max = ymd_hms(c("2021-06-02 18:00:00","2021-07-25 21:00:00","2021-08-19 20:00:00","2021-11-26 05:00:00"))
)
The log should modify the following example of a dataframe, removing observations for each variable (for now I am not worried about 'all') that fall between the date min/max.
df <- tibble(
Datetime = ymd_hms(c("2021-06-01 17:00:00","2021-06-01 18:00:00","2021-06-01 19:00:00","2021-11-23 16:00:00","2021-11-23 17:00:00","2021-11-23 18:00:00")),
SpCond = c(220,225,224,230,231,235),
pH = c(7.8,7.9,8.0,7.7,7.8,7.7)
)
I have tried pmap like this:
df%>%
{pmap(QC_log, mutate(., ..1 = ifelse(Datetime > ..2 & Datetime < ..3, "NA", ..1)))}
I assumed pmap was taking ..1,2,3 from QC_log where ..1 is 'variable', ..2 is datetime_min, and ..3 is datetime_max, passing those as arguments into mutate one QC_log row at a time, which then conditionally replaces observations with NA if they fall into the specified date range.
I think I am having a hard time understanding ideas about non-standard evaluation/how arguments get passed through functions, among other things. Hopefully this is simple for now - I would like for this functionality to eventually be more complicated (e.g., changing all observations to NA when variable = 'all'; adding in separate actions like adding a data flag rather than omitting; or using a specific criterion (e.g., "<10") to omit observations rather than a daterange.
You can do the following
inner_join(
df %>% pivot_longer(cols=c("SpCond","pH")),
QC_log,
by=c("name" = "variable")
) %>%
filter((Datetime<datetime_min) | (Datetime>datetime_max)) %>%
select(Datetime, name, value) %>%
distinct() %>%
pivot_wider(id_cols = Datetime)
Output
Datetime SpCond pH
<dttm> <dbl> <dbl>
1 2021-06-01 17:00:00 220 7.8
2 2021-06-01 18:00:00 NA 7.9
3 2021-06-01 19:00:00 NA 8
4 2021-11-23 16:00:00 230 7.7
5 2021-11-23 17:00:00 231 7.8
6 2021-11-23 18:00:00 235 7.7
And here is a data.table approach
dcast(
unique(melt(setDT(df), id="Datetime")[setDT(QC_log),on=.(variable),allow.cartesian=T,nomatch=0] %>%
.[(Datetime<datetime_min) | (Datetime>datetime_max), .(Datetime,variable,value)]),
Datetime~variable, value.var="value"
)

Automatically convert formats of of date-time data to date only in r

I have a dataframe that has been put together by binding the data together after reading in multiple .csv files. The data comprises 6 variables and approx. 560,000 observations.
One variable 'date.time' unfortunately is currently in two formats dd/mm/yyyy hh:mm:ss and dd/mm/yy hh:mm. What I would like to do is mutate() the variable to a date only format.
I have tried df %>% mutate(date = as.Date(dmy_hms(date.time)) but I get an error failed to parse as you would expect given I have two date/time formats in the same column.
Another way I have tried is df %>% mutate(date = anydate(date.time)) using anydate() from the anytime package, but this is far too slow and the CPU environment I'm working in uses all available memory given the size of the dataframe.
I'm hoping there is a swift and easy way of addressing this.
Thanks.
How about this:
library(tidyverse)
library(lubridate)
``` r
library(tidyverse)
library(lubridate)
df %>%
mutate(time_temp = dmy_hms(time, quiet = TRUE)) %>%
mutate(time = if_else(is.na(time_temp),
dmy_hm(time, quiet = TRUE),
time_temp)) %>%
select(-time_temp)
#> # A tibble: 4 x 1
#> time
#> <dttm>
#> 1 2020-01-01 00:00:01
#> 2 2020-01-02 00:01:01
#> 3 2020-01-04 00:02:00
#> 4 2020-01-03 01:02:00
reprex data
df <- tibble(
time = c("01/01/2020 00:00:01", "02/01/2020 00:01:01", "04/01/20 00:02", "03/01/20 01:02")
)

R - create a timeseries from filenames

I have 900 files named like 20120412_bwDD2yYa.txt. The first part up to the _ is in the year-month-day format. Some days have multiple files associated with them.
I'd like to use the dates extracted from the file names as data to compile a timeseries where the dates are the x axis and the number of files are the y axis.
How can I do this?
Here is a solution with Base R. Since the question does not include a reproducible example, we'll simulate the file names, parse out the dates, and create the counts by date.
# use list.files() to extract files from directory
files <- list.files(path="./data",pattern="*.txt",full.names = FALSE)
# simulate result from list.files()
files <- c("20120101_aaa.txt","20120101_bbb.txt","20120102_ccc.txt")
# extract dates from file names
date <- as.Date(substr(files,1,8),"%Y%m%d")
df <- data.frame(date,count = rep(1,length(date)))
aggregate(count ~ date,data = df, sum)
...and the output:
date count
1 2012-01-01 2
2 2012-01-02 1
dplyr solution
A solution with dplyr::summarise() looks like this:
files <- list.files(path="./data",pattern="*.txt",full.names = FALSE)
# simulate result from list.files()
files <- c("20120101_aaa.txt","20120101_bbb.txt","20120102_ccc.txt")
library(dplyr)
data.frame(date=as.Date(substr(files,1,8),"%Y%m%d")) %>%
group_by(date) %>% summarise(count = n())
# A tibble: 2 x 2
date count
<date> <int>
1 2012-01-01 2
2 2012-01-02 1
Accounting for dates with no files
In response to a comment on my answer, here is a solution that fills in gaps in the file list where there are days with 0 files. We take the minimum and maximum dates from the file list and create a data frame containing the sequence of dates. Then we left_join() this with the previously aggregated data, and recode NA values for count to 0.
# create a gap in dates with files
files <- c("20120101_aaa.txt","20120101_bbb.txt","20120102_ccc.txt",
"20120104_aaa.txt","20120104_aab.txt","20120104_aac.txt")
library(dplyr)
data.frame(date=as.Date(substr(files,1,8),"%Y%m%d")) %>%
group_by(date) %>% summarise(count = n()) -> fileCounts
# create df with all dates, left_join() and recode NA to 0
data.frame(date = as.Date(min(fileCounts$date):max(fileCounts$date),
origin = "1970-01-01")) %>%
left_join(.,fileCounts) %>%
mutate(count = if_else(is.na(count),0,as.numeric(count)))
...and the output:
Joining, by = "date"
date count
1 2012-01-01 2
2 2012-01-02 1
3 2012-01-03 0
4 2012-01-04 3
You can use table to count frequencies and then stack it to get a dataframe.
Using #Len Greski's files.
files <- c("20120101_aaa.txt","20120101_bbb.txt","20120102_ccc.txt")
stack(table(as.Date(sub('_.*', '', files),"%Y%m%d")))[2:1]
# ind values
#1 2012-01-01 2
#2 2012-01-02 1

R Dplyr and string values, how to split and get the second element? vapply/sapply

Been having difficulty with this one data frame manipulation in R.
I have two columns for well height and a date-time string ("yyyy-mm-dd HH:MM:ss").
I would like to extract all the rows from this table that occur at midnight (00:00:00).
I could manipulate this table in seconds with python, but I want to figure it out in R using strsplit() instead of POSIXct.
How do I mutate the table so that I split the date-time string and extract just the time value into a new column?
I think the answer is in vapply, but I have been drenching myself in manuals the last couple weeks and still can't figure it out.
Welcome to SO. it can be done in multiple ways. Try this:
## some data
df <- data.frame(height=c(11,12),time = c("1999-9-9 00:00:00","1999-9-9 00:00:02"),stringsAsFactors = FALSE)
df
#> height time
#> 1 11 1999-9-9 00:00:00
#> 2 12 1999-9-9 00:00:02
## In base R
df2<- df
df2$hms <- do.call(rbind,strsplit(df2$time," "))[,2]
df2[df2$hms=="00:00:00",]
#> height time hms
#> 1 11 1999-9-9 00:00:00 00:00:00
## In tidyverse
library(dplyr)
df3 <- df %>%
mutate(hms = gsub(".*(..:..:..).*","\\1",time)) %>%
filter(hms == "00:00:00")
df3
#> height time hms
#> 1 11 1999-9-9 00:00:00 00:00:00
Created on 2018-10-04 by the reprex package (v0.2.1)
You don't provide an example, so here is my guess:
Let's say you have a character vector (could be a column):
dateTimes <- c("1999-01-01 11:11:11", "1999-01-01 12:12:12", "1999-01-01 13:13:13")
You extract the times in the end:
ans <- sub(".*-\\d+\\s", "", dateTimes, perl = T)
#[1] "11:11:11" "12:12:12" "13:13:13"
Save them into a new variable or column:
When you want to extract rows that occur at 00:00:00 simply use a string comparison and subset your data:
df1[ans == "00:00:00",]

R, Aggregate Summing Across All Values in Reference Column (Instead of Just One)

I've tried several different methods to get a summary table of averages by half hour, similar to an average pivot table. My preferred method is with aggregate, but I seem to get nothing but an average for the top row.
Data is as shown in the link, where the Group and the Messages can be ignored.
See below.
The code I'm using is...
Data <- read_csv("P:/Book3.csv", col_types = cols(Date = col_date(format = "%m/%d/%Y"),
Time = col_time(format = "%H:%M:%S")))
View(Data)
class(Data)
[1] "tbl_df" "tbl" "data.frame"
aggregate(Data[, 3:4], list(Data$Time), mean)
Group.1 Calls Estimated_Calls
1 08:30:00 15.38889 14.55556
You'll notice the single line, but ideally the output would have averages for every time.
Any help would be great. Thanks.
I like to use the dplyr library for problems of this sort:
library(dplyr)
Data %>% group_by(Time) %>%
summarise(Mean_Calls = mean(Calls), Mean_Est_Calls = mean(Estimated_Calls))
I find the pipe %>% makes code easier to read (once you get used to it). This is a feature of the dplyr library.
I prefer to use data.table for summary operations like this:
setDT(Data)
Data[, .(Mean_Calls = mean(Calls), Mean_Est_Calls = mean(Estimated_Calls)),
by=.(Group, Time)]
This will group by Group and Time, meaning you'll have one row for each combination of Group and Time.
With dummy data (used 3 "hours" for time; also changed by to keyby to sort):
set.seed(48)
df1 <- data.table(Group = sample(LETTERS[1:3],10,T),
Time = sample(1:3,10,T),
Calls = sample(1:50,10,T),
Estimated_Calls = sample(1:50,10,T))
df1[, .(Mean_Calls = mean(Calls), Mean_Est_Calls = mean(Estimated_Calls)),
keyby=.(Group, Time)]
Output:
Group Time Mean_Calls Mean_Est_Calls
1: A 2 27.00000 22.00000
2: A 3 34.66667 25.66667
3: B 2 26.00000 6.50000
4: B 3 20.00000 1.00000
5: C 2 35.50000 32.00000
With aggregate:
df2 <- aggregate(df1[,3:4], by=with(df1,list(Group,Time)),mean)

Resources