Store data according to starting date - r

I am working with multiple csv files stored in a folder (2488). Each csv contains monthly data from gaugin stations with the format years * months. I want to store all these csv files in a big data frame in which cols are the IDs of the different gaugin stations and the rows the times %Y-%m-%d.
For this purpose, I have list all the files with list.files:
a <- list.files(pattern="./*.csv",full.names=TRUE)
And create a dummy data frame with the final dimensions:
gst <- data.frame(NA,1860,2488) # 1860 times - 2488 stations
Each csv file starts in a different date. The earliest date found is January 1863, so I created the data frame with 1860 rows from that starting date to June 2017.
I create a sequence of dates to name the rows of gst:
s <- paste0(1863,"-01-01")
e <- paste0(2017,"-12-31")
ss<- chron(s, format='y-m-d')
ee<- chron(e, format='y-m-d')
dates <- seq.dates(ss,ee,by='months')
In the following loop, I read each csv file. First I change the initial data frame format: years * months + total column.
# Initial format
Jan Feb Mar Apr
1993 NA 0.05 0.05 0.06
1994 0.18 0.15 0.1 0.19
1995 0.22 0.23 0.26 0.11
1996 0.14 0.11 0.1 0.08
1997 0.12 0.16 0.07 0.05
1998 0.12 0.07 0.12 0.18
1999 0.07 0.32 0.14 0.15
2000 0.13 0.22 0.15 0.1
2001 0.18 0.09 0.5 0.26
to a single column data frame (kk.df) with data stored as:
Date Value
93-01-01 NA
93-02-01 0.05
93-03-01 0.05
93-04-01 0.06
93-05-01 0.05
93-06-01 0.05
93-07-01 0.03
93-08-01 0.03
93-09-01 0.05
93-10-01 0.09
93-11-01 0.04
93-12-01 0.10
This is the loop I am working with:
for (i in 1:length(a)){
kk <- read.csv(a[i])
colnames(kk) <- c(seq(1,12,1),'total') # 12 (months) columns and a total column
kk.ts <- ts(as.vector(t(as.matrix(kk))),
start=as.numeric(c(rownames(kk)[1],1)), end= as.numeric(c(rownames(kk)[dim(kk)[1]],12)),frequency=12)
kk.df <- as.data.frame(kk.ts)
colnames(kk.df) <- a[[i]]
a <- paste0(start,"-01-01")
b <- paste0(end,"-12-31")
ac<- chron(a, format='y-m-d')
bc<- chron(b, format='y-m-d')
times <- seq.dates(ac,bc, by="months")
rownames(kk.df) <- times
gst[i,] <- kk.df
}
My question is, as I am storing the same number of columns as gaugin stations I have (2488 stations) and each station start at a different Year-month, how can I specify when store each i for each station, the row in which it must start?
If i = 1 and the first record is in 1993-01-01, I want that column to start at the row of gst that corresponds to 1993-01-01 and so on with the rest of the stations.
Thank you so much.

Maybe you could incorporate a join in your for loop for that:
df = data.frame(date = seq(Sys.Date(),Sys.Date()+3,by=1))
station1 = data.frame(date = seq(Sys.Date()+2,Sys.Date()+3,by=1),data = c(1,2))
station2 = data.frame(date = seq(Sys.Date()+1,Sys.Date()+2,by=1),data = c(2,3))
df = df %>% left_join(station1) %>% rename(station1=data)
df = df %>% left_join(station2) %>% rename(station2=data)
Input:
> df
date
1 2017-07-17
2 2017-07-18
3 2017-07-19
4 2017-07-20
> station1
date data
1 2017-07-19 1
2 2017-07-20 2
> station2
date data
1 2017-07-18 2
2 2017-07-19 3
Output:
> df
date station1 station2
1 2017-07-17 NA NA
2 2017-07-18 NA 2
3 2017-07-19 1 3
4 2017-07-20 2 NA

Related

How to combine 2 rows in 1

I'm trying to reshape a data frame but I'm totally lost in how to proceed:
> test
# Time Entry Order Size Price S / L T / P Profit Balance
1 1 2017-01-11 00:00:00 buy 1 0.16 1.05403 1.0449 1.07838 NA NA
2 3 2017-01-24 16:00:00 s/l 1 0.16 1.04490 1.0449 1.07838 -97.28 9902.72
As you can see, we have 2 (or more) registers for one order ID. What I want to do is combine those 2 rows into one by adding several new columns: Exit (that's where the "s/l" entry of the second observation should go), Exit Price (there should go the data for the Price column on the second entry) and replace the NA from the first entry with the data of the second one on the Profit and Balance columns.
By the way, the original name of the Entry column is "Type" but I already changed that, so that's why it doesn't make that much sense of having the exit reason of the trade on a column called "Entry". So far I've only thought of extracting the data on several vectors and then just do a mutate on the first entry and dropping the second one, but I'm quite sure there's a better way of doing that. Also, that stone-age approach would be useless when applied to the whole data frame.
If possible, I'd like to stick to the tidyverse library to do this just for ease of replication. Thank you in advance for your suggestions!
I ended up sorting it out! My solution was to split the data frame in 2, reshape each half as needed, and then full joining them. Here's the initial data frame:
> head(backtest_table, n = 10)
# Time Type Order Size Price S / L T / P Profit Balance
1 1 2017.01.11 00:00 buy 1 0.16 1.05403 1.04490 1.07838 NA NA
2 2 2017.01.19 00:00 buy 2 0.16 1.05376 1.04480 1.07764 NA NA
3 3 2017.01.24 16:00 s/l 1 0.16 1.04490 1.04490 1.07838 -97.28 9902.72
4 4 2017.01.24 16:00 s/l 2 0.16 1.04480 1.04480 1.07764 -95.48 9807.24
5 5 2017.02.09 00:00 buy 3 0.15 1.05218 1.04265 1.07758 NA NA
6 6 2017.03.03 16:00 t/p 3 0.15 1.07758 1.04265 1.07758 251.75 10058.99
7 7 2017.03.29 00:00 buy 4 0.15 1.08826 1.07859 1.11405 NA NA
8 8 2017.04.04 00:00 close 4 0.15 1.08416 1.07859 1.11405 -41.24 10017.75
9 9 2017.04.04 00:00 sell 5 0.15 1.08416 1.09421 1.05737 NA NA
10 10 2017.04.07 00:00 sell 6 0.15 1.08250 1.09199 1.05718 NA NA
Here's the code I used to modify everything:
# Re-format data
library(lubridate)
# Separate entries and exits
entries <- backtest_table %>% filter(Type %in% c("buy", "sell"))
exits <- backtest_table %>% filter(!Type %in% c("buy", "sell"))
# Reshape entries and exits
# Entries
entries <- entries[-c(1, 9, 10)]
colnames(entries) <- c("Entry time", "Entry type", "Order", "Entry volume",
"Entry price", "Entry SL", "Entry TP")
entries$`Entry time` <- entries$`Entry time` %>% ymd_hm()
entries$`Entry type` <- as.factor(entries$`Entry type`)
# Exits
exits <- exits[-1]
colnames(exits) <- c("Exit time", "Exit type", "Order", "Exit volume",
"Exit price", "Exit SL", "Exit TP", "Profit", "Balance")
exits$`Exit time` <- exits$`Exit time` %>% ymd_hm()
exits$`Exit type` <- as.factor(exits$`Exit type`)
# Join re-shaped data
test <- full_join(entries, exits, by = c("Order"))
And here's the output of that:
> head(test, n = 10)
Entry time Entry type Order Entry volume Entry price Entry SL Entry TP Exit time
1 2017-01-11 buy 1 0.16 1.05403 1.04490 1.07838 2017-01-24 16:00:00
2 2017-01-19 buy 2 0.16 1.05376 1.04480 1.07764 2017-01-24 16:00:00
3 2017-02-09 buy 3 0.15 1.05218 1.04265 1.07758 2017-03-03 16:00:00
4 2017-03-29 buy 4 0.15 1.08826 1.07859 1.11405 2017-04-04 00:00:00
5 2017-04-04 sell 5 0.15 1.08416 1.09421 1.05737 2017-05-26 10:00:00
6 2017-04-07 sell 6 0.15 1.08250 1.09199 1.05718 2017-05-01 09:20:00
7 2017-04-19 sell 7 0.15 1.07334 1.08309 1.04733 2017-04-25 10:00:00
8 2017-05-05 sell 8 0.14 1.07769 1.08773 1.05093 2017-05-29 14:00:00
9 2017-05-24 sell 9 0.14 1.06673 1.07749 1.03803 2017-06-22 18:00:00
10 2017-06-14 sell 10 0.14 1.04362 1.05439 1.01489 2017-06-15 06:40:00
Exit type Exit volume Exit price Exit SL Exit TP Profit Balance
1 s/l 0.16 1.04490 1.04490 1.07838 -97.28 9902.72
2 s/l 0.16 1.04480 1.04480 1.07764 -95.48 9807.24
3 t/p 0.15 1.07758 1.04265 1.07758 251.75 10058.99
4 close 0.15 1.08416 1.07859 1.11405 -41.24 10017.75
5 t/p 0.15 1.05737 1.09421 1.05737 265.58 10091.18
6 s/l 0.15 1.09199 1.09199 1.05718 -94.79 9825.60
7 s/l 0.15 1.08309 1.08309 1.04733 -97.36 9920.39
8 t/p 0.14 1.05093 1.08773 1.05093 247.61 10338.79
9 t/p 0.14 1.03803 1.07749 1.03803 265.59 10504.05
10 s/l 0.14 1.05439 1.05439 1.01489 -100.33 10238.46
And that combined the observations where a trade was added that showed NAs on the last columns with the observations where that trade was closed, populating the last columns with the actual result and new account balance!
If someone has suggestions on how to improve the system please let me know!

Webscraping Returns xml_nodeset 0 in R

I am trying to scrape this website. https://web.tmxmoney.com/earnings.php?qm_symbol=DOL specifically the table at the bottom of the screen.
I have tried countless of CSS and XPath but still get {xml_nodeset(0)}. I am looking for an intuitive answer rather than just the code.
Here are a few of my attempts.
library(httr)
library(rvest)
library(dplyr)
tbl = read_html('https://web.tmxmoney.com/earnings.php?qm_symbol=DOL')%>%
html_nodes("table").[2]%>%html_table(fill = T)#no luck
tbl = read_html('https://web.tmxmoney.com/earnings.php?qm_symbol=DOL')%>%
html_nodes(xpath = '//*[#id="DataTables_Table_0"]')%>%html_table(fill = T)#node set(0)
I have tried countless others, using selector gadget and inspecting the source code.
I didn't check the complete Terms of Service, so please be aware that scraping might not be legal.
The following should do the trick:
library(rvest)
library(data.table)
library(httr)
library(XML)
library(RSelenium)
mybrowser <- rsDriver(browser = 'firefox')
link <- "https://web.tmxmoney.com/earnings.php?qm_symbol=DOL"
mybrowser$client$navigate(link)
mybrowser$client$findElement(using = 'css selector', "#DataTables_Table_0")$getElementText()
html.table <- mybrowser$client$findElement(using = 'css selector', "#DataTables_Table_0")
webElem5txt <- html.table$getElementAttribute("outerHTML")[[1]]
df.table <- read_html(webElem5txt) %>% html_table() %>% data.frame(.)
mybrowser$server$stop()
# Excerpt of the data:
> df.table
Var.1 Quarter.End X..EPS.Actual X..EPS.Estimate X..Estimates X..Surprise X..Surprise.1 Date
1 NA 2019-07-31 (Q2 2020) 0.45 0.47 3 -0.02 -4.26% 2019-09-12
2 NA 2019-04-30 (Q1 2020) 0.33 0.33 4 0.00 0.00% 2019-06-13
3 NA 2019-01-31 (Q4 2019) 0.54 0.55 4 -0.01 -1.82% 2019-03-28
4 NA 2018-10-31 (Q3 2019) 0.41 0.42 3 -0.01 -2.38% 2018-12-06
5 NA 2018-07-31 (Q2 2019) 0.43 0.44 3 -0.01 -2.27% 2018-09-13
6 NA 2018-04-30 (Q1 2019) 0.31 0.31 3 0.00 0.00% 2018-06-07
7 NA 2018-01-31 (Q4 2018) 0.48 0.47 4 0.01 2.13% 2018-03-29

How to use filter function to pick out specific dates?

I cannot filter my data set for a specific date.
padf_12 <- read.table("Paddus_2012_sheet2.csv", head = TRUE, sep=";" )
tibble(padf_12)
padf_12 <- padf_12 %>%
mutate_at(vars(Block, Treatment), factor) %>%
mutate("Date"=dmy(Date, tz = "UTC"))
I have tried different ways to filter the data e.g.
padf_12 <- padf_12 %>%
filter_at(padf_12, vars(Date == "2012-08-14"))
and
padf_12 <- padf_12 %>%
filter(padf_12, Treatment == "2012-08-14")
Due to error codes I have tried to convert my dates to POSIXct
padf_12 <- padf_12 %>%
mutate(as.POSIXct(padf_12$Date, tz="", format="%Y-%m-%d"))
My data looks like this:
Sample Date Time Plot Ch..Vol..L. Plot..old. iButton Treatment Block X2.methylbutyl.acetat X3.hexenyl.acetate
1 31-K1 20120522 2012-05-22 14:01:00 C1 13 K1 2198C9 C 1 0.00 0.02
2 32-K1 20120613 2012-06-13 10:19:00 C1 13 K1 2198C9 C 1 0.00 0.00
3 33-K1 20120626 2012-06-26 12:19:00 C1 13 K1 21980 C 1 0.00 0.00
4 34-K1 20120715 2012-07-15 12:15:00 C1 13 K1 2198CD C 1 0.00 0.02
5 35-K1 20120814 2012-08-14 C1 13 K1 C 1 0.00 2.34
6 36-K2 20120522 2012-05-22 15:12:00 C2 13 K2 2198C9 C 2 0.01 0.04
And here's a link to the full data set:
https://www.dropbox.com/s/m4qfrdagqxvdxnh/Filtering%20problem.R?dl=0
Any help is much appreciated.
Removing the data object in the function as suggested by TinglTanglBob did the trick!

One-liner to find corresponding value in large dataframe r

I am looking for a simple one-liner that will help me find a corresponding value in a dataframe.
Data sample:
weather <-data.frame("date" = seq(as.Date("2000/1/1"), by ="days", length.out = 10), temp = runif(10))
weather
date temp
1 2000-01-01 0.08520875
2 2000-01-02 0.69003449
3 2000-01-03 0.85892903
4 2000-01-04 0.37790250
5 2000-01-05 0.04121786
6 2000-01-06 0.31550816
7 2000-01-07 0.86219597
8 2000-01-08 0.30844555
9 2000-01-09 0.96949855
10 2000-01-10 0.18851018
Lets say I now want to find the day on which the maximum temperature occurred:
max_temp <- max(weather$temp)
max_temp
[1] 0.9694985
Now there are a couple of ways that I can find the date of this temperature (i.e. the corresponding value that i am after):
weather[which(weather$temp == max_temp), which(colnames(weather) == "date")]
[1] "2000-01-09"
But this is kind of laborious. I could also use dplyr:
library(dplyr)
filter(weather, temp == max_temp) %>%
select(date)
date
1 2000-01-09
But again, a two liner in the console just to get this seems like overkill.
I can't help but feel that there must be something like:
function(df, name_of_known_variable, value_of_known_variable, character_vector_of_variables_of_interest)
So for this example this would look like (assuming the function is "correspond"):
correspond(weather, temp, max_temp, date)
1 2000-01-09
I have looked all over and can't seem to find something simple for this. Please note that i understand that i could use:
weather[which.max(weather$temp), 1]
[1] "2000-01-09"
But lets assume that I am not necessarily looking for the maximum temperature (lets imagine i just have a value of interest and i am trying to find the corresponding value). Lets also imagine i have a massive data frame with lots and lots of columns (so many as to make counting them laborious). Further, lets imagine that i want to return corresponding values from multiple columns.
Turning my comment into an answer, using Base R only:
Create data, adding two more columns to provide a broader perspective:
set.seed( 1110 )
weather <-data.frame( "date" = seq( as.Date("2000/1/1"), by = "days", length.out = 10),
temp = round( runif( 10 ), 2 ),
loc = round( runif( 10 ) * 10, 2 ),
speed = round( runif( 10 ) * 50, 1 ) )
> weather
date temp loc speed
1 2000-01-01 0.48 9.79 18.9
2 2000-01-02 0.79 9.20 18.6
3 2000-01-03 0.88 9.65 46.3
4 2000-01-04 0.58 0.59 5.3
5 2000-01-05 0.22 6.12 38.7
6 2000-01-06 0.09 3.05 42.6
7 2000-01-07 0.49 4.09 2.1
8 2000-01-08 0.99 8.60 31.9
9 2000-01-09 0.56 4.27 12.6
10 2000-01-10 0.36 6.02 42.7
Now we can select per one-liner and based on column names rather than numbers, as required:
# The day with the maximum temparature
weather[ weather$temp == max( weather$temp ), "date" ]
[1] "2000-01-08"
But we can do a lot more:
# Speed and Location (order reversed) on the day with a temperature of 0.49
weather[ weather$temp == .49, c( "speed", "loc" ) ]
speed loc
7 2.1 4.09
# Date and speed, based upon two selection criteria (Temparature or Location)
# here we need to use which() to get the row indices
weather[ c( which( weather$temp == min( weather$temp ) ), which( weather$loc == 6.12 ) ), c( "date", "speed" ) ]
date speed
6 2000-01-06 42.6
5 2000-01-05 38.7
use data.table package. Syntax is simple.
a[variable == value_you_want]
a[variable == max(variable]
a[variable == 0]
dplyr::slice is also a possibility here:
set.seed(1)
weather <-data.frame("date" = seq(as.Date("2000/1/1"), by ="days", length.out = 10), temp = runif(10))
library(dplyr)
weather %>% arrange(desc(temp)) %>% slice(1)
# A tibble: 1 x 2
date temp
<date> <dbl>
1 2000-01-07 0.9446753
And you can use dplyr::filter if you need to look for a specific value

Summarize R data frame based on a date range in a second data frame

I have two data frames, one that includes data by day, and one that includes data by irregular time multi-day intervals. For example:
A data frame precip_range with precipitation data by irregular time intervals:
start_date<-as.Date(c("2010-11-01", "2010-11-04", "2010-11-10"))
end_date<-as.Date(c("2010-11-03", "2010-11-09", "2010-11-12"))
precipitation<-(c(12, 8, 14))
precip_range<-data.frame(start_date, end_date, precipitation)
And a data frame precip_daily with daily precipitation data:
day<-as.Date(c("2010-11-01", "2010-11-02", "2010-11-03", "2010-11-04", "2010-11-05",
"2010-11-06", "2010-11-07", "2010-11-08", "2010-11-09", "2010-11-10",
"2010-11-11", "2010-11-12"))
precip<-(c(3, 1, 2, 1, 0.25, 1, 3, 0.33, 0.75, 0.5, 1, 2))
precip_daily<-data.frame(day, precip)
In this example, precip_daily represents daily precipitation estimated by a model and precip_range represents measured cumulative precipitation for specific date ranges. I am trying to compare modeled to measured data, which requires synchronizing the time periods.
So, I want to summarize the precip column in data frame precip_daily (count of observations and sum of precip) by the date date ranges between start_date and end_date in the data frame precip_range. Any thoughts on the best way to do this?
You can use the start_dates from precip_range as breaks to cut() to group your daily values. For example
rng <- cut(precip_daily$day,
breaks=c(precip_range$start_date, max(precip_range$end_date)),
include.lowest=T)
Here we cut the values in daily using the start dates in the range data.frame. We're sure to include the lowest value and stop at the largest end value. If we merge that with the daily values we see
cbind(precip_daily, rng)
# day precip rng
# 1 2010-11-01 3.00 2010-11-01
# 2 2010-11-02 1.00 2010-11-01
# 3 2010-11-03 2.00 2010-11-01
# 4 2010-11-04 1.00 2010-11-04
# 5 2010-11-05 0.25 2010-11-04
# 6 2010-11-06 1.00 2010-11-04
# 7 2010-11-07 3.00 2010-11-04
# 8 2010-11-08 0.33 2010-11-04
# 9 2010-11-09 0.75 2010-11-04
# 10 2010-11-10 0.50 2010-11-10
# 11 2010-11-11 1.00 2010-11-10
# 12 2010-11-12 2.00 2010-11-10
which shows that the values have been grouped. Then we can do
aggregate(cbind(count=1, sum=precip_daily$precip)~rng, FUN=sum)
# rng count sum
# 1 2010-11-01 3 6.00
# 2 2010-11-04 6 6.33
# 3 2010-11-10 3 3.50
To get the total for each of those ranges (ranges as labeled with the start date)
Or
library(zoo)
library(data.table)
temp <- merge(precip_daily, precip_range, by.x = "day", by.y = "start_date", all.x = T)
temp$end_date <- na.locf(temp$end_date)
setDT(temp)[, list(Sum = sum(precip), Count = .N), by = end_date]
## end_date Sum Count
## 1: 2010-11-03 6.00 3
## 2: 2010-11-09 6.33 6
## 3: 2010-11-12 3.50 3

Resources