I am trying to pull some economic data from Investing.com. Here is a link to the non-farm payroll I am looking to pull.
https://ca.investing.com/economic-calendar/nonfarm-payrolls-227
As you can see, once you click the show more button, more rows are loaded. I would like to scrape all the hidden data in the table.
If you inspect the page you can quite easily see the html tags associated with each row. I was wondering if there was an easy way to scrape the data without using R selenium.
Here is my current code that only returns the 6 rows initially showed when first entering the site.
x = read_html("https://ca.investing.com/economic-calendar/nonfarm-payrolls-227")%>%
html_nodes('table')%>%.[1]%>%html_table(fill = T)
print(x)
# Release Date Time Actual Forecast Previous
1 May 03, 2019 (Apr) 08:30 263K 181K 189K NA
2 Apr 05, 2019 (Mar) 08:30 196K 175K 33K NA
3 Mar 08, 2019 (Feb) 09:30 20K 181K 311K NA
4 Feb 01, 2019 (Jan) 09:30 304K 165K 222K NA
5 Jan 04, 2019 (Dec) 09:30 312K 178K 176K NA
6 Dec 07, 2018 (Nov) 09:30 155K 200K 237K NA
Related
I have a text file of many rows containing date and time and the end goal is for me to group together the number of rows per week that their date values are in. This is so that I can plot a scatter diagram with x values being the week number and y values being the frequency. For example the text file (dates.txt):
Mon May 11 22:51:27 2013
Mon May 11 22:58:34 2013
Wed May 13 23:15:27 2013
Thu May 14 04:11:22 2013
Sat May 16 19:46:55 2013
Sat May 16 22:29:54 2013
Sun May 17 02:08:45 2013
Sun May 17 23:55:15 2013
Mon May 18 00:42:07 2013
So from here, week 1 will have a frequency of 6 and week 2 will have a frequency of 1
As I want to plot a scatter diagram for this, I want to convert them to text value first using strptime() with format %a %b
my attempt so far has been
time_stamp <- strptime(time_stamp, format='%a.%b')
However it shows the input string is too long. I'm very new to R-studio so could somebody please help me figure this out?
Thank you
Example of final output graph : https://imgur.com/a/3o3DivA
You could use readLines() to avoid the data frame, then read time using strptime, and finally strftime to format the output.
strftime(strptime(readLines('dates.txt'), '%c'), '%a.%b')
# [1] "Sat.May" "Sat.May" "Mon.May" "Tue.May" "Thu.May" "Thu.May" "Fri.May" "Fri.May" "Sat.May"
Edit
So it appears that your dates have a time zone abbreviation "Mon Apr 06 23:49:29 PDT 2009". Since it is constant during the dates we can specify it literally in the pattern.
We will use '%d_%m' for strftime to get something numeric seperated by _ with which we feed strsplit and then type.convert into numerics.
Finally we unlist, create a matrix that we fill byrow, and plot the guy.
strptime(readLines('timestamp.txt'), '%a %b %d %H:%M:%S PDT %Y') |>
strftime('%d_%m') |>
strsplit('_') |>
type.convert(as.is=TRUE) |>
unlist() |>
matrix(ncol=2, byrow=TRUE) |>
plot(pch=20, col=4, main='My Plot', xlab='day', ylab='month')
Note: Please use R>=4.1 for the |> pipes.
You need to first read (or assign) the data, parse it to a date type and then use that to e.g. get the number of the week.
Here is one example
text <- "Mon May 11 22:51:27 2013
Mon May 11 22:58:34 2013
Wed May 13 23:15:27 2013
Thu May 14 04:11:22 2013
Sat May 16 19:46:55 2013
Sat May 16 22:29:54 2013
Sun May 17 02:08:45 2013
Sun May 17 23:55:15 2013
Mon May 18 00:42:07 2013"
data <- read.table(text=text, sep='\n', col.names="dates")
data$parse <- anytime::anytime(data$dates)
data$week <- as.integer(format(data$parse, "%V"))
data
The result is a new data.frame object:
> data
dates parse week
1 Mon May 11 22:51:27 2013 2013-05-11 22:51:27 19
2 Mon May 11 22:58:34 2013 2013-05-11 22:58:34 19
3 Wed May 13 23:15:27 2013 2013-05-13 23:15:27 20
4 Thu May 14 04:11:22 2013 2013-05-14 04:11:22 20
5 Sat May 16 19:46:55 2013 2013-05-16 19:46:55 20
6 Sat May 16 22:29:54 2013 2013-05-16 22:29:54 20
7 Sun May 17 02:08:45 2013 2013-05-17 02:08:45 20
8 Sun May 17 23:55:15 2013 2013-05-17 23:55:15 20
9 Mon May 18 00:42:07 2013 2013-05-18 00:42:07 20
>
I have a dataset with repeated measurements on multiple individuals over time. It looks something like this:
ID Time Event
1 Jan 1 2012, 4pm Abx
1 Jan 2 2012, 2pm Test
1 Jan 26 2012 3 pm Test
1 Jan 29 2012 10 pm Abx
1 Jan 30 2012, 3 pm Test
1 Jan 5 2012 3 pm Test
2 Jan 1 2012, 4pm Abx
2 Jan 2 2012, 2pm Test
2 Jan 26 2012 3 pm Test
The dataset is currently based around events. It will later be filtered down to just tests. What I need to do is make a new variable that is 1 when certain events (Abx, in this case) occur within a certain time range of tests. So if the event 'Abx' occurs within, let's say, 48 hours of a Test event, the new variable should equal 1. Otherwise, it should equal zero.
I'm hoping to produce something like this:
ID Time Event New_variable
1 Jan 1 2012, 4pm Abx 1
1 Jan 2 2012, 2pm Test 1
1 Jan 26 2012 3 pm Test 0
1 Jan 29 2012 10 pm Abx 1
1 Jan 30 2012, 3 pm Test 1
1 Jan 5 2012 3 pm Test 0
2 Jan 1 2012, 4pm Abx 1
2 Jan 2 2012, 2pm Test 1
2 Jan 26 2012 3 pm Test 0
I know that I could probably solve this with a combination of Dplyr mutate functions combined with ifelse statements, and if I just wanted to make a variable that reads "1" when the antibiotic event occurs I could do that like this:
test %>%
mutate(New_variable = ifelse(Event == 'Abx', 1, 0)) -> test2
But I don't know how to factor in time so that Test events = 1 within 48 hours of an Abx event. I also am not sure how to make sure that the condition is applied only within the same ID. How can I do this?
Any help is appreciated!
Update: Thank you so much for the suggestions! I'm going to try these out on the data, but I think they'll work. If they don't, I'll be back soon. Success! I also modified the suggested helper function to include additional options (for more than one type of Abx):
abxRows <- type == "Abx" | type == "Abx2"
To the data provided, I added two "Abx" events which should not be one (i.e. one that was not within 48 hours and one that wasn't in the same group as the test that was within 48 hours).
library(dplyr)
library(lubridate)
library(purrr)
eventData <-
data.frame(stringsAsFactors = FALSE,
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1),
Time = c("Jan 1 2012 4 pm", "Jan 2 2012, 2pm",
"Jan 26 2012 3 pm", "Jan 29 2012 10 pm",
"Jan 30 2012 3 pm", "Jan 5 2012 3 pm",
"Jan 1 2012 4 pm", "Jan 2 2012, 2pm",
"Jan 26 2012 3 pm", "Feb 12 2012 1pm",
"Jan 16 2012 3 pm", "Jan 16 2012 1 pm"),
Event = c("Abx", "Test", "Test", "Abx", "Test", "Test",
"Abx", "Test", "Test", "Abx", "Abx", "Test")
) %>%
mutate(Time = mdy_h(Time),
window = if_else(Event == "Test",
interval(Time - hours(48), Time + hours(48)),
interval(NA, NA))
)
First, you want to make sure the Time column is a time format. Then create a column of the lubridate Interval class that creates a 48 hr window around "Test" events.
Define the helper function that will check if the event occurred within the window.
chkFun <- function(eventTime, intervals, grp, type){
abxRows <- type == "Abx"
testRows <- !abxRows
hits <- map2_lgl(eventTime, grp,
~any(.x %within% intervals[grp %in% .y], na.rm = TRUE)) &
abxRows
testHits <- map_lgl(which(testRows),
~any(eventTime[abxRows & (grp[.x] == grp)] %within%
intervals[.x]))
hits[testRows] <- testHits
as.integer(hits)
}
This function first goes through and test if the "Abx" events occur within the intervals. It then determines which "Test" rows have an interval that contains a "Abx" event. The function returns the combination of these cast as integers.
Last, just use a mutate statement with the helper function, dropping the window column
eventData %>%
mutate(New_variable = chkFun(Time, window, ID, Event)) %>%
select(-window)
Alternatively, the helper function could just take the data.frame as an argument and assume the column names. In the form above, though, if you define it first in your script, it could also be used in the original definition of eventData
Results:
#> ID Time Event New_variable
#> 1 1 2012-01-01 16:00:00 Abx 1
#> 2 1 2012-01-02 14:00:00 Test 1
#> 3 1 2012-01-26 15:00:00 Test 0
#> 4 1 2012-01-29 22:00:00 Abx 1
#> 5 1 2012-01-30 15:00:00 Test 1
#> 6 1 2012-01-05 15:00:00 Test 0
#> 7 2 2012-01-01 16:00:00 Abx 1
#> 8 2 2012-01-02 14:00:00 Test 1
#> 9 2 2012-01-26 15:00:00 Test 0
#> 10 2 2012-02-12 13:00:00 Abx 0
#> 11 2 2012-01-16 15:00:00 Abx 0
#> 12 1 2012-01-16 13:00:00 Test 0
So I dont have a copy of your data, so Im not sure what for kmat your dates are in...
I would recommend converting the date to the right format using as.POSIXct(Time, format="%b %d %Y, %I%p") For more info on the format look up ?strptime, but I think that is right for your column.
If we assume your data frame is like this... I know I have changed parts of it but this is for simplicity
df <- data.frame(ID = c(rep(1,6),rep(2,3)),
Time=c(seq(from=start, by=interval*6840, to=end)[1:6],seq(from=start, by=interval*6840, to=end)[1:3]),
Event = rep(c("Abs","Test","Test"),3))
This would look like this
ID Time Event
1 1 2012-01-01 00:00:00 Abs
2 1 2012-01-05 18:00:00 Test
3 1 2012-01-10 12:00:00 Test
4 1 2012-01-15 06:00:00 Abs
5 1 2012-01-20 00:00:00 Test
6 1 2012-01-24 18:00:00 Test
7 2 2012-01-01 00:00:00 Abs
8 2 2012-01-05 18:00:00 Test
9 2 2012-01-10 12:00:00 Test
So you can use the following code to test whether a Test falls within 48 hours of an Abs
df[which(df$Event=="Test"),]$Time %in% unlist(Map(`:`, df[which(df$Event=="Abs"),]$Time-48*60*60, df[which(df$Event=="Abs"),]$Time+48*60*60))
So this will return FALSE for all, but that is because the synthetic data is at larger time steps.
To unpack this...
df[which(df$Event=="Test"),]$Time Gives the times of tests
%in% Says look for what precedes this, in a set of values that follows it.
So what follows it is: unlist(Map(`:`, df[which(df$Event=="Abs"),]$Time-48*60*60, df[which(df$Event=="Abs"),]$Time+48*60*60))
This creates a list of dates +/- 48 hours from each Abs. to add or subtract 48 hours, POSIXct objects like this done in seconds, hence 48*60*60
I would like to subset a timeseries dataframe based on my requirement.
I have a dataframe something similar to the one mentioned below.
> df
Date Year Month Day Time Parameter
2012-04-19 2012 04 19 7:00:00 26
2012-04-19 2012 04 19 7:00:00 20
.................................................
2012-05-01 2012 05 01 00:00:00 23
2012-05-01 2012 05 01 00:30:00 22
.................................................
2015-04-30 2015 04 30 23:30:00 20
.................................................
2015-05-01 2015 05 01 00:00:00 26
From the dataframe similar to this I will like to select all the data from the first of May 2012 2012-05-01 to the end of April 2015-04-30, regardless of the starting and end date of the dataframe.
However, I am familiar with the grep function to select the data from one particular column. I have been using the following code with grep and with.
# To select one particular year
> df.2012 <- df[grep("2012", df$Year),]
# To select two or more years at the same time
> df.sel.yr <- df[grep("201[2-5]", df$Year),]
# To select one particular month of a particular year.
> df.Dec.2012 <- df[with(df, Year=="2012" & Month=="12"), ]
With several Lines of commands i will be able to do it. But it would save a lot of time if I can do it with only few or one line of command.
Any help will be appreciated. Thank you in advance.
If your date column is not of class date first convert it to one by,
df$Date <- as.Date(df$Date)
and then you can subset the date by,
df[df$Date >= as.Date("2012-05-01") & df$Date <= as.Date("2015-04-30"), ]
# Date Year Month Day Time Parameter
#3 2012-05-01 2012 5 1 00:00:00 23
#4 2012-05-01 2012 5 1 00:30:00 22
#5 2015-04-30 2015 4 30 23:30:00 20
I am new to Access 2010 and need to get the number of days in a workweek excluding Holidays however with a twist. I have been able to use the standard VB code for workdays that appears on the internet and it works great for a simple Monday – Friday or Monday - Saturday calculation. My question is, how can I or is it possible to manipulate this code to calculate the number of days if Friday, Saturday and Sunday all count as 1 day?
Example: Calculate the number of days from Tuesday 11/25/14 to today.
-Today's date = Monday, December 01, 2014;
-Monday, December 01, 2014 = 0;
-Sunday, November 30, 2014 = 3;
-Saturday, November 29, 2014 = 3;
-Friday, November 28, 2014 = 3;
-Thursday, November 27, 2014(Holiday) = 2;
-Wednesday, November 26, 2014 = 2;
-Tuesday, November 25, 2014 = 1
So in the example above, the number of days would be 3.
If you need to account for Statutory Holidays you'll really need to use some kind of table. Purely algorithmic approaches to the problem are difficult to manage and prone to failure, primarily because
Holidays that fall on a fixed date may be observed on some other date. For example, if Christmas falls on a Saturday then employees may get a day off on Friday.
Some holiday dates are difficult to calculate. In particular, Good Friday is defined (here in Canada, at least) as "the Friday before the first Sunday after the first full moon following the Spring Equinox".
In its simplest form, the [DatesTable] could look something like this:
theDate dayOff comment
---------- ------ ----------------
2014-11-21 False
2014-11-22 True Saturday
2014-11-23 True Sunday
2014-11-24 False
2014-11-25 False
2014-11-26 False
2014-11-27 True Thanksgiving Day
2014-11-28 False
2014-11-29 True Saturday
2014-11-30 True Sunday
2014-12-01 False
2014-12-02 False
Counting the number of work days between 2014-11-25 and 2014-11-30 (inclusive) would simply be
SELECT COUNT(*) AS WorkDays
FROM DatesTable
WHERE theDate Between #2014-11-25# And #2014-11-30#
AND dayOff=False;
I am trying to plot a series of sunset times in matplotlib but I get the following error:
"TypeError: Empty 'DataFrame': no numeric data to plot"
I have looked at several options to convert, e.g. plt.dates.date2num but that doesn't really fullfil my needs as i would like to plot it in a readable format, i.e. times. All examples I have found have times on the x-axis but non have them on the y-axis.
Is there no way of accomplishing this task? Has anyone got an idea?
I am looking very forward to your replies.
Best regards, Arne
3 Jan 2013 16:44:00
4 Jan 2013 16:45:00
5 Jan 2013 16:46:00
6 Jan 2013 16:47:00
7 Jan 2013 16:48:00
8 Jan 2013 16:49:00
9 Jan 2013 16:51:00
10 Jan 2013 16:52:00
11 Jan 2013 16:53:00
12 Jan 2013 16:55:00
13 Jan 2013 16:56:00
14 Jan 2013 16:57:00
It's not quite clear from your question if you're trying to plot some unspecified data on the x-axis with date/time on the y-axis or if you're trying to plot days on the x-axis with times on the y-axis.
From your question, though, I'm going to assume it's the latter.
It sounds like you might be using pandas, but for the moment, I'll just assume you have two sequences of strings: One with the day, and another sequence with the time.
To treat a given axis as dates, just call ax.xaxis_date() or ax.yaxis_date(). In this case, both will actually be dates. (The times will have today as the day, though you won't see this directly.)
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
date = ['3 Jan 2013', '4 Jan 2013', '5 Jan 2013', '6 Jan 2013', '7 Jan 2013',
'8 Jan 2013', '9 Jan 2013', '10 Jan 2013', '11 Jan 2013', '12 Jan 2013',
'13 Jan 2013', '14 Jan 2013']
time = ['16:44:00', '16:45:00', '16:46:00', '16:47:00', '16:48:00', '16:49:00',
'16:51:00', '16:52:00', '16:53:00', '16:55:00', '16:56:00', '16:57:00']
# Convert to matplotlib's internal date format.
x = mdates.datestr2num(date)
y = mdates.datestr2num(time)
fig, ax = plt.subplots()
ax.plot(x, y, 'ro-')
ax.yaxis_date()
ax.xaxis_date()
# Optional. Just rotates x-ticklabels in this case.
fig.autofmt_xdate()
plt.show()