Include two multiple when scraping data

Include two multiple when scraping data - r

Wish to scrape weather data from
https://www.wetterzentrale.de/weatherdata.php?station=260&jaar=2019&maand=1&dag=1
From 2019-01-01 until today, yet I don't know how to write a code that changes jaar=2019 (i.e., year=2019), maand=1 (i.e., month=1) and dag=1 (i.e., day=1) to the desired days.
I tried to work with lapply as:
years <- c("2019", "2020")
urls <- rbindlist(lapply(years, function(x) {
url <- paste(https://www.wetterzentrale.de/weatherdata.php?station=260&jaar=2019&maand=1&dag=1, sep = "")
data.frame(url)
} ))
Hence this only gives the urls for 2019 and 2020. Is there a way to include months and days?

library(lubridate)
allYourDates <- seq(ymd(20190101), Sys.Date(), by = "days")
urls <- paste("https://www.wetterzentrale.de/weatherdata.php?station=260&jaar=", year(allYourDates)
, "&maand=", month(allYourDates)
, "&dag=", day(allYourDates)
, sep = "")

Related

Changing Dates in R from webscraper but not able to convert

I am trying to complete a problem that pulls from two data sets that need to be combined into one data set. To get to this point, I need to rbind both data sets by the year-month information. Unfortunately, the first data set needs to be tallied by year-month info, and I can't seem to figure out how to change the date so I can have month-year info rather than month-day-year info.
This is data on avalanches and I need to write code totally the number of avalanches each moth for the Snow Season, defined as Dec-Mar. How do I do that?
I keep trying to convert the format of the date to month-year but after I change it with
as.Date(avalancheslc$Date, format="%y-%m")
all the values for Date turn to NA's....help!
# write the webscraper
library(XML)
library(RCurl)
avalanche<-data.frame()
avalanche.url<-"https://utahavalanchecenter.org/observations?page="
all.pages<-0:202
for(page in all.pages){
this.url<-paste(avalanche.url, page, sep=" ")
this.webpage<-htmlParse(getURL(this.url))
thispage.avalanche<-readHTMLTable(this.webpage, which=1, header=T)
avalanche<-rbind(avalanche,thispage.avalanche)
}
# subset the data to the Salt Lake Region
avalancheslc<-subset(avalanche, Region=="Salt Lake")
str(avalancheslc)
avalancheslc$monthyear<-format(as.Date(avalancheslc$Date),"%Y-%m")
# How can I tally the number of avalanches?
The final output of my dataset should be something like:
date avalanches
2000-1 18
2000-2 4
2000-3 10
2000-12 12
2001-1 52

This should work (I tried it on only 1 page, not all 203). Note the use of the option stringsAsFactors = F in the readHTMLTable function, and the need to add names because 1 column does not automatically get one.
library(XML)
library(RCurl)
library(dplyr)
avalanche <- data.frame()
avalanche.url <- "https://utahavalanchecenter.org/observations?page="
all.pages <- 0:202
for(page in all.pages){
this.url <- paste(avalanche.url, page, sep=" ")
this.webpage <- htmlParse(getURL(this.url))
thispage.avalanche <- readHTMLTable(this.webpage, which = 1, header = T,
stringsAsFactors = F)
names(thispage.avalanche) <- c('Date','Region','Location','Observer')
avalanche <- rbind(avalanche,thispage.avalanche)
}
avalancheslc <- subset(avalanche, Region == "Salt Lake")
str(avalancheslc)
avalancheslc <- mutate(avalancheslc, Date = as.Date(Date, format = "%m/%d/%Y"),
monthyear = paste(year(Date), month(Date), sep = "-"))

Resolve 500 server error with Google Core Reporting API

I am at a complete loss at this point on how to resolve my query issue. We have been using the R package, RGA, for about a year now without any problems. I have a script that has been fetching data from 7 views, matching sessions based off of specific pages on our website and totaling them to our product offerings.
This has been working without problem for months. Out of nowhere I have been getting 503 and 500 internal server errors and I'm not sure why.
I've tried changing the fetch.by status to "month", "quarter", "year", "day", etc... but I think the initial query is just too big.
I've also tried changing the max.results options and fetching just one profile ID at a time. We have 7 to process.
date1 <- as.character(cut(Sys.Date(), "month"))
date1_name <- format(as.Date(date1), format = '%Y%m')
date2 <- as.character(Sys.Date())
date2_name <- format(as.Date(date2), format = '%Y%m')
dimensions <- c("ga:yearMonth"
)
metrics <- c("ga:sessions"
)
filters2 <- "ga:sessions>0"
#fetch trip level data for all users and for the micro-goal segment
# country_short_table
short_unq <- unique(country_short_table$destination)
brand_trip_unique <- unique(trip_country_brand$brand_trip)
brand_trip=1
brand_trip=73
all_sessions <- data.frame()
for (brand_trip in 1:length(brand_trip_unique)){
mkt <- gsub('_.*', '', brand_trip_unique[brand_trip])
trip <- gsub('.*_', '', brand_trip_unique[brand_trip])
id <- as.character(ids[ids$market==mkt, 'id'])
segment <- paste('ga:pagePath=~(reisen|circuit)/.*/', trip, sep = '')
segment_def <- paste('users::condition::',segment,sep = '')
table <- get_ga(profileId = id,
start.date = date1,
end.date = date2,
metrics = metrics,
dimensions = dimensions
,filters = filters2
,segment = segment_def
,samplingLevel = "HIGHER_PRECISION"
,fetch.by = "quarter"
,max.results = NULL
)
if (is.list(table)==T) {
table$trip <- trip
table$market <- mkt
all_sessions <- bind_rows(all_sessions, table)
} else {
next()
}
}
GOAL: Can you recommend any way that I could avoid this issue, maybe by separating the date queries and cycling them by weeks or days of the month? I need monthly data aggregated every day but I'm not sure how to edit this script I inherited.

run an r script from within an r script and change some values

I have an R script that I would like to run to import some data. The R script is called load_in_year1-4. I have designed this script so that I only have to make 3 changes to the code at the top and everything will run and import the correct data files.
The 3 changes are;
year <- "Year4"
weeks <- "1270_1321"
product <- "/cookies"
However I have 20 years worth of data and more than 50 products.
I am currently manually changing the top of each file and running it, so I have no errors currently in the data.
What I would like to do is to create a separate R script which will run the current script.
I would like to have something like
year <- c("year1", "year2", "year3"....)
weeks <- c("1270_1321", "1321_1327"....)
product <- c("product1", "product2"....)
So it will take year 1, week 1270_1321 and product1, call them year, week, product and run the R script which I have created.
Is there a grid function anybody can suggest?
EDIT: I have something like the following
#Make changes here
year <- "Year11"
weeks <- "1635_1686"
product <- "/cigets"
# year1: "1114_1165", year2: "1166_1217", year3: "1218_1269"
#Does not need changing
files <- gsub("Year1", as.character(year), "E:/DATA/Dataset/Year1")
parsedstub <- "E:/DATA/Dataset/files/"
produc <- paste0("prod", gsub("/", "_", as.character(product)))
drug <- "_drug_"
groc <- "_groc_"
####################Reading in the data###########################################
drug <- read.table(paste0(files, product, product, drug, weeks), header = TRUE)
groc <- read.table(paste0(files, product, product, groc, weeks), header = TRUE)

To make a function out of your script, do something like this:
get.tables <- function(year,weeks,product){
files <- gsub("Year1", as.character(year), "E:/DATA/Dataset/Year1")
parsedstub <- "E:/DATA/Dataset/files/"
product <- paste0("prod", gsub("/", "_", as.character(product)))
drug <- "_drug_"
groc <- "_groc_"
####################Reading in the data###########################################
drug <- read.table(paste0(files, product, product, drug, weeks), header = TRUE)
groc <- read.table(paste0(files, product, product, groc, weeks), header = TRUE)
list(drug = drug, groc = groc)
}
Then you could use something in the apply family to apply this function to different years, weeks, and products.

R - Select files by dates in filenames

I already had a similar question here:
R - How to choose files by dates in file names?
But I have to do a little change.
I still have a list of filenames, similar to that:
list = c("AT0ACH10000700100dymax.1-1-1993.31-12-2003",
"AT0ILL10000700500dymax.1-1-1990.31-12-2011",
"AT0PIL10000700500dymax.1-1-1992.31-12-2011",
"AT0SON10000700100dymax.1-1-1990.31-12-2011",
"AT0STO10000700100dymax.1-1-1992.31-12-2006",
"AT0VOR10000700500dymax.1-1-1981.31-12-2011",
"AT110020000700100dymax.1-1-1993.31-12-2001",
"AT2HE190000700100dymax.1-1-1973.31-12-1994",
"AT2KA110000700500dymax.1-1-1991.31-12-2010",
"AT2KA410000700500dymax.1-1-1991.31-12-2011")
I already have a command to sort out files that a certain length of recording (for example 10 in this case):
#Listing Files (creates the list above)
files = list.files(pattern="*00007.*dymax", recursive = TRUE)
#Making date readable
split_daymax = strsplit(files, split=".", fixed=TRUE)
from = unlist(lapply(split_daymax, "[[", 2))
to = unlist(lapply(split_daymax, "[[", 3))
from = as.POSIXct(from, format="%d-%m-%Y")
to = as.POSIXct(to, format="%d-%m-%Y")
timelistmax = difftime(to, from, "days")
#Files with more than 10 years of recording
index = timelistmax >= 10*360
filesdaymean = filesdaymean[index]
My problem is now that I have way too many files and no computer can handle that.
Now I only want to read in files that contain files from 1993 (or any other certain year I want) on and have 10 years of recording from then on, so the recordings should be at least until 2003.
So the file 1973-1994 should not be included, but the file from 1981- 2011 is fine.
I dont know how to select a year in this case.
I am thankful for any help

library(stringr)
library(lubridate)
fileDates <- str_extract_all(files, "[0-9]{1,2}-[0-9]{1,2}-[0-9]{4}")
find_file <- function(x, whichYear, noYears = 10) {
start <- as.Date(x[[1]], "%d-%m-%Y")
end <- as.Date(x[[2]], "%d-%m-%Y")
years <- as.numeric(end-whichYear, units = "days")/365
years > noYears & (year(start) <= year(whichYear) &
year(end) >= year(whichYear))
}
sapply(fileDates, find_file, whichYear = as.Date("1993-01-01"), noYears = 10)
You have two conditions which you can calculate first the number of years since 1993 and then use boolean logic to figure out if 1993 is within the date range.

Using files, to, and from as you've defined them above, this should get get you files that contain atleast a ten year span of data between 1993 and 2003:
library(lubridate)
df <- data.frame(file_name = files, file_start = from, file_end = to)
df_index <- year(df$file_start) <=1993 & year(df$file_end) >= 2003
files_to_load <- df$file_name[df_index]
If a base only solution is desired, turn the POSIXct to POSIXlt and extract the year component as such:
df <- data.frame(file_name = files,
file_start = as.POSIXlt(from),
file_end = as.POSIXlt(to))
df_index <- (df$file_start$year+1900 <=1993 &
df$file_end$year+1900 >= 2003)
files_to_load <- df$file_name[df_index]

Need help manipulating URL using concatenation to span 2 years of archived data in R

I would like to concatenate the following values in R.
day <- sprintf("%02d", 1:31)
month <- sprintf("%02d", 1:12)
year <- 2015:as.numeric(format(Sys.time(), "%Y"))
I need them to be in the following format 2015/01/01012015 (YYYY/MM/MMDDYYYY) where MMs would have to be equal at all times.
Ultimately I want to attach it to the end on this URL http://brocktonpolice.com/wp-content/uploads/ so I can pass it as an argument to a download function to download the files.
Here is what I have so far
links <- NULL
i <- 1
while (i <= length(year)) {
links[i] <- paste0("http://brocktonpolice.com/wp-content/uploads/",year[i], sep = "/")
i = i + 1
}
I would like it to span the entire year of 2015 and 2016.
For example:
http://brocktonpolice.com/wp-content/uploads/2015/01/01012015.pdf
http://brocktonpolice.com/wp-content/uploads/2015/01/01022015.pdf
http://brocktonpolice.com/wp-content/uploads/2015/01/01032015.pdf
http://brocktonpolice.com/wp-content/uploads/2015/01/01042015.pdf
...
http://brocktonpolice.com/wp-content/uploads/2015/02/02012015.pdf
http://brocktonpolice.com/wp-content/uploads/2015/02/02022015.pdf
http://brocktonpolice.com/wp-content/uploads/2015/02/02032015.pdf
...
etc

Use seq.Date. It's much easier.
prefix <- "http://brocktonpolice.com/wp-content/uploads/"
AllDays <- seq.Date(from = as.Date('2015-01-01'), to = Sys.Date(), by = "day")
links <- paste0(prefix, format(AllDays, '%Y/%m/%m%d%Y'), '.pdf')
print(links)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Include two multiple when scraping data - r

library(lubridate) allYourDates <- seq(ymd(20190101), Sys.Date(), by = "days") urls <- paste("https://www.wetterzentrale.de/weatherdata.php?station=260&jaar=", year(allYourDates) , "&maand=", month(allYourDates) , "&dag=", day(allYourDates) , sep = "")

Related

Changing Dates in R from webscraper but not able to convert

Resolve 500 server error with Google Core Reporting API

run an r script from within an r script and change some values

R - Select files by dates in filenames

Need help manipulating URL using concatenation to span 2 years of archived data in R

Categories

Resources