Building off of this question (Retrieve modified DateTime of a file from an FTP Server), it's clear how to get the date modified value. However, the full date is not returned even though it's visible from the FTP site.
This shows how to get the date modified values for files at ftp://ftp.FreeBSD.org/pub/FreeBSD/
library(curl)
library(stringr)
con <- curl("ftp://ftp.FreeBSD.org/pub/FreeBSD/")
dat <- readLines(con)
close(con)
no_dirs <- grep("^d", dat, value=TRUE, invert=TRUE)
date_and_name <- sub("^[[:alnum:][:punct:][:blank:]]{43}", "", no_dirs)
dates <- sub('\\s[[:alpha:][:punct:][:alpha:]]+$', '', date_and_name)
dates
## [1] "May 07 2015" "Apr 22 15:15" "Apr 22 10:00"
Some dates are in month/day/year format, others are in month/day/hour/minute format.
Looking at the FTP site, all dates in month/day/year hour/minutes/seconds format.
I assume it's got something to do with Unix format standards (explained in FTP details command doesn't seem to return the year the file was modified, is there a way around this?). It would be nice to get the full date.
If you use download.file you get an html representation of the directory which you can parse with the xml2 package.
read_ftp <- function(url)
{
tmp <- tempfile()
download.file(url, tmp, quiet = TRUE)
html <- xml2::read_html(readChar(tmp, 1e6))
file.remove(tmp)
lines <- strsplit(xml2::xml_text(html), "[\n\r]+")[[1]]
lines <- grep("(\\d{2}/){2}\\d{4}", lines, value = TRUE)
result <- read.table(text = lines, stringsAsFactors = FALSE)
setNames(result, c("Date", "Time", "Size", "File"))
}
Which allows you to just do this:
read_ftp("ftp://ftp.FreeBSD.org/pub/FreeBSD/")
#> Date Time Size File
#> 1 05/07/2015 12:00AM 4,259 README.TXT
#> 2 04/22/2020 08:00PM 35 TIMESTAMP
#> 3 04/22/2020 08:00PM Directory development
#> 4 04/22/2020 10:00AM 2,325 dir.sizes
#> 5 11/12/2017 12:00AM Directory doc
#> 6 11/12/2017 12:00AM Directory ports
#> 7 04/22/2020 08:00PM Directory releases
#> 8 11/09/2018 12:00AM Directory snapshots
Created on 2020-04-22 by the reprex package (v0.3.0)
Related
I'm trying to save some time information included data with write_csv.
But it keeps showing T and Z character (ISO8601 format as I know).
For example, 2022-12-12 08:00:00 is shown as 2022-12-12T08:00:00Z on csv file open with notepad.
I want keep original data format after saving csv file, but I couldnt find option for this.
Just saw a article about this problem but there is no answer.
Other Q
Here are two solutions.
First write a data set to a temp file.
library(readr)
df1 <- data.frame(datetime = as.POSIXct("2022-12-12 08:00:00"),
x = 1L, y = 2)
csvfile <- tempfile(fileext = ".csv")
# write the data, this is the problem instruction
write_csv(df1, file = csvfile)
Created on 2023-01-31 with reprex v2.0.2
1. Change nothing
This is probably not what you want but read_csv recognizes write_csv's ISO8601 output format, so if the data is written to file with write_csv and read in from disk with read_csv the problem doesn't occur.
# read from file as text, problem format is present
readLines(csvfile)
#> [1] "datetime,x,y" "2022-12-12T08:00:00Z,1,2"
# read from file as spec_tbl_df, problem format is not present
read_csv(csvfile, show_col_types = FALSE)
#> # A tibble: 1 × 3
#> datetime x y
#> <dttm> <dbl> <dbl>
#> 1 2022-12-12 08:00:00 1 2
Created on 2023-01-31 with reprex v2.0.2
2. Coerce to "character"
If the datetime column of class "POSIXct" is coerced to character the ISO8601 format is gone and everything is OK. And afterwards read_csv will recognize the datetime column.
This is done in a pipe, below with the base pipe operator introduced in R 4.1, in order not to change the original data.
# coerce the problem column to character and write to file
# done in a pipe it won't alter the original data set
df1 |>
dplyr::mutate(datetime = as.character(datetime)) |>
write_csv(file = csvfile)
# check result, both are OK
readLines(csvfile)
#> [1] "datetime,x,y" "2022-12-12 08:00:00,1,2"
read_csv(csvfile, show_col_types = FALSE)
#> # A tibble: 1 × 3
#> datetime x y
#> <dttm> <dbl> <dbl>
#> 1 2022-12-12 08:00:00 1 2
Created on 2023-01-31 with reprex v2.0.2
Final clean up.
unlink(csvfile)
I want to be able to store a list of all files in a directory for use to build a dataset later, but I need to ignore all files that are in certain folders. The files I need are all stored by year in separate folders, and using a pattern argument isn't a great option as the lengths of file names are inconsistent.
So the easiest way I can think of is to first ignore all folder before a certain year in this step (storing the file list) and then ignore certain files again when I begin to actually read in the files.
Start by listing all directories:
dir_list <- list.dirs()
dir_list
#> [1] "./stuff_2021" "./things" "./other_stuff_2015"
Then use grep to pick out directories that contain four consecutive digits (if the years are abbreviated to two digits, change the regex to \\d{2})
dirs_with_years <- grep("\\d{4}", dir_list, value = TRUE)
dirs_without_years <- grep("\\d{4}", dir_list, invert = TRUE, value = TRUE)
dirs_with_years
#> [1] "./stuff_2021" "./other_stuff_2015"
dirs_without_years
#> [1] "./things"
Now extract the four digits from each directory name using gsub and convert to numeric:
year_of_dir <- as.numeric(gsub("^.*(\\d{4}).*$", "\\1", dirs_with_years))
year_of_dir
#> [1] 2021 2015
You can now use year_of_dir to filter out the folders you want according to the year:
dirs_before_2020 <- dirs_with_years[year_of_dir < 2020]
dirs_after_2020 <- dirs_with_years[year_of_dir >= 2020]
dirs_before_2020
#> [1] "./other_stuff_2015"
dirs_after_2020
#> [1] "./stuff_2021"
If this all seems a bit long-winded, it can easily be compressed into a short function:
get_post_2020_files <- function()
{
dirs_with_years <- grep("\\d{4}", list.dirs(), value = TRUE)
year_of_dir <- as.numeric(gsub("^.*(\\d{4}).*$", "\\1", dirs_with_years))
dirs_with_years[year_of_dir >= 2020]
}
get_post_2020_files()
#> [1] "./stuff_2021"
Created on 2021-11-10 by the reprex package (v2.0.0)
I'm trying to scrape weather data (in R) for the 2nd of March on the following web page: https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020 I am interested in the table at the end, below "Stockholm Weather History for..."
Just above and to the right of that table is a drop-down list where I chose the 2nd of March. But when I scrape using rselenium I only get the data of the 1st of March.
How can I get the data for the 2nd (and any other date except the 1st)
I have also tried to scrape the entire page using read_html but I can't find a way to extract the data I want from that.
The following code is the one that only seem to work for the 1st but any other date in the month.
library(tidyverse)
library(rvest)
library(RSelenium)
library(stringr)
library(dplyr)
rD <- rsDriver(browser="chrome", port=4234L, chromever ="85.0.4183.83")
remDr <- rD[["client"]]
remDr$navigate("https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020")
webElems <- remDr$findElements(using="class name", value="sticky-wr")
s<-webElems[[1]]$getElementText()
s<-as.character(s)
print(s)
Here's an approach with RSelenium
library(RSelenium)
library(rvest)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020")
client$findElement(using = "link text","Mar 2")$clickElement()
source <- client$getPageSource()[[1]]
read_html(source) %>%
html_node(xpath = '//*[#id="wt-his"]') %>%
html_table %>%
head
Conditions Conditions Conditions Comfort Comfort Comfort
1 Time Temp Weather Wind Humidity Barometer Visibility
2 12:20 amMon, Mar 2 39 °F Chilly. 7 mph ↑ 87% 29.18 "Hg N/A
3 12:50 am 37 °F Chilly. 7 mph ↑ 87% 29.18 "Hg N/A
4 1:20 am 37 °F Passing clouds. 7 mph ↑ 87% 29.18 "Hg N/A
5 1:50 am 37 °F Passing clouds. 7 mph ↑ 87% 29.18 "Hg N/A
6 2:20 am 37 °F Overcast. 8 mph ↑ 87% 29.18 "Hg N/A
You can then iterate over dates for findElement().
You can find the xpath by right clicking on the table and choosing Inspect in Chrome:
Then, you can find the table element, right click and choose Copy > Copy XPath.
It is always useful to use your browser's "developer tools" to inspect the web page and figure out how to extract the information you need.
A couple of tutorials that explain this I just googled:
https://towardsdatascience.com/tidy-web-scraping-in-r-tutorial-and-resources-ac9f72b4fe47
https://www.scrapingbee.com/blog/web-scraping-r/
For example, in this particular webpage, when we select a new date in the dropdown list, the webpage sends a GET request to the server, which returns a JSON string with the data of the requested date. Then the webpage updates the data in the table (probably using javascript -- did not check this).
So, in this case you need to emulate this behavior, capture the json file and parse the info in it.
In Chrome, if you look at the developer tool network pane, you will see that the address of the GET request is of the form:
https://www.timeanddate.com/scripts/cityajax.php?n=sweden/stockholm&mode=historic&hd=YYYYMMDD&month=M&year=YYYY&json=1
where YYYY stands for year with 4 digits, MM(M) month with two (one) digits, and DD day of the month with two digits.
So you can set your code to do the GET request directly to this address, get the json response and parse it accordingly.
library(rjson)
library(rvest)
library(plyr)
library(dplyr)
year <- 2020
month <- 3
day <- 7
# create formatted url with desired dates
url <- sprintf('https://www.timeanddate.com/scripts/cityajax.php?n=sweden/stockholm&mode=historic&hd=%4d%02d%02d&month=%d&year=%4d&json=1', year, month, day, month, year)
webpage <- read_html(url) %>% html_text()
# json string is not formatted the way fromJSON function needs
# so I had to parse it manually
# split string on each row
x <- strsplit(webpage, "\\{c:")[[1]]
# remove first element (garbage)
x <- x[2:length(x)]
# clean last 2 characters in each row
x <- sapply(x, FUN=function(xx){substr(xx[1], 1, nchar(xx[1])-2)}, USE.NAMES = FALSE)
# function to get actual data in each row and put it into a dataframe
parse.row <- function(row.string) {
# parse columns using '},{' as divider
a <- strsplit(row.string, '\\},\\{')[[1]]
# remove some lefover characters from parsing
a <- gsub('\\[\\{|\\}\\]', '', a)
# remove what I think is metadata
a <- gsub('h:', '', gsub('s:.*,', '', a))
df <- data.frame(time=a[1], temp=a[3], weather=a[4], wind=a[5], humidity=a[7],
barometer=a[8])
return(df)
}
# use ldply to run function parse.row for each element of x and combine the results in a single dataframe
df.final <- ldply(x, parse.row)
Result:
> head(df.final)
time temp weather wind humidity barometer
1 "12:20 amSat, Mar 7" "28 °F" "Passing clouds." "No wind" "100%" "29.80 \\"Hg"
2 "12:50 am" "28 °F" "Passing clouds." "No wind" "100%" "29.80 \\"Hg"
3 "1:20 am" "28 °F" "Passing clouds." "1 mph" "100%" "29.80 \\"Hg"
4 "1:50 am" "30 °F" "Passing clouds." "2 mph" "100%" "29.80 \\"Hg"
5 "2:20 am" "30 °F" "Passing clouds." "1 mph" "100%" "29.80 \\"Hg"
6 "2:50 am" "30 °F" "Low clouds." "No wind" "100%" "29.80 \\"Hg"
I left everything as strings in the data frame, but you can convert the columns to numeric or dates with you need.
I have 100+ csv files in the current directory, all with the same characteristics. Some examples:
ABC.csv
,close,high,low,open,time,volumefrom,volumeto,timestamp
0,0.05,0.05,0.05,0.05,1405555200,100.0,5.0,2014-07-17 02:00:00
1,0.032,0.05,0.032,0.05,1405641600,500.0,16.0,2014-07-18 02:00:00
2,0.042,0.05,0.026,0.032,1405728000,12600.0,599.6,2014-07-19 02:00:00
...
1265,0.6334,0.6627,0.6054,0.6266,1514851200,6101389.25,3862059.89,2018-01-02 01:00:00
XYZ.csv
,close,high,low,open,time,volumefrom,volumeto,timestamp
0,0.0003616,0.0003616,0.0003616,0.0003616,1412640000,11.21,0.004054,2014-10-07 02:00:00
...
1183,0.0003614,0.0003614,0.0003614,0.0003614,1514851200,0.0,0.0,2018-01-02 01:00:00
The idea is to build in R a time series dataset in xts so that I could use the PerformanceAnalyticsand quantmod libraries. Something like that:
## ABC XYZ ... ... JKL
## 2006-01-03 NaN 20.94342
## 2006-01-04 NaN 21.04486
## 2006-01-05 9.728111 21.06047
## 2006-01-06 9.979226 20.99804
## 2006-01-09 9.946529 20.95903
## 2006-01-10 10.575626 21.06827
## ...
Any idea? I can provide my trials if required.
A solution using base R
If you know that your files are formatted the same way then you can merge them. Below is what I would have done.
Get a list a files (this assumes that all the .csv files are the one you actually need and they are placed in the working directory)
vcfl <- list.files(pattern = "*.csv")
lapply() to open all files and store them as.data.frame:
lsdf <- lapply(lsfl, read.csv)
Merge them. Here I used the column high but you can apply the same code on any variable (there likely is a solution without a loop)
out_high <- lsdf[[1]][,c("timestamp", "high")]
for (i in 2:length(vcfl)) {
out_high <- merge(out_high, lsdf[[i]][,c("timestamp", "high")], by = "timestamp")
}
Rename the column using the vector of files' names:
names(lsdf)[2:length(vcfl)] <- gsub(vcfl, pattern = ".csv", replacement = "")
You can now use as.xts() fron the xts package https://cran.r-project.org/web/packages/xts/xts.pdf
I guess there is an alternative solution using tidyverse, somebody else?
Hope this helps.
Create an .xlsx file with three sheets named: "Test 1", "S&P500 TR" and "SP500 TR". Put some random content in each sheet and save it as "Book1.xlsx".
Run:
> a <- getSheetNames("Book1.xlsx")
> a
[1] "Test 1" "S&P500 TR" "SP500 TR"
Now try:
> read.xlsx("Book1.xlsx", a[2])
Error in read.xlsx.default("Book1.xlsx", a[2]) :
Cannot find sheet named "S&P500 TR"
First check if you actually type the name S&P500 TR instead of using a[2] that would change anything.
Alternatively, you can use readxl package for importing;
library(readxl)
X1 <- read_excel("C:/1.xls", sheet = "S&P500 TR")
This is a spreadsheet that I had and it is the result after it is imported;
head(X1)
# A tibble: 6 × 4
# Year Month Community ` Average Daily`
# <dbl> <chr> <chr> <dbl>
# 1 2016 Jan Arlington 5.35
# 2 2016 Jan Ashland 1.26
# 3 2016 Jan Bedford 2.62
# 4 2016 Jan Belmont 3.03
# 5 2016 Jan Boston 84.89
# 6 2016 Jan Braintree 8.16
I ran into the same problem, but found a workaround. First load in the workbook using read.xlsx(). Then rename the problematic sheet to avoid the ampersand. To fix the code in your example:
wb = read.xlsx("Book1.xlsx")
renameWorksheet(wb, "S&P500 TR", "NEW NAME")
output = read.xlsx(wb, "NEW NAME")
Hope this helps!
First load the workbook, then use the which and grepl function to return the sheet number containing the sheet name (which can include the '&' character when done in this way). This seems to work quite well in an application I am currently working on.
An (incomplete) example is given below that should be easily modified to your context. In my case 'i' is a file name (looping over many files). The "toy" code is here:
wb <- loadWorkbook(file = i)
which( grepl("CAPEX & Depreciation", names(wb)) )