How to get true file creation date in R, on Unix systems? - r

I'm trying to get file creation dates from R and I understand that this information might not be possible to retrieve at all on some operating systems that just don't store it anywhere. However, I'm unsure how to retrieve it generically when it is (at least, theoretically) retrievable.
On Windows, this is straight forward because ctime from file.info provides this information, for reference, this is the relevant excerpt from ?file.info
What is meant by the three file times depends on the OS and file system. On Windows native file systems ctime is the file creation time (something which is not recorded on most Unix-alike file systems).
However, although most unix systems don't record this information (as pointed out in the help), some unix-based systems such as OS X do in fact store this. On OS X, for example, the system command metadata ls mdls will print file metadata and list kMDItemContentCreationDate (the actual creation date of the file) as one of the file attributes.
My question is, what advice do people have for getting at file creation dates (if they are available at all) from file metadata? (e.g. specifically in the case of OS X where there's a system command but no direct R call)
UPDATE:
Thanks to info from the comments + details on SO and SE here and here, I've come up with a way to solve this in R on OS X type unix platforms that track creation date and have the BSD style stat command. However, I still couldn't figure out how to do this in R on other linux systems that track creation date but don't have this version of stat. In this answer on unix SE, it is suggested that this info could be retrieved with debugfs + stat even when stat itself does not report it (provided the file system records birthdate), but that solution I couldn't get to work (only linux I could test on didn't have debugfs). Anyways, here's how far I got:
get_birthdate <- function(filepath) {
switch(Sys.info()[['sysname']],
Windows = {
# Windows
file.info(filepath)$ctime
},
Darwin = {
# OS X
cmd <- paste('stat -f "%DB"', filepath) # use BSD stat command
ctime_sec <- as.integer(system(cmd, intern=T)) # retrieve birth date in seconds from start of epoch (%DB)
as.POSIXct(ctime_sec, origin = "1970-01-01", tz = "") # convert to POSIXct
},
Linux = {
# Linux
stop("not sure how to do this")
})
}

Following other's pointers, this should work quite reasonably.
Unfortunately it needs root privileges (dued to debugfs) and it's not very efficient yet (especially a bit quick'n dirty on regular expressions, but it's 01:00 o clock in the morning here :) ).
BTW, we set up the pager to be cat (making debugfs to print on standard output), find in which device the file is stored in order to use debugfs properly and finally get the stats and elaborate it a bit.
In general, in UNIX, once you have a bash-command to read its output in R you have to use pipe in read mode(that is default) and readLines.
Test done in a Debian Gnu Linux.
np350v5c:/home/l# R
> my.file <- "/etc/network/interfaces"
>
> setup_pager <- function() {system("export PAGER=cat")}
>
> where_is <- function(file) {
con <- pipe(sprintf("df %s", file))
res <- strsplit(readLines(con)[2], " ")[[1]][1]
close(con)
res
}
>
> where_is(my.file) # could be /dev/sda1 as well, depending on /etc/fstab
[1] "/dev/disk/by-uuid/9ce40c2b-60d8-40b1-890f-1e5da4199c88"
>
> my.command <- sprintf("debugfs -R 'stat %s' %s",
my.file,
where_is(my.file))
>
> ## root privileges especially here ..
> setup_pager()
> con <- pipe(my.command)
> debugfs <- readLines(con)
debugfs 1.42.9 (4-Feb-2014)
> close(con)
>
> my.date <- gsub("^crtime:.+-- ", "", grep("^crtime", debugfs, value = TRUE))
> my.date
[1] "Tue Feb 19 00:07:21 2013"
> strptime(tolower(substr(my.date, 5, nchar(my.date))),
format = "%b %d %H:%M:%S %Y")
[1] "2013-02-19 00:07:21 CET"
HTH, Luca

I know I am a little late to the game here, but here is a pretty easy solution for unix/Mac OS:
file.name <- "~/dir/file.extension"
df$file_created_dt <- system(paste0("stat -f %SB ", file.name), intern = T)
And then you can format it however you like:
df$file_created_dt <- as.POSIXct(df$file_created_dt, format = "%b %d %H:%M:%S %Y", origin = "1970-01-01 00:00:00", tz = "your/timezone")

Related

Prompt 'Yes' every time to getFilings

I am going to download the 2005 10-Ks for several corporations in R using the EDGAR package. I have a mini loop to test which is working:
for (CIK in c(789019, 777676, 849399)){
getFilings(2005,CIK,'10-K')
}
However each time this runs I get a yes/no prompt and I have to type 'yes':
Total number of filings to be downloaded=1. Do you want to download (yes/no)? yes
Total number of filings to be downloaded=1. Do you want to download (yes/no)? yes
Total number of filings to be downloaded=1. Do you want to download (yes/no)? yes
How can I prompt R to answer 'yes' for each run? Thank you
Please remember to include a minimal reproducible example in your question, including library(...) and all other necessary commands:
library(edgar)
report <- getMasterIndex(2005)
We can bypass the prompt by doing some code surgery. Here, we retrieve the code for getFilings, and replace the line that asks for the prompt with just a message. We then write the new function (my_getFilings) to a temporary file, and source that file:
x <- capture.output(dput(edgar::getFilings))
x <- gsub("choice <- .*", "cat(paste(msg3, '\n')); choice <- 'yes'", x)
x <- gsub("^function", "my_getFilings <- function", x)
writeLines(x, con = tmp <- tempfile())
source(tmp)
Everything downloads fine:
for (CIK in c(789019, 777676, 849399)){
my_getFilings(2005, CIK, '10-K')
}
list.files(file.path(getwd(), "Edgar filings"))
# [1] "777676_10-K_2005" "789019_10-K_2005" "849399_10-K_2005"

Rscript: How to inject options for an R script [duplicate]

I've got a R script for which I'd like to be able to supply several command-line parameters (rather than hardcode parameter values in the code itself). The script runs on Windows.
I can't find info on how to read parameters supplied on the command-line into my R script. I'd be surprised if it can't be done, so maybe I'm just not using the best keywords in my Google search...
Any pointers or recommendations?
Dirk's answer here is everything you need. Here's a minimal reproducible example.
I made two files: exmpl.bat and exmpl.R.
exmpl.bat:
set R_Script="C:\Program Files\R-3.0.2\bin\RScript.exe"
%R_Script% exmpl.R 2010-01-28 example 100 > exmpl.batch 2>&1
Alternatively, using Rterm.exe:
set R_TERM="C:\Program Files\R-3.0.2\bin\i386\Rterm.exe"
%R_TERM% --no-restore --no-save --args 2010-01-28 example 100 < exmpl.R > exmpl.batch 2>&1
exmpl.R:
options(echo=TRUE) # if you want see commands in output file
args <- commandArgs(trailingOnly = TRUE)
print(args)
# trailingOnly=TRUE means that only your arguments are returned, check:
# print(commandArgs(trailingOnly=FALSE))
start_date <- as.Date(args[1])
name <- args[2]
n <- as.integer(args[3])
rm(args)
# Some computations:
x <- rnorm(n)
png(paste(name,".png",sep=""))
plot(start_date+(1L:n), x)
dev.off()
summary(x)
Save both files in the same directory and start exmpl.bat. In the result you'll get:
example.png with some plot
exmpl.batch with all that was done
You could also add an environment variable %R_Script%:
"C:\Program Files\R-3.0.2\bin\RScript.exe"
and use it in your batch scripts as %R_Script% <filename.r> <arguments>
Differences between RScript and Rterm:
Rscript has simpler syntax
Rscript automatically chooses architecture on x64 (see R Installation and Administration, 2.6 Sub-architectures for details)
Rscript needs options(echo=TRUE) in the .R file if you want to write the commands to the output file
A few points:
Command-line parameters are
accessible via commandArgs(), so
see help(commandArgs) for an
overview.
You can use Rscript.exe on all platforms, including Windows. It will support commandArgs(). littler could be ported to Windows but lives right now only on OS X and Linux.
There are two add-on packages on CRAN -- getopt and optparse -- which were both written for command-line parsing.
Edit in Nov 2015: New alternatives have appeared and I wholeheartedly recommend docopt.
Add this to the top of your script:
args<-commandArgs(TRUE)
Then you can refer to the arguments passed as args[1], args[2] etc.
Then run
Rscript myscript.R arg1 arg2 arg3
If your args are strings with spaces in them, enclose within double quotes.
Try library(getopt) ... if you want things to be nicer. For example:
spec <- matrix(c(
'in' , 'i', 1, "character", "file from fastq-stats -x (required)",
'gc' , 'g', 1, "character", "input gc content file (optional)",
'out' , 'o', 1, "character", "output filename (optional)",
'help' , 'h', 0, "logical", "this help"
),ncol=5,byrow=T)
opt = getopt(spec);
if (!is.null(opt$help) || is.null(opt$in)) {
cat(paste(getopt(spec, usage=T),"\n"));
q();
}
Since optparse has been mentioned a couple of times in the answers, and it provides a comprehensive kit for command line processing, here's a short simplified example of how you can use it, assuming the input file exists:
script.R:
library(optparse)
option_list <- list(
make_option(c("-n", "--count_lines"), action="store_true", default=FALSE,
help="Count the line numbers [default]"),
make_option(c("-f", "--factor"), type="integer", default=3,
help="Multiply output by this number [default %default]")
)
parser <- OptionParser(usage="%prog [options] file", option_list=option_list)
args <- parse_args(parser, positional_arguments = 1)
opt <- args$options
file <- args$args
if(opt$count_lines) {
print(paste(length(readLines(file)) * opt$factor))
}
Given an arbitrary file blah.txt with 23 lines.
On the command line:
Rscript script.R -h outputs
Usage: script.R [options] file
Options:
-n, --count_lines
Count the line numbers [default]
-f FACTOR, --factor=FACTOR
Multiply output by this number [default 3]
-h, --help
Show this help message and exit
Rscript script.R -n blah.txt outputs [1] "69"
Rscript script.R -n -f 5 blah.txt outputs [1] "115"
you need littler (pronounced 'little r')
Dirk will be by in about 15 minutes to elaborate ;)
In bash, you can construct a command line like the following:
$ z=10
$ echo $z
10
$ Rscript -e "args<-commandArgs(TRUE);x=args[1]:args[2];x;mean(x);sd(x)" 1 $z
[1] 1 2 3 4 5 6 7 8 9 10
[1] 5.5
[1] 3.027650
$
You can see that the variable $z is substituted by bash shell with "10" and this value is picked up by commandArgs and fed into args[2], and the range command x=1:10 executed by R successfully, etc etc.
FYI: there is a function args(), which retrieves the arguments of R functions, not to be confused with a vector of arguments named args
If you need to specify options with flags, (like -h, --help, --number=42, etc) you can use the R package optparse (inspired from Python):
http://cran.r-project.org/web/packages/optparse/vignettes/optparse.pdf.
At least this how I understand your question, because I found this post when looking for an equivalent of the bash getopt, or perl Getopt, or python argparse and optparse.
I just put together a nice data structure and chain of processing to generate this switching behaviour, no libraries needed. I'm sure it will have been implemented numerous times over, and came across this thread looking for examples - thought I'd chip in.
I didn't even particularly need flags (the only flag here is a debug mode, creating a variable which I check for as a condition of starting a downstream function if (!exists(debug.mode)) {...} else {print(variables)}). The flag checking lapply statements below produce the same as:
if ("--debug" %in% args) debug.mode <- T
if ("-h" %in% args || "--help" %in% args)
where args is the variable read in from command line arguments (a character vector, equivalent to c('--debug','--help') when you supply these on for instance)
It's reusable for any other flag and you avoid all the repetition, and no libraries so no dependencies:
args <- commandArgs(TRUE)
flag.details <- list(
"debug" = list(
def = "Print variables rather than executing function XYZ...",
flag = "--debug",
output = "debug.mode <- T"),
"help" = list(
def = "Display flag definitions",
flag = c("-h","--help"),
output = "cat(help.prompt)") )
flag.conditions <- lapply(flag.details, function(x) {
paste0(paste0('"',x$flag,'"'), sep = " %in% args", collapse = " || ")
})
flag.truth.table <- unlist(lapply(flag.conditions, function(x) {
if (eval(parse(text = x))) {
return(T)
} else return(F)
}))
help.prompts <- lapply(names(flag.truth.table), function(x){
# joins 2-space-separatated flags with a tab-space to the flag description
paste0(c(paste0(flag.details[x][[1]][['flag']], collapse=" "),
flag.details[x][[1]][['def']]), collapse="\t")
} )
help.prompt <- paste(c(unlist(help.prompts),''),collapse="\n\n")
# The following lines handle the flags, running the corresponding 'output' entry in flag.details for any supplied
flag.output <- unlist(lapply(names(flag.truth.table), function(x){
if (flag.truth.table[x]) return(flag.details[x][[1]][['output']])
}))
eval(parse(text = flag.output))
Note that in flag.details here the commands are stored as strings, then evaluated with eval(parse(text = '...')). Optparse is obviously desirable for any serious script, but minimal-functionality code is good too sometimes.
Sample output:
$ Rscript check_mail.Rscript --help
--debug Print variables rather than executing function XYZ...
-h --help Display flag definitions

"download.file" Incomplete and inconsistent downloads

Am trying to understand why I am having inconsistent results downloading CSV files from a website archive. Don't know if the problem is at my end, the other side or just failed communications in between. Any suggestions are welcomed.
Using a R script to automate the downloading of CSV files by month and year from the HYCOM archives for analysis. The script generated the following URL trying URL 'http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly?var=salinity&var=water_temp&var=water_u&var=water_v&latitude=13.875&longitude=-72.25&time_start=2012-05-01T00:00:00Z&time_end=2012-05-31T21:00:00Z&vertCoord=&accept=csv'
Running download.file successfully obtains the file about half the time, otherwise fails. Any suggestions are welcomed. The images below shows the failed run. Successful run is below.
Successful Log
#download one month of data
MM = '05'
LastDay = ndays(paste(year,MM,'01',sep="-"))
H1 = paste( as shown in image)
H2 = '-01T00:00:00Z&time_end='
#H3 = 'T21:00:00Z&timeStride=1&vertCoord=&accept=csv'
H3 = 'T21:00:00Z&vertCoord=&accept=csv'
HtmlLink <- paste(H1,year,"-",MM,H2,year,"-",MM,"-",LastDay,H3,sep="")
dest = paste("../data/",year,MM,".csv",sep="")
download.file(url =HtmlLink ,destfile=dest,cacheOK=FALSE, method="auto")
trying URL 'as shown in image'
Content type 'text/plain;charset=UTF-8' length unknown
..................................................
................downloaded 666 KB
user system elapsed
28.278 6.605 5201.421
LOG OF FAILED RUN
You can/should turn the following into a function accepting parameters and replace the hardcoded values with said params (I used httr:::parse_query() to make the list):
library(httr)
URL <- "http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly"
params <- list(var = "salinity",
var = "water_temp",
var = "water_u",
var = "water_v",
latitude = "13.875",
longitude = "-72.25",
time_start = "2012-05-01T00:00:00Z",
time_end = "2012-05-31T21:00:00Z",
vertCoord = "",
accept = "csv")
dest_file <- "filename"
res <- GET(url=URL,
query=params,
timeout(360),
write_disk(dest_file, overwrite=TRUE),
verbose())
warn_for_status(res)
You can (eventually) remove the verbose() from that GET call, but it's helpful during debugging.
The main issue is that this server is s l o w and times out before the transfer is complete. Even the value of 360 might not be enough (you'll need to experiment).
Many thanks to all for the help. The suggestion by hrbrmstr appears to be an elegant answer and I look forwards to testing it. However, I was unable to install a working copy using the program manager. Installation from a local download also failed since R complained that the OS X version that I downloaded from CRAN was a windows version, not OS X. Yes, I repeated the download several times to make sure I had the right package.
As suggested by Cyrus Mohammadian, I tried the procedures in the curl library.
Running the same URL, download.file transfers failed about 50% of the time. Using curl reduced the transfer times from 2000 seconds to 1000 seconds with no failures in 12 tries.
## calculate number of days in month
ndays <- function(d) {
last_days <- 28:31
rev(last_days[which(!is.na(
as.Date( paste( substr(d, 1, 8),
last_days, sep = ''),
'%Y-%m-%d')))])[1] }
nlat = 13.875
elon = -72.25
#download one month of data
year = 2008
MM = '01'
LastDay = ndays(paste(year,MM,'01',sep="-"))
H1 = paste('http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly?
var=salinity&var=water_temp&var=water_u&var=water_v&latitude=',
nlat,'&longitude=', elon,'&time_start=',sep="")
H2 = '-01T00:00:00Z&time_end='
H3 = 'T21:00:00Z&timeStride=1&vertCoord=&accept=csv'
HtmlLink <- paste(H1,year,"-",MM,H2,year,"-",MM,"-",LastDay,H3,sep="")
dest = paste("../data/",year,MM,".csv",sep="")
curl_download(url =HtmlLink ,destfile=dest,quiet=FALSE, mode="wb")

Importing data into R (rdata) from Github

I want to put some R code plus the associated data file (RData) on Github.
So far, everything works okay. But when people clone the repository, I want them to be able to run the code immediately. At the moment, this isn't possible because they will have to change their work directory (setwd) to directory that the RData file was cloned (i.e. downloaded) to.
Therefore, I thought it might be easier, if I changed the R code such that it linked to the RData file on github. But I cannot get this to work using the following snippet. I think perhaps there is some issue text / binary issue.
x <- RCurl::getURL("https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData")
y <- load(x)
Any help would be appreciated.
Thanks
This works for me:
githubURL <- "https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData"
load(url(githubURL))
head(df)
# X Y Z
# 1 16602794 -4183983 94.92019
# 2 16602814 -4183983 91.15794
# 3 16602834 -4183983 87.44995
# 4 16602854 -4183983 83.79617
# 5 16602874 -4183983 80.19643
# 6 16602894 -4183983 76.65052
EDIT Response to OP comment.
From the documentation:
Note that the https:// URL scheme is not supported except on Windows.
So you could try this:
download.file(githubURL,"myfile")
load("myfile")
which works for me as well, but this will clutter your working directory. If that doesn't work, try setting method="curl" in the call to download.file(...).
I've had trouble with this before as well, and the solution I've found to be the most reliable is to use a tiny modification of source_url from the fantastic [devtools][1] package. This works for me (on a Mac).
load_url <- function (url, ..., sha1 = NULL) {
# based very closely on code for devtools::source_url
stopifnot(is.character(url), length(url) == 1)
temp_file <- tempfile()
on.exit(unlink(temp_file))
request <- httr::GET(url)
httr::stop_for_status(request)
writeBin(httr::content(request, type = "raw"), temp_file)
file_sha1 <- digest::digest(file = temp_file, algo = "sha1")
if (is.null(sha1)) {
message("SHA-1 hash of file is ", file_sha1)
}
else {
if (nchar(sha1) < 6) {
stop("Supplied SHA-1 hash is too short (must be at least 6 characters)")
}
file_sha1 <- substr(file_sha1, 1, nchar(sha1))
if (!identical(file_sha1, sha1)) {
stop("SHA-1 hash of downloaded file (", file_sha1,
")\n does not match expected value (", sha1,
")", call. = FALSE)
}
}
load(temp_file, envir = .GlobalEnv)
}
I use a very similar modification to get text files from github using read.table, etc. Note that you need to use the "raw" version of the github URL (which you included in your question).
[1] https://github.com/hadley/devtoolspackage
load takes a filename.
x <- RCurl::getURL("https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData")
writeLines(x, tmp <- tempfile())
y <- load(tmp)

Check if a timezone is valid in R

I am reading a file that contains timestamps and a timezone specification. I would like to be able to detect if a given timezone on this file is recognized by R or not, and supply my own default in case it isn't.
However, it seems like as.POSIXct silently falls back to UTC if given an invalid timezone, with no error or warning I could catch and handle:
> as.POSIXct("1970-01-01", tz="blah")
[1] "1970-01-01 UTC"
What would be a 'proper' way in R to check if a given timezone is recognized or not?
help("time zones") explains a lot of the issues with time zones in detail and is well worth the read.
Results will vary based on your OS, but example("time zones") shows how you can read a zone.tab file if your OS has one.
tzfile <- "/usr/share/zoneinfo/zone.tab"
tzones <- read.delim(tzfile, row.names = NULL, header = FALSE,
col.names = c("country", "coords", "name", "comments"),
as.is = TRUE, fill = TRUE, comment.char = "#")
str(tzones$name)
#chr [1:415] "Europe/Andorra" "Asia/Dubai" "Asia/Kabul" "America/Antigua" "America/Anguilla" ...
NROW(tzones)
#[1] 415
head(tzones)
# country coords name comments
#1 AD +4230+00131 Europe/Andorra
#2 AE +2518+05518 Asia/Dubai
#3 AF +3431+06912 Asia/Kabul
#4 AG +1703-06148 America/Antigua
#5 AI +1812-06304 America/Anguilla
#6 AL +4120+01950 Europe/Tirane
You could use a timezone library which has knowledge of time zones. This is from the SVN version of RcppBDT:
R> tz <- new(bdtTz, "America/Chicago")
R> cat("tz object initialized as: ", format(tz), "\n")
tz object initialized as: America/Chicago
R> tzBAD <- new(bdtTz, "blah")
Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) :
Unknown region supplied, no tz object created
R>
In general, time zone support is dependent on the operating system. So for a portable solution you need to supply a list of valid time zones from somewhere...
And for what it is worth, I am using the csv file from the Boost sources. A copy of that time zones file is eg here at github.
Just stumbled on this question since I was looking to figure out the same thing. Turned out using the following. Leaving this for anyone who might stumble on this question...
is.valid.timezone <- function(timezone) {
return(timezone %in% (OlsonNames()))
}
You can also use the Rmetrics package timeDate package to check for timezone.
require(timeDate)
timeDate("1970-01-01", zone = "Africa/Dakar")
## [1] [1970-01-01]
timeDate("1970-01-01", zone = "blah")
## Error in .formatFinCenterNum(unclass(ct), zone, type = "any2gmt") :
## 'blah' is not a valid FinCenter.

Resources