list files in R by dates - r

I have a set of netcdf file that is organised by dates in my directory ( each file is one day of data). I read all the files in R using
require(RNetCDF)
files= list.files( ,pattern='*.nc',full.names=TRUE)
When I run the codes R reads 2014 and 2013, then parts of 2010 is at the end .. ( see below sample output in R)
"./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820223.SUB.nc"
"./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820224.SUB.nc"
"./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820225.SUB.nc"
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130829.SUB.nc"
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130830.SUB.nc"
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130831.SUB.nc"
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100626.SUB.nc"
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100827.SUB.nc"
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100828.SUB.nc"
I am trying to generate daily times series for these files using a loop..so when i apply the rest of my codes.. data for from June to Aug 2010 comes to end of daily time series. I rather suspect that this has to do how the files are listed R
Is there any way to list files in R and ensure it is organized dates?

Here are your files unsorted
paths <- c("./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820223.SUB.nc",
"./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820224.SUB.nc",
"./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820225.SUB.nc",
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130829.SUB.nc",
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130830.SUB.nc",
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130831.SUB.nc",
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100626.SUB.nc",
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100827.SUB.nc",
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100828.SUB.nc")
I'm using a regular expression to extract the 8 digits in the date, YYYYMMDD, and you should be able to sort by the string of digits, but you can also just convert them into dates
## matches ...Nx.<number of digits = 8>... and captures the stuff in <>
## and saves this match to the first capture group, \\1
pattern <- '.*Nx\\.(\\d{8}).*'
gsub(pattern, '\\1', paths)
# [1] "19820223" "19820224" "19820225" "20130829" "20130830" "20130831"
# [7] "20100626" "20100827" "20100828"
sort(gsub(pattern, '\\1', paths))
# [1] "19820223" "19820224" "19820225" "20100626" "20100827" "20100828"
# [7] "20130829" "20130830" "20130831"
## not necessary to convert that into dates but you can
as.Date(sort(gsub(pattern, '\\1', paths)), '%Y%m%d')
# [1] "1982-02-23" "1982-02-24" "1982-02-25" "2010-06-26" "2010-08-27"
# [6] "2010-08-28" "2013-08-29" "2013-08-30" "2013-08-31"
And order the original paths
## so you can use the above to order the paths
paths[order(gsub(pattern, '\\1', paths))]
# [1] "./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820223.SUB.nc"
# [2] "./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820224.SUB.nc"
# [3] "./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820225.SUB.nc"
# [4] "./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100626.SUB.nc"
# [5] "./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100827.SUB.nc"
# [6] "./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100828.SUB.nc"
# [7] "./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130829.SUB.nc"
# [8] "./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130830.SUB.nc"
# [9] "./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130831.SUB.nc"

Related

how to sort list.files() in correct date order?

Using normal list.files() in the working directory return the file list but the numeric order is messed up.
f <- list.files(pattern="*.nc")
f
# [1] "te1971-1.nc" "te1971-10.nc" "te1971-11.nc" "te1971-12.nc"
# [5] "te1971-2.nc" "te1971-3.nc" "te1971-4.nc" "te1971-5.nc"
# [9] "te1971-6.nc" "te1971-7.nc" "te1971-8.nc" "te1971-9.nc"
where the number after "-" describes the month number.
I used the following to try to sort it
myFiles <- paste("te", i, "-", c(1:12), ".nc", sep = "")
mixedsort(myFiles)
it returns ordered files but in reverse:
[1] "te1971-12.nc" "te1971-11.nc" "tev1971-10.nc" "te1971-9.nc"
[5] "te1971-8.nc" "te1971-7.nc" "te1971-6.nc" "te1971-5.nc"
[9] "te1971-4.nc" "te1971-3.nc" "te1971-2.nc" "te1971-1.nc"
How do I fix this?
The issue is that the values get alphabetically sorted.
You could gsub out years and months as groups (.) and add "-1" as first day of the month to the yield, coerce it as.Date and order by that.
x[order(as.Date(gsub('.*(\\d{4})-(\\d{,2}).*', '\\1-\\2-1', x)))]
# [1] "te1971-1.nc" "te1971-2.nc" "te1971-3.nc" "te1971-4.nc" "te1971-5.nc"
# [6] "te1971-6.nc" "te1971-7.nc" "te1971-8.nc" "te1971-9.nc" "te1971-10.nc"
# [11] "te1971-11.nc" "te1971-12.nc"
Data:
x <- c("te1971-1.nc", "te1971-10.nc", "te1971-11.nc", "te1971-12.nc",
"te1971-2.nc", "te1971-3.nc", "te1971-4.nc", "te1971-5.nc", "te1971-6.nc",
"te1971-7.nc", "te1971-8.nc", "te1971-9.nc")

Convert Dat Data To Data Frame

I'm reading data data and trying to convert it to data frame to save it into readable format. However no clue about converting the dat data. A bit beginner to R. Any help will be highly appreciated.
Code so Far:
data <- readLines("Day8.dat")
print(data)
Output So Far:
[1] "<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\"
modelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\"
xmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">
<d2lm:exchange><d2lm:supplierIdentification><d2lm:country>gb</d2lm:country>
<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier></d2lm:supplierIdentification></d2lm:exchange>
<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\"
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><d2lm:feedType>Event Data</d2lm:feedType>
<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime><d2lm:publicationCreator>
<d2lm:country>gb</d2lm:country><d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>
</d2lm:publicationCreator><d2lm:situation version=\"\" id=\"2922904\"><d2lm:headerInformation>
<d2lm:areaOfInterest>national</d2lm:areaOfInterest>
....
Thanks
It all depends on what you want to do with the data, i.e., how you want to process it.
For example, let's assume your interest is in parsing all XML tags as separate strings, then you can extract the tags using regular expression and the function str_extract:
library(stringr)
str_extract_all(dat, "<(d2lm:[^>]*)>.*</\\1>|<d2lm:[^>]*>")
This regex works even if the XML element names are variable:
str_extract_all(dat, "<([^>]*)>.*</\\1>|<[^>]*>")
The result is a list:
[[1]]
[1] "<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\" \nmodelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\" \nxmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">"
[2] "<d2lm:exchange>"
[3] "<d2lm:supplierIdentification>"
[4] "<d2lm:country>gb</d2lm:country>"
[5] "<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>"
[6] "<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\" \nxmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">"
[7] "<d2lm:feedType>Event Data</d2lm:feedType>"
[8] "<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime>"
[9] "<d2lm:publicationCreator>"
[10] "<d2lm:country>gb</d2lm:country>"
[11] "<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>"
[12] "<d2lm:situation version=\"\" id=\"2922904\">"
[13] "<d2lm:headerInformation>"
[14] "<d2lm:areaOfInterest>national</d2lm:areaOfInterest>"
To turn the list into a dataframe:
datDF <- data.frame(tags = unlist(str_extract_all(dat, "<(d2lm:[^>]*)>.*</\\1>|<d2lm:[^>]*>")))
EDIT:
If you want to have a dataframe with the text values between XML start tag and XML end tag, you can extract these tags and values along these lines:
datDF <- data.frame(
tags = unlist(str_extract_all(dat, "<([^>]*)>(?=[^>]*</\\1>)")),
values = unlist(str_extract_all(dat, "(?<=<([^>]{1,100})>).*(?=</\\1>)"))
)
datDF
tags values
1 <d2lm:country> gb
2 <d2lm:nationalIdentifier> NTIS
3 <d2lm:feedType> Event Data
4 <d2lm:publicationTime> 2020-05-10T00:00:44.778+01:00
5 <d2lm:country> gb
6 <d2lm:nationalIdentifier> NTIS
7 <d2lm:areaOfInterest> national
Is this--roughly--what you had in mind?
DATA:
dat <- '<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\"
modelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\"
xmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">
<d2lm:exchange><d2lm:supplierIdentification><d2lm:country>gb</d2lm:country>
<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier></d2lm:supplierIdentification></d2lm:exchange>
<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\"
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><d2lm:feedType>Event Data</d2lm:feedType>
<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime><d2lm:publicationCreator>
<d2lm:country>gb</d2lm:country><d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>
</d2lm:publicationCreator><d2lm:situation version=\"\" id=\"2922904\"><d2lm:headerInformation>
<d2lm:areaOfInterest>national</d2lm:areaOfInterest>'

How to select files in directory according to date in filename and copy to another folder

I have a set of files with the following names in a folder (tempfiles1):
ASC05012019R.DBF
ASC05012019R.NTX
ASC05012019H.DBF
ASC05012019H.NTX
ASC05012019F.DBF
ASC05012019F.NTX
ROS12012019R.DBF
ROS12012019R.NTX
ROS12012019H.DBF
ROS12012019H.NTX
ROS12012019F.DBF
ROS12012019F.NTX
BAL25012019R.DBF
BAL25012019R.NTX
BAL25012019H.DBF
BAL25012019H.NTX
BAL25012019F.DBF
BAL25012019F.NTX
ROK20012019R.DBF
ROK20012019R.NTX
ROK20012019H.DBF
ROK20012019H.NTX
ROK20012019F.DBF
ROK20012019F.NTX
Each filename has 3 different letters to begin with, but all are followed by a date in the format ddmmyyyy.
There are other files in the folder (as shown above) like R.NTX, H.NTX or F.NTX and others but I am only looking for the files with extension of "R.DBF", "H.DBF" and "F.DBF".
I wish to select from a date range, say 05012019 to 22012019, and copy all of the R.DBF, H.DBF and F.DBF files to another folder (tempfiles2).
I have been able to specify my folders:
current_folder <- "G:/m/HR/tempfiles1"
new_folder <- "G:/m/HR/tempfiles2"
and extract the date from each of the file names:
list_of_files <- substr(list.files(current_folder, ".DBF"),4,11)
list_of_files <- as.Date(list_of_files, format= "%d%m%Y")
But here is where I'm stuck. I tried using pattern but this returned an invalid pattern argument error:
list_of_files1 <- list.files(current_folder, pattern = 2019-01-05)
Plus this will only give me one date.
I can copy the files using file.copy, like so:
file.copy(file.path(current_folder,list_of_files), new_folder)
But I can't figure out how to select the dates.
The end result of the example above using dates between 05012019 to 22012019 would be to have the correct files copied to the folder tempfiles2:
ASC05012019R.DBF
ASC05012019H.DBF
ASC05012019F.DBF
ROS12012019R.DBF
ROS12012019H.DBF
ROS12012019F.DBF
ROK20012019R.DBF
ROK20012019H.DBF
ROK20012019F.DBF
Here's an approach:
# make some files to work with
files <- c("ASC05012019R.DBF", "ASC05012019R.NTX", "ASC05012019H.DBF",
"ASC05012019H.NTX", "ASC05012019F.DBF", "ASC05012019F.NTX", "ROS12012019R.DBF",
"ROS12012019R.NTX", "ROS12012019H.DBF", "ROS12012019H.NTX", "ROS12012019F.DBF",
"ROS12012019F.NTX", "BAL25012019R.DBF", "BAL25012019R.NTX", "BAL25012019H.DBF",
"BAL25012019H.NTX", "BAL25012019F.DBF", "BAL25012019F.NTX", "ROK20012019R.DBF",
"ROK20012019R.NTX", "ROK20012019H.DBF", "ROK20012019H.NTX", "ROK20012019F.DBF",
"ROK20012019F.NTX")
dir.create('temp')
setwd('temp')
file.create(files)
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# use a set in a regex pattern to get files with the right ending
files <- list.files(pattern = '[RHF].DBF$')
files
#> [1] "ASC05012019F.DBF" "ASC05012019H.DBF" "ASC05012019R.DBF"
#> [4] "BAL25012019F.DBF" "BAL25012019H.DBF" "BAL25012019R.DBF"
#> [7] "ROK20012019F.DBF" "ROK20012019H.DBF" "ROK20012019R.DBF"
#> [10] "ROS12012019F.DBF" "ROS12012019H.DBF" "ROS12012019R.DBF"
# extract and parse the dates
file_dates <- as.Date(sub('\\D+(\\d+).*', '\\1', files), '%d%m%Y')
file_dates
#> [1] "2019-01-05" "2019-01-05" "2019-01-05" "2019-01-25" "2019-01-25"
#> [6] "2019-01-25" "2019-01-20" "2019-01-20" "2019-01-20" "2019-01-12"
#> [11] "2019-01-12" "2019-01-12"
# subset based on the dates
wanted_files <- files[file_dates > as.Date('2019-01-05') & file_dates < as.Date('2019-01-22')]
wanted_files
#> [1] "ROK20012019F.DBF" "ROK20012019H.DBF" "ROK20012019R.DBF"
#> [4] "ROS12012019F.DBF" "ROS12012019H.DBF" "ROS12012019R.DBF"
# make a new directory
new_dir <- 'temp2'
dir.create(new_dir)
# move the files you care about
file.rename(wanted_files, file.path(new_dir, wanted_files))
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE
# check that they're there
list.files(new_dir)
#> [1] "ROK20012019F.DBF" "ROK20012019H.DBF" "ROK20012019R.DBF"
#> [4] "ROS12012019F.DBF" "ROS12012019H.DBF" "ROS12012019R.DBF"
Try this
#Define start and end date to select from files
start_range <- as.Date("05012019", format = "%d%m%Y")
end_range <- as.Date("22012019", format = "%d%m%Y")
#Get full path of file names to copy
file_path <- list.files(current_folder, ".DBF", full.names = TRUE)
#Get date from the filenames to compare
list_date <- as.Date(substr(list.files(current_folder, ".DBF"),
4,11), format= "%d%m%Y")
#Select the files which lie in the range of dates
files_to_copy <- file_path[list_date %in% seq(start_range, end_range, by = "1 day")]
#Copy the files
file.copy(files_to_copy, new_folder)

Converting a list of dates into another format in R?

I am attempting to change these lists into another type of format for use in the RJDBC package. The syntax for Oracle will only accept dates in the format of '1-OCT-18'. However, the output of this list is in the format '2018-10-01'.
How can I convert these lists into the format: 1-OCT-18
begin <- ymd("2017-01-01")+ months(0:21)
last <- ymd("2017-02-01")+ months(0:22)-days(1)
You can use format, wrapping in toupper for upper-case, and trimws to trim the whitespace created by %e. I used %e in format because it is the Day of the month as decimal number (1–31), with a leading space for a single-digit number. So then we just remove the whitespace afterward.
trimws(toupper(format(begin, "%e-%b-%y")))
# [1] "1-JAN-17" "1-FEB-17" "1-MAR-17" "1-APR-17" "1-MAY-17" "1-JUN-17"
# [7] "1-JUL-17" "1-AUG-17" "1-SEP-17" "1-OCT-17" "1-NOV-17" "1-DEC-17"
# [13] "1-JAN-18" "1-FEB-18" "1-MAR-18" "1-APR-18" "1-MAY-18" "1-JUN-18"
# [19] "1-JUL-18" "1-AUG-18" "1-SEP-18" "1-OCT-18"
For the last vector, you can omit trimws because %e won't generate any whitespace for two-digit numbers.
toupper(format(last, "%e-%b-%y"))
# [1] "31-JAN-17" "28-FEB-17" "31-MAR-17" "30-APR-17" "31-MAY-17"
# [6] "30-JUN-17" "31-JUL-17" "31-AUG-17" "30-SEP-17" "31-OCT-17"
# [11] "30-NOV-17" "31-DEC-17" "31-JAN-18" "28-FEB-18" "31-MAR-18"
# [16] "30-APR-18" "31-MAY-18" "30-JUN-18" "31-JUL-18" "31-AUG-18"
# [21] "30-SEP-18" "31-OCT-18" "30-NOV-18"

How do I use file.path() on a list of subdirectories

I want to add "_quants" to a list of folder names contained in samples$sample. When I use the following:
files <- file.path(dir, "quants", samples$sample, "_quants")
> dir
[1] "E:/ubuntu-shared/salmonTutorial/"
> samples$sample
[1] DRR016125 DRR016126 DRR016127 DRR016128 DRR016129 DRR016130 DRR016131 DRR016132 DRR016133 DRR016134 DRR016135 DRR016136 DRR016137 DRR016138 DRR016139
[16] DRR016140
16 Levels: DRR016125 DRR016126 DRR016127 DRR016128 DRR016129 DRR016130 DRR016131 DRR016132 DRR016133 DRR016134 DRR016135 DRR016136 DRR016137 ... DRR016140
I get:
[1] "E:/ubuntu-shared/salmonTutorial//quants/DRR016125/_quants"
How do I remove the double // and append "_quants" to "DRR016125" using file.path() to get the desired:
[1] "E:/ubuntu-shared/salmonTutorial/quants/DRR016125_quants"
[2] "E:/ubuntu-shared/salmonTutorial/quants/DRR016126_quants"
Solution using base::paste0:
dir <- "E:/ubuntu-shared/salmonTutorial/"
samples <- list(sample = c("DRR016125", "DRR016126", "DRR016127"))
paste0(dir, "quants", samples$sample, "_quants")
[1] "E:/ubuntu-shared/salmonTutorial/quantsDRR016125_quants"
[2] "E:/ubuntu-shared/salmonTutorial/quantsDRR016126_quants"
[3] "E:/ubuntu-shared/salmonTutorial/quantsDRR016127_quants"
paste0 - concatenates vectors (after converting to character), i.e. outputs single string. And as you passed multiple samples it does this for every sample.

Resources