Good morning everyone,
I am currently using the code written by Antonio Olinto Avila-da-Silva on this link: https://oceancolor.gsfc.nasa.gov/forum/oceancolor/topic_show.pl?tid=5954
It allows me to extract data of type sst/chlor_a from nc file. It uses a loop to create an excel file with all the data. Unfortunately, I noticed that the function only takes the first data file in the loop. Thus, I find myself with 20 times the same data in a row in my excel file.
Does anyone have a solution to make this loop work properly?
I would first check out that these two lines contain all the files you are expecting:
(f <- list.files(".", pattern="*.L3m_MO_SST_sst_9km.nc",full.names=F))
(lf<-length(f))
And then there's a bug in the for-loop. This line:
data<-nc_open(f)
Needs to reference the iterator i, so change it to something like this:
data<-nc_open(f[[i]])
It appears both scripts have this same bug.
I am trying to deal with extracting a subset from multiple .grb2 files in the same file path, and write them in a csv. I am able to do it for one (or a few) by using the following set of commands:
GRIB <- brick("tmp2m.1989102800.time.grb2")
GRIB <- as.array(GRIB)
readGDAL("tmp2m.1989102800.time.grb2")
tmp2m.6hr <- GRIB[51,27,c(261:1232)]
str(tmp2m.6hr)
tmp2m.data <- data.frame(tmp2m.6hr)
write.csv(tmp2m.data,"tmp1.csv")
The above set of commands extract, in csv, temperature values for specific latitude "51" and longitude "27", as well as for a specific time range "c(261:1232)".
Now I have hundreds of these files (with different file names, of course) in the same directory and I want to do the same for all. As you know, better than me, I cannot do this to one by one, changing the file name each time.
I have struggled a lot with this, but so far I did not manage to do it. Since I am new in R, and my knowledge is limited, I would very much appreciate any possible help with this.
The simplest way would be to use a normal for loop:
path <- "your file path here"
input.file.names <- dir(path, pattern =".grb2")
output.file.names <- paste0(tools::file_path_sans_ext(file.names),".csv")
for(i in 1:length(file.names)){
GRIB <- brick(input.file.names[i])
GRIB <- as.array(GRIB)
readGDAL(input.file.names[i]) # edited line
tmp2m.6hr <- GRIB[51,27,c(261:1232)]
str(tmp2m.6hr)
tmp2m.data <- data.frame(tmp2m.6hr)
write.csv(tmp2m.data,output.file.names[i])
}
You could of course create the body of the for loop into a function and then use the standard lapply or the map function from purrr.
Note that this code will print out different CSV files. If you want to append the data to a single file then you should check out write.table
I have rawdata as 20 offline html files stored in following format
../rawdata/1999_table.html
../rawdata/2000_table.html
../rawdata/2001_table.html
../rawdata/2002_table.html
.
.
../rawdata/2017_table.html
These files contain tables that I am extracting and reshaping to a particular format.
I want to read these files at once to a list and process them one by one through a function that I have written.
What I tried:
I put the names of these files into an Excel file called filestoread.xlsx and used a for loop to load these files using the names mentioned in the sheet. But it doesn't seem to work
filestoread <- fread("../rawdata/filestoread.csv")
x <- list()
for (i in nrow(filestoread)) {
x[[i]] <- read_html(paste0("../rawdata/", filestoread[i]))
}
How can this be done?
Also, after reading the HTML files I want to extract the tables from them and reshape them using a function I wrote after converting it to a data table.
My final objective is to rbind all the tables and have a single data table with year wise entries of the tables in the html file.
First save path of your data on one of the following ways.
Either, hardcoded
filestoread <- paste0("../rawdata/", 1999:2017, "_table.html")
or reading all html files in the directory
filestoread <- list.files(path = "../rawdata/", pattern="\\.html$")
Then use lapply()
library(rvest)
lapply(filestoread, function(x) try(read_html(x)))
Note: try() runs the code even when there is a file missing (throwing error).
The second part of your question is a little broad, depends on the content of your files, and there are already some answers, you could consider e.g. this answer. In principle you use a combination of ?html_nodes and ?html_table.
I am working with very large data layers for a SDM class and because of this I ended up breaking some of my layers into a bunch of blocks to avoid memory restraint. These blocks were written out as .grd files, and now I need to get them read back into R and merged together. I am extremely new to R an programming in general so any help would be appreciated. What I have been trying so far looks like this:
merge.coarse=raster("coarseBlock1.grd")
for ("" in 2:nBlocks){
merge.coarse=merge(merge.coarse,raster(paste("coarseBlock", ".grd", sep="")))
}
where my files are in coarseBlock.grd and are sequentially numbered from 1 to nBlocks (259)
Any feed back would be greatly appreciated.
Using for loops is generally slow in R. Also, using functions like merge and rbind in a for loop eat up a lot of memory because of the way R passes values to these functions.
A more efficient way to do this task would be to call lapply (see this tutorial on apply functions for details) to load the files into R. This will result in a list which can then be collapsed using the rbind function:
rasters <- lapply(list.files(GRDFolder), FUN = raster)
merge.coarse <- do.call(rbind, rasters)
I'm not too familiar with .grd files, but this overall process should at least get you going in the right direction. Assuming all your .grd files (1 through 259) are stored in the same folder (which I will refer to as GRDFolder), then you can try this:
merge.coarse <- raster("coarseBlock1.grd")
for(filename in list.files(GRDFolder))
{
temp <- raster(filename)
merge.coarse <- rbind(merge.coarse, temp)
}
I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.