How can I read all hdf files in a folder using R? - r

I have thousands of hdf files in a folder. Is there a way to create a loop to read all of the hdf files in that folder and write some specific data to another file?
I read the first file in the folder using the code below:
mydata <- h5read("/path to file/name of the file.he5", "/HDFEOS/GRIDS/Northern Hemisphere/Data Fields/SWE_NorthernDaily")
But I have 1686 more files in the folder, and it is not possible to read one by one. I think I need to write a for loop to read all files in the folder.
I found some codes listing the txt files in a folder and then, read all the files:
nm <- list.files(path="path/to/file")
do.call(rbind, lapply(nm, function(x) read.table(file=x)[, 2]))
I tried to change the code as seen below:
nm <- list.files(path="path/to/file")
do.call(rbind, lapply(nm, function(x) h5read(file=x)[, 2]))
But the error message says:
Error in h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
Error in h5checktypeOrOpenLoc(). Cannot open file. File 'D:\path to file\name of the file.he5' does not exist.
What should I do in that situation?

If you are not bound to a specific technology, you may want to take a look at HDFql. Using HDFql in R, your issue can be solved as follows (for the sake of this example, assume that (1) dataset /HDFEOS/GRIDS/Northern Hemisphere/Data Fields/SWE_NorthernDaily exists in all the HDF5 files stored in the directory, (2) it has one dimension (size 1024), and (3) is of data type integer):
# load HDFql R wrapper (make sure it can be found by the R interpreter)
source("HDFql.R")
# create variable "values" and initialize it
values <- array(dim = c(1024))
for(x in 1:1024)
{
values[x] <- as.integer(0)
}
# show (i.e. get) files stored in directory "/path/to/hdf5/files" and populate HDFql default cursor with it
hdfql_execute("SHOW FILE /path/to/hdf5/files")
# iterate HDFql default cursor
while(hdfql_cursor_next() == HDFQL_SUCCESS)
{
file_name <- hdfql_cursor_get_char()
# select (i.e. read) data from dataset "/HDFEOS/GRIDS/Northern Hemisphere/Data Fields/SWE_NorthernDaily" and populate variable "values" with it
hdfql_execute(paste("SELECT FROM", file_name, "\"/HDFEOS/GRIDS/Northern Hemisphere/Data Fields/SWE_NorthernDaily\" INTO MEMORY", hdfql_variable_transient_register(values)))
# display values stored in variable "values"
for(x in 1:1024)
{
print(values[x])
}
}
Additional examples on how to read datasets using HDFql can be found in the quick start guide and reference manual.

Related

Read many .sas7bdat files via a loop in R

I am a noob in R and a experience a lot of trouble with the following:
I have to read in over 200 datasets and I want to do this automatically. I wrote some code that works perfectly for Rdata extensions but if I try it for SAS-files it always blocks...
path= "road"
# I make a list of all the different paths of all the files in my folder
File_pathnames <- list.files (path= Road, pattern = "*.sas7bdat", full.names=T)
# I create an empty list
list.data<-list()
# I try to run a loop to load all the SAS files:
for (i in 1: length(File_pathnames))
{
list.data[[i]] <- read_sas(File_pathnames[i])
}
Problem: it does not load the tables into my global environment (when I used the rdata files I used the load function and all the data appeared in the global environment). How Can I solve this?
many thanks!
Actually, your data ARE in the global environment, as elements of list.data (check list.data[[1]], list.data[[2]], ...)
The issue you have is linked to the fact that load loads an object in the environment using the name it had when it was saved. As an example
x <- 10
save(x, file='tmp')
rm(x)
x
load('tmp')
x
save x and reload it, while read_sas only load the data that you have to assign to a variable.
If you want to assign specifically each data set, you have to define a name for each of them and assign the data. Your loop would look like
for (i in 1: 1: length(File_pathnames))
{
namei <- paste0("name",i)
data <- read_sas(File_pathnames[i])
assign(namei, data)
}
and your data would be stored in "name1", "name2", ...
You should the assign each SAS files read in File_pathnames[i] as an object named FilenamesS[i]. Try
for (i in 1: length(File_pathnames))
{
data <- read_sas(File_pathnames[i])
assign (FilenamesS[i], data)
}

parameter not passed to the function when using walk function in PURRR package

I am using the purrr:walk to read multiple excel files and it failed. I have 3 questions:
(1) I used the function list.files to read the excel file list in one folder. But the returned values also included the subfolders. I tried set value for the parameters recursive= and include.dirs=, but it didn't work.
setwd(file_path)
files<-as_tibble(list.files(file_path,recursive=F,include.dirs=F)) %>%
filter(str_detect(value,".xlsx"))
files
(2) When I used the following piece of code, it can run without any error or warning message, but there is no returned data.
###read the excel data
file_read <- function(value1) {
print(value1)
file1<-read_excel(value1,sheet=1)
}
walk(files$value,file_read)
When I used the following, it worked. Not sure why.
test<-read_excel(files$value,sheet=1)
(3) In Q2, actually I want to create file1 to file6, suppose there are 6 excel files. How can I dynamically assign the dataset name?
list.files has pattern argument where you can specify what kind of files you are looking for. This will help you avoid filter(str_detect(value,".xlsx")) step. Also list.files only returns the files that are included in the main directory (file_path) and not it's subdirectory unless you specify recursive = TRUE.
library(readxl)
setwd(file_path)
files <- list.files(pattern = '\\.xlsx')
In the function you need to return the object.
file_read <- function(value1) {
data <- read_excel(value1,sheet=1)
return(data)
}
Now you can use map/lapply to read the files.
result <- purrr::map(files,file_read)

Why does "write.dat" (R) save data files within folders?

In order to conduct some analysis using a particular software, I am required to have separate ".dat" files for each participant, with each file named as the participant number, all saved in one directory.
I have tried to do this using the "write.dat" function in R (from the 'multiplex' package).
I have written a loop that outputs a ".dat" file for each participant in a dataset. I would like each file that is outputted to be named the participant number, and for them all to be stored in the same folder.
## Using write.dat
participants_ID <- unique(newdata$SJNB)
for (i in 1:length(participants_ID)) {
data_list[[i]] <- newdata %>%
filter(SJNB == participants_ID[i])
write.dat(data_list[[i]], paste0("/Filepath/Directory/", participants_ID[i], ".dat"))
}
## Using write_csv this works perfectly:
participants_ID <- unique(newdata$SJNB)
for (i in 1:length(participants_ID)) {
newdata %>%
filter(SJNB == participants_ID[i]) %>%
write_csv(paste0("/Filepath/Directory/", participants_ID[i], ".csv"), append = FALSE)
}
If I use the function "write_csv", this works perfectly (saving .csv files for each participant). However, if I use the function "write.dat" each participant file is saved inside a separate folder - the folder name is the participant number, and the file inside the folder is called "data_list[[i]]". In order to get all of the data_list files into the same directory, I then have to rename them which is time consuming.
I could theoretically output the files to .csv and then convert them to .dat, but I'm just intrigued to know if there's anything I could do differently to get the write.dat function to work the way I'm trying it :)
The documentation on write.dat is subminimal, but it would appear that you have confused a directory path with a file name . You have deliberately created a directory named "/Filepath/Directory/[participants_ID[i]].dat" and that's where each output file is placed. That you cannot assing a name to the x.dat file itself appears to be a defect in the package as supplied.
However, not all is lost. Inside your loop, replace your write.dat line with the following lines, or something similar (not tested):
edit
It occurs to me that there's a smoother solution, albeit using the dreaded eval:
Again inside the loop, (assuming participants_ID[i] is a char string)
eval(paste0(participants_ID[i],'<- dataList[[i]]'))
write.dat(participants_ID[i], "/Filepath/Directory/")
previous answer
write.dat(data_list[[i]], "/Filepath/Directory/")
thecommand = paste0('mv /Filepath/Directory/dataList[[i]] /Filepath/Directory/',[participants_ID[i]],'.dat',collapse="")
system(thecommand)

Importing files with almost similar path and name

I have many txt files that I want to import into R. These files are imported one by one, I do the operations that I want, and then I import the next file.
All these files are located in a database system where all the folders have almost the same names, e.g.
database\type4\system50
database\type6\system50
database\type4\system30
database\type4\system50
Similarly, the names of the files are also almost the same, referring to the folder where they are positioned, e.g..
type4.system50.txt
type6.system50.txt
type4.system30.txt
type4.system50.txt
I have heard that there should be a easier way of importing these many files one by one, than simply multiple setwd and read.csv2 commands. As far as I understand this is possible by the macro import function in SAS, where you specify an overall path and then for each time you want to import a file you specify what is specific about this file name/folder name.
Is there a similar function in R? I tried to look at
Importing Data in R like SAS macro
, but this question did not really show me how to specify the folder name/file name.
Thank you for your help.
If you want to specify folder name / file name, try this
databasepath="path/to/database"
## list all files
list.files(getwd(),recursive = T,full.names = T,include.dirs = T) -> tmp
## filter files you want to read
readmyfile <- function(foldername,filename){
tmp[which(grepl(foldername,tmp) & grepl(filename,tmp))]
}
files_to_read <- readmyfile("type4", "system50")
some_files <- lapply(files_to_read, read.csv2)
## Or you can read all of them (if memory is large enough to hold them)
all_files <- lapply(tmp,read.csv2)
Instead of using setwd continuously, you could specify the absolute path for each file, save all of the paths to a vector, loop through the vector of paths and load the files into a list
library(data.table)
file_dir <- "path/to/files/"
file_vec <- list.files(path = file_dir, pattern = "*.txt")
file_list <- list()
for (n in 1:length(file_list)){
file_list[[n]] <- fread(input = paste0(file_dir, file_vec[n]))
}

To stack up results in one masterfile in R

Using this script I have created a specific folder for each csv file and then saved all my further analysis results in this folder. The name of the folder and csv file are same. The csv files are stored in the main/master directory.
Now, I have created a csv file in each of these folders which contains a list of all the fitted values.
I would now like to do the following:
Set the working directory to the particular filename
Read fitted values file
Add a row/column stating the name of the site/ unique ID
Add it to the masterfile which is stored in the main directory with a title specifying site name/filename. It can be stacked by rows or by columns it doesn't really matter.
Come to the main directory to pick the next file
Repeat the loop
Using the merge(), rbind(), cbind() combines all the data under one column name. I want to keep all the sites separate for comparison at a later on stage.
This is what I'm using at the moment and I'm lost on how to proceed further.
setwd( "path") # main directory
path <-"path" # need this for convenience while switching back to main directory
# import all files and create a character type array
files <- list.files(path=path, pattern="*.csv")
for(i in seq(1, length(files), by = 1)){
fileName <- read.csv(files[i]) # repeat to set the required working directory
base <- strsplit(files[i], ".csv")[[1]] # getting the filename
setwd(file.path(path, base)) # setting the working directory to the same filename
master <- read.csv(paste(base,"_fiited_values curve.csv"))
# read the fitted value csv file for the site and store it in a list
}
I want to construct a for loop to make one master file with the files in different directories. I do not want to merge all under one column name.
For example, If I have 50 similar csv files and each had two columns of data, I would like to have one csv file which accommodates all of it; but in its original format rather than appending to the existing row/column. So then I will have 100 columns of data.
Please tell me what further information can I provide?
for reading a group of files, from a number of different directories, with pathnames patha pathb pathc:
paths = c('patha','pathb','pathc')
files = unlist(sapply(paths, function(path) list.files(path,pattern = "*.csv", full.names = TRUE)))
listContainingAllFiles = lapply(files, read.csv)
If you want to be really quick about it, you can grab fread from data.table:
library(data.table)
listContainingAllFiles = lapply(files, fread)
Either way this will give you a list of all objects, kept separate. If you want to join them together vertically/horizontally, then:
do.call(rbind, listContainingAllFiles)
do.call(cbind, listContainingAllFiles)
EDIT: NOTE, the latter makes no sense unless your rows actually mean something when they're corresponding. It makes far more sense to just create a field tracking what location the data is from.
if you want to include the names of the files as the method of determining sample location (I don't see where you're getting this info from in your example), then you want to do this as you read in the files, so:
listContainingAllFiles = lapply(files,
function(file) data.frame(filename = file,
read.csv(file)))
then later you can split that column to get your details (Assuming of course you have a standard naming convention)

Resources