Can we load .txt files to vaex? - bigdata

I have folder of .txt files which is of the size of 52.6 GB. The .txt files are located in various subfolders. Each subfolder has unique labels "F","G", etc. Each subfolder has got many .txt files. I need to combine all the .txt files of each unique labels("F","G") into one single file. I tried to used vaex. But I could not find a way to do this for .txt files. Can any one please help me out?

Provided the text files have csv formatted data, and same structure across files, you could use:
df = vaex.open_many([fpath1, fpath2, ..., fpathX])
To fetch all the filenames and their paths, you could conveniently use pathlib to recursively glob the file paths
from pathlib import Path
txt_files = Path('your_label_folder_path').rglob('*.txt')
# since this returns a generator and vaex.open_many expects a list
# and while we're here, resolve the absolute path as well
txt_files = [txt.absolute() for txt in txt_files]
df = vaex.open_many(txt_files)

Related

Importing and converting multiple files based on criteria in R

Thank you guys in advance for your help, I am faced with the task of importing multiple excels files into R, such files may have a .csv or an .xlsx extention they all have the same name format which is the word "bookings" followed by a date formart YYYY_MM_DD, like this:
bookings_2016_05_23
bookings_2016_12_06
After uploading these files I have to write them as .xlsx files in a specific directory but I have to replace their names with the last two characters in the name of the file, to illustrate if I upload a file named bookings_2016_05_23, I will have to properly read it whether it is .xlsx or .csv and then I will have to save it (write) as 23.xlsx and do that for every file I am trying to accomplish this by using this code:
path <- fs::dir_ls(choose.dir())
read_all_files <- function(path){
path %>%
excel_sheets() %>%
set_names() %>%
map_df(~read_excel(.x, path = path,col_types = "text") %>% mutate(file=path, sheet=.x))
}
which Allows me to import all files and their respective sheets, but this will work if all files were .xlsx and in order to change the name of the file I am trying to uses a regexs within the function set_names() with no luck, also I will be so thankful if you guys could teach me/ reference how to save all files as .xlsx separately in a working directory using a for loop. Thank you guys so much I trully do.

How to read an Excel file from a folder without specifying the filename?

Is there a way to read an Excel file into R directly from a folder without having to specify the filename?
I'm using the readxl library and I have only 1 file in the folder, but the file's name changes monthly.
Thanks!
The computer need to have the path anyway BUT you can get it without giving if you are absolutly sure that this is the only one file in your folder
see https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.files.html
to learn more about how to open a directory and getting filename inside.
If this isn't the only file but this is the only excel file you while have to get the extension of each file and do some parsing to take a decision of wich file you want to open
As noted in other answers, this can be solved using the dir function.
Assuming the variable path contains the path of the directory in which the XLSX file is located, then the following will give you the full path to the file:
file_path = dir(path, pattern = '\\.xlsx$', full.names = TRUE)
Note that pattern uses regular expressions rather than glob format! Using the glob pattern *.xlsx might appear to work but it’s incorrect, only works by accident, and will also match other file names.
Suppose your file is located inside a folder called myFolder located in disk E:\\
library(readxl)
folderPath <- "E://myFolder"
fileName <- grep(".xlsx", dir(folderPath), value = TRUE)
filePath <- paste(folderPath, fileName, sep = "//")
df <- read_xlsx(filePath)
This code will get the name of your xlsx file inside folderPath each time you run it and then you can import it with readxl::read_xlsx. I assume there is only one xlsx file inside the folder.

Is there a way to load csv files saved in different folders with only a partial file name in R

I am trying to load multiple csv files that are each saved in different folders within my working directory in R. However I only know part of each of the file name.
For example the file in "folder1" will be named "xxx_xxx_folder1.csv", and the file in "folder2" is "xxx_xxx_folder2.csv" etc. There is only one csv in each folder.
I was wondering is there a way to load files saved in different folders with only a partial file name?
The only way I have got it to partially work so far is to have all the files in one folder
Thanks and sorry if any of this is unclear!
From your description you could use list.files with option recursive=TRUE to get a list of your csv files. You could then loop over the list to read your files:
fn <- list.files(PATH_TO_WORKING_DIRECTORY, "\\.csv$", recursive = TRUE, full.names = TRUE)
lapply(fn, read.csv)

Importing a batch of CSV files in R with unique file paths

I have successfully read batch files once I move all the folder files I want into a specific place using lapply, but that's a short term fix. I'm trying to read a batch of CSV files (all with the same name seed.csv) that are located in unique file paths. Some aspects of the path are uniform and some are not. The file path structure is as follows:
resting/8000/8102/2000-09-26/rsfMRI_26-b/ROI/name/seed.csv
resting is uniform (all paths have this), 8000 represents a folder that has subfolders 8000-8999. I'm interested in reading the files in 8102 within three months of a specified date (listed in an Excel sheet elsewhere). I want to read any folder that starts with rsfMRI_ (26-b varies) and then the rest of the path is uniform reading ROI/name/seed.csv.
If the top-level directory is consistent, just use recursive = TRUE for list.files:
library(data.table)
DT <- rbindlist(lapply(list.files(
"resting/", pattern = "seed\\.csv", full.names = TRUE), fread))
If this is reading in too many files, you'll have to use list.dirs and trim the output from that.

Merge all files using R in a directory while removing the headers and more

I need to merge multiple .csv files together while removing the header row from each file except from the first file in R Studio. All the files have the same number of columns and I just need to merge all the rows from each file.
However, this is the complicated part, or what I think is. The way this data is produced, each file is in its own folder. So if I have 100 files, then I have 100 individual folders and each folder inside is one file. The folders are named by each day and the file is named by each day as well. The only part of the name of the file that changes is the date. So for example, I'll have a folder named "20160420" with the file inside named "20160420_file". The next file would be named "20160419" with the file inside named "20160419_file". And so on. Each file has a header row, and below it are a days worth of data every minute.
The machines archives data everyday. We have over 100 machines, and each machine has been producing these files for the past 8 years. So you can imagine how many files there are and just how long it would take if I did this manually.
How would I write the code in R in R Studio to combine all these files together into one file and remove the duplicate header rows? Any ideas or help would be greatly appreciated.
You can use list.files() or dir() with argument full.names = TRUE and recursive = TRUE to get a vector of file names with paths from across multiple directories.
files <- dir(path = "c:/", pattern = "csv", full.names = TRUE, recursive = TRUE)
Then you can use a loop of some sort to process the files, for example
require(plyr)
allData <- ldply(as.list(files), read.csv)

Resources