Reading files from folder and storing them to dataframe in R - r

My goal eventually is to build a classifier -- something like a spam detector.
I am not sure however how to read the text files that contain the text that will feed the classifier and store it to a dataframe.
So suppose that I have assembled in a folder text files -- raw text initially stored in a Notepad and then saved from it in a txt file-- whose names are indicative of their content, e.g. xx_xx_xx__yyyyyyyyyyy_zzzzz, where xx will be numbers representing a date, yyyyyyyyy will be a character string representing a theme and zzzz will be a character string representing a source. yyyyyyyyyyy and zzzzzzz will be of variable lengths.
My objective would be to create a function that would loop through the files, read them, store the information contained in their names in separate columns of a dataframe --e.g. "Date", "Theme", "Source" -- and the text content in a fourth column (e.g. "Content").
Any ideas how this could be achieved?
Your advice will be appreciated.

Hi here is a possible answer, I'm storing results in a list instead of a data frame but you can convert from one to the other by using do.call(rbind.data.frame, result)
require(stringr)
datawd<-"C:/my/path/to/folder/" # your data directory
listoffiles<-list.files(str_c(datawd)) # list of files
listoffiles<-listoffiles[grep(".txt",listoffiles)] # only extract .txt files
my_paths<-str_c(datawd,listoffiles) # vector of path
# the following works with windows only
progress<-winProgressBar(title = "loading text files",
label = "progression %",
min = 0,
max = length(my_paths),
initial = 0,
width = 400)
#000000000000000000000000000000000000000 loop
for (i in 1:length(chemins)){
result<-list()
setWinProgressBar(progress,i,label=listoffiles[i])
the_date<-sapply(strsplit(listoffiles[i],"_"),"[[",1)
the_theme<-sapply(strsplit(listoffiles[i],"_"),"[[",2)
the_source<-sapply(strsplit(listoffiles[i],"_"),"[[",3)
# open connexion with read
con <- file(my_paths[i], open = "r")
# readlines returns an element per line, here I'm concatenating all,
#you will do what you need....
the_text<- str_c(readLines(con,warn = FALSE))
close(con) # closing the connexion
result[[i]]<-list()
result[[i]]["date"]<-the_date
result[[i]]["source"]<-the_source
result[[i]]["theme"]<-the_theme
result[[i]]["text"]<-the_text
}
#000000000000000000000000000000000000000 end loop

Related

Reading .xlsx file names from cells and outputting specific cell as mutated column

I have an excel sheet with input data for an experiment I ran, and I want to get that input data alongside the results of the experiment.
Each row in the excel sheet contains all the input data for one unique test in the experiment. In each of these rows, I'd like to display additional cells that show some of the results from the experiment's output files. Each test has its own unique output file and the names of each of these files (e.g. "Output1.xlsx") is contained in a column alongside the input data. So each row in the input file contains all the input data for a test as well as the file name of the output file for that test.
I'd like to run a code that reads the file names from the "Filename" column in the input file, finds the files in the working directory, accesses a value from a specific cell in each of the files, and creates a mutated vector containing the values from the cells referenced in these files.
So far my code looks like this:
library(tidyverse)
library(readxl)
# Import testing inputs from testing matrix sheet
testing_inputs <- read_xlsx("Testing Setup Info.xlsx",
sheet = 2,
range = NULL,
col_names = TRUE,
col_types = NULL,
na = "",
trim_ws = TRUE,
skip = 0,
progress = readxl_progress(),
.name_repair = "unique")
# Isolate results files from testing inputs sheet
testing_results <- testing_inputs %>%
select("Filename")
# Create df with columns for test inputs and results. Inputs from testing_inputs. Results from individual cells read from the files in the testing_results df.
analysis_df <- testing_inputs %>%
select("Test ID", "Test Order", "Box Size (in.)",
"Coil Rows (#)", "HWST (degF)", "Damper Position",
"Insulation Level", "HW Flow # HWST (GPM)", "heating SA flow (cfm)",
"H.T. - from Price (MBH)", "Filename") %>%
mutate(FS_Cap = lapply(testing_results, read_xlsx(path=".", sheet = 2, range = "Q27",
col_names = FALSE,
col_types = NULL)))
The error message that results from this:
"Error in mutate_cols():
! Problem with mutate() column FS_Cap.
i FS_Cap = lapply(...).
x zip file 'C:\Users\pwend\Box\BSG\Projects\HVAC\Integrated\PIR-19-013 Hot Water\Laboratory testing\Test Results\Combined' cannot be opened
Caused by error in utils::unzip():
! zip file 'C:\Users\pwend\Box\BSG\Projects\HVAC\Integrated\PIR-19-013 Hot Water\Laboratory testing\Test Results\Combined' cannot be opened"
There is no zip file in the working directory, so I'm not sure what to do about this.
For reference, every file I'm using in this script is contained in the working directory, so I wouldn't be able to use the list.files function since some of the files it would read would be the ones I don't need. Using list.files would also make it difficult to match the returned values from the files to the rows they correspond to in the inputs file.
Does anyone know how I could achieve the output I'm looking for?

Loop over a large number of CSV files with the same statements in R?

I'm having a lot of trouble reading/writing to CSV files. Say I have over 300 CSV's in a folder, each being a matrix of values.
If I wanted to find out a characteristic of each individual CSV file such as which rows had an exact number of 3's, and write the result to another CSV fil for each test, how would I go about iterating this over 300 different CSV files?
For example, say I have this code I am running for each file:
values_4 <- read.csv(file = 'values_04.csv', header=FALSE) // read CSV in as it's own DF
values_4$howMany3s <- apply(values_04, 1, function(x) length(which(x==3))) // compute number of 3's
values_4$exactly4 <- apply(values_04[50], 1, function(x) length(which(x==4))) // show 1/0 on each column that has exactly four 3's
values_4 // print new matrix
I am then continuously copy and pasting this code and changing the "4" to a 5, 6, etc and noting the values. This seems wildly inefficient to me but I'm not experienced enough at R to know exactly what my options are. Should I look at adding all 300 CSV files to a single list and somehow looping through them?
Appreciate any help!
Here's one way you can read all the files and proceess them. Untested code as you haven't given us anything to work on.
# Get a list of CSV files. Use the path argument to point to a folder
# other than the current working directory
files <- list.files(pattern=".+\\.csv")
# For each file, work your magic
# lapply runs the function defined in the second argument on each
# value of the first argument
everything <- lapply(
files,
function(f) {
values <- read.csv(f, header=FALSE)
apply(values, 1, function(x) length(which(x==3)))
}
)
# And returns the results in a list. Each element consists of
# the results from one function call.
# Make sure you can access the elements of the list by filename
names(everything) <- files
# The return value is a list. Access all of it with
everything
# Or a single element with
everything[["values04.csv"]]

creating date variable from file names in R

I need some help creating a dataset in R where each observation contains a latitude, longitude, and date. Right now, I have a list of roughly 2,000 files gridded by lat/long, and each file contains observations for one date. Ultimately, what I need to do, is combine all of these files into one file where each observation contains a date variable that is pulled from the name of its file.
So for instance, a file is named "MERRA2_400.tavg1_2d_flx_Nx.20120217.SUB.nc". I want all observations from that file to contain a date variable for 02/17/2012.
That "nc" extension describes a netCDF file, which can be read into R as follows:
library(RNetCDF)
setwd("~/Desktop/Thesis Data")
p1a<-"MERRA2_300.tavg1_2d_flx_Nx.20050101.SUB.nc"
pid<-open.nc(p1a)
dat<-read.nc(pid)
I know the ldply command can by useful for extracting and designating a new variable from the file name. But I need to create a loop that combines all the files in the 'Thesis Data' folder above (set as my wd), and gives them date variables in the process.
I have been attempting this using two separate loops. The first loop uploads files one by one, creates a date variable from the file name, and then resaves them into a new folder. The second loop concatenates all files in that new folder. I have had little luck with this strategy.
view[dat]
As you can hopefully see in this picture, which describes the data file uploaded above, each file contains a time variable, but that time variable has one observation, which is 690, in each file. So I could replace that variable with the date within the file name, or I could create a new variable - either works.
Any help would be much appreciated!
I do not have any experience working with .nc files, but what I think you need to do, in broad strokes, is this:
filenames <- list.files(path = ".") # Creates a character vector of all file names in working directory
Creating empty dataframe with column names:
final_data <- data.frame(matrix(ncol = ..., nrow = 0)) # enter number of columns you will have in the final dataset
colnames(final_data) <- c("...", "...", "...", ...) # create column names
For each filename, read in file, create date column and write as object in global environment:
for (i in filenames) {
pid<-open.nc(i)
dat<-read.nc(pid)
date <- ... # use regex to get your date from i and convert it into date
dat$date <- date
final_data <- rbind(final_data, dat)
}

Counting the times a word in a row appears in subsequent rows in r

I’m attempting to develop a specific string count script. There’s one step I can’t seem to solve. I have several files (tab-delimited tables), each file contains a data frame with over 1,000 strings each in a row. I’m trying to count the number of times a particular string in a row appears in another row as part of that row’s string. Here’s what I have so far. This yields a list of each file name, the number of times a string appears in a row by itself or inside another string. I’m able to develop the concept, but right now I have to search string manually, which is impractical when dealing with thousands of strings of different lengths. As you can see, the script iterates over each file in the folder. The result should only generate a list of those strings that appear in other rows and the number of times it does per file. Also, the files don’t necessarily have the same list of strings so each file should be checked separately.
Here’s a simple example of the data frame.
north.txt
1. abcd
2. bdcd
3. tabcdt
4. bdcad
I've been able to get the script to check for each word, but I have to input the word manually.
library(stringr)
library(tidyverse)
# Read all .txt files in folder.
files <- list.files(path="/Data/Processed_data/docs_by_name", pattern=".txt")
###Action on each file
# Select the column with the sequences-clones
for (i in files){
print(i)
data <- read.table(file =paste0( "/Data/Processed_data/samples_by_name/", i), sep = '\t', header = TRUE)
# Compare selected string with strings of other rows and count matches
# Select file
for (t in unique(data)){
word <- deframe(data)
number.word <- str_count(word, “abcd”)
repeats <- sum(number.word)-1
print(repeats)
}
}
Here’s an example of what I’m hoping to get.
north.txt
abcd
2
bdca
1
south.txt
abcd…

how to write binary data to csv file in R

I am trying to write binary data to a csv file for further reading this file with 'read.csv2', 'read.table' or 'fread' to get a dataframe. The script is as follows:
library(iotools)
library(data.table)
#make a dataframe
n<-data.frame(x=1:100000,y=rnorm(1:100000),z=rnorm(1:100000),w=c("1dfsfsfsf"))
#file name variable
file_output<-"test.csv"
#check the existence of the file -> if true -> to remove it
if (file.exists(file_output)) file.remove(file_output)
#create a file
file(file_output, ifelse(FALSE, "ab", "wb"))
#to make a file object
zz <- file(file_output, "wb")
#to make a binary vector with column names
rnames<-as.output(rbind(colnames(n),""),sep=";",nsep="\t")
#to make a binary vector with dataframe
r = as.output(n, sep = ";",nsep="\t")
#write column names to the file
writeBin(rnames, zz)
#write data to the file
writeBin(r, zz)
#close file object
close(zz)
#test readings
check<-read.table(file_output,header = TRUE,sep=";",dec=".",stringsAsFactors = FALSE
,blank.lines.skip=T)
str(check)
class(check)
check<-fread(file_output,dec=".",data.table = FALSE,stringsAsFactors = FALSE)
str(check)
class(check)
check<-read.csv2(file_output,dec=".")
str(check)
class(check)
The output from the file is attached:
My questions are:
how to remove the blank line from the file without downloading to R?
It has been made on purpose to paste a binary vector of colnames as a dataframe. Otherwise colnames were written as one-column vector. Maybe it is possible to remove a blank line before 'writeBin()'?
How make the file to be written all numeric values as numeric but not as a character?
I use the binary data transfer on purpose because it is much faster then 'write.csv2'. For instance, if you apply
system.time(write.table.raw(n,"test.csv",sep=";",col.names=TRUE))
the time elapsed will be ~4 times as less as 'write.table' used.
I could not comment on your question because of my reputation but I hope it helps you.
Two things come in my mind
Using the fill in read.table in which, if TRUE then in that case the rows have unequal length, blank fields are implicitly added. (do ??read.table)
You have mentioned blank.lines.skip=TRUE. If TRUE blank lines in the input are ignored.

Resources