Reading multiple txt files from a directory - julia

I am very new to Julia and I have a question regarding reading some files. I need to read 12500 .txt files from the same directory and save them all into 1 array but I'm having performance issues. Is there a fast way of doing this? My code takes around like 60 seconds which is way more than I can afford. Here is what I have:
function load_train(directory)
data = []
dir = joinpath("./aclImdb/train/",directory)
for f in readdir(dir)
s = read(joinpath(dir,f),String)
push!(data,s)
end
data
end
trainPos = load_train("pos/")

Related

Moving large amounts of files from one large folder to several smaller folders using R

I have over 7,000 .wav files in one folder which need to be split up into groups of 12 and placed into separate smaller folders.
The files correspond to 1-minute recordings taken every 5 minutes, so every 12 files corresponds to 1 hour.
The files are stored on my PC in the working directory: "E:/Audiomoth Files/Winter/Rural/Emma/"
Examples of the file names are as follows:
20210111_000000.wav
20210111_000500.wav
20210111_001000.wav
20210111_001500.wav
20210111_002000.wav
20210111_002500.wav
20210111_003000.wav
20210111_003500.wav
20210111_004000.wav
20210111_004500.wav
20210111_005000.wav
20210111_005500.wav
which would be one hour, then
20210111_010000.wav
20210111_010500.wav
20210111_011000.wav
and so on.
I need the files split into groups of 12 and then I need a new folder to be created in: "E:/Audiomoth Files/Winter/Rural/Emma/Organised Files"
With the new folders named 'Hour 1', 'Hour 2' and so on.
What is the exact code I need to do this?
As is probably very obvious I'm a complete beginner with R so if the answer could be spelt out in layman's terms that would be brilliant.
Thank you in advance
Something like this?
I intentionally used copy instead of cut in order to prevent data from being lost. I edited the answer so the files will keep their old names. I order to give them new names, replace name in the last line by "Part_", i, ".wav", for example.
# get a list of the paths to all the files
old_files <- list.files("E:/Audiomoth Files/Winter/Rural/Emma/", pattern = "\\.wav$", full.names = TRUE)
# create new directory
dir.create("E:/Audiomoth Files/Winter/Rural/Emma/Organised Files")
# start a loop, repeat as often as there are groups of 12 within the list of files
for(i in 1:(round(length(old_files)/12)+1)){
# create a directory for the hour
directory <- paste("E:/Audiomoth Files/Winter/Rural/Emma/Organised Files", "/Hour_", i, sep = "")
dir.create(directory)
# select the files that are to copy (I guess it will start with 1*12-11 = 1st file
# and end with i*12 = 12th file)
filesToCopy <- old_files[(i*12-11):(i*12)]
# for those files run another loop:
for(file in 1:12){
# get the name of the file
name <- basename(filesToCopy[file])
# copy the file to the current directory
file.copy(filesToCopy[file], paste(directory, "/", name, sep = ""))
}
}
When you're not entirely sure, I'd recommend to copy the files instead of moving them directly (which is what I hope this script here does). You can delete them manually, later on. After you checked that everything worked well and all data is where it should be. Otherwise data can be lost due to even small errors, which we do not want to happen.

How to delete a row in a csv file with powershell in R?

Good morning,
I'm new about powershell and I'd like to ask you if somebody can help me.
I have a big csv file around 3.5gb and my goal is to load it with fread (a data.table function) in R environment, but this function makes a error.
> n_a<-fread("C:/x/xy/xyz/name_file.csv",sep=";", fill = TRUE)
The error is:
Warning message:
In fread("C:/x/xy/xyz/name_file.csv") :
Stopped early on line 458945. Expected 29 fields but found 30. Consider fill=TRUE and comment.char=. First discarded non-empty line
I tried to use different way (I putted in my code fill=true, but doesn't work) to solve the problem, but I couldn't do it.
After different researches I found this kind of solution (always to do in R):
>system("powershell Get-Content C:/a/b/c/file.csv | Select -Index (0..458944 + 1000000) > output.csv")
The focus about the use of powershell in R is to delete a specific row and to load with fread the file.
My question is:
How I can delete a specific row in a csv in powershell but without specifying the length of the matrix?
Thank you in advance for every type of help.
Francesco
I'd hazard a guess that the invalid row's location is not known. In such a case, it might be sensible to read the original file and create a new file that contains only valid data. What's more, if the source data would benefit of manipulation, it can be done before reading it into R.
A file as large as 3,5 GiB is a bit on the large side to read in memory as such. Sure, it can be done in the days of 64 bit systems, but for simple row processing it's unwieldy. A scalable solution uses .Net methods and row-by-row approach.
To process a file on row-by-row basis, use .Net methods for efficient row reading. A StringBuilder is created to store rows that contain valid data, others are discarded. The StringBuilder is flushed on disk every so often. Even on days of SSDs, a write operation for each row is relatively slow in respect to writing in a bulk of, say, 10 000 rows a time.
$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText("MyCsvFile.csv")
$i = 0
$MaxRows = 10000
$colonCount = 30
while($null -ne ($line = $reader.ReadLine())) {
# Split the line on semicolons
$elements = $line -split ';'
# If there were $colonCount elements, add those to builder
if($elements.count -eq $colonCount) {
# If $line's contents need modifications, do it here
# before adding it into the builder
[void]$sb.AppendLine($line)
++$i
}
# Write builder contents into file every now and then
if($i -ge $MaxRows) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
[void]$sb.Clear()
$i = 0
}
}
# Flush the builder after the loop if there's data
if($sb.Length -gt 0) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
}
This is easy done in powershell: Read csv in generic list, remove line and write back:
Add-Type -AssemblyName System.Collections
[System.Collections.Generic.List[string]]$csvList = #()
$csvFile = 'C:\test\myfile.csv'
$csvList = [System.IO.File]::ReadLines( $csvFile )
$lineToDelete = 2
[void]$csvList.RemoveAt( $lineToDelete - 1 )
[System.IO.File]::WriteAllLines( $csvFile, $csvList ) | Out-Null
vonPryz's helpful answer offers the best solution, given the size of your input file.
The following works too, but will be slow - in general, due to the overhead of using a pipeline, but also because Get-Content itself is slow due to decorating each line read with additional properties (see green-lighted, but not yet implemented GitHub suggestion #7537):
# Exclude line number 458945 (0-based index 458944)
Get-Content C:/a/b/c/file.csv | Select-Object -SkipIndex 458944 > output.csv
The beneficial flip side of use of the pipeline is that it acts as a memory throttle, so the above command can be used to process arbitrarily large files (though it may take a long time).

saveWorkbook Execution time

I'm sorry if this was already answered but I couldn't find.
I'm using the XLConnect package to add new entries to a spreadsheet, but the execution time of saveWorkbook is increasing and delaying all other tasks that depend on the updated spreadsheet.
The work flow is the following:
Query a SQL db for new entries (Load the result using read.table);
Load out-of-date spreadsheet and save each sheets as a entry of a
list;
Add entries to appropriate sheets/list element;
Color lines, using setCellStyel, according to a series of
parameters (example in code bellow);
saveWorkbook
cs_completo=getOrCreateCellStyle(wb, name = "Cs_Completo")
setFillPattern(cs_completo, fill = XLC$FILL.SOLID_FOREGROUND)
setFillForegroundColor(cs_completo, color = XLC$COLOR.LIGHT_GREEN)
for(status in c("Conferido","Impresso","Entregue","Envelopado")){
if(sum(grepl(status,dados$NomeStatusExame))>0){
index=which(grepl(status,dados$NomeStatusExame))+1
lapply(1:length(desired_tabs),function(x) setCellStyle(wb, sheet = sheet, row=index, col=x,cellstyle = cs_completo))}
}
}
Steps 1 through 4 are complete under 3 three minutes (some sheets have as much as 2000 lines).
Step 5 takes at least 30 minutes!
Is there a way to speed up the saveWorkbook writing process?
I don't know why but saving the workbook to a new file take much less time (under a minute) than overwrite the existing one!

Making writing to a file process more efficient

I am new to programming and i am running this script to clean a large text file (over 12000 lines) and write it to another .txt file. The problem is when a run this with a smaller file (roughly around 500 line) it executes fast, therefore my conclusion was it is taking time due to the size of the file. So if someone can guide me to make this code efficient it will be highly appreciated.
input_file = open('bNEG.txt', 'rt', encoding='utf-8')
l_p = LanguageProcessing()
sentences=[]
for lines in input_file.readlines():
tokeniz = l_p.tokeniz(lines)
cleaned_url = l_p.clean_URL(tokeniz)
remove_words = l_p.remove_non_englishwords(cleaned_url)
stopwords_removed = l_p.remove_stopwords(remove_words)
cleaned_sentence=' '.join(str(s) for s in stopwords_removed)+"\n"
output_file = open('cNEG.txt', 'w', encoding='utf-8')
sentences.append(cleaned_sentence)
output_file.writelines(sentences)
input_file.close()
output_file.close()
EDIT: Below is the corrected code as mentioned in the answer with few other alteration to suit my requirements
input_file = open('chromehistory_log.txt', 'rt', encoding='utf-8')
output_file = open('dNEG.txt', 'w', encoding='utf-8')
l_p = LanguageProcessing()
#sentences=[]
for lines in input_file.readlines():
#print(lines)
tokeniz = l_p.tokeniz(lines)
cleaned_url = l_p.clean_URL(tokeniz)
remove_words = l_p.remove_non_englishwords(cleaned_url)
stopwords_removed = l_p.remove_stopwords(remove_words)
#print(stopwords_removed)
if stopwords_removed==[]:
continue
else:
cleaned_sentence=' '.join(str(s) for s in stopwords_removed)+"\n"
#sentences.append(cleaned_sentence)
output_file.writelines(cleaned_sentence)
input_file.close()
output_file.close()
To have the discussion as answer:
Two problems are here:
You open / create the outputfile and write the data in the loop - for every line of the input file. Additional you are collection all data in an array (sentences).
You have two possibilities:
a) Create the file before the loop, and write in the loop just "cleaned_sentence" (and delete the collecting "sentences").
b) Collect everything in "sentences" and write "sentences" at once after the loop.
Disadvantage of a) is: this is a bit slower than b) (as long as the OS di not have to swap memory for b). But the advantage is: This is much less memory consuming and will work no matter how big the file is and how less memory in the computer is installed.

How do I read multiple binary files in R?

Suppose we have files in one folder file1.bin, file2.bin, ... , and file1460.bin in directory C:\R\Data and we want to read them and make a loop to go from 1 to 4 and take the average then from 4 to 8 average and so on till 1460.in the end will get 360 files
I tried to have them in a list,but did not know how to make the loop.
How do I read multiple files and manupulat them? in R language
I have been wasting countless hourse to figuer it out.any help
results <- array(dim=360)
for (i in 1:360){
results <- mean(yourlist[[(i*4):(i*4+3)]])
}
YMMV with the mean(yourList) call, but that structure would be how you could loop through the data once it's loaded.

Resources