I'm doing a project that involves file reading and I need to know the exact number of row in the file. Does anyone know how to count the number of rows in a file without having to read the whole file? I mean is there a built-in function for that in Lua? Thanks in advance.
Lua has built-in file lines iterator.
Very convenient.
Recommended for using. :-)
local ctr = 0
for _ in io.lines'filename.txt' do
ctr = ctr + 1
end
There is no built-in function for that. The only way is to read the whole file, as in Egor's answer for instance.
Related
Is there a more efficient way than
int fileSize = size(readFileLines(fileLoc));
to get the total number of lines in a file? I presume this code has to read the entire file first, which could become costly for huge files.
I have looked into IO and Loc whether some of this info might be saved in conjunction with the file.
This is the way, unless you'd like to call wc -l via util::ShellExec 😁
Apart from streaming the file and saving some memory counting lines is always linear in the size of the file so you won't win much time.
I'm using Julia programming language.
I save each simulation result with a loop, and the parameter value is saved in a file name.
Is there any good way to unify the number of decimal places in file names?
For example,
using DelimitedFiles
for i=0:0.05:1
a = rand(1)
writedlm("i=$i.txt", a)
end
then I got
What I want is like below
Thanks for your comment
you can use Formatting.jl package. Here is an example (there are many more options; I show you one that I typically use when I want the file names to be nicely sorted):
julia> format.(0.0:0.5:10.0, precision=2, width=5, zeropadding=true)
21-element Vector{String}:
"00.00"
"00.50"
"01.00"
"01.50"
"02.00"
"02.50"
"03.00"
"03.50"
"04.00"
"04.50"
"05.00"
"05.50"
"06.00"
"06.50"
"07.00"
"07.50"
"08.00"
"08.50"
"09.00"
"09.50"
"10.00"
Good morning,
I'm new about powershell and I'd like to ask you if somebody can help me.
I have a big csv file around 3.5gb and my goal is to load it with fread (a data.table function) in R environment, but this function makes a error.
> n_a<-fread("C:/x/xy/xyz/name_file.csv",sep=";", fill = TRUE)
The error is:
Warning message:
In fread("C:/x/xy/xyz/name_file.csv") :
Stopped early on line 458945. Expected 29 fields but found 30. Consider fill=TRUE and comment.char=. First discarded non-empty line
I tried to use different way (I putted in my code fill=true, but doesn't work) to solve the problem, but I couldn't do it.
After different researches I found this kind of solution (always to do in R):
>system("powershell Get-Content C:/a/b/c/file.csv | Select -Index (0..458944 + 1000000) > output.csv")
The focus about the use of powershell in R is to delete a specific row and to load with fread the file.
My question is:
How I can delete a specific row in a csv in powershell but without specifying the length of the matrix?
Thank you in advance for every type of help.
Francesco
I'd hazard a guess that the invalid row's location is not known. In such a case, it might be sensible to read the original file and create a new file that contains only valid data. What's more, if the source data would benefit of manipulation, it can be done before reading it into R.
A file as large as 3,5 GiB is a bit on the large side to read in memory as such. Sure, it can be done in the days of 64 bit systems, but for simple row processing it's unwieldy. A scalable solution uses .Net methods and row-by-row approach.
To process a file on row-by-row basis, use .Net methods for efficient row reading. A StringBuilder is created to store rows that contain valid data, others are discarded. The StringBuilder is flushed on disk every so often. Even on days of SSDs, a write operation for each row is relatively slow in respect to writing in a bulk of, say, 10 000 rows a time.
$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText("MyCsvFile.csv")
$i = 0
$MaxRows = 10000
$colonCount = 30
while($null -ne ($line = $reader.ReadLine())) {
# Split the line on semicolons
$elements = $line -split ';'
# If there were $colonCount elements, add those to builder
if($elements.count -eq $colonCount) {
# If $line's contents need modifications, do it here
# before adding it into the builder
[void]$sb.AppendLine($line)
++$i
}
# Write builder contents into file every now and then
if($i -ge $MaxRows) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
[void]$sb.Clear()
$i = 0
}
}
# Flush the builder after the loop if there's data
if($sb.Length -gt 0) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
}
This is easy done in powershell: Read csv in generic list, remove line and write back:
Add-Type -AssemblyName System.Collections
[System.Collections.Generic.List[string]]$csvList = #()
$csvFile = 'C:\test\myfile.csv'
$csvList = [System.IO.File]::ReadLines( $csvFile )
$lineToDelete = 2
[void]$csvList.RemoveAt( $lineToDelete - 1 )
[System.IO.File]::WriteAllLines( $csvFile, $csvList ) | Out-Null
vonPryz's helpful answer offers the best solution, given the size of your input file.
The following works too, but will be slow - in general, due to the overhead of using a pipeline, but also because Get-Content itself is slow due to decorating each line read with additional properties (see green-lighted, but not yet implemented GitHub suggestion #7537):
# Exclude line number 458945 (0-based index 458944)
Get-Content C:/a/b/c/file.csv | Select-Object -SkipIndex 458944 > output.csv
The beneficial flip side of use of the pipeline is that it acts as a memory throttle, so the above command can be used to process arbitrarily large files (though it may take a long time).
I'm having issues when trying to read in a binary file I've previously written into another program. I have been able to open it and read it to an array with out compilation errors, however, the array is not populated (all 0's). Any suggestions or thoughts would be great. Here is the open/read statement I'm using:
allocate(dummy(imax,jmax))
open(unit=io, file=trim(input), form='binary', access='stream', &
iostat=ioer, status='old', action='READWRITE')
if(ioer/=0) then
print*, 'Cannot open file'
else
print*,'success opening file'
end if
read(unit=io, fmt=*, iostat=ioer) dummy
j=0
k=0
size: do j=1, imax
do k=1, jmax
if(dummy(j,k) > 0.) print*,dummy(j,k)
end do
end do size
Please let me know if you need more info.
Here is how the file is originally written:
out_file = trim(output_dir)//'SEVIRI_FRP_.08deg_'//trim(season)//'.bin'
print*, out_file
print*, i_max,' i_max,',j_max,' j_max'
open (io, file = out_file, access = 'direct', status = 'replace', recl = i_max*j_max*4)
write(io, rec = 1) sev_frp
write(io, rec = 2) count_sev_frp
write(io, rec = 3) sum_sev_frp
check: do n=1, i_max
inna: do m=1, j_max
!if (sev_frp(n,m) > 0) print*, count_sev_frp(n,m)
end do inna
end do check
print*,'n-',n,'m-',m
close(io)
First of all the form takes two possible values as far as I know: "FORMATTED" or "UNFORMATTED".
Second, to read, you should use a open that is symmetric to the open statement that you used to write the file, Unless you know exactely what you are doing. I suggest that for reading, you open with:
open(unit=io, file=trim(input), access='direct', &
iostat=ioer, status='old', action='READ', recl = i_max*j_max*4)
That corresponds to the open statement that you used to save the file.
As innoSPG says, you have a mismatch in the way the file is written and how it is read.
An external file may be connected with one of three access methods: sequential; direct; stream. Further, a connection may be formatted or unformatted.
When the file is opened for writing it uses the direct access method with unformatted records. The records are unformatted because this is the default (in the abscence of the form= specifier).
When you open the file for reading you use the non-standard extension of form="binary" and stream access. There is possibly nothing wrong with this, but it does require care.
However, with the read statements you are using formatted (list-directed) input. This will not be allowed.
The way suggested in the previous answer, of using a similar access method and record length will require a further change to the code. [You'll also need to set the value of the record length somehow.]
Not only will you need to remove the format, to match the unformatted records written, but you'll want to use the rec= specifier to access the records of the file.
Finally, if you are using the iostat= specifier you really should check the resulting value.
I have a very large file (~10 GB) that can be compressed to < 1 GB using gzip. I'm interested in using sort FILE | uniq -c | sort to see how often a single line is repeated, however the 10 GB file is too large to sort and my computer runs out of memory.
Is there a way to compress the file while preserving newlines (or an entirely different method all together) that would reduce the file to a small enough size to sort, yet still leave the file in a condition that's sortable?
Or any other method of finding out / countin how many times each line is repetead inside a large file (a ~10 GB CSV-like file) ?
Thanks for any help!
Are you sure you're running out of the Memory (RAM?) with your sort?
My experience debugging sort problems leads me to believe that you have probably run out of diskspace for sort to create it temporary files. Also recall that diskspace used to sort is usually in /tmp or /var/tmp.
So check out your available disk space with :
df -g
(some systems don't support -g, try -m (megs) -k (kiloB) )
If you have an undersized /tmp partition, do you have another partition with 10-20GB free? If yes, then tell your sort to use that dir with
sort -T /alt/dir
Note that for sort version
sort (GNU coreutils) 5.97
The help says
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
I'm not sure if this means can combine a bunch of -T=/dr1/ -T=/dr2 ... to get to your 10GB*sortFactor space or not. My experience was that it only used the last dir in the list, so try to use 1 dir that is big enough.
Also, note that you can go to the whatever dir you are using for sort, and you'll see the acctivity of the temporary files used for sorting.
I hope this helps.
As you appear to be a new user here on S.O., allow me to welcome you and remind you of four things we do:
. 1) Read the FAQs
. 2) Please accept the answer that best solves your problem, if any, by pressing the checkmark sign. This gives the respondent with the best answer 15 points of reputation. It is not subtracted (as some people seem to think) from your reputation points ;-)
. 3) When you see good Q&A, vote them up by using the gray triangles, as the credibility of the system is based on the reputation that users gain by sharing their knowledge.
. 4) As you receive help, try to give it too, answering questions in your area of expertise
There are some possible solutions:
1 - use any text processing language (perl, awk) to extract each line and save the line number and a hash for that line, and then compare the hashes
2 - Can / Want to remove the duplicate lines, leaving just one occurence per file? Could use a script (command) like:
awk '!x[$0]++' oldfile > newfile
3 - Why not split the files but with some criteria? Supposing all your lines begin with letters:
- break your original_file in 20 smaller files: grep "^a*$" original_file > a_file
- sort each small file: a_file, b_file, and so on
- verify the duplicates, count them, do whatever you want.