gzgrep help in multiple large archives - solaris - unix

In solaris- i need to perform a gzgrep of archives. But i need to filter so not searching ALL the archives- maybe just files with '09.30-12' in the name.. then i want to search IN that particular file or files for a particular expression. I have this close.. but it takes WAY too long as its searching unnecessary files first and matching on those.. then moving onto the October archives and finding what i need in them. I need to basically search any files in which filename contains 'x' then look in those files for text 'y' and output to > fileoutput. Perhaps just change the *.gz to just match on a set of files?? i cannot figure out how though. Any help is MUCH appreciated.
Something like this works- but i get way too much output and it takes way too long.
gzgrep 'firstexpression' *.gz > /fileoutput.file

maybe just files with '09.30-12' in the name..
You could say:
gzgrep 'firstexpression' *09.30-12*.gz > fileoutput.file
or
gzgrep pattern_to_search *filename_pattern*.gz > outfile

Related

Extract exactly one file (any) from each 7zip archive, in bulk (Unix)

I have 1,500 7zip archives, each archive contains 2 to 10 files, with no subdirectories.
Each file has the same extension, however the filename varies.
I only want one file out of each archive, but I'd like to perform this in bulk. I do not care which file is taken out, as long as only one file is taken out. It can be the first file, the newest, the biggest, the smallest, it doesn't matter.
Here's an example:
aa.7z {blah 56.smc, blah 57.smc, 1 blah 58.smc}
ab.7z {xx.smc, xx 1.smc, xx_2.smc}
ac.7z {1.smc}
I want to run something equivalent to:
7z e *.7z # But somehow only extract one file
Thank you!
Ultimately my solution was to extract all files and run the following in the directory:
for n in *; do echo "$n"; done > files.txt
I then imported that list into excel, and split the files by a special character that divided the title of the file with the qualifying data inside the filename (for example: Some Title (V1) [X2].smc), specifically I used a brackets delimiter.
Then I removed all duplicates, leaving me with only one edition of each from the zip. I finally remerged the columns (unfortunately the bracket was deleted during the splitting so wrote a function to add it back on the condition of whether there was content in the next column) and then resaved files.txt, after a bit of reviewing StackOverflow for answers, deleted files based on an input file (files.txt). A word of warning on this, spaces in filenames cause problems with rm and xargs so I had to encapsulate the variable with quotes.
Ultimately this still didn't serve me well enough so I just used a different resource entirely.
Posting this answer so others who find themselves in a similar predicament find an alternative resolution.

Parse Flat file every 410 characters

I have a flat file that I need to take and insert a carriage return every 410 characters. I know this sounds weird, but for whatever reason my work was given several huge flat files from a clearinghouse, and I need to parse it out.
There is nothing that seperates what is supposed to be each new line, but it is exactly 410 characters. So I can't even search for anything specific and then do it.
There are 21 files total, each about 12-13mb.
I have asked for a CSV file, and they are unable to provide that.
I am trying to see if Notepad++ will do a Character count and then I can just hit "enter" after every 410th.
Also I am trying to see if I can do this in Java.
Any help you all can provide would be appreciated.
In Notepad++ you can search for the regular expression (.{410}) and replace it with \1\r.
It has happened to me that Notepad++ swallowed some characters when doing regex-based search and replace operations in large files, so I would try this for one file, then remove all the carriage returns again and compare the result size to the original size, just to make sure that nothing got swallowed during the replace operation.

Improperly formatted CSV, how to repair?

I have a csv, and each line reads as follows:
"http://www.videourl.com/video,video title,video duration,thumbnail,<iframe src=""http://embed.videourl.com/video"" frameborder=0 width=510 height=400 scrolling=no> </iframe>,tag 1,tag 2",,,,,,,,,,,,,,,,,,,,,,,,,,
Is there a program I can use to clean this up? I'm trying to import it to wordpress and map it to current fields, but it isn't functioning properly. Any suggestions?
Just use search and replace in this case. remove the commas at the end and then replace the remaining commas with ",".
Should anyone else have the same issue. Know that this solution will only work with data much like the example giving. If data has a lot of text and there are commas within the text that need kept. Then search replacing comma will not work. Using regex would be the next option and that can be done in Notepad ++
However I think the regex pattern depends on the data so not much point creating an example.
PHP could be used to explode each line also. Remove values that match a regex out of many i.e. URL, money. Then what is left could be (depending on the data again) just a block of text. That approach may not work if there are two or more columns with a lot of text

parsing xml file manually with r

for some reason, I cannot download the r xml package at work. I have an xml file that has contents like this:
x<-read.table("info.xml")
x
</name></content></item><item id="id-123"><content><name>
</name></content></item><item id="id-456"><content><name>
</name></content></item><item id="id-5559"><content><name>
I need to pick values that start with id and - and the numbers like
id-123, id-456 id-5559, etc
tried this:
str_extract_all(x, "id-[0-9]")
but is only printing id-1, I really need help very quick. Any ideas?
str_extract_all(x, "id-[0-9]+")
The regular expression "id-[0-9]" is missing a "+" at the end.
There may be more issues, but that one jumps out.

Reading a file into R with partly unknown filename

Is there a way to read a file into R where I do not know the complete file name. Something like.
read.csv("abc_*")
In this case I do not know the complete file name after abc_
If you have exactly one file matching your criteria, you can do it like this:
read.csv(dir(pattern='^abc_')[1])
If there is more than one file, this approach would just use the first hit. In a more elaborated version you could loop over all matches and append them to one dataframe or something like that.
Note that the pattern uses regular expressions and thus is a bit different from what you did expect (and what I wrongly assumed at my first shot to answer the question). Details can be found using ?regex
If you have a directory you want to submit, you have do modify the dir command accordingly:
read.csv(dir('path/to/your/file', full.names=T, pattern="^abc"))
The submitted path in your case may be c:\\users\\user\\desktop, and then the pattern as above. full.names=T forces dir() to output a whole path and not only the file name. Try running dir(...) without the read.csv to understand what is happening there.
If you want to give your path as a complete string, it again gets a bit more complicated:
filepath <- 'path/to/your/file/abc_'
read.csv(dir(dirname(filepath), full.names=T, pattern=paste("^", basename(filepath), sep='')))
That process will fail if your filename contains any regular expression keywords. You would have to substitute then with their corresponding escape sequences upfront. But that again is another topic.

Resources