I am trying to find strings within files. I have a couple of things I want to accomplish.
load the files into R (I have a directory but I can't seem to read the files in in one line of code, they are all rtf files)
search within every file a certain string that can be modified (need to find words like "wrist" and/or "rist" and/or "ris" for example)
return all of the files that contain these words
thank you so much! please help!
Related
I'm not sure how to properly ask this but basically I have a very populated single 400 line file on a kaggle competition I was working on and I want to split it up into multiple files (say one file is for data cleaning, another file is for feature engineering etc) in such a way that I can have one main file that will go from reading the csv files all the way to making the model predictions, how can I do that in R? Do I have to encapsulate the entire files into one function each and then use that? If so how does that work? Thanks in advance
You can use the source command and pass it the filename. try ?source
I have used file.create and file.append successfully to aggregate multiple .txt files. When I try it with .rtf files, however, I get a larger rtf file that only shows the contents of the first .rtf of many to be aggregated.
So I have 5 .rtf files, for example. dirFiles is the list of names to be aggregated:
file.create(fileCollection_r)
file.append(fileCollection_r,dirFiles_r)
Is this a bug, and I would I report it?
How can I aggregate multiple .rtf files?
First of all, it is not clear what file.create() resp. append is doing. You didn't tag for a specific programming language, so that part of your question is really unclear and you need to improve that.
Having said that: RTF files, are in the end, pure text files. They contain formatting information, such as
{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf100
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
So, theoretically, you can just pull that text content from multiple RTF files, and put all of that into a single file.
Thus: simple use a file viewer, such as less, cat, or some windows/macos pendant, and A) check the textual content of your single RTF files and B) check out the textual content of the file that you created this way. That will tell you if the pure textual append did work.
But beyond that: it could very well be that the RTF format itself has certain limitations, that simply make it not possible to just append arbitrary RTF file content and end up with something that works as a correct RTF document.
I've got a folder full of .doc files and I want to merge them all into R to create a dataframe with filename as one column and content as another column (which would include all content from the .doc file.
Is this even possible? If so, could you provide me with an overview of how to go about doing this?
I tried starting out by converting all the files to .txt format using readtext() using the following code:
DATA_DIR <- system.file("C:/Users/MyFiles/Desktop")
readtext(paste0(DATA_DIR, "/files/*.doc"))
I also tried:
setwd("C:/Users/My Files/Desktop")
I couldn't get either to work (output from R was Error in list_files(file, ignore_missing, TRUE, verbosity) : File '' does not exist.) but I'm not sure if this is necessary for what I want to do.
Sorry that this is quite vague; I guess I want to know first and foremost if what I want to do can be done. Many thanks!
I need to untar specific files from a high number of archives. The names of the desired files have a dynamic part in them, so my approach is a wildcard search.
The following approach does not work:
untar(archive, files = glob2rx("*name"))
At this point, I need to extract the whole archive and search the files or use untar with list = TRUE to get the filename.
Both takes too much time. How can I open the archive in the memory, look for the file and extract that file only? Surely there is there another efficient way to solve this?
EDIT: It's tar.gz files I'm working with.
I have a parent folder with around 30 subfolders which each contain pdfs,.doc,.docx, and .jpg files. I need to combine all files into one large pdf. I want the order in which the files are appended into the 'master pdf' to reflect my current folder and file order (which is alphabetic for the subfolder names and numeric for the files within each subfolder).
I am fairly new to Unix and am a bit stuck on this....I would be most grateful for any advice you may have on how to approach this problem. Thank you.
There are three problems here:
Traverse the directory tree to find all documents
Convert each file into PDF
Merge the PDFs
For the first part you could use the find command to get the list of files or script the directory traversal.
For the second part you could use OpenOffice/LibreOffice command line driver to convert .doc and .docx files and ghostscript to convert .jpg files.
For the third part, probably ghostscript again.
Alternatively there are good PDF APIs available for some programming languages, such as iText from Lowagie for Java.