How to modify i in an R loop? - r

I have several large R objects saved as .RData files: "this.RData", "that.RData", "andTheOther.RData" and so on. I don't have enough memory, so I want to load each in a loop, extract some rows, and unload it. However, once I load(i), I need to strip the ".RData" part of (i) before I can do anything with objects "this", "that", "andTheOther". I want to do the opposite of what is described in How to iterate over file names in a R script? How can I do that? Thx
Edit: I omitted to mention the files are not in the working directory and have a filepath as well. I came across Getting filename without extension in R and file_path_sans_ext takes out the extension but the rest of the path is still there.

Do you mean something like this?
i <- c("/path/to/this.RDat", "/another/path/to/that.RDat")
f <- gsub(".*/([^/]+)", "\\1", i)
f1 <- gsub("\\.RDat", "", f)
f1
[1] "this" "that"
On windows' paths you have to use "\\" instead of "/"
Edit: Explanation. Technically, these are called "regular
expressions" (regexps), not "patterns".
. any character
.* arbitrary number (including 0) of any kind of characters
.*/ arbitrary number of any kind of characters, followed by a
/
[^/] any character but not /
[^/]+ arbitrary number (1 or more) of any kind of characters,
but not /
( and ) enclose groups. You can use the groups when
replacing as \\1, \\2 etc.
So, look for any kind of character, followed by /, followed by
anything but not the path separator. Replace this with the "anything
but not separator".
There are many good tutorials for regexps, just look for it.

A simple way to do this using would be to extract the base name from the filepaths with base::basename() and then remove the file extension with tools::file_path_sans_ext().
paths_to_files <- c("./path/to/this.RData", "./another/path/to/that.RData")
tools::file_path_sans_ext(
basename(
paths_to_files
)
)
## Returns:
## [1] "this" "that"

Related

Reading multiple csv files from a folder with R using regex

I wish to use R to read multiple csv files from a single folder. If I wanted to read every csv file I could use:
list.files(folder, pattern="*.csv")
See, for example, these questions:
Reading multiple csv files from a folder into a single dataframe in R
Importing multiple .csv files into R
However, I only wish to read one of four subsets of the files at a time. Below is an example grouping of four files each for three models.
JS.N_Nov6_2017_model220_N200.csv
JS.N_Nov6_2017_model221_N200.csv
JS.N_Nov6_2017_model222_N200.csv
my.IDs.alt_Nov6_2017_model220_N200.csv
my.IDs.alt_Nov6_2017_model221_N200.csv
my.IDs.alt_Nov6_2017_model222_N200.csv
parms_Nov6_2017_model220_N200.csv
parms_Nov6_2017_model221_N200.csv
parms_Nov6_2017_model222_N200.csv
supN_Nov6_2017_model220_N200.csv
supN_Nov6_2017_model221_N200.csv
supN_Nov6_2017_model222_N200.csv
If I only wish to read, for example, the parms files I try the following, which does not work:
list.files(folder, pattern="parm*.csv")
I am assuming that I may need to use regex to read a given group of the four groups present, but I do not know.
How can I read each of the four groups separately?
EDIT
I am unsure whether I would have been able to obtain the solution from answers to this question:
Listing all files matching a full-path pattern in R
I may have had to spend a fair bit of time brushing up on regex to apply those answers to my problem. The answer provided below by Mako212 is outstanding.
A quick REGEX 101 explanation:
For the case of matching the beginning and end of the string, which is all you need to do here, the following prinicples apply to match files that are .csv and start with parm:
list.files(folder, pattern="^parm.*?\\.csv")
^ asserts we're at the beginning of the string, so ^parm means match parm, but only if it's at the beginning of the string.
.*? means match anything up until the next part of the pattern matches. In this case, match until we see a period \\.
. means match any character in REGEX, so we need to escape it with \\ to match the literal . (note that in R you need the double escape \\, in other languages a single escape \ is sufficienct).
Finally csv means match csv after the .. If we were going to be really thorough, we might use \\.csv$ using the $ to indicate the end of the string. You'd need the dollar sign if you had other files with an extension like .csv2. \\.csv would match .csv2, where as \\.csv$ would not.
In your case, you could simply replace parm in the REGEX pattern with JS, my, or supN to select one of your other file types.
Finally, if you wanted to match a subset of your total file list, you could use the | logical "or" operator:
list.files(folder, pattern = "^(parm|JS|supN).*?\\.csv")
Which would return all the file names except the ones that start with my
The list.files statement shown in the question is using globs but list.files accepts regular expressions, not globs.
Sys.glob To use globs use Sys.glob like this:
olddir <- setwd(folder)
parm <- lapply(Sys.glob("parm*.csv"), read.csv)
parm is now a list of data frames read in from those files.
glob2rx Note that the glob2rx function can be used to convert globs to regular expressions:
parm <- lapply(list.files(folder, pattern = glob2rx("parm*.csv")), read.csv)

remove/replace specific words or phrases from character strings - R

I looked around both here and elsewhere, I found many similar questions but none which exactly answer mine. I need to clean up naming conventions, specifically replace/remove certain words and phrases from a specific column/variable, not the entire dataset. I am migrating from SPSS to R, I have an example of the code to do this in SPSS below, but I am not sure how to do it in R.
EG:
"Acadia Parish" --> "Acadia" (removes Parish and space before Parish)
"Fifth District" --> "Fifth" (removes District and space before District)
SPSS syntax:
COMPUTE county=REPLACE(county,' Parish','').
There are only a few instances of this issue in the column with 32,000 cases, and what needs replacing/removing varies and the cases can repeat (there are dozens of instances of a phrase containing 'Parish'), meaning it's much faster to code what needs to be removed/replaced, it's not as simple or clean as a regular expression to remove all spaces, all characters after a specific word or character, all special characters, etc. And it must include leading spaces.
I have looked at the replace() gsub() and other similar commands in R, but they all involve creating vectors, or it seems like they do. What I'd like is syntax that looks for characters I specify, which can include leading or trailing spaces, and replaces them with something I specify, which can include nothing at all, and if it does not find the specific characters, the case is unchanged.
Yes, I will end up repeating the same syntax many times, it's probably easier to create a vector but if possible I'd like to get the syntax I described, as there are other similar operations I need to do as well.
Thank you for looking.
> x <- c("Acadia Parish", "Fifth District")
> x2 <- gsub("^(\\w*).*$", "\\1", x)
> x2
[1] "Acadia" "Fifth"
Legend:
^ Start of pattern.
() Group (or token).
\w* One or more occurrences of word character more than 1 times.
.* one or more occurrences of any character except new line \n.
$ end of pattern.
\1 Returns group from regexp
Maybe I'm missing something but I don't see why you can't simply use conditionals in your regex expression, then trim out the annoying white space.
string <- c("Arcadia Parish", "Fifth District")
bad_words <- c("Parish", "District") # Write all the words you want removed here!
bad_regex <- paste(bad_words, collapse = "|")
trimws( sub(bad_regex, "", string) )
# [1] "Arcadia" "Fifth"
dataframename$varname <- gsub(" Parish","", dataframename$varname)

skip lines in reading files using reg ex

i have files with similar contents
!software version: $Revision$
!date: 07/06/2016 $
!
! from Mouse Genome Database (MGD) & Gene Expression Database (GXD)
!
MGI
I am using read.csv to read the files. But I need to skip the lines with "!" in the beginning. How can I do that?
The read.csv function and read.table that it is based on have an argument called comment.char which can be used to specify a character that if seen will ignore the rest of that line. Setting that to "!" may be enough to do what you want.
If you really need a regular expression, then the best approach is to read the file using readLines (or similar function), then apply the regular expression to the resulting vector of character strings to drop to unwanted elements (rows), then pass the result to the text argument to read.table (or use a text connection).
To calculate the first line that doesn't start with a !,
to_skip <- min(grep('^[^!]', trimws(readLines('file.csv'))))
df <- read.csv('file.csv', skip = to_skip)

Removing '|' from object names?

I have created a data frame in R with the following name:
table_file1_C.txt|file2_C.txt
This name is was generated by the assign() function, in reference to a single .txt file that was generated by a program run on command line. Here is a sample from the loop that created this object:
assign(x=paste("table_",
dir(file.dir, pattern="\\.txt$")[i],
sep=''),
value=tmpTables[[i]])#tmpTables holds the data I'm manipulating, as read in from readHTMLtable
The issue is that I an unable to reference this object after its creation;
>table_file1_C.txt|file2_C.txt
Error: object 'file2_C.txt' not found
I believe that R is seeing the '|' character, and reading it as an instruction, not a part of the object's name, even though it already accepted it as part of the object's name.
So, I need to strip the | from the object's name. I planned to accomplish this with gsub() embedded within the assign() function, using something like this:
assign(x=paste("table_",#creating the name of the object
gsub(x=dir(file.dir, pattern="\\.txt$")[i],
pattern="|",
replacement="."),#need to remove the | characters!!
sep=''),
value=tmpTables[[i]])
However, this output gives something like this:
[1] ".t.a.b.l.e._.f.i.l.e.1...t.x.t.|.f.i.l.e.2...t.x.t."
As you can see, the name has been mangled, and the | has not actually been removed.
I need to find a way to remove the | from the name, so I can process the object that I have created. Or, prevent it from being included in the name in the first place. I can only do this within R, as I cannot modify the output of the program that I used to generate the data.
Does this make sense? Let me know if more information is needed. Thank you for taking the time to read this.
You need to escape the | character in the regular expression. Otherwise it is an empty pattern, which matches everything.
Escaping the character with brackets (character class):
x <- 'a|b'
gsub('[|]', '.', x)
## [1] "a.b"
Escaping with a backslash:
gsub('\\|', '.', x)
## [1] "a.b"
If you don't escape the | character, it is an "or" operation in the regular expression. Nothing or nothing, same as matching nothing. Thus it inserts the . between each character:
gsub('', '.', x)
## [1] ".a.|.b."
gsub('|', '.', x) # Same as above
## [1] ".a.|.b."
For some reason, escaping with ' ' as per Matthew Lundberg did not work correctly for me, but escaping with ` did.
> 'file1.txt|file2.txt'
[1] "file1.txt|file2.txt"
>`denovo_AR_C.txt|FOXA1_C.txt`
*data*
Thanks go to Matthew

Finding number of occurrences of a word in a file using R functions

I am using the following code for finding number of occurrences of a word memory in a file and I am getting the wrong result. Can you please help me to know what I am missing?
NOTE1: The question is looking for exact occurrence of word "memory"!
NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.
#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9
Here's the file
The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.
You can fix the problem by telling scan to regard all quotation marks as normal characters and not treat them specially:
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
Be careful when using the R implementation of grep. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.
As pointed by #andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:
names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)
length(idxs)
# [1] 10

Resources