How to rename file name from Pbs_d7_s2_juliam_08July2020_02_1_0_live singlets.fcs to Pbs_d7_s2.fcs
For multiple files keeping in mind that _juliam_08July2020_02_1_0_live singlets is not the same for all files?
It's a bit unclear what you're asking for, but it looks like you only want to keep the chunks before the third underscore. If so, you can tackle this with regular expressions. The regular expression
str_extract(input_string, "^(([^_])+_){3}")
will take out the first 3 blocks of characters (that aren't underscores) that end in underscores. The first ^ "anchors" the match to the beginning of the string, the "[^_]+_" matches any number of non-underscore characters before an underscore. The {3} does the preceding operation 3 times.
So for "Pbs_d7_s2_juliam_08July2020_02_1_0_live singlets.fcs" you'll end up with "Pbs_d7_s2_". Now you just replace the last underscore with ".fcs" like so
str_replace(modified string, "_$", ".fcs")
The $ "anchors" the characters that precede it to the end of the string so in this case it's replacing the last underscore. The full sequence is
string1<- "Pbs_d7_s2_juliam_08July2020_02_1_0_live singlets.fcs"
str_extract(string1, "^(([^_])+_){3}") %>%
str_replace("_$",".fcs")
[1] "Pbs_d7_s2.fcs"
Now let's assume your filenames are in a vector named stringvec.
output <- vector("character",length(stringvec))
for (i in seq_along(stringvec)) {
output[[i]] <- str_extract(stringvec[[i]],"^(([^_])+_){3}")%>%
str_replace("_$",".fcs")
}
output
I'm making some assumptions here - namely that the naming convention is the same for all of your files. If that's not true you'll need to find ways to modify the regex search pattern.
I recommend this answer How do I rename files using R? for replacing vectors of file names. If you have a vector of original file names you can use my for loop to generate a vector of new names, and then you can use the information in the link to replace one with the other. Perhaps there are other solutions not involving for loops.
Related
I wish to use R to read multiple csv files from a single folder. If I wanted to read every csv file I could use:
list.files(folder, pattern="*.csv")
See, for example, these questions:
Reading multiple csv files from a folder into a single dataframe in R
Importing multiple .csv files into R
However, I only wish to read one of four subsets of the files at a time. Below is an example grouping of four files each for three models.
JS.N_Nov6_2017_model220_N200.csv
JS.N_Nov6_2017_model221_N200.csv
JS.N_Nov6_2017_model222_N200.csv
my.IDs.alt_Nov6_2017_model220_N200.csv
my.IDs.alt_Nov6_2017_model221_N200.csv
my.IDs.alt_Nov6_2017_model222_N200.csv
parms_Nov6_2017_model220_N200.csv
parms_Nov6_2017_model221_N200.csv
parms_Nov6_2017_model222_N200.csv
supN_Nov6_2017_model220_N200.csv
supN_Nov6_2017_model221_N200.csv
supN_Nov6_2017_model222_N200.csv
If I only wish to read, for example, the parms files I try the following, which does not work:
list.files(folder, pattern="parm*.csv")
I am assuming that I may need to use regex to read a given group of the four groups present, but I do not know.
How can I read each of the four groups separately?
EDIT
I am unsure whether I would have been able to obtain the solution from answers to this question:
Listing all files matching a full-path pattern in R
I may have had to spend a fair bit of time brushing up on regex to apply those answers to my problem. The answer provided below by Mako212 is outstanding.
A quick REGEX 101 explanation:
For the case of matching the beginning and end of the string, which is all you need to do here, the following prinicples apply to match files that are .csv and start with parm:
list.files(folder, pattern="^parm.*?\\.csv")
^ asserts we're at the beginning of the string, so ^parm means match parm, but only if it's at the beginning of the string.
.*? means match anything up until the next part of the pattern matches. In this case, match until we see a period \\.
. means match any character in REGEX, so we need to escape it with \\ to match the literal . (note that in R you need the double escape \\, in other languages a single escape \ is sufficienct).
Finally csv means match csv after the .. If we were going to be really thorough, we might use \\.csv$ using the $ to indicate the end of the string. You'd need the dollar sign if you had other files with an extension like .csv2. \\.csv would match .csv2, where as \\.csv$ would not.
In your case, you could simply replace parm in the REGEX pattern with JS, my, or supN to select one of your other file types.
Finally, if you wanted to match a subset of your total file list, you could use the | logical "or" operator:
list.files(folder, pattern = "^(parm|JS|supN).*?\\.csv")
Which would return all the file names except the ones that start with my
The list.files statement shown in the question is using globs but list.files accepts regular expressions, not globs.
Sys.glob To use globs use Sys.glob like this:
olddir <- setwd(folder)
parm <- lapply(Sys.glob("parm*.csv"), read.csv)
parm is now a list of data frames read in from those files.
glob2rx Note that the glob2rx function can be used to convert globs to regular expressions:
parm <- lapply(list.files(folder, pattern = glob2rx("parm*.csv")), read.csv)
I've got some kind of logfile I'd like to read and analyse. Unfortunately the files are saved in a pretty "ugly" way (with lots of special characters in between), so I'm not able to read in just the lines with each one being an entry. The only way to separate the different entries is using regular expressions, since the beginning of each entry follows a specified pattern.
My first approach was to identify the pattern in the character vector (I use read_file from the readr-package) and use the corresponding positions to split the vector with strsplit. Unfortunately the positions seem not always to match, since the result doesn't always correspond to the entries (I'd guess that there's a problem with the special characters).
A typical line of the file looks as follows:
16/10/2017, 21:51 - George: This is a typical entry here
The corresponding regular expressions looks as follows:
([[:digit:]]{2})/([[:digit:]]{2})/([[:digit:]]{4}), ([[:digit:]]{2}):([[:digit:]]{2}) - ([[:alpha:]]+):
The first thing I want is a data.frame with each line corresponding to a specific entry (in a next step I'd split the pattern into its different parts).
What I tried so far was the following:
regex.log = "([[:digit:]]{2})/([[:digit:]]{2})/([[:digit:]]{4}), ([[:digit:]]{2}):([[:digit:]]{2}) - ([[:alpha:]]+):"
log.regex = gregexpr(regex.log, file.log)[[1]]
log.splitted = substring(file.log, log.regex, log.regex[2:355]-1)
As can be seen this logfile has 355 entries. The first ones are separated correctly. How can I separate the character vector using a regular expression without loosing the information of the regular expression/pattern?
Use capturing and non-capturing groups to identify the parts you want to keep, and be sure to use anchors:
file.log = "16/10/2017, 21:51 - George: This is a typical entry here"
regex.log = "^((?:[[:digit:]]{2})\\/(?:[[:digit:]]{2})\\/(?:[[:digit:]]{4}), (?:[[:digit:]]{2}):(?:[[:digit:]]{2}) - (?:[[:alpha:]]+)): (.*)$"
gsub(regex.log,"\\1",file.log)
>> "16/10/2017, 21:51 - George"
gsub(regex.log,"\\2",file.log)
>> "This is a typical entry here"
I have some files in the directory and their extensions all same. I want list all files but later on would like to ignore some of the file names that contain certain strings. I tried grepl using this answer using-r-to-list-all-files-with-a-specified-extension.
For instance in this example I would like exclude the files which contains 'B' in it. So tried,
file_names <- c('AA','BA','KK','CB')
files <- paste0(file_names,'.txt')
Filter_files <- files[-grepl('.*B.txt|.B*.txt', files)]
Filter_files
"BA.txt" "KK.txt" "CB.txt"
Interestingly only AA.txt excluded!
This will work:
file_names <- c('AA','BA','KK','CB')
files <- paste0(file_names,'.txt')
Filter_files <- files[!grepl('.*B.*\\.txt', files)]
Filter_files
## "AA.txt" "KK.txt"
These are the changes I made:
Instead of -, the grepl is preceded by !, which negates the result of the grepl (i.e., the TRUE results become FALSE and viceversa).
To capture all the Bs regardless of where they are located, I search for any character (that's what . indicates) appearing 0 or more times (indicated by the * sign). This way whether the B is at the beginning or the end of the filename it is equally captured.
Since . has the special meaning of 'any character' in regex, to have a literal dot in the expression you have to escape it, thus \\. before the txt extension.
I looked around both here and elsewhere, I found many similar questions but none which exactly answer mine. I need to clean up naming conventions, specifically replace/remove certain words and phrases from a specific column/variable, not the entire dataset. I am migrating from SPSS to R, I have an example of the code to do this in SPSS below, but I am not sure how to do it in R.
EG:
"Acadia Parish" --> "Acadia" (removes Parish and space before Parish)
"Fifth District" --> "Fifth" (removes District and space before District)
SPSS syntax:
COMPUTE county=REPLACE(county,' Parish','').
There are only a few instances of this issue in the column with 32,000 cases, and what needs replacing/removing varies and the cases can repeat (there are dozens of instances of a phrase containing 'Parish'), meaning it's much faster to code what needs to be removed/replaced, it's not as simple or clean as a regular expression to remove all spaces, all characters after a specific word or character, all special characters, etc. And it must include leading spaces.
I have looked at the replace() gsub() and other similar commands in R, but they all involve creating vectors, or it seems like they do. What I'd like is syntax that looks for characters I specify, which can include leading or trailing spaces, and replaces them with something I specify, which can include nothing at all, and if it does not find the specific characters, the case is unchanged.
Yes, I will end up repeating the same syntax many times, it's probably easier to create a vector but if possible I'd like to get the syntax I described, as there are other similar operations I need to do as well.
Thank you for looking.
> x <- c("Acadia Parish", "Fifth District")
> x2 <- gsub("^(\\w*).*$", "\\1", x)
> x2
[1] "Acadia" "Fifth"
Legend:
^ Start of pattern.
() Group (or token).
\w* One or more occurrences of word character more than 1 times.
.* one or more occurrences of any character except new line \n.
$ end of pattern.
\1 Returns group from regexp
Maybe I'm missing something but I don't see why you can't simply use conditionals in your regex expression, then trim out the annoying white space.
string <- c("Arcadia Parish", "Fifth District")
bad_words <- c("Parish", "District") # Write all the words you want removed here!
bad_regex <- paste(bad_words, collapse = "|")
trimws( sub(bad_regex, "", string) )
# [1] "Arcadia" "Fifth"
dataframename$varname <- gsub(" Parish","", dataframename$varname)
I have a folder which contains files of the following names "5_45name.Rdata"and "15_45name.Rdata".
I would like to list only the ones starting with "5", in the example above this means that I would like to exclude "15_45name.Rdata".
Using list.files(pattern="5_*name.Rdata"), however, will list both of these files.
Is there a way to tell list.files() that I want the filename to start with a specific character?
We need to use the metacharacter (^) to specify the start of the string followed by the number 5. So, it can a more specific pattern like below
list.files(pattern ="^5[0-9]*_[0-9]*name.Rdata")
Or concise if we are not concerned about the _ and other numbers following it.
list.files(pattern = "^5.*name.Rdata")