R - Split character vector using regex

R - Split character vector using regex - r

I've got some kind of logfile I'd like to read and analyse. Unfortunately the files are saved in a pretty "ugly" way (with lots of special characters in between), so I'm not able to read in just the lines with each one being an entry. The only way to separate the different entries is using regular expressions, since the beginning of each entry follows a specified pattern.
My first approach was to identify the pattern in the character vector (I use read_file from the readr-package) and use the corresponding positions to split the vector with strsplit. Unfortunately the positions seem not always to match, since the result doesn't always correspond to the entries (I'd guess that there's a problem with the special characters).
A typical line of the file looks as follows:
16/10/2017, 21:51 - George: This is a typical entry here
The corresponding regular expressions looks as follows:
([[:digit:]]{2})/([[:digit:]]{2})/([[:digit:]]{4}), ([[:digit:]]{2}):([[:digit:]]{2}) - ([[:alpha:]]+):
The first thing I want is a data.frame with each line corresponding to a specific entry (in a next step I'd split the pattern into its different parts).
What I tried so far was the following:
regex.log = "([[:digit:]]{2})/([[:digit:]]{2})/([[:digit:]]{4}), ([[:digit:]]{2}):([[:digit:]]{2}) - ([[:alpha:]]+):"
log.regex = gregexpr(regex.log, file.log)[[1]]
log.splitted = substring(file.log, log.regex, log.regex[2:355]-1)
As can be seen this logfile has 355 entries. The first ones are separated correctly. How can I separate the character vector using a regular expression without loosing the information of the regular expression/pattern?

Use capturing and non-capturing groups to identify the parts you want to keep, and be sure to use anchors:
file.log = "16/10/2017, 21:51 - George: This is a typical entry here"
regex.log = "^((?:[[:digit:]]{2})\\/(?:[[:digit:]]{2})\\/(?:[[:digit:]]{4}), (?:[[:digit:]]{2}):(?:[[:digit:]]{2}) - (?:[[:alpha:]]+)): (.*)$"
gsub(regex.log,"\\1",file.log)
>> "16/10/2017, 21:51 - George"
gsub(regex.log,"\\2",file.log)
>> "This is a typical entry here"

Related

Delete part of file name (up to but NOT including string)

I have a script that was very kindly provided for me a while ago which allowed me to generate input files by inserting coordinates from a series of .xyz files into a template file (Create new files by copying contents of coordinate files into template file).
I'm trying to adapt that script to do something very similar, but different in a very slight, but annoying way. In the script, the new directories created to house these new files are named like this:
# File name is in the form '....Hnnn.xyz';
# this will parse nnn from that name.
local inputNumber=$coordFile
# Remove '.xyz'.
inputNumber=${inputNumber%.xyz}
# Remove everything up to and including the 'H'.
inputNumber=${inputNumber##*H}
# Subdirectory name is based on the input number.
local outDir=$baseDir/D$inputNumber
# Create the directory if it doesn't exist.
if [[ ! -d $outDir ]]; then
mkdir $outDir
fi
This worked for my last problem, because the files were all named in the form xxxx_DH000.xyz. However, now the files I have are named using the form xxxx.000.xyz. While everything else in the script works, I cannot figure out how to name the new directories in the form 000.
The line in the script which I think needs to be edited slightly is where it says inputNumber=${inputNumber##*H}. What I cannot figure out is how to get the script to delete everything up to but not including a 0. I've searched online, but the only questions/answers I've found relating to the renaming of files by stripping part of the original names speaks about deleting everything 'up to and including' a string.
I was able to generate directories named 1, 2, 3, etc. with inputNumber=${inputNumber##*0}, however I want all three digits present (i.e. I would like create directories 001, 002, 003, etc.).
As an aside, I cannot use the . as the cutoff point, as there are multiple .s in each file name. An example of one of the file names is tma.h2s-2-pes-b97m-d4-tz.011.xyz.
Is there some way to get the script to simply name the files based on the full three digit number?

Although it's not needed in this case, zsh does support deleting text just before a matched pattern in a string. These parameter expansions will remove everything prior to the first 0 in the string, but keep the 0:
inputNumber='tma.h2s-2-pes-b97m-d4-tz.011.xyz'
inputNumber=${inputNumber:r} # remove '.xyz'
inputNumber=${(SM)inputNumber##0*}
print ${inputNumber}
# ==> 011
This includes a few zsh-isms:
${...:r} returns the 'root' of a filename, removing the extension.
(S) - parameter expansion flag to change the behavior of the ## expansion. It will now search for patterns in the middle of a string, not just at the beginning.
(M) - flag to include the pattern match (the 0*) in the result.
This depends on the number always starting with 0, which may not be a good choice - what file comes after 099?
This next version uses a zsh extended glob pattern to find a number between two periods, and returns that number - i.e. it will find the number in .11., .011., or .2345., but not in .x11.:
coordFile='tma.h2s-2-pes-b97m-d4-tz.022.xyz'
inputNumber=${(*)coordFile//(#b)*.(<->).*/${match}}
print ${inputNumber}
# ==> 022
Some of the pieces:
${...//.../...} - substitution expansion.
(*) - enables extendedglob for this expansion.
(#b) - globbing flag to enable 'backreferences', so that $match will work.
<-> - matches a number. This can be restricted to a range if needed, like <100-199>.
(<->) - puts the number into a match group.
*. and .* - everything before and after the number; these are not in the match group.
${match} - the matched string from the parenthesized part of the pattern. This is used as the replacement for the entire string, so we get just the number. If more than one part of the input string matches the pattern, this will be the last one. match is actually an array, but since there's only one match group in the pattern, it does not need to be indexed with ${match[1]}.
This variant uses a standard regular expression to find the number:
coordFile='tma.h2s-2-pes-b97m-d4-tz.033.xyz'
match=
[[ $coordFile =~ .*\\.([[:digit:]]+)\\..* ]]
inputNumber=${match[1]}
print ${inputNumber}
# ==> 033
After the [[ ]] test, the match array will contain matches from any parenthesized groups in the regular expression - here, that will be a set of one or more digits in between two periods / full stops.
But, as #choroba and Fravadona have noted, since the number will be always be at the end of the string, you can use the standard #/##/%/%% expansions to remove parts of the string based only on the .s. This is a common idiom that will be familiar to many shell programmers, and will also work in bash (note that other parts of your original script depend on zsh).
inputNumber='tma.h2s-2-pes-b97m-d4-tz.044.xyz'
inputNumber=${inputNumber%.xyz}
inputNumber=${inputNumber##*.}
print ${inputNumber}
# ==> 044
In zsh everything can be consolidated into a single nested substitution:
baseDir='files/are/here'
coordFile='tma.h2s-2-pes-b97m-d4-tz.055.xyz'
local outDir=$baseDir/D${${coordFile:r}##*.}
print $outDir
# ==> files/are/here/D055

Renaming multiple files by keeping first 6 characters

How to rename file name from Pbs_d7_s2_juliam_08July2020_02_1_0_live singlets.fcs to Pbs_d7_s2.fcs
For multiple files keeping in mind that _juliam_08July2020_02_1_0_live singlets is not the same for all files?

It's a bit unclear what you're asking for, but it looks like you only want to keep the chunks before the third underscore. If so, you can tackle this with regular expressions. The regular expression
str_extract(input_string, "^(([^_])+_){3}")
will take out the first 3 blocks of characters (that aren't underscores) that end in underscores. The first ^ "anchors" the match to the beginning of the string, the "[^_]+_" matches any number of non-underscore characters before an underscore. The {3} does the preceding operation 3 times.
So for "Pbs_d7_s2_juliam_08July2020_02_1_0_live singlets.fcs" you'll end up with "Pbs_d7_s2_". Now you just replace the last underscore with ".fcs" like so
str_replace(modified string, "_$", ".fcs")
The $ "anchors" the characters that precede it to the end of the string so in this case it's replacing the last underscore. The full sequence is
string1<- "Pbs_d7_s2_juliam_08July2020_02_1_0_live singlets.fcs"
str_extract(string1, "^(([^_])+_){3}") %>%
str_replace("_$",".fcs")
[1] "Pbs_d7_s2.fcs"
Now let's assume your filenames are in a vector named stringvec.
output <- vector("character",length(stringvec))
for (i in seq_along(stringvec)) {
output[[i]] <- str_extract(stringvec[[i]],"^(([^_])+_){3}")%>%
str_replace("_$",".fcs")
}
output
I'm making some assumptions here - namely that the naming convention is the same for all of your files. If that's not true you'll need to find ways to modify the regex search pattern.
I recommend this answer How do I rename files using R? for replacing vectors of file names. If you have a vector of original file names you can use my for loop to generate a vector of new names, and then you can use the information in the link to replace one with the other. Perhaps there are other solutions not involving for loops.

How do I extract a section number and the text after it?

I have a question.
My text file contains lines such as:
1.1        Description.
This is the description.
1.1.1      Quality Assurance
Random sentence.
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.
I'm trying to find out how to get:
1.1        Description
1.1.1      Quality Assurance
1.6.1    Quality Control
Right now, I have:
txt1 <- readLines("text1.txt")
txt2<-grep("^[0-9.]+", txt1, value = TRUE)
file<-write(txt2, "text3.txt")
which results in:
1.1        Description.
1.1.1      Quality Assurance
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.

You are using grep with value=TRUE, which
returns a character vector containing the selected elements of x
(after coercion, preserving names but no other attributes).
This means, that if your regular expression matches anything in the line, the all line will be returned. You managed to build your regular expression to match numbers in the begining of the line. So all the lines which begin with numbers get selected.
It seems that your goal is not to select the all line, but to select only until there is a line break or a period.
So, you need to adjust the regular expression to be more specific, and you need to extract only the matching portion of the line.
A regular expression that matches what you want can be:
"^([0-9]\\.?)+ .+?(\\.|$)"
It selects numbers with dots, followed by a space, followed by anything, and stops matching things when a . comes or the line ends. I recommend the following website to better understand what the regex does: https://regexr.com/
The next step is extracting from the given lines only the matching portion, and not the all line where the regex has a match. For this we'll use the function regexpr, which tells us where the matches are, and the function regmatches, which helps us extract those matches:
txt1 <- readLines("text.txt")
regmatches(txt1, regexpr("^([0-9]\\.?)+ .+?(\\.|$)", txt1))

Two PASTE functions in a character vector

attach.files = c(paste("/users/joesmith/nosection_", currentDate,".csv",sep=""),
paste("/users/joesmith/withsection_", currentDate,".csv",sep=""))
Basically, if I did it like
c("nosection_051418.csv", "withsection_051418.csv")
And I did that manually it would work fine but since I'm automating this to run every day I can't do that.
I'm trying to attach files in an automated email but when I structure it like this, it doesn't work. How can I recreate this so that the character vector accepts it?

I thought your example implied the need for "parallel" inputs to the path stem, the first portion of the file name, and the date portions of those full paths. Consider this illustration of using a 2 item vector and a one item vector (produced by Sys.Date, replacing your "currentdate") to populate the %s positions in that sprintf string (suggested by #Gregor):
sprintf("/users/joesmith/%s_%s.csv", c("nosection", "withsection"), Sys.Date() )
[1] "/users/joesmith/nosection_2018-05-14.csv" "/users/joesmith/withsection_2018-05-14.csv"

Treating "#" as a regular character when reading data

I'm almost certain this has been asked before but due to a certain social media app I drowning in unrelated search results.
So the data set that I'm importing contains actual "#", as in Apartment #404, and I'd like to if possible preserve the character but R thinks it's an end of line or something. At first it would bomb out on the first occurrence, then I set fill=TRUE and now it just ignores the rest of the line after that.
How does one instruct R to treat #'s as regular characters?

If you are not using "#" as a comment symbol in your data, you can use
read.table(..., comment.char="")
That should treat "#" like any other character.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex