R: Using regular expression to swap strings separated by a comma - r

So I have this list of names:
names <- c("stewart,pat", "peterson,greg")
from which I extract only the lastname,firstname items with the following regular expression:
myregexpr <- "(\\w+),(\\w+)?"
str_view(str_extract_all(names, myregexpr), myregexpr)
This yields a view like:
stewart,pat
peterson,greg
My question: Is there a way for me to write the regular expression such that the result would instead look like:
pat_stewart
greg_peterson
i.e. where the result of is first_last? I believe there is a way to do it as I've seen on other, similar questions. I've tried:
myregexpr <- "(\\w+),(\\w+)?\\2_\\1"
but that returns only `character(0)'. I've attempted many versions - some of which crash R studio. Any ideas?

Related

Create list from column and filter from the resulting list

The below code allows simple filtering of a list:
#Filter to applicable codes only
ICS_List <- "QMM|QJG|QH8|QM7|QUE|QHG"
EofEMSOAs <- EofEMSOAs[grep(ICS_List, EofEMSOAs$Code),]
What I am looking to do instead is take all data from the column from another dataframe within a project and use the grep function to filter for values contained within that column - there could be hundreds so typing a manual list is not practical.
I have tried the below but it results in error 'argument 'pattern' has length > 1 and only the first element will be used. Seems using dplyr in this way does not create the same output as manually typing in a list which is throwing the error so I only get one result.
#To filter from required dataframe 'EofEMSOAsIMD'
EofEMSOAsCodeListOnly <- dplyr::pull(EofEMSOAsIMD, "Area Code")
EofEMSOAsFinalList <- EofEMSOAs[grep(EofEMSOAsCodeListOnly, EofEMSOAs$msoa11cd),]
Could anyone please amend the above so it does work using similar logic to the code at top of this question, namely 1. List created from column 2. Dataframe filtered for matches to that list? Thank you.

Rename a column with R

I'm trying to rename a specific column in my R script using the colnames function but with no sucess so far.
I'm kinda new around programming so it may be something simple to solve.
Basically, I'm trying to rename a column called Reviewer Overall Notes and name it Nota Final in a data frame called notas with the codes:
colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
and it returns to me:
> colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
Error: object 'Nota Final' not found
I also found in [this post][1] a code that goes:
colnames(notas) [13] <- `Nota Final`
But it also return the same message.
What I'm doing wrong?
Ps:. Sorry for any misspeling, English is not my primary language.
You probably want
colnames(notas)[colnames(notas) == "Reviewer Overall Notes"] <- "Nota Final"
(#Whatif's answer shows how you can do this with the numeric index, but probably better practice to do it this way; working with strings rather than column indices makes your code both easier to read [you can see what you're renaming] and more robust [in case the order of columns changes in the future])
Alternatively,
notas <- notas %>% dplyr::rename(`Nota Final` = `Reviewer Overall Notes`)
Here you do use back-ticks, because tidyverse (of which dplyr is a part) prefers its arguments to be passed as symbols rather than strings.
Why using backtick? Use the normal quotation mark.
colnames(notas)[13] <- 'Nota Final'
This seems to matter:
df <- data.frame(a = 1:4)
colnames(df)[1] <- `b`
Error: object 'b' not found
You should not use single or double quotes in naming:
I have learned that we should not use space in names. If there are spaces in names (it works and is called a non-syntactic name: And according to Wickham Hadley's description in Advanced R book this is due to historical reasons:
"You can also create non-syntactic bindings using single or double quotes (e.g. "_abc" <- 1) instead of backticks, but you shouldn’t, because you’ll have to use a different syntax to retrieve the values. The ability to use strings on the left hand side of the assignment arrow is an historical artefact, used before R supported backticks."
To get an overview what syntactic names are use ?make.names:
make.names("Nota Final")
[1] "Nota.Final"

Match everything up until first instance of a colon

Trying to code up a Regex in R to match everything before the first occurrence of a colon.
Let's say I have:
time = "12:05:41"
I'm trying to extract just the 12. My strategy was to do something like this:
grep(".+?(?=:)", time, value = TRUE)
But I'm getting the error that it's an invalid Regex. Thoughts?
Your regex seems fine in my opinion, I don't think you should use grep, also you are missing perl=TRUE that is why you are getting the error.
I would recommend using :
stringr::str_extract( time, "\\d+?(?=:)")
grep is little different than it is being used here, its good for matching separate values and filtering out those which has similar pattern, but you can't pluck out values within a string using grep.
If you want to use Base R you can also go for sub:
sub("^(\\d+?)(?=:)(.*)$","\\1",time, perl=TRUE)
Also, you may split the string using strsplit and filter out the first string like below:
strsplit(time, ":")[[1]][1]

Use LaF and grepl together

I would like to read in a possibly large text file and filter the relevant lines on the fly based on a regular expression. My first approach was using the package LaF which supports chunkwise reading and then grepl to filter. However, this seems not to work:
library(LaF)
fh <- laf_open_csv("myfile.txt", column_types="string", sep="°")
# would be nice to declare *no* separator
fh[grepl("abc", fh[[1]]), ]
returns an error in as.character.default(x) -- no method to convert this S4 to character. It seems like grepl is applied to the S4 function and not to the chunks.
Is there a nice way to read text lines from a large file and filter them selectively?
OK, I just discovered process_blocks:
regfilter <- function(df, result) c(result, df[grepl("1745", df[[1]]),1])
process_blocks(fh, regfilter)
This works, now I only need to find a way to ignore separators..

R - Split character vector using regex

I've got some kind of logfile I'd like to read and analyse. Unfortunately the files are saved in a pretty "ugly" way (with lots of special characters in between), so I'm not able to read in just the lines with each one being an entry. The only way to separate the different entries is using regular expressions, since the beginning of each entry follows a specified pattern.
My first approach was to identify the pattern in the character vector (I use read_file from the readr-package) and use the corresponding positions to split the vector with strsplit. Unfortunately the positions seem not always to match, since the result doesn't always correspond to the entries (I'd guess that there's a problem with the special characters).
A typical line of the file looks as follows:
16/10/2017, 21:51 - George: This is a typical entry here
The corresponding regular expressions looks as follows:
([[:digit:]]{2})/([[:digit:]]{2})/([[:digit:]]{4}), ([[:digit:]]{2}):([[:digit:]]{2}) - ([[:alpha:]]+):
The first thing I want is a data.frame with each line corresponding to a specific entry (in a next step I'd split the pattern into its different parts).
What I tried so far was the following:
regex.log = "([[:digit:]]{2})/([[:digit:]]{2})/([[:digit:]]{4}), ([[:digit:]]{2}):([[:digit:]]{2}) - ([[:alpha:]]+):"
log.regex = gregexpr(regex.log, file.log)[[1]]
log.splitted = substring(file.log, log.regex, log.regex[2:355]-1)
As can be seen this logfile has 355 entries. The first ones are separated correctly. How can I separate the character vector using a regular expression without loosing the information of the regular expression/pattern?
Use capturing and non-capturing groups to identify the parts you want to keep, and be sure to use anchors:
file.log = "16/10/2017, 21:51 - George: This is a typical entry here"
regex.log = "^((?:[[:digit:]]{2})\\/(?:[[:digit:]]{2})\\/(?:[[:digit:]]{4}), (?:[[:digit:]]{2}):(?:[[:digit:]]{2}) - (?:[[:alpha:]]+)): (.*)$"
gsub(regex.log,"\\1",file.log)
>> "16/10/2017, 21:51 - George"
gsub(regex.log,"\\2",file.log)
>> "This is a typical entry here"

Resources