Use regular expression to change column names - r

the df column names are for instance:
split_the_tree1.zip.name, split_the_tree2.zip.region, split_the_tree3.zip.type
I would like to remove everything that comes before '.zip.' including the '.zip.'
To end up with column names: name, region, type
I tried something like this:
gsub(".*\zip\/", "", colnames(df))
gsub('*.zip', '', colnames(df))

colnames(df) <- gsub('.*zip.', '', colnames(df))
Note that we have .* and not *.
Although the quick way to do this is to note that the names you want are simply stored as extensions, hence use tools::file_ext function
strings <- c("split_the_tree1.zip.name", "split_the_tree2.zip.region",
"split_the_tree3.zip.type")
tools::file_ext(strings)
[1] "name" "region" "type"

Related

Changing column name - replace the prefix

Suppose I have the dataframe below:
>colnames(df)
>[1] "0939.HK.Open" "0939.HK.High" "0939.HK.Low" "0939.HK.Close" "0939.HK.Volume" "0939.HK.Adjusted"
Wondering if I can replace column name with a defined symbol?
>symbol<-'0123.KI'
And what I would like to obtain is:
>[1] "0123.KI.Open" "0123.KI.High" "0123.KI.Low" "0123.KI.Close" "0123.KI.Volume" "0123.KI.Adjusted"
Thank you so much guys.
Try:
colnames(df) <- paste0(symbol, ".", tools::file_ext(colnames(df)))
As the column names look like a filename, I am taking the extension - file_ext, then adding the symbol using paste0.
Another way:
symbol <- '0123.KI'
colnames(df) <- gsub("0939.HK", symbol, colnames(df), fixed=TRUE)

grep() and sub() and regular expression

I'd like to change the variable names in my data.frame from e.g. "pmm_StartTimev4_E2_C19_1" to "pmm_StartTimev4_E2_C19". So if the name ends with an underscore followed by any number it gets removed.
But I'd like for this to happen only if the variable name has the word "Start" in it.
I've got a muddled up bit of code that doesn't work. Any help would be appreciated!
# Current data frame:
dfbefore <- data.frame(a=c("pmm_StartTimev4_E2_C19_1","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19_2","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Desired data frame:
dfafter <- data.frame(a=c("pmm_StartTimev4_E2_C19","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Current code:
sub((.*{1,}[0-9]*).*","",grep("Start",names(df),value = TRUE)
How about something like this using gsub().
stripcol <- function(x) {
gsub("(.*Start.*)_\\d+$", "\\1", as.character(x))
}
dfnew <- dfbefore
dfnew[] <- lapply(dfbefore, stripcol)
We use the regular expression to look for "Start" and then grab everything but the underscore number at the end. We use lapply to apply the function to all columns.
doit <- function(x){
x <- as.character(x)
if(grepl("Start",x)){
x <- gsub("_([0-9])","",x)
}
return(x)
}
apply(dfbefore,c(1,2),doit)
a b
[1,] "pmm_StartTimev4_E2_C19" "pmm_StartTo_v4_E2_C19"
[2,] "pmm_StartTimev4_E2_E2_C1" "complete_E1_C12_1"
[3,] "delivery_C1_C12" "pmm_StartTo_v4_E2_C19"
We can use sub to capture groups where the 'Start' substring is also present followed by an underscore and one or more numbers. In the replacement, use the backreference of the captured group. As there are multiple columns, use lapply to loop over the columns, apply the sub and assign the output back to the original data
out <- dfbefore
out[] <- lapply(dfbefore, sub,
pattern = "^(.*_Start.*)_\\d+$", replacement ="\\1")
out
dfafter[] <- lapply(dfafter, as.character)
all.equal(out, dfafter, check.attributes = FALSE)
#[1] TRUE

how to separate names using regular expression?

I have a name vector like the following:
vname<-c("T.Lovullo (73-58)","K.Gibson (63-96) and A.Trammell (1-2)","T.La Russa (81-81)","C.Dressen (16-10), B.Swift (32-25) and F.Skaff (40-39)")
Watch out for T.La Russa who has a space in his name
I want to use str_match to separate the name. The difficulty here is that some characters contain two names while the other contain only one like the example I gave.
I have write my code but it does not work:
str_match_all(ss,"(D[.]D+.+)s(\\(d+-d+\\))(s(and)s(D[.]D+.+)s(\\(d+-d+\\)))?")
Perhaps this helps
res <- unlist(strsplit(vname, "(?<=\\))(\\sand\\b\\s)*", perl = TRUE))
res
#[1] "T.Lovullo (73-58)" "K.Gibson (63-96)" "A.Trammell (1-2)" "T.La Russa (81-81)"
To get the names only (if that is what the expected)
sub("\\s*\\(.*", "", res)
#[1] "T.Lovullo" "K.Gibson" "A.Trammell" "T.La Russa"

Creating a list for search query from column using dbhydroR package

I am trying to create a list that I can copy into a search query for a function from a data.frame column. My output from the below code is in the format of :
‘C-484,''F-409,''S-18A,''G-850,''PB-632,'...etc.
But I need it to read
'C-484','F-409','S-18A','G-850','PB-632', ...etc.
There are 1,974 variables. How can I switch the placement of the last quote around each variable with the comma?
inactivestations <- read.csv("INACTIVE_WELLS.csv",header=TRUE)
#subset and make data.frame for only STATION (station names)
allstations_inactive <- inactivestations['STATION']
#not separated in a way that can be copied into a query
list(allstations_inactive$STATION)
#separated by commas and has quotes around each variable but commas inside quotes
test<-paste0(allstations_inactive$STATION, collapse="''",sep=",")
##separated by commas and has quotes around each variable but commas inside quotes
test1<-paste0(allstations_inactive$STATION, sep=",",collapse="''")
Thank you in advance
This approach works:
input <- c("C-484", "F-409", "S-18A", "G-850", "PB-632")
output <- paste0("'", input, "'", collapse = ",")
# cat(output)
# 'C-484','F-409','S-18A','G-850','PB-632'
So in your specific case it becomes:
test1 <- paste0("'", allstations_inactive$STATION, "'", collapse = ",")

Avoid that space in column name is replaced with period (".") when using read.csv()

I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.
If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.
To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).
Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)
Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"

Resources