How to transform long names into shorter (two-part) names - r

I have a character vector in which long names are used, which will consist of several words connected by delimiters in the form of a dot.
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
The length of the names is different. But only the first two words of the entire name are important.
My goal is to get names up to 7 symbols: 3 initial symbols from the first two words and a separator in the form of a "dot" between them.
Very close to my request are these examples, but I do not know how to apply these code variations to my case.
R How to remove characters from long column names in a data frame and
how to append names to " column names" of the output data frame in R?
What should I do to get exit names to look like this?
x <- c("Dus.fru",
"Bet.nan",
"Sal.gla",
"Sal.jen",
"Vac.min")
Any help would be appreciated.

You can do the following:
gsub("(\\w{1,3})[^\\.]*\\.(\\w{1,3}).*", "\\1.\\2", x)
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
First we match up to 3 characters (\\w{1,3}), then ignore anything which is not a dot [^\\.]*, match a dot \\. and then again up to 3 characters (\\w{1,3}). Finally anything, that comes after that .*. We then only use the things in the brackets and separate them with a dot \\1.\\2.

Split on dot, substring 3 characters, then paste back together:
sapply(strsplit(x, ".", fixed = TRUE), function(i){
paste(substr(i[ 1 ], 1, 3), substr(i[ 2], 1, 3), sep = ".")
})
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

Here a less elegant solution than kath's, but a bit more easy to read, if you are not an expert in regex.
# Your data
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
# A function that takes three characters from first two words and merges them
cleaner_fun <- function(ugly_string) {
words <- strsplit(ugly_string, "\\.")[[1]]
short_words <- substr(words, 1, 3)
new_name <- paste(short_words[1:2], collapse = ".")
return(new_name)
}
# Testing function
sapply(x, cleaner_fun)
[1]"Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

Related

Find strings where the first half matches the second

I have a list of IP address pairs separated by "::".
ip_pairs <- c("104.124.199.136::192.168.1.67", "104.124.199.136::192.168.137.174", "192.168.1.67::104.124.199.136", "192.168.137.174::104.124.199.136")
As you can see, the third and fourth elements of the vector are the same as the first two, but reversed (my actual problem is to find all unique pairings of IPs, so the solution would drop the pair B::A if A::B is already present. This could be solved using stringr or regex, I'm guessing.
One option:
library(stringr)
split_function = function(x) {
x = sort(x)
paste(x, collapse="::")
}
pairs = str_split(ip_pairs, "::")
unique(sapply(pairs, split_function))
[1] "104.124.199.136::192.168.1.67" "104.124.199.136::192.168.137.174"
Use read.table to create a two column data frame from the pairs, sort each row and find the duplicates using duplicated. Then extract out the non-duplicates. No packages are used.
DF <- read.table(text = ip_pairs, sep = ":")[-2]
ip_pairs[! duplicated(t(apply(DF, 1, sort)))]
## [1] "192.168.1.67::104.124.199.136" "192.168.137.174::104.124.199.136"

Substitute based on regex [duplicate]

This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 1 year ago.
relatively new to R, need help with applying a regex-based substitution.
I have a data frame in one column of which I have a sequence of digits (my values of interest) followed by a string of all sorts of characters.
Example:
4623(randomcharacters)
I need to remove everything after the initial digits to continue working with the values. My idea was to use gsub to remove the non-digit characters by positive lookbehind.
The code I have is:
sub_function <- function() {
gsub("?<=[[:digit:]].", " ", fixed = T)
}
data_frame$`x` <- data_known$`x` %>%
sapply(sub_function)
But I then get the error:
Error in FUN(X[[i]], ...) : unused argument (X[[i]])
Any help would be greatly appreciated!
Here is a base R function.
It uses sub, not gsub, since there will be only one substitution. And there's no need for look behind, the meta-character ^ marks the beginning of the string, followed by an optional minus sign, followed by at least one digit. Everything else is discarded.
sub_function <- function(x){
sub("(^-*[[:digit:]]+).*", "\\1", x)
}
data <- data.frame(x = c("4623(randomcharacters)", "-4623(randomcharacters)"))
sub_function(data$x)
#[1] "4623" "-4623"
Edit
With this simple modification the function returns a numeric vector.
sub_function <- function(x){
y <- sub("(^-*[[:digit:]]+).*", "\\1", x)
as.numeric(y)
}
There are a few ways to accomplish this, but I like using functions from {tidyverse}:
library(tidyverse)
# Create some dummy data
df <- tibble(targetcol = c("4658(randomcharacters)", "5847(randomcharacters)", "4958(randomcharacters)"))
df <- mutate(df, just_digits = str_extract(targetcol, pattern = "^[[:digit:]]+"))
Output (contents of df):
targetcol just_digits
<chr> <chr>
1 4658(randomcharacters) 4658
2 5847(randomcharacters) 5847
3 4958(randomcharacters) 4958
If you always want to extract numbers from the data, you can use parse_number from readr. It will also return data in numeric form by default.
Using #Rory S' data.
sub_function <- function(x) {
readr::parse_number(x)
}
sub_function(df$targetcol)
#[1] 4658 5847 4958

argument 'replacement' has length > 1 and only the first element will be used

I would like to replace the first 3 letters of txt.files with a sequence.
x <- list.files()
n <- seq(length(list.files()))
x2 <- gsub('^.{3}', n, x)
file.rename(x, x2)
the 4 files in the folder
2eEMORT.txt
3h4MORT.txt
4F1MORT.txt
841MORT.txt
were replaced by one file
1MORT.txt
In the OP's code, gsub (or sub) is not vectorized for replacement - i.e. it takes a vector of length 1). Hence, we get the warning message. One option is to make use of substring (faster and efficient) along with paste
x2 <- paste0(seq_along(x), substring(x, 4))
x2
#[1] "1MORT.txt" "2MORT.txt" "3MORT.txt" "4MORT.txt"
Or with paste and sub. Here, we match first 3 characters as in the OP's code and replace it with blank ("") and then paste
x2 <- paste0(seq_along(x), sub("^.{3}", "", x))
Also, if we need to do this using regex, a vectorized option is str_replace
library(stringr)
x2 <- str_replace(x, "^.{3}", as.character(n))
x2
#[1] "1MORT.txt" "2MORT.txt" "3MORT.txt" "4MORT.txt"
NOTE: None of the solutions use any loop
Now, we simply do
file.rename(x, x2)
data
x <- c("2eEMORT.txt", "3h4MORT.txt", "4F1MORT.txt", "841MORT.txt")
The reason you're getting the warning "argument 'replacement' has length >1 and only the first element will be used" is because you're supplying n -- a vector of the form c(1, 2, ...) -- as a string to replace the substring matching your regex ^.{3}.
If what you want to do is replace the first three characters of each filename with a number you can sort by, here is one way to do it (comments explain each step):
# the files to be renamed
fnames <- list.files()
# new prefixes to add: '001', '002', '003', etc.
# (note usage of sprintf() to get left-padding for nice sorting)
fname_prefixes <- sprintf("%03d", seq_along(fnames))
# sub the i-th prefix for the first three characters of the i-th filename
new_fnames <- Map(function(fname, idx) gsub("^.{3}", idx, fname),
fnames, fname_prefixes)
Then you can rename each file by iterating over the named list new_fnames:
for (idx in seq_along(new_fnames)){
# can show a message so you can track what's going on
message('renaming ', names(new_fnames)[idx], ' to: ', new_fnames[[idx]])
file.rename(from=names(new_fnames)[idx], to=new_fnames[[idx]])
}

R - How to replace a string from multiple matches (in a data frame)

I need to replace subset of a string with some matches that are stored within a dataframe.
For example -
input_string = "Whats your name and Where're you from"
I need to replace part of this string from a data frame. Say the data frame is
matching <- data.frame(from_word=c("Whats your name", "name", "fro"),
to_word=c("what is your name","names","froth"))
Output expected is what is your name and Where're you from
Note -
It is to match the maximum string. In this example, name is not matched to names, because name was a part of a bigger match
It has to match whole string and not partial strings. fro of "from" should not match as "froth"
I referred to the below link but somehow could not get this work as intended/described above
Match and replace multiple strings in a vector of text without looping in R
This is my first post here. If I haven't given enough details, kindly let me know
Edit
Based on the input from Sri's comment I would suggest using:
library(gsubfn)
# words to be replaced
a <-c("Whats your","Whats your name", "name", "fro")
# their replacements
b <- c("What is yours","what is your name","names","froth")
# named list as an input for gsubfn
replacements <- setNames(as.list(b), a)
# the test string
input_string = "fro Whats your name and Where're name you from to and fro I Whats your"
# match entire words
gsubfn(paste(paste0("\\w*", names(replacements), "\\w*"), collapse = "|"), replacements, input_string)
Original
I would not say this is easier to read than your simple loop, but it might take better care of the overlapping replacements:
# define the sample dataset
input_string = "Whats your name and Where're you from"
matching <- data.frame(from_word=c("Whats your name", "name", "fro", "Where're", "Whats"),
to_word=c("what is your name","names","froth", "where are", "Whatsup"))
# load used library
library(gsubfn)
# make sure data is of class character
matching$from_word <- as.character(matching$from_word)
matching$to_word <- as.character(matching$to_word)
# extract the words in the sentence
test <- unlist(str_split(input_string, " "))
# find where individual words from sentence match with the list of replaceble words
test2 <- sapply(paste0("\\b", test, "\\b"), grepl, matching$from_word)
# change rownames to see what is the format of output from the above sapply
rownames(test2) <- matching$from_word
# reorder the data so that largest replacement blocks are at the top
test3 <- test2[order(rowSums(test2), decreasing = TRUE),]
# where the word is already being replaced by larger chunk, do not replace again
test3[apply(test3, 2, cumsum) > 1] <- FALSE
# define the actual pairs of replacement
replacements <- setNames(as.list(as.character(matching[,2])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1]),
as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1])
# perform the replacement
gsubfn(paste(as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1], collapse = "|"),
replacements,input_string)
toreplace =list("x1" = "y1","x2" = "y2", ..., "xn" = "yn")
function have two arguments xi and yi.
xi is pattern (find what), yi is replacement (replace with).
input_string = "Whats your name and Where're you from"
toreplace<-list("Whats your name" = "what is your name", "names" = "name", "fro" = "froth")
gsubfn(paste(names(toreplace),collapse="|"),toreplace,input_string)
Was trying out different things and the below code seems to work.
a <-c("Whats your name", "name", "fro")
b <- c("what is your name","names","froth")
c <- c("Whats your name and Where're you from")
for(i in seq_along(a)) c <- gsub(paste0('\\<',a[i],'\\>'), gsub(" ","_",b[i]), c)
c <- gsub("_"," ",c)
c
Took help from the below link Making gsub only replace entire words?
However, I would like to avoid the loop if possible. Can someone please improve this answer, without the loop

Avoid that space in column name is replaced with period (".") when using read.csv()

I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.
If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.
To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).
Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)
Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"

Resources