Adding a period between characters in a column in R - r

species <- c("Dacut","Hhyde","Faffi","Dmelan","Jrobusta")
leg <- c(1,2,3,4,5)
df <- data.frame(species, leg)
I am trying to add a period (".") between the first and second letter of every character in the first column of a data frame.
#End Goal:
#D.acut
#H.hyde
#F.affi
#D.melan
#J.robusta
Does anyone know of any code I can use for this issue?

Using substr() to split the string at the positions:
species <- c("Dacut","Hhyde","Faffi","Dmelan","Jrobusta")
leg <- c(1,2,3,4,5)
df <- data.frame(species, leg, stringsAsFactors = FALSE)
df$species <- paste0(
substr(df$species, 1, 1),
".",
substr(df$species, 2, nchar(df$species))
)
df$species
the first substr() extracts character 1 to 1, the second extracts character 2 to last character in string. With paste() we can put the . in between.
Or sub() with a back-reference:
df$species <- sub("(^.)", "\\1.", df$species)
(^.) is the first character in the string grouped with (). sub() replaces the first instance with the back-refernce to the group (\\1) plus the ..

Using sub, we can find on the zero-width lookbehind (?<=^.), and then replace with a dot. This has the effect of inserting a dot into the second position.
df$species <- sub("(?<=^.)", "\\.", df$species, perl=TRUE)
df$species
[1] "D.acut" "H.hyde" "F.affi" "D.melan" "J.robusta"
Note: If, for some reason, you only want to do this replacement if the first character in the species name be an actual capital letter, then find on the following pattern instead:
(?<=^[A-Z])

Related

In R, how do I split each string in a vector to return everything before the Nth instance of a character?

Example:
df <- data.frame(Name = c("J*120_234_458_28", "Z*23_205_a834_306", "H*_39_004_204_99_04902"))
I would like to be able to select everything before the third underscore for each row in the dataframe. I understand how to split the string apart:
df$New <- sapply(strsplit((df$Name),"_"), `[`)
But this places a list in each row. I've thus far been unable to figure out how to use sapply to unlist() each row of df$New select the first N elements of the list to paste/collapse them back together. Because the length of each subelement can be distinct, and the number of subelements can also be distinct, I haven't been able to figure out an alternative way of getting this info.
We specify the 'n', after splitting the character column by '_', extract the n-1 first components
n <- 4
lapply(strsplit(as.character(df$Name), "_"), `[`, seq_len(n - 1))
If we need to paste it together, can use anonymous function call (function(x)) after looping over the list with lapply/sapply, get the first n elements with head and paste them together`
sapply(strsplit(as.character(df$Name), "_"), function(x)
paste(head(x, n - 1), collapse="_"))
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
Or use regex method
sub("^([^_]+_[^_]+_[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
Or if the 'n' is really large, then
pat <- sprintf("^([^_]+){%d}[^_]+).*", n-1)
sub(pat, "\\1", df$Name)
Or
sub("^(([^_]+_){2}[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"

add running counter for semi-consecutive strings in vector

I would like to add a number indicating the x^th occurrence of a word in a vector. (So this question is different from Make a column with duplicated values unique in a dataframe , because I have a simple vector and try to avoid the overhead of casting it to a data.frame).
E.g. for the vector:
book, ship, umbrella, book, ship, ship
the output would be:
book, ship, umbrella, book2, ship2, ship3
I have solved this myself by transposing the vector to a dataframe and next using the grouping function. That feels like using a sledgehammer to crack nuts:
# add consecutive number for equal string
words <- c("book", "ship", "umbrella", "book", "ship", "ship")
# transpose word vector to data.frame for grouping
df <- data.frame(words = words)
df <- df %>% group_by(words) %>% mutate(seqN = row_number())
# combine columns and remove '1' for first occurrence
wordsVec <- paste0(df$words, df$seqN)
gsub("1", "", wordsVec)
# [1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Is there a more clean solution, e.g. using the stringr package?
You can still utilize row_number() from dplyr but you don't need to convert to data frame, i.e.
sub('1$', '', ave(words, words, FUN = function(i) paste0(i, row_number(i))))
#[1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Another option is to use make.unique along with gsubfn to increment your values by 1, i.e.
library(gsubfn)
gsubfn("\\d+", function(x) as.numeric(x) + 1, make.unique(words))
#[1] "book" "ship" "umbrella" "book.2" "ship.2" "ship.3"

argument 'replacement' has length > 1 and only the first element will be used

I would like to replace the first 3 letters of txt.files with a sequence.
x <- list.files()
n <- seq(length(list.files()))
x2 <- gsub('^.{3}', n, x)
file.rename(x, x2)
the 4 files in the folder
2eEMORT.txt
3h4MORT.txt
4F1MORT.txt
841MORT.txt
were replaced by one file
1MORT.txt
In the OP's code, gsub (or sub) is not vectorized for replacement - i.e. it takes a vector of length 1). Hence, we get the warning message. One option is to make use of substring (faster and efficient) along with paste
x2 <- paste0(seq_along(x), substring(x, 4))
x2
#[1] "1MORT.txt" "2MORT.txt" "3MORT.txt" "4MORT.txt"
Or with paste and sub. Here, we match first 3 characters as in the OP's code and replace it with blank ("") and then paste
x2 <- paste0(seq_along(x), sub("^.{3}", "", x))
Also, if we need to do this using regex, a vectorized option is str_replace
library(stringr)
x2 <- str_replace(x, "^.{3}", as.character(n))
x2
#[1] "1MORT.txt" "2MORT.txt" "3MORT.txt" "4MORT.txt"
NOTE: None of the solutions use any loop
Now, we simply do
file.rename(x, x2)
data
x <- c("2eEMORT.txt", "3h4MORT.txt", "4F1MORT.txt", "841MORT.txt")
The reason you're getting the warning "argument 'replacement' has length >1 and only the first element will be used" is because you're supplying n -- a vector of the form c(1, 2, ...) -- as a string to replace the substring matching your regex ^.{3}.
If what you want to do is replace the first three characters of each filename with a number you can sort by, here is one way to do it (comments explain each step):
# the files to be renamed
fnames <- list.files()
# new prefixes to add: '001', '002', '003', etc.
# (note usage of sprintf() to get left-padding for nice sorting)
fname_prefixes <- sprintf("%03d", seq_along(fnames))
# sub the i-th prefix for the first three characters of the i-th filename
new_fnames <- Map(function(fname, idx) gsub("^.{3}", idx, fname),
fnames, fname_prefixes)
Then you can rename each file by iterating over the named list new_fnames:
for (idx in seq_along(new_fnames)){
# can show a message so you can track what's going on
message('renaming ', names(new_fnames)[idx], ' to: ', new_fnames[[idx]])
file.rename(from=names(new_fnames)[idx], to=new_fnames[[idx]])
}

How to transform long names into shorter (two-part) names

I have a character vector in which long names are used, which will consist of several words connected by delimiters in the form of a dot.
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
The length of the names is different. But only the first two words of the entire name are important.
My goal is to get names up to 7 symbols: 3 initial symbols from the first two words and a separator in the form of a "dot" between them.
Very close to my request are these examples, but I do not know how to apply these code variations to my case.
R How to remove characters from long column names in a data frame and
how to append names to " column names" of the output data frame in R?
What should I do to get exit names to look like this?
x <- c("Dus.fru",
"Bet.nan",
"Sal.gla",
"Sal.jen",
"Vac.min")
Any help would be appreciated.
You can do the following:
gsub("(\\w{1,3})[^\\.]*\\.(\\w{1,3}).*", "\\1.\\2", x)
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
First we match up to 3 characters (\\w{1,3}), then ignore anything which is not a dot [^\\.]*, match a dot \\. and then again up to 3 characters (\\w{1,3}). Finally anything, that comes after that .*. We then only use the things in the brackets and separate them with a dot \\1.\\2.
Split on dot, substring 3 characters, then paste back together:
sapply(strsplit(x, ".", fixed = TRUE), function(i){
paste(substr(i[ 1 ], 1, 3), substr(i[ 2], 1, 3), sep = ".")
})
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
Here a less elegant solution than kath's, but a bit more easy to read, if you are not an expert in regex.
# Your data
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
# A function that takes three characters from first two words and merges them
cleaner_fun <- function(ugly_string) {
words <- strsplit(ugly_string, "\\.")[[1]]
short_words <- substr(words, 1, 3)
new_name <- paste(short_words[1:2], collapse = ".")
return(new_name)
}
# Testing function
sapply(x, cleaner_fun)
[1]"Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

gsub not working on colnames?

I have a dataframe called df with column names in the following format:
"A Agarwal" "A Agrawal" "A Balachandran"
"A.Brush" "A.Casavant" "A.Chakrabarti"
They are first initial and last name. However, some of them are separated with a space, while other are with a period. I need to replace the period with a period.(The first column is called author.ID, and I excluded it from the following code)
I have tried the following codes but the resulting colnames still do not change.
colnames(df[, -1]) = gsub("\\s", "\\.", colnames(df[, -1]))
colnames(df[, -1]) = gsub(" ", ".", colnames(df[, -1]))
What am I doing wrong?
Thanks.
Note that df[, -1] gets you all rows and columns except the first column (see this reference). In order to modify the column names you should use colnames(df).
To replace the first literal space with a dot, use
colnames(df) <- sub(" ", ".", colnames(df), fixed=TRUE)
If there can be more than one whitespace, use a regex:
colnames(df) <- sub("\\s+", ".", colnames(df))
If you need to remove all whitespaces sequences with a single dot in the column names, use gsub:
colnames(df) <- gsub("\\s+", ".", colnames(df))

Resources