Inserting character into variable names [duplicate] - r

This question already has answers here:
Insert a character at a specific location in a string
(8 answers)
Closed 5 years ago.
I have a dataset with variable names such as FamId00 and ISCO8899 and would like to write a command to insert an underscore before the last two digits, which represent years. What is the best way of doing it? I have tried with regex but the further I got was to:
gsub('.{2}$', '', varname)
which gives me:
FamId
How to I add '_' and the original last two digits back? Also, I have variables in the dataset that do not have the year in the last two digits (i.e. ID and sex). Is there a way to keep the regular expression from affecting those?

We don't need gsub just a sub would be enough as this is only a single instance replacement. Capture the last two characters as a group ((...)) and in the replacement use the _ followed by the backreference of that capture group
sub("(.{2})$", "_\\1", varname)
#[1] "FamId_00" "ISCO88_99"
The . is a metacharacter implying any character. If this needs to be specific i.e. digits, use \\d{2} in place of .{2}
data
varname <- c("FamId00", "ISCO8899")

Alternative solution always using sub() or gsub() and a different pattern.
ids <- c("FamId00", "ISCO8899")
gsub("(^.*)([[:digit:]]{2}$)", "\\1_\\2", ids)
[1] "FamId_00" "ISCO88_99"

Related

Split a column of strings (with different patterns) based on two different conditions

Was hoping to get some help with this problem. So I have a column with two types of strings and I would need to split the strings into multiple columns using 2 different conditions. I can figure out how to split them individually but struggling to add maybe an IF statement to my code. This is the example dataset below:
data = data.frame(string=c("HFUFN-087836_661", "207465-125 - IK_6 Mar 2009.docx_37484956"))
For the first type of variable (with the _). I would like to split after the _. So I used the following code for that
strsplit(data$string, "-")
For variables that have.docx in them I would like to split after the docx. I cannot split based on "_" as it comes multiple times in this string. So I used the following code:
strsplit(data$string, "x_")
My question is both these types of strings appear in the same column. Is there a way to tell R if "docx" is in the string then split after x_, but if its not split on the _?
Any help would be appreciated - Thank you guys!
Here's a tidyr solution:
library(tidyr)
data %>%
extract(string,
into = c("1","2"), # choose your own column labels
"(.*?)_([^_]+)$")
1 2
1 HFUFN-087836 661
2 207465-125 - IK_6 Mar 2009.docx 37484956
How the regex works:
The regex partitions the strings into two "capture groups" plus an underscore in-between:
(.*?): first capture group, matching any character (.) zero or more times (*) non-greedily (?)
_: a literal underscore
([^_]+)$: the second capture group, matching any character that is not an underscore ([^_]) one or more times (+) at the very end of he string ($)
Data:
data = data.frame(string=c("HFUFN-087836_661", "207465-125 - IK_6 Mar 2009.docx_37484956"))

Changing a full last name to just the first letter of the name in R [duplicate]

This question already has answers here:
Getting and removing the first character of a string
(7 answers)
Extract the first (or last) n characters of a string
(5 answers)
Closed 2 years ago.
I'm working in R. I have a dataset with people first and last names. There is a column called "First" and another column called "Last".
I want to change "Bodie" to just "B" and do the same for all the observations in the "Last" column.
I'm newer to programming so I don't even know where to start. I have looked at some of the string packages in R and can't quite figure out what to do. Thanks for the help.
We can use substr to extract the first letter of the 'Last' column
df1$Last <- substr(df1$Last, 1, 1)
Or sub to remove all the characters other than the first
df1$Last <- sub("^(.).*", "\\1", df1$Last)
Or another option is to split the characters, select the first element
df1$Last <- sapply(strsplit(df1$Last, ""), `[`, 1)
Just a variation on the #akrun answer which uses sub sans a capture group:
df1$Last <- sub("(?<=.).*$", "", df1$Last, perl=TRUE)

Substitute multiple periods in all column names in R [duplicate]

This question already has answers here:
R: How to replace . in a string?
(5 answers)
Closed 2 years ago.
I have the following data.frame.
df = data.frame(a.dfs.56=c(rep("a",8), rep("b",5), rep("c",7), rep("d",10)),
b.fqh.28=rnorm(30, 6, 2),
c.34.2.fgs=rnorm(30, 12, 3.5),
d.tre.19.frn=rnorm(30, 8, 3)
)
How can I substitute all periods "." in the column names to have them become dashes "-"?
I am aware of options like check.names=FALSE when using read.table or data.frame, but in this case, I cannot use this.
I have also tried variations of the following posts, but they did not work for me.
Specifying column names in a data.frame changes spaces to "."
How can I use gsub in multiple specific column in r
R gsub column names in all data frames within a list
Thank you.
You can use gsub for name replacement
names(df) <- gsub(".", "-", names(df), fixed=TRUE)
Note that you need fixed=TRUE because normally gsub expects regular expressions and . is a special regular expression character.
But be aware that - is a non-standard character for variable names. If you try to use those columns with functions that use non-standard evaluation, you will need to surround the names in back-ticks to use them. For example
dplyr::filter(df, `a-dfs-56`=="a")
gsub("\\.", "-", names(df)) is the regex (regular expressions) way. The . is a special symbol in regex that means "match any single character". That's why the fixed = TRUE argument is included in MrFlick's answer.
The \\ (escape) tells R that we wan't the literal period and not the special symbol that it represents.

Split column names in R [duplicate]

This question already has answers here:
Remove all characters before a period in a string
(5 answers)
Closed 4 years ago.
I have a data frame like below.
df:
X1.Name X1.ID X1.Prac X1.SCD
But, I need to split the column name by dot and display as,
output df:
Name ID Prac SCD
Using sub:
names(df) <- sub("^[^.]+\\.", "", names(df))
Demo
The regex pattern I used will match everything from the start of the string up to, and including, the first dot. Then, it replaces that, and only that, with empty string.
^ from the start of the string
[^.]+ match one or more characters which are NOT dots
\\. then match a literal dot
We then replace this entire pattern with empty string "", i.e. we remove it from the original string.

R-taking reverse of substring of strsplit sentence [duplicate]

This question already has answers here:
How to reverse a string in R
(14 answers)
Closed 5 years ago.
I have a sentence, ['this', 'is, 'my', house'].
After splitting it by using "-"as a as separator,and reversing it to[ house, my, is, this], how do I access the last part of string? and join my and is together with house to form another sentence?
sentence <- c("this","is","my","house")
strsplit(sentence[4], split="")[[1]][nchar(sentence[4]):1]
This code might be a bit dense for a beginner to interpret. The [[1]] is necessary because the value of strsplit is always a list, even when it's just one vector of individual characters; the indexing extracts that vector. The indexing after that, [nchar(sentence[4]):1], reorders the letters in that vector backwards, from the last to the first, in this case c(5,4,3,2,1). The split="" argument causes the strsplit function to split the string at every possible point, i.e. between each character.
out <- strsplit(sentence, "-")
last <- out[length(out)]
flip <- rev(last)
word <- paste(flip, collapse='')

Resources