Move characters from beginning of column name to end of column name - r

I have a dataset where column names have prefixes (corresponding to panel waves), e.g.
a_age
a_sex
a_jbstat
b_age
b_sex
b_jbstat
I would like to convert the prefixes into suffixes, so that it becomes:
age_a
sex_a
jbstat_a
age_b
sex_b
jbstat_b
I'd be grateful for suggestions on efficient ways of doing this.

You can use sub and backreference:
sub("([a-z])_([a-z]+)", "\\2_\\1", x)
[1] "age_a" "sex_a" "jbstat_a" "age_b" "sex_b" "jbstat_b"
The backreferences \\1and \\2 recall the exact character strings in the two capturing groups ([a-z]), which is recalled by \\1, and ([a-z]+), which is recalled by \\2. To obtain the desired string change, these 'recollections' are simply reversed in the replacement argument to sub.
EDIT:
If the elements are column names, you can do this:
names(df) <- sub("([a-z])_([a-z]+)", "\\2_\\1", names(df))

One way to do it, is to use a regex
x <- c(
"a_age",
"a_sex",
"a_jbstat",
"b_age",
"b_sex",
"b_jbstat"
)
stringr::str_replace(x, "^([a-z]+)_([a-z]+)$", "\\2_\\1")
#> [1] "age_a" "sex_a" "jbstat_a" "age_b" "sex_b" "jbstat_b"
Created on 2020-05-25 by the reprex package (v0.3.0)
Edit: Full Example
df <- data.frame(
a_age = 1,
a_sex = 1,
b_age = 2,
b_sex = 2
)
df
#> a_age a_sex b_age b_sex
#> 1 1 1 2 2
names(df) <- stringr::str_replace(names(df), "^([a-z]+)_([a-z]+)$", "\\2_\\1")
df
#> age_a sex_a age_b sex_b
#> 1 1 1 2 2
Created on 2020-05-26 by the reprex package (v0.3.0)

Related

Split a string by two delimiters only in the first occurrence

I have read many examples here and other forums, tried things myself, but still can´t do what I want:
I have a string like this:
myString <- c("ENSG00000185561.10|TLCD2", "ENSG00000124785.9|NRN1", "ENSG00000287339.1|RP11-575F12.4")
And I want to split it into columns by the first dot and the vertical slash so it looks like this:
data.frame(c("ENSG00000185561", "ENSG00000124785", "ENSG00000287339"), c("TLCD2","NRN1","RP11-575F12.4")) %>% set_colnames(c("col1","col2"))
The biggest problem here is the dot that is sometimes present in the right part of the slash (e.g. third row), by which I don´t want to split.
Among others, what I tried was:
data.frame(do.call(rbind, strsplit(myString,"(\\.)|(\\|)")))
but this also creates a fourth column when it splits after the second dot.
I tried to tell it to only split once for the dot:
data.frame(do.call(rbind, strsplit(myString,"(\\.{1})|(\\|)")))
but same result.
Then tried to tell it that the dot could not be preceded by a slash:
data.frame(do.call(rbind, strsplit(myString,"([^\\|]\\.)|(\\|)")))
data.frame(do.call(rbind, strsplit(myString,"([[:alnum:]][^\\|]\\.)|(\\|)")))
but in both cases it splits by both dots.
I tried various combinations with reshape2::colsplit as well, similar results; either it splits in both dots, or it splits on the first dot but not on the slash:
reshape2::colsplit(myString, "([^\\|]\\.)|(\\|)", c("col1", "col2"))
Does anyone have an idea on how to solve this?
It is totally ok if it creates 3 columns instead of 2, I can then select the ones of interest.
E.g.
data.frame(c("ENSG00000185561", "ENSG00000124785", "ENSG00000287339"), c("10","9","1"), c("TLCD2","NRN1","RP11-575F12.4")) %>% set_colnames(c("col1","col2", "col3"))
library(stringr)
str_split_fixed(df$myString, "[\\.,\\|]", 3)
output:
[,1] [,2] [,3]
[1,] "ENSG00000185561" "10" "TLCD2"
[2,] "ENSG00000124785" "9" "NRN1"
[3,] "ENSG00000287339" "1" "RP11-575F12.4"
This should work. The secret sauce is the option extra = "merge", which means that any extra separated parts get added back onto the last column.
library(tidyr)
tibble(string = c(
"ENSG00000185561.10|TLCD2",
"ENSG00000124785.9|NRN1",
"ENSG00000287339.1|RP11-575F12.4"
)) %>%
separate(
string, into = c("c1", "c2", "c3"), sep = "[.]|[|]", extra = "merge"
)
#> # A tibble: 3 x 3
#> c1 c2 c3
#> <chr> <chr> <chr>
#> 1 ENSG00000185561 10 TLCD2
#> 2 ENSG00000124785 9 NRN1
#> 3 ENSG00000287339 1 RP11-575F12.4
Created on 2021-10-21 by the reprex package (v2.0.0)
NB, reshape2 is superseded by tidyr. You should make the switch ASAP!
I would suggest using matching instead of splitting (i.e. write a regex that specifies the parts that should be matched, rather than the splitter):
df = tibble(ID = myString)
df %>% extract(ID, into = c('ID', 'Name'), '([^.]+).*\\|(.+)')
# A tibble: 3 × 2
ID Name
<chr> <chr>
1 ENSG00000185561 TLCD2
2 ENSG00000124785 NRN1
3 ENSG00000287339 RP11-575F12.4
Just like the other answer, this is using ‘tidyr’ (which supersedes ‘reshape2’).
This could also help in base R:
as.data.frame(do.call(rbind, strsplit(myString, "\\.\\d+.+?", perl = TRUE)))
V1 V2
1 ENSG00000185561 TLCD2
2 ENSG00000124785 NRN1
3 ENSG00000287339 RP11-575F12.4
You can use str_extract and lookahead (?=\\|) and, respectively, lookbehind (?<=\\|) to assert the | as demarcation point:
library(stringr)
df <- data.frame(
col1 = str_extract(myString, ".*?(?=\\|)"),
col2 = str_extract(myString, "(?<=\\|).*$")
)
df
col1 col2
1 ENSG00000185561.10 TLCD2
2 ENSG00000124785.9 NRN1
3 ENSG00000287339.1 RP11-575F12.4
EDIT:
If you want three columns:
df <- data.frame(
col1 = str_extract(myString, ".*?(?=\\.)"),
col2 = str_extract(myString, "(?<=\\.)\\d+(?=\\|)"),
col3 = str_extract(myString, "(?<=\\|).*$")
)
df
col1 col2 col3
1 ENSG00000185561 10 TLCD2
2 ENSG00000124785 9 NRN1
3 ENSG00000287339 1 RP11-575F12.4
It seems to me that you are trying to cram two operations into a single command. First split at | and create two columns, than remove the dot suffix from the first column. I think this is simpler and there is no need for external packages either:
myString <- c("ENSG00000185561.10|TLCD2", "ENSG00000124785.9|NRN1", "ENSG00000287339.1|RP11-575F12.4")
df <- do.call(rbind, strsplit(myString, '\\|'))
df[,1] <- sub('\\..*', '', df[,1])
df
[,1] [,2]
[1,] "ENSG00000185561" "TLCD2"
[2,] "ENSG00000124785" "NRN1"
[3,] "ENSG00000287339" "RP11-575F12.4"
or am I missing something...?

Is there a way in R to count the number of substrings in a string enclosed in square brackets, all substrings are separated by commas and are quoted?

['ax', 'byc', 'crm', 'dop']
This is a character string, and I want a count of all substrings, ie 4 here as output. Want to do this for the entire column containing such strings.
We may use str_count
library(stringr)
str_count(str1, "\\w+")
[1] 4
Or may also extract the alpha numeric characters into a list and get the lengths
lengths(str_extract_all(str1, "[[:alnum:]]+"))
If it is a data.frame column, extract the column as a vector and apply str_count
str_count(df1$str1, "\\w+")
data
str1 <- "['ax', 'byc', 'crm', 'dop']"
df1 <- data.frame(str1)
Here are a few base R approaches. We use the 2 row input defined reproducibly in the Note at the end. No packages are used.
lengths(strsplit(DF$b, ","))
## [1] 4 4
nchar(gsub("[^,]", "", DF$b)) + 1
## [1] 4 4
count.fields(textConnection(DF$b), ",")
## [1] 4 4
Note
DF <- data.frame(a = 1:2, b = "['ax', 'byc', 'crm', 'dop']")

Extract String Part to Column in R

Consider the following dataframe:
status
1 file-status-done-bad
2 file-status-maybe-good
3 file-status-underreview-good
4 file-status-complete-final-bad
We want to extract the last part of status, wherein part is delimited by -. Such:
status status_extract
1 file-status-done-bad done
2 file-status-maybe-good maybe
3 file-status-ok-underreview-good underreview
4 file-status-complete-final-bad final
In SQL this is easy, select split_part(status, '-', -2).
However, the solutions I've seen with R either operate on vectors or are messy to extract particular elements (they return ALL elements). How is this done in a mutate chain? The below is a failed attempt.
df %>%
mutate(status_extract = str_split_fixed(status, pattern = '-')[[-2]])
Found the a really simple answer.
library(tidyverse)
df %>%
mutate(status_extract = word(status, -1, sep = "-"))
In base R you can combine the functions sapply and strsplit
df$status_extract <- sapply(strsplit(df$status, "-"), function(x) x[length(x) - 1])
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
You can use map() and nth() to extract the nth value from a vector.
library(tidyverse)
df %>%
mutate(status_extract = map_chr(str_split(status, "-"), nth, -2))
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
which is equivalent to a base version like
sapply(strsplit(df$status, "-"), function(x) rev(x)[2])
# [1] "done" "maybe" "underreview" "final"
You can use regex to get what you want without splitting the string.
sub('.*-(\\w+)-.*$', '\\1', df$status)
#[1] "done" "maybe" "underreview" "final"

Remove non-unique string components from a column in R

example <- data.frame(
file_name = c("some_file_name_first_2020.csv",
"some_file_name_second_and_third_2020.csv",
"some_file_name_4_2020_update.csv"),
a = 1:3
)
example
#> file_name a
#> 1 some_file_name_first_2020.csv 1
#> 2 some_file_name_second_and_third_2020.csv 2
#> 3 some_file_name_4_2020_update.csv 3
I have a dataframe that looks something like this example. The "some_file_name" part changes often and the unique identifier is usually in the middle and there can be suffixed information (sometimes) that is important to retain.
I would like to end up with the dataframe below. The approach I can think of is finding all common string "components" and removing them from each row.
desired
#> file_name a
#> 1 first 1
#> 2 second_and_third 2
#> 3 4_update 3
This works for the example shared, perhaps you can use this to make a more general solution :
#split the data on "_" or "."
list_data <- strsplit(example$file_name, '_|\\.')
#Get the words that occur only once
unique_words <- names(Filter(function(x) x==1, table(unlist(list_data))))
#Keep only unique_words and paste the string back.
sapply(list_data, function(x) paste(x[x %in% unique_words], collapse = "_"))
#[1] "first" "second_and_third" "4_update"
However, this answer relies on the fact that you would have separators like "_" in the filenames to detect each "component".

Extracting nth character till the end of string in R

I'm trying to extract the nth character onwards in a string, using R. Here's my data:
StringField
example_string1
example_string2
example_string3
example_string4
example_string5
example_string6
example_string7
example_string8
example_string9
example_string10
example_string11
example_string12
I want to extract only the numbers after example_string, so the result would be:
1
2
3
4
5
6
7
8
9
10
11
12
I've tried something along the lines of:
df$unique_number <- substr(df$stringField, 15:)
to indicate I want everything from the 15th position onward, till the end of the string. Is there an easy way to accomplish what I'm trying to do?
Here is an easy option using sub. We can capture the final digits in the input, and then replace with only that captured quantity.
x <- "example_string10"
num <- sub("^.*?(\\d+)$", "\\1", x)
num
[1] "10"
x <- "example_string10"
substr(x, 15, 20)
#> [1] "10"
Created on 2020-02-06 by the reprex package (v0.3.0)
Replace each non-digit (\D) with an empty string and convert to numeric:
transform(df, unique_number = as.numeric(gsub("\\D", "", StringField)))
Note
We used this as input:
df <- data.frame(StringField = c("example_string1", "example_string2",
"example_string3"), stringsAsFactors = FALSE)
df %>% tidyr::extract(StringField, into = "nmb", "([0-9]+)")
If you are interested in extracting only numbers from a string, this can be a solution:
library(stringr)
as.numeric(str_extract(df$stringField,"\\d+"))

Resources