I have read many examples here and other forums, tried things myself, but still can´t do what I want:
I have a string like this:
myString <- c("ENSG00000185561.10|TLCD2", "ENSG00000124785.9|NRN1", "ENSG00000287339.1|RP11-575F12.4")
And I want to split it into columns by the first dot and the vertical slash so it looks like this:
data.frame(c("ENSG00000185561", "ENSG00000124785", "ENSG00000287339"), c("TLCD2","NRN1","RP11-575F12.4")) %>% set_colnames(c("col1","col2"))
The biggest problem here is the dot that is sometimes present in the right part of the slash (e.g. third row), by which I don´t want to split.
Among others, what I tried was:
data.frame(do.call(rbind, strsplit(myString,"(\\.)|(\\|)")))
but this also creates a fourth column when it splits after the second dot.
I tried to tell it to only split once for the dot:
data.frame(do.call(rbind, strsplit(myString,"(\\.{1})|(\\|)")))
but same result.
Then tried to tell it that the dot could not be preceded by a slash:
data.frame(do.call(rbind, strsplit(myString,"([^\\|]\\.)|(\\|)")))
data.frame(do.call(rbind, strsplit(myString,"([[:alnum:]][^\\|]\\.)|(\\|)")))
but in both cases it splits by both dots.
I tried various combinations with reshape2::colsplit as well, similar results; either it splits in both dots, or it splits on the first dot but not on the slash:
reshape2::colsplit(myString, "([^\\|]\\.)|(\\|)", c("col1", "col2"))
Does anyone have an idea on how to solve this?
It is totally ok if it creates 3 columns instead of 2, I can then select the ones of interest.
E.g.
data.frame(c("ENSG00000185561", "ENSG00000124785", "ENSG00000287339"), c("10","9","1"), c("TLCD2","NRN1","RP11-575F12.4")) %>% set_colnames(c("col1","col2", "col3"))
library(stringr)
str_split_fixed(df$myString, "[\\.,\\|]", 3)
output:
[,1] [,2] [,3]
[1,] "ENSG00000185561" "10" "TLCD2"
[2,] "ENSG00000124785" "9" "NRN1"
[3,] "ENSG00000287339" "1" "RP11-575F12.4"
This should work. The secret sauce is the option extra = "merge", which means that any extra separated parts get added back onto the last column.
library(tidyr)
tibble(string = c(
"ENSG00000185561.10|TLCD2",
"ENSG00000124785.9|NRN1",
"ENSG00000287339.1|RP11-575F12.4"
)) %>%
separate(
string, into = c("c1", "c2", "c3"), sep = "[.]|[|]", extra = "merge"
)
#> # A tibble: 3 x 3
#> c1 c2 c3
#> <chr> <chr> <chr>
#> 1 ENSG00000185561 10 TLCD2
#> 2 ENSG00000124785 9 NRN1
#> 3 ENSG00000287339 1 RP11-575F12.4
Created on 2021-10-21 by the reprex package (v2.0.0)
NB, reshape2 is superseded by tidyr. You should make the switch ASAP!
I would suggest using matching instead of splitting (i.e. write a regex that specifies the parts that should be matched, rather than the splitter):
df = tibble(ID = myString)
df %>% extract(ID, into = c('ID', 'Name'), '([^.]+).*\\|(.+)')
# A tibble: 3 × 2
ID Name
<chr> <chr>
1 ENSG00000185561 TLCD2
2 ENSG00000124785 NRN1
3 ENSG00000287339 RP11-575F12.4
Just like the other answer, this is using ‘tidyr’ (which supersedes ‘reshape2’).
This could also help in base R:
as.data.frame(do.call(rbind, strsplit(myString, "\\.\\d+.+?", perl = TRUE)))
V1 V2
1 ENSG00000185561 TLCD2
2 ENSG00000124785 NRN1
3 ENSG00000287339 RP11-575F12.4
You can use str_extract and lookahead (?=\\|) and, respectively, lookbehind (?<=\\|) to assert the | as demarcation point:
library(stringr)
df <- data.frame(
col1 = str_extract(myString, ".*?(?=\\|)"),
col2 = str_extract(myString, "(?<=\\|).*$")
)
df
col1 col2
1 ENSG00000185561.10 TLCD2
2 ENSG00000124785.9 NRN1
3 ENSG00000287339.1 RP11-575F12.4
EDIT:
If you want three columns:
df <- data.frame(
col1 = str_extract(myString, ".*?(?=\\.)"),
col2 = str_extract(myString, "(?<=\\.)\\d+(?=\\|)"),
col3 = str_extract(myString, "(?<=\\|).*$")
)
df
col1 col2 col3
1 ENSG00000185561 10 TLCD2
2 ENSG00000124785 9 NRN1
3 ENSG00000287339 1 RP11-575F12.4
It seems to me that you are trying to cram two operations into a single command. First split at | and create two columns, than remove the dot suffix from the first column. I think this is simpler and there is no need for external packages either:
myString <- c("ENSG00000185561.10|TLCD2", "ENSG00000124785.9|NRN1", "ENSG00000287339.1|RP11-575F12.4")
df <- do.call(rbind, strsplit(myString, '\\|'))
df[,1] <- sub('\\..*', '', df[,1])
df
[,1] [,2]
[1,] "ENSG00000185561" "TLCD2"
[2,] "ENSG00000124785" "NRN1"
[3,] "ENSG00000287339" "RP11-575F12.4"
or am I missing something...?
Related
I have a dataframe that looks like the below:
BaseRating contRating Participant
5,4,6,3,2,4 5 01
4 4 01
I would first like to run some code that looks to see whether there are any commas in the dataframe, and returns a column number of where that is. I have tried some of the solutions in the questions below, which don't seem to work when looking for a comma instead of a string/whole value? I'm probably missing something simple here but any help appreciated!
Selecting data frame rows based on partial string match in a column
Filter rows which contain a certain string
Check if value is in data frame
Having determined whether there are commas in my data, I then want to extract just the last number in the list separated by commas in that entry, and replace the entry with that value. For instance, I want the first row in the BaseRating column to become '4' because it is last in that list.
Is there a way to do this in R without manually changing the number?
A possible solution is bellow.
EXPLANATION
In what follows, I will explain the regex expression used in str_extract function, as asked for by #milsandhills:
The symbol | in the middle means the logical OR operator.
We use that because BaseRating can have multiple numbers or only one number — hence the need to use |, to treat each case separately.
The left-hand side of | means a number formed by one or more digits (\\d+), which starts (^) and finishes the string ($).
The right-hand side of | means a number formed by one or more digits (\\d+), which finishes the string ($). And (?<=\\,) is used to guarantee that the number is preceded by a comma.
You can find more details at the stringr cheat sheet.
library(tidyverse)
df <- data.frame(
BaseRating = c("5,4,6,3,2,4", "4"),
contRating = c(5L, 4L),
Participant = c(1L, 1L)
)
df %>%
mutate(BaseRating = sapply(BaseRating,
function(x) str_extract(x, "^\\d+$|(?<=\\,)\\d+$") %>% as.integer))
#> BaseRating contRating Participant
#> 1 4 5 1
#> 2 4 4 1
Or:
library(tidyverse)
df %>%
separate_rows(BaseRating, sep = ",", convert = TRUE) %>%
group_by(contRating, Participant) %>%
summarise(BaseRating = last(BaseRating), .groups = "drop") %>%
relocate(BaseRating, .before = 1)
#> # A tibble: 2 × 3
#> BaseRating contRating Participant
#> <int> <int> <int>
#> 1 4 4 1
#> 2 4 5 1
If we want a quick option, we can use trimws from base R
df$BaseRating <- as.numeric(trimws(df$BaseRating, whitespace = ".*,"))
-output
> df
BaseRating contRating Participant
1 4 5 1
2 4 4 1
Or another option is stri_extract_last
library(stringi)
df$BaseRating <- as.numeric(stri_extract_last_regex(df$BaseRating, "\\d+"))
data
df <- structure(list(BaseRating = c("5,4,6,3,2,4", "4"), contRating = 5:4,
Participant = c(1L, 1L)), class = "data.frame", row.names = c(NA,
-2L))
I have a dataset with a single column that contains multiple ICD-10 codes separate by spaces, eg
Identifier Codes
1 A14 R17
2 R069 D136 B08
3 C11 K71 V91
I have a vector with the ICD-10 codes that are relevant to my analysis, eg goodcodes<-c("C11","A14","R17","O80"). I want to select rows from my dataset where the Codes column contains any of the codes in my vector, but does not need to exactly match a code in my vector.
Using medicalinfo<-filter(medicalinfo, Codes %in% goodcodes) returns only rows where a single matching code is listed in the Codes column. I could also filter based on a partial string, I only know how to do that for a single partial string, not all of those in my codes vector.
Is there a way to get all the rows where any of these codes are present in the column?
One trick is to combine the goodcodes into a regular expression:
library(dplyr)
ptn <- paste0("\\b(", paste(goodcodes, collapse = "|"), ")\\b")
ptn
# [1] "\\b(C11|A14|R17|O80)\\b"
FYI, the \\b( and )\\b are absolutely necessary if there's a chance that you will have codes A10 and A101; without \\b(...)\\b, then grepl("A10", "A101") will be a false-positive. See
grepl("A10|B20", "A101")
# [1] TRUE
grepl("\\b(A10|B20)\\b", "A101")
# [1] FALSE
Finally, let's use that ptn:
dat %>%
filter(grepl(ptn, Codes))
# Identifier Codes
# 1 1 A14 R17
# 2 3 C11 K71 V91
Another way is to split the Codes column into a list of individual codes, and look for membership with %in%:
sapply(strsplit(trimws(dat$Codes), "\\s+"), function(a) any(a %in% goodcodes))
# [1] TRUE FALSE TRUE
Depending on how complex things are, a third way is to "unnest" Codes and look for matches.
dat %>%
mutate(Codes = strsplit(trimws(Codes), "\\s+")) %>%
tidyr::unnest(Codes) %>%
group_by(Identifier) %>%
filter(any(Codes %in% goodcodes)) %>%
ungroup()
# # A tibble: 5 x 2
# Identifier Codes
# <dbl> <chr>
# 1 1 A14
# 2 1 R17
# 3 3 C11
# 4 3 K71
# 5 3 V91
(If you really prefer them combined into a single space-delimited string as before, that's easy enough to do with group_by(Identifier) %>% summarize(Codes = paste(Codes, collapse = " ")). I don't recommend it, per se, since I prefer to have that type of information broken out like this, but there is likely context I don't know.)
With subset from base R. Loop over the 'goodcodes' vector, use that as pattern in grepl, Reduce the list of logical vectors into a single logical vector to subset the rows
subset(dat, Reduce(`|`, lapply(goodcodes, function(x) grepl(x, Codes))))
# Identifier Codes
#1 1 A14 R17
#3 3 C11 K71 V91
data
dat <- structure(list(Identifier = 1:3, Codes = c("A14 R17", "R069 D136 B08",
"C11 K71 V91")), class = "data.frame", row.names = c(NA, -3L))
Consider the following dataframe:
status
1 file-status-done-bad
2 file-status-maybe-good
3 file-status-underreview-good
4 file-status-complete-final-bad
We want to extract the last part of status, wherein part is delimited by -. Such:
status status_extract
1 file-status-done-bad done
2 file-status-maybe-good maybe
3 file-status-ok-underreview-good underreview
4 file-status-complete-final-bad final
In SQL this is easy, select split_part(status, '-', -2).
However, the solutions I've seen with R either operate on vectors or are messy to extract particular elements (they return ALL elements). How is this done in a mutate chain? The below is a failed attempt.
df %>%
mutate(status_extract = str_split_fixed(status, pattern = '-')[[-2]])
Found the a really simple answer.
library(tidyverse)
df %>%
mutate(status_extract = word(status, -1, sep = "-"))
In base R you can combine the functions sapply and strsplit
df$status_extract <- sapply(strsplit(df$status, "-"), function(x) x[length(x) - 1])
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
You can use map() and nth() to extract the nth value from a vector.
library(tidyverse)
df %>%
mutate(status_extract = map_chr(str_split(status, "-"), nth, -2))
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
which is equivalent to a base version like
sapply(strsplit(df$status, "-"), function(x) rev(x)[2])
# [1] "done" "maybe" "underreview" "final"
You can use regex to get what you want without splitting the string.
sub('.*-(\\w+)-.*$', '\\1', df$status)
#[1] "done" "maybe" "underreview" "final"
example <- data.frame(
file_name = c("some_file_name_first_2020.csv",
"some_file_name_second_and_third_2020.csv",
"some_file_name_4_2020_update.csv"),
a = 1:3
)
example
#> file_name a
#> 1 some_file_name_first_2020.csv 1
#> 2 some_file_name_second_and_third_2020.csv 2
#> 3 some_file_name_4_2020_update.csv 3
I have a dataframe that looks something like this example. The "some_file_name" part changes often and the unique identifier is usually in the middle and there can be suffixed information (sometimes) that is important to retain.
I would like to end up with the dataframe below. The approach I can think of is finding all common string "components" and removing them from each row.
desired
#> file_name a
#> 1 first 1
#> 2 second_and_third 2
#> 3 4_update 3
This works for the example shared, perhaps you can use this to make a more general solution :
#split the data on "_" or "."
list_data <- strsplit(example$file_name, '_|\\.')
#Get the words that occur only once
unique_words <- names(Filter(function(x) x==1, table(unlist(list_data))))
#Keep only unique_words and paste the string back.
sapply(list_data, function(x) paste(x[x %in% unique_words], collapse = "_"))
#[1] "first" "second_and_third" "4_update"
However, this answer relies on the fact that you would have separators like "_" in the filenames to detect each "component".
I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4