Remove non-unique string components from a column in R - r

example <- data.frame(
file_name = c("some_file_name_first_2020.csv",
"some_file_name_second_and_third_2020.csv",
"some_file_name_4_2020_update.csv"),
a = 1:3
)
example
#> file_name a
#> 1 some_file_name_first_2020.csv 1
#> 2 some_file_name_second_and_third_2020.csv 2
#> 3 some_file_name_4_2020_update.csv 3
I have a dataframe that looks something like this example. The "some_file_name" part changes often and the unique identifier is usually in the middle and there can be suffixed information (sometimes) that is important to retain.
I would like to end up with the dataframe below. The approach I can think of is finding all common string "components" and removing them from each row.
desired
#> file_name a
#> 1 first 1
#> 2 second_and_third 2
#> 3 4_update 3

This works for the example shared, perhaps you can use this to make a more general solution :
#split the data on "_" or "."
list_data <- strsplit(example$file_name, '_|\\.')
#Get the words that occur only once
unique_words <- names(Filter(function(x) x==1, table(unlist(list_data))))
#Keep only unique_words and paste the string back.
sapply(list_data, function(x) paste(x[x %in% unique_words], collapse = "_"))
#[1] "first" "second_and_third" "4_update"
However, this answer relies on the fact that you would have separators like "_" in the filenames to detect each "component".

Related

Split a string by two delimiters only in the first occurrence

I have read many examples here and other forums, tried things myself, but still can´t do what I want:
I have a string like this:
myString <- c("ENSG00000185561.10|TLCD2", "ENSG00000124785.9|NRN1", "ENSG00000287339.1|RP11-575F12.4")
And I want to split it into columns by the first dot and the vertical slash so it looks like this:
data.frame(c("ENSG00000185561", "ENSG00000124785", "ENSG00000287339"), c("TLCD2","NRN1","RP11-575F12.4")) %>% set_colnames(c("col1","col2"))
The biggest problem here is the dot that is sometimes present in the right part of the slash (e.g. third row), by which I don´t want to split.
Among others, what I tried was:
data.frame(do.call(rbind, strsplit(myString,"(\\.)|(\\|)")))
but this also creates a fourth column when it splits after the second dot.
I tried to tell it to only split once for the dot:
data.frame(do.call(rbind, strsplit(myString,"(\\.{1})|(\\|)")))
but same result.
Then tried to tell it that the dot could not be preceded by a slash:
data.frame(do.call(rbind, strsplit(myString,"([^\\|]\\.)|(\\|)")))
data.frame(do.call(rbind, strsplit(myString,"([[:alnum:]][^\\|]\\.)|(\\|)")))
but in both cases it splits by both dots.
I tried various combinations with reshape2::colsplit as well, similar results; either it splits in both dots, or it splits on the first dot but not on the slash:
reshape2::colsplit(myString, "([^\\|]\\.)|(\\|)", c("col1", "col2"))
Does anyone have an idea on how to solve this?
It is totally ok if it creates 3 columns instead of 2, I can then select the ones of interest.
E.g.
data.frame(c("ENSG00000185561", "ENSG00000124785", "ENSG00000287339"), c("10","9","1"), c("TLCD2","NRN1","RP11-575F12.4")) %>% set_colnames(c("col1","col2", "col3"))
library(stringr)
str_split_fixed(df$myString, "[\\.,\\|]", 3)
output:
[,1] [,2] [,3]
[1,] "ENSG00000185561" "10" "TLCD2"
[2,] "ENSG00000124785" "9" "NRN1"
[3,] "ENSG00000287339" "1" "RP11-575F12.4"
This should work. The secret sauce is the option extra = "merge", which means that any extra separated parts get added back onto the last column.
library(tidyr)
tibble(string = c(
"ENSG00000185561.10|TLCD2",
"ENSG00000124785.9|NRN1",
"ENSG00000287339.1|RP11-575F12.4"
)) %>%
separate(
string, into = c("c1", "c2", "c3"), sep = "[.]|[|]", extra = "merge"
)
#> # A tibble: 3 x 3
#> c1 c2 c3
#> <chr> <chr> <chr>
#> 1 ENSG00000185561 10 TLCD2
#> 2 ENSG00000124785 9 NRN1
#> 3 ENSG00000287339 1 RP11-575F12.4
Created on 2021-10-21 by the reprex package (v2.0.0)
NB, reshape2 is superseded by tidyr. You should make the switch ASAP!
I would suggest using matching instead of splitting (i.e. write a regex that specifies the parts that should be matched, rather than the splitter):
df = tibble(ID = myString)
df %>% extract(ID, into = c('ID', 'Name'), '([^.]+).*\\|(.+)')
# A tibble: 3 × 2
ID Name
<chr> <chr>
1 ENSG00000185561 TLCD2
2 ENSG00000124785 NRN1
3 ENSG00000287339 RP11-575F12.4
Just like the other answer, this is using ‘tidyr’ (which supersedes ‘reshape2’).
This could also help in base R:
as.data.frame(do.call(rbind, strsplit(myString, "\\.\\d+.+?", perl = TRUE)))
V1 V2
1 ENSG00000185561 TLCD2
2 ENSG00000124785 NRN1
3 ENSG00000287339 RP11-575F12.4
You can use str_extract and lookahead (?=\\|) and, respectively, lookbehind (?<=\\|) to assert the | as demarcation point:
library(stringr)
df <- data.frame(
col1 = str_extract(myString, ".*?(?=\\|)"),
col2 = str_extract(myString, "(?<=\\|).*$")
)
df
col1 col2
1 ENSG00000185561.10 TLCD2
2 ENSG00000124785.9 NRN1
3 ENSG00000287339.1 RP11-575F12.4
EDIT:
If you want three columns:
df <- data.frame(
col1 = str_extract(myString, ".*?(?=\\.)"),
col2 = str_extract(myString, "(?<=\\.)\\d+(?=\\|)"),
col3 = str_extract(myString, "(?<=\\|).*$")
)
df
col1 col2 col3
1 ENSG00000185561 10 TLCD2
2 ENSG00000124785 9 NRN1
3 ENSG00000287339 1 RP11-575F12.4
It seems to me that you are trying to cram two operations into a single command. First split at | and create two columns, than remove the dot suffix from the first column. I think this is simpler and there is no need for external packages either:
myString <- c("ENSG00000185561.10|TLCD2", "ENSG00000124785.9|NRN1", "ENSG00000287339.1|RP11-575F12.4")
df <- do.call(rbind, strsplit(myString, '\\|'))
df[,1] <- sub('\\..*', '', df[,1])
df
[,1] [,2]
[1,] "ENSG00000185561" "TLCD2"
[2,] "ENSG00000124785" "NRN1"
[3,] "ENSG00000287339" "RP11-575F12.4"
or am I missing something...?

Extract String Part to Column in R

Consider the following dataframe:
status
1 file-status-done-bad
2 file-status-maybe-good
3 file-status-underreview-good
4 file-status-complete-final-bad
We want to extract the last part of status, wherein part is delimited by -. Such:
status status_extract
1 file-status-done-bad done
2 file-status-maybe-good maybe
3 file-status-ok-underreview-good underreview
4 file-status-complete-final-bad final
In SQL this is easy, select split_part(status, '-', -2).
However, the solutions I've seen with R either operate on vectors or are messy to extract particular elements (they return ALL elements). How is this done in a mutate chain? The below is a failed attempt.
df %>%
mutate(status_extract = str_split_fixed(status, pattern = '-')[[-2]])
Found the a really simple answer.
library(tidyverse)
df %>%
mutate(status_extract = word(status, -1, sep = "-"))
In base R you can combine the functions sapply and strsplit
df$status_extract <- sapply(strsplit(df$status, "-"), function(x) x[length(x) - 1])
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
You can use map() and nth() to extract the nth value from a vector.
library(tidyverse)
df %>%
mutate(status_extract = map_chr(str_split(status, "-"), nth, -2))
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
which is equivalent to a base version like
sapply(strsplit(df$status, "-"), function(x) rev(x)[2])
# [1] "done" "maybe" "underreview" "final"
You can use regex to get what you want without splitting the string.
sub('.*-(\\w+)-.*$', '\\1', df$status)
#[1] "done" "maybe" "underreview" "final"

R: Count the frequency of every unique character in a column

I have a data frame df which contains a column named strings. The values in this column are some sentences.
For example:
id strings
1 "I want to go to school, how about you?"
2 "I like you."
3 "I like you so much"
4 "I like you very much"
5 "I don't like you"
Now, I have a list of stop word,
["I", "don't" "you"]
How can I make another data frame which stores the total number of occurrence of each unique word (except stop word)in the column of previous data frame.
keyword frequency
want 1
to 2
go 1
school 1
how 1
about 1
like 4
so 1
very 1
much 2
My idea is that:
combine the strings in the column to a big string.
Make a list storing the unique character in the big string.
Make the df whose one column is the unique words.
Compute the frequency.
But this seems really inefficient and I don't know how to really code this.
At first, you can create a vector of all words through str_split and then create a frequency table of the words.
library(stringr)
stop_words <- c("I", "don't", "you")
# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))
# create a frequency table
word_list <- as.data.frame(table(all_words))
# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]
One way is using tidytext. Here a book and the code
library("tidytext")
library("tidyverse")
#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))
df %>%
mutate(strings = as.character(strings)) %>%
unnest_tokens(word, string) %>% #this tokenize the strings and extract the words
filter(!word %in% c("I", "i", "don't", "you")) %>%
count(word)
#> # A tibble: 11 x 2
#> word n
#> <chr> <int>
#> 1 about 1
#> 2 go 1
#> 3 how 1
#> 4 like 4
#> 5 much 2
EDIT
All the tokens are transformed to lower case, so you either include i in the stop_words or add the argument lower_case = FALSE to unnest_tokens
Assuming you have a mystring object and a vector of stopWords, you can do it like this:
# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]
# remove stopwords from the vector
vector = vector[!vector %in% stopWords]
At this point you can turn a frequency table() into a dataframe object:
frequency_df = data.frame(table(words))
Let me know if this can help you.

How to return rows of a df that contain strings from a character list

I have a character list. I would like to return rows in a df that contain any of the strings in the list in a given column.
I have tried things like:
hits <- df %>%
filter(column, any(strings))
strings <- c("ape", "bat", "cat")
head(df$column)
[1] "ape and some other text here"
[2] "just some random text"
[3] "Something about cats"
I would like only rows 1 and 3 returned
Thanks in advance for the help.
Use grepl() with a regular expression matching any of the strings in your strings vector:
strings <- c("ape", "bat", "cat")
Firstly, you can collapse the strings vector to the regex you need:
regex <- paste(strings, collapse = "|")
Which gives:
> regex <- paste(strings, collapse = "|")
> regex
[1] "ape|bat|cat"
The pipe symbol | acts as an or operator, so this regex ape|bat|cat will match ape or bat or cat.
If your data.frame df looks like this:
> df
# A tibble: 3 x 1
column
<chr>
1 ape and some other text here
2 just some random text
3 something about cats
Then you can run the following line of code to return just the rows matching your desired strings:
df[grepl(regex, df$column), ]
The output is as follows:
> df[grepl(regex, df$column), ]
# A tibble: 2 x 1
column
<chr>
1 ape and some other text here
2 something about cats
Note that the above example is case-insensitive, it will only match the lower case strings exactly as specified. You can overcome this easily using the ignore.case parameter of grepl() (note the upper case Cats):
> df[grepl(regex, df$column, ignore.case = TRUE), ]
# A tibble: 2 x 1
column
<chr>
1 ape and some other text here
2 something about Cats
This can be accomplished with a regular expression.
aColumn <- c("ape and some other text here","just some random text","Something about cats")
aColumn[grepl("ape|bat|cat",aColumn)]
...and the output:
> aColumn[grepl("ape|bat|cat",aColumn)]
[1] "ape and some other text here" "Something about cats"
>
One an also set up the regular expression in an R object, as follows.
# use with a variable
strings <- "ape|cat|bat"
aColumn[grepl(strings,aColumn)]

strpslit a character array and convert to dataframe simultaneously

I have what feels like a difficult data manipulation problem, and am hoping to get some guidance. Here is a test version of what my current array looks like, as well as what dataframe I hope to obtain:
dput(test)
c("<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>", "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>")
test
[1] "<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>"
[2] "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>"
desired_df
quarter oncourt-id time-minutes time-seconds id
1 1 NA 12 0 1
2 3 NA 10 NA 1
There are a few problems I am dealing with:
the character array "test" has backslashes where there should be nothing, but i was having difficulty using gsub in this format gsub("\", "", test).
not every element in test has the same number of entries, note in the example that the 2nd element doesn't have time-seconds, and so for the dataframe I would prefer it to return NA.
I have tried using strsplit(test, " ") to first split on spaces, which only exist between different column entires, but then I am returned with a list of lists that is just as difficult to deal with.
You've got xml there. You could parse it, then run rbindlist on the result. This will probably be a lot less hassle than trying to split the name-value pairs as strings.
dflist <- lapply(test, function(x) {
df <- as.data.frame.list(XML::xmlToList(x))
is.na(df) <- df == ""
df
})
data.table::rbindlist(dflist, fill = TRUE)
# quarter oncourt.id time.minutes time.seconds id
# 1: 1 NA 12 0 1
# 2: 2 NA 10 NA 1
Note: You will need the XML and data.table packages for this solution.

Resources