Is there a way in R to find values in a column that contain a word? For example, I want to find all the values that contain the word "the", where some values of the column are "the_cat" and "the_dog" and "dog"
x <- c("the_dog", "the_cat", "dog")
Using the example above, the answer would be 2.
I know this is relatively easy to do in Python, but I am wondering if there is a way to do this in R. Thanks!
Try:
sum(grepl("(?<![A-Za-z])the(?![A-Za-z])", x, perl = T))
This gives a sum of 2 on your example.
But let's consider also a slightly more complex example:
x <- c("the_dog", "the_cat", "dog", "theano", "menthe", " the")
Output:
[1] 3
Above we're trying to match any the that doesn't have another letter before or after (like e.g. theano).
You could also add inside the [] other things you wouldn't like to match, like e.g. if you wouldn't consider the99 a word the, you would do [A-Za-z0-9] etc.
You can also use the above with stringr, for example (I've included the exclusion of numbers, so below the99 wouldn't be counted as a word):
library(stringr)
sum(str_detect(x, "(?<![A-Za-z0-9])the(?![A-Za-z0-9])"))
library(stringr)
##with a vector
sum(str_detect(c("the_dog", "the_cat", "dog"),"the"))
##In a dataframe
tibble(x = c("the_dog", "the_cat", "dog")) %>%
filter(str_detect(x, "the")) %>%
nrow()
x <- c("the_dog", "the_cat", "dog")
stringr::str_detect(x, "the")
#> [1] TRUE TRUE FALSE
Created on 2019-02-23 by the reprex package (v0.2.1)
Try also:
x <- c("the_dog", "the_cat", "dog")
sum(stringi::stri_count(x,regex="^the"))#matches the at the beginning
Result:
[1] 2
Or:
x <- c("the_dog", "the_cat", "dog")
sum(stringi::stri_count(x,regex="the{1,}"))#matches any the
Related
Let's say my data is df <- c("Author1","Reference1","Abstract1","Author2","Reference2","Abstract2","Author3","Reference3","Author4","Reference4","Abstract4").
This is a series in which the order is Author, Reference and Abstract. But in some cases, the Abstract data is missing. (In this example, the third Abstract is missing.) So, how can I add NA values in place of Abstract, when Abstract is missing?
In other words, If an element in the vector starts with the word "Reference", but its next element doesn't start with the word "Abstract", I want to add an NA value just after the element starting with "Reference". The result vector should be
result <- c("Author1","Reference1","Abstract1","Author2","Reference2","Abstract2","Author3","Reference3",NA,"Author4","Reference4","Abstract4")
How can I do it?
I have tried the append function in R, but for using it, I need to have the index number of the element where I want to add NA. So, it takes a manual entry for each NA element.
Here's an approach.
Bascially you get two vectors:
which tests whether that element containts Reference, the other that checks that the element does not contain Abstract
You offset one vector by 1, because you want to test whether abstract follows reference.
you take the logical and
then you insert NAs into the positions where abstract should be but isn't with append()
ab_missing <- grepl("Reference", df) & c(!grepl("Abstract", df)[-1], FALSE)
df <- append(df, NA, which(ab_missing))
df
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3" "Reference3" NA "Author4"
[11] "Reference4" "Abstract4"
One way (and the only way I get these things done) is to think in tibbles or data frames: (So this is not the best approach)!
We create a tibble of one column calling x,
then we group by the numbers e.g. 1,1,1 with parse_number() function from readr (I love parse_number()),
With summarise(cur_data()[seq(3),]) see expand each group to the max rows, see here Expand each group to the max n of rows
3a stop here and pull if NA is desired otherwise continue
finally we use paste with r's recycling ability and pull the vector:
1. In case NA is desired:
library(dplyr)
library(readr)
my_vector <- tibble(x = c("Author1","Reference1","Abstract1","Author2","Reference2",
"Abstract2","Author3","Reference3","Author4","Reference4","Abstract4")) %>%
group_by(group= parse_number(x)) %>%
summarise(cur_data()[seq(3),]) %>%
pull(x)
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3"
[8] "Reference3" NA "Author4" "Reference4" "Abstract4"
2. In case the lacking word is desired:
library(dplyr)
library(readr)
my_vector <- tibble(x = c("Author1","Reference1","Abstract1","Author2","Reference2",
"Abstract2","Author3","Reference3","Author4","Reference4","Abstract4")) %>%
group_by(group= parse_number(x)) %>%
summarise(cur_data()[seq(3),]) %>%
mutate(group = paste0(c("Author", "Reference", "Abstract"), group)) %>%
pull(group)
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3"
[8] "Reference3" "Abstract3" "Author4" "Reference4" "Abstract4"
A slightly different approach might be:
c(sapply(split(x, cumsum(grepl("Author", x))), function(x) head(c(x, NA_character_), 3)))
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3"
[8] "Reference3" NA "Author4" "Reference4" "Abstract4"
I have a string
str1 <- "T-759..780, -D-27..758_E, -D-781..1338_C"
And I tried to use gtools::mixedsort to order these comma separated strings.
sapply(strsplit(str1 , ','), function(x) toString(gtools::mixedsort(x)))
I get
" -D-781..1338_C, -D-27..758_E, T-759..780"
I am expecting
"-D-27..758_E, T-759..780 -D-781..1338_C"
Not sure what I need to do to get the expected output.
I think you have a misconception on how mixedsort() works. It doesn't sort by the numbers in the string, it splits a string in separate string and number parts and sorts all of them in order. I hope these small example illustrate how mixedsort() works. It starts by sorting the elements of the vector c("B_1", "A_2", "A_10") by their first string-part c("B", "A", "A"), so A is always before B and then for the two A-elements it sorts them by their numbers 10 and 2:
# example showing how mixedsort works
example <- c("B_1", "A_2", "A_10")
gtools::mixedsort(example)
#> [1] "A_2" "A_10" "B_1"
sort(example) # in comparison to normal sort, which doesn't recognize parts of the string as numbers
#> [1] "A_10" "A_2" "B_1"
Created on 2022-09-02 by the reprex package (v2.0.1)
But according to your example, you want to sort a vector by the first number, which appears in each element, and ignore a possible - infront of the number. In that case, you can just use a regular expression to extract the first number in a string with gsub(".*?([0-9]+).*", "\\1", x) and use that to sort the vector. I wrote a small function for it:
# function to sort by first number, ignoring minus before the number
sort.first.number <- function(x) {
v <- gsub(".*?([0-9]+).*", "\\1", x)
x[order(v)]
}
str1 <- "T-759..780, -D-27..758_E, -D-781..1338_C"
sapply(strsplit(str1 , ','), function(x) toString(sort.first.number(x)))
#> [1] " -D-27..758_E, T-759..780, -D-781..1338_C"
Created on 2022-09-02 by the reprex package (v2.0.1)
> Df1
[1] "HM_004_T" "HM_004_T2" "HM_005_T" "HMFN_005_T2" "HM_007_T" "HM_007_T2" "HM_088_TR"
[8] "HM_088_T3"
Reference is made to change position of word within a string in r. I have a slightly different question. I first wish to delete _T if it presents on its own, and wish to delete _T2, _T3 or _TR and move them before all other text.
My ideal output will be:
Df1 <- c("HM_004", "T2_HM_004", "HM_005", "T2_HM_005", "HM_007", "T2_HM_007", "TR_HM_088", "T3_HM_088")
Input data
Df1 <- c("HM_004_T", "HM_004_T2", "HM_005_T", "HM_005_T2", "HM_007_T", "HM_007_T2", "HM_088_TR", "HM_088_T3")
You can do this with nested sub and backreference:
DF1 <- sub("(.*)_(T\\w)$", "\\2_\\1", sub("_T$", "", DF1))
Here you delete string-final _T in the first sub operation, the result of which you pass to the second sub operation, which switches the order of (i) whatever comes before the underscore _ and (ii) T followed by a digit or a letter (\\w), by referring to these two substrings with the backreferences \\1and \\2.
Result:
DF1
[1] "HM_004" "T2_HM_004" "HM_005" "T2_HM_005" "HM_007" "T2_HM_007" "TR_HM_088" "T3_HM_088"
Data:
DF1 <- c("HM_004_T", "HM_004_T2", "HM_005_T", "HM_005_T2",
"HM_007_T", "HM_007_T2", "HM_088_TR", "HM_088_T3")
You can achieve this relatively easy with the package stringr and the functions str_remove() and str_replace().
I am assuming that the patterns of interest always occur at the end of the text and that they are always preceded by _.
Please, have a look at the updated code below. This treats the pattern _T*, where * can now be a letter, as target thus good pattern.
library(stringr)
Df1 <- c("HM_004_T", "HM_004_T2", "HM_005_T", "HM_005_T2",
"HM_007_T", "HM_007_T2", "HM_088_TR", "HM_088_T3")
# Here I remove the roots I don't want like "_T" and "_T*"
# where "*" can be a digit or a character
df2 <- str_remove(Df1, "_T$")
# Here I replace the patterns through the group reference
final <- str_replace( df2, "(^.*)_(T\\d+$|T\\w+$)", "\\2_\\1" )
final
#> [1] "HM_004" "T2_HM_004" "HM_005" "T2_HM_005" "HM_007" "T2_HM_007"
#> [7] "TR_HM_088" "T3_HM_088"
# A more coincise way would be the following where \\w is the workhorse.
final <- str_replace( df2, "(^.*)_(T\\w$)", "\\2_\\1" )
final
#> [1] "HM_004" "T2_HM_004" "HM_005" "T2_HM_005" "HM_007" "T2_HM_007"
#> [7] "TR_HM_088" "T3_HM_088"
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?
I have 71 columns in a dataframe, 10 of which include data that may include a year between 1990 and 2019 in the format YYYY (e.g. 2019). For example:
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
I am trying to find a way to pull the years from relevant cells and insert them in a new column.
So far, I am only aware of how to filter the data in a very time-consuming way. I have produced the following code, which starts like this:
dated_data <- select(undated_data, 1:71) %>%
filter(grepl("1990", id_1) | filter(grepl("1990", id_2) | filter(grepl("1991", id_1) | filter(grepl("1991", id_2)
However, it take a really long time to write that for all ten columns and all 30 years. I am sure there is a quicker way. I also have no idea how to then pull the dates from each of the matching cells into a new cell.
The output I want looks like this:
dated_data$year <- c("2013", "2014", "2016", "1990")
Does anyone know how I do this? Thank you in advance for your help!
There are many ways. This is one of them:
Step 1: define a pattern you want to match with regex:
pattern <- "(1|2)\\d{3}"
Step 2: define a function to extract raw matches:
extract <- function(x) unlist(regmatches(x, gregexpr(pattern, x, perl = T)))
Step 3: apply the function to your data, e.g., id_1:
extract(id_1)
[1] "2013" "2014" "2016" "1990"
Here's another way, actually simpler ;)
It uses the str_extract function from the stringr package. So you install the package and activate it:
install.packages("stringr")
library(stringr)
and use str_extract to pull your matches:
years <- str_extract(id_1,"(1|2)\\d{3}")
years
[1] "2013" "2014" "2016" "1990"
EDIT:
If not every string contains a match and you want to preserve the length of the vectors/columns, you can use ifelse to test whether the regex finds a match and, where it doesn't, to put NA.
For example, if your data is like this (note the two added strings which do not contain years):
id_3 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759", "gbgbgbgb", "hnhna25")
you can set up the ifelse test like this:
years <- ifelse(grepl("(1|2)\\d{3}", id_3), str_extract(id_3,"(1|2)\\d{3}"), NA)
years
[1] "2013" "2014" "2016" "1990" NA NA
Based on the example in your question, you are trying to filter out any rows without years and then extract the year from the string. It looks like every row only contains 1 year. Here is some code so that you do not have to write long filter statements for 10 columns and 30 years. Keep in mind that I don't have your data so I couldn't test it.
library(tidyverse)
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate(year = str_extract(id_1, pattern = paste0(1990:2019, collapse = "|")))
EDIT: based on your comment it looks like maybe some columns have a year and others do not. What we do instead is pull the year out of any column with id_* and then we coalesce the columns together. Again, without your data its tough to test this.
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate_at(vars(starts_with("id_")), list(year = ~str_extract(., pattern = paste0(1990:2019, collapse = "|")))) %>%
mutate(year = coalesce(ends_with("_year"))) %>%
select(-ends_with("_year"))
Using tidyverse methods:
undated_data %>%
mutate_at(vars(1:71),
funs(str_extract(., "(1|2)[0-9]{3}")))
(Note that the regex pattern will match numbers that may not be years, such as 2999; if your data has many "false positives" like that, you may be better off writing a custom function.)
Here is a similar solution to the one provided, but using dplyr and stringr on a data.frame.
library(stringr)
library(dplyr)
df<-data.frame("X1" = id_1,"X2" = id_2)
#Set in cols the column names from which years are going to be extracted
df %>%
pivot_longer(cols = c("X1","X2"), names_to = "id") %>%
arrange(id) %>%
mutate(new = unlist(str_extract_all(value, pattern = "(1|2)\\d{3}")))
Base R solution:
# Sample data: id_1; id_2 => character vectors
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
# Thanks #Chris Ruehlemann: store the date pattern: date_pattern => character scalar
date_pattern <- "(1|2)\\d{3}"
# Convert to data.frame: df => data.frame
df <- data.frame(id_1, id_2, stringsAsFactors = FALSE)
# Subset the data to only contain date information vectors: dates_subset => data.frame
dates_subset <- df[,sapply(df, function(x){any(grepl(date_pattern, x))}), drop = FALSE]
# Initialse the year vector: year => character vector:
df$years <- NA_character_
# Remove punctuation and letters, return valid dates, combine into a, comma-separated string:
# Store the dates found in the string: years => character vector
df$years[which(rowSums(Vectorize(grepl)(date_pattern, dates_subset)) > 0)] <-
apply(sapply(dates_subset, function(x){
grep(date_pattern, unlist(strsplit(x, "[[:punct:]]|[a-zA-Z]")), value = TRUE)}),
1, paste, collapse = ", ")
Here may be another solution.
We just use gsub() function and set pattern as ".(199[0-9]|20[01][0-9]).".
The pattern captures a year text between 1990 to 2019 as a
group result , especially only one group , so we replace original text with first one group string:)
library(magrittr)
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_1)
# [1] "2013" "2014" "2016" "1990"
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_2)
#[1] "2013" "2014" "2016" "1990"
I stored some items that didn't fulfill a criteria in a vector.
non.fulfilled <- c('positive', 'beta.1', 'beta.2', 'negative', 'alpha.1', 'alpha.2', 'alpha.3')
Now, I would like to find which words are in my vector multiple times and afterwards add them to this vector. So in this case:
non.fulfilled2 <- cbind(non.fulfilled, 'beta', 'alpha')
How do I find these words?
If we assume that a "word" here is defined as the first run of \w ("word characters"), we can do as follows to get the desired output:
non.fulfilled <- c('positive', 'beta.1', 'beta.2', 'negative', 'alpha.1', 'alpha.2', 'alpha.3')
library(stringr)
words <- str_extract(non.fulfilled, "\\w+")
unique(words[duplicated(words)])
#> [1] "beta" "alpha"
EDIT: After clarification in the comments, we can get duplicates like so:
words <- str_replace(non.fulfilled, "\\..*", "")
unique(words[duplicated(words)])
#> [1] "beta" "alpha"
Created on 2019-12-23 by the reprex package (v0.3.0)
We can use sub to keep string before dot, count their occurrence using table and select values which occur more than once.
vals <- table(sub('\\..*', '', non.fulfilled))
names(vals[vals > 1])
#[1] "alpha" "beta"
Append them to original vector
c(non.fulfilled, names(vals[vals > 1]))
We can also use tidyverse approaches
library(dplyr)
library(stringr)
tibble(non.fulfilled) %>%
mutate(non.fulfilled = str_remove(non.fulfilled, "\\.\\d+$")) %>%
count(non.fulfilled) %>%
filter(n > 1) %>%
pull(non.fulfilled)
#[1] "alpha" "beta"