Select words that are in a vector multiple times - r

I stored some items that didn't fulfill a criteria in a vector.
non.fulfilled <- c('positive', 'beta.1', 'beta.2', 'negative', 'alpha.1', 'alpha.2', 'alpha.3')
Now, I would like to find which words are in my vector multiple times and afterwards add them to this vector. So in this case:
non.fulfilled2 <- cbind(non.fulfilled, 'beta', 'alpha')
How do I find these words?

If we assume that a "word" here is defined as the first run of \w ("word characters"), we can do as follows to get the desired output:
non.fulfilled <- c('positive', 'beta.1', 'beta.2', 'negative', 'alpha.1', 'alpha.2', 'alpha.3')
library(stringr)
words <- str_extract(non.fulfilled, "\\w+")
unique(words[duplicated(words)])
#> [1] "beta" "alpha"
EDIT: After clarification in the comments, we can get duplicates like so:
words <- str_replace(non.fulfilled, "\\..*", "")
unique(words[duplicated(words)])
#> [1] "beta" "alpha"
Created on 2019-12-23 by the reprex package (v0.3.0)

We can use sub to keep string before dot, count their occurrence using table and select values which occur more than once.
vals <- table(sub('\\..*', '', non.fulfilled))
names(vals[vals > 1])
#[1] "alpha" "beta"
Append them to original vector
c(non.fulfilled, names(vals[vals > 1]))

We can also use tidyverse approaches
library(dplyr)
library(stringr)
tibble(non.fulfilled) %>%
mutate(non.fulfilled = str_remove(non.fulfilled, "\\.\\d+$")) %>%
count(non.fulfilled) %>%
filter(n > 1) %>%
pull(non.fulfilled)
#[1] "alpha" "beta"

Related

How can I add NA values conditionally in a vector in R?

Let's say my data is df <- c("Author1","Reference1","Abstract1","Author2","Reference2","Abstract2","Author3","Reference3","Author4","Reference4","Abstract4").
This is a series in which the order is Author, Reference and Abstract. But in some cases, the Abstract data is missing. (In this example, the third Abstract is missing.) So, how can I add NA values in place of Abstract, when Abstract is missing?
In other words, If an element in the vector starts with the word "Reference", but its next element doesn't start with the word "Abstract", I want to add an NA value just after the element starting with "Reference". The result vector should be
result <- c("Author1","Reference1","Abstract1","Author2","Reference2","Abstract2","Author3","Reference3",NA,"Author4","Reference4","Abstract4")
How can I do it?
I have tried the append function in R, but for using it, I need to have the index number of the element where I want to add NA. So, it takes a manual entry for each NA element.
Here's an approach.
Bascially you get two vectors:
which tests whether that element containts Reference, the other that checks that the element does not contain Abstract
You offset one vector by 1, because you want to test whether abstract follows reference.
you take the logical and
then you insert NAs into the positions where abstract should be but isn't with append()
ab_missing <- grepl("Reference", df) & c(!grepl("Abstract", df)[-1], FALSE)
df <- append(df, NA, which(ab_missing))
df
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3" "Reference3" NA "Author4"
[11] "Reference4" "Abstract4"
One way (and the only way I get these things done) is to think in tibbles or data frames: (So this is not the best approach)!
We create a tibble of one column calling x,
then we group by the numbers e.g. 1,1,1 with parse_number() function from readr (I love parse_number()),
With summarise(cur_data()[seq(3),]) see expand each group to the max rows, see here Expand each group to the max n of rows
3a stop here and pull if NA is desired otherwise continue
finally we use paste with r's recycling ability and pull the vector:
1. In case NA is desired:
library(dplyr)
library(readr)
my_vector <- tibble(x = c("Author1","Reference1","Abstract1","Author2","Reference2",
"Abstract2","Author3","Reference3","Author4","Reference4","Abstract4")) %>%
group_by(group= parse_number(x)) %>%
summarise(cur_data()[seq(3),]) %>%
pull(x)
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3"
[8] "Reference3" NA "Author4" "Reference4" "Abstract4"
2. In case the lacking word is desired:
library(dplyr)
library(readr)
my_vector <- tibble(x = c("Author1","Reference1","Abstract1","Author2","Reference2",
"Abstract2","Author3","Reference3","Author4","Reference4","Abstract4")) %>%
group_by(group= parse_number(x)) %>%
summarise(cur_data()[seq(3),]) %>%
mutate(group = paste0(c("Author", "Reference", "Abstract"), group)) %>%
pull(group)
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3"
[8] "Reference3" "Abstract3" "Author4" "Reference4" "Abstract4"
A slightly different approach might be:
c(sapply(split(x, cumsum(grepl("Author", x))), function(x) head(c(x, NA_character_), 3)))
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3"
[8] "Reference3" NA "Author4" "Reference4" "Abstract4"

Find year in random data in R

I have 71 columns in a dataframe, 10 of which include data that may include a year between 1990 and 2019 in the format YYYY (e.g. 2019). For example:
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
I am trying to find a way to pull the years from relevant cells and insert them in a new column.
So far, I am only aware of how to filter the data in a very time-consuming way. I have produced the following code, which starts like this:
dated_data <- select(undated_data, 1:71) %>%
filter(grepl("1990", id_1) | filter(grepl("1990", id_2) | filter(grepl("1991", id_1) | filter(grepl("1991", id_2)
However, it take a really long time to write that for all ten columns and all 30 years. I am sure there is a quicker way. I also have no idea how to then pull the dates from each of the matching cells into a new cell.
The output I want looks like this:
dated_data$year <- c("2013", "2014", "2016", "1990")
Does anyone know how I do this? Thank you in advance for your help!
There are many ways. This is one of them:
Step 1: define a pattern you want to match with regex:
pattern <- "(1|2)\\d{3}"
Step 2: define a function to extract raw matches:
extract <- function(x) unlist(regmatches(x, gregexpr(pattern, x, perl = T)))
Step 3: apply the function to your data, e.g., id_1:
extract(id_1)
[1] "2013" "2014" "2016" "1990"
Here's another way, actually simpler ;)
It uses the str_extract function from the stringr package. So you install the package and activate it:
install.packages("stringr")
library(stringr)
and use str_extract to pull your matches:
years <- str_extract(id_1,"(1|2)\\d{3}")
years
[1] "2013" "2014" "2016" "1990"
EDIT:
If not every string contains a match and you want to preserve the length of the vectors/columns, you can use ifelse to test whether the regex finds a match and, where it doesn't, to put NA.
For example, if your data is like this (note the two added strings which do not contain years):
id_3 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759", "gbgbgbgb", "hnhna25")
you can set up the ifelse test like this:
years <- ifelse(grepl("(1|2)\\d{3}", id_3), str_extract(id_3,"(1|2)\\d{3}"), NA)
years
[1] "2013" "2014" "2016" "1990" NA NA
Based on the example in your question, you are trying to filter out any rows without years and then extract the year from the string. It looks like every row only contains 1 year. Here is some code so that you do not have to write long filter statements for 10 columns and 30 years. Keep in mind that I don't have your data so I couldn't test it.
library(tidyverse)
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate(year = str_extract(id_1, pattern = paste0(1990:2019, collapse = "|")))
EDIT: based on your comment it looks like maybe some columns have a year and others do not. What we do instead is pull the year out of any column with id_* and then we coalesce the columns together. Again, without your data its tough to test this.
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate_at(vars(starts_with("id_")), list(year = ~str_extract(., pattern = paste0(1990:2019, collapse = "|")))) %>%
mutate(year = coalesce(ends_with("_year"))) %>%
select(-ends_with("_year"))
Using tidyverse methods:
undated_data %>%
mutate_at(vars(1:71),
funs(str_extract(., "(1|2)[0-9]{3}")))
(Note that the regex pattern will match numbers that may not be years, such as 2999; if your data has many "false positives" like that, you may be better off writing a custom function.)
Here is a similar solution to the one provided, but using dplyr and stringr on a data.frame.
library(stringr)
library(dplyr)
df<-data.frame("X1" = id_1,"X2" = id_2)
#Set in cols the column names from which years are going to be extracted
df %>%
pivot_longer(cols = c("X1","X2"), names_to = "id") %>%
arrange(id) %>%
mutate(new = unlist(str_extract_all(value, pattern = "(1|2)\\d{3}")))
Base R solution:
# Sample data: id_1; id_2 => character vectors
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
# Thanks #Chris Ruehlemann: store the date pattern: date_pattern => character scalar
date_pattern <- "(1|2)\\d{3}"
# Convert to data.frame: df => data.frame
df <- data.frame(id_1, id_2, stringsAsFactors = FALSE)
# Subset the data to only contain date information vectors: dates_subset => data.frame
dates_subset <- df[,sapply(df, function(x){any(grepl(date_pattern, x))}), drop = FALSE]
# Initialse the year vector: year => character vector:
df$years <- NA_character_
# Remove punctuation and letters, return valid dates, combine into a, comma-separated string:
# Store the dates found in the string: years => character vector
df$years[which(rowSums(Vectorize(grepl)(date_pattern, dates_subset)) > 0)] <-
apply(sapply(dates_subset, function(x){
grep(date_pattern, unlist(strsplit(x, "[[:punct:]]|[a-zA-Z]")), value = TRUE)}),
1, paste, collapse = ", ")
Here may be another solution.
We just use gsub() function and set pattern as ".(199[0-9]|20[01][0-9]).".
The pattern captures a year text between 1990 to 2019 as a
group result , especially only one group , so we replace original text with first one group string:)
library(magrittr)
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_1)
# [1] "2013" "2014" "2016" "1990"
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_2)
#[1] "2013" "2014" "2016" "1990"

Find out single word names

I have a Name column and names are like this:
Preety ..
Sudalai Rajkumar S.
Parvathy M. S.
Navaraj Ranjan Arthur
I want to get which of these are single-word names, like in this case Preety.
I have tried eliminating the "." and " " and counting the length and using the difference of this length and the original string length.
But it's not giving me the desired output. Please help.
NBData3$namewodot <- gsub(" .","",NBData3$Client.Name)
NBData3$namewoblank <- gsub(" ","",NBData3$namewodot)
wordlength <- NBData3$namelengthchar-nchar(as.character(NBData3$namewoblank))
You could use str_count from stringr inside an ifelse() statement to check one worded names; first removing dots from names with gsub.
library(stringr)
NBData3$namewodot <- gsub("\\.", "", NBData3$Client.Name)
NBData3$oneword <- ifelse(str_count(NBData3$namewodot , '\\w+') == 1, TRUE, FALSE)
# Client.Name namewodot oneword
# 1 Preety .. Preety TRUE
# 2 Sudalai Rajkumar S. Sudalai Rajkumar S FALSE
# 3 Parvathy M. S. Parvathy M S FALSE
# 4 Navaraj Ranjan Arthur Navaraj Ranjan Arthur FALSE
This seems to work for your example
names = c("Preety ..",
"Sudalai Rajkumar S." ,
"Parvathy M. S.",
"Navaraj Ranjan Arthur")
names[sapply(strsplit(gsub(".","",names,fixed=T)," ",fixed=T),function(x) length(x) == 1)]
[1] "Preety .."
This may be a bit round about, but here would be a text mining approach. There are definitely more streamlined ways, but I thought there might be concepts in here that are also useful.
# define the data frame
df <- data.frame(Name = c("Preety ..",
"Sudalai Rajkumar S.",
"Parvathy M. S.",
"Navaraj Ranjan Arthur"),
stringsAsFactors = FALSE)
library(tidyverse)
library(tidytext)
# break each name out by words. remove all the periods
df_token <- df %>%
rowid_to_column(var = "name_id") %>%
mutate(Name = str_remove_all(Name, pattern = "\\.")) %>%
unnest_tokens(name_split, Name, to_lower = FALSE)
# find the lines with only one word
df_token %>%
group_by(name_id) %>%
summarize(count = n()) %>%
filter(count == 1) %>%
left_join(df_token) %>%
pull(name_split)
[1] "Preety"
in base R you could use grep:
grep("^\\S+$", gsub("\\W+$", "", names), value=T)
[1] "Preety"
If you need the names as originally given, then you will just use [:
names[grep("^\\S+$", gsub("\\W+$", "", names))]
[1] "Preety .."

Finding a value in R that contains a certain string

Is there a way in R to find values in a column that contain a word? For example, I want to find all the values that contain the word "the", where some values of the column are "the_cat" and "the_dog" and "dog"
x <- c("the_dog", "the_cat", "dog")
Using the example above, the answer would be 2.
I know this is relatively easy to do in Python, but I am wondering if there is a way to do this in R. Thanks!
Try:
sum(grepl("(?<![A-Za-z])the(?![A-Za-z])", x, perl = T))
This gives a sum of 2 on your example.
But let's consider also a slightly more complex example:
x <- c("the_dog", "the_cat", "dog", "theano", "menthe", " the")
Output:
[1] 3
Above we're trying to match any the that doesn't have another letter before or after (like e.g. theano).
You could also add inside the [] other things you wouldn't like to match, like e.g. if you wouldn't consider the99 a word the, you would do [A-Za-z0-9] etc.
You can also use the above with stringr, for example (I've included the exclusion of numbers, so below the99 wouldn't be counted as a word):
library(stringr)
sum(str_detect(x, "(?<![A-Za-z0-9])the(?![A-Za-z0-9])"))
library(stringr)
##with a vector
sum(str_detect(c("the_dog", "the_cat", "dog"),"the"))
##In a dataframe
tibble(x = c("the_dog", "the_cat", "dog")) %>%
filter(str_detect(x, "the")) %>%
nrow()
x <- c("the_dog", "the_cat", "dog")
stringr::str_detect(x, "the")
#> [1] TRUE TRUE FALSE
Created on 2019-02-23 by the reprex package (v0.2.1)
Try also:
x <- c("the_dog", "the_cat", "dog")
sum(stringi::stri_count(x,regex="^the"))#matches the at the beginning
Result:
[1] 2
Or:
x <- c("the_dog", "the_cat", "dog")
sum(stringi::stri_count(x,regex="the{1,}"))#matches any the

Using Regex to edit a column in R [duplicate]

I've got a column people$food that has entries like chocolate or apple-orange-strawberry.
I want to split people$food by - and get the first entry from the split.
In python, the solution would be food.split('-')[0], but I can't find an equivalent for R.
If you need to extract the first (or nth) entry from each split, use:
word <- c('apple-orange-strawberry','chocolate')
sapply(strsplit(word,"-"), `[`, 1)
#[1] "apple" "chocolate"
Or faster and more explictly:
vapply(strsplit(word,"-"), `[`, 1, FUN.VALUE=character(1))
#[1] "apple" "chocolate"
Both bits of code will cope well with selecting whichever value in the split list, and will deal with cases that are outside the range:
vapply(strsplit(word,"-"), `[`, 2, FUN.VALUE=character(1))
#[1] "orange" NA
For example
word <- 'apple-orange-strawberry'
strsplit(word, "-")[[1]][1]
[1] "apple"
or, equivalently
unlist(strsplit(word, "-"))[1].
Essentially the idea is that split gives a list as a result, whose elements have to be accessed either by slicing (the former case) or by unlisting (the latter).
If you want to apply the method to an entire column:
first.word <- function(my.string){
unlist(strsplit(my.string, "-"))[1]
}
words <- c('apple-orange-strawberry', 'orange-juice')
R: sapply(words, first.word)
apple-orange-strawberry orange-juice
"apple" "orange"
I would use sub() instead. Since you want the first "word" before the split, we can simply remove everything after the first - and that's what we're left with.
sub("-.*", "", people$food)
Here's an example -
x <- c("apple", "banana-raspberry-cherry", "orange-berry", "tomato-apple")
sub("-.*", "", x)
# [1] "apple" "banana" "orange" "tomato"
Otherwise, if you want to use strsplit() you can round up the first elements with vapply()
vapply(strsplit(x, "-", fixed = TRUE), "[", "", 1)
# [1] "apple" "banana" "orange" "tomato"
I would suggest using head rather than [ in R.
word <- c('apple-orange-strawberry','chocolate')
sapply(strsplit(word, "-"), head, 1)
# [1] "apple" "chocolate"
dplyr/magrittr approach:
library(magrittr)
library(dplyr)
word = c('apple-orange-strawberry', 'chocolate')
strsplit(word, "-") %>% sapply(extract2, 1)
# [1] "apple" "chocolate"
Using str_remove() to delete everything after the pattern:
df <- data.frame(words = c('apple-orange-strawberry', 'chocolate'))
mutate(df, short = stringr::str_remove(words, "-.*")) # mutate method
stringr::str_remove(df$words, "-.*") # str_remove example
stringr::str_replace(df$words, "-.*", "") # str_replace method
stringr::str_split_fixed(df$words, "-", n=2)[,1] # str_split method similar to original question's Python code
tidyr::separate(df, words, into = c("short", NA)) # using the separate function
words short
1 apple-orange-strawberry apple
2 chocolate chocolate
stringr 1.5.0 introduced str_split_i to do this easily:
library(stringr)
str_split_i(c('apple-orange-strawberry','chocolate'), "-", 1)
[1] "apple" "chocolate"
The third argument represents the index you want to extract. Also new is that you can use negative values to index from the right:
str_split_i(c('apple-orange-strawberry','chocolate'), "-", -1)
[1] "strawberry" "chocolate"

Resources