R How to conditionally remove the last N characters from multiple observations - r

I am working with a large dataset and am having difficulty removing the last N characters from my variable pitchResult. Below is a sample of what the column looks like:
pitchResult <- c("strike", "Home Run on 368.96 ft fly ball", "Ground out","Home Run
on a 415.0 ft fly ball", "Home Run on a 401.77 ft line drive", "ball")
However, I only want to remove the last N characters from all observations starting with Home Run so only Home Run is displayed and not the following information. I was thinking gsub could work but unsure how to conditionally format it in this fashion. Thanks all in advance!

You don't need an ifelse statement. Just use regex backreferencing with capturing parenthesis.
> gsub(x = pitchResult, pattern = "(Home Run).+$", replacement = "\\1")
[1] "strike" "Home Run" "Ground out" "Home Run" "Home Run" "ball"

ifelse would probably help. Like this:
library(stringr)
> pitchResult
[1] "strike" "Home Run on 368.96 ft fly ball"
[3] "Ground out" "Home Run \non a 415.0 ft fly ball"
[5] "Home Run on a 401.77 ft line drive" "ball"
> ifelse(grepl("Home Run",pitchResult),str_extract(pitchResult,"Home Run"),pitchResult)
[1] "strike" "Home Run" "Ground out" "Home Run" "Home Run"
[6] "ball"

Related

How to remove hidden characters in R from string imported from Excel?

Using the openxlsx package in R, I am importing data from an Excel file that originated in Brazil. In the character strings, there seem to be hidden characters. As you can see from my code, I remove the white space but strings 2 and 3 are still shown as unique strings. Strings 6 and 7 are also appearing as unique strings when they should be identical.
moths %<>%
mutate(Details = str_trim(Details))
sort(unique(moths$Details))
[1] "Check Outside" "Check Plot"
[3] "Check Plot ​​" "Check Plot ​​ (between treatments)"
[5] "PRX-01GA1-21022.00" "PRX-01GA2-22001​"
[7] "PRX-01GA2-22001" "PRX-01GA2-22002​"
[9] "PRX-01GA2-22002" "PRX-01GA2-22003"
[11] "PRX-01GA2-22004" "SF2.5VP"
[13] "YM001-22" "YM001-PRX-01GA2-22001​ PRX-01GA2-22001​"
Unfortunately, since I can't attach the Excel file that the data are coming from, I can't make a completely reproducible example here, but hopefully someone can still provide some insight.
There may be some non-ascii characters in your data. If you're happy to remove them, you can use textclean, like so (this example uses the first 4 values of your data):
vec <- c("Check Outside", "Check Plot", "Check Plot ​​",
"Check Plot ​​ (between treatments)")
unique(vec)
# [1] "Check Outside" "Check Plot"
# [3] "Check Plot ​​" "Check Plot ​​ (between treatments)"
library(textclean)
vec2 <- replace_non_ascii(vec)
unique(vec2)
# [1] "Check Outside" "Check Plot" "Check Plot (between treatments)"
So tl;dr this should do what you’re after
library(textclean)
moths <- moths %>%
mutate(Details = replace_non_ascii(str_trim(Details)))
you haven't saved your cleaned data
moths %<>%
mutate(Details = str_trim(Details)) -> moths
sort(unique(moths$Details))

Extract title from multiple lines

I have multiple files each one has a different title, I want to extract the title name from each file. Here is an example of one file
[1] "<START" "ID=\"CMP-001\"" "NO=\"1\">"
[4] "<NAME>Plasma-derived" "vaccine" "(PDV)"
[7] "versus" "placebo" "by"
[10] "intramuscular" "route</NAME>" "<DIC"
[13] "CHI2=\"3.6385\"" "CI_END=\"0.6042\"" "CI_START=\"0.3425\""
[16] "CI_STUDY=\"95\"" "CI_TOTAL=\"95\"" "DF=\"3.0\""
[19] "TOTAL_1=\"0.6648\"" "TOTAL_2=\"0.50487622\"" "BLE=\"YES\""
.
.
.
[789] "TOTAL_2=\"39\"" "WEIGHT=\"300.0\"" "Z=\"1.5443\">"
[792] "<NAME>Local" "adverse" "events"
[795] "after" "each" "injection"
[798] "of" "vaccine</NAME>" "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>"
[801] "</GROUP_LABEL_2>" "<GRAPH_LABEL_1>" "PDV</GRAPH_LABEL_1>"
the extracted expected title is
Plasma-derived vaccine (PDV) versus placebo by intramuscular route
Note, each file has a different title's length.
Here is a solution using stringr. This first collapses the vector into one long string, and then captures all words / characters that are not a newline \n between every pair of "<NAME>" and "</NAME>". In the future, people will be able to help you easier if you make a reproducible example (e.g., using dput()). Hope this helps!
Note: if you just one the first title you can use str_match() instead of str_match_all().
library(stringr)
str_match_all(paste0(string, collapse = " "), "<NAME>(.*?)</NAME>")[[1]][,2]
[1] "Plasma-derived vaccine (PDV) versus placebo by intramuscular route"
[2] "Local adverse events after each injection of vaccine"
Data:
string <- c("<START", "ID=\"CMP-001\"", "NO=\"1\">", "<NAME>Plasma-derived", "vaccine", "(PDV)", "versus", "placebo", "by", "intramuscular", "route</NAME>", "<DIC", "CHI2=\"3.6385\"", "CI_END=\"0.6042\"", "CI_START=\"0.3425\"", "CI_STUDY=\"95\"", "CI_TOTAL=\"95\"", "DF=\"3.0\"", "TOTAL_1=\"0.6648\"", "TOTAL_2=\"0.50487622\"", "BLE=\"YES\"",
"TOTAL_2=\"39\"", "WEIGHT=\"300.0\"", "Z=\"1.5443\">", "<NAME>Local", "adverse", "events", "after", "each", "injection", "of", "vaccine</NAME>", "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>", "</GROUP_LABEL_2>", "<GRAPH_LABEL_1>", "PDV</GRAPH_LABEL_1>")

Split 2 numbers character into it's own new line

I've a character object with 84 elements.
> head(output.by.line)
[1] "\n17"
[2] "Now when Joseph saw that his father"
[3] "laid his right hand on the head of"
[4] "Ephraim, it displeased him; so he took"
[5] "hold of his father's hand to remove it"
[6] "from Ephraim's head to Manasseh's"
But there is a line that has 2 numbers (49) that is not in it's own line:
[35] "49And Jacob called his sons and"
I'd like to transform this into:
[35] "\n49"
[36] "And Jacob called his sons and"
And insert this in the correct numeration, after object 34.
Dput Output:
dput(output.by.line)
c("\n17", "Now when Joseph saw that his father", "laid his right hand on the head of",
"Ephraim, it displeased him; so he took", "hold of his father's hand to remove it",
"from Ephraim's head to Manasseh's", "head.", "\n18", "And Joseph said to his father, \"Not so,",
"my father, for this one is the firstborn;", "put your right hand on his head.\"",
"\n19", "But his father refused and said, \"I", "know, my son, I know. He also shall",
"become a people, and he also shall be", "great; but truly his younger brother shall",
"be greater than he, and his descendants", "shall become a multitude of nations.\"",
"\n20", "So he blessed them that day, saying,", "\"By you Israel will bless, saying, \"May",
"God make you as Ephraim and as", "Manasseh!\"' And thus he set Ephraim",
"before Manasseh.", "\n21", "Then Israel said to Joseph, \"Behold, I",
"am dying, but God will be with you and", "bring you back to the land of your",
"fathers.", "\n22", "Moreover I have given to you one", "portion above your brothers, which I",
"took from the hand of the Amorite with", "my sword and my bow.\"",
"49And Jacob called his sons and", "said, \"Gather together, that I may tell",
"you what shall befall you in the last", "days:", "\n2", "\"Gather together and hear, you sons of",
"Jacob, And listen to Israel your father.", "\n3", "\"Reuben, you are my firstborn, My",
"might and the beginning of my strength,", "The excellency of dignity and the",
"excellency of power.", "\n4", "Unstable as water, you shall not excel,",
"Because you went up to your father's", "bed; Then you defiled it-- He went up to",
"my couch.", "\n5", "\"Simeon and Levi are brothers;", "Instruments of cruelty are in their",
"dwelling place.", "\n6", "Let not my soul enter their council; Let",
"not my honor be united to their", "assembly; For in their anger they slew a",
"man, And in their self-will they", "hamstrung an ox.", "\n7",
"Cursed be their anger, for it is fierce;", "And their wrath, for it is cruel! I will",
"divide them in Jacob And scatter them", "in Israel.", "\n8",
"\"Judah, you are he whom your brothers", "shall praise; Your hand shall be on the",
"neck of your enemies; Your father's", "children shall bow down before you.",
"\n9", "Judah is a lion's whelp; From the prey,", "my son, you have gone up. He bows",
"down, he lies down as a lion; And as a", "lion, who shall rouse him?",
"\n10", "The scepter shall not depart from", "Judah, Nor a lawgiver from between his",
"feet, Until Shiloh comes; And to Him", "shall be the obedience of the people.",
"\n11", "Binding his donkey to the vine, And his", "donkey's colt to the choice vine, He"
)
Please, check this:
library(tidyverse)
split_line_number <- function(x) {
x %>%
str_replace("^([0-9]+)", "\n\\1\b") %>%
str_split("\b")
}
output.by.line %>%
map(split_line_number) %>%
unlist()
# Output:
# [35] "\n49"
# [36] "And Jacob called his sons and"
# [37] "said, \"Gather together, that I may tell"
# [38] "you what shall befall you in the last"
An option using stringr::str_match is to match two components of an optional number followed by everything. Get the captured output from the matched matrix (2:3) and create a new vector of strings by dropping NAs and empty strings.
vals <- c(t(stringr::str_match(output.by.line, "(\n?\\d+)?(.*)")[, 2:3]))
output <- vals[!is.na(vals) & vals != ""]
output[32:39]
#[1] "portion above your brothers, which I"
#[2] "took from the hand of the Amorite with"
#[3] "my sword and my bow.\""
#[4] "49"
#[5] "And Jacob called his sons and"
#[6] "said, \"Gather together, that I may tell"
#[7] "you what shall befall you in the last" "days:"
We'll make use of the stringr package:
library(stringr)
Modify the object:
output.by.line <- unlist(
ifelse(grepl('[[:digit:]][[:alpha:]]', output.by.line), str_split(gsub('([[:digit:]]+)([[:alpha:]])', paste0('\n', '\\1 \\2'), output.by.line), '[[:blank:]]', n = 2), output.by.line)
)
Print the resuts:
dput(output.by.line)
#[32] "portion above your brothers, which I"
#[33] "took from the hand of the Amorite with"
#[34] "my sword and my bow.\""
#[35] "\n49"
#[36] "And Jacob called his sons and"
#[37] "said, \"Gather together, that I may tell"
#[38] "you what shall befall you in the last"

how to extract data frame elements that has a specified word in R?

I have a dataframe twts consisting of movie tweets that was retrieved from twitter . I want to extract specified elements of the data frame that has the word like "music" and store them in another data frame object. I thought of using for loops and parsing every sentence word by word. So is there any other efficient ways or in-buid functions to do the desired action.?
input : twt (initial data.frame)
[1] "Music is so nice"
[2] "the movie rocked"
[3] "the hero is the best"
[4] "theme music at its peak"
output : music (new data.frame)
[1] "Music is so nice"
[2] "theme music at its peak"
Thanks in advance :)
> twt <- c("Music is so nice", "the movie rocked", "the hero is the best", "theme music at its peak")
> twt[grep("music", twt, ignore.case = TRUE)]
[1] "Music is so nice" "theme music at its peak"

Words in sentences and their nearest neighbors in a lexicons

I have following data frame:
sent <- data.frame(words = c("just right size", "size love quality", "laptop worth price", "price amazing user",
"explanation complex what", "easy set", "product best buy", "buy priceless when"), user = c(1,2,3,4,5,6,7,8))
Sent data frame resulted into:
words user
just right size 1
size love quality 2
laptop worth price 3
price amazing user 4
explanation complex what 5
easy set 6
product best buy 7
buy priceless when 8
I need to remove word at the begining of following sentence which is the same as a word at the end of previous sentece.
I mean eg. we have a sentences "just right size" and "size love quality", so I need to remove word size at the second user possition.
Then sentences "laptop worth price" and "price amazing user", so I need to remove word price at fourth user possition.
Can anyone help me, I'll appreciate any of your help. Thank you very much in advance.
You could extract the "first" and "last" word from the "words" column for the succeeding row and the current row using sub. If the words are the same, remove the first word from the succeeding row or else keep it as such (ifelse(...))
w1 <- sub(' .*', '', sent$words[-1])
w2 <- sub('.* ', '', sent$words[-nrow(sent)])
sent$words <- as.character(sent$words)
sent$words
#[1] "just right size" "size love quality"
#[3] "laptop worth price" "price amazing user"
#[5] "explanation complex what" "easy set"
#[7] "product best buy" "buy priceless when"
sent$words[-1] <- with(sent, ifelse(w1==w2, sub('\\w+ ', '',words[-1]),
words[-1]))
sent$words
#[1] "just right size" "love quality"
#[3] "laptop worth price" "amazing user"
#[5] "explanation complex what" "easy set"
#[7] "product best buy" "priceless when"

Resources