I am familiar with the functions toupper() and tolower(), however that doesn't exactly give what I want here. Here is an example of the string I have and the string I want:
this = "This is the string THAT I have!"
that = "tHIS IS THE STRING that i HAVE!"
simple enough to describe with an example, harder to implement (i think).
Thanks!
I'm sort of curious if there is a better way than:
chartr(x = this,
old = paste0(c(letters,LETTERS),collapse = ""),
new = paste0(c(LETTERS,letters),collapse = ""))
Helpful observation by #Joris in the comments that ?chartr notes that you can use character ranges, avoiding the paste:
chartr("a-zA-Z", "A-Za-z",this)
Here's a way that works with characters that aren't between a and z / A and Z :
text <- "aBàÉ"
up <- gregexpr("\\p{Lu}", text, perl=TRUE)
lo <- gregexpr("\\p{Ll}", text, perl=TRUE)
regmatches(text,up)[[1]] <- tolower(regmatches(text,up)[[1]])
regmatches(text,lo)[[1]] <- toupper(regmatches(text,lo)[[1]])
text
#> [1] "AbÀé"
Related
Ideally, in base R I need some kind of string manipulation that will let me detect a pattern and change the string 3 positions after the pattern.
example <- "when string says SOMETHING = #c792ea"
desired output:
when string says SOMETHING = #001628
I have tried gsub but I am not sure how I can get it to replace the characters after a pattern.
If it based on the position of character, then we can use substring assignment
substring(example, 30) <- "#001628"
example
#[1] "when string says SOMETHING = #001628"
Or if we need to find the position of the word that starts with #
library(stringr)
posvec <- c(str_locate(example, "#\\w+"))
substring(example, posvec[1], posvec[2]) <- "#001628"
# // or with
# str_sub(example, posvec[1], posvec[2]) <- "#001628"
Another option is sub to change the substring after the = and one or more space (\\s*)
sub("=\\s*.*", "= #001628", example)
#[1] "when string says SOMETHING = #001628"
Hello I have a column in a data.frame, it has many rows, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"))
I want to make a new column "Species_new" where the "*" is moved to the end of the character string, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"),
"Species_new" = c("Briza minor*", "Briza minor*", "Wattle"))
Is there a way to do this using gsub? The manual example would take far too long as I have approximately 50,000 rows.
Thanks in advance
One option is to capture the * as a group and in the replacement reverse the backreferences
df$Species_new <- sub("^([*])(.*)$", "\\2\\1", df$Species)
df$Species_new
#[1] "Briza minor*" "Briza minor*" "Wattle"
NOTE: * is a metacharacter meaning 0 or more, so we can either escape (\\*) or place it in brackets ([]) to evaluate the raw character i.e. literal evaluation
Thanks so much for the quick response, I also found a workaround;
df$Species_new = sub("[*]","",df$Species, perl=TRUE)
differences = setdiff(df$Species,df$Species_new)
tochange = subset(df,df$Species == differences)
toleave = subset(df,!df$Species == differences)
tochange$Species_new = paste(tochange$Species_new, "*", sep = "")
df = rbind(tochange,toleave)
I'm trying to scrape information from this url: http://www.sports-reference.com/cbb/boxscores/index.cgi?month=2&day=3&year=2017 and have gotten decently far to the point where I have strings for each game that look like this:
str <-"Yale\n\t\t\t87\n\t\t\t\n\t\t\t\tFinal\n\t\t\t\t\n\t\t\t\n\t\tColumbia\n\t\t\t78\n\t\t\t \n\t\t\t\n\t\t"
Ideally I'd like to get to a vector or dataframe that looks something like:
str_vec <- c('Yale',87,'Columbia',78)
I've tried a few things that didn't work like:
without_n <- gsub(x = str, pattern = '\n')
without_Final <- gsub(x = without_n, pattern = 'Final')
str_vec <- strslpit(x = without_Final, split = '\t')
Thanks in advance for any helpful tips/answers!
You can use gsub to first replace all the non-alphanumeric characters in the string with an empty string. Then insert a space between the name and score. Thereafter you can split the string on space to a data structure needed.
require(stringr)
step_1 <- gsub('([^[:alnum:]]|(Final))', "", str)
#"Yale87Columbia78"
step_2 <- gsub("([[:alpha:]]+)([[:digit:]]+)", "\\1 \\2 ", step_1)
strsplit(str_trim(step_2)," ")
#"Yale" "87" "Columbia" "78"
I assume the string pattern is consistent, for this to work reliably.
how can I encode a url as this
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InchI=1S/C21H30O9/c1-11(5-6-21(28)12(2)8-13(23)9-20(21,3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19/h5-8,14,16-19,22,25-28H,9-10H2,1-4H3/b6-5+,11-7-/t14-,16-,17+,18-,19+,21-/m1/s1&token=e4a6d6fb-ae07-4cf6-bae8-c0e6115bc681
to make this
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InChI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1
on R?
I tried
URLencode
but it does not work.
Thanks
It seems that you want to get rid of all but first URL GET data specifier and then to encode the associated data.
url <- "..."
library(stringi)
(addr <- stri_replace_all_regex(url, "\\?.*", ""))
## [1] "http://www.chemspider.com/inchi.asmx/InChIToSMILES"
args <- stri_match_first_regex(url, "[?&](.*?)=([^&]+)")
(data <- stri_replace_all_regex(
stri_trans_general(args[,3], "[^a-zA-Z0-9\\-()]Any-Hex/XML"),
"&#x([0-9a-fA-F]{2});", "%$1"))
## [1] "InchI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1"
(addr <- stri_c(addr, "?", args[,2], "=", data))
## [1] "http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InchI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1"
Here I made use of the ICU's transliterator (via stri_trans_general). All characters but A..Z, a..z, 0..9, (, ), and - have been converted to hexadecimal representation
(it seems that URLencode does not handle , even with reserved=TRUE) of the form &#xNN;. Then, each &#xNN; was converted to %NN with stri_replace_all_regex.
Here are two approaches:
1) gsubfn/URLencode If u is an R character string containing the URL then try this. This inputs everything after ? to URLencode replacing the input with the output of that function. Note that "\\K" kills everything in the buffer up to that point so that the ? itself does not get encoded:
library(gsubfn)
gsubfn("\\?\\K(.*)", ~ URLencode(x, TRUE), u, perl = TRUE)
It gives the following (which is not identical to the output in the question but may be sufficient):
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi%3dInchI%3d1S%2fC21H30O9%2fc1-11(5-6-21(28)12(2)8-13(23)9-20(21,3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2fh5-8,14,16-19,22,25-28H,9-10H2,1-4H3%2fb6-5+,11-7-%2ft14-,16-,17+,18-,19+,21-%2fm1%2fs1%26token%3de4a6d6fb-ae07-4cf6-bae8-c0e6115bc681
2) gsubfn/curlEscape For a somewhat different output continuing to use gsubfn try:
library(RCurl)
gsubfn("\\?\\K(.*)", curlEscape, u, perl = TRUE)
giving:
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi%3DInchI%3D1S%2FC21H30O9%2Fc1%2D11%285%2D6%2D21%2828%2912%282%298%2D13%2823%299%2D20%2821%2C3%294%297%2D15%2824%2930%2D19%2D18%2827%2917%2826%2916%2825%2914%2810%2D22%2929%2D19%2Fh5%2D8%2C14%2C16%2D19%2C22%2C25%2D28H%2C9%2D10H2%2C1%2D4H3%2Fb6%2D5%2B%2C11%2D7%2D%2Ft14%2D%2C16%2D%2C17%2B%2C18%2D%2C19%2B%2C21%2D%2Fm1%2Fs1%26token%3De4a6d6fb%2Dae07%2D4cf6%2Dbae8%2Dc0e6115bc681
ADDED curlEscape approach
I'm having trouble doing some extraction & coercing of a string in R. I'm not very good with R... just enough to be dangerous. Any help would be appreciated.
I am trying to take a string of this form:
"AAA,BBB,CCC'
And create two items:
A list containing each element separately (i.e. 3 entries) - c("AAA","BBB","CCC"). I've tried strsplit(string, ",") but I get a list of length 1
A data frame with names = lower case entries, values = entries. e.g. df = data.frame(aaa=AAA, bbb=BBB, ccc=CCC). I'm not sure how to pull out each of the elements, and lowercase the references.
Hopefully this is doable with R. Appreciate your time!
If the string is malformed read in with quotes changed
malform <- read.table("weirdstring.txt", colClasses='character',quote = "")
str = gsub("\'|\"", "", malform[1,1])
The string should now look like:
str = "AAA,BBB,CCC"
## as list
ll <- unlist(strsplit(str, ","))
## df
df <- data.frame(t(ll))
names(df) <- sapply(ll, tolower)