Partial Match String and full replacement over multiple vectors - r

Would like to efficiently replace all partial match strings over a single column by supplying a vector of strings which will be searched (and matched) and also be used as replacement. i.e. for each vector in df below, it will partially match for vectors in vec_string. Where matches is found, it will simply replace the entire string with vec_string. i.e. turning 'subscriber manager' to 'manager'. By supplying more vectors into vec_string, it will search through the whole df until all is complete.
I have started the function, but can't seem to finish it off by replacing the vectors in df with vec_string. Appreciate your help
df <- c(
'solicitor'
,'subscriber manager'
,'licensed conveyancer'
,'paralegal'
,'property assistant'
,'secretary'
,'conveyancing paralegal'
,'licensee'
,'conveyancer'
,'principal'
,'assistant'
,'senior conveyancer'
,'law clerk'
,'lawyer'
,'legal practice director'
,'legal secretary'
,'personal assistant'
,'legal assistant'
,'conveyancing clerk')
vec_string <- c('manager','law')
#function to search and replace
replace_func <-
function(vec,str_vec) {
repl_str <- list()
for(i in 1:length(str_vec)) {
repl_str[[i]] <- grep(str_vec[i],unique(tolower(vec)))
}
names(repl_str) <- vec_string
return(repl_str)
}
replace_func(df,vec_string)
$`manager`
[1] 2
$law
[1] 13 14
As you can see, the function returns a named list with elements to which the replacement will

This should do the trick
res = sapply(df,function(x){
match = which(sapply(vec_string,function(y) grepl(y,x)))
if (length(match)){x=vec_string[match[1]]}else{x}
})
res
[1] "solicitor" "manager" "licensed conveyancer"
[4] "paralegal" "property assistant" "secretary"
[7] "conveyancing paralegal" "licensee" "conveyancer"
[10] "principal" "assistant" "senior conveyancer"
[13] "law" "law" "legal practice director"
[16] "legal secretary" "personal assistant" "legal assistant"
[19] "conveyancing clerk"
We compare each part of df with each part of vec_string. If there is a match, the vec_string part is returned, else it is left as it is. Watch out as if there are more than 1 matches it will keep the first one.

Related

R: How to get column names for columns that contain a certain word AND their associated index number?

I want to create a list of column names that contain the word "arrest" AND their associated index number. I do not want all the columns, so I DO NOT want to subset the arrest columns into a new data frame. I merely want to see the list of names and their index numbers so I can delete the ones I don't want from the original data frame.
I tried getting the column names and their associated index numbers by using the below codes, but they only gave one or the other.
This gives me their names only
colnames(x2009_2014)[grepl("arrest",colnames(x2009_2014))]
[1] "poss_cannabis_tot_arrests" "poss_drug_total_tot_arrests"
[3] "poss_heroin_coke_tot_arrests" "poss_other_drug_tot_arrests"
[5] "poss_synth_narc_tot_arrests" "sale_cannabis_tot_arrests"
[7] "sale_drug_total_tot_arrests" "sale_heroin_coke_tot_arrests"
[9] "sale_other_drug_tot_arrests" "sale_synth_narc_tot_arrests"
[11] "total_drug_tot_arrests"
This gives me their index numbers only
grep("county", colnames(x2009_2014))
[1] 93 168 243 318 393 468 543 618 693 768 843
But I want their name AND index number so that it looks something like this
[93] "poss_cannabis_tot_arrests"
[168] "poss_drug_total_tot_arrests"
[243] "poss_heroin_coke_tot_arrests"
[318] "poss_other_drug_tot_arrests"
[393] "poss_synth_narc_tot_arrests"
[468] "sale_cannabis_tot_arrests"
[543] "sale_drug_total_tot_arrests"
[618] "sale_heroin_coke_tot_arrests"
[693] "sale_other_drug_tot_arrests"
[768] "sale_synth_narc_tot_arrests"
[843] "total_drug_tot_arrests"
Lastly, using advice here, I used the below code, but it did not work.
K=sapply(x2009_2014,function(x)any(grepl("arrest",x)))
which(K)
named integer(0)
The person who provided the advice in the above link used
K=sapply(df,function(x)any(grepl("\\D+",x)))
names (df)[K]
Zo.A Zo.B
Which (k)
Zo.A Zo.B
2 4
I'd prefer the list I showed in the third block of code, but the code this person used provides a structure I can work with. It just did not work for me when I tried using it.
Hacky as a one-liner because I really dislike use <- inside a function call, but this should work:
setNames(
nm = matches <- grep("arrest", colnames(x2009_2014)),
colnames(x2009_2014)[matches]
)
Reproducible example:
setNames(nm = x <- grep("b|c", letters), letters[x])
# 2 3
# "b" "c"
Or write your own function that does it. Here I put it in a data frame, which seems nicer than a named vector:
grep_ind_value = function(pattern, x, ...) {
index = grep(x, pattern, ...)
value = x[index]
data.frame(index, value)
}

paste0 multiple lists of different lengths without looping

I am paste0'ing a bunch of variables into a definitive url list
id <- 1:10
animal <- c("dog", "cat", "fish")
base <- "www.google.com/"
urls <- paste0(base, "id=", id, "search=", animal)
The output looks like:
[1] "www.google.com/id=1search=dog" "www.google.com/id=2search=cat" "www.google.com/id=3search=fish"
[4] "www.google.com/id=4search=dog" "www.google.com/id=5search=cat" "www.google.com/id=6search=fish"
[7] "www.google.com/id=7search=dog" "www.google.com/id=8search=cat" "www.google.com/id=9search=fish"
[10] "www.google.com/id=10search=dog"
But I actually want the ids and animals to be repeated in sequence like:
[1] "www.google.com/id=1search=dog" "www.google.com/id=2search=dog" "www.google.com/id=3search=dog"
[4] "www.google.com/id=4search=dog" "www.google.com/id=5search=dog" "www.google.com/id=6search=dog"
[7] "www.google.com/id=7search=dog" "www.google.com/id=8search=dog" "www.google.com/id=9search=dog"
[10] "www.google.com/id=10search=dog" "www.google.com/id=1search=cat" ...
You can modify the code by including rep in paste0 or sprintf
sprintf('%sid=%dsearch=%s', base, id, rep(animal,each=length(id)))
Or
paste0(base, 'id=',id, 'search=', rep(animal,each=length(id)))
Or as #MrFlick suggested, we can use expand.grid to get all the combinations between 'animal' and 'id'
with(expand.grid(a=animal, i=id), paste0(base, "id=", i, "search=", a))

R - get values from multiple variables in the environment

I have some variables in my current R environment:
ls()
[1] "clt.list" "commands.list" "dirs.list" "eq" "hurs.list" "mlist" "prec.list" "temp.list" "vars"
[10] "vars.list" "wind.list"
where each one of the variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" is a (huge) list of strings.
For example:
clt.list[1:20]
[1] "clt_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc" "clt_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc"
[3] "clt_Amon_bcc-csm1-1_historical_r1i1p1_185001-201212.nc" "clt_Amon_bcc-csm1-1-m_historical_r1i1p1_185001-201212.nc"
[5] "clt_Amon_BNU-ESM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CanESM2_historical_r1i1p1_185001-200512.nc"
[7] "clt_Amon_CCSM4_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-BGC_historical_r1i1p1_185001-200512.nc"
[9] "clt_Amon_CESM1-CAM5_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-CAM5-1-FV2_historical_r1i1p1_185001-200512.nc"
[11] "clt_Amon_CESM1-FASTCHEM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-WACCM_historical_r1i1p1_185001-200512.nc"
[13] "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-190412.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-200512.nc"
[15] "clt_Amon_CMCC-CESM_historical_r1i1p1_190501-190912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_191001-191412.nc"
[17] "clt_Amon_CMCC-CESM_historical_r1i1p1_191501-191912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_192001-192412.nc"
[19] "clt_Amon_CMCC-CESM_historical_r1i1p1_192501-192912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_193001-193412.nc"
What I need to do is extract the subset of the string that is between "Amon_" and "_historical".
I can do this for a single variable, as shown here:
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", clt.list[1:20])))
[1] "ACCESS1-0" "ACCESS1-3" "bcc-csm1-1" "bcc-csm1-1-m" "BNU-ESM" "CanESM2" "CCSM4"
[8] "CESM1-BGC" "CESM1-CAM5" "CESM1-CAM5-1-FV2" "CESM1-FASTCHEM" "CESM1-WACCM" "CMCC-CESM"
However, what I'd like to do is to run the command above for all the five variables at once. Instead of using just "ctl.list" as argument in the command above, I'd like to use all variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" at once.
How can I do that?
Many thanks in advance!
You can put your operation into a function and then iterate over it:
get_my_substr <- function(vecname)
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", get(vecname))))
lapply(my_vecnames,get_my_substr)
lapply acts like a loop. You can create your list of vector names with
my_vecnames <- ls(pattern=".list$")
It is generally good practice to post a reproducible example in your question. Since none was provided here, I tested this approach with...
# example-maker
prestr <- "grr_Amon_"
posstr <- "_historical_zzz"
make_ex <- function()
replicate(
sample(10,1),
paste0(prestr,paste0(sample(LETTERS,sample(5,1)),collapse=""),posstr)
)
# make a couple examples
set.seed(1)
m01 <- make_ex()
m02 <- make_ex()
# test result
lapply(ls(pattern="^m[0-9][0-9]$"),get_my_substr)
One solution would be to create a vector containing the variable names that you want extract the data from, for example:
var.names <- c("clt.list", "commands.list", "dirs.list")
Then to access the value of each variable from the name:
for (var.name in var.names) {
var.value <- as.list(environment())[[var.name]]
# Do something with var.value
}

R: How to remove quotation marks in a vector of strings, but maintain vector format as to call each individual value?

I want to create a vector of names that act as variable names so I can then use themlater on in a loop.
years=1950:2012
for(i in 1:length(years))
{
varname[i]=paste("mydata",years[i],sep="")
}
this gives:
> [1] "mydata1950" "mydata1951" "mydata1952" "mydata1953" "mydata1954" "mydata1955" "mydata1956" "mydata1957" "mydata1958"
[10] "mydata1959" "mydata1960" "mydata1961" "mydata1962" "mydata1963" "mydata1964" "mydata1965" "mydata1966" "mydata1967"
[19] "mydata1968" "mydata1969" "mydata1970" "mydata1971" "mydata1972" "mydata1973" "mydata1974" "mydata1975" "mydata1976"
[28] "mydata1977" "mydata1978" "mydata1979" "mydata1980" "mydata1981" "mydata1982" "mydata1983" "mydata1984" "mydata1985"
[37] "mydata1986" "mydata1987" "mydata1988" "mydata1989" "mydata1990" "mydata1991" "mydata1992" "mydata1993" "mydata1994"
[46] "mydata1995" "mydata1996" "mydata1997" "mydata1998" "mydata1999" "mydata2000" "mydata2001" "mydata2002" "mydata2003"
[55] "mydata2004" "mydata2005" "mydata2006" "mydata2007" "mydata2008" "mydata2009" "mydata2010" "mydata2011" "mydata2012"
All I want to do is remove the quotes and be able to call each value individually.
I want:
>[1] mydata1950 mydata1951 mydata1952 mydata1953, #etc...
stored as a variable such that
varname[1]
> mydata1950
varname[2]
> mydata1951
and so on.
I have played around with
cat(varname[i],"\n")
but this just prints values as one line and I can't call each individual string. And
gsub("'",'',varname)
but this doesn't seem to do anything.
Suggestions? Is this possible in R? Thank you.
There are no quotes in that character vector's values. Use:
cat(varname)
.... if you want to see the unquoted values. The R print mechanism is set to use quotes as a signal to your brain that distinct values are present. You can also use:
print(varname, quote=FALSE)
If there are that many named objects in you workspace, then you need desperately to learn to use lists. There are mechanisms for "promoting" character values to names, but this would be seen as a failure on your part to learn to use the language effectively:
var <- 2
> eval(as.name('var'))
[1] 2
> eval(parse(text="var"))
[1] 2
> get('var')
[1] 2

Extracting synonymous terms from wordnet using synonym()

Supposed I am pulling the synonyms of "help" by the function of synonyms() from wordnet and get the followings:
Str = synonyms("help")
Str
[1] "c(\"aid\", \"assist\", \"assistance\", \"help\")"
[2] "c(\"aid\", \"assistance\", \"help\")"
[3] "c(\"assistant\", \"helper\", \"help\", \"supporter\")"
[4] "c(\"avail\", \"help\", \"service\")"
Then I can get a one character string using
unique(unlist(lapply(parse(text=Str),eval)))
at the end that looks like this:
[1] "aid" "assist" "assistance" "help" "assistant" "helper" "supporter"
[8] "avail" "service"
The above process was suggested by Gabor Grothendieck. His/Her solution is good, but I still couldn't figure out that if I change the query term into "company", "boy", or someone else, an error message will be responsed.
One possible reason maybe due to the "sixth" synonym of "company" (please see below) is a single term and does not follow the format of "c(\"company\")".
synonyms("company")
[1] "c(\"caller\", \"company\")"
[2] "c(\"company\", \"companionship\", \"fellowship\", \"society\")"
[3] "c(\"company\", \"troupe\")"
[4] "c(\"party\", \"company\")"
[5] "c(\"ship's company\", \"company\")"
[6] "company"
Could someone kindly help me to solve this problem.
Many thanks.
You can solve this by creating a little helper function that uses R's try mechanism to catch errors. In this case, if the eval produces an error, then return the original string, else return the result of eval:
Create a helper function:
evalOrValue <- function(expr, ...){
z <- try(eval(expr, ...), TRUE)
if(inherits(z, "try-error")) as.character(expr) else unlist(z)
}
unique(unlist(sapply(parse(text=Str), evalOrValue)))
Produces:
[1] "caller" "company" "companionship"
[4] "fellowship" "society" "troupe"
[7] "party" "ship's company"
I reproduced your data and then used dput to reproduce it here:
Str <- c("c(\"caller\", \"company\")", "c(\"company\", \"companionship\", \"fellowship\", \"society\")",
"c(\"company\", \"troupe\")", "c(\"party\", \"company\")", "c(\"ship's company\", \"company\")",
"company")
Those synonyms are in a form that looks like an expression, so you should be able to parse them as you illustrated. BUT: When I execute your original code above I get an error from the synonyms call because you included no part-of-speech argument.
> synonyms("help")
Error in charmatch(x, WN_synset_types) :
argument "pos" is missing, with no default
Observe that the code of synonyms uses getSynonyms and that its code has a unique wrapped around it so all of the pre-processing you are doing is no longer needed (if you update);:
> synonyms("company", "NOUN")
[1] "caller" "companionship" "company"
[4] "fellowship" "party" "ship's company"
[7] "society" "troupe"
> synonyms
function (word, pos)
{
filter <- getTermFilter("ExactMatchFilter", word, TRUE)
terms <- getIndexTerms(pos, 1L, filter)
if (is.null(terms))
character()
else getSynonyms(terms[[1L]])
}
<environment: namespace:wordnet>
> getSynonyms
function (indexterm)
{
synsets <- .jcall(indexterm, "[Lcom/nexagis/jawbone/Synset;",
"getSynsets")
sort(unique(unlist(lapply(synsets, getWord))))
}
<environment: namespace:wordnet>

Resources