Extracting synonymous terms from wordnet using synonym() - r

Supposed I am pulling the synonyms of "help" by the function of synonyms() from wordnet and get the followings:
Str = synonyms("help")
Str
[1] "c(\"aid\", \"assist\", \"assistance\", \"help\")"
[2] "c(\"aid\", \"assistance\", \"help\")"
[3] "c(\"assistant\", \"helper\", \"help\", \"supporter\")"
[4] "c(\"avail\", \"help\", \"service\")"
Then I can get a one character string using
unique(unlist(lapply(parse(text=Str),eval)))
at the end that looks like this:
[1] "aid" "assist" "assistance" "help" "assistant" "helper" "supporter"
[8] "avail" "service"
The above process was suggested by Gabor Grothendieck. His/Her solution is good, but I still couldn't figure out that if I change the query term into "company", "boy", or someone else, an error message will be responsed.
One possible reason maybe due to the "sixth" synonym of "company" (please see below) is a single term and does not follow the format of "c(\"company\")".
synonyms("company")
[1] "c(\"caller\", \"company\")"
[2] "c(\"company\", \"companionship\", \"fellowship\", \"society\")"
[3] "c(\"company\", \"troupe\")"
[4] "c(\"party\", \"company\")"
[5] "c(\"ship's company\", \"company\")"
[6] "company"
Could someone kindly help me to solve this problem.
Many thanks.

You can solve this by creating a little helper function that uses R's try mechanism to catch errors. In this case, if the eval produces an error, then return the original string, else return the result of eval:
Create a helper function:
evalOrValue <- function(expr, ...){
z <- try(eval(expr, ...), TRUE)
if(inherits(z, "try-error")) as.character(expr) else unlist(z)
}
unique(unlist(sapply(parse(text=Str), evalOrValue)))
Produces:
[1] "caller" "company" "companionship"
[4] "fellowship" "society" "troupe"
[7] "party" "ship's company"
I reproduced your data and then used dput to reproduce it here:
Str <- c("c(\"caller\", \"company\")", "c(\"company\", \"companionship\", \"fellowship\", \"society\")",
"c(\"company\", \"troupe\")", "c(\"party\", \"company\")", "c(\"ship's company\", \"company\")",
"company")

Those synonyms are in a form that looks like an expression, so you should be able to parse them as you illustrated. BUT: When I execute your original code above I get an error from the synonyms call because you included no part-of-speech argument.
> synonyms("help")
Error in charmatch(x, WN_synset_types) :
argument "pos" is missing, with no default
Observe that the code of synonyms uses getSynonyms and that its code has a unique wrapped around it so all of the pre-processing you are doing is no longer needed (if you update);:
> synonyms("company", "NOUN")
[1] "caller" "companionship" "company"
[4] "fellowship" "party" "ship's company"
[7] "society" "troupe"
> synonyms
function (word, pos)
{
filter <- getTermFilter("ExactMatchFilter", word, TRUE)
terms <- getIndexTerms(pos, 1L, filter)
if (is.null(terms))
character()
else getSynonyms(terms[[1L]])
}
<environment: namespace:wordnet>
> getSynonyms
function (indexterm)
{
synsets <- .jcall(indexterm, "[Lcom/nexagis/jawbone/Synset;",
"getSynsets")
sort(unique(unlist(lapply(synsets, getWord))))
}
<environment: namespace:wordnet>

Related

Partial Match String and full replacement over multiple vectors

Would like to efficiently replace all partial match strings over a single column by supplying a vector of strings which will be searched (and matched) and also be used as replacement. i.e. for each vector in df below, it will partially match for vectors in vec_string. Where matches is found, it will simply replace the entire string with vec_string. i.e. turning 'subscriber manager' to 'manager'. By supplying more vectors into vec_string, it will search through the whole df until all is complete.
I have started the function, but can't seem to finish it off by replacing the vectors in df with vec_string. Appreciate your help
df <- c(
'solicitor'
,'subscriber manager'
,'licensed conveyancer'
,'paralegal'
,'property assistant'
,'secretary'
,'conveyancing paralegal'
,'licensee'
,'conveyancer'
,'principal'
,'assistant'
,'senior conveyancer'
,'law clerk'
,'lawyer'
,'legal practice director'
,'legal secretary'
,'personal assistant'
,'legal assistant'
,'conveyancing clerk')
vec_string <- c('manager','law')
#function to search and replace
replace_func <-
function(vec,str_vec) {
repl_str <- list()
for(i in 1:length(str_vec)) {
repl_str[[i]] <- grep(str_vec[i],unique(tolower(vec)))
}
names(repl_str) <- vec_string
return(repl_str)
}
replace_func(df,vec_string)
$`manager`
[1] 2
$law
[1] 13 14
As you can see, the function returns a named list with elements to which the replacement will
This should do the trick
res = sapply(df,function(x){
match = which(sapply(vec_string,function(y) grepl(y,x)))
if (length(match)){x=vec_string[match[1]]}else{x}
})
res
[1] "solicitor" "manager" "licensed conveyancer"
[4] "paralegal" "property assistant" "secretary"
[7] "conveyancing paralegal" "licensee" "conveyancer"
[10] "principal" "assistant" "senior conveyancer"
[13] "law" "law" "legal practice director"
[16] "legal secretary" "personal assistant" "legal assistant"
[19] "conveyancing clerk"
We compare each part of df with each part of vec_string. If there is a match, the vec_string part is returned, else it is left as it is. Watch out as if there are more than 1 matches it will keep the first one.

How to find algo type(regression,classification) in Caret in R for all algos at once?

How do I find whether model type for all models at once? I know how to access this info if I know the algo name, e.g.:
library('Caret')
tail(name(getModelInfo()))
[1] "widekernelpls" "WM" "wsrf" "xgbLinear" "xgbTree"
[6] "xyf"
getModelInfo()$xyf$type
[1] "Classification" "Regression"
How do I see the $type for all the algos in one place?
Look at the help page ?models. Also, here are some links too.
Also:
> is_class <- unlist(lapply(mods, function(x) any(x$type == "Classification")))
> class_mods <- names(is_class)[is_class]
> head(class_mods)
[1] "ada" "AdaBag" "AdaBoost.M1" "amdai" "avNNet"
[6] "bag"

R - get values from multiple variables in the environment

I have some variables in my current R environment:
ls()
[1] "clt.list" "commands.list" "dirs.list" "eq" "hurs.list" "mlist" "prec.list" "temp.list" "vars"
[10] "vars.list" "wind.list"
where each one of the variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" is a (huge) list of strings.
For example:
clt.list[1:20]
[1] "clt_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc" "clt_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc"
[3] "clt_Amon_bcc-csm1-1_historical_r1i1p1_185001-201212.nc" "clt_Amon_bcc-csm1-1-m_historical_r1i1p1_185001-201212.nc"
[5] "clt_Amon_BNU-ESM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CanESM2_historical_r1i1p1_185001-200512.nc"
[7] "clt_Amon_CCSM4_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-BGC_historical_r1i1p1_185001-200512.nc"
[9] "clt_Amon_CESM1-CAM5_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-CAM5-1-FV2_historical_r1i1p1_185001-200512.nc"
[11] "clt_Amon_CESM1-FASTCHEM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-WACCM_historical_r1i1p1_185001-200512.nc"
[13] "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-190412.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-200512.nc"
[15] "clt_Amon_CMCC-CESM_historical_r1i1p1_190501-190912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_191001-191412.nc"
[17] "clt_Amon_CMCC-CESM_historical_r1i1p1_191501-191912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_192001-192412.nc"
[19] "clt_Amon_CMCC-CESM_historical_r1i1p1_192501-192912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_193001-193412.nc"
What I need to do is extract the subset of the string that is between "Amon_" and "_historical".
I can do this for a single variable, as shown here:
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", clt.list[1:20])))
[1] "ACCESS1-0" "ACCESS1-3" "bcc-csm1-1" "bcc-csm1-1-m" "BNU-ESM" "CanESM2" "CCSM4"
[8] "CESM1-BGC" "CESM1-CAM5" "CESM1-CAM5-1-FV2" "CESM1-FASTCHEM" "CESM1-WACCM" "CMCC-CESM"
However, what I'd like to do is to run the command above for all the five variables at once. Instead of using just "ctl.list" as argument in the command above, I'd like to use all variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" at once.
How can I do that?
Many thanks in advance!
You can put your operation into a function and then iterate over it:
get_my_substr <- function(vecname)
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", get(vecname))))
lapply(my_vecnames,get_my_substr)
lapply acts like a loop. You can create your list of vector names with
my_vecnames <- ls(pattern=".list$")
It is generally good practice to post a reproducible example in your question. Since none was provided here, I tested this approach with...
# example-maker
prestr <- "grr_Amon_"
posstr <- "_historical_zzz"
make_ex <- function()
replicate(
sample(10,1),
paste0(prestr,paste0(sample(LETTERS,sample(5,1)),collapse=""),posstr)
)
# make a couple examples
set.seed(1)
m01 <- make_ex()
m02 <- make_ex()
# test result
lapply(ls(pattern="^m[0-9][0-9]$"),get_my_substr)
One solution would be to create a vector containing the variable names that you want extract the data from, for example:
var.names <- c("clt.list", "commands.list", "dirs.list")
Then to access the value of each variable from the name:
for (var.name in var.names) {
var.value <- as.list(environment())[[var.name]]
# Do something with var.value
}

parent.env( x ) confusion

I've read the documentation for parent.env() and it seems fairly straightforward - it returns the enclosing environment. However, if I use parent.env() to walk the chain of enclosing environments, I see something that I cannot explain. First, the code (taken from "R in a nutshell")
library( PerformanceAnalytics )
x = environment(chart.RelativePerformance)
while (environmentName(x) != environmentName(emptyenv()))
{
print(environmentName(parent.env(x)))
x <- parent.env(x)
}
And the results:
[1] "imports:PerformanceAnalytics"
[1] "base"
[1] "R_GlobalEnv"
[1] "package:PerformanceAnalytics"
[1] "package:xts"
[1] "package:zoo"
[1] "tools:rstudio"
[1] "package:stats"
[1] "package:graphics"
[1] "package:utils"
[1] "package:datasets"
[1] "package:grDevices"
[1] "package:roxygen2"
[1] "package:digest"
[1] "package:methods"
[1] "Autoloads"
[1] "base"
[1] "R_EmptyEnv"
How can we explain the "base" at the top and the "base" at the bottom? Also, how can we explain "package:PerformanceAnalytics" and "imports:PerformanceAnalytics"? Everything would seem consistent without the first two lines. That is, function chart.RelativePerformance is in the package:PerformanceAnalytics environment which is created by xts, which is created by zoo, ... all the way up (or down) to base and the empty environment.
Also, the documentation is not super clear on this - is the "enclosing environment" the environment in which another environment is created and thus walking parent.env() shows a "creation" chain?
Edit
Shameless plug: I wrote a blog post that explains environments, parent.env(), enclosures, namespace/package, etc. with intuitive diagrams.
1) Regarding how base could be there twice (given that environments form a tree), its the fault of the environmentName function. Actually the first occurrence is .BaseNamespaceEnv and the latter occurrence is baseenv().
> identical(baseenv(), .BaseNamespaceEnv)
[1] FALSE
2) Regarding the imports:PerformanceAnalytics that is a special environment that R sets up to hold the imports mentioned in the package's NAMESPACE or DESCRIPTION file so that objects in it are encountered before anything else.
Try running this for some clarity. The str(p) and following if statements will give a better idea of what p is:
library( PerformanceAnalytics )
x <- environment(chart.RelativePerformance)
str(x)
while (environmentName(x) != environmentName(emptyenv())) {
p <- parent.env(x)
cat("------------------------------\n")
str(p)
if (identical(p, .BaseNamespaceEnv)) cat("Same as .BaseNamespaceEnv\n")
if (identical(p, baseenv())) cat("Same as baseenv()\n")
x <- p
}
The first few items in your results give evidence of the rules R uses to search for variables used in functions in packages with namespaces. From the R-ext manual:
The namespace controls the search strategy for variables used by functions in the package.
If not found locally, R searches the package namespace first, then the imports, then the base
namespace and then the normal search path.
Elaborating just a bit, have a look at the first few lines of chart.RelativePerformance:
head(body(chart.RelativePerformance), 5)
# {
# Ra = checkData(Ra)
# Rb = checkData(Rb)
# columns.a = ncol(Ra)
# columns.b = ncol(Rb)
# }
When a call to chart.RelativePerformance is being evaluated, each of those symbols --- whether the checkData on line 1, or the ncol on line 3 --- needs to be found somewhere on the search path. Here are the first few enclosing environments checked:
First off is namespace:PerformanceAnalytics. checkData is found there, but ncol is not.
Next stop (and the first location listed in your results) is imports:PerformanceAnalytics. This is the list of functions specified as imports in the package's NAMESPACE file. ncol is not found here either.
The base environment namespace (where ncol will be found) is the last stop before proceeding to the normal search path. Almost any R function will use some base functions, so this stop ensures that none of that functionality can be broken by objects in the global environment or in other packages. (R's designers could have left it to package authors to explicitly import the base environment in their NAMESPACE files, but adding this default pass through base does seem like the better design decision.)
The second base is .BaseNamespaceEnv, while the second to last base is baseenv(). These are not different (probably w.r.t. its parents). The parent of .BaseNamespaceEnv is .GlobalEnv, while that of baseenv() is emptyenv().
In a package, as #Josh says, R searches the namespace of the package, then the imports, and then the base (i.e., BaseNamespaceEnv).
you can find this by, e.g.:
> library(zoo)
> packageDescription("zoo")
Package: zoo
# ... snip ...
Imports: stats, utils, graphics, grDevices, lattice (>= 0.18-1)
# ... snip ...
> x <- environment(zoo)
> x
<environment: namespace:zoo>
> ls(x) # objects in zoo
[1] "-.yearmon" "-.yearqtr" "[.yearmon"
[4] "[.yearqtr" "[.zoo" "[<-.zoo"
# ... snip ...
> y <- parent.env(x)
> y # namespace of imported packages
<environment: 0x116e37468>
attr(,"name")
[1] "imports:zoo"
> ls(y) # objects in the imported packages
[1] "?" "abline"
[3] "acf" "acf2AR"
# ... snip ...

spChFIDs() on level 1 or higher map-files

Hopefully (one of) the last question on map-files.
Why is this not working, and how would I do that right?
load(url('http://gadm.org/data/rda/CUB_adm1.RData'))
CUB <- gadm
CUB <- spChFIDs(CUB, paste("CUB", rownames(CUB), sep = "_"))
Thank you very much!!!
seems to work with row.names()
load(url('http://gadm.org/data/rda/CUB_adm1.RData'))
CUB <- gadm
CUB <- spChFIDs(CUB, paste("CUB", row.names(CUB), sep = "_"))
The answer is apparent once one reads the help for ?row.names() and ?rownames().
The rownames() function only knows something about matrix-like objects, and CUB is not one of those, hence it doesn't have row names that rownames() can find:
> rownames(CUB)
NULL
row.names() is different, it is an S3 generic function and that means package authors can write methods for specific types of objects such that the row names of those objects can be extracted.
Here is a list of the methods available for row.names() in my current session, with the sp package loaded:
> methods(row.names)
[1] row.names.data.frame
[2] row.names.default
[3] row.names.SpatialGrid*
[4] row.names.SpatialGridDataFrame*
[5] row.names.SpatialLines*
[6] row.names.SpatialLinesDataFrame*
[7] row.names.SpatialPixels*
[8] row.names.SpatialPoints*
[9] row.names.SpatialPointsDataFrame*
[10] row.names.SpatialPolygons*
[11] row.names.SpatialPolygonsDataFrame*
Non-visible functions are asterisked
The class of the object CUB is:
> class(CUB)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"
So what is happening is that the SpatialPolygonsDataFrame method of the row.names() function is being used and it knows where to find the required row names.

Resources