Classifying tags in R with grepl in ifelse - r

I am having an issue with some R code. I am trying to classify text values from a column into a new column. My data is a collection of tags used on the gis.stackexchange site, which has ~2,500 rows. My goal is to classify the tags as either COTS, FOSS, or other. Reviewing the tags there are two "scenarios"; tags that are used once (i.e. anaconda) and tags that have a term used multiple times (i.e. qgis, qgis-desktop, qgis-server, etc.). This scenario is true for both COTS and FOSS tags.
My approach was to do the following:
create a vector with all tags that represent FOSS
create a vector with all tags that represent COTS
create a new column called software and code using ifelse
ifelse - where the tagName is %in% FOSS then code as FOSS
in the ifelse use grep on the FOSS vector to pattern match tags that may be used multiple times (i.e. qgis) and code as FOSS
Repeat this for COTS
I am getting an issue where the last grep (COTS) is being coded as FOSS. Obviously there is something wrong, but I cannot seem to figure out the issue. Below is the code and a link to the source data.
Shared folder with source CSV
Tag vectors -- FOSS and COTS
foss <- c("anaconda", "android", "apache", "aptana", "google", "blender", "cordova",
"docker", "drupal", "eclipse", "facebook", "firefox", "ftools", "fwtools",
"geodjango", "geopandas", "geomoose", "geonetwork", "geonode", "geotools",
"ggmap", "ggplot2", "gimp", "github", "gme", "chrome", "gvsig", "h2gis",
"hadoop", "inkscape", "lastools", "laszip", "mongodb", "neo4j", "numpy",
"open-data-kit", "opencv", "opendronemap", "openev", "opengeo-suite-composer",
"opengl", "openjump", "openstreetmap", "opentopomap", "opentripplanner", "openwind",
"orfeo-toolbox", "pandas", "pdal", "pgrouting", "pg2shape", "phonegap",
"plpgsql", "ppygis", "pydev", "pygdal", "pyproj", "pyqspatialite", "rasterlite",
"raster2pgsql", "rdal", "saga", "shapely", "shp2pgsql", "sp", "sf",
"spatialite-gui", "three-js", "unity3d", "wordpress", "youtube", "bing-maps",
"dropbox", "instagram", "sketchup", "carto", "django", "gdal", "geoserver",
"grass", "jupyter", "leaflet", "mapbox", "matplotlib", "mysql", "ogr", "openlayers",
"osgeo", "osm", "pgadmin", "postgis", "postgresql", "proj4", "pyqgis", "qgis",
"qt", "scikit", "scipy", "tilemill")
cots <- c("autodesk", "bentley", "cityengine", "drone2map", "ecognition", "envi", "er-mapper",
"et-geowizards", "excel", "geomatica", "geosoft", "global-mapper", "illustrator",
"mac", "matlab", "microstation", "modelbuilder", "pix4d", "plsql", "powerpoint",
"silverlight", "spss", "tableau", "xtools-pro", "mapinfo", "arc", "oracle",
"erdas", "esri", "fme", "microsoft", "-analyst")
Create new column with classified values calculated based on tag vector
tags$software <- ifelse(tags$tagName %in% foss, "FOSS",
ifelse(grep(foss, tags$tagName, fixed = TRUE), "FOSS",
ifelse(tags$tagName %in% cots, "COTS",
ifelse(grep(cots, tags$tagName, fixed = TRUE), "COTS",
"other"))))
When I run the code the following error is produced: argument 'pattern' has length > 1 and only the first element will be used
I am sure it is a very simple issue, but I cannot seem to figure it out.

With tidyverse:
tags<-data.frame(tagName=c("opengl","openglGHSAJKGNKS","arc","arc93257","asnsgn"))
tags%>%
mutate(software = case_when(
tagName %in% foss ~ "FOSS",
grepl(paste(foss,collapse="|"),tagName) ~ "FOSS",
tagName %in% cots ~ "COTS",
grepl(paste(cots,collapse="|"), tagName) ~ "COTS",
T ~ "other"))
tagName software
1 opengl FOSS
2 openglGHSAJKGNKS FOSS
3 arc COTS
4 arc93257 COTS
5 asnsgn other

Two things. First of all, you need grepl() because of the logical output. Secondly, grepl() does not work with a character vector, therefore you need to collapse it like this "anaconda|android|..." and omit the fixed = TRUE to work.
This should do it:
tags$software <- ifelse(tags$tagName %in% foss, "FOSS",
ifelse(grepl(paste(foss, collapse = "|"), tags$tagName), "FOSS",
ifelse(tags$tagName %in% cots, "COTS",
ifelse(grepl(paste(cots, collapse = "|"), tags$tagName), "COTS",
"other"))))

Related

Create binary yes/no animal variable based on match with any term in a dictionary, "animal" in R

Continuing off this question: R: Create category column reflecting match between a dictionary and column in df
I have a big dataset, "df", of 30,000 rows, and two big dictionary dataframes: (1) animal, 600k rows; (2)nature, 300k rows.
I am simply trying to figure out how to create two simple binary variables, "df$content_animal" and "df$content_nature" based on whether each row in df$content had any matches with "animal" or "nature" dictionaries. (1=match, 0=no match).
Below are the data samples, it's impossible for me to include the entire datasets here:
df <- tibble(content= c("hello turkey feet blah blah blah", "i love rabbits haha", "wow this sunlight is amazing", "omg did u see the rainbow?!", "turtles like swimming in the water", "i love running across grassy lawns with my dog"))
animal=c("turkey", "rabbit", "turtle", "dog", "cat", "bear")
nature=c("sunlight", "water", "rainbow", "grass", "lawn", "mountain", "ice")
I have tried the following codes based on multiple-pattern matches, to no success - I suspect it is bc of the largeness of both my dataset and dictionary/pattern:
df$content_animal <- grepl(paste(animal,collapse="|"),df$content,ignore.case=TRUE)
df$content_nature <- grepl(paste(nature,collapse="|"),df$content,ignore.case=TRUE)
which returns the error:
Error in grepl(paste(animal,collapse="|"), df$content, :
invalid regular expression, reason 'Out of memory' Error in grepl(paste(nature,collapse="|"), df$content, :
invalid regular expression, reason 'Out of memory'
I also tried:
df<-df %>%
mutate(
content_animal = case_when(grepl(animal, content) ~ "1")
)
df<-df %>%
mutate(
content_nature = case_when(grepl(nature, content) ~ "1")
)
which returns the error:
Problem with `mutate()` input `content_animal`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
ℹ Input `content_animal` is `case_when(grepl(animal, content) ~ "1")`.argument 'pattern' has length > 1 and only the first element will be used
Problem with `mutate()` input `content_nature`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
ℹ Input `content_nature` is `case_when(grepl(nature, content) ~ "1")`.argument 'pattern' has length > 1 and only the first element will be used
I ALSO tried
bench::mark(basic = mutate(df, content_animal = 1L*map_lgl(content, ~any(str_detect(.x, animal))),
content_nature = 1L*map_lgl(content, ~any(str_detect(.x, nature)))),
fixed = mutate(df, content_animal = 1L*map_lgl(content, ~any(str_detect(.x, fixed(animal)))),
content_nature = 1L*map_lgl(content, ~any(str_detect(.x, fixed(nature))))))
which ran for over two hours, without giving me any output.
I'm really at a loss here as to what I should do. Does anyone have any ideas? It there a better package or code to use for my big data purposes???
It may be better to loop with lapply and Reduce
Reduce(`|`, lapply(nature, function(x) grepl(x, df$content, ignore.case = TRUE)))
#[1] FALSE FALSE TRUE TRUE TRUE TRUE
which is the same as
grepl(paste(nature,collapse="|"),df$content,ignore.case=TRUE)
#[1] FALSE FALSE TRUE TRUE TRUE TRUE
Here's an approach with the quanteda package, which has built-in functions for doing exactly what you want. (I tried this only on the sample dataset; I'd be interested to hear what its performance is on the whole thing.)
library(quanteda)
c = corpus(df$content)
d = dictionary(list(animal = animal, nature = nature))
df = cbind(df, convert(dfm(c, dictionary = d), to = "data.frame")[,-1])

R GWASTools createDataFile() error: "Error ... %in% names(...) is not TRUE"

I'm trying to create an intensity GDS file from existing Illumina files using createDataFile() function of GWASTools.
I tried this:
col.nums <- as.integer(c(1,11,12,13,14))
names(col.nums) <- c("snp", "BAlleleFreq", "LogRRatio", "a1", "a2")
variables <- c("genotype","BAlleleFreq","LogRRatio")
intens <- createDataFile(path="/pathexample/", "/pathexample/IntensityGDS", file.type="gds", variables=variables, snp.annotation=snpAnnot, scan.annotation=scanAnnot, sep.type=",", skip.num=12, col.total=14, col.nums=col.nums, scan.name.in.file=-1, allele.coding="nucleotide", precision="single", compress="LZMA_RA:1M", compress.geno="", compress.annot="LZMA_RA", array.name=NULL, genome.build=NULL, diagnostics.filename="createDataFile.diagnostics.RData", verbose=TRUE)
The error I'm getting is:
Error: all(c("snpID", "chromosome", "position", "snpName") %in% names(snp.annotation)) is not TRUE
However I know those column names are in both the snp.annotation snpAnnotationDataFrame (aka snpAnnot) and the underlying dataframe I used to create that snpAnnotationDataFrame. E.g.:
varLabels(snpAnnot)
yields
"snpName" "chromosome" "position" "rsID_real" "snpID"
Thanks!!
Apparently the problem was that createDataFile() takes regular R dataframes in the snp.annotation and scan.annotation arguments, not an object of class "snp annotation data frame." ie, no need to run the command SnpAnnotationDataFrame() on your dataframe, just insert the actual dataframe.

String pulled directly from source data seems to not match string in source data

I have a string that is failing to evaluate as a match with itself. I am trying to do a simple subset based on one of 8 possible values in a column,
out <- df[df$`Var name` == "string",]
I've had it work multiple times with different strings but for some reason this string fails. I have tried to get the exact string (thinking there may be some character encoding issue) from the source using the four below avenues but have had no success. Even when I make an explicit call to a cell I know contains that string and copy that into an evaluation statement it fails
> df[i,j]
[1] "string"
df[i,j]=="string" # pasted from above line
I don't understand how I can be explicitly pasting the output I was just given and it not match.
## attempts to get exact string to paste into subset statement
# from dput
"IF APPLICABLE – Which of the following best characterizes the expectations with"
# from calling a specific row/col (df[i, j])
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"
# from the source pane of rstudio
IF APPLICABLE – Which of the following best characterizes the expectations with
# from the source excel file
IF APPLICABLE – Which of the following best characterizes the expectations with
I don't have a clue what could be going on here. I am explicitly drawing the string straight from the data and yet it still fails to evaluate as true. Is there something going on in the background that I'm not seeing? Am I overlooking something ridiculously simple?
edit:
I subset based on another way, below is a dput and actual example of what I'm doing:
> dput(temp)
structure(list(`Item Stem` = "IF APPLICABLE – Which of the following best characterizes the expectations with",
`Item Response` = "It was required.", orgchar_group = "locale",
`Org Characteristic` = "Rural", N = 487, percent = 34.5145287030475,
`Graphs note` = NA_character_, `Report note` = NA_character_,
`Other note` = NA_character_, subsig = 1, overall = 0, varname = NA_character_,
statsig = NA_real_, use = NA_real_, difference = 9.16044821292665), .Names = c("Item Stem",
"Item Response", "orgchar_group", "Org Characteristic", "N",
"percent", "Graphs note", "Report note", "Other note", "subsig",
"overall", "varname", "statsig", "use", "difference"), row.names = 288L, class = "data.frame")
> temp[1,1]
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"
> temp[1,1] == "IF APPLICABLE – Which of the following best characterizes the expectations with"
[1] FALSE
Turns out it was in fact a non-printable character, shoutout to the commenters for helping me figure it out by 1) suggesting it and 2) showing that it worked for them.
I was able to figure it out using insights from here (& here) and here.
I used a grep command (from #Tyler Rinker) to determine that there was in fact a non-ASCII character in my string, and a stringi command (from #hadley) to determine what kind. I then used base solution from #Josh O'Brien to remove it. Turns out it was the heiphen.
# working in the temp df
> x <- temp[1,1]
> grepl("[^ -~]", x)
[1] TRUE
> stringi::stri_enc_mark(x)
[1] "UTF-8"
> iconv(x, "UTF-8", "ASCII", sub="")
[1] "IF APPLICABLE Which of the following best characterizes the expectations with"
# set x as df$`Var name` and reassign it to fix
df$`Var name` <- iconv(df$`Var name`, "UTF-8", "ASCII", sub="")
Still don't understand it enough to explain why it happened but it's fixed now.

Give a new variable value 0 or 1 based on the distance between two words in another variable

I am new to R. In my dataset, I have a variable called Reason . I want to create a new column called Price. If any of the following conditions is met:
word "Price" and word "High" are both mentioned in Reason and the distance between them is less than 6 words
word "Price" and word "expensive" are both mentioned in Reason and the distance between them is less than 6 words
-word "Price" and word "increase" are both mentioned in Reason and the distance between them is less than 6 words
than Price=1. Otherwise, price=0.
I found the following user defined function to get the distance between 2 words
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
but I don't know how to apply it the whole column to get the expected results. I tried the following code but it only give me "logical(0)" as the result.
for (j in seq(Survey$Reason))
{
Survey$Price[[j]]<- distance(Survey$Reason[[j]], " price ", " high ") <=6
}
Any help is highly appreciated.
Thanks
Starting from your sample data:
survey <- structure(list(Reason = c("Their price are extremely high.", "Because my price was increased so much, I wouldn't want anyone else to have to deal with that.", "Just because the intial workings were fine, but after we realised it would affect our contract, it left a sour taste in our mouth.", "Problems with the repair", "They did not handle my complaint as well I would have liked.", "Bad service overall.")), .Names = "Reason", row.names = c(NA, 6L), class = "data.frame")
First, I updated your fonction to remove punctuation and directrly returns your position test
distanceOK <- function(string, term1, term2,n=6) {
words <- strsplit(gsub("[[:punct:]]", "", string), "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
dist <- abs(indices[term1] - indices[term2])
ifelse(is.na(dist)|dist>n,0,1)
}
Then we apply:
survey$Price <- sapply(survey$Reason, FUN=function(str) distanceOK(str, "price","high"))

Efficiently match multiple strings/keywords to multiple texts in R

I am trying to efficiently map exact peptides (short sequences of amino acids in the 26 character alphabet A-Z1) to proteins (longer sequences of the same alphabet). The most efficient way to do this I'm aware of is an Aho-Corasick trie (where peptides are the keywords). Unfortunately I can't find a version of AC in R that will work with a non-nucleotide alphabet (Biostrings' PDict and Starr's match_ac are both hard-coded for DNA).
As a crutch I've been trying to parallelize a basic grep approach. But I'm having trouble figuring out a way to do so without incurring significant IO overhead. Here is a brief example:
peptides = c("FSSSGGGGGGGR","GAHLQGGAK","GGSGGSYGGGGSGGGYGGGSGSR","IISNASCTTNCLAPLAK")
if (!exists("proteins"))
{
biocLite("biomaRt", ask=F, suppressUpdates=T, suppressAutoUpdate=T)
library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
proteins = getBM(attributes=c('peptide', 'refseq_peptide'), filters='refseq_peptide', values=c("NP_000217", "NP_001276675"), mart=ensembl)
row.names(proteins) = proteins$refseq_peptide
}
library(snowfall)
library(Biostrings)
library(plyr)
sfInit(parallel=T, cpus=detectCores()-1)
allPeptideInstances = NULL
i=1
increment=100
count=nrow(proteins)
while(T)
{
print(paste(i, min(count, i+increment), sep=":"))
text_source = proteins[i:min(count, i+increment),]
text = text_source$peptide
#peptideInstances = sapply(peptides, regexpr, text, fixed=T, useBytes=T)
peptideInstances = sfSapply(peptides, regexpr, text, fixed=T, useBytes=T)
dimnames(peptideInstances) = list(text_source$refseq_peptide, colnames(peptideInstances))
sparsePeptideInstances = alply(peptideInstances, 2, .fun = function(x) {x[x > 0]}, .dims = T)
allPeptideInstances = c(allPeptideInstances, sparsePeptideInstances, recursive=T)
if (i==count | nrow(text_source) < increment)
break
i = i+increment
}
sfStop()
There are a few issues here:
peptideInstances here is a dense matrix, so
returning it from each worker is very verbose. I have broken it up
into blocks so that I'm not dealing with a 40,000 (proteins) x 60,000
(peptides) matrix.
Parallelizing over peptides, when it would make
more sense to parallelize over the proteins because they're bigger.
But I got frustrated with trying to do it by protein because:
This code breaks if there is only one protein in text_source.
Alternatively, if anyone is aware of a better solution in R, I'm happy to use that. I've spent enough time on this I probably would have been better served implementing Aho-Corasick.
1 Some of those are ambiguity codes, but for simplicity, ignore that.
I learned Rcpp and implemented an Aho-Corasick myself. Now CRAN has a good general purpose multiple-keyword search package.
Here are some usage examples:
listEquals = function(a, b) { is.null(unlist(a)) && is.null(unlist(b)) || !is.null(a) && !is.null(b) && all(unlist(a) == unlist(b)) }
# simple search of multiple keywords in a single text
keywords = c("Abra", "cadabra", "is", "the", "Magic", "Word")
oneSearch = AhoCorasickSearch(keywords, "Is Abracadabra the Magic Word?")
stopifnot(listEquals(oneSearch[[1]][[1]], list(keyword="Abra", offset=4)))
stopifnot(listEquals(oneSearch[[1]][[2]], list(keyword="cadabra", offset=8)))
stopifnot(listEquals(oneSearch[[1]][[3]], list(keyword="the", offset=16)))
stopifnot(listEquals(oneSearch[[1]][[4]], list(keyword="Magic", offset=20)))
stopifnot(listEquals(oneSearch[[1]][[5]], list(keyword="Word", offset=26)))
# search a list of lists
# * sublists are accessed by index
# * texts are accessed by index
# * non-matched texts are kept (to preserve index order)
listSearch = AhoCorasickSearchList(keywords, list(c("What in", "the world"), c("is"), "secret about", "the Magic Word?"))
stopifnot(listEquals(listSearch[[1]][[1]], list()))
stopifnot(listEquals(listSearch[[1]][[2]][[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(listSearch[[2]][[1]][[1]], list(keyword="is", offset=1)))
stopifnot(listEquals(listSearch[[3]], list()))
stopifnot(listEquals(listSearch[[4]][[1]][[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(listSearch[[4]][[1]][[2]], list(keyword="Magic", offset=5)))
stopifnot(listEquals(listSearch[[4]][[1]][[3]], list(keyword="Word", offset=11)))
# named search of a list of lists
# * sublists are accessed by name
# * matched texts are accessed by name
# * non-matched texts are dropped
namedSearch = AhoCorasickSearchList(keywords, list(subject=c(phrase1="What in", phrase2="the world"),
verb=c(phrase1="is"),
predicate1=c(phrase1="secret about"),
predicate2=c(phrase1="the Magic Word?")))
stopifnot(listEquals(namedSearch$subject$phrase2[[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(namedSearch$verb$phrase1[[1]], list(keyword="is", offset=1)))
stopifnot(listEquals(namedSearch$predicate1, list()))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[2]], list(keyword="Magic", offset=5)))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[3]], list(keyword="Word", offset=11)))
# named search of multiple texts in a single list with keyword grouping and aminoacid alphabet
# * all matches to a keyword are accessed by name
# * non-matched keywords are dropped
proteins = c(protein1="PEPTIDEPEPTIDEDADADARARARARAKEKEKEKEPEPTIDE",
protein2="DERPADERPAPEWPEWPEEPEERAWRAWWARRAGTAGPEPTIDEKESEQUENCE")
peptides = c("PEPTIDE", "DERPA", "SEQUENCE", "KEKE", "PEPPIE")
peptideSearch = AhoCorasickSearch(peptides, proteins, alphabet="aminoacid", groupByKeyword=T)
stopifnot(listEquals(peptideSearch$PEPTIDE, list(list(keyword="protein1", offset=1),
list(keyword="protein1", offset=8),
list(keyword="protein1", offset=37),
list(keyword="protein2", offset=38))))
stopifnot(listEquals(peptideSearch$DERPA, list(list(keyword="protein2", offset=1),
list(keyword="protein2", offset=6))))
stopifnot(listEquals(peptideSearch$SEQUENCE, list(list(keyword="protein2", offset=47))))
stopifnot(listEquals(peptideSearch$KEKE, list(list(keyword="protein1", offset=29),
list(keyword="protein1", offset=31),
list(keyword="protein1", offset=33))))
stopifnot(listEquals(peptideSearch$PEPPIE, NULL))
# grouping by keyword without text names: offsets are given without reference to the text
names(proteins) = NULL
peptideSearch = AhoCorasickSearch(peptides, proteins, groupByKeyword=T)
stopifnot(listEquals(peptideSearch$PEPTIDE, list(1, 8, 37, 38)))
stopifnot(listEquals(peptideSearch$DERPA, list(1, 6)))
stopifnot(listEquals(peptideSearch$SEQUENCE, list(47)))
stopifnot(listEquals(peptideSearch$KEKE, list(29, 31, 33)))

Resources