Combining Two GREPL Searches with 'OR' and adding a 'NOT' - r

I have this dataframe:
ID Description
1 Tree fell on car
2 Tree was uprooted
3 While cutting tree, it came down
4 Tree came down
I am trying to search a column in a dataframe for weather words. I am doing this by using multiple GREPL functions seperated by an 'OR'. However, I want to combine two grepl functions to say "If the description has THIS WORD and THIS WORD, but not THIS WORD, it is weather". If you look at the dataframe above, one can assume that "Tree came down" would be classified as weather, but "While cutting tree, it came down" is non-weather related.
The code that I tried from other stack overflow answers is :
Data$Type<-ifelse(grepl(' Tree|^Tree|-
Tree|:Tree',Data$DESCRIPTION,ignore.case=TRUE)&
grepl('^[^Cutting]*[Feel|Fell|Fall|Up Rooted|Uprooted|Came Down| Down|Knocked
Onto|Caused Damage]
[^Cutting]*$',Data$DESCRIPTION,ignore.case=TRUE)), "weather", "Not
Classified")
But this is not working. I tried:
Data$Type<-ifelse(grepl(' Tree|^Tree|-
Tree|:Tree',Data$DESCRIPTION,ignore.case=TRUE)& grepl('Feel|Fell|Fall|Up
Rooted|Uprooted|Came Down| Down|Knocked Onto|Caused
Damage',Data$DESCRIPTION,ignore.case=TRUE) &
!grepl('Cutting',Data$DESCRIPTION,ignore.case=TRUE)), "Weather", "Not
Classified")
I am expecting this outcome:
ID Description Type
1 Tree fell on car "Weather"
2 Tree was uprooted "Weather"
3 While cutting tree, it came down "Non-Weather"
4 Tree came down "Weather"
But these do not work. Thank you

Since you have only two cases (Weather and Non-Weather), I think it would be sufficient to use grepl for one only:
df$Type <- sapply(df$Description,
function(x) ifelse(grepl(pattern = 'Tree|fell|^cutting',x = x),'Weather','Non-Weather'))
[1] "Weather" "Weather" "Non-Weather" "Weather"

I ended up just doing things like this to make sure "Ice" is a weather word, but "Maker".
ifelse(grepl('Ice$| Ice |,Ice |^Ice | Ice,',Data$DESCRIPTION,ignore.case=TRUE) &
!grepl('Maker',Data$DESCRIPTION,ignore.case=TRUE))

Related

How to remove the first words of specific rows that appear in another column?

Is there a way to remove the first n words of the column "content" when there are words present in the "keyword" column?
I am working with a data frame similar to this:
keyword <- c("Mr. Jones", "My uncle Sam", "Tom", "", "The librarian")
content <- c("Mr. Jones is drinking coffee", "My uncle Sam is sitting in the kitchen with my uncle Richard", "Tom is playing with Tom's family's dog", "Cassandra is jogging for her first time", "The librarian is jogging with her")
data <- data.frame(keyword, content)
data
In some cases, the first few words of the "keyboard" sting are contained in the "content" string.
In others, the "keyword" string remains empty and only "content" is filled.
What I want to achieve here is to remove the first appearance of the word combination in "keyword" that appears in the same row in "content".
Unfortunately, I am only able to create code that deletes all the matching words. But as you can see, some words (like "uncle" or "Tom") appear more than one time in a cell.
I'd like to only delete the first appearance and keep all that come after in the same cell.
My next-best solution was to use the following code:
data$content <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",data$keyword),data$content)
This code was designed to remove all of the words from "content" that are present in "keyword" of the same row. (It was initially posted here).
Another option that I tried was to design a function for this:
I first created a new variable which counted the number of words that are included in the "keyword" string of the corresponding line:
numw <- lengths(gregexpr("\\S+", data$keyword))
data <- cbind(data, numw)
Second, I tried to formulate a function to remove the first n words of content[i], with n = numw[i]
shorten <- function(v, z){
v <- gsub(".*^\\w+", z, v)
}
shorten(data$content, data$numw)
Unfortunately, I am not able to make the function work and the following error message will be generated:
Error in gsub(".*^\w+", z, v) : invalid 'replacement' argument
So, I'd be incredibly greatful if one could help me to formulate a function that could actually deal with the issue more appropriately.
Here is a solution which is based on str_remove. As str_remove gives warnings, if the pattern is '' the first row exchanges it with NA. If then keyword is NA the keyword is stripped off, if not content is taken as is.
library(tidyverse )
data |>
mutate(keyword = na_if(keyword, '')) |>
mutate(content = case_when(
!is.na(keyword) ~ str_remove(content, keyword),
is.na(keyword) ~content))
#> keyword content
#> 1 Mr. Jones is drinking coffee
#> 2 My uncle Sam is sitting in the kitchen with my uncle Richard
#> 3 Tom is playing with Tom's family's dog
#> 4 <NA> Cassandra is jogging for her first time
#> 5 The librarian is jogging with her

Matching phrases (word-search), noting finding in different columns

I'm working with: feedback from teachers, multiple selections from a google form. All selections end up producing a single column.
There was also the option to type in responses... so some words/responses are only present one time. There are about 50 responses, with between 1, and 7 words(ish) per entry. For the MRE I've included 5 as a vector.
eg_responses <- c("Difficult, Challenging, Fair", "Necessary, Useful", "Cruel", "Easy, Challenging", "School's shouldn't have to do that")
words_example <- c("Difficult","Easy","Fair","Challenging","Necessary","Useful")
While I can get a summary of the results, I need to create isolated responses so that I can select and compare with other variables later on...
What I would like to be able to do:
have columns representative of each word, and an extra column with other comments.
Look through each row for each word that was in the drop down.
while there are rows left to check look through word x.
if word is present mark in x column.
if word is not present mark NA in x column.
next column and next word when rows run out.
any other words/phrases left at the end go in the last column,
if no other words phrases are left mark end column with NA.
I'm sure I've been overcomplicating things. But I'm an absolute beginner. I'm working in R.
I have tried, separating the words, turning things into factors, then turning to character... Searching for exact complete phrases, or searching through words, would be really helpful.
There are 6 other sections with the same problem, and other sections of other variables.
I would like to have something like this at the end...
(apologies, just going to make it up bit of a chicken and egg)
respcolnames <- c( "Challenging", "Fair", "Unfair", "Other" )
row1 <- c("NA", "Fair", "NA", "NA")
row2 <- c("Challenging", "NA", "Unfair", "NA")
row3 <- c("NA", "NA","NA", "School's shouldn't have to do that")
teacher_responses <- cbind(row1, row2,row3)
names(teacher_responses) <- respcolnames
so that the corresponding response is in the correct column. I can then strip away the "Other" responses for a more social science analysis, and use the drop down response selections for some graphics.

R - Filter any rows and show all columns

I would like to an output that shows the column names that has rows containing a string value. Assume the following...
Animals Sex
I like Dogs Male
I like Cats Male
I like Dogs Female
I like Dogs Female
Data Missing Male
Data Missing Male
I found an SO tread here, David Arenburg provided answer which works very well but I was wondering if it is possible to get an output that doesn't show all the rows. So If I want to find a string "Data Missing" the output I would like to see is...
Animals
Data Missing
or
Animal
TRUE
instead of
Anmials Sex
Data Missing Male
Data Missing Male
I have also found using filters such as df$columnName works but I have big file and a number of large quantity of column names, typing column names would be tedious. Assume string "Data Missing" is also in other columns and there could be different type of strings. So that is why I like David Arenburg's answer, so bear in mind I don't have two columns, as sample given above.
Cheers
One thing you could do is grep for "Data Missing" like this:
x <- apply(data, 2, grep, pattern = "Data Missing")
lapply(x, length) > 1
This will give you the:
Animal
TRUE
result you're after. It's also good because it checks all columns, which you mentioned was something you wanted.
If we want only the first row where it matches, use match
data[match("Data Missing", data$Animals), "Animals", drop = FALSE]
# Animals
#5 Data Missing

Gene ontology (GO) analysis for a list of Genes (with ENTREZID) in R?

I am very new with the GO analysis and I am a bit confuse how to do it my list of genes.
I have a list of genes (n=10):
gene_list
SYMBOL ENTREZID GENENAME
1 AFAP1 60312 actin filament associated protein 1
2 ANAPC11 51529 anaphase promoting complex subunit 11
3 ANAPC5 51433 anaphase promoting complex subunit 5
4 ATL2 64225 atlastin GTPase 2
5 AURKA 6790 aurora kinase A
6 CCNB2 9133 cyclin B2
7 CCND2 894 cyclin D2
8 CDCA2 157313 cell division cycle associated 2
9 CDCA7 83879 cell division cycle associated 7
10 CDCA7L 55536 cell division cycle associated 7-like
and I simply want to find their function and I've been suggested to use GO analysis tools.
I am not sure if it's a correct way to do so.
here is my solution:
x <- org.Hs.egGO
# Get the entrez gene identifiers that are mapped to a GO ID
xx<- as.list(x[gene_list$ENTREZID])
So, I've got a list with EntrezID that are assigned to several GO terms for each genes.
for example:
> xx$`60312`
$`GO:0009966`
$`GO:0009966`$GOID
[1] "GO:0009966"
$`GO:0009966`$Evidence
[1] "IEA"
$`GO:0009966`$Ontology
[1] "BP"
$`GO:0051493`
$`GO:0051493`$GOID
[1] "GO:0051493"
$`GO:0051493`$Evidence
[1] "IEA"
$`GO:0051493`$Ontology
[1] "BP"
My question is :
how can I find the function for each of these genes in a simpler way and I also wondered if I am doing it right or?
because I want to add the function to the gene_list as a function/GO column.
Thanks in advance,
EDIT: There is a new Bioinformatics SE (currently in beta mode).
I hope I get what you are aiming here.
BTW, for bioinformatics related topics, you can also have a look at biostar which have the same purpose as SO but for bioinformatics
If you just want to have a list of each function related to the gene, you can query database such ENSEMBl through the biomaRt bioconductor package which is an API for querying biomart database.
You will need internet though to do the query.
Bioconductor proposes packages for bioinformatics studies and these packages come generally along with good vignettes which get you through the different steps of the analysis (and even highlight how you should design your data or which would be then some of the pitfalls).
In your case, directly from biomaRt vignette - task 2 in particular:
Note: there are slightly quicker way that the one I reported below:
# load the library
library("biomaRt")
# I prefer ensembl so that the one I will query, but you can
# query other bases, try out: listMarts()
ensembl=useMart("ensembl")
# as it seems that you are looking for human genes:
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
# if you want other model organisms have a look at:
#listDatasets(ensembl)
You need to create your query (your list of ENTREZ ids). To see which filters you can query:
filters = listFilters(ensembl)
And then you want to retrieve attributes : your GO number and description. To see the list of available attributes
attributes = listAttributes(ensembl)
For you, the query would look like something as:
goids = getBM(
#you want entrezgene so you know which is what, the GO ID and
# name_1006 is actually the identifier of 'Go term name'
attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',
values=gene_list$ENTREZID,
mart=ensembl)
The query itself can take a while.
Then you can always collapse the information in two columns (but I won't recommend it for anything else that reporting purposes).
Go.collapsed<-Reduce(rbind,lapply(gene_list$ENTREZID,function(x)
tempo<-goids[goids$entrezgene==x,]
return(
data.frame('ENTREZGENE'= x,
'Go.ID'= paste(tempo$go_id,collapse=' ; '),
'GO.term'=paste(tempo$name_1006,collapse=' ; '))
)
Edit:
If you want to query a past version of the ensembl database:
ens82<-useMart(host='sep2015.archive.ensembl.org',
biomart='ENSEMBL_MART_ENSEMBL',
dataset='hsapiens_gene_ensembl')
and then the query would be:
goids = getBM(attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',values=gene_list$ENTREZID,
mart=ens82)
However, if you had in mind to do a GO enrichment analysis, your list of genes is too short.

Pasting (or merging) two elements of a column together

I have two sources of clinical procedure billing information that I have added together (with rbind). In each row there is a CPT field and a CPT.description field that supplys a brief explanation. However, the descriptions are slightly different from the two sources. I want to be able to combine them. That way, if different words or abbreviations are used, then I can just do a string search to find what I am looking for.
So lets make up a simplified representation of a data table that I was able to generate.
cpt <- c(23456,23456,10000,44555,44555)
description <- c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy")
cpt.desc <- data.frame(cpt,description)
And here is what I want to get to.
cpt.wanted <- c(23456,10000,44555)
description.wanted <- c("tonsillectomy; tonsillectomy in >12 year old","brain transplant","castration; orchidectomy")
cpt.desc.wanted <- data.frame(cpt.wanted,description.wanted)
I have tried using functions such as unstack and then lapply(list,paste) but that is not pasting the elements of each list. I also tried reshape but there was no categorical variable to differentiate first or second version of description or even in some cases a third. The really annoying part is I had a similar problem a few months or years ago and someone helped me either on stackoverflow or on r-help and for the life of me I cannot find it.
So the underlying problem is, imagine that I have a spreadsheet in front of me. I need to do a vertical merge (paste) of two or maybe even three description cells who have the same CPT code in the adjacent column.
What buzzwords should I have been using to search for a solution to this problem.
Thank you so much for your help.
sapply( sapply(unique(cpt), function(x) grep(x, cpt) ),
# creates sets of index vectors as a list
function(x) paste(description[x], collapse=";") )
# ... and this pastes each set of selected items from "description" vector
[1] "tonsillectomy;tonsillectomy in >12 year old"
[2] "brain transplant"
[3] "castration;orchidectomy"
Here is an approach that uses plyr.
library("plyr")
cpt.desc.wanted <- ddply(cpt.desc, .(cpt), summarise,
description.wanted = paste(unique(description), collapse="; "))
which gives
> cpt.desc.wanted
cpt description.wanted
1 10000 brain transplant
2 23456 tonsillectomy; tonsillectomy in >12 year old
3 44555 castration; orchidectomy

Resources