text searching in R - r

I am trying to make a query to use in a R package named RISmed, which will search and downloaded relevant journal article information from pubmed database. I want to search two words always together, for example :
query= "gene sequencing"
search<-EUtilsSummary(query,type="esearch",db = "pubmed",mindate=2014, maxdate=2014, retmax=20)
If I use, above command, it will search gene and sequencing separately, then both gene and sequencing,that means if in whole text gene and sequencing exists, my command captures them but I want to search in such a way, that it will consider "Gene sequencing", two words always together. How can I write that query? Would anyone please help me?
Thanks in advance !

I would try this:
query <- '"gene sequencing"[Title/Abstract]'
The pubmed search engine does accept quoted strings and you just need to know how to preserve them within R. Using surrounding single quotes is one method. Using back-slashed quotes would be another. Notice that the returned value from my experiment with your code shows that escape-backslashing is how the implemeters of that package do it:
> str(search)
Formal class 'EUtilsSummary' [package "RISmed"] with 6 slots
..# db : chr "pubmed"
..# count : num 542
..# retmax : num 20
..# retstart : num 0
..# PMID : chr [1:20] "25548628" "25543043" "25542841" "25540641" ...
..# querytranslation: chr "\"gene sequencing\"[Title/Abstract] AND 2014[EDAT] : 2014[EDAT]"

Related

How to specify a text column when read a csv file?

I use this way to read a csv file:
Here the str()
$ an_id : int 4840 41981 40482 37473 33278 29083 30940 29374 24023 23922 ...
It seems to be an int character column and using the following it is converted to chr
df$an_id <- paste0("doc_", df$an_id)
However when I execute this command I receive this error:
toks <- corpus(df, docid_field = "an_id") %>%
tokens()
Error in corpus.data.frame(df, docid_field = "an_id") :
column name text not found
Is there any different way to read the file or pass the column as text?
If I save this data into csv file and read the file and run the command they work properly
dtext <- data.frame(id = c(1,2,3,4), text = c("here","This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."),stringsAsFactors = F)
As mentioned in the comments by #Nathalie, the following does the trick if the data is in a data.frame. docid_field references to the document ids column and text_field should reference the column that contains the text.
toks <- corpus(df,
docid_field = "an_id",
text_field = "text") %>%
tokens()
str(toks)
List of 4
$ doc_1: chr "here"
$ doc_2: chr [1:39] "This" "dataset" "contains" "movie" ...
$ doc_3: chr [1:36] "The" "core" "dataset" "contains" ...
$ doc_4: chr [1:105] "There" "are" "two" "top-level" ...
- attr(*, "types")= chr [1:102] "here" "This" "dataset" "contains" ...
- attr(*, "padding")= logi FALSE
- attr(*, "class")= chr "tokens"
- attr(*, "what")= chr "word"
- attr(*, "ngrams")= int 1
- attr(*, "skip")= int 0
- attr(*, "concatenator")= chr "_"
- attr(*, "docvars")='data.frame': 4 obs. of 0 variables
data:
df <- structure(list(an_id = c("doc_1", "doc_2", "doc_3", "doc_4"),
text = c("here", "This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.",
"The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.",
"There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."
)), row.names = c(NA, -4L), class = "data.frame")

Get help file description in R as an object [duplicate]

This question already has answers here:
R help page as object
(2 answers)
Closed 4 years ago.
I'm trying to build a shiny app that manipulates some base R datasets, like datasets::airmiles. I'd like to display somewhere within that app the description of this dataset. For instance, ?datasets::airmiles yields -
Is there any way I can extract the description string? e.g. in this case, it would be the string 'The revenue passenger miles flown by commercial airlines in the United States for each year from 1937 to 1960.'
I have tried -
x <- help('airmiles')
str(x)
# 'help_files_with_topic' chr "C:/Program Files/R/R-3.5.1/library/datasets/help/airmiles"
# - attr(*, "call")= language help(topic = "airmiles")
# - attr(*, "topic")= chr "airmiles"
# - attr(*, "tried_all_packages")= logi FALSE
# - attr(*, "type")= chr "html"
But x does not contain the description of airmiles.
Seems to work for what you want but I'd take out the semi-colons:
R help page as object
Can be accessed by var$description.

Read cell values without formatting into R with googlesheets

Would like to be able to read Google Sheets cell values into R with googlesheets package, but without any cell formatting applied (e.g. comma separators, percentage conversion, etc.).
Have tried gs_read() without specifying a range, which uses gs_read_csv(), which will "request the data from the Sheets API via the exportcsv link". Can't find a way to tell it to provide underlying cell value without formatting applied.
Similarly, tried gs_read() and specifying a range, which uses gs_read_cellfeed(). But can't find a way to indicate that I want un-formatted cell values.
Note: I'm not after the formulas in any cells, just the values without any formatting applied.
Example:
(looks like I'm not able to post image images)
Here's a screenshot of an example Google Sheet:
https://www.dropbox.com/s/qff05u8nn3do33n/Screenshot%202015-07-26%2008.42.58.png?dl=0
First and third columns are numeric with no formatting applied, 2nd column applies comma separators for thousands, 4th column applies percentage formatting.
Reading this sheet with the following code:
library(googlesheets)
gs <- gs_title("GoogleSheets Test")
ws <- gs_read(gs, ws = "Sheet1")
yields:
> str(ws)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 4 variables:
$ Number : int 123456 123457 123458
$ Number_wFormat : chr "123,456" "123,457" "123,458"
$ Percent : num 0.123 0.234 0.346
$ Percent_wFormat: chr "12.34%" "23.45%" "34.56%"
Would like to be able to read a worksheet that has formatting applied (ala columns 2 and 4), but read the unformatted values (ala columns 1 and 3).
At this point, I think your best bet is to fix the imported data like so:
> ws$Number_fixed <- type.convert(gsub(',', '', ws$Number_wFormat))
> ws$Percent_fixed <- type.convert(gsub('%', '', ws$Percent_wFormat)) / 100
> str(ws)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 6 variables:
$ Number : int 123456 123457 123458
$ Number_wFormat : chr "123,456" "123,457" "123,458"
$ Percent : num 0.123 0.234 0.346
$ Percent_wFormat: chr "12.34%" "23.45%" "34.56%"
$ Number_fixed : int 123456 123457 123458
$ Percent_fixed : num 0.123 0.234 0.346
I had some hope that post-processing with functions from readr would be a decent answer, but it looks like percentages and "currency" style numbers are open issues there too.
I have opened an issue to solve this better in googlesheets, one way or another.

Hidden Markov Model in R - Predict the next observation with RHmm

This is my first post on StackOverflow and I could use a little help... Please forgive me if I am not following the correct posting protocols.
There is another example in the StackOverflow for which I am heavily basing my work off of but I cant quite figure out how to adapt the code. Most importantly, I am looking at the solution to the question provided.
Here is the link:
Getting the next observation from a HMM gaussian mixture distribution
Some background:
RHmm - version 2.1.0 downloaded from R Forge.
RStudio - 0.98.953
R - 3.0.2 32 bit
I am trying to figure out the following issues with my code:
How do I amend the solution from the link above (prediction of the next observation) to work with my Baum-Welch model?
Ex. hm_model <- HMMFit(obs=TWII_Train, nStates=5)
The R / RStudio session aborts when I run the Baum-Welch version of the hm_model <- HMMFit(obs=TWII_Train, dis="MIXTURE", nStates=5, nMixt=4). Can you recreate the error and propose a workaround?
Here is my R code:
library(quantmod)
library(RHmm)
getSymbols("^TWII")
TWII_Subset <- window(TWII, start=as.Date("2012-01-01"), end = as.Date("2013-04-01"))
TWII_Train <- cbind(TWII_Subset$TWII.Close - TWII_Subset$TWII.Open,
TWII_Subset$TWII.Volume)
hm_model <- HMMFit(obs=TWII_Train, nStates=5)
VitPath <- viterbi(hm_model, TWII_Train)
I'm not a user of this package and this is not really an answer, but a comment would obscure some of the structures. It appears that the "proportion" value of your model is missing (so the structures are different. The "mean" value looks like this:
$ mean :List of 5
..$ : num [1:2] 6.72 3.34e+06
..$ : num [1:2] -12.4 2420174.5
..$ : num [1:2] -2.4 1832546.5
..$ : num [1:2] -10.4 1432636.1
..$ : num [1:2] 5.02 1.96e+06
I also suspect that you should be using 2 and 5 rather than 4 and 5 for m and n. Look at the rest of the model with:
str(hm_model)

Exporting data from UScensus2010

I need to extract Census 2010 data for CDPs (nationwide) and figured it would be simplest to use the UScensus2010 package to extract the key tables I need, one state at a time, and then export the resulting data. Towards this end I ran the following code
library(UScensus2010)
install.cdp("osx")
demographics(dem="P0030001", state="oh",level="cdp")
p3.race <- demographics(dem=c("P0010001", "P0030001", "P0030002", "P0030003", "P0030004", "P0030005", "P0030006"), state="oh",level="cdp")
Note that for p3.race we have
..$ : chr [1:1204] "Hamler village" "Florida village" "Deshler village" "New Bavaria village" ...
..$ : chr [1:7] "P0010001" "P0030001" "P0030002" "P0030003" ...
So far so good. Now, what would be the best way to export p3.race (as *.csv *.dta, or any other format) such that the CDP names show up as column A and P001001:P0030006 show up as Columns B:H?

Resources