How to extract sample titles (names) using GEOquery package? - r

GEOquery is a great R package to retrieve and analyze the Gene Expression data stored in NCBI Gene Expression Omnibus (GEO) database. I have used the following code provided from GEO2R service of GEO database (that generates the initial R script to analyze your desired data automatically) to extract some GEO series of experiments:
gset <- getGEO("GSE10246", GSEMatrix =TRUE)
if (length(gset) > 1) idx <- grep("GPL1261", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
gset # displays a summary of the data stored in this variable
The problem is that I can not retrieve the sample titles from it. I have found some function Columns() that works on GDS datasets and returns the sample names, but not on GSE.
Please note I am not interested in sample accession IDs (i.e. GSM258609 GSM258610, etc), what I want is the sample human readable titles.
Is there any idea? Thanks

After
gset <- getGEO("GSE10246", GSEMatrix =TRUE)
gset is a simple list, it's first element is an ExpressionSet, and the sample information are in the phenoData or pData, so maybe you're looking for
pData(gset[[1]])
See ?ExpressionSet for more.

Related

Using R and an API to extract multiple stock market data

I have setup an API access key with a data provider of stock market data. With this key i am able to extract stock market data based on ticker code (E.g. APPL: Apple, FB: Facebook etc).
I am able to extract stock data on an individual ticker basis using R but I want to write a piece of code that extracts data based on the multiple stock tickers and puts them all in one data frame (the structure is the same for all stocks). I m not sure how to create a loop that updates the data frame each time stock data is extracted. I get an error called 'No encoding supplied: defaulting to UTF-8' which does not tell me much. A point in the right direction would be helpful.
I have the following code:
if (!require("httr")) {
install.packages("httr")
library(httr)
}
if (!require("jsonlite")) {
install.packages("jsonlite")
library(jsonlite)
}
stocks <- c("FB","APPL") #Example stocks actual stocks removed
len <- length(stocks)
url <- "URL" #Actual url removed
access_key <- "MY ACCESS KEY" #Actual access key removed
extraction <- lapply(stocks[1:len],function(i){
call1 <- paste(url,"?access_key=",access_key,"&","symbols","=",stocks[i],sep="")
get_prices <- GET(call1)
get_prices_text <- content(get_prices, "text")
get_prices_json <- fromJSON(get_prices_text, flatten = TRUE)
get_prices_df <- as.data.frame(get_prices_json)
return(get_prices_df)
}
)
file <- do.call(rbind,extraction)
I realised that this is not the most efficient way of doing this. A better way is to update the url to include multiple stocks rather then using a lapply function. I am therefore closing the question.

Extract data of several gene sets from an RNAseq result summary file using R

I am trying to extract data of several gene sets from an RNAseq result summary file:
Example gene lists:
I am using Excel to first highlight duplicated genes, sort the summary file, then copy the data I need. It is time-consuming and Excel always "freeze" when sorting especially for big gene lists.
I was wondering if R can do a better job. Could someone kindly provide the code if R can be a better solution?
I think I got the solution although I still need to process those lists one by one.
It is faster than Excel anyway. :)
# read the RNAseq result summary file
result <- read_excel("RNAseq_Result.xlsx")
# read the gene lists file
geneset <- read_excel("Gene set list.xlsx")
# read one specific list from the gene lists file
ListA <- geneset$ListA
#subsetting
ResultListA <- result[(result$Gene_name) %in% ListA, ]
#output file
write.csv(ResultListA, 'ResultListA.csv')

How to Subset One Spatial Data Set into Multiple Spatial Datasets Based on Feature Name in R

I have a spatial dataset, polyline, that contains 115 line features and am trying to figure out if it is possible to select and save each line feature out into individual shape files using a loop or similar function?
I understand how to do this individually using subset (example below), however repeating this process 115 times seems like a waste of time and the power of R.
I am including an example of the data below:
trailname <- ("trail1", "trail2", "trail3")
trailtype <- ("mountain", "flat", "hilly")
parking <- ("no", "yes", "no")
shapelength <- ("835", "5728", "367")
trails <- data.frame(accessname, trailtype, parking, shapelength)
Here is a single subset example:
trail1 <- subset(trails, trailname == "trail1")
I would like to select each trail, and save it out as the name that appears under the "trail name" column i.e., trail1.shp
In base R, couldn't you us the assign function in a for loop to do this?
trailname <- c("trail1", "trail2", "trail3")
trailtype <- c("mountain", "flat", "hilly")
parking <- c("no", "yes", "no")
shapelength <- c("835", "5728", "367")
trails <- data.frame(trailname, trailtype, parking, shapelength)
for(i in 1:nrow(trails)){
name <- as.character(trails$trailname[[i]])
assign( name, subset(trails, trailname == trails$trailname[[i]]) )
}
EDITED TO ANSWER OP'S COMMENT
This should be do-able with a few tweaks. One item to note is that the example you provided is a data frame, while the writeOGR function takes...
SpatialPointsDataFrame, SpatialLinesDataFrame, or SpatialPolygonsDataFrame objects as defined in the sp package.
These type of objects have data frames, but also other attributes that are likely of interests. Let's assume your data is in one of these accepted types. I'll use rgdal cities data as an example. If all we care about is saving the files outside of our R session, then skip the assign function and drop the subset into the writeOGR function:
library('rgdal')
#loading in data
cities <- readOGR(system.file("vectors", package = "rgdal")[1], "cities")
#taking only first two rows for this example
shap <- cities[1:2,]
#where you want to save these files. This places it on your current working directory
location <- getwd()[[1]]
for(i in 1:nrow(shap)){
# name of file
name <- as.character(shap$NAME[[i]])
# change shap to your 'SpatialPointsDataFrame'
writeOGR(subset(shap, NAME == shap$NAME[[i]]), location, name , driver="ESRI Shapefile")
}
There is a R package called ShapePattern. Look up the function shpsplitter. Seems to do what you want. Otherwise you can do it in other GIS software, see here https://gis.stackexchange.com/questions/25709/splitting-shapefile-into-separate-files-for-each-feature-using-qgis-gdal-saga

Fixed it. What is the option_description used for in the build_dict function in the dataMeta package in R?

I have a dataset with some 100,000 tweets and their sentiment scores attached. The original dataset just has two columns one for the tweets and one for their sentiment scores.
I am trying to build a data dictionary for it using the dataMeta package. Here is the code that I have writtern so far:
#Data Dictionary
var_desc<-c("Sentiment Score 0 for Negative sentences and 4 for Positive sentences","The tweets collected")
var_type<-c(0,1)
#Creating the Linker Data Frame
linker <- build_linker(tweets_train, variable_description = var_desc, variable_type = var_type)
linker
#Build the data dictionary
dict<-build_dict(my.data = tweets_train,linker=linker,option_description = NULL, prompt_varopts = F)
kable(dict,format="html",caption="Data dictionary for the Training dataset")
My problem is in the data dictionary I have provided the Variable Name and the Variable Description but I think in the Variable Options column it is trying to print the entire 100,000 tweets which I want to avoid. Is it possible for me to set that column up too manually. Would the option_description in the build_dict function be of any help to do it?
I tried getting some idea about it from online but to no use. Here is the link that I have followed till now:
https://cran.r-project.org/web/packages/dataMeta/vignettes/dataMeta_Vignette.html
This is the first time I am trying to build a data dictionary and hence the struggle. Any suggestions would be extremely appreciated. Thanks in advance.

Pairing qualitative user data with text-mining results

I have pairs of customer feedback data in a CSV, denoting whether the customer recommended the service they received (1 or 0), "rec", and an associated comment, "comment". I am trying to compare the customer feedback between those who recommended the service and those who did not.
I have used the tm package to simply read all the lines in a CSV with only comments and do some follow-on text-mining on all the comments, which worked:
>file_loc <- "C:/Users/..(etc)...file.csv"
x <- read.csv(file_loc, header = TRUE)
require(tm)
fdbk <- Corpus(DataframeSource(x))
Now I am trying to compare the comments of those customers who recommend and those who do not by including the "rec" column, but I have not been able to create a corpus from a single column CSV - I tried the following:
>file_loc <- "C:/Users/..(etc)...file.csv"
x <- read.csv(file_loc, header = TRUE)
require(tm)
fdbk <- Corpus(DataframeSource(x$comment))
But I get an error saying
"Error in if (vectorized && (length <= 0))
stop("vectorized sources must have positive length") :
missing value where TRUE/FALSE needed"
I also tried binding the "rec" codes to the comments after creating a topic model, but certain comments end up getting filtered by the "topic" function so the "rec" column is longer than the # of documents in the resulting topic model.
If this something I can do with the tm package simply? I haven't worked with the qdap package at all but is that something that is more appropriate here?
... as ben mentioned:
vec <- as.character(x[,"place of comments"])
Corpus(VectorSource(vec))
perhaps some customer id as meta data would be nice...
hth

Resources