Extract raw data from affyBatch object - r

I have an affyBatch object with gene expression data. The data is read in using
dat <- ReadAffy()
with no options. I then extract the 5600 genes that I am interested in using,
dat <- RemoveProbes(listOutProbeSets, cdfpackagename, probepackagename)
I then normalise the expression data using
dat.rma <- rma(dat)
Now I want to the export the raw data AND the rma-normalised data to .csv files. Inspecting the data I find that exprs(dat) has dimensions 226576 by 30 and dat.rma has dimensions 5600 by 30. How do I extract the 5600 by 30 matrix of the RAW expression values? I don't know where the 226576 rows in the raw data have come from!
I'm a bit of a beginner with bioconductor data structures! Sorry for not providing runnable example code - not sure how I would do that in this case.

During transformation from raw to rma-normalised data, you have, among other things, combined/summarised low level probe intensity values into probe sets values (that map to genes). This explains why you have more features in a raw AffyBatch object than in a ExpressionSet instance (created by the rma function). Also, depending on the chip you have, there are several perfect match (PM) and miss match (MM) probes per probeset, which boosts the number of probes per probeset. The mapping probe -> probeset is defined in the chip definition file and handled automatically.
A few additional thoughts though. Removing probes before doing normalisation might not be a good thing to do. One assumption when performing normalisation is that most of you 'genes' do not change, so keeping only 'those of interest' might break this, depending what 'those of interest' means of course. You can always do your filtering on the ExpressionSet, after normalisation:
> library(affydata)
> data(Dilution) ## gets some test data
> eset <- rma(Dilution) ## rma normalisation
> featureNames(eset)[1:10] ## gets some probesets of interest
> ps
[1] "100_g_at" "1000_at" "1001_at" "1002_f_at" "1003_s_at" "1004_at"
[7] "1005_at" "1006_at" "1007_s_at" "1008_f_at"
> dim(eset) ## full dataset
Features Samples
12625 4
> dim(eset[ps,]) ## only 10 first probesets of interest
Features Samples
10 4
Hope this helps.

Related

How can I fit my data and question in a script for the ace function of the ape pakage in Rstudio?

I have 96 amino acid sequences which I aligned with MAFFT and trimmed manually (FASTA format), choose the model of amino acid substitution with ProtTest (LG+I+G model), did the phylogenetic reconstruction with MEGAX (ML method, bootstrap test 1000 replicates, tree in Newick format) and the ancestral reconstruction with PAML, in a total of 664 final amino acid positions. However, my alignment has indels. I am naming each indel with a letter (A to T) and the respective amido acid positions range: A:89-92, B:66-67, C:181-186, D:208-208, E:214-219, F:244-250, G:237-296, H:278-280, I:295-295, J:329-334, K:345-349, L:371-375, M:390-425, N:432-433, O:440-443, P:480-480, Q:500-500, R:541-544, S:600-600. Both the initial and final parts of the sequences is very variable, so from positions 0 to 34 (initial) and 600 to 664 (final), each amino acid position may represent an indel.
I want to know, at each ancestral node, what is the probability that each indel is present in the ancestral sequence. I was told that the R-studio "ace" function on the package "ape - analysis of phylogenetics and evolution" can perform this task. I have installed both "ape" and "ggtree". I checked this webpage https://www.rdocumentation.org/packages/ape/versions/3.0-1/topics/ace, however, I have no idea how to construct the script. I am a biologist and newbie to R.
Can someone please help? Would be greatly appreciated, thanks.
It's hard to exactly figure out what you'll need from your example but the following could fit the general idea:
1 - Load your tree in R
For this step you can use the functions read.tree or read.nexus depending on your tree format: i.e. whether your phylogenetic software outputs a NEXUS file (usually the first line in these files is #NEXUS and the last line is end; or END;) or a newick output (usually, the first line directly starts with the phylogeny like ((my_species... and finishes with ;). You can locate this file and then read it in R using:
## Loading the package
library(ape)
## Reading the tree
my_tree <- read.tree("<the_path_to_your_file>")
2 - Load your trait data in R
You will then need to load your trait data (for example the indels positions you've listed above) as a matrix or a data.frame. The easiest is to have them in a .csv format ("comma separated values") that you can then read in R using the function read.csv:
## Reading the variables as a matrix
my_variables <- read.csv("<the_path_to_your_file>")
3 - Running an ancestral character estimation
And finally you can run your ancestral character estimation for each of your variable using the ace function from the package ape:
## Selecting the variable of interest (e.g. the first column of the dataset)
one_variable <- my_variables[, 1]
## Running the ancestral character estimation for this variable
my_ace <- ace(x = one_variable, phy = my_tree, type = "discrete")
## Looking at the results
my_ace
Of course there is much more to it but hopefully this could get you starting.

Generating 3.000.000 strings of length 11 in R

Apparently if I try this:
# first grab the package
install.packages("stringi")
library(stringi)
# and then try to generate some serious dummy data
my_try <- as.vector(sample(1111111111:99999999999,3000000,replace=T))
R will say NOPE, sorry:
Error: cannot allocate vector of size 736.8 Gb
Should I buy more RAM*?
*this is a joke, but I seriously appreciate any help!
EDIT:
The desired output is a dataframe of 20 variables, and 3x10^6 rows. Some columns/variables should be strings, some integers. All in lengths ranging from 2 to 12.
The error isn't coming from sampling 3 million values, it's from trying to create a population of about 90 billion values 1111111111:99999999999 from which to sample. If you want to sample from that range, sample from the range 1:88888888889 and add 11111111110 using
sample(88888888889, 3000000,replace=TRUE) + 11111111110
There's no need for as.vector at the end, it's already a vector.
P.S. I believe in R-devel the range 1111111111:99999999999 will be stored much more efficiently (basically just the limits), but I don't know if sample() will be modified to work with it that way.

R package spatstat: How to use point process model covariate as factor starting with shape file

I have a question similar to this one from 2014, which was answered but the datasets are no longer available and our original data structures differ. (I'm in crunch time and stumped, so if you're able to respond quickly I would greatly appreciate it!!)
Goal: use the type of bedrock as a covariate in a Point Process Model (ppm) in spatstat with mine locations in Connecticut.
Data: the files are available from this Dropbox folder. The rock data and CT poly outline comes from UConn Magic Library, and the mine data comes from the USGS Mineral Resources Data System.
Approach:
I loaded some relevant packages and read in the shapefiles (and converted coords to match CT's system), and used the CT polygon as an owin object.
library(rgdal)
library(splancs)
library(spatstat)
library(sp)
library(raster)
library(geostatsp)
#read in shapefiles
ct <-readOGR(".","CONNECTICUT_STATE_POLY")
mrds <-readOGR(".","mrds-2017-02-20-23-30-58")
rock<-readOGR(".","bedrockpolyct_37800_0000_2000_s50_ctgnhs_1_shp_wgs84")
#convert mrds and rock to ct's coord system
tempcrs<-ct#proj4string
mrds<-spTransform(mrds,tempcrs)
rock<-spTransform(rock,tempcrs)
#turn ct shapefile into owin, call it w for window
w <-as.owin(ct)
#subset mrds data to just CT mines
mrdsCT <-subset(mrds,mrds#data$state=="Connecticut")
#ppm can't handle marked data yet, so need to unmark()
#create ppp object for mrds data, set window to w
mrdsCT.ppp <-as.ppp(mrdsCT)
Window(mrdsCT.ppp)<-w
From "Modelling Spatial Point Patterns in R" by Baddeley & Turner (page 39):
Unfortunately a pixel image in spatstat cannot have categorical (factor) values, because R refuses to create a factor-valued matrix. In order to represent a categorical variate as a pixel image, the categorical values should be encoded as integers (for efficiency’s sake) and assigned to an integer-valued pixel image. Then the model formula should invoke the factor command on this image. For example if fim is an image with integer values which represent levels of a factor, then:
ppm(X, ˜factor(f), Poisson(), covariates=list(f=fim))
There are several different types of rock classification included in the shapefile. I'm interested in LITHO1, which is a factor with 27 levels. It's the sixth attribute.
litho1<-rock[,6]
My (limited but researched) understanding is that I need to convert the shapefile to a raster, and later convert it to an image in order to be used in ppm. I created a mask from ct, and used that.
ctmask<-raster(ct, resolution=2000)
ctmask[!is.na(ctmask)] <- 0
litho1rast<-rasterize(litho1,ctmask)
After this point, I've tried several approaches and haven't had success just yet. I've attempted to follow the approaches laid out in the question linked, as well as search in documentation for relevant examples to adopt (factor, ratify, levels). Unlike the prior question, my data was already a factor, so it wasn't clear why I should apply the factor function to it.
Looking at litho1rast, the #data#attributes dataframe contains the following. If I plot it, it just plots the ID; levelplot function does plot LITHO1. When I would apply the factor functions, the ID would be retained but not LITHO1.
$ ID : int [1:1891] 1 2 3 4 5 6 7 8 9 10 ...
$ LITHO1: Factor w/ 27 levels "amphibolite",..: 23 16 23 16 23 16 24 23 16 24 ...
The ppm model would need an object class im, so I converted the raster to the im. I tried two ways. I can make ppm execute...but it treats every point as a factor rather than the 27 levels (with either litho1.im or litho1.im2) ...
litho1.im<-as.im(litho1rast)
litho1.im2<-as.im.RasterLayer(litho1rast)
model1=ppm(unmark(mrdsCT.ppp) ~ factor(COV1), covariates=list(COV1=litho1.im))
model1
So, I'm not quite sure where to go from here. It seems like I need to pass an argument to the as.im so that it knows to retain the LITHO1 not the ID. Clever ideas or leads to pertinent functions or approaches much appreciated!
The quoted statement from Baddeley & Turner is no longer true --- that quotation is from a very old set of workshop notes.
Pixel images of class im can have factor values (since 2011). If Z is an integer-valued pixel image (of class im), you can make it into a factor-valued image by setting levels(Z) <- lev where lev is the character vector of labels for the possible values.
You should not need to use rasterize: it should be possible to convert rock[,6] directly into a pixel image using as.im (after loading the maptools package).
See the book by Baddeley, Rubak and Turner (Spatial point patterns: methodology and applications with R, CRC Press, 2016) for a full explanation.
Looking at your code you don't seem to be providing the field argument to rasterize.
From rasterize help:
fieldnumeric or character. The value(s) to be transferred. This can
be a single number, or a vector of numbers that has the same length as
the number of spatial features (points, lines, polygons). If x is a
Spatial*DataFrame, this can be the column name of the variable to be
transferred. If missing, the attribute index is used (i.e. numbers
from 1 to the number of features). You can also provide a vector with
the same length as the number of spatial features, or a matrix where
the number of rows matches the number of spatial features
at this line:
litho1rast<-rasterize(litho1,ctmask)
you probably have to specify which column of the litho object to use in rasterization. Something like:
litho1rast<-rasterize(litho1,ctmask, field = "LITHO1")

Clustering big data

I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.

Calculate percentage over time on very large data frames

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}

Resources