How to use image() function to plot the data in R - r

I have a clinical dataset and I would like to plot it using image() function to see if I can spot out the different groups within my data.
The structure of this data is a List of 2: 56 samples and 5000 gene expressions.
When I use image(lung), all I see a just a plot of orange color and I do not see pattern or any group standing out to me.
Basically, there are four types of clinical conditions in the dataset: Colon cancer (13 samples), smallcell (6 samples), etc.
I wanted to see, for instance, ```smallcell" with 6 samples has its own pattern compared to the rest of the groups/conditions within this dataset.
load(url("https://github.com/hughng92/dataset/raw/master/lung.RData"))
rownames(lung)
image(lung)
This is all I see:
I am wondering if I can combine the four different plots of these 4 conditions from the data set, it will look different.
Any tip would be great!

I'd suggest looking at the image output after rearranging the like types together. I think I now see some group differences in those gene expression profiles. Specifically the "Normal" category has generally fewer red bands although there are a couple where "normal" is red and the others are not. I think it is interesting, and not particularly surprising, that the appears to be less variability within the Normal columns (in the image) than there is within each the tumor types. I have a friend who's a molecular biologist who characterizes tumors as "genetic train wrecks":
table( rownames( lung[order(rownames(lung)), ]))
Carcinoid Colon Normal SmallCell
20 13 17 6
------------------
image( lung[order(rownames(lung)), ])
This would give a better indication of the boundaries of the type grouping:
image( lung[order(rownames(lung)), ], xaxt="n")
axis(1, at=(cumsum( table( rownames( lung[order(rownames(lung)), ])))-1)/56 ,
labels=names(table( rownames( lung[order(rownames(lung)), ]))),las=2)

Related

Scatterplot for comparing species abundance

I have a homework question that states the following:
The file “channel_islands_counts_edit.csv” contains survey data on temperate rocky reef fishes from the Channel Islands, collected at many sites over many years. The data has columns for Year, Date, Site, count, and SpeciesName (broken into adults and juveniles). The version of the data that I’ve given you looks at 16 sites over 27 years, with count data for 27 categories of fish. Imagine we’re interested in whether the abundance of different species are correlated across sites (to get a sense for whether species have similar habitat preferences and/or interact with each other), and whether the across-site correlations are consistent over time. To visualize this, make some code that does the following:
For each year, draw a scatterplot that compares the abundance of Hypsypops rubicundus (adults) and the abundance of Paralabrax clathratus (adults) across sites. Feel free to transform the data for plotting purposes, if you think that helps you see any patterns.
I imported my data set, and ran the following code which is giving me 27 plots, with Site as x and Count as y, but there is no data shown in the plots.
head(channel_islands)
sapply(channel_islands, class)
levels(channel_islands$SpeciesName)
par(mfrow= c(6,5)) # set the plotting area into a 6 row*5 column array
for (i in 1:27) {
HR11<-subset(channel_islands,SpeciesName=="Hypsypops rubicundus,adult"[i] & Site==11)
PC15<-subset(channel_islands,SpeciesName=="Paralabrax clathratus,adult"[i] & Site==15)
with(HR11,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='green',main=i))
with(PC15,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='blue',main=i))
}
If anyone could help me figure out how to compare species abundance across sites, over 27 years, I would really appreciate it.
The code "Hypsypops rubicundus,adult"[i] doesn't really make sense. Technically, it should work for when i == 1 but beyond that it would just return NA. I'm assuming SpeciesName == NA will never be true so you will get an empty subset.
Consider looking into using ggplot2 with facet_grid to quickly make multiple plots without the loop. The R Graphics Cookbook has good examples on using facets.

Using a column that contains a frequency/weight/count in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
This is an easy question to ask, but a hard one to search for. Frequency is used all over the place. I tried a synonym (weight), but since mtcars is so widely used, I get a lot of false negatives as well. Same thing for counts.
I'm looking at datasets::HairEyeColor, partly reproduced here:
Hair Eye Sex Freq
1 Black Brown Male 32
2 Brown Brown Male 53
3 Red Brown Male 10
4 Blond Brown Male 3
5 Black Blue Male 11
6 Brown Blue Male 50
7 Red Blue Male 10
8 Blond Blue Male 30
9 Black Hazel Male 10
10 Brown Hazel Male 25
.
.
.
I can across this when trying to show someone how to make a mosaic plot of any two of Hair, Eye, and Gender. On first read, I didn't see a way to specify a column to specify "this column represents 32 of the set members" but I didn't read too carefully.
I suppose I could reshape the data using melt() and reshape() every time I receive data with a frequency column, but that seems kind of drastic.
In other languages I know, I could add a parameter to the fitting function to let it know “there’s not just one row with this set of levels, there are n of them. So if I wanted to see a distribution, I might say
DISTR(Y=Hair, FREQ=freq)
...which would generate a histogram or density plot with n values per row
Alternately,
lm(hair ~ eye + sex, data = ‘HairEyeColor’, freq = ‘freq’)
Would fit a linear model with 32 replications if the first row rather than 1.
I’m asking about a way to use the 32 in the first row (for example) to tell the modeling or graphing function that there are 32 cases with this combination of levels, 53 with the combination in the second row, etc.
Surely this kind of data shows up a lot. I see it all the time, but there’s usually a way to say that this number specifies the frequency that this row represents in the actual data. Rather than a data table with 32 rows of Black, Brown, Male, there’s one row with frequency 32.
(No plyr please.)
No, there is not a standard way to use this type of data across all of R.
Many of the basic modeling functions, e.g., lm, glm, nls, loess, and more from the stats package accept a weights argument that will meet your needs. prop.test accepts data in either format. But many other modeling functions do not, e.g., knn, princomp, and many others not in base R.
barplot accepts input in either format. mosaicplot expects input as an aggregated contingency table. Other types of plots would require more custom handling, because there are a lot of different things you could do with frequency.
Of course, anything not in base R is up to whoever writes it.
ggplot2 (which is not base R) generally handles this really well, e.g., geom_bar will stack up values by default, or in the case of scatterplots you could map size or color or alpha to visually convey the intensity.
randomForest and xgboost do not accept weights
I will say that I very rarely find this to be a problem. I'd encourage you to ask specific questions about methods where it is causing you issues. I think mosaicplot is a bad example as it expects a contingency table, so the problem would be the opposite: using it with disaggregated data would require first aggregating it up to a frequency table.

In R, how can I use different colors for each range in my scatterplot?

I'm trying to plot with R a series of 120 numbers where the first 40 are of one type, the next 40 items are of a second type and the last 40 items are of a third type.
Right now I'm just plotting it as a scatter-plot and its hard to tell the three sections apart:
data <- read.table("mydata.txt")
plot(data[,1])
Is there a way to distinguish the three sections, as in this following mockup that I made?
You could provide a colour vector if the data are already ordered.
mydata <- runif(120)
plot(mydata, col = rep(rainbow(3), each = 40))
rainbow(3) makes a colour vector of 3 colours, and rep with each = 40 makes 40 copies of each.
A longer answer not as nice as the one of Mark O'Connell but has the merit to be more flexible.(I think)
data<-data.frame(y=seq(1,1000)+rnorm(1000,0,100),index=seq(1,1000))
data.blue<-data[data$index<200,]
data.green<-data[data$index>=200&data$index<400, ]
data.red<-data[data$index>=400&data$index<600, ]
data.purple<-data[data$index>=600&data$index<1000, ]
plot(data.blue,col='blue',xlim=c(-200,1300),ylim=c(0,1000))
points(data.green$index,data.green$y,col='green')
points(data.red$index,data.red$y,col='red')
points(data.purple$index,data.purple$y,col='purple')

HMM text recognition in R depmixs4

I'm wondering how I would utilize the depmixs4 package for R to run HMM on a dataset. What functions would I use so I get a classification of a testing data set?
I have a file of training data, a file of label data, and a test data.
Training data consists of 4620 rows. Each row has 1079 values. These values are 83 windows with 13 values per window so in otherwords the 1079 is data that is made up of 83 states and each category has 13 observations. Each of these rows with 1079 values is a spoken word so it have 4620 utterances. But in total the data only has 7 distinct words. each of these distinct words have 660 different utterances hence the 4620 rows of words.
So we have words (0-6)
The label file is a list where each row is labeled 0-6 corresponding to what word they are. For example row 300 is labeled 2, row 450 is labeled 6 and 520 is labeled 0.
The test file contains about 5000 rows structured exactly like the training data except there are no labels assocaiated with it.
I want to use HMM to using the training data to classify the test data.
How would I use depmixs4 to output a classification of my test data?
I'm looking at :
depmix(response, data=NULL, nstates, transition=~1, family=gaussian(),
prior=~1, initdata=NULL, respstart=NULL, trstart=NULL, instart=NULL,
ntimes=NULL,...)
but I don't know what response refers to or any of the other parameters.
Here's a quick, albeit incomplete, test to get you started, if only to familiarize you with the basic outline. Please note that this is a toy example and it merely scratches the surface for HMM design/analysis. The vignette for the depmixs4 package, for instance, offers quite a lot of context and examples. Meanwhile, here's a brief intro.
Let's say that you wanted to investigate if industrial production offers clues about economic recessions. First, let's load the relevant packages and then download the data from the St. Louis Fed:
library(quantmod)
library(depmixS4)
library(TTR)
fred.tickers <-c("INDPRO")
getSymbols(fred.tickers,src="FRED")
Next, transform the data into rolling 1-year percentage changes to minimize noise in the data and convert data into data.frame format for analysis in depmixs4:
indpro.1yr <-na.omit(ROC(INDPRO,12))
indpro.1yr.df <-data.frame(indpro.1yr)
Now, let's run a simple HMM model and choose just 2 states--growth and contraction. Note that we're only using industrial production to search for signals:
model <- depmix(response=INDPRO ~ 1,
family = gaussian(),
nstates = 2,
data = indpro.1yr.df ,
transition=~1)
Now let's fit the resulting model, generate posterior states
for analysis, and estimate probabilities of recession. Also, we'll bind the data with dates in an xts format for easier viewing/analysis. (Note the use of set.seed(1), which is used to create a replicable starting value to launch the modeling.)
set.seed(1)
model.fit <- fit(model, verbose = FALSE)
model.prob <- posterior(model.fit)
prob.rec <-model.prob[,2]
prob.rec.dates <-xts(prob.rec,as.Date(index(indpro.1yr)),
order.by=as.Date(index(indpro.1yr)))
Finally, let's review and ideally plot the data:
head(prob.rec.dates)
[,1]
1920-01-01 1.0000000
1920-02-01 1.0000000
1920-03-01 1.0000000
1920-04-01 0.9991880
1920-05-01 0.9999549
1920-06-01 0.9739622
High values (>0.80 ??) indicate/suggest that the economy is in recession/contraction.
Again, a very, very basic introduction, perhaps too basic. Hope it helps.

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources