Number of rows in matrices must match (see arg 2) - rpart - r

I'm trying to analyse some tennis data and I'm hitting a problem with the code:
library(rpart)
library(rpart.plot)
library(ggplot2)
library(wesanderson)
train=read.csv("/ags_test.csv",header=T, na.strings=c("","NA"))
Please note this is a complete set, not one I've cobbled together through the code. All the gaps have NA values in them.
control=rpart.control(cp=0.007)
train$res=as.factor(train$res)
tree=rpart(res~Tournament+Surface+Round+J1Rank+J2Rank+J1Pts+J2Pts+DRank+DPts,data=train,control=control,parms=list(split="gini"))
All good until the last line when it kicks out:
Error in cbind(yval2, yprob, nodeprob) :
number of rows of matrices must match (see arg 2)
The data isn't a massive set but comprises of 17 columns and 50 lines.
Any ideas would be much appreciated.

Turns out that the problem the data is too certain, i.e. the pros are all in the same columns and the cons in a similar structure.
Therefore, there's little to run the decision tree against.

Related

Generating 3.000.000 strings of length 11 in R

Apparently if I try this:
# first grab the package
install.packages("stringi")
library(stringi)
# and then try to generate some serious dummy data
my_try <- as.vector(sample(1111111111:99999999999,3000000,replace=T))
R will say NOPE, sorry:
Error: cannot allocate vector of size 736.8 Gb
Should I buy more RAM*?
*this is a joke, but I seriously appreciate any help!
EDIT:
The desired output is a dataframe of 20 variables, and 3x10^6 rows. Some columns/variables should be strings, some integers. All in lengths ranging from 2 to 12.
The error isn't coming from sampling 3 million values, it's from trying to create a population of about 90 billion values 1111111111:99999999999 from which to sample. If you want to sample from that range, sample from the range 1:88888888889 and add 11111111110 using
sample(88888888889, 3000000,replace=TRUE) + 11111111110
There's no need for as.vector at the end, it's already a vector.
P.S. I believe in R-devel the range 1111111111:99999999999 will be stored much more efficiently (basically just the limits), but I don't know if sample() will be modified to work with it that way.

Reverse coding the output for reliability analysis using psych package in R

I am performing the reliability analysis of a psychometric scale which measures user engagement during a video game play. The scale has 28 questions, out of which the ones highlighted in yellow in the screenshot below will need to be reverse coded in the output (since they are opposite to what the scale purports to measure).
library(readxl)
GES <- read_excel("CAGE Datadump Copy.xlsx")
GESpreEngagement <- GES[,11:38] #these 28 columns contains the response to 28 questions from survey
GESpreEngagementReverseCOded <- GESpreEngagement
# data in column 9 through 16 needs to be reverse coded, corresponding to the questions highlighted in the screenshot
GESpreEngagementReverseCOded[,c(9:16)] = 5-GESpreEngagement[,c(9:16)]
#calculate the cronbach's alpha value
psych::alpha(GESpreEngagementReverseCOded, check.keys = TRUE)
This codes is leading to the following output with a warning message(with or without check.keys=TRUE):
In psych::alpha(GESpreEngagementReverseCOded, check.keys = TRUE) :
Some items were negatively correlated with total scale and were
automatically reversed. This is indicated by a negative sign for the
variable name.
However, there is no such warning if I do not reverse code the output, i.e. if I just run psych::alpha(GESpreEngagement). The question is, it seems logical to reverse code the output, but R is telling me to do otherwise. What should I do in this case?
I also ran into this issue recently and it was driving me crazy. In my case I later found out that the raw data was already reverse-coded before I imported it, which is why reversing them again gave me this warning. This is pretty late but hope it's been sorted out!

Using R to cluster based on euclidean distance and a complete linkage metric, too many vectors?

I am trying to figure out how to read in a counts matrix into R, and then cluster based on euclidean distance and a complete linkage metric. The original matrix has 56,000 rows (genes) and 7 columns (treatments). I want to see if there is a clustering relationship between the treatments. However, every time I try to do this, I first get an error stating, Error: cannot allocate vector of size 544.4 Gb Since I'm trying to reproduce work that has been published by someone else, I am wondering if I am making a mistake with my initial data entry.
Second, if I try such clustering with just 20 genes of the 56,000, I am able to make a clustering dendrogram, but the branches are no experimental samples. The paper I am trying to replicate did such clustering with the resulting dendrogram displaying clustering samples.
Here is the code I am trying to run:
exprs <- as.matrix(read.table("small_RMA_table.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
eucl_dist=dist(matrix(exprs),method = 'euclidean')
hie_clust=hclust(eucl_dist,method = 'complete')
plot(hie_clust)
And here is a sample of my data table:
AGS KATOIII MKN45 N87 SNU1 SNU5 SNU16
1_DDR1 11.18467721 11.91358171 11.81568242 11.08565284 8.054326631 12.46899188 10.54972491
2_RFC2 9.19869822 9.609015734 8.925772678 8.3641799 8.550993726 10.32160527 9.421779056
3_HSPA6 6.455324139 6.088320986 7.949175048 6.128573129 6.113793411 6.317460116 7.726657567
4_PAX8 8.511225092 8.719103196 8.706242048 8.705618546 8.696547633 9.292782564 8.710369119
5_GUCA1A 3.773404228 3.797729793 3.574286779 3.848753216 3.684193193 3.66065606 3.88239872
6_UBA7 6.477543321 6.631538303 6.506133756 6.433793116 6.145507918 6.92197071 6.479113995
7_THRA 6.263090367 6.507397854 6.896879084 6.696356125 6.243160864 6.936051147 6.444444498
8_PTPN21 6.88050894 6.342007735 6.55408163 6.099950167 5.836763044 5.904301086 6.097067306
9_CCL5 6.197989448 4.00619542 4.445053893 7.350765625 3.892650264 7.140038596 4.123639647
10_CYP2E1 4.379433632 4.867741561 4.719912827 4.547433566 6.530890968 4.187701905 4.453267508
11_EPHB3 6.655231606 7.984278173 7.025962652 7.111129175 6.246989328 6.169529157 6.546374446
12_ESRRA 8.675023046 9.270153715 8.948209029 9.412638347 9.4470612 9.98312055 9.534236722
13_CYP2A6 6.834018146 7.18386746 6.826740822 7.244411918 6.744588768 6.715122111 7.302922762
14_SCARB1 8.856802264 8.962211232 8.975200168 9.710291176 9.120002571 10.29588004 10.55749325
15_TTLL12 8.659539601 9.93935462 8.309244963 9.21145716 9.792647852 10.46958091 10.51879844
16_LINC00152 5.108632654 4.906321384 4.958158343 5.315532543 5.456138001 5.242577092 5.180295902
17_WFDC2 5.595843025 5.590991341 5.776102664 5.622086284 5.273603946 5.304240608 5.573746302
18_MAPK1 6.970036434 5.739881305 4.927993642 5.807358161 7.368137365 6.17697538 5.985006279
19_MAPK1 8.333269232 8.758733916 7.855324572 9.03596893 7.808283302 7.675434022 7.450262521
20_ADAM32 4.075355477 4.216259982 4.653654879 4.250333684 4.648194266 4.250333684 4.114286071
The rows describe genes (Ex., 1_DDR1, 2_RFC2, etc.) and the columns are experimental samples (Ex. AGS, KATOIII). I wish to see the relatedness of the samples in the cluster.
Here is my sample dendrogram that my code produces. I thought it would only show 7 branches reflecting my 7 samples:
The paper's dendrogram (including these 8 samples and many more as well) is below:
Thanks for any help you can provide!
You're running out of RAM. That's it. You can't allocate a vector that exceeds your memory space. Move to a computer with more memory or maybe, try use bigmemory (I've never tried it).
https://support.bioconductor.org/p/53848/
In case anybody was wondering, the answer to my second question is below. I was calling as.matrix on a matrix, and it was screwing up the data. The following code works now!
exprs <- as.matrix(read.table("small_RMA_table.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
eucl_dist=dist(exprs,method = 'euclidean')
hie_clust=hclust(eucl_dist,method = 'complete')
plot(hie_clust)
Do you want to cluster on columns (detect similarities between treatments) or on rows (detect similarities between genes)? It sounds like you want the former, given that you're expecting 7 dendrogram branches for 7 treatments.
If so, then you need to transpose your dataset. dist computes a distance matrix for rows, not columns, which is not what you want.
Once you've done the transpose, your clustering should take no time at all, and minimal memory.

Attributing row name of irregular number of rows (populations)

I've been given this to do by the GENELAND tutorial to give population names to a dataset of populations of 60 individuals :
pop.mbrship1<-rep(c(1,2,3), each=60)
Nevertheless, my dataset comprises 10 populations of irregular sizes to which i would give the names of 1,2,3,4,5,6,7,8,9,10 and the distribution of my individuals (represented by one row each) would be :
1:24,25:39,40:58,59:79,80:103,104:126,127:147,148:171,172:191,192:214
I'd be tempted to use each population number as number of repeats which would make it
pop.mbrship1<-rep[c(1,2,3,4,5,6,7,8,9,10), each=c(24,15,19,21,24,23,21,24,20,23)]
Or try their distribution...
pop.mbrship1<-rep[c(1,2,3,4,5,6,7,8,9,10),
c(1:24,25:39,40:58,59:79,80:103,104:126,127:147,148:171,172:191,192:214)]
In both case, R gives me Error: unexpected '>' in ">"
I'm sure i'm really close to having it work but i've spent a shameful amount of time on this and i'd defenetly need a hand. Thanks a lot!
I'm looking at the geneland tutorial and I see that they have > at the beginning of the lines that you're copying/editing.
You are copying everything including the console pointer > all you need to copy/paste is :
# replicates each element 60 times
pop.mbrship1 <- rep(c(1,2,3),each=60)
# replicates each element, respectively
pop.mbrship2 <- rep(c(1,2,3),times=c(60,40,30))
Your answer is what Henrik said above, without a preceding>.
pop.mbrship1 <- rep(c(1,2,3,4,5,6,7,8,9,10), c(24,15,19,21,24,23,21,24,20,23))
# same as
pop.mbrship1 <- rep(c(...),times=c(...))

Extract raw data from affyBatch object

I have an affyBatch object with gene expression data. The data is read in using
dat <- ReadAffy()
with no options. I then extract the 5600 genes that I am interested in using,
dat <- RemoveProbes(listOutProbeSets, cdfpackagename, probepackagename)
I then normalise the expression data using
dat.rma <- rma(dat)
Now I want to the export the raw data AND the rma-normalised data to .csv files. Inspecting the data I find that exprs(dat) has dimensions 226576 by 30 and dat.rma has dimensions 5600 by 30. How do I extract the 5600 by 30 matrix of the RAW expression values? I don't know where the 226576 rows in the raw data have come from!
I'm a bit of a beginner with bioconductor data structures! Sorry for not providing runnable example code - not sure how I would do that in this case.
During transformation from raw to rma-normalised data, you have, among other things, combined/summarised low level probe intensity values into probe sets values (that map to genes). This explains why you have more features in a raw AffyBatch object than in a ExpressionSet instance (created by the rma function). Also, depending on the chip you have, there are several perfect match (PM) and miss match (MM) probes per probeset, which boosts the number of probes per probeset. The mapping probe -> probeset is defined in the chip definition file and handled automatically.
A few additional thoughts though. Removing probes before doing normalisation might not be a good thing to do. One assumption when performing normalisation is that most of you 'genes' do not change, so keeping only 'those of interest' might break this, depending what 'those of interest' means of course. You can always do your filtering on the ExpressionSet, after normalisation:
> library(affydata)
> data(Dilution) ## gets some test data
> eset <- rma(Dilution) ## rma normalisation
> featureNames(eset)[1:10] ## gets some probesets of interest
> ps
[1] "100_g_at" "1000_at" "1001_at" "1002_f_at" "1003_s_at" "1004_at"
[7] "1005_at" "1006_at" "1007_s_at" "1008_f_at"
> dim(eset) ## full dataset
Features Samples
12625 4
> dim(eset[ps,]) ## only 10 first probesets of interest
Features Samples
10 4
Hope this helps.

Resources