I try to make a scatter-plot matrix with a dataframe(here it is http://statweb.stanford.edu/~tibs/ElemStatLearn/). However, the order of the variables is not the one that I wish and I would like to ignore the variable train.
Dataframe order:
lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45, lpsa,train
The order I wish:
lpsa, lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45
For the moment, here is my code:
prostate1 <- read.table("C:/Users/.../Desktop/prostate.data")
prostate=as.data.frame.matrix(prostate1)
pairs(prostate, col="purple")
I tried to add the arguments horInd and verInd, but I get the following warnings:
1: horInd" is not a graphical parameter
2: verInd" is not a graphical parameter
If anyone could help me, it would really be appreciated.
try this:
prostate1 <- read.table("C:/Users/.../Desktop/prostate.data")
prostate = as.matrix(prostate1)
prostate.reordered = prostate[, c("lpsa", "lcavol", "lweight", "age", "lbph", "svi", "lcp", "gleason", "pgg45")]
pairs(prostate.reordered, col="purple")
The idea is to select the columns you want, in the order you want, using the column names for selection.
Of course, it would probably even more efficient not to convert everything from the data frame into a matrix, but only the required columns...
Related
I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)
I have programmed a function that yields percentage values. When I pass the values on to a data frame, the formatting is lost; e.g. 99.90% will turn into 0.9990. Some simple code to illustrate this (1st column is to indicate that I don't want percentages in all columns):
DT = data.frame(matrix(0,nrow=1,ncol=5), stringsAsFactors = FALSE)
DT[1,1] = 100
DT[1,2:5] = percent(c(0.9999,0.4567,0.3256,0.7777))
DT
X1 X2 X3 X4 X5
1 100 0.9999 0.4567 0.3256 0.7777
I noticed that, in order to keep the percentage formatting, I need to format the data frame in advance. However, I can only do so column by column
This works:
DT[,2] = percent(DT[,2])
DT[,3] = percent(DT[,3])
...
But it is a bit tedious. Unfortunately, both lines below yield the same error
DT[,c(2,3,4,5)] = percent(DT[,c(2,3,4,5)])
DT[,2:5] = percent(DT[,2:5])
Error in UseMethod("percent") :
no applicable method for 'percent' applied to an object of class "data.frame"
This also doesn't work:
DT[,2:5] = apply(DT[,2:5],MARGIN = 2 ,FUN=percent)
I was therefore wondering whether there is:
a way to keep the percentage formatting of a value when it is passed on to a data frame when, for the latter, it has not been specified in advance what the formatting is; or,
an efficient way to assign the percentage formatting to a large number of columns of a data frame.
As #hrbrmstr pointed out, I have to use scales::percent() to make my code work. To obtain two digits I can either scales::percent(round(x,4)) or formattable::percent(x, 2). Specifyingthe namespace (scales or formattable) allows you to use functions from masked packages.
I am trying to create a boxplot, using boxplot(data) for this sample data
1,0.3074855004
1,0.5342907151
1,0.1243014226
1,0.8373050862
1,0.2964970712
2,0.2753391378
2,0.0662903741
2,0.7435585174
2,0.141665858
2,0.8710871406
3,0.683215396
3,0.9968826184
3,0.8009274979
3,0.6164554236
3,0.9880523647
4,0.6854059871
4,0.4828904583
4,0.6001796951
4,0.3790802876
4,0.5728325425
I expect to get a graph with four columns but the output currently only shows two columns. Here is the output
I have tried following the documentation here
http://octave.sourceforge.net/statistics/function/boxplot.html
but I'm still having trouble getting desired results.
Please help me with the correct syntax for getting a proper boxplot in octave.
Thanks,
Your expectations are wrong. Why would boxplot() assume that the first column is the group number. The documentation for boxplot() says:
DATA is a matrix with one column for each data set, or data is a cell vector with one cell for each data set.
Your data is not any of the above.
Also, why are you even wasting memory by setting it up like that? Why do you have a column just to store the group number? Since each group seems to have the same number of values, you can reshape your second column into a matrix with one column per group:
octave> reshape (data(:,2), 5, 4)
ans =
0.307486 0.275339 0.683215 0.685406
0.534291 0.066290 0.996883 0.482890
0.124301 0.743559 0.800927 0.600180
0.837305 0.141666 0.616455 0.379080
0.296497 0.871087 0.988052 0.572833
or if each group has different number of values, use a cell array:
octave> accumarray (data(:,1), data(:,2), [], #(x) {x})
ans =
{
[1,1] =
0.30749
0.53429
0.12430
0.83731
0.29650
[2,1] =
0.275339
0.066290
0.743559
0.141666
0.871087
[3,1] =
0.68322
0.99688
0.80093
0.61646
0.98805
[4,1] =
0.68541
0.48289
0.60018
0.37908
0.57283
}
Once your data is a sensible format, boxplot() will work as you expected.
I am trying to use a custom function inside 'ddply' in order to create a new variable (NormViability) in my data frame, based on values of a pre-existing variable (CelltiterGLO).
The function is meant to create a rescaled (%) value of 'CelltiterGLO' based on the mean 'CelltiterGLO' values at a specific sub-level of the variable 'Concentration_nM' (0.01).
So if the mean of 'CelltiterGLO' at 'Concentration_nM'==0.01 is set as 100, I want to rescale all other values of 'CelltiterGLO' over the levels of other variables ('CTSC', 'Time_h' and 'ExpType').
The normalization function is the following:
normalize.fun = function(CelltiterGLO) {
idx = Concentration_nM==0.01
jnk = mean(CelltiterGLO[idx], na.rm = T)
out = 100*(CelltiterGLO/jnk)
return(out)
}
and this is the code I try to apply to my dataframe:
library("plyr")
df.bis=ddply(df,
.(CTSC, Time_h, ExpType),
transform,
NormViability = normalize.fun(CelltiterGLO))
The code runs, but when I try to double check (aggregate or tapply) if the mean of 'NormViability' equals '100' at 'Concentration_nM'==0.01, I do not get 100, but different numbers. The fact is that, if I try to subset my df by the two levels of the variable 'ExpType', the code returns the correct numbers on each separated subset. I tried to make 'ExpType' either character or factor but I got similar results. 'ExpType has two levels/values which are "Combinations" and "DoseResponse", respectively. I can't figure out why the code is not working on the entire df, I wonder if this is due to the fact that the two levels of 'ExpType' do not contain the same number of levels for all the other variables, e.g. one of the levels of 'Time_h' is missing for the level "Combinations" of 'ExpType'.
Thanks very much for your help and I apologize in advance if the answer is already present in Stackoverflow and I was not able to find it.
Michele
I (the OP) found out that the function was missing one variable in the arguments, that was used in the statements. Simply adding the variable Concentration_nM to the custom function solved the problem.
THANKS
m.
New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names