Boxplot in octave - graph

I am trying to create a boxplot, using boxplot(data) for this sample data
1,0.3074855004
1,0.5342907151
1,0.1243014226
1,0.8373050862
1,0.2964970712
2,0.2753391378
2,0.0662903741
2,0.7435585174
2,0.141665858
2,0.8710871406
3,0.683215396
3,0.9968826184
3,0.8009274979
3,0.6164554236
3,0.9880523647
4,0.6854059871
4,0.4828904583
4,0.6001796951
4,0.3790802876
4,0.5728325425
I expect to get a graph with four columns but the output currently only shows two columns. Here is the output
I have tried following the documentation here
http://octave.sourceforge.net/statistics/function/boxplot.html
but I'm still having trouble getting desired results.
Please help me with the correct syntax for getting a proper boxplot in octave.
Thanks,

Your expectations are wrong. Why would boxplot() assume that the first column is the group number. The documentation for boxplot() says:
DATA is a matrix with one column for each data set, or data is a cell vector with one cell for each data set.
Your data is not any of the above.
Also, why are you even wasting memory by setting it up like that? Why do you have a column just to store the group number? Since each group seems to have the same number of values, you can reshape your second column into a matrix with one column per group:
octave> reshape (data(:,2), 5, 4)
ans =
0.307486 0.275339 0.683215 0.685406
0.534291 0.066290 0.996883 0.482890
0.124301 0.743559 0.800927 0.600180
0.837305 0.141666 0.616455 0.379080
0.296497 0.871087 0.988052 0.572833
or if each group has different number of values, use a cell array:
octave> accumarray (data(:,1), data(:,2), [], #(x) {x})
ans =
{
[1,1] =
0.30749
0.53429
0.12430
0.83731
0.29650
[2,1] =
0.275339
0.066290
0.743559
0.141666
0.871087
[3,1] =
0.68322
0.99688
0.80093
0.61646
0.98805
[4,1] =
0.68541
0.48289
0.60018
0.37908
0.57283
}
Once your data is a sensible format, boxplot() will work as you expected.

Related

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)

min() does not work as expected

I am trying to get the minimum of a a column.
The data has been split into groups using the "abbr" factor. My objective is to return the data in column 2 corresponding to the minimum in column number passed in the argument. If it helps , this is a part of the coursera R programming introductory course.
The minimum is supposed to be somewhere around 8, it shows 10.
Please help me here.
here's the link to the csv file on which i used read.csv
https://drive.google.com/file/d/0Bxkj3-FNtxqrLW14MFZCeEl6UGc/view?usp=sharing
best <- function(abbr, outvar){
## outcome is a dataframe consisting of a column labelled "State" (one of many)
## outvar is the desired column number
statecol <- split(outcome, outcome$State) ##state is a factor which will be inputted as abbr
dislist <- statecol[[abbr]][,2][statecol[[abbr]][, outvar] ==
min(statecol[[abbr]][, outvar])] ##continuation of prev line
dislist
}
In my opinion you are messing up with NA, make sure to specify na as not available and na.rm=TRUE in min..
filedata<-read.table(file.choose(),quote='"',sep=",",dec=".",header=TRUE,stringsAsFactors=FALSE, na.strings="Not Available")
f<-function(df,abbr,outVar,na.rm=TRUE){
outlist<-split(df,df["State"])
tempCol<-outlist[[abbr]][outVar]
outlist[[abbr]][,2][which(tempCol==min(tempCol,na.rm=na.rm))]
}
f(filedata,"AK",44)

Easiest way to apply series of calculations to similar data frames in R

The following is an example of how I want to treat my data sets. It might be a bit different to understand how my data frame is structured, but I hope it makes sense:
First density must be calculated for columns A, B, and C using raw data from columns ADry, AEthanol, BDry ...... (Since these were earlier defined as vectors too, i used the vectors instead data frame columns as it was shorter - ADry_1_0 instead of Sample_1_0$ADry_1_0)
Sample_1_0$ADensi_1_0=(ADry_1_0/(ADry_1_0-AEthanol_1_0))*(peth-pair)+pair
Sample_1_0$BDensi_1_0=(BDry_1_0/(BDry_1_0-BEthanol_1_0))*(peth-pair)+pair
Sample_1_0$CDensi_1_0=(CDry_1_0/(CDry_1_0-CEthanol_1_0))*(peth-pair)+pair
This yields 10 densities for both A, B, and C. What's interesting is the mean density
Mean_1_0=apply(Sample_1_0[7:9],2,mean)
Next standard deviations are found. We are mainly interested in standard deviations for our raw data columns (ADry and AEthanol), as error propagation calculations are afterwards carried out to find out how the deviations sum up when calculating the densities
StdAfv_1_0=apply(Sample_1_0,2,sd)
Error propagation (same for B and C)
ASd_1_0=(sqrt((sd(Sample_1_0$ADry_1_0)/mean(Sample_1_0$ADry_1_0))^2+(sqrt((sd(Sample_1_0$ADry_1_0)^2+sd(Sample_1_0$AEthanol_1_0)^2))/(mean(Sample_1_0$ADry_1_0)-mean(Sample_1_0$AEthanol_1_0)))^2))*mean(Sample_1_0$ADensi_1_0)
In the end we semi manually gathered the end informations (mean density and deviation hereof) in a plot-able dataframe. Some of the codes might be a tad long and maybe we could have achieved equal results using shorter codes, but bear with us, we are rookies.
So now to the real actual problem
This was for A_1_0, B_1_0, and C_1_0. We would like to apply the same series of commands to 15 other data frames. The dimensions are the same, and they will be named A_1_1, A_1_2, A_2_0 and so on.
Is it possible to use some kind of loop function or make a loadable script containing x and y placeholders, where we can easily insert A_1_1 for instance??
Thanks in advance, i tried to keep the amount of confusion at a minimum, although it's tough!
Data list
If instead of individual vectors you combine the raw data into data frames (or even better data.tables) and then subsequently store all the data frames for all runs into a list as #Gregor suggested, you can use this function below and the lapply function.
my_func <- function(dataset, peth, pair){
require(data.table)
names <- names(dataset)
setDT(dataset)[, `:=` (ADens = (get(names[1])/(get(names[1])-get(names[4])))*(peth-pair)+pair,
BDens = (get(names[2])/(get(names[2])-get(names[5])))*(peth-pair)+pair,
CDens = (get(names[3])/(get(names[3])-get(names[6])))*(peth-pair)+pair)
][, .(ADens_mean = mean(ADens),
ADens_sd = sd(ADens),
AErr = (sqrt((sd(get(names[1]))/mean(get(names[1])))^2) +
(sqrt((sd(get(names[1]))^2 + sd(get(names[4]))^2))/
(mean(get(names[1])) - mean(get(names[4]))))^2)* mean(ADens),
BDens_mean = mean(BDens),
BDens_sd = sd(BDens),
BErr = (sqrt((sd(get(names[2]))/mean(get(names[2])))^2) +
(sqrt((sd(get(names[2]))^2 + sd(get(names[5]))^2))/
(mean(get(names[2])) - mean(get(names[5]))))^2)* mean(BDens),
CDens_mean = mean(CDens),
CDens_sd = sd(CDens),
CErr = (sqrt((sd(get(names[3]))/mean(get(names[3])))^2) +
(sqrt((sd(get(names[3]))^2 + sd(get(names[6]))^2))/
(mean(get(names[3])) - mean(get(names[6]))))^2)* mean(CDens))
]
}
rbindlist(lapply(list_datasets, my_func, peth = 2, pair = 1))
Now, this assumes that you put your raw vectors into data frames with the columns in the order in which they appeared in your example (and that they are the only columns in the data set). If this is not the case, you may just have to edit the indices in the names[x] calls. If you wanted to have a little more flexibility, you could also define a list of list with the column names for each data set in your individual raw data sets, add that as an argument to my_func and then replace all the instances of names[x] with get(list_column_names[x])
This function should output a data.table with the results for each set of data sets (1-16) in individual rows with 6 columns (ADens_mean, ADens_sd, ...)
NOTE since there was no actual data to work with, I can't say for sure that this function does exactly what you want, but I think it will be close. This will also require you to download the data.table package.

Pairs in R - Re-order variables

I try to make a scatter-plot matrix with a dataframe(here it is http://statweb.stanford.edu/~tibs/ElemStatLearn/). However, the order of the variables is not the one that I wish and I would like to ignore the variable train.
Dataframe order:
lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45, lpsa,train
The order I wish:
lpsa, lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45
For the moment, here is my code:
prostate1 <- read.table("C:/Users/.../Desktop/prostate.data")
prostate=as.data.frame.matrix(prostate1)
pairs(prostate, col="purple")
I tried to add the arguments horInd and verInd, but I get the following warnings:
1: horInd" is not a graphical parameter
2: verInd" is not a graphical parameter
If anyone could help me, it would really be appreciated.
try this:
prostate1 <- read.table("C:/Users/.../Desktop/prostate.data")
prostate = as.matrix(prostate1)
prostate.reordered = prostate[, c("lpsa", "lcavol", "lweight", "age", "lbph", "svi", "lcp", "gleason", "pgg45")]
pairs(prostate.reordered, col="purple")
The idea is to select the columns you want, in the order you want, using the column names for selection.
Of course, it would probably even more efficient not to convert everything from the data frame into a matrix, but only the required columns...

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Resources