How to solve 'x' must be numeric in r? - r

I'm trying to make a histogram but I keep running into the an error message.
Here is my code
library(readxl)
data <- read_excel("data.xls")
hist(data)
This is my sample. I want to create a histogram the y-axis be 0-100, x-axis (safely, basic, limited, etc) the numbers (39,29,8,12,12) be in the graph. Does this help make sense?
Safely Basic Limited Unimproved Open
39 29 8 12 12
Error in hist.default(data) : 'x' must be numeric
What am I doing wrong? I don't know understand the error message.

In your case data is not a variable, but a dataframe that contains variables.
You can take the histogram of each single variable like this:
library(readxl)
data <- read_excel("data.xls")
If you want to look at the histogram of variable Safely:
hist(data$Safely)
You can access each variable contained in data in the same way.

The issue is that you are passing a dataframe to the hist() function, when it requires a vector for its argument x (see ?hist). Based on your edited post, you would want:
hist(as.numeric(data[1,]))
Where data[1,] creates a vector from the first row of your dataframe.
Though it seems like you may actually be looking for a bar plot. In that case, try:
plot_data <- data.frame(t(data)) %>%
tibble::rownames_to_column()
ggplot(plot_data,aes(x = rowname,y=t.data.)) +
stat_identity(geom = "bar")
From #user2554330, a simpler base graphics method:
f <- as.numeric(data[1,])
names(f) <- names(data)
barplot(f)

Related

Wrong Tukey-letter ordering in R multcompView package

I do not understand letter ordering in the function multcompletters from multcompView. According to documentation it should be according to mean of the group. In the following example, the middle group got c (from abc) and should have got b. Is this a bug?
require(multcompView)
# Data
datacol <- c(21.1,20.2,21.8,20.9,23.3,21.1,20.2,21.8,20.9,23.3,19.8,16.4,
16.9,16.0,17.6,17.5,16.9,13.3,18.0,17.6,13.5,12.2,15.2,15.1,15.2,14.0)
# Group
faccol <- c(rep(c(1,2),each=10),rep(3,6))
# Combined Dataframe
tukeyset <- data.frame(datacol,as.factor(faccol))
colnames(tukeyset)[2] <- "faccol"
# Tukeytest
tukeyres <- TukeyHSD(x=aov(lm(datacol~faccol,data=tukeyset)))
Tlevels <- tukeyres$faccol[,4]
multcompLetters(Tlevels) # WRONG ORDER, even reversed
# Boxplot
boxplot(tukeyset$datacol~tukeyset$faccol)
# adding the labels
text(x=c(1,2,3),y=c(aggregate(data=tukeyset,datacol~faccol,mean)$datacol),
labels=as.character(multcompLetters(Tlevels,reversed=TRUE)$Letters)[order(names(multcompLetters(Tlevels,reversed=TRUE)['Letters']$Letters))])
Oh no just encountered this problem! Spent 2 hours finally sorting it out!
You cannot call multcompLetters. It will give you the completely wrong order.
You have to use multcompLetters2, multcompLetters3 or multcompLetters4.
Another very important point is you have to convert your input dataset into a dataframe, but not tibble! Tibble doesn't work for this.

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)

How to check if a column has numeric or categorical levels in R?

I am trying to plot 9 barplots in a 3X3 matrix in R using base-R wrapped inside a for loop. (I am working on a workhorse solution for visualizing every column before I begin working on manipulating data) Below is the code:
library(ISLR);
library(ggplot2);
# load wage data
data(Wage)
par(mfrow=c(3,3))
for(i in 1:(dim(Wage)[2]-2)){
plot(Wage[,i],main = paste0(names(Wage)[i]),las = 2)
}
But unfortunately can't do properly for first 2 columns because they are numeric and actually needs a histogram. I get it that I need to fit if-else condition somewhere inside for() statement but that is giving me errors. below is the output where first 2 columns are plotted wrong. (Age and year are actually numeric and I may need to use them in X-axis instead of defaulting them to y).
Kindly requesting to suggest an edit/hack? I also learnt that I cant' use par() when I am wrapping ggplot inside for so I had to use base-R otherwise ggplot would have been great aesthetically.

Unable to Convert Chi-Squared Values into a Numeric Column in R

I've been working on a project for a little bit for a homework assignment and I've been stuck on a logistical problem for a while now.
What I have at the moment is a list that returns 10000 values in the format:
[[10000]]
X-squared
0.1867083
(This is the 10000th value of the list)
What I really would like is to just have the chi-squared value alone so I can do things like create a histogram of the values.
Is there any way I can do this? I'm fine with repeating the test from the start if necessary.
My current code is:
nsims = 10000
for (i in 1:nsims) {cancer.cells <- c(rep("M",24),rep("B",13))
malig[i] <- sum(sample(cancer.cells,21)=="M")}
benign = 21 - malig
rbenign = 13 - benign
rmalig = 24 - malig
for (i in 1:nsims) {test = cbind(c(rbenign[i],benign[i]),c(rmalig[i],malig[i]))
cancerchi[i] = chisq.test(test,correct=FALSE) }
It gives me all I need, I just cannot perform follow-up analysis on it such as creating a histogram.
Thanks for taking the time to read this!
I'll provide an answer at the suggestion of #Dr. Mike.
hist requires a vector as input. The reason that hist(cancerchi) will not work is because cancerchi is a list, not a vector.
There a several ways to convert cancerchi, from a list into a format that hist can work with. Here are 3 ways:
hist(as.data.frame(unlist(cancerchi)))
Note that if you do not reassign cancerchi it will still be a list and cannot be passed directly to hist.
# i.e
class(cancerchi)
hist(cancerchi) # will still give you an error
If you reassign, it can be another type of object:
(class(cancerchi2 <- unlist(cancerchi)))
(class(cancerchi3 <- as.data.frame(unlist(cancerchi))))
# using the ldply function in the plyr package
library(plyr)
(class(cancerchi4 <- ldply(cancerchi)))
these new objects can be passed to hist directly
hist(cancerchi2)
hist(cancerchi3[,1]) # specify column because cancerchi3 is a data frame, not a vector
hist(cancerchi4[,1]) # specify column because cancerchi4 is a data frame, not a vector
A little extra information: other useful commands for looking at your objects include str and attributes.

performing a calculation with a `paste`d vector reference

So I have some lidar data that I want to calculate some metrics for (I'll attach a link to the data in a comment).
I also have ground plots that I have extracted the lidar points around, so that I have a couple hundred points per plot (19 plots). Each point has X, Y, Z, height above ground, and the associated plot.
I need to calculate a bunch of metrics on the plot level, so I created plotsgrouped with split(plotpts, plotpts$AssocPlot).
So now I have a data frame with a "page" for each plot, so I can calculate all my metrics by the "plot page". This works just dandy for individual plots, but I want to automate it. (yes, I know there's only 19 plots, but it's the principle of it, darn it! :-P)
So far, I've got a for loop going that calculates the metrics and puts the results in a data frame called Results. I pulled the names of the groups into a list called groups as well.
for(i in 1:length(groups)){
Results$Plot[i] <- groups[i]
Results$Mean[i] <- mean(plotsgrouped$PLT01$Z)
Results$Std.Dev.[i] <- sd(plotsgrouped$PLT01$Z)
Results$Max[i] <- max(plotsgrouped$PLT01$Z)
Results$75%Avg.[i] <- mean(plotsgrouped$PLT01$Z[plotsgrouped$PLT01$Z <= quantile(plotsgrouped$PLT01$Z, .75)])
Results$50%Avg.[i] <- mean(plotsgrouped$PLT01$Z[plotsgrouped$PLT01$Z <= quantile(plotsgrouped$PLT01$Z, .50)])
...
and so on.
The problem arises when I try to do something like:
Results$mean[i] <- mean(paste("plotsgrouped", groups[i],"Z", sep="$")). mean() doesn't recognize the paste as a reference to the vector plotsgrouped$PLT27$Z, and instead fails. I've deduced that it's because it sees the quotes and thinks, "Oh, you're just some text, I can't get the mean of you." or something to that effect.
Btw, groups is a list of the 19 plot names: PLT01-PLT27 (non-consecutive sometimes) and FTWR, so I can't simply put a sequence for the numeric part of the name.
Anyone have an easier way to iterate across my test plots and get arbitrary metrics?
I feel like I have all the right pieces, but just don't know how they go together to give me what I want.
Also, if anyone can come up with a better title for the question, feel free to post it or change it or whatever.
Try with:
for(i in seq_along(groups)) {
Results$Plot[i] <- groups[i] # character names of the groups
tempZ = plotsgrouped[[groups[i]]][["Z"]]
Results$Mean[i] <- mean(tempZ)
Results$Std.Dev.[i] <- sd(tempZ)
Results$Max[i] <- max(tempZ)
Results$75%Avg.[i] <- mean(tempZ[tempZ <= quantile(tempZ, .75)])
Results$50%Avg.[i] <- mean(tempZ[tempZ <= quantile(tempZ, .50)])
}

Resources