R in counting data - r

Right now I'm trying to do a bell curve on a file called output9.csv on my.
Here is my code, I want to uses z score to detect outliers, and uses the difference between the value and mean of the data set.The difference is compared with standard deviation to find the outliers. va
#DATA LOAD
data <- read.csv('output9.csv')
height <- data$Height
hist(height) #histogram
#POPULATION PARAMETER CALCULATIONS
pop_sd <- sd(height)*sqrt((length(height)-1)/(length(height)))
pop_mean <- mean(height)
But I have this error after I tried the histogram part,
> hist(height)
Error in hist.default(height) : 'x' must be numeric
how should I fix this?

Since I don't have your data I can only guess. Can you provide it? Or at least a portion of it?
What class is your data? You can use class(data) to find out. The most common way is to have table-like data in data.frames. To subset one of your columns to use it for the hist you can use the $ operator. Be sure you subset on a column that actually exists. You can use names(data) (if data is a data.frame) to find out what columns exist in your data. Use nrow(data) to find out how many rows there are in your data.
After extracting your height you can go further. First check that your height object is numeric and has something in it. You can use class(height) to find out.
As you posted in your comment you have the following names
names(data)
# [1] "Host" "TimeStamp" "TimeZone" "Command" "RequestLink" "HTTP" [7] "ReplyCode" "Bytes"
Therefore you can extract your height with
height <- data$Bytes
Did you try to convert it to numeric? as.numeric(height) might do the trick. as.numeric() can coerce all things that are stored as characters but might also be numbers automatically. Try as.numeric("3") as an example.
Here an example I made up.
height <- c(1,1,2,3,1)
class(height)
# [1] "numeric"
hist(height)
This works just fine, because the data is numeric.
In the following the data are numbers but formatted as characters.
height_char <- c("1","1","2","3","1")
class(height_char)
# [1] "character"
hist(height_char)
# Error in hist.default(height) : 'x' must be numeric
So you have to coerce it first:
hist(as.numeric(height_char))
..and then it works fine.
For future questions: Try to give Minimal, Complete, and Verifiable Examples.

Related

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)

Efficient Way to Convert to Numeric

I have converted a bunch of my columns from factor to numeric, but the code was very cumbersome. I had to individually convert each column, which ended up taking more time than it should. This is the code I used (only a short sample - I actually have many more columns):
city1$NY <-as.numeric(levels(city1$NY))[city1$NY]
city1$CHI<-as.numeric(levels(city1$CHI))[city1$CHI]
city1$LA <-as.numeric(levels(city1$LA))[city1$LA]
city1$ATL<-as.numeric(levels(city1$ATL))[city1$ATL]
city1$MIA<-as.numeric(levels(city1$MIA))[city1$MIA]
I was almost positive that instead of doing all of that, I could've just done:
city1[,CityNames]<-as.numeric(levels(city1[,CityNames]))[city1[,CityNames]]
Where CityNames is just all of the columns for the data that I would like to convert.. But that doesn't work, as I get:
Error in as.numeric(levels(city1[, CityNames]))[city1[, CityNames]] :
invalid subscript type 'list'
Can anyone tell what I am doing wrong? Or is there just simply no easier way to do this task other than my long, annoying first method?
I was almost positive that instead of doing all of that, I could've just done:
city1[,CityNames]<-as.numeric(levels(city1[,CityNames]))[city1[,CityNames]]
So, a small change is needed:
city1[,CityNames] <- lapply(city1[,CityNames], function(x) as.numeric(levels(x))[x] )
The original approach didn't work because
levels are vector-specific, so it's not clear what myvec = levels(city1[,CityNames]) is.
myvec[ city1[,CityNames] ] throws an error because city1[,CityNames] is a data.frame and cannot be used to subset in this way.
This is typically what I do when I want to convert many columns in a data.frame to a different data type:
convNames <- c("NY", "CHI", "LA", "ATL", "MIA")
for(name in convNames) { city1[, name] <- as.numeric(as.character((city1[, name])) }
It's a nice two lines and you just have to add the names of whatever columns you want to coerce to the convNames vector to add a new column to the coercing loop below.
EDIT: Do to a factor issue, do the lapply method above.
I'm not sure if it is faster, but may be since the lookups may be what is slowing you down. Try city1 <- as.numeric(as.character(city1)). The as.character() converts to the level values and then the as.numeric() interprets those strings as their a numeric equivalent. It may be significantly faster since it does not have to do any lookups into the levels vector for each value.

R warning message - invalid factor level, NA generated

I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!
Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.

Unable to Convert Chi-Squared Values into a Numeric Column in R

I've been working on a project for a little bit for a homework assignment and I've been stuck on a logistical problem for a while now.
What I have at the moment is a list that returns 10000 values in the format:
[[10000]]
X-squared
0.1867083
(This is the 10000th value of the list)
What I really would like is to just have the chi-squared value alone so I can do things like create a histogram of the values.
Is there any way I can do this? I'm fine with repeating the test from the start if necessary.
My current code is:
nsims = 10000
for (i in 1:nsims) {cancer.cells <- c(rep("M",24),rep("B",13))
malig[i] <- sum(sample(cancer.cells,21)=="M")}
benign = 21 - malig
rbenign = 13 - benign
rmalig = 24 - malig
for (i in 1:nsims) {test = cbind(c(rbenign[i],benign[i]),c(rmalig[i],malig[i]))
cancerchi[i] = chisq.test(test,correct=FALSE) }
It gives me all I need, I just cannot perform follow-up analysis on it such as creating a histogram.
Thanks for taking the time to read this!
I'll provide an answer at the suggestion of #Dr. Mike.
hist requires a vector as input. The reason that hist(cancerchi) will not work is because cancerchi is a list, not a vector.
There a several ways to convert cancerchi, from a list into a format that hist can work with. Here are 3 ways:
hist(as.data.frame(unlist(cancerchi)))
Note that if you do not reassign cancerchi it will still be a list and cannot be passed directly to hist.
# i.e
class(cancerchi)
hist(cancerchi) # will still give you an error
If you reassign, it can be another type of object:
(class(cancerchi2 <- unlist(cancerchi)))
(class(cancerchi3 <- as.data.frame(unlist(cancerchi))))
# using the ldply function in the plyr package
library(plyr)
(class(cancerchi4 <- ldply(cancerchi)))
these new objects can be passed to hist directly
hist(cancerchi2)
hist(cancerchi3[,1]) # specify column because cancerchi3 is a data frame, not a vector
hist(cancerchi4[,1]) # specify column because cancerchi4 is a data frame, not a vector
A little extra information: other useful commands for looking at your objects include str and attributes.

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Resources