R: "undefined columns selected" error after check.names=FALSE? - r
Brand new to R; trying to get my data read in and reshaped properly. File format has seven columns of "id"-ish data, then about sixty columns of annual growth values, columns labelled by year. First pass was:
> firstData <- read.csv("~/theData.csv")
> nuData <- melt(firstData, id=1:7)
That made the right arrangement but read.csv() had prepended an X to all the years ("X1983", e.g.), so now they don't work as values. I get that, so:
> firstData <- read.csv("~/theData.csv",check.names = FALSE)
> nuData <- melt(firstData, id=1:7)
Error in `[.data.frame`(data, , x) : undefined columns selected
The Xs were kept away (plain "1983", etc.), but now it won't melt(). Many retries; lots of reference-consulting; hard to figure out the right way to find the answer. It seems to think the structure is okay:
> is.data.frame(firstData)
[1] TRUE
> ncol(firstData)
[1] 71
I suspect that something about the bare-number column labels for 8-71 is throwing it. How do I reassure it that everything's fine?
EDIT
Didn't want to dump the data-mess if someone could answer offhand, but here's a sample. I thought I'd figured it out when I found spaces in column labels... but I fixed them and still get the same error. Is it a problem that the rows don't all have values in the 2016 column?
Tree,Gap,TransX,TransY,DBH,Nodes,Ht,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,
1,1,3,0,4.4,23,366,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7,3,3,7,3,4,4,13,7,23,17,34,25,30,23,19,25,22,29,28,20,14,6,
2,1,4,0,3.3,24,398,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,12,11,16,10,7,7,16,13,16,12,25,14,24,21,20,22,20,24,15,27,18,17,15,16,
3,1,5,2,2.8,24,325,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,7,16,8,6,16,18,10,17,7,21,10,14,12,16,14,23,15,21,20,14,14,12,9,
4,1,5,2.5,3.5,22,388,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,6,6,5,5,15,9,12,13,29,16,20,13,17,19,27,25,13,31,32,26,26,23,
5,1,10.2,0,9.5,43,739,,,,,,,,,,,,,,,,,,,,,16,18,9,14,18,13,14,10,6,8,8,10,12,11,13,11,6,6,7,8,8,9,11,13,20,27,17,23,11,38,21,29,27,31,29,19,23.1,22,33,40,24,22,24,
Related
`$<-.data.frame`(`*tmp*`, Numero, value = numeric(0) error [duplicate]
I have a numeric column ("value") in a dataframe ("df"), and I would like to generate a new column ("valueBin") based on "value." I have the following conditional code to define df$valueBin: df$valueBin[which(df$value<=250)] <- "<=250" df$valueBin[which(df$value>250 & df$value<=500)] <- "250-500" df$valueBin[which(df$value>500 & df$value<=1000)] <- "500-1,000" df$valueBin[which(df$value>1000 & df$value<=2000)] <- "1,000 - 2,000" df$valueBin[which(df$value>2000)] <- ">2,000" I'm getting the following error: "Error in $<-.data.frame(*tmp*, "valueBin", value = c(NA, NA, NA, : replacement has 6530 rows, data has 6532" Every element of df$value should fit into one of my which() statements. There are no missing values in df$value. Although even if I run just the first conditional statement (<=250), I get the exact same error, with "...replacement has 6530 rows..." although there are way fewer than 6530 records with value<=250, and value is never NA. This SO link notes a similar error when using aggregate() was a bug, but it recommends installing the version of R I have. Plus the bug report says its fixed. R aggregate error: "replacement has <foo> rows, data has <bar>" This SO link seems more related to my issue, and the issue here was an issue with his/her conditional logic that caused fewer elements of the replacement array to be generated. I guess that must be my issue as well, and figured at first I must have a "<=" instead of an "<" or vice versa, but after checking I'm pretty sure they're all correct to cover every value of "value" without overlaps. R error in '[<-.data.frame'... replacement has # items, need #
The answer by #akrun certainly does the trick. For future googlers who want to understand why, here is an explanation... The new variable needs to be created first. The variable "valueBin" needs to be already in the df in order for the conditional assignment to work. Essentially, the syntax of the code is correct. Just add one line in front of the code chuck to create this name -- df$newVariableName <- NA Then you continue with whatever conditional assignment rules you have, like df$newVariableName[which(df$oldVariableName<=250)] <- "<=250" I blame whoever wrote that package's error message... The debugging was made especially confusing by that error message. It is irrelevant information that you have two arrays in the df with different lengths. No. Simply create the new column first. For more details, consult this post https://www.r-bloggers.com/translating-weird-r-errors/
You could use cut df$valueBin <- cut(df$value, c(-Inf, 250, 500, 1000, 2000, Inf), labels=c('<=250', '250-500', '500-1,000', '1,000-2,000', '>2,000')) data set.seed(24) df <- data.frame(value= sample(0:2500, 100, replace=TRUE))
TL;DR ...and late to the party, but that short explanation might help future googlers.. In general that error message means that the replacement doesn't fit into the corresponding column of the dataframe. A minimal example: df <- data.frame(a = 1:2); df$a <- 1:3 throws the error Error in $<-.data.frame(*tmp*, a, value = 1:3) : replacement has 3 rows, data has 2 which is clear, because the vector a of df has 2 entries (rows) whilst the vector we try to replace has 3 entries (rows).
Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA
I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following: Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or missing values in 'x' Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly? The original issue with the HCPC, then, is the following: # read in data; 40 x 267 data frame data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267))) # loop over first 267 columns, converting them to numeric for(i in 1:267) data_for_ca[[i]] <- as.numeric(data_for_ca[[i]]) # perform CA data.ca <- CA(data_for_ca,graph = F) # perform HCPC for rows (i.e. individuals); up until here everything works just fine data.hcpc <- HCPC(data.ca,graph = T) # now I start having trouble # perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T) The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error: Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w = res.sauv$call$row.w.init) : object 'data.clust' not found It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like: Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today: HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in. The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be: data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)
R in counting data
Right now I'm trying to do a bell curve on a file called output9.csv on my. Here is my code, I want to uses z score to detect outliers, and uses the difference between the value and mean of the data set.The difference is compared with standard deviation to find the outliers. va #DATA LOAD data <- read.csv('output9.csv') height <- data$Height hist(height) #histogram #POPULATION PARAMETER CALCULATIONS pop_sd <- sd(height)*sqrt((length(height)-1)/(length(height))) pop_mean <- mean(height) But I have this error after I tried the histogram part, > hist(height) Error in hist.default(height) : 'x' must be numeric how should I fix this?
Since I don't have your data I can only guess. Can you provide it? Or at least a portion of it? What class is your data? You can use class(data) to find out. The most common way is to have table-like data in data.frames. To subset one of your columns to use it for the hist you can use the $ operator. Be sure you subset on a column that actually exists. You can use names(data) (if data is a data.frame) to find out what columns exist in your data. Use nrow(data) to find out how many rows there are in your data. After extracting your height you can go further. First check that your height object is numeric and has something in it. You can use class(height) to find out. As you posted in your comment you have the following names names(data) # [1] "Host" "TimeStamp" "TimeZone" "Command" "RequestLink" "HTTP" [7] "ReplyCode" "Bytes" Therefore you can extract your height with height <- data$Bytes Did you try to convert it to numeric? as.numeric(height) might do the trick. as.numeric() can coerce all things that are stored as characters but might also be numbers automatically. Try as.numeric("3") as an example. Here an example I made up. height <- c(1,1,2,3,1) class(height) # [1] "numeric" hist(height) This works just fine, because the data is numeric. In the following the data are numbers but formatted as characters. height_char <- c("1","1","2","3","1") class(height_char) # [1] "character" hist(height_char) # Error in hist.default(height) : 'x' must be numeric So you have to coerce it first: hist(as.numeric(height_char)) ..and then it works fine. For future questions: Try to give Minimal, Complete, and Verifiable Examples.
get mean form a column in a csv file on R
I am a very beginner user of R. I am taking the Coursera R programming course and I am stuck in a homework (the pollutant mean homework). The objective of the homework is to obtain means from columns in csv files. The files have four columns. We have 300+ files and each has 1000+ observations. Most of them Are NA. In the csv file I am working with there are only 117 numeric observations. I have been trying stuff like this: cmydata1 <- read.csv("/Users/joshuavincent/Documents/specdata/001.csv") Once I had cmydata1, I tried to get the mean of one of the columns, "nitrate" but I got this: > mean(cmydata1, "nitrate") [1] NA Warning message: In mean.default(cmydata1, "nitrate") : argument is not numeric or logical: returning NA To solve it I created a new list like this: > cmydata2 <- list(na.omit(cmydata1)) > cmydata2[[1]] The outcome is the cleaned matrix, no NA anymore The column names are: "Date" "sulfate" "nitrate" and ID. However, I still can't get the mean > mean(cmydata2, "nitrate") [1] NA Warning message: In mean.default(cmydata2, "nitrate") : argument is not numeric or logical: returning NA I try to fix it, so I type... and get null > colnames(cmydata2) NULL So, what could I fix to get the mean from that column? (Afterwards I think have to try loops and stuff to finish the homework, but I am going very baby steps towards it) Note that might help: I have cmydata1 with a table icon in the autofill, while cmydata2 has some shapes, seems like an organigram icon. Thanks
This is a rather simple question and you should probably just reference other questions that have been asked before. However, to try to answer, you reference columns in dataframes in two main ways listed out below (although there are other ways). data(mtcars) #calling in some data that is stored in R already ##METHOD 1## mean(mtcars$mpg,na.rm=T) #the 'na.rm=T' is to remove missing values before calculating the mean 20.09062 ##METHOD 2## mean(mtcars[,'mpg'],na.rm=T) 20.09062
R: 'Missing Value where True/False needed'
So I know this has been asked before, but from what I've searched I can't really find an answer to my problem. I should also add I'm relatively new to R (and any type of coding at all) so when it comes to fixing problems in code I'm not too sure what I'm looking for. My code is: education_ge <- data.frame(matrix(ncol=2, nrow=1)) colnames(education_ge) <- c("Education","Genetic.Engineering") for (i in 1:nrow(survey)) if (survey[i,12]=="Bachelors") education_ge$Education <- survey[i,12] To give more info, 'survey' is a data frame with 12 columns and 26 rows, and the 12th column, 'Education', is a factor which has levels such as 'Bachelors', 'Masters', 'Doctorate' etc. This is the error as it appears in R: for (i in 1:nrow(survey)) if (survey[i,12]=="Bachelors") education_ge$Education <- survey[i,12] Error in if (survey[i, 12] == "Bachelors") education_ge$Education <- survey[i, : missing value where TRUE/FALSE needed Any help would be greatly appreciated!
If you just want to ignore any records with missing values and get on with your analysis, try inserting this at the beginning: survey <- survey[ complete.cases(survey), ] It basically finds the indexes of all the rows where there are no NAs anywhere, and then subsets survey to have only those rows. For more information on subsetting, try reading this chapter: http://adv-r.had.co.nz/Subsetting.html The command: sapply(survey,function (x) sum(is.na(x))) will show you how many NAs you have in each column. That might help your data cleaning.
You can try this: sub<-subset(survey,survey$Education=="Bachelors") education_ge$Education<-sub$Education Let me know if this helps.