get mean form a column in a csv file on R - r
I am a very beginner user of R. I am taking the Coursera R programming course and I am stuck in a homework (the pollutant mean homework). The objective of the homework is to obtain means from columns in csv files. The files have four columns. We have 300+ files and each has 1000+ observations. Most of them Are NA. In the csv file I am working with there are only 117 numeric observations. I have been trying stuff like this:
cmydata1 <- read.csv("/Users/joshuavincent/Documents/specdata/001.csv")
Once I had cmydata1, I tried to get the mean of one of the columns, "nitrate" but I got this:
> mean(cmydata1, "nitrate")
[1] NA
Warning message:
In mean.default(cmydata1, "nitrate") :
argument is not numeric or logical: returning NA
To solve it I created a new list like this:
> cmydata2 <- list(na.omit(cmydata1))
> cmydata2[[1]]
The outcome is the cleaned matrix, no NA anymore
The column names are: "Date" "sulfate" "nitrate" and ID.
However, I still can't get the mean
> mean(cmydata2, "nitrate")
[1] NA
Warning message:
In mean.default(cmydata2, "nitrate") :
argument is not numeric or logical: returning NA
I try to fix it, so I type... and get null
> colnames(cmydata2)
NULL
So, what could I fix to get the mean from that column? (Afterwards I think have to try loops and stuff to finish the homework, but I am going very baby steps towards it)
Note that might help: I have cmydata1 with a table icon in the autofill, while cmydata2 has some shapes, seems like an organigram icon.
Thanks
This is a rather simple question and you should probably just reference other questions that have been asked before. However, to try to answer, you reference columns in dataframes in two main ways listed out below (although there are other ways).
data(mtcars) #calling in some data that is stored in R already
##METHOD 1##
mean(mtcars$mpg,na.rm=T) #the 'na.rm=T' is to remove missing values before calculating the mean
20.09062
##METHOD 2##
mean(mtcars[,'mpg'],na.rm=T)
20.09062
Related
R in counting data
Right now I'm trying to do a bell curve on a file called output9.csv on my. Here is my code, I want to uses z score to detect outliers, and uses the difference between the value and mean of the data set.The difference is compared with standard deviation to find the outliers. va #DATA LOAD data <- read.csv('output9.csv') height <- data$Height hist(height) #histogram #POPULATION PARAMETER CALCULATIONS pop_sd <- sd(height)*sqrt((length(height)-1)/(length(height))) pop_mean <- mean(height) But I have this error after I tried the histogram part, > hist(height) Error in hist.default(height) : 'x' must be numeric how should I fix this?
Since I don't have your data I can only guess. Can you provide it? Or at least a portion of it? What class is your data? You can use class(data) to find out. The most common way is to have table-like data in data.frames. To subset one of your columns to use it for the hist you can use the $ operator. Be sure you subset on a column that actually exists. You can use names(data) (if data is a data.frame) to find out what columns exist in your data. Use nrow(data) to find out how many rows there are in your data. After extracting your height you can go further. First check that your height object is numeric and has something in it. You can use class(height) to find out. As you posted in your comment you have the following names names(data) # [1] "Host" "TimeStamp" "TimeZone" "Command" "RequestLink" "HTTP" [7] "ReplyCode" "Bytes" Therefore you can extract your height with height <- data$Bytes Did you try to convert it to numeric? as.numeric(height) might do the trick. as.numeric() can coerce all things that are stored as characters but might also be numbers automatically. Try as.numeric("3") as an example. Here an example I made up. height <- c(1,1,2,3,1) class(height) # [1] "numeric" hist(height) This works just fine, because the data is numeric. In the following the data are numbers but formatted as characters. height_char <- c("1","1","2","3","1") class(height_char) # [1] "character" hist(height_char) # Error in hist.default(height) : 'x' must be numeric So you have to coerce it first: hist(as.numeric(height_char)) ..and then it works fine. For future questions: Try to give Minimal, Complete, and Verifiable Examples.
R: errors in cor() and corrplot()
Another stumbling block. I have a large set of data (called "brightly") with about ~180k rows and 165 columns. I am trying to create a correlation matrix of these columns in R. Several problems have arisen, none of which I can resolve with the suggestions proposed on this site and others. First, how I created the data set: I saved it as a CSV file from Excel. My understanding is that CSV should remove any formatting, such that anything that is a number should be read as a number by R. I loaded it with brightly = read.csv("brightly.csv", header=TRUE) But I kept getting "'x' must be numeric" error messages every time I ran cor(brightly), so I replaced all the NAs with 0s. (This may be altering my data, but I think it will be all right--anything that's "NA" is effectively 0, either for the continuous or dummy variables.) Now I am no longer getting the error message about text. But any time I run cor()--either on all of the variables simultaneously or combinations of the variables--I get "Warning message: In cor(brightly$PPV, brightly, use = "complete") : the standard deviation is zero" I am also having some of the correlations of that one variable with others show up as "NA." I have ensured that no cell in the data is "NA," so I do not know why I am getting "NA" values for the correlations. I also tried both of the following to make REALLY sure I wasn't including any NA values: cor(brightly$PPV, brightly, use = "pairwise.complete.obs") and cor(brightly$PPV,brightly,use="complete") But I still get warnings about the SD being zero, and I still get the NAs. Any insights as to why this might be happening? Finally, when I try to do corrplot to show the results of the correlations, I do the following: brightly2 <- cor(brightly) Warning message: In cor(brightly) : the standard deviation is zero corrplot(brightly2, method = "number") Error in if (min(corr) < -1 - .Machine$double.eps || max(corr) > 1 + .Machine$double.eps) { : missing value where TRUE/FALSE needed And instead of making my nice color-coded correlation matrix, I get this. I have yet to find an explanation of what that means. Any help would be HUGELY appreciated! Thanks very much!!
Please check if you replaced your NAs with 0 or '0' as one is character and other is int. Or you can even try using as.numeric(column_name) function to convert your char 0s with int 0. Also this error occurs if your dataset has factors, because those are not int values corrplot throws this error. It would be helpful of you put sample of your data in the question using str(head(your_dataset)) That would be helpful for you to check the datatypes of columns. Let me know if I am wrong. Cheerio.
R: "undefined columns selected" error after check.names=FALSE?
Brand new to R; trying to get my data read in and reshaped properly. File format has seven columns of "id"-ish data, then about sixty columns of annual growth values, columns labelled by year. First pass was: > firstData <- read.csv("~/theData.csv") > nuData <- melt(firstData, id=1:7) That made the right arrangement but read.csv() had prepended an X to all the years ("X1983", e.g.), so now they don't work as values. I get that, so: > firstData <- read.csv("~/theData.csv",check.names = FALSE) > nuData <- melt(firstData, id=1:7) Error in `[.data.frame`(data, , x) : undefined columns selected The Xs were kept away (plain "1983", etc.), but now it won't melt(). Many retries; lots of reference-consulting; hard to figure out the right way to find the answer. It seems to think the structure is okay: > is.data.frame(firstData) [1] TRUE > ncol(firstData) [1] 71 I suspect that something about the bare-number column labels for 8-71 is throwing it. How do I reassure it that everything's fine? EDIT Didn't want to dump the data-mess if someone could answer offhand, but here's a sample. I thought I'd figured it out when I found spaces in column labels... but I fixed them and still get the same error. Is it a problem that the rows don't all have values in the 2016 column? Tree,Gap,TransX,TransY,DBH,Nodes,Ht,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016, 1,1,3,0,4.4,23,366,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7,3,3,7,3,4,4,13,7,23,17,34,25,30,23,19,25,22,29,28,20,14,6, 2,1,4,0,3.3,24,398,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,12,11,16,10,7,7,16,13,16,12,25,14,24,21,20,22,20,24,15,27,18,17,15,16, 3,1,5,2,2.8,24,325,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,7,16,8,6,16,18,10,17,7,21,10,14,12,16,14,23,15,21,20,14,14,12,9, 4,1,5,2.5,3.5,22,388,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,6,6,5,5,15,9,12,13,29,16,20,13,17,19,27,25,13,31,32,26,26,23, 5,1,10.2,0,9.5,43,739,,,,,,,,,,,,,,,,,,,,,16,18,9,14,18,13,14,10,6,8,8,10,12,11,13,11,6,6,7,8,8,9,11,13,20,27,17,23,11,38,21,29,27,31,29,19,23.1,22,33,40,24,22,24,
R warning message - invalid factor level, NA generated
I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written. mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character") allstate <- unique(mdata$State) allstate <- allstate[order(allstate)] spldata <- split(mdata,mdata$State) if (num=="best") num <- 1 ranklist <- data.frame("hospital" = character(),"state" = character()) for (i in seq_len(length(allstate))) { if (outcome=="heart attack"){ pdata <- spldata[[i]] pdata[,11] <- as.numeric(pdata[,11]) bestof <- pdata[!is.na(as.numeric(pdata[,11])),][] inorder <- order(bestof[,11],bestof[,2]) if (num=="worst") num <- nrow(bestof) hospital <- bestof[inorder[num],2] state <- allstate[i] ranklist <- rbind(ranklist,c(hospital,state)) } } allstate is a character vector of states. outcome can have values similar to "heart attack" num will be numeric or "best" or "worst" I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion. However I keep getting the error invalid factor level, NA generated I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work. I would be grateful for any help. Thanks in advance!
Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary. Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary. By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.
Recoding over multiple data frames in R
(edited to reflect help...I'm not doing great with formatting, but appreciate the feedback) I'm a bit stuck on what I suspect is an easy enough problem. I have multiple different data sets that I have loaded into R, all of which have different numbers of observations, but all of which have two variables named "A1," "A2," and "A3". I want to create a new variable in each of the three data frames that contains the value held in "A1" if A3 contains a value greater than zero, and the value held in "A2" if A3 contains a value less than zero. Seems simple enough, right? My attempt at this code uses this faux-data: set.seed(1) A1=seq(1,100,length=100) A2=seq(-100,-1,length=100) A3=runif(100,-1,1) df1=cbind(A1,A2,A3) A3=runif(100,-1,1) df2=cbind(A1,A2,A3) I'm about a thousand percent sure that R has some functionality for creating the same named variable in multiple data frames, but I have tried doing this with lapply: mylist=list(df1,df2) lapply(mylist,function(x){ x$newVar=x$A1 x$newVar[x$A3>0]=x$A2[x$A3>0] return(x) }) But the newVar is not available for me once I leave the lapply loop. For example, if I ask for the mean of the new variable: mean(df1$newVar) [1] NA Warning message: In mean.default(df1$newVar) : argument is not numeric or logical: returning NA Any help would be appreciated. Thank you.
Well first of all, df1 and df2 are not data.frames but matrices (the dollar syntax doesn't work on matrices). In fact, if you do: set.seed(1) A1=seq(1,100,length=100) A2=seq(-100,-1,length=100) A3=runif(100,-1,1) df1=as.data.frame(cbind(A1,A2,A3)) A3=runif(100,-1,1) df2=as.data.frame(cbind(A1,A2,A3)) mylist=list(df1,df2) lapply(mylist,function(x){ x$newVar=x$A1 x$newVar[x$A3>0]=x$A2 }) the code almost works but gives some warnings. In fact, there's still an error in the last line of the function called by lapply. If you change it like this, it works as expected: lapply(mylist,function(x){ x$newVar=x$A1 x$newVar[x$A3>0]=x$A2[x$A3>0] # you need to subset x$A2 otherwise it's too long return(x) # better to state explicitly what's the return value }) EDIT (as per comment): as basically always happens in R, functions do not mutate existing objects but return brand new objects. So, in this case df1 and df2 are still the same but lapply returns a list with the expected 2 new data.frames i.e. : resultList <- lapply(mylist,function(x){ x$newVar=x$A1 x$newVar[x$A3>0]=x$A2[x$A3>0] return(x) }) newDf1 <- resultList[[1]] newDf2 <- resultList[[2]]