Mean function in R for data in csv file - r

I tried searching the error which I am getting while using "mean" function in R 3.1.2.'
Purpose: Calculate Mean of datasets
Used Functions: sapply, summary to calculate mean as shown below:
sapply(data,mean,na.rm=TRUE)
summary(data)
Problem Faced: Now, I am trying to use "mean" function to calculate mean from complete dataset. I used the function like this:
> testingnew <-data[complete.cases(data),]
> mean(testingnew)
Popped Warning :
[1] NA
Warning message:
In mean.default(testingnew) :
argument is not numeric or logical: returning NA
Que: Can someone please tell me why this warning comes, I tried to remove NA(missing values) using complete.cases.

#To Eliminate missing values: # ! = is not
testingnew <- subset(data, !(is.na(data)))
#Choose a column to calculate the mean:
#Make sure it is numeric or integer
class(testingnew$Col1)
mean(testingnew$Col1, na.rm=TRUE)

Maybe you can try to reproduce this workflow with your own dataset... It seems the only thing missing is referring to individual columns with the mean function, or using sapply as you did before.
Create a dataframe using random values
my.df <- data.frame(x1 = rnorm(n = 200), x2 = rnorm(n=200))
Spread NA's randomly into the df
is.na(my.df) <- matrix(sample(c(TRUE,FALSE), replace= TRUE, size = 400,
prob=c(0.10, 0.90)),
ncol = 2)
For getting means without using complete cases:
mean(my.df$x1, na.rm=TRUE) # mean(my.df[,1], na.rm=TRUE) is equivalent
mean(my.df$x2, na.rm=TRUE) # mean(my.df[,2], na.rm=TRUE) is equivalent
Complete-case approach (if this is what you really need):
my.df.complete <- my.df[complete.cases(my.df),]
Get means for both columns
sapply(X = my.df.complete, FUN = mean)
Get mean from individual columns
mean(my.df.complete$x1)
mean(my.df.complete$x2)

Creating a subset helped:
data3 <-subset(data, !is.na(Ozone))
mean(data3$Ozone)

Related

Applying a function to all columns in R [duplicate]

I have a dataset which I created by column binding using the cbindX function from the gdata package. This function allows me to bind columns with different numbers of rows. So, NA's are introduced when there are no values in a particular column. Now, I want to calculate the standard deviation for each column. I tried using
sapply(dataset,sd)
This returns the standard deviation for the column having all rows with values and NA for the columns having fewer rows. I tried using the na.rm argument with the sd function:
sapply(dataset,sd(na.rm=T))
and got the error message
Error in is.data.frame(x) : argument "x" is missing, with no default
For example:
firstcol <- matrix(c(1:150),ncol=1)
secondcol <- matrix(c(1:300),ncol=1)
thirdcol <- matrix(c(1:450),ncol=1)
fourthcol <- matrix(c(1:600),ncol=1)
fifthcol <- matrix(c(1:30),ncol=1)
sixthcol <- matrix(c(1:30),ncol=1)
seventhcol <- matrix(c(1:30),ncol=1)
library(gdata)
allcolscomb <- data.frame(cbindX (firstcol,secondcol,thirdcol,fourthcol,fifthcol,sixthcol,seventhcol))
names(allcolscomb) <- c("1stcol","2ndcol","3rdcol","4thcol","5thcol","6thcol","7thcol")
sapply(allcolscomb,sd)
sapply(allcolscomb,sd(na.rm=T))
How can I compute standard deviation using the sapply function?
You should read ?sapply manual. Below example of sapply with some extra arguments:
sapply(allcolscomb, sd, na.rm=TRUE)
sapply(allcolscomb, function(x) sd(x, na.rm=TRUE))
Try this.
sapply(allcolscomb,sd, na.rm = TRUE)
in the apply family functions the syntax is (data, fun, ...). The three dots are "ellipsis", they are there to host the arguments of the function passed to the apply's function.

Replacing outlier 2.5%, 97.5% code error in R

I used the following code to try to replace variables's value that are below the bottom 2.5% and above the top 97.5% with specific values.You can perform that code. It provides open data file.
credit<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
fun <- function(x){
quantiles <- quantile( x, c(.025, .975 ) )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
fun(credit)
But the error message is appeared.
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) :
undefined columns selected
What's the problem? I happy to any help!
+Addition comment
I found that the above function does not work in the data frame but works only in the vector.
I can change the outlier of each variable in the data file with the following code:
credit$Duration.of.Credit..month. <- pmax(quantile(credit$Duration.of.Credit..month.,.025),
pmin(credit$Duration.of.Credit..month., quantile(credit$Duration.of.Credit..month.,.975)))
However, my data file has so many variables that it is inconvenient to enter code one by one.
So how can I change the outliers of the variables that a specific value not pmax&pmin?
There's actually nothing wrong with your function as long as you apply it to a column. I'd use mutate_at or mutate_all (if you really want to apply it to all columns) of the dplyr package. Something like this:
library(dplyr)
credit_trunc <- credit %>%
mutate_at(vars(Credit.Amount, Creditability), funs(fun))
or
credit_trunc <- credit %>%
mutate_all(funs(fun))
or if you also have columns of another type (e.g. factor, character) in your data frame, you can use:
credit_trunc <- credit %>%
mutate_if(is.numeric, funs(fun))
This will give you back the data frame with the chosen / all columns / all numeric columns modified as you wanted it.

Cross tabulating missing values in SparkR data frame across all columns

I'm interested in arriving at a cross-tab of missing values across all columns in SparkR data frame. The data I'm trying to utilise can be generated with use of the code below:
Data
set.seed(2)
# Create basic matrix
M <- matrix(
nrow = 100,
ncol = 100,
data = base::sample(x = letters, size = 1e4, replace = TRUE)
)
## Force missing vales
M[base::sample(1:nrow(M), 10),
base::sample(1:ncol(M), 10)] <- NA
table(is.na(M))
SparkR
Following, this answer I would like to arrive at the desired solution using flatMap. The idea is to replace missing / non-missing values with T/F and then count occurrences for each variable. First it appears that flatMap was no exported by SparkR 2.1 so I had to dig it out with :::
# Import data to SparkR ---------------------------------------------------
# Feed data into SparkR
dtaSprkM <- createDataFrame(sqc, as.data.frame(M))
## Preview
describe(dtaSprkM)
# Missing values count ----------------------------------------------------
# Function to convert missing to T/F
convMiss <- function(x) {
ifelse(test = isNull(x),
yes = FALSE,
no = TRUE)
}
# Apply
dtaSprkMTF <- SparkR:::flatMap(dtaSprkM, isNull)
## Derive data frame
dtaSprkMTFres <- createDataFrame(sqc, dtaSprkMTF)
Second, after running the code fails with the following error message:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘isNull’ for signature ‘"list"’
Desired results
On an ordinary data frame in R the desired results can be achieved in the following manner
sapply(as.data.frame(M), function(x) {
prop.table(table(is.na(x)))
})
I like the flexibility that table and prop.table offer and ideally I would like to be able to arrive at similar flexibility via SparkR.
Compute fraction of NULL per column:
fractions <- select(dtaSprkM, lapply(columns(dtaSprkM), function(c)
alias(avg(cast(isNotNull(dtaSprkM[[c]]), "integer")), c)
)
This will create a single row Data.Frame which can be safely collected and easily reshaped locally, for example with tidyr:
library(tidyr)
fractions %>% as.data.frame %>% gather(variable, fraction_not_null)

How to calculate an overall mean from more than two columns in a data frame?

I would like to have a single mean value from my selected columns in a data frame, but it doesn't works from two columns. I tried this:
testDF <- data.frame(v1 = c(1,3,15,7,18,3,5,NA,4,5,7,9),
v2 = c(11,33,55,7,88,33,55,NA,44,5,67,99),
v3 = c(NA,33,5,77,88,3,55,NA,4,55,87,14))
mean(testDF[,2:3], na.rm=T)
and I get this Warning message:
mean(testDF[,2:3], na.rm=T)
[1] NA
Warning message:
In mean.default(testDF[, 2:3], na.rm = T) :
argument is not numeric or logical: returning NA
if I use the sum() function it works perfectly, but I don't understand why it can't works with the mean() function. After some steps I did it with the melt() function from the reshape2{} package but I'm looking a short way to do it simple because I have a lot of variables and data.
Regards
The help for mean says:
Currently there are methods for numeric/logical vectors and date, date-time and time interval objects.
which makes me think that mean does not work on data frames.
Indeed you will see that doing mean(testDF) results in the same error, but mean(testDF[,1]) works.
The easiest solution is to do:
mean(as.matrix(testDF[,2:3]), na.rm=T)
Also, you can use colMeans to get the mean of each column.
Indeed, if you look at the source for colMeans, the first lines are:
if (is.data.frame(x))
x <- as.matrix(x)

How do I use arguments of a function when using sapply?

I have a dataset which I created by column binding using the cbindX function from the gdata package. This function allows me to bind columns with different numbers of rows. So, NA's are introduced when there are no values in a particular column. Now, I want to calculate the standard deviation for each column. I tried using
sapply(dataset,sd)
This returns the standard deviation for the column having all rows with values and NA for the columns having fewer rows. I tried using the na.rm argument with the sd function:
sapply(dataset,sd(na.rm=T))
and got the error message
Error in is.data.frame(x) : argument "x" is missing, with no default
For example:
firstcol <- matrix(c(1:150),ncol=1)
secondcol <- matrix(c(1:300),ncol=1)
thirdcol <- matrix(c(1:450),ncol=1)
fourthcol <- matrix(c(1:600),ncol=1)
fifthcol <- matrix(c(1:30),ncol=1)
sixthcol <- matrix(c(1:30),ncol=1)
seventhcol <- matrix(c(1:30),ncol=1)
library(gdata)
allcolscomb <- data.frame(cbindX (firstcol,secondcol,thirdcol,fourthcol,fifthcol,sixthcol,seventhcol))
names(allcolscomb) <- c("1stcol","2ndcol","3rdcol","4thcol","5thcol","6thcol","7thcol")
sapply(allcolscomb,sd)
sapply(allcolscomb,sd(na.rm=T))
How can I compute standard deviation using the sapply function?
You should read ?sapply manual. Below example of sapply with some extra arguments:
sapply(allcolscomb, sd, na.rm=TRUE)
sapply(allcolscomb, function(x) sd(x, na.rm=TRUE))
Try this.
sapply(allcolscomb,sd, na.rm = TRUE)
in the apply family functions the syntax is (data, fun, ...). The three dots are "ellipsis", they are there to host the arguments of the function passed to the apply's function.

Resources