I need your help here. I need to calculate variance manually in R. I have achieved it with this codes, it is to not robust enough for missing values and non-numeric data types.
a= c(1,2,3,4,5)
k=mean(a,na.rm = T)
storage=a
for(i in 1:length(a)) {
storage[i]= ((i-k)^2)
}
storage =sum((storage)/(length(a)-1))
storage
I run into trouble when I have a= c(1,2,3,4,5,c,NA)
Please how would I edit the code?
First, a few observations:
In R, you can do an operation on the whole vector. E.g. (c(1, 2, 3))^2 yields 1 4 9. There's no need to use a for loop.
mean isn't the only function that needs na.rm = TRUE; sum does too.
In R, atomic vectors (which are pretty much all vectors that aren't a list) can only have elements of one single data type. There are four primary types: logical, integer, double and character. If there's more than one type in the vector, all the elements are coerced to be the same, in the following order: character → double → integer → logical. For example, c(1, 'c') will return the character vector "1", "c". That's why you were having trouble. (Note: If there's an NA in the vector, its type will be the same type of the vector.)
Unfortunately for that specific vector, c(1,2,3,4,5,c,NA), I don't think there's a simple way to coerce it to an integer. That's because it's a list that has a function as an element: the function c().
However, this function works whenever x is an atomic vector:
variance <- function(x){
x = as.numeric(x)
x = na.omit(x)
m = mean(x)
return(
sum((x-m)^2, na.rm = TRUE)/(length(x) - 1)
)
}
First we coerce the vector to numeric, so we can deal with a vector like c(1, 2, 'a'). Then we remove the NA's, so we don't have to write na.rm = TRUE in mean and sum. Then we just write down the formula.
A minor inconvenience is that when converting a character vector to numeric, we get a warning saying that NAs were generated. This can be solved if we write x = suppressWarnings(as.numeric(x)) instead.
If you want your function to be able to handle lists with functions, let me know.
You are using a for loop but that is really unnecessary, you can make a function to vectorise it which removes the NAs from the data as the first step, via conversion to character then numeric vector types (because c is a function)...
# Create data
set.seed(1)
x1 <- sample(1:10, 5)
x2 <- c(x1, c, NA)
# Make the function
varFunc <- function(x){
# Convert to character then numeric (non numeric become NA) then remove NAs
x <- as.numeric(as.character(x))[!is.na(as.numeric(as.character(x)))]
# Return Variance
sum((x-mean(x))^2) / (length(x)-1)
}
# Use the function
varFunc(x1)
varFunc(x2)
# Sanity check
var(x1)
var(x2, na.rm = TRUE)
One possible approach: first, clean up a. If you start with something like a = c(1, 2, 3, 4, 5, "c", NA), then a will not be stored as a numeric variable (because of the non-numeric entry). You might first coerce it to a numeric vector, which will give an extra NA entry:
a = c(1, 2, 3, 4, 5, "c", NA)
a <- as.numeric(a)
a
## 1 2 3 4 5 NA NA
Then, you could subset the original vector by retaining only the entries from this that are numeric (by using !):
a <- a[!is.na(as.numeric(a))]
a
## 1 2 3 4 5
You could do these right after your initial declaration of a, for instance. Gregor Thomas also suggested na.omit(), which could work if combined properly with as.numeric().
I notice that you computed the mean by using the built-in mean() function and using na.rm = T... if you're able to use that same approach here, note that var() also has an optional na.rm = T parameter. I suspect you're not allowed to use it since you were instructed to compute the variance by hand, but perhaps you could use this to check your answers.
Related
I have a dataset which contains about 40 different variables. Now I would like to create a new variable indicating whether each observation is above or below the median.
I managed to create a new variable "var1_mediansplit" from the existing "var1" (values 1 for below median, 2 for everything else):
mydata$var1_mediansplit <- ifelse(mydata$var1 < median(mydata$var1), mydata$var1_mediansplit <- "1", mydata$var1_mediansplit <- "2"
I am looking for a way to run it through several variables (with a loop, I would guess). I appreciate any help!
Edit: The solution from jblood94 worked for me, so thank you!
Using the colMedians and eachrow from the Rfast package:
library(Rfast)
df <- as.data.frame(matrix(runif(4000), ncol = 40)) # dummy data
m <- as.matrix(df)
df2 <- as.data.frame((eachrow(m, colMedians(m), "-") >= 0) + 1)
Detailed explanation:
colMedians(m) returns the median of each column (a vector of
length 40).
eachrow takes a matrix for the first argument, a vector for the second argument (with the same length as the number of columns in the matrix), and an operator for the third argument. Each row of the matrix has the vector applied to it according to the operator. So here, the colMedians(m) vector is subtracted from each row of m.
The result of eachrow is compared to 0 (FALSE if it is less than 0, TRUE otherwise).
Operating on a logical with a numeric will coerce it to numeric: FALSE + 1 = 1, and TRUE + 1 = 2.
Don't overcomplicate it, consider this:
v <- c(1:100)
x <- median(v)
y <- v >= x
I have a large list with many sublists, each of the sublists is formed by a vector of values. To this list I aim to apply a form of fast expand grid cJ, however when confronted with one of the lists yielding integer zero the function fails. My question is how could I convert Z as per all sublists which yield integer zero are transformed into class which can be submitted to the below function. I know I could use length(Z[[4]) but I aim to have a method that can be used for lists wich may include thousands of lines and a few of them may be integer 0, so I aim to convert in Z any possible integers which may be listed.
Z <- list (c(1,2,3,4,3,2,1,2),c(1,2,3,4),c(5,6,4),c(integer(0)))
do.call ( CJ , args = Z ) # get all combinations
My question is if there is any way to change the class as a whole of Z as to succeed in sumitting the data as for the function to work and not yield an error.
# Desired Output will be equal to having the last list with a numeric 0 so it will be represented in the fast expand.grid.
Z <- list (c(1,2,3,4,3,2,1,2),c(1,2,3,4),c(5,6,4),c((0)))
do.call(CJ,Z)
CJ function comes from data.table so it is worth to add that tag to question.
There is an open FR to create CJ generic method, so it could handle different types separately.
Below the function which address your question.
library(data.table)
f = function(x){
stopifnot(is.list(x))
ll = sapply(x, length)
if(any(ll == 0L)) x[ll == 0L] = 0L
do.call(CJ, args = x)
}
x = list(c(1,2,3,4,3,2,1,2),c(1,2,3,4),c(5,6,4),c(integer(0)))
f(x)
I've got a list of lists of bootstrap statistics from a function that I wrote in R. The main list has the 1000 bootstrap iterations. Each element within the list is itself a list of three things, including fitted values for each of the four variables ("fvboot" -- a 501x4 matrix).
I want to make a vector of the values for each position on the grid of x values, from 1:501, and for each variable, from 1:4.
For example, for the ith point on the xgrid of the jth variable, I want to make a vector like the following:
vec = bootfits$fvboot[[1:1000]][i,j]
but when I do this, I get:
recursive indexing failed at level 2
googling around, i think I understand why R is doing this. but I'm not getting an answer for how I can get the ijth element of each fvboot matrix into a 1000x1 vector.
help would be much appreciated.
Use unlist() function in R. From example(unlist),
unlist(options())
unlist(options(), use.names = FALSE)
l.ex <- list(a = list(1:5, LETTERS[1:5]), b = "Z", c = NA)
unlist(l.ex, recursive = FALSE)
unlist(l.ex, recursive = TRUE)
l1 <- list(a = "a", b = 2, c = pi+2i)
unlist(l1) # a character vector
l2 <- list(a = "a", b = as.name("b"), c = pi+2i)
unlist(l2) # remains a list
ll <- list(as.name("sinc"), quote( a + b ), 1:10, letters, expression(1+x))
utils::str(ll)
for(x in ll)
stopifnot(identical(x, unlist(x)))
This would be easier if you give a minimal example object. In general, you can not index lists with vectors like [[1:1000]]. I would use the plyr functions. This should do it (although I haven't tested it):
require("plyr")
laply(bootfits$fvboot,function(l) l[i,j])
If you are not familiar with plyr: I always found Hadley Wickham's article 'The split-apply-combine strategy for data analysis' very useful.
You can extract one vector at a time using sapply, e.g. for i=1 and j=1:
i <- 1
j <- 1
vec <- sapply(bootfits, function(x){x$fvboot[i,j]})
sapply carries out the function (in this case an inline function we have written) to each element of the list bootfits, and simplifies the result if possible (i.e. converts it from a list to a vector).
To extract a whole set of values as a matrix (e.g. over all the i's) you can wrap this in another sapply, but this time over the i's for a specified j:
j <- 1
mymatrix <- sapply(1:501, function(i){
sapply(bootfits, function(x){x$fvboot[i,j]})
})
Warning: I haven't tested this code but I think it should work.
When I pass a row of a data frame to a function using apply, I lose the class information of the elements of that row. They all turn into 'character'. The following is a simple example. I want to add a couple of years to the 3 stooges ages. When I try to add 2 a value that had been numeric R says "non-numeric argument to binary operator." How do I avoid this?
age = c(20, 30, 50)
who = c("Larry", "Curly", "Mo")
df = data.frame(who, age)
colnames(df) <- c( '_who_', '_age_')
dfunc <- function (er) {
print(er['_age_'])
print(er[2])
print(is.numeric(er[2]))
print(class(er[2]))
return (er[2] + 2)
}
a <- apply(df,1, dfunc)
Output follows:
_age_
"20"
_age_
"20"
[1] FALSE
[1] "character"
Error in er[2] + 2 : non-numeric argument to binary operator
apply only really works on matrices (which have the same type for all elements). When you run it on a data.frame, it simply calls as.matrix first.
The easiest way around this is to work on the numeric columns only:
# skips the first column
a <- apply(df[, -1, drop=FALSE],1, dfunc)
# Or in two steps:
m <- as.matrix(df[, -1, drop=FALSE])
a <- apply(m,1, dfunc)
The drop=FALSE is needed to avoid getting a single column vector.
-1 means all-but-the first column, you could instead explicitly specify the columns you want, for example df[, c('foo', 'bar')]
UPDATE
If you want your function to access one full data.frame row at a time, there are (at least) two options:
# "loop" over the index and extract a row at a time
sapply(seq_len(nrow(df)), function(i) dfunc(df[i,]))
# Use split to produce a list where each element is a row
sapply(split(df, seq_len(nrow(df))), dfunc)
The first option is probably better for large data frames since it doesn't have to create a huge list structure upfront.
I have a column which contain numeric as well as non-numeric values. I want to find the mean of the numeric values which i can use it to replace the non-numeric values. How can this be done in R?
Say your data frame is named df and the column you want to "fix" is called df$x. You could do the following.
You have to unfactor and then convert to numeric. This will give you NAs for all the character strings that cannot be coalesced to numbers.
nums <- as.numeric(as.character(df$x))
As Richie Cotton pointed out, there is a "more efficient, but harder to remember" way to convert factors to numeric
nums <- as.numeric(levels(df$x))[as.integer(df$x)]
To get the mean, you use mean() but pass na.rm = T
m <- mean(nums, na.rm = T)
Assign the mean to all the NA values.
nums[is.na(nums)] <- m
You could then replace the old data, but I don't recommend it. Instead just add a new column
df$new.x <- nums
This is a function I wrote yesterday to combat the non-numeric types. I have a data.frame with unpredictable type for each column. I want to calculate the means for numeric, and leave everything else untouched.
colMeans2 <- function(x) {
# This function tries to guess column type. Since all columns come as
# characters, it first tries to see if x == "TRUE" or "FALSE". If
# not so, it tries to coerce vector into integer. If that doesn't
# work it tries to see if there's a ' \" ' in the vector (meaning a
# column with character), it uses that as a result. Finally if nothing
# else passes, it means the column type is numeric, and it calculates
# the mean of that. The end.
# browser()
# try if logical
if (any(levels(x) == "TRUE" | levels(x) == "FALSE")) return(NA)
# try if integer
try.int <- strtoi(x)
if (all(!is.na(try.int))) return(try.int[1])
# try if character
if (any(grepl("\\\"", x))) return(x[1])
# what's left is numeric
mean(as.numeric(as.character(x)), na.rm = TRUE)
# a possible warning about coerced NAs probably originates in the above line
}
You would use it like so:
apply(X = your.dataframe, MARGIN = 2, FUN = colMeans2)
It sort of depends on what your data looks like.
Does it look like this?
data = list(1, 2, 'new jersey')
Then you could
data.numbers = sapply(data, as.numeric)
and get
c(1, 2, NA)
And you can find the mean with
mean(data.numbers, na.rm=T)
A compact conversion:
vec <- c(0:10,"a","z")
vec2 <- (as.numeric(vec))
vec2[is.na(vec2)] <- mean(vec2[!is.na(vec2)])
as.numeric will print the warning message listed below and convert the non-numeric to NA.
Warning message:
In mean(as.numeric(vec)) : NAs introduced by coercion