I want to get the standard deviation of specific columns in a dataframe and store those means in a list in R.
The specific variable names of the columns are stored in a vector. For those specific variables (depends on user input) I want to calculate the standard deviation and store those in a list, over which I can loop then to use it in another part of my code.
I tried as follows, e.g.:
specific_variables <- c("variable1", "variable2") # can be of a different length depending on user input
data <- data.frame(...) # this is a dataframe with multiple columns, of which "variable1" and "variable2" are both columns from
sd_list <- 0 # empty variable for storage purposes
# for loop over the variables
for (i in length(specific_variables)) {
sd_list[i] <- sd(data$specific_variables[i], na.rm = TRUE)
}
print(sd_list)
I get an error.
Second attempt using colSds and sapply:
colSds(data[sapply(specific_variables, na.rm = TRUE)])
But the colSds function doesn't work (anymore?).
Ideally, I'd like to store those the standard deviations from certain column names into a list.
Lets assume you have a dataframe with two columns. The easiest way is to use apply:
frame<-data.frame(X=1:6,Y=rnorm(6))
sd_list<-apply(frame,2,sd)
the "2" in apply means: calculate sds for each column. A "1" would mean: calculate for each row.
There is no colSds() function, but colMeans() and colSums() do exist ...
With help of #shghm I found a way:
sd_list <- as.list(unname(apply(data[specific_variables], 2, sd, na.rm = TRUE)))
Related
I want to get the mean of specific columns in a dataframe and store those means in a vector in R.
The specific variable names of the columns are stored in a vector. For those specific variables (depends on user input) I want to calculate the mean and store those in a vector, over which I can loop then to use it in another part of my code.
I tried as follows, e.g.:
specific_variables <- c("variable1", "variable2") # can be of a different length depending on user input
data <- data.frame(...) # this is a dataframe with multiple columns, of which "variable1" and "variable2" are both columns from
mean_xm <- 0 # empty variable for storage purposes
# for loop over the variables
for (i in length(specific_variables)) {
mean_xm[i] <- mean(data$specific_variables[i], na.rm = TRUE)
}
print(mean_xm)
I get an error saying
Error: object of type 'closure' is not subsettable
Second attempt using sapply:
colMeans(data[sapply(data, is.numeric)])
But this gives me the means of all columns of the dataframe, but I only want to get those from the columns specified in specific_variables. Ideally, I'd like to store those means into a vector as I did in my first attempt.
We may use
v1 <- unname(colMeans(data[specific_variables], na.rm = TRUE))
I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))
I want to write a function that dynamically uses different correlation methods depending on the scale of measure of the feature (continuous, dichotomous, ordinal). The label is always continuous. My idea was to use the apply() function, so iterate over every feature (aka column), check it's scale of measure (numeric, factor with two levels, factor with more than two levels) and then use the appropriate correlation function. Unfortunately my code seems to convert every feature into a character vector and as consequence the condition in the if statement is always false for every column. I don't know why my code is doing this. How can I prevent my code from converting my features to character vectors?
set.seed(42)
foo <- sample(c("x", "y"), 200, replace = T, prob = c(0.7, 0.3))
bar <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.5,0.05,0.1,0.1,0.25))
y <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.25,0.1,0.1,0.05,0.5))
data <- data.frame(foo,bar,y)
features <- data[, !names(data) %in% 'y']
dyn.corr <- function(x,y){
# print out structure of every column
print(str(x))
# if feature is numeric and has more than two outcomes use corr.test
if(is.numeric(x) & length(unique(x))>2){
result <- corr.test(x,y)[['r']]
} else {
result <- "else"
}
}
result <- apply(features,2,dyn.corr,y)
apply is built for matrices. When you apply to a data frame, the first thing that happens is coercing your data frame to a matrix. A matrix can only have one data type, so all columns of your data are converted to the most general type among them when this happens.
Use sapply or lapply to work with columns of a data frame.
This should work fine (I tried to test, but I don't know what package to load to get the corr.test function.)
result <- sapply(features, dyn.corr, income)
I have a data.frame where each column represents a different individual and each row represents different food items eaten.
My goal is to resample each column via bootstrapping and then calculate a metric score and C.I.s for each individual (data column) using a defined function.
I have done this successfully on a single vector but cannot figure out how to apply the bootstrapping and metric function to individual columns in a data frame. Below is the code I have to apply it to a single vector:
data.1 <- c(10, 50, 200, 54, 6) ## example vector
## create function
metric.function <- function(x){
p <- x/sum(x)
dap <- 1/sum(p^2)
return(dap)
}
vect <- c() ## empty vector for bootstrap data
for (i in 1:1000){
data.2 <- sample(data.1, replace = TRUE) ##bootstrap sample ##
vect[i] <- metric.function (data.2) ## apply metric.function ##
}
summary(vect) ## summary
quantile(vect, probs = c(0.025, 0.975)) ## C.I.
This works fine for a single vector but I want to apply it independently to multiple columns in a data frame, for example in the example.df below I want to apply it to x1:x10 independently resulting in 10 metric scores and 10 C.I.s
example.df<-data.frame(replicate(10,sample(0:50,10,rep=TRUE)))
I have tried changing the vector item to a data.frame and messing around with apply and dply but cannot figure it out, can anyone suggest how to do it or point me in the direction of useful guide/website etc?
This is a perfect chance to use replicate and sapply.
replicate(1000, sapply(example.df, function(x)
metric.function(sample(x, replace = TRUE))))
sapply will operate column-wise (given that a data.frame is in a sense a list of columns); once we've isolated a column within sapply, we need only resample it & apply our metric.
I am new to R and I tried to use a function that tests for outliers in a large dataframe with over 600 variables all numeric except for the last 2 columns. I tried the outlier function in the outliers package to test one column at a time, I ended with a numeric vector which I could not use. Is there a better way to identify all outliers in a dataframe.
myout <- c()
for (i in 1:dim(training)[2]){
if (is.numeric(training[,i])) {
myout <- c(myout,outlier(training[,i])) }
}
As you can read in the helpfile of outlier it finds one value for each variable, the one that differs the most from the mean. I think what you want is finding for each variable the index of all data points that are outliers. This can be done in the following way (of course you need to remove your non-numeric variables first):
# first write a custom function that returns the index of all outliers
# I define an outlier as 3 sd's away from the mean, you can adjust that
is.outlier <- function(x) which(abs(x - mean(x)) > 3*sd(x))
# turn the df into a list, and apply the function to each variable with lapply
df.as.list <- as.list(df) # enter the name of your data frame instead of df
lapply(df.as.list, is.outlier)
It will return a list with at element i the indices of the outliers of the variable in
column i.
You may not actually want to remove outliers, but per this 2 years ago:
x[!x %in% boxplot.stats(x)$out]