How can I get each numeric column's mean in one data? - r

I have data named cluster_1. It has nominal variable from first column to the third.
# select the columns based on the clustering results
cluster_1 <- mat[which(groups==1),]
m_cluster_1 <- mean(cluster_1[c(-(1:3))])
By the last statement, I can get the mean of all columns'. However, what I want is to attach the mean of each variable(column) to the bottom of the column.
How can I make it? Please let me know.

colMeans() will give you the mean of each column in a data frame or matrix. And rbind() can be used to append the result.
rbind(cluster_1[, -(1:3)], colMeans(cluster_1[, -(1:3)]))

A generalization of what you are doing can be found with the function addmargins. Try, for example:
cluster_1Means <- addmargins(cluster_1[, -(1:3)], margin = 1, FUN = mean)
cluster_1Means

Related

How to get standard deviation of multiple columns in R?

I want to get the standard deviation of specific columns in a dataframe and store those means in a list in R.
The specific variable names of the columns are stored in a vector. For those specific variables (depends on user input) I want to calculate the standard deviation and store those in a list, over which I can loop then to use it in another part of my code.
I tried as follows, e.g.:
specific_variables <- c("variable1", "variable2") # can be of a different length depending on user input
data <- data.frame(...) # this is a dataframe with multiple columns, of which "variable1" and "variable2" are both columns from
sd_list <- 0 # empty variable for storage purposes
# for loop over the variables
for (i in length(specific_variables)) {
sd_list[i] <- sd(data$specific_variables[i], na.rm = TRUE)
}
print(sd_list)
I get an error.
Second attempt using colSds and sapply:
colSds(data[sapply(specific_variables, na.rm = TRUE)])
But the colSds function doesn't work (anymore?).
Ideally, I'd like to store those the standard deviations from certain column names into a list.
Lets assume you have a dataframe with two columns. The easiest way is to use apply:
frame<-data.frame(X=1:6,Y=rnorm(6))
sd_list<-apply(frame,2,sd)
the "2" in apply means: calculate sds for each column. A "1" would mean: calculate for each row.
There is no colSds() function, but colMeans() and colSums() do exist ...
With help of #shghm I found a way:
sd_list <- as.list(unname(apply(data[specific_variables], 2, sd, na.rm = TRUE)))

How can I insert a data frame in a function and then group by groups with tapply

I am new to programming in R and I have made a function that returns me some basic statistics from a list or vector that we insert. The problem comes when I want to insert a data frame.
The dataframe I want to insert has 2 columns; the first refers to a group (1 or 2) and the second refers to widths of the skull in cm (numerical values). I would like to take the mean of both groups separately so that later I can compare them (1 and 2), the mode, median, quartiles ... (everything I have inside the function).
It occurred to me to use the function that I had made to insert lists or vectors and then to group me, use the tapply function but it gives me an error by console, this one:
Error in tapply(archivo, archivo$`Época histórica`, descriptive_statistics) :
arguments must have same length
Here you have the function and the tapply that I did:
descriptive_statistics = function(x){
result <- list(
mean(x), exp(mean(log(x))), median(x), modes(x),
(range(x)[2] - range(x)[1]), var(x), sqrt(var(x)), sqrt(var(x)) / mean(x)
)
names(result) <- c('Aritmetic mean', 'Geometric mean', 'Median', 'Mode', 'Range', 'Variance', 'Standard deviation', 'Pearsons coefficient of variation')
result
}
tapply(archivo, archivo$`Época histórica`, descriptive_statistics)
What could I improve my function so that it lets me enter dataframes? or what could I do in the tapply function to make it work for me? Can someone give me a hand with this? I also accept other ideas, I have tried with aggregate and inside the summary function and such but it does not give me the statistics I want, such as Pearson's coefficient.
Thank you very much in advance, greetings
Pass column of dataframe in the function instead of complete dataframe. You haven't shared your data so it is difficult to give specific answer but let's assume the other column is called col1. In that case you can do -
tapply(archivo$col1, archivo$`Época histórica`, descriptive_statistics)

Finding Mean of a column in an R Data Set, by using FOR Loops to remove Missing Values

I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))

Subtract a row value from the mean of the corresponding column and then sum the differences together

Here is a reproducible problem set-
c = c(1,2,3,4)
d = c(4,1,2,4)
e = c(2,1,5,4)
f = c(2,3,3,4)
tdf <- data.frame(c,d,e,f)
I can't figure out how I can subtract each row value from the mean of the corresponding column and then sum all these differences together for each column and save them.
basically I want to compute summation(xi-xavg) for each column. I would really appreciate any help. Thank you.
The apply() family of functions will solve this issue. sapply will apply a function to each column of a data.frame and return the results of the function. So simply pass it a data frame and define a function you want performed
sapply(tdf, function(x) sum(x-mean(x)))
An option would be to replicate the colMeans to get the dimensions same as that of the original data, get the difference and find the sum of each column with colSums
colSums(tdf - colMeans(tdf)[col(tdf)])
Or another option is to take the transpose of 'tdf', subtract from colMeans and then do the rowSums
rowSums(t(tdf) - colMeans(tdf))

R - Summation of data frame columns changes data type

I have a data frame of 15 columns where the first column is an integer and others are numeric. I have to generate a one-liner summary of the sum of all columns except the last one. I need to generate mean of the last column. So, I am doing something as below:
summary <- c(sum(df$col1), ... mean(df$col15))
The summary then appears with values up to two decimal places even for the integer column (first one). I have been trying the round function to fix this. I can understand, when different types are added, e.g. 1 + 1.0. But, in this case, shouldn't the summation maintain the data-type?
Please let me know what am I missing?
If you are looking for a one-line summary:
lst <- c(lapply(df[-ncol(df)], function(x) sum(x)), mean=mean(df[,ncol(df)]))
as.data.frame(lst)
# int num1 mean
#1 10 6 2.5
The output is a data frame that preserves the classes of each vector. If you would like the output to be added to the original data frame you can replace as.data.frame(lst) with:
names(lst) <- names(df)
rbind(df, lst)
If you are trying to get the sum of all integer columns and the mean of numeric columns, go with #Frank's answer.
Data
df <- data.frame(int=1:4, num1=seq(1,2,length.out=4), num2=seq(2,3,length.out=4))
Perhaps an adaptation of this?
apply(iris[,1:4], 2, sum) / c(rep(1,3), nrow(iris))

Resources