Compute Variance in each column between certain number of rows - r

I want to compute the variances for each column of a matrix, but that variance must be calculated every 7 rows, for example
9.8 4.5 0.9 7.8.....
5.4 9.8 1.2 3.5....
3.1 2.6 9.5 7.1.....
3.4 NA 1.1 1.5.....
7.9 5.9 3.4 2.6.....
4.5 5.1 7.4 NA.....
VAR VAR VAR VAR
VAR is the variace of the column.
After 7 rows in the same matrix I have to compute the variance again, removing the NA´s. The dimension of the matrix is 266x107.
I tried with the colVars from the boa package, but that command compute the variance for the entire column.

Here is the data.table approach:
require(data.table)
# Create the data table
dt <- as.data.table(matrix(rnorm(266*107), 266, 107))
# For every 7 rows, calculate variance of each column, ignoring NAs
dt[, lapply(.SD, var, na.rm=T), by=gl(ceiling(266/7), 7, 266)]

aggregate() is a mighty function for this kind of tasks, no need for another package in this case:
lolzdf <- matrix(rnorm(266*107), 266, 107)
n<-7
aggregate(lolzdf,list(rep(1:(nrow(lolzdf)%/%n+1),each=n,len=nrow(lolzdf))),var,na.rm=TRUE)[-1];

Related

how to use a loop | apply | map to slice a data-frame for multiple variable values and create multiple statistics summary() in r

I am trying to get multiple summary() outputs from a data-frame. I want to subset according to some characteristics multiple times. Then get the summary() of a certain variable for each slice and put all summary() outputs together in either a dataframe or a list.
Ideally i would like to get the name of each building_id i use to slice the data as a name for that row of summary(). So i thought of using a for loop.
The data are sufficiently large (about 20 m. lines) and i am using the train and building_metadata dataframes joined in one from the ashrae energy prediction from kaggle here
I have created a tibble which holds the building ids i want subset by. I want to get the summary() of variable "energy_sqm" (which i have already created) so i am trying to put this slicing in a for loop:
Warning 1: My building_id tibble has values like 50, 67, 778, 1099 etc. So one of problems i have is with the use of these numbers if i try to use them for some sort of indexing or naming my summary outputs. I think it tries to make row 50, 67 etc in the several differnt trials i did.
summaries_output <- tibble() # or list() `
for (id in building_id){
temp_stats <- joined %>%
filter(building_id == "id") %>%
pull(energy_sqm) %>%
summary() %>%
broom:tidy()
summaries_output <- bind_rows(summaries_output, temp_stats, .id = "id")
`
My problems:
a) whatever summaries_output i use to initialize i cant get it to retain anything inside the loop so i am guessing i am messing up the loop also.
b) Ideally i would like to have the building_id as an identifier of the summary() statistic
c) Could someone propose what is the good practice principle for these kind of loops in terms of using list, tible or whatever.
Details: The class() of summary() is "summaryDefault" "table" which i don't know anything about.
Thanks for the help.
We can also use tidyverse. After grouping by 'Species', tidy the summary output of 'Sepal.Length'. Here, the tidy output is a tibble/data.frame. In dplyr 1.0.0, we could use that without wrapping in a list, but it could also include a column name attribute with $ because we have out and the column names from tidy. To avoid that, we wrap in a list and then unnest the column created
library(dplyr)
library(broom)
library(tidyr)
iris %>%
group_by(Species) %>%
summarise(out = list(tidy(summary(Sepal.Length)))) %>%
unnest(c(out))
# A tibble: 3 x 7
# Species minimum q1 median mean q3 maximum
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 setosa 4.3 4.8 5 5.01 5.2 5.8
#2 versicolor 4.9 5.6 5.9 5.94 6.3 7
#3 virginica 4.9 6.22 6.5 6.59 6.9 7.9
This appears to be summarizing by group. Here's a way to do it with data.table although I am unsure your exact expected output:
library(broom)
library(data.table)
dt_iris = as.data.table(iris)
dt_iris[, tidy(summary(Sepal.Length)), by = Species]
#> Species minimum q1 median mean q3 maximum
#> 1: setosa 4.3 4.800 5.0 5.006 5.2 5.8
#> 2: versicolor 4.9 5.600 5.9 5.936 6.3 7.0
#> 3: virginica 4.9 6.225 6.5 6.588 6.9 7.9
Created on 2020-07-11 by the reprex package (v0.3.0)

Performing the Same mutate on all variables in a data frame

I have a 28-variable data frame, and I would like to mutate each variable in the same data frame with the same function. For example, add an extra column for each variable in the data frame where the new column is the log of the variable. So for example if I had
dataframe <- data.frame(X=data1, Y=data2, Z=data3)
I want a new data frame that contains X Y and Z, but also log(X), log(Y) and log(Z). This is easy enough to do using
mutate(dataframe, log(X)); mutate(dataframe(log(Y))
etc but for 28 variables (and multiple transformations on each variable - I want to get sqrt and ^2 of each too) it's a bit too much. I'm aware of the existance of mutate_all, but for some reason when I try to use that it replaces all the variables rather than adding new ones.
We can use mutate_all and specify the suffix in the funs so that it will create as a new column. Otherwise, would replace the original with the output of the function
dataframe %>%
mutate_all(funs(log = log(.))
A base R option would be
df <- head(iris[1:2])
df[paste("log", names(df), sep = "_")] <- log(df)
df
# Sepal.Length Sepal.Width log_Sepal.Length log_Sepal.Width
#1 5.1 3.5 1.629241 1.252763
#2 4.9 3.0 1.589235 1.098612
#3 4.7 3.2 1.547563 1.163151
#4 4.6 3.1 1.526056 1.131402
#5 5.0 3.6 1.609438 1.280934
#6 5.4 3.9 1.686399 1.360977

Determining observations with same values in one variable

My problem is that i want to determine households which have the same and then use the rank number (ranked by income) to create another rank variable.Sample.Data.Frame
For example you have a data.frame like displayed in the image. The first 2 observations have no income. So there are 2(=n) observations with the same income and rank of 1(=y) and 2(=x). The new rank variable I want to create for both observations: rank.new = (y+x)/n. So that there new column with "rank.new" where in observation 1 and 2 the value is 1.5.
Of course I have many more observations an more identical income households, so i want to ask you have i could fix this in R?
You are looking for the function rank
Income = c(0,0,150,300,300,440,500,500,500)
rank(Income)
[1] 1.5 1.5 3.0 4.5 4.5 6.0 8.0 8.0 8.0
I am making your test data a little bigger to show what happens when there are more than two points that are the same group. You just need to group the points that have the same income and take the average of the groups. I am assuming that the data has been sorted by Income.
## Test Data
Income = c(0,0,150,300,300,440,500,500,500)
Rank = 1:length(Income)
Group = cumsum(c(1, diff(Income) != 0))
NewRank = aggregate(Rank, list(Group), mean)[Group,2]
NewRank
[1] 1.5 1.5 3.0 4.5 4.5 6.0 8.0 8.0 8.0

Double For Loop and calculate averages in R

I have a minor problem, and I'm unsure how to fix the error.
Basically, I have two columns and I want to use a Double For Loop to calculate the averages between each number in both columns so it results in a vector of averages. To clarify, apply and mean functions isn't the best function because I need only half of the total possible combinations to obtain averages. For example:
Col1<-c(1,2,3,4,5)
Col2<-c(1,2,3,4,5)
Q1<-data.frame(cbind(Col1, Col2))
Q1$mean<-0
for (i in 1:length(Q1$Col1)) {
for (j in i+1:length(Q1$Col2)) {
Q1$mean[i]<-(Q1$Col1[i]+Q1$Col2[j])/2
}
}
Basically, for each number in Q1$Col1, I want it average it with Q1$Col2. The reason why I want to use a double for loop is to eliminate duplicates. This is the matrix version to provide visualization:
1.0 1.5 2.0 2.5 3.0
1.5 2.0 2.5 3.0 3.5
2.0 2.5 3.0 3.5 4.0
2.5 3.0 3.5 4.0 4.5
3.0 3.5 4.0 4.5 5.0
Here, each row represents a number from Q1$Col1 and each column represents a number from Q1$Col2. However, notice that there is redundancy on both sides of the matrix diagonal. So using the Double For Loop, I eliminate the redundancy to obtain the averages of the unique combination of cases. Using the matrix above, it should look like this:
1.0 1.5 2.0 2.5 3.0
2.0 2.5 3.0 3.5
3.0 3.5 4.0
4.0 4.5
5.0
What I think you're asking is this: given two vectors of numbers, how can I find the mean of the first items in each vector, the mean of the second items in each vector, and so on. If that's the case, then here is a way to do that.
First, you want use cbind() not rbind() in order to get columns not rows.
Col1<-c(1,2,3,4,5)
Col2<-c(2,3,4,5,6)
Q1<-cbind(Col1, Col2)
Then you can use the function [rowMeans()][1] to figure out (you guessed it) the means of each row. (See also rowSums() and colMeans() and colSums().)
rowMeans(Q1)
#> [1] 1.5 2.5 3.5 4.5 5.5
The more general way to do this is the apply() function, which will let us apply a function to each column or row. Here we use the argument 1 to apply it to rows (because the first row takes the first item from Col1 and Col2, etc.).
apply(Q1, 1, mean)
The results are these:
#> [1] 1.5 2.5 3.5 4.5 5.5
If you really want them in your existing matrix, you could do something like this:
means <- rowMeans(Q1)
cbind(Q1, means)
You do not need the loops to get the averages, you can use vectorised operations:
Col1 <- c(1,2,3,4,5)
Col2 <- c(2,3,4,5,6)
Mean <- (Col1+Col2)/2
Q1 <- rbind(Col1, Col2, Mean)
However rbind treats your vectors as rows, you could use cbind for columns.
You could just use the outer function to first calculate the averages, then use lower.trito fill the area underneath the diagonal of the matrix with NA values.
matrix<-outer(Q1$Col1, Q1$Col2, "+")/2
matrix[lower.tri(matrix)] = NA

Create a new (identical) data frame by sampling an existing data frame column-wise

I am trying to create a new data frame which is identical in the number of columns (but not rows) of an existing data frame. All columns are of identical type, numeric. I need to sample each column of the original data frame (n=241 samples, replace=T) and add those samples to the new data frame at the same column number as the original data frame.
My code so far:
#create the new data frame
tree.df <- data.frame(matrix(nrow=0, ncol=72))
#give same column names as original data frame (data3)
colnames(tree.df)<-colnames(data3)
#populate with NA values
tree.df[1:241,]=NA
#sample original data frame column wise and add to new data frame
for (i in colnames(data3)){
rbind(sample(data3[i], 241, replace = T),tree.df)}
The code isn't working out. Any ideas on how to get this to work?
Use the fact that a data frame is a list, and pass to lapply to perform a column-by-column operation.
Here's an example, taking 5 elements from each column in iris:
as.data.frame(lapply(iris, sample, size=5, replace=TRUE))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.7 3.2 1.7 0.2 versicolor
## 2 5.8 3.1 1.5 1.2 setosa
## 3 6.0 3.8 4.9 1.9 virginica
## 4 4.4 2.5 5.3 0.2 versicolor
## 5 5.1 3.1 3.3 0.3 setosa
There are several issues here. Probably the one that is causing things not to work is that you are trying to access a column of the data frame data3. To do that, you use the following data3[, i]. Note the comma. That separates the row index from the column index.
Additionally, since you already know how big your data frame will be, allocate the space from the beginning:
tree.df <- data.frame(matrix(nrow = 241, ncol = 72))
tree.df is already prepopulated with missing (NA) values so you don't need to do it again. You can now rewrite your for loop as
for (i in colnames(data3)){
tree.df[, i] <- sample(data3[, i], 241, replace = TRUE)
}
Notice I spelled out TRUE. This is better practice than using T because T can be reassigned. Compare:
T
T <- FALSE
T
TRUE <- FALSE

Resources