I have a dataset called spam which contains 58 columns and approximately 3500 rows of data related to spam messages.
I plan on running some linear regression on this dataset in the future, but I'd like to do some pre-processing beforehand and standardize the columns to have zero mean and unit variance.
I've been told the best way to go about this is with R, so I'd like to ask how can i achieve normalization with R? I've already got the data properly loaded and I'm just looking for some packages or methods to perform this task.
I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale function on the data to do what you want.
dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)
# check that we get mean of 0 and sd of 1
colMeans(scaled.dat) # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)
Using built in functions is classy. Like this cat:
Realizing that the question is old and one answer is accepted, I'll provide another answer for reference.
scale is limited by the fact that it scales all variables. The solution below allows to scale only specific variable names while preserving other variables unchanged (and the variable names could be dynamically generated):
library(dplyr)
set.seed(1234)
dat <- data.frame(x = rnorm(10, 30, .2),
y = runif(10, 3, 5),
z = runif(10, 10, 20))
dat
dat2 <- dat %>% mutate_at(c("y", "z"), ~(scale(.) %>% as.vector))
dat2
which gives me this:
> dat
x y z
1 29.75859 3.633225 14.56091
2 30.05549 3.605387 12.65187
3 30.21689 3.318092 13.04672
4 29.53086 3.079992 15.07307
5 30.08582 3.437599 11.81096
6 30.10121 4.621197 17.59671
7 29.88505 4.051395 12.01248
8 29.89067 4.829316 12.58810
9 29.88711 4.662690 19.92150
10 29.82199 3.091541 18.07352
and
> dat2 <- dat %>% mutate_at(c("y", "z"), ~(scale(.) %>% as.vector))
> dat2
x y z
1 29.75859 -0.3004815 -0.06016029
2 30.05549 -0.3423437 -0.72529604
3 30.21689 -0.7743696 -0.58772361
4 29.53086 -1.1324181 0.11828039
5 30.08582 -0.5946582 -1.01827752
6 30.10121 1.1852038 0.99754666
7 29.88505 0.3283513 -0.94806607
8 29.89067 1.4981677 -0.74751378
9 29.88711 1.2475998 1.80753470
10 29.82199 -1.1150515 1.16367556
EDIT 1 (2016): Addressed Julian's comment: the output of scale is Nx1 matrix so ideally we should add an as.vector to convert the matrix type back into a vector type. Thanks Julian!
EDIT 2 (2019): Quoting Duccio A.'s comment: For the latest dplyr (version 0.8) you need to change dplyr::funcs with list, like dat %>% mutate_each_(list(~scale(.) %>% as.vector), vars=c("y","z"))
EDIT 3 (2020): Thanks to #mj_whales: the old solution is deprecated and now we need to use mutate_at.
This is 3 years old. Still, I feel I have to add the following:
The most common normalization is the z-transformation, where you subtract the mean and divide by the standard deviation of your variable. The result will have mean=0 and sd=1.
For that, you don't need any package.
zVar <- (myVar - mean(myVar)) / sd(myVar)
That's it.
'Caret' package provides methods for preprocessing data (e.g. centering and scaling). You could also use the following code:
library(caret)
# Assuming goal class is column 10
preObj <- preProcess(data[, -10], method=c("center", "scale"))
newData <- predict(preObj, data[, -10])
More details: http://www.inside-r.org/node/86978
When I used the solution stated by Dason, instead of getting a data frame as a result, I got a vector of numbers (the scaled values of my df).
In case someone is having the same trouble, you have to add as.data.frame() to the code, like this:
df.scaled <- as.data.frame(scale(df))
I hope this is will be useful for ppl having the same issue!
You can easily normalize the data also using data.Normalization function in clusterSim package. It provides different method of data normalization.
data.Normalization (x,type="n0",normalization="column")
Arguments
x
vector, matrix or dataset
type
type of normalization:
n0 - without normalization
n1 - standardization ((x-mean)/sd)
n2 - positional standardization ((x-median)/mad)
n3 - unitization ((x-mean)/range)
n3a - positional unitization ((x-median)/range)
n4 - unitization with zero minimum ((x-min)/range)
n5 - normalization in range <-1,1> ((x-mean)/max(abs(x-mean)))
n5a - positional normalization in range <-1,1> ((x-median)/max(abs(x-median)))
n6 - quotient transformation (x/sd)
n6a - positional quotient transformation (x/mad)
n7 - quotient transformation (x/range)
n8 - quotient transformation (x/max)
n9 - quotient transformation (x/mean)
n9a - positional quotient transformation (x/median)
n10 - quotient transformation (x/sum)
n11 - quotient transformation (x/sqrt(SSQ))
n12 - normalization ((x-mean)/sqrt(sum((x-mean)^2)))
n12a - positional normalization ((x-median)/sqrt(sum((x-median)^2)))
n13 - normalization with zero being the central point ((x-midrange)/(range/2))
normalization
"column" - normalization by variable, "row" - normalization by object
With dplyr v0.7.4 all variables can be scaled by using mutate_all():
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
set.seed(1234)
dat <- tibble(x = rnorm(10, 30, .2),
y = runif(10, 3, 5),
z = runif(10, 10, 20))
dat %>% mutate_all(scale)
#> # A tibble: 10 x 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 -0.827 -0.300 -0.0602
#> 2 0.663 -0.342 -0.725
#> 3 1.47 -0.774 -0.588
#> 4 -1.97 -1.13 0.118
#> 5 0.816 -0.595 -1.02
#> 6 0.893 1.19 0.998
#> 7 -0.192 0.328 -0.948
#> 8 -0.164 1.50 -0.748
#> 9 -0.182 1.25 1.81
#> 10 -0.509 -1.12 1.16
Specific variables can be excluded using mutate_at():
dat %>% mutate_at(scale, .vars = vars(-x))
#> # A tibble: 10 x 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 29.8 -0.300 -0.0602
#> 2 30.1 -0.342 -0.725
#> 3 30.2 -0.774 -0.588
#> 4 29.5 -1.13 0.118
#> 5 30.1 -0.595 -1.02
#> 6 30.1 1.19 0.998
#> 7 29.9 0.328 -0.948
#> 8 29.9 1.50 -0.748
#> 9 29.9 1.25 1.81
#> 10 29.8 -1.12 1.16
Created on 2018-04-24 by the reprex package (v0.2.0).
Again, even though this is an old question, it is very relevant! And I have found a simple way to normalise certain columns without the need of any packages:
normFunc <- function(x){(x-mean(x, na.rm = T))/sd(x, na.rm = T)}
For example
x<-rnorm(10,14,2)
y<-rnorm(10,7,3)
z<-rnorm(10,18,5)
df<-data.frame(x,y,z)
df[2:3] <- apply(df[2:3], 2, normFunc)
You will see that the y and z columns have been normalised. No packages needed :-)
Scale can be used for both full data frame and specific columns.
For specific columns, following code can be used:
trainingSet[, 3:7] = scale(trainingSet[, 3:7]) # For column 3 to 7
trainingSet[, 8] = scale(trainingSet[, 8]) # For column 8
Full data frame
trainingSet <- scale(trainingSet)
The collapse package provides the fastest scale function - implemented in C++ using Welfords Online Algorithm:
dat <- data.frame(x = rnorm(1e6, 30, .2),
y = runif(1e6, 3, 5),
z = runif(1e6, 10, 20))
library(collapse)
library(microbenchmark)
microbenchmark(fscale(dat), scale(dat))
Unit: milliseconds
expr min lq mean median uq max neval cld
fscale(dat) 27.86456 29.5864 38.96896 30.80421 43.79045 313.5729 100 a
scale(dat) 357.07130 391.0914 489.93546 416.33626 625.38561 793.2243 100 b
Furthermore: fscale is S3 generic for vectors, matrices and data frames and also supports grouped and/or weighted scaling operations, as well as scaling to arbitrary means and standard deviations.
The dplyr package has two functions that do this.
> require(dplyr)
To mutate specific columns of a data table, you can use the function mutate_at(). To mutate all columns, you can use mutate_all.
The following is a brief example for using these functions to standardize data.
Mutate specific columns:
dt = data.table(a = runif(3500), b = runif(3500), c = runif(3500))
dt = data.table(dt %>% mutate_at(vars("a", "c"), scale)) # can also index columns by number, e.g., vars(c(1,3))
> apply(dt, 2, mean)
a b c
1.783137e-16 5.064855e-01 -5.245395e-17
> apply(dt, 2, sd)
a b c
1.0000000 0.2906622 1.0000000
Mutate all columns:
dt = data.table(a = runif(3500), b = runif(3500), c = runif(3500))
dt = data.table(dt %>% mutate_all(scale))
> apply(dt, 2, mean)
a b c
-1.728266e-16 9.291994e-17 1.683551e-16
> apply(dt, 2, sd)
a b c
1 1 1
Before I happened to find this thread, I had the same problem. I had user dependant column types, so I wrote a for loop going through them and getting needed columns scale'd. There are probably better ways to do it, but this solved the problem just fine:
for(i in 1:length(colnames(df))) {
if(class(df[,i]) == "numeric" || class(df[,i]) == "integer") {
df[,i] <- as.vector(scale(df[,i])) }
}
as.vector is a needed part, because it turned out scale does rownames x 1 matrix which is usually not what you want to have in your data.frame.
#BBKim pretty much gave the best answer, but it can just be done shorter. I'm surprised noone came up with it yet.
dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
dat <- apply(dat, 2, function(x) (x - mean(x)) / sd(x))
Use the package "recommenderlab". Download and install the package.
This package has a command "Normalize" in built. It also allows you to choose one of the many methods for normalization namely 'center' or 'Z-score'
Follow the following example:
## create a matrix with ratings
m <- matrix(sample(c(NA,0:5),50, replace=TRUE, prob=c(.5,rep(.5/6,6))),nrow=5, ncol=10, dimnames = list(users=paste('u', 1:5, sep=”), items=paste('i', 1:10, sep=”)))
## do normalization
r <- as(m, "realRatingMatrix")
#here, 'centre' is the default method
r_n1 <- normalize(r)
#here "Z-score" is the used method used
r_n2 <- normalize(r, method="Z-score")
r
r_n1
r_n2
## show normalized data
image(r, main="Raw Data")
image(r_n1, main="Centered")
image(r_n2, main="Z-Score Normalization")
The code below could be the shortest way to achieve this.
dataframe <- apply(dataframe, 2, scale)
The normalize function from the BBMisc package was the right tool for me since it can deal with NA values.
Here is how to use it:
Given the following dataset,
ASR_API <- c("CV", "F", "IER", "LS-c", "LS-o")
Human <- c(NA, 5.8, 12.7, NA, NA)
Google <- c(23.2, 24.2, 16.6, 12.1, 28.8)
GoogleCloud <- c(23.3, 26.3, 18.3, 12.3, 27.3)
IBM <- c(21.8, 47.6, 24.0, 9.8, 25.3)
Microsoft <- c(29.1, 28.1, 23.1, 18.8, 35.9)
Speechmatics <- c(19.1, 38.4, 21.4, 7.3, 19.4)
Wit_ai <- c(35.6, 54.2, 37.4, 19.2, 41.7)
dt <- data.table(ASR_API,Human, Google, GoogleCloud, IBM, Microsoft, Speechmatics, Wit_ai)
> dt
ASR_API Human Google GoogleCloud IBM Microsoft Speechmatics Wit_ai
1: CV NA 23.2 23.3 21.8 29.1 19.1 35.6
2: F 5.8 24.2 26.3 47.6 28.1 38.4 54.2
3: IER 12.7 16.6 18.3 24.0 23.1 21.4 37.4
4: LS-c NA 12.1 12.3 9.8 18.8 7.3 19.2
5: LS-o NA 28.8 27.3 25.3 35.9 19.4 41.7
normalized values can be obtained like this:
> dtn <- normalize(dt, method = "standardize", range = c(0, 1), margin = 1L, on.constant = "quiet")
> dtn
ASR_API Human Google GoogleCloud IBM Microsoft Speechmatics Wit_ai
1: CV NA 0.3361245 0.2893457 -0.28468670 0.3247336 -0.18127203 -0.16032655
2: F -0.7071068 0.4875320 0.7715885 1.59862532 0.1700986 1.55068347 1.31594762
3: IER 0.7071068 -0.6631646 -0.5143923 -0.12409420 -0.6030768 0.02512682 -0.01746131
4: LS-c NA -1.3444981 -1.4788780 -1.16064578 -1.2680075 -1.24018782 -1.46198764
5: LS-o NA 1.1840062 0.9323361 -0.02919864 1.3762521 -0.15435044 0.32382788
where hand calculated method just ignores colmuns containing NAs:
> dt %>% mutate(normalizedHuman = (Human - mean(Human))/sd(Human)) %>%
+ mutate(normalizedGoogle = (Google - mean(Google))/sd(Google)) %>%
+ mutate(normalizedGoogleCloud = (GoogleCloud - mean(GoogleCloud))/sd(GoogleCloud)) %>%
+ mutate(normalizedIBM = (IBM - mean(IBM))/sd(IBM)) %>%
+ mutate(normalizedMicrosoft = (Microsoft - mean(Microsoft))/sd(Microsoft)) %>%
+ mutate(normalizedSpeechmatics = (Speechmatics - mean(Speechmatics))/sd(Speechmatics)) %>%
+ mutate(normalizedWit_ai = (Wit_ai - mean(Wit_ai))/sd(Wit_ai))
ASR_API Human Google GoogleCloud IBM Microsoft Speechmatics Wit_ai normalizedHuman normalizedGoogle
1 CV NA 23.2 23.3 21.8 29.1 19.1 35.6 NA 0.3361245
2 F 5.8 24.2 26.3 47.6 28.1 38.4 54.2 NA 0.4875320
3 IER 12.7 16.6 18.3 24.0 23.1 21.4 37.4 NA -0.6631646
4 LS-c NA 12.1 12.3 9.8 18.8 7.3 19.2 NA -1.3444981
5 LS-o NA 28.8 27.3 25.3 35.9 19.4 41.7 NA 1.1840062
normalizedGoogleCloud normalizedIBM normalizedMicrosoft normalizedSpeechmatics normalizedWit_ai
1 0.2893457 -0.28468670 0.3247336 -0.18127203 -0.16032655
2 0.7715885 1.59862532 0.1700986 1.55068347 1.31594762
3 -0.5143923 -0.12409420 -0.6030768 0.02512682 -0.01746131
4 -1.4788780 -1.16064578 -1.2680075 -1.24018782 -1.46198764
5 0.9323361 -0.02919864 1.3762521 -0.15435044 0.32382788
(normalizedHuman is made a list of NAs ...)
regarding the selection of specific columns for calculation, a generic method can be employed like this one:
data_vars <- df_full %>% dplyr::select(-ASR_API,-otherVarNotToBeUsed)
meta_vars <- df_full %>% dplyr::select(ASR_API,otherVarNotToBeUsed)
data_varsn <- normalize(data_vars, method = "standardize", range = c(0, 1), margin = 1L, on.constant = "quiet")
dtn <- cbind(meta_vars,data_varsn)
Related
Let's pretend I am measuring the distance the distance grasshoppers can jump pre- and post-treatment. This is just for fun, the real measurement could be anything, and the bigger picture is to understand the group_by() command.
For the statistical test I would like to run, each observation needs to have its own column, but I'm given a dataset that is not in this format...!!, and I would like to use the package library(dplyr) , and the command group_by()to shape the data for my needs, because if this were to happen again, I could make a more general code to work over other datasets :)
I am able to do this using commands, such as filter(), and then cbind()at a later step (see example below). But it also requires renaming a column. Additionally, if I wanted to add a column, let's say "difference", to calculate the observed difference between observation 1, and observation 2, I can do this, but then I need to add another line of code (again, see example below)
It would be great to do this with less lines of code
Please see what I have tried, and let me know how I could modify the code group by() to work properly.
example_df <- data.frame( "observation" = character(0), "distance" = integer(0))
Assign names for our "observations", remember, in this example, it's done twice
variable_names <- c( "obs_1", "obs_2")
Assign fictitious values to y
w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
Combine everything for this pretend exercise
df <- data.frame( "observation" = variable_names, "distance" = c(w,x,y,z))
attach(df)
Here's how I achieved the desired results for this example
library(dplyr)
dat = filter(df,observation=="obs_1")
dat2 = filter(df,observation=="obs_2")
names(dat2)
colnames(dat2)[2] <- "distance_2"
final <- cbind(dat,dat2)
attach(final)
final$difference <- distance-distance_2
I tried using the group_by() command, I just get an error message
final <- df %>% group_by(observation,distance) %>% summarise(
Observation_1 = first(observation), distance_1 = first(distance),
Observation_2 = last(observation), distance_2 = last(distance,difference=distance-distance_2)))
It would be great to get the above code to work
To make things even more "fun" :), what if more than one variable was measured. Could I make a general code to achieve the desired results, again, without having the go over the whole filter() process, with cbind()etc..
Here's an example (expanded on the above one)
example_df <- data.frame( "observation" = character(0), "distance" = integer(0),"weight" = integer(0),"speed" = integer(0))
variable_names <- c( "obs_1", "obs_2")
w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
a<-rnorm(200, mean=5, sd=2)
b<-rnorm(200, mean=5, sd=2)
df <- data.frame( "observation" = variable_names, "distance" = c(w,x),"weight" = c(y,z),"speed" = c(a,b))
attach(df)
library(dplyr)
dat = filter(df,observation=="obs_1")
dat2 = filter(df,observation=="obs_2")
names(dat2)
colnames(dat2)[2] <- "distance_2"
colnames(dat2)[3] <- "weight_2"
colnames(dat2)[4] <- "speed_2"
final <- cbind(dat,dat2)
attach(final)
final$difference <- distance-distance_2
final$difference_weight <- weight-weight_2
final$difference_speed <- speed-speed_2
Thanks everyone!
Would be simple with pivot_wider, though I presume your data also has an id column to link observations somehow, so have added one here:
library(tidyverse)
w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
a<-rnorm(200, mean=5, sd=2)
b<-rnorm(200, mean=5, sd=2)
variable_names <- c( "obs_1", "obs_2")
df <-
data.frame(
"id" = rep(1:200, each = 2),
"observation" = variable_names,
"distance" = c(w, x),
"weight" = c(y, z),
"speed" = c(a, b)
)
df %>%
pivot_wider(
id_cols = id,
names_from = observation,
values_from = distance:speed
)
#> # A tibble: 200 x 7
#> id distance_obs_1 distance_obs_2 weight_obs_1 weight_obs_2 speed_obs_1
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.63 2.80 2.98 -0.795 3.58
#> 2 2 4.96 6.84 4.11 9.92 8.21
#> 3 3 4.84 7.51 6.32 3.28 9.02
#> 4 4 3.79 6.82 5.42 6.86 7.96
#> 5 5 5.48 2.84 9.56 3.27 3.55
#> 6 6 8.78 2.06 3.81 4.35 5.93
#> 7 7 8.42 4.21 3.92 4.40 9.37
#> 8 8 8.26 9.67 4.05 6.19 3.17
#> 9 9 3.80 4.47 6.58 5.38 6.09
#> 10 10 4.67 2.86 6.27 6.88 3.72
#> # ... with 190 more rows, and 1 more variable: speed_obs_2 <dbl>
Follow-up
You can also tell pivot_wider to use a function in combining values. Here in this example I've passed names_from = NULL so that every column is paired up by id, and using the diff function to calculate the difference:
df %>%
pivot_wider(
id_cols = id,
names_from = NULL,
values_from = distance:speed,
values_fn = diff,
names_sep = ""
)
#> # A tibble: 200 x 4
#> id distance weight speed
#> <int> <dbl> <dbl> <dbl>
#> 1 1 -0.828 -3.77 4.45
#> 2 2 1.88 5.82 -1.07
#> 3 3 2.66 -3.04 -4.31
#> 4 4 3.03 1.45 -0.969
#> 5 5 -2.64 -6.29 5.06
#> 6 6 -6.72 0.541 -2.24
#> 7 7 -4.20 0.481 -5.82
#> 8 8 1.41 2.14 3.71
#> 9 9 0.669 -1.19 -1.14
#> 10 10 -1.81 0.607 -2.62
#> # ... with 190 more rows
Created on 2022-03-25 by the reprex package (v2.0.1)
In the below code, I've simulated dice rolls at increasing sample sizes and computed the average roll at each sample size. My lapply function works, but I'm uncomfortable with it since I know sample_n is not a dplyr function and has been superceded by slice_sample. I would like make my code better with a dplyr solution rather than sample_n() within the lapply. I think I may have other syntactical errors within the lapply. Here is the code:
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = sample_n(dice_df,var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
The final step is computing the difference compared to the expected value, 3.5. I want a column where that shows the difference between 3.5 and the sample mean. We should see the difference decreasing as the sample size increases.
output <- output %>%
mutate(difference = across(sample_mean, ~3.5 - .x))
When I run this, it's throwing this error:
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "list"
I've tried using sapply but I get a similar error: no applicable method for 'mutate' applied to an object of class "c('matrix', 'array', 'list')"
If it helps, here was my failed attempt at using slice_sample:
output <- lapply(X=sample_sizes, FUN = function(...){
obs = slice_sample(dice_df, ..., .preserve=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, ...)
return(new.df)
})
I got this error: Error: '...' used in an incorrect context
The output is just a single row data.frame element in a list. We can bind them with bind_rows and simply subtract once instead of doing this multiple times
library(dplyr)
bind_rows(output) %>%
mutate(difference = 3.5 - sample_mean )
sample_mean var difference
1 3.500000 10 0.00000000
2 2.800000 25 0.70000000
3 3.440000 50 0.06000000
4 3.510000 100 -0.01000000
5 3.495000 1000 0.00500000
6 3.502200 10000 -0.00220000
7 3.502410 100000 -0.00241000
8 3.498094 1000000 0.00190600
9 3.500183 100000000 -0.00018332
The n argument of slice_sample correspondes to sample_n's size argument.
And to calculate the difference of your output list we can use purrr::map instead of dplyr::across.
library(dplyr)
library(purrr)
set.seed(123)
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = slice_sample(dice_df,n = var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
output %>%
map(~ 3.5 - .x$sample_mean)
#> [[1]]
#> [1] -0.5
#>
#> [[2]]
#> [1] 0.42
#>
#> [[3]]
#> [1] -0.04
#>
#> [[4]]
#> [1] -0.34
#>
#> [[5]]
#> [1] 0.025
#>
#> [[6]]
#> [1] 0.0317
#>
#> [[7]]
#> [1] 0.00416
#>
#> [[8]]
#> [1] -2.6e-05
#>
#> [[9]]
#> [1] -4.405e-05
Created on 2021-08-02 by the reprex package (v0.3.0)
Alternatively, we can use purrr::map_df and add a row diff inside each tibble as proposed by Martin Gal in the comments:
output %>%
map_df(~ tibble(.x, diff = 3.5 - .x$sample_mean))
#> # A tibble: 9 x 3
#> sample_mean var diff
#> <dbl> <dbl> <dbl>
#> 1 2.6 10 0.9
#> 2 3.28 25 0.220
#> 3 3.66 50 -0.160
#> 4 3.5 100 0
#> 5 3.53 1000 -0.0270
#> 6 3.50 10000 -0.00180
#> 7 3.50 100000 -0.00444
#> 8 3.50 1000000 -0.000226
#> 9 3.50 100000000 -0.0000669
Here is a base R way -
transform(do.call(rbind, output), difference = 3.5 - sample_mean)
# sample_mean var difference
#1 3.80 10 -0.300000
#2 3.44 25 0.060000
#3 3.78 50 -0.280000
#4 3.30 100 0.200000
#5 3.52 1000 -0.015000
#6 3.50 10000 -0.004200
#7 3.50 100000 -0.004370
#8 3.50 1000000 0.002696
#9 3.50 100000000 0.000356
If you just need the difference value you can do -
3.5 - sapply(output, `[[`, 'sample_mean')
I am very new in R and I need some advice about very basic issues.
I want to create a new column that is the sum of existent columns in my data frame Data4
The extended code is this:
Data4$E<-(Data4$E1+Data4$E2+Data4$E3+Data4$E4+Data4$E5)
I would like to simplify the code and find a way to not write the sequence of the column's name every time.
I tried this, but it indeed wrong
Data4$E<-(Data4$E[1:5])
Do you know a way to do it?
Thank you!
Among your options are:
set.seed(12)
Data4 <- data.frame(replicate(5, rnorm(5, 10, 1)))
colnames(Data4) <- paste0("E", 1:5)
# base R
Data4$E <- rowSums(Data4) # if there are just columns E1 to E5
Data4$E_option2 <- rowSums(subset(Data4, select = paste0("E", 1:5))) # if there are other columns ..
# "tidy"
library(tidyverse)
Data4 <- Data4 %>%
mutate(E_option3 = pmap_dbl(Data4 %>%
select(E1:E5),
sum))
# E1 E2 E3 E4 E5 E E_option2 E_option3
#1 8.519432 9.727704 9.222280 9.296536 10.223641 46.98959 46.98959 46.98959
#2 11.577169 9.684651 8.706118 11.188879 12.007201 53.16402 53.16402 53.16402
#3 9.043256 9.371745 9.220433 10.340512 11.011979 48.98793 48.98793 48.98793
#4 9.079995 9.893536 10.011952 10.506968 9.697541 49.18999 49.18999 49.18999
#5 8.002358 10.428015 9.847584 9.706695 8.974755 46.95941 46.95941 46.95941
Use functions like sum or rowSums. It seems you want row sums. These functions are better than + because they have na.rm argument that controls wether or not NAs are ignored.
Data4$E <- rowSums(Data[, c("E1", "E2", "E3", "E4", "E5")], na.rm = TRUE)
An easy way to generate column names is to paste them with numbers. Equivalently, we could write it so we can reuse this for other such operations:
E_col_names <- sprintf("E%d", 1:5)
Data4$E <- rowSums(Data[, E_col_names], na.rm = TRUE)
One more way to do it in dplyr demonstrating it on toy_data created in one of the above answers. Just use E1:E5 inside c_across. Of course you may also use select helper functions e.g. starts_with here
#toy_data
set.seed(12)
Data4 <- data.frame(replicate(5, rnorm(5, 10, 1)))
colnames(Data4) <- paste0("E", 1:5)
library(dplyr)
Data4 %>% rowwise() %>%
mutate(E = sum(c_across(E1:E5)))
#> # A tibble: 5 x 6
#> # Rowwise:
#> E1 E2 E3 E4 E5 E
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 8.52 9.73 9.22 9.30 10.2 47.0
#> 2 11.6 9.68 8.71 11.2 12.0 53.2
#> 3 9.04 9.37 9.22 10.3 11.0 49.0
#> 4 9.08 9.89 10.0 10.5 9.70 49.2
#> 5 8.00 10.4 9.85 9.71 8.97 47.0
Created on 2021-05-25 by the reprex package (v2.0.0)
I first simulated 500 samples of size 55 in the normal distribution.
samples <- replicate(500, rnorm(55,mean=50, sd=10), simplify = FALSE)
1) For each sample, I want the mean, median, range, and third quartile. Then I need to store these together in a data frame.
This is what I have. I am not sure about the range or the quantile. I tried sapply and lapply but not sure how they work.
stats <- data.frame(
means = map_dbl(samples,mean),
medians = map_dbl(samples,median),
sd= map_dbl(samples,sd),
range= map_int(samples, max-min),
third_quantile=sapply(samples,quantile,type=3)
)
2) Then plot the sampling distribution (histogram) of the means.
I try to plot but I don't get how to get the mean
stats <- gather(stats, key = "Trials", value = "Mean")
ggplot(stats,aes(x=Trials))+geom_histogram()
3) Then I want to plot the other three statistics in (three separate graphs) of a single plotting window.
I know I need to use something like gather and facet_wrap, but I am not sure how to do it.
You were almost there. All it is needed is to define anonymous functions wherever there are errors.
library(tidyverse)
set.seed(1234) # Make the results reproducible
samples <- replicate(500, rnorm(55,mean=50, sd=10), simplify = FALSE)
str(samples)
stats <- data.frame(
means = map_dbl(samples, mean),
medians = map_dbl(samples, median),
sd = map_dbl(samples, sd),
range = map_dbl(samples, function(x) diff(range(x))),
third_quantile = map_dbl(samples, function(x) quantile(x, probs = 3/4, type = 3))
)
str(stats)
#'data.frame': 500 obs. of 5 variables:
# $ means : num 49.8 51.5 52.2 50.2 51.6 ...
# $ medians : num 51.5 51.7 51 51.1 50.5 ...
# $ sd : num 9.55 7.81 11.43 8.97 10.75 ...
# $ range : num 38.5 37.2 54 36.7 60.2 ...
# $ third_quantile: num 57.7 56.2 58.8 55.6 57 ...
The map_dbl functions you're using are definitely nice, but if you're trying to get a data frame in the end anyway, you might have an easier time converting the list into a data frame at the beginning, then taking advantage of some dplyr functions.
I'm first mapping over the list, creating tibbles, and binding it together with an added ID. The conversion creates a column value of the sample values. summarise_at lets you take a list of functions—supplying names in the list sets the names in the resultant data frame. You can use purrr's ~. notation to define these functions inline where needed. Cuts down on the number of times you have to map_dbl and so on.
library(tidyverse)
stats <- samples %>%
map_dfr(as_tibble, .id = "sample") %>%
group_by(sample) %>%
summarise_at(vars(value),
.funs = list(mean = mean, median = median, sd = sd,
range = ~(max(.) - min(.)),
third_quartile = ~quantile(., probs = 0.75)))
head(stats)
#> # A tibble: 6 x 6
#> sample mean median sd range third_quartile
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 45.0 44.4 8.71 47.6 48.6
#> 2 10 51.0 52.0 9.55 49.3 56.2
#> 3 100 51.6 52.2 10.4 60.7 58.1
#> 4 101 51.6 51.1 9.92 37.6 57.2
#> 5 102 49.1 48.2 9.65 39.8 57.0
#> 6 103 52.2 51.3 10.1 47.4 58.5
Next, in your code you gathered the data—which is often the solution folks need on SO—but if you're only trying to show the mean column, you can work with it as is.
ggplot(stats, aes(x = mean)) +
geom_histogram()
So, I am trying to make bartlett or any test in R. it's working good with imported data:
data(foster, package = "HSAUR")
bartlett.test(weight ~ litgen,data = foster)
But not with my data:
mdat <- matrix(c(2.3,2.2,2.25, 2.2,2.1,2.2, 2.15, 2.15, 2.2, 2.25, 2.15, 2.25), nrow = 3, ncol = 4)
working_df = data.frame(mdat)
bartlett.test(X1 ~ X2, data = working_df)
Error in bartlett.test.default(c(2.3, 2.2, 2.25), c(2.2, 2.1, 2.2)) :
there must be at least 2 observations in each group
I have tried all the different functions, assignments but the problem is that the arguments are treated as a single object rather than its content
How can I make a barttlet test with my dataframes? How do make the arguments be the contents, rather than the container?
I don't know what you mean when you talk about "contents" and "container". The documentation at ?bartlett.test is pretty straightforward. You're trying to use a formula, so we'll look at the description of the formula argument:
formula a formula of the form lhs ~ rhs where lhs gives the data values and rhs the corresponding groups.
This matches with the structure of the foster data, where weight is numeric, and litgen is a categorical grouper.
head(foster)
litgen motgen weight
1 A A 61.5
2 A A 68.2
3 A A 64.0
4 A A 65.0
5 A A 59.7
6 A B 55.0
So, you need to put your data in that format.
your_data = data.frame(x = c(mdat), group = c(col(mdat)))
your_data
# x group
# 1 2.30 1
# 2 2.20 1
# 3 2.25 1
# 4 2.20 2
# 5 2.10 2
# 6 2.20 2
# 7 2.15 3
# 8 2.15 3
# 9 2.20 3
# 10 2.25 4
# 11 2.15 4
# 12 2.25 4
bartlett.test(x ~ group, data = your_data)
# Bartlett test of homogeneity of variances
#
# data: x by group
# Bartlett's K-squared = 0.86607, df = 3, p-value = 0.8336
That's all your groups at once. If you want to do pairwise comparisons, give subsets of you data to bartlett.test.