How to select different columns with sequential letter+number in R? - r

I am very new in R and I need some advice about very basic issues.
I want to create a new column that is the sum of existent columns in my data frame Data4
The extended code is this:
Data4$E<-(Data4$E1+Data4$E2+Data4$E3+Data4$E4+Data4$E5)
I would like to simplify the code and find a way to not write the sequence of the column's name every time.
I tried this, but it indeed wrong
Data4$E<-(Data4$E[1:5])
Do you know a way to do it?
Thank you!

Among your options are:
set.seed(12)
Data4 <- data.frame(replicate(5, rnorm(5, 10, 1)))
colnames(Data4) <- paste0("E", 1:5)
# base R
Data4$E <- rowSums(Data4) # if there are just columns E1 to E5
Data4$E_option2 <- rowSums(subset(Data4, select = paste0("E", 1:5))) # if there are other columns ..
# "tidy"
library(tidyverse)
Data4 <- Data4 %>%
mutate(E_option3 = pmap_dbl(Data4 %>%
select(E1:E5),
sum))
# E1 E2 E3 E4 E5 E E_option2 E_option3
#1 8.519432 9.727704 9.222280 9.296536 10.223641 46.98959 46.98959 46.98959
#2 11.577169 9.684651 8.706118 11.188879 12.007201 53.16402 53.16402 53.16402
#3 9.043256 9.371745 9.220433 10.340512 11.011979 48.98793 48.98793 48.98793
#4 9.079995 9.893536 10.011952 10.506968 9.697541 49.18999 49.18999 49.18999
#5 8.002358 10.428015 9.847584 9.706695 8.974755 46.95941 46.95941 46.95941

Use functions like sum or rowSums. It seems you want row sums. These functions are better than + because they have na.rm argument that controls wether or not NAs are ignored.
Data4$E <- rowSums(Data[, c("E1", "E2", "E3", "E4", "E5")], na.rm = TRUE)
An easy way to generate column names is to paste them with numbers. Equivalently, we could write it so we can reuse this for other such operations:
E_col_names <- sprintf("E%d", 1:5)
Data4$E <- rowSums(Data[, E_col_names], na.rm = TRUE)

One more way to do it in dplyr demonstrating it on toy_data created in one of the above answers. Just use E1:E5 inside c_across. Of course you may also use select helper functions e.g. starts_with here
#toy_data
set.seed(12)
Data4 <- data.frame(replicate(5, rnorm(5, 10, 1)))
colnames(Data4) <- paste0("E", 1:5)
library(dplyr)
Data4 %>% rowwise() %>%
mutate(E = sum(c_across(E1:E5)))
#> # A tibble: 5 x 6
#> # Rowwise:
#> E1 E2 E3 E4 E5 E
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 8.52 9.73 9.22 9.30 10.2 47.0
#> 2 11.6 9.68 8.71 11.2 12.0 53.2
#> 3 9.04 9.37 9.22 10.3 11.0 49.0
#> 4 9.08 9.89 10.0 10.5 9.70 49.2
#> 5 8.00 10.4 9.85 9.71 8.97 47.0
Created on 2021-05-25 by the reprex package (v2.0.0)

Related

How to multiply specific columns in R by scalar

This is probably a very simple problem, but I can't seem to figure it out.
I am trying to multiply every column (except the first one) on my dataframe by the same scalar. Here is my reproducible example:
df <- data.frame(replicate(200, sample(0:100, 1000, rep=TRUE)))
a <- 0.75
First I tried this:
df2 <- df[,2:200]*a
However, this creates dataframe df2 that's missing the first column.
I also tried using tidyverse with the mutate_at and specifying a multiplication function, but that didn't run at all:
scalar <- function(x) (x*0.75)
df2 <- df %>% mutate_at(across(c(2:200)), scalar)
My apologies in advance if this is very simple.
df <-
data.frame(
replicate(
5,
sample(0:100, 1000, rep=TRUE)
)
)
a <- 0.75
df2 <-
df |>
dplyr::mutate(
dplyr::across(
# Every column but the first column
.cols = -c(1),
.fns = function(x){
x * a
}
)
)
head(df2)
#> X1 X2 X3 X4 X5
#> 1 58 33.00 54.75 64.50 35.25
#> 2 39 63.00 2.25 8.25 30.00
#> 3 63 18.00 30.00 9.00 0.00
#> 4 22 39.75 46.50 11.25 29.25
#> 5 42 18.75 60.75 31.50 3.00
#> 6 34 46.50 15.00 74.25 55.50
Created on 2022-11-08 with reprex v2.0.2

Mutate data to juxtapose repeat measurments

Let's pretend I am measuring the distance the distance grasshoppers can jump pre- and post-treatment. This is just for fun, the real measurement could be anything, and the bigger picture is to understand the group_by() command.
For the statistical test I would like to run, each observation needs to have its own column, but I'm given a dataset that is not in this format...!!, and I would like to use the package library(dplyr) , and the command group_by()to shape the data for my needs, because if this were to happen again, I could make a more general code to work over other datasets :)
I am able to do this using commands, such as filter(), and then cbind()at a later step (see example below). But it also requires renaming a column. Additionally, if I wanted to add a column, let's say "difference", to calculate the observed difference between observation 1, and observation 2, I can do this, but then I need to add another line of code (again, see example below)
It would be great to do this with less lines of code
Please see what I have tried, and let me know how I could modify the code group by() to work properly.
example_df <- data.frame( "observation" = character(0), "distance" = integer(0))
Assign names for our "observations", remember, in this example, it's done twice
variable_names <- c( "obs_1", "obs_2")
Assign fictitious values to y
w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
Combine everything for this pretend exercise
df <- data.frame( "observation" = variable_names, "distance" = c(w,x,y,z))
attach(df)
Here's how I achieved the desired results for this example
library(dplyr)
dat = filter(df,observation=="obs_1")
dat2 = filter(df,observation=="obs_2")
names(dat2)
colnames(dat2)[2] <- "distance_2"
final <- cbind(dat,dat2)
attach(final)
final$difference <- distance-distance_2
I tried using the group_by() command, I just get an error message
final <- df %>% group_by(observation,distance) %>% summarise(
Observation_1 = first(observation), distance_1 = first(distance),
Observation_2 = last(observation), distance_2 = last(distance,difference=distance-distance_2)))
It would be great to get the above code to work
To make things even more "fun" :), what if more than one variable was measured. Could I make a general code to achieve the desired results, again, without having the go over the whole filter() process, with cbind()etc..
Here's an example (expanded on the above one)
example_df <- data.frame( "observation" = character(0), "distance" = integer(0),"weight" = integer(0),"speed" = integer(0))
variable_names <- c( "obs_1", "obs_2")
w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
a<-rnorm(200, mean=5, sd=2)
b<-rnorm(200, mean=5, sd=2)
df <- data.frame( "observation" = variable_names, "distance" = c(w,x),"weight" = c(y,z),"speed" = c(a,b))
attach(df)
library(dplyr)
dat = filter(df,observation=="obs_1")
dat2 = filter(df,observation=="obs_2")
names(dat2)
colnames(dat2)[2] <- "distance_2"
colnames(dat2)[3] <- "weight_2"
colnames(dat2)[4] <- "speed_2"
final <- cbind(dat,dat2)
attach(final)
final$difference <- distance-distance_2
final$difference_weight <- weight-weight_2
final$difference_speed <- speed-speed_2
Thanks everyone!
Would be simple with pivot_wider, though I presume your data also has an id column to link observations somehow, so have added one here:
library(tidyverse)
w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
a<-rnorm(200, mean=5, sd=2)
b<-rnorm(200, mean=5, sd=2)
variable_names <- c( "obs_1", "obs_2")
df <-
data.frame(
"id" = rep(1:200, each = 2),
"observation" = variable_names,
"distance" = c(w, x),
"weight" = c(y, z),
"speed" = c(a, b)
)
df %>%
pivot_wider(
id_cols = id,
names_from = observation,
values_from = distance:speed
)
#> # A tibble: 200 x 7
#> id distance_obs_1 distance_obs_2 weight_obs_1 weight_obs_2 speed_obs_1
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.63 2.80 2.98 -0.795 3.58
#> 2 2 4.96 6.84 4.11 9.92 8.21
#> 3 3 4.84 7.51 6.32 3.28 9.02
#> 4 4 3.79 6.82 5.42 6.86 7.96
#> 5 5 5.48 2.84 9.56 3.27 3.55
#> 6 6 8.78 2.06 3.81 4.35 5.93
#> 7 7 8.42 4.21 3.92 4.40 9.37
#> 8 8 8.26 9.67 4.05 6.19 3.17
#> 9 9 3.80 4.47 6.58 5.38 6.09
#> 10 10 4.67 2.86 6.27 6.88 3.72
#> # ... with 190 more rows, and 1 more variable: speed_obs_2 <dbl>
Follow-up
You can also tell pivot_wider to use a function in combining values. Here in this example I've passed names_from = NULL so that every column is paired up by id, and using the diff function to calculate the difference:
df %>%
pivot_wider(
id_cols = id,
names_from = NULL,
values_from = distance:speed,
values_fn = diff,
names_sep = ""
)
#> # A tibble: 200 x 4
#> id distance weight speed
#> <int> <dbl> <dbl> <dbl>
#> 1 1 -0.828 -3.77 4.45
#> 2 2 1.88 5.82 -1.07
#> 3 3 2.66 -3.04 -4.31
#> 4 4 3.03 1.45 -0.969
#> 5 5 -2.64 -6.29 5.06
#> 6 6 -6.72 0.541 -2.24
#> 7 7 -4.20 0.481 -5.82
#> 8 8 1.41 2.14 3.71
#> 9 9 0.669 -1.19 -1.14
#> 10 10 -1.81 0.607 -2.62
#> # ... with 190 more rows
Created on 2022-03-25 by the reprex package (v2.0.1)

Arranging Columns in R

I have data of the form:
Department LengthAfter
1 A 8.42
2 B 10.93
3 D 9.98
4 A 10.13
5 B 10.54
6 C 7.82
7 A 9.55
8 D 12.53
9 C 7.87
I would like to make a new table or dataframe in which the column header is each department (A, B, C, D) and the Lengths under each column are the values on LengthAfter corresponding to each department. e.g.
A B C D
8.42 10.93 7.82 9.98
Can anyone help with this? Thank you
Using tidyverse, you can use pivot_wider to pivot your data into the desired form. Before that, you will need to sort (arrange) by Department first, if you want to include the values from LengthAfter in the order of appearance, and have the columns in order as above.
library(tidyverse)
df %>%
arrange(Department) %>%
group_by(Department) %>%
mutate(rn = row_number()) %>%
pivot_wider(names_from = "Department", values_from = "LengthAfter") %>%
select(-rn)
Output
A B C D
<dbl> <dbl> <dbl> <dbl>
1 8.42 10.9 7.82 9.98
2 10.1 10.5 7.87 12.5
3 9.55 NA NA NA
You can use package reshape2 for this
Library (reshape2)
Df_new <- dcast(df_old, Department~LengthAfter)
Base-R
dept_length <- read.csv("/Users/usr/SO_Department_LengthAfter.tsv", sep="\t");
dl_list <- with(dept_length, tapply(LengthAfter, Department, `c`));
n.obs <- sapply(dl_list, length);
seq.max <- seq_len(max(n.obs));
sapply(dl_list, `[`, i = seq.max);
Returns:
A B C D
[1,] 8.42 10.93 7.82 9.98
[2,] 10.13 10.54 7.87 12.53
[3,] 9.55 NA NA NA
References:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html
How to convert a list consisting of vector of different lengths to a usable data frame in R?
https://eeob-biodata.github.io/R-Data-Skills/05-split-apply-combine/

How I can use the function names with for in R

I have a little problem with my R code. I don't know where, but I make a mistake.
The problem is:
I have many file excel with the same names of the columns. I'd like to change the titles of the matrix, with a other titles.
These are five files.
AA <- read_excel("AA.xlsx")
BB <- read_excel("BB.xlsx")
CC <- read_excel("CC.xlsx")
DD <- read_excel("DD.xlsx")
EE <- read_excel("EE.xlsx")
head(AA) #the matrix is the same for the other file.
DATA Open Max Min Close VAR % CLOSE VOLUME
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2004-07-07 00:00:00 3.73 3.79 3.6 3.70 0 21810440
2 2004-07-08 00:00:00 3.7 3.71 3.47 3.65 -1.43 7226890
3 2004-07-09 00:00:00 3.61 3.65 3.56 3.65 0 3754407
4 2004-07-12 00:00:00 3.64 3.65 3.59 3.63 -0.55 850667
5 2004-07-13 00:00:00 3.63 3.63 3.58 3.59 -1.16 777508
6 2004-07-14 00:00:00 3.54 3.59 3.47 3.5 -2.45 1931765
To change the titles fast, I decided to use this code.
t <- list(AA, BB, CC, DD, EE)
for (i in t ) {
names(i) <- c("DATA", "OPE", "MAX", "MIN", "CLO", "VAR%", "VOL")
} #R dosen't give any type of error!
head(AA) #the data are the same, as the for dosen't exits.
Where I was wrong?
Thank you so much in advance.
Francesco
We can do this with lapply. Get the datasets in a list with mget, loop through the list, set the column names to vector of names ('nm1
) and modify the objects in the global environment with list2env
nm1 <- c("DATA", "OPE", "MAX", "MIN", "CLO", "VAR%", "VOL")
lst <- lapply(mget(nm2), setNames, nm1)
list2env(lst, envir = .GlobalEnv)
Or using a for loop, loop through the string of object names and assign the column names to the objects in the global environment
for(nm in nm2) assign(nm, `names<-`(get(nm), nm1))
Or using tidyverse
library(tidyverse)
mget(nm2) %>%
map(set_names, nm1) %>%
list2env(., envir = .GlobalEnv)
data
AA <- mtcars[1:7]
BB <- mtcars[1:7]
CC <- mtcars[1:7]
DD <- mtcars[1:7]
EE <- mtcars[1:7]
nm2 <- strrep(LETTERS[1:5], 2)
I am trying to explain why your code didn't work. In the list t, the address of AA (t[[1]]) is the the same as AA in the global environment. In the for-loop, i initially is the same copy as the data.frame AA in global env. When you change the names of i with names(i) <-, the data.frame i is copied twice. Finally, you are changing the name of a new data.frame i rather than the original data.frame AA in the global environment.
Here is an example to illustrate what I mean (tracemem "marks an object so that a message is printed whenever the internal code copies the object."):
tracemem(mtcars)
# [1] "<0x1095b2150>"
tracemem(iris)
# [1] "<0x10959a350>"
x <- list(mtcars, iris)
for(i in x){
cat('-------\n')
tracemem(i)
names(i) <- paste(names(i), 'xx')
}
# -------
# tracemem[0x1095b2150 -> 0x10d678c00]:
# tracemem[0x10d678c00 -> 0x10d678ca8]:
# -------
# tracemem[0x10959a350 -> 0x10cb307b0]:
# tracemem[0x10cb307b0 -> 0x10cb30818]:

How to simplify this normalization code? [duplicate]

I have a dataset called spam which contains 58 columns and approximately 3500 rows of data related to spam messages.
I plan on running some linear regression on this dataset in the future, but I'd like to do some pre-processing beforehand and standardize the columns to have zero mean and unit variance.
I've been told the best way to go about this is with R, so I'd like to ask how can i achieve normalization with R? I've already got the data properly loaded and I'm just looking for some packages or methods to perform this task.
I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale function on the data to do what you want.
dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)
# check that we get mean of 0 and sd of 1
colMeans(scaled.dat) # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)
Using built in functions is classy. Like this cat:
Realizing that the question is old and one answer is accepted, I'll provide another answer for reference.
scale is limited by the fact that it scales all variables. The solution below allows to scale only specific variable names while preserving other variables unchanged (and the variable names could be dynamically generated):
library(dplyr)
set.seed(1234)
dat <- data.frame(x = rnorm(10, 30, .2),
y = runif(10, 3, 5),
z = runif(10, 10, 20))
dat
dat2 <- dat %>% mutate_at(c("y", "z"), ~(scale(.) %>% as.vector))
dat2
which gives me this:
> dat
x y z
1 29.75859 3.633225 14.56091
2 30.05549 3.605387 12.65187
3 30.21689 3.318092 13.04672
4 29.53086 3.079992 15.07307
5 30.08582 3.437599 11.81096
6 30.10121 4.621197 17.59671
7 29.88505 4.051395 12.01248
8 29.89067 4.829316 12.58810
9 29.88711 4.662690 19.92150
10 29.82199 3.091541 18.07352
and
> dat2 <- dat %>% mutate_at(c("y", "z"), ~(scale(.) %>% as.vector))
> dat2
x y z
1 29.75859 -0.3004815 -0.06016029
2 30.05549 -0.3423437 -0.72529604
3 30.21689 -0.7743696 -0.58772361
4 29.53086 -1.1324181 0.11828039
5 30.08582 -0.5946582 -1.01827752
6 30.10121 1.1852038 0.99754666
7 29.88505 0.3283513 -0.94806607
8 29.89067 1.4981677 -0.74751378
9 29.88711 1.2475998 1.80753470
10 29.82199 -1.1150515 1.16367556
EDIT 1 (2016): Addressed Julian's comment: the output of scale is Nx1 matrix so ideally we should add an as.vector to convert the matrix type back into a vector type. Thanks Julian!
EDIT 2 (2019): Quoting Duccio A.'s comment: For the latest dplyr (version 0.8) you need to change dplyr::funcs with list, like dat %>% mutate_each_(list(~scale(.) %>% as.vector), vars=c("y","z"))
EDIT 3 (2020): Thanks to #mj_whales: the old solution is deprecated and now we need to use mutate_at.
This is 3 years old. Still, I feel I have to add the following:
The most common normalization is the z-transformation, where you subtract the mean and divide by the standard deviation of your variable. The result will have mean=0 and sd=1.
For that, you don't need any package.
zVar <- (myVar - mean(myVar)) / sd(myVar)
That's it.
'Caret' package provides methods for preprocessing data (e.g. centering and scaling). You could also use the following code:
library(caret)
# Assuming goal class is column 10
preObj <- preProcess(data[, -10], method=c("center", "scale"))
newData <- predict(preObj, data[, -10])
More details: http://www.inside-r.org/node/86978
When I used the solution stated by Dason, instead of getting a data frame as a result, I got a vector of numbers (the scaled values of my df).
In case someone is having the same trouble, you have to add as.data.frame() to the code, like this:
df.scaled <- as.data.frame(scale(df))
I hope this is will be useful for ppl having the same issue!
You can easily normalize the data also using data.Normalization function in clusterSim package. It provides different method of data normalization.
data.Normalization (x,type="n0",normalization="column")
Arguments
x
vector, matrix or dataset
type
type of normalization:
n0 - without normalization
n1 - standardization ((x-mean)/sd)
n2 - positional standardization ((x-median)/mad)
n3 - unitization ((x-mean)/range)
n3a - positional unitization ((x-median)/range)
n4 - unitization with zero minimum ((x-min)/range)
n5 - normalization in range <-1,1> ((x-mean)/max(abs(x-mean)))
n5a - positional normalization in range <-1,1> ((x-median)/max(abs(x-median)))
n6 - quotient transformation (x/sd)
n6a - positional quotient transformation (x/mad)
n7 - quotient transformation (x/range)
n8 - quotient transformation (x/max)
n9 - quotient transformation (x/mean)
n9a - positional quotient transformation (x/median)
n10 - quotient transformation (x/sum)
n11 - quotient transformation (x/sqrt(SSQ))
n12 - normalization ((x-mean)/sqrt(sum((x-mean)^2)))
n12a - positional normalization ((x-median)/sqrt(sum((x-median)^2)))
n13 - normalization with zero being the central point ((x-midrange)/(range/2))
normalization
"column" - normalization by variable, "row" - normalization by object
With dplyr v0.7.4 all variables can be scaled by using mutate_all():
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
set.seed(1234)
dat <- tibble(x = rnorm(10, 30, .2),
y = runif(10, 3, 5),
z = runif(10, 10, 20))
dat %>% mutate_all(scale)
#> # A tibble: 10 x 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 -0.827 -0.300 -0.0602
#> 2 0.663 -0.342 -0.725
#> 3 1.47 -0.774 -0.588
#> 4 -1.97 -1.13 0.118
#> 5 0.816 -0.595 -1.02
#> 6 0.893 1.19 0.998
#> 7 -0.192 0.328 -0.948
#> 8 -0.164 1.50 -0.748
#> 9 -0.182 1.25 1.81
#> 10 -0.509 -1.12 1.16
Specific variables can be excluded using mutate_at():
dat %>% mutate_at(scale, .vars = vars(-x))
#> # A tibble: 10 x 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 29.8 -0.300 -0.0602
#> 2 30.1 -0.342 -0.725
#> 3 30.2 -0.774 -0.588
#> 4 29.5 -1.13 0.118
#> 5 30.1 -0.595 -1.02
#> 6 30.1 1.19 0.998
#> 7 29.9 0.328 -0.948
#> 8 29.9 1.50 -0.748
#> 9 29.9 1.25 1.81
#> 10 29.8 -1.12 1.16
Created on 2018-04-24 by the reprex package (v0.2.0).
Again, even though this is an old question, it is very relevant! And I have found a simple way to normalise certain columns without the need of any packages:
normFunc <- function(x){(x-mean(x, na.rm = T))/sd(x, na.rm = T)}
For example
x<-rnorm(10,14,2)
y<-rnorm(10,7,3)
z<-rnorm(10,18,5)
df<-data.frame(x,y,z)
df[2:3] <- apply(df[2:3], 2, normFunc)
You will see that the y and z columns have been normalised. No packages needed :-)
Scale can be used for both full data frame and specific columns.
For specific columns, following code can be used:
trainingSet[, 3:7] = scale(trainingSet[, 3:7]) # For column 3 to 7
trainingSet[, 8] = scale(trainingSet[, 8]) # For column 8
Full data frame
trainingSet <- scale(trainingSet)
The collapse package provides the fastest scale function - implemented in C++ using Welfords Online Algorithm:
dat <- data.frame(x = rnorm(1e6, 30, .2),
y = runif(1e6, 3, 5),
z = runif(1e6, 10, 20))
library(collapse)
library(microbenchmark)
microbenchmark(fscale(dat), scale(dat))
Unit: milliseconds
expr min lq mean median uq max neval cld
fscale(dat) 27.86456 29.5864 38.96896 30.80421 43.79045 313.5729 100 a
scale(dat) 357.07130 391.0914 489.93546 416.33626 625.38561 793.2243 100 b
Furthermore: fscale is S3 generic for vectors, matrices and data frames and also supports grouped and/or weighted scaling operations, as well as scaling to arbitrary means and standard deviations.
The dplyr package has two functions that do this.
> require(dplyr)
To mutate specific columns of a data table, you can use the function mutate_at(). To mutate all columns, you can use mutate_all.
The following is a brief example for using these functions to standardize data.
Mutate specific columns:
dt = data.table(a = runif(3500), b = runif(3500), c = runif(3500))
dt = data.table(dt %>% mutate_at(vars("a", "c"), scale)) # can also index columns by number, e.g., vars(c(1,3))
> apply(dt, 2, mean)
a b c
1.783137e-16 5.064855e-01 -5.245395e-17
> apply(dt, 2, sd)
a b c
1.0000000 0.2906622 1.0000000
Mutate all columns:
dt = data.table(a = runif(3500), b = runif(3500), c = runif(3500))
dt = data.table(dt %>% mutate_all(scale))
> apply(dt, 2, mean)
a b c
-1.728266e-16 9.291994e-17 1.683551e-16
> apply(dt, 2, sd)
a b c
1 1 1
Before I happened to find this thread, I had the same problem. I had user dependant column types, so I wrote a for loop going through them and getting needed columns scale'd. There are probably better ways to do it, but this solved the problem just fine:
for(i in 1:length(colnames(df))) {
if(class(df[,i]) == "numeric" || class(df[,i]) == "integer") {
df[,i] <- as.vector(scale(df[,i])) }
}
as.vector is a needed part, because it turned out scale does rownames x 1 matrix which is usually not what you want to have in your data.frame.
#BBKim pretty much gave the best answer, but it can just be done shorter. I'm surprised noone came up with it yet.
dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
dat <- apply(dat, 2, function(x) (x - mean(x)) / sd(x))
Use the package "recommenderlab". Download and install the package.
This package has a command "Normalize" in built. It also allows you to choose one of the many methods for normalization namely 'center' or 'Z-score'
Follow the following example:
## create a matrix with ratings
m <- matrix(sample(c(NA,0:5),50, replace=TRUE, prob=c(.5,rep(.5/6,6))),nrow=5, ncol=10, dimnames = list(users=paste('u', 1:5, sep=”), items=paste('i', 1:10, sep=”)))
## do normalization
r <- as(m, "realRatingMatrix")
#here, 'centre' is the default method
r_n1 <- normalize(r)
#here "Z-score" is the used method used
r_n2 <- normalize(r, method="Z-score")
r
r_n1
r_n2
## show normalized data
image(r, main="Raw Data")
image(r_n1, main="Centered")
image(r_n2, main="Z-Score Normalization")
The code below could be the shortest way to achieve this.
dataframe <- apply(dataframe, 2, scale)
The normalize function from the BBMisc package was the right tool for me since it can deal with NA values.
Here is how to use it:
Given the following dataset,
ASR_API <- c("CV", "F", "IER", "LS-c", "LS-o")
Human <- c(NA, 5.8, 12.7, NA, NA)
Google <- c(23.2, 24.2, 16.6, 12.1, 28.8)
GoogleCloud <- c(23.3, 26.3, 18.3, 12.3, 27.3)
IBM <- c(21.8, 47.6, 24.0, 9.8, 25.3)
Microsoft <- c(29.1, 28.1, 23.1, 18.8, 35.9)
Speechmatics <- c(19.1, 38.4, 21.4, 7.3, 19.4)
Wit_ai <- c(35.6, 54.2, 37.4, 19.2, 41.7)
dt <- data.table(ASR_API,Human, Google, GoogleCloud, IBM, Microsoft, Speechmatics, Wit_ai)
> dt
ASR_API Human Google GoogleCloud IBM Microsoft Speechmatics Wit_ai
1: CV NA 23.2 23.3 21.8 29.1 19.1 35.6
2: F 5.8 24.2 26.3 47.6 28.1 38.4 54.2
3: IER 12.7 16.6 18.3 24.0 23.1 21.4 37.4
4: LS-c NA 12.1 12.3 9.8 18.8 7.3 19.2
5: LS-o NA 28.8 27.3 25.3 35.9 19.4 41.7
normalized values can be obtained like this:
> dtn <- normalize(dt, method = "standardize", range = c(0, 1), margin = 1L, on.constant = "quiet")
> dtn
ASR_API Human Google GoogleCloud IBM Microsoft Speechmatics Wit_ai
1: CV NA 0.3361245 0.2893457 -0.28468670 0.3247336 -0.18127203 -0.16032655
2: F -0.7071068 0.4875320 0.7715885 1.59862532 0.1700986 1.55068347 1.31594762
3: IER 0.7071068 -0.6631646 -0.5143923 -0.12409420 -0.6030768 0.02512682 -0.01746131
4: LS-c NA -1.3444981 -1.4788780 -1.16064578 -1.2680075 -1.24018782 -1.46198764
5: LS-o NA 1.1840062 0.9323361 -0.02919864 1.3762521 -0.15435044 0.32382788
where hand calculated method just ignores colmuns containing NAs:
> dt %>% mutate(normalizedHuman = (Human - mean(Human))/sd(Human)) %>%
+ mutate(normalizedGoogle = (Google - mean(Google))/sd(Google)) %>%
+ mutate(normalizedGoogleCloud = (GoogleCloud - mean(GoogleCloud))/sd(GoogleCloud)) %>%
+ mutate(normalizedIBM = (IBM - mean(IBM))/sd(IBM)) %>%
+ mutate(normalizedMicrosoft = (Microsoft - mean(Microsoft))/sd(Microsoft)) %>%
+ mutate(normalizedSpeechmatics = (Speechmatics - mean(Speechmatics))/sd(Speechmatics)) %>%
+ mutate(normalizedWit_ai = (Wit_ai - mean(Wit_ai))/sd(Wit_ai))
ASR_API Human Google GoogleCloud IBM Microsoft Speechmatics Wit_ai normalizedHuman normalizedGoogle
1 CV NA 23.2 23.3 21.8 29.1 19.1 35.6 NA 0.3361245
2 F 5.8 24.2 26.3 47.6 28.1 38.4 54.2 NA 0.4875320
3 IER 12.7 16.6 18.3 24.0 23.1 21.4 37.4 NA -0.6631646
4 LS-c NA 12.1 12.3 9.8 18.8 7.3 19.2 NA -1.3444981
5 LS-o NA 28.8 27.3 25.3 35.9 19.4 41.7 NA 1.1840062
normalizedGoogleCloud normalizedIBM normalizedMicrosoft normalizedSpeechmatics normalizedWit_ai
1 0.2893457 -0.28468670 0.3247336 -0.18127203 -0.16032655
2 0.7715885 1.59862532 0.1700986 1.55068347 1.31594762
3 -0.5143923 -0.12409420 -0.6030768 0.02512682 -0.01746131
4 -1.4788780 -1.16064578 -1.2680075 -1.24018782 -1.46198764
5 0.9323361 -0.02919864 1.3762521 -0.15435044 0.32382788
(normalizedHuman is made a list of NAs ...)
regarding the selection of specific columns for calculation, a generic method can be employed like this one:
data_vars <- df_full %>% dplyr::select(-ASR_API,-otherVarNotToBeUsed)
meta_vars <- df_full %>% dplyr::select(ASR_API,otherVarNotToBeUsed)
data_varsn <- normalize(data_vars, method = "standardize", range = c(0, 1), margin = 1L, on.constant = "quiet")
dtn <- cbind(meta_vars,data_varsn)

Resources