How I can use the function names with for in R - r

I have a little problem with my R code. I don't know where, but I make a mistake.
The problem is:
I have many file excel with the same names of the columns. I'd like to change the titles of the matrix, with a other titles.
These are five files.
AA <- read_excel("AA.xlsx")
BB <- read_excel("BB.xlsx")
CC <- read_excel("CC.xlsx")
DD <- read_excel("DD.xlsx")
EE <- read_excel("EE.xlsx")
head(AA) #the matrix is the same for the other file.
DATA Open Max Min Close VAR % CLOSE VOLUME
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2004-07-07 00:00:00 3.73 3.79 3.6 3.70 0 21810440
2 2004-07-08 00:00:00 3.7 3.71 3.47 3.65 -1.43 7226890
3 2004-07-09 00:00:00 3.61 3.65 3.56 3.65 0 3754407
4 2004-07-12 00:00:00 3.64 3.65 3.59 3.63 -0.55 850667
5 2004-07-13 00:00:00 3.63 3.63 3.58 3.59 -1.16 777508
6 2004-07-14 00:00:00 3.54 3.59 3.47 3.5 -2.45 1931765
To change the titles fast, I decided to use this code.
t <- list(AA, BB, CC, DD, EE)
for (i in t ) {
names(i) <- c("DATA", "OPE", "MAX", "MIN", "CLO", "VAR%", "VOL")
} #R dosen't give any type of error!
head(AA) #the data are the same, as the for dosen't exits.
Where I was wrong?
Thank you so much in advance.
Francesco

We can do this with lapply. Get the datasets in a list with mget, loop through the list, set the column names to vector of names ('nm1
) and modify the objects in the global environment with list2env
nm1 <- c("DATA", "OPE", "MAX", "MIN", "CLO", "VAR%", "VOL")
lst <- lapply(mget(nm2), setNames, nm1)
list2env(lst, envir = .GlobalEnv)
Or using a for loop, loop through the string of object names and assign the column names to the objects in the global environment
for(nm in nm2) assign(nm, `names<-`(get(nm), nm1))
Or using tidyverse
library(tidyverse)
mget(nm2) %>%
map(set_names, nm1) %>%
list2env(., envir = .GlobalEnv)
data
AA <- mtcars[1:7]
BB <- mtcars[1:7]
CC <- mtcars[1:7]
DD <- mtcars[1:7]
EE <- mtcars[1:7]
nm2 <- strrep(LETTERS[1:5], 2)

I am trying to explain why your code didn't work. In the list t, the address of AA (t[[1]]) is the the same as AA in the global environment. In the for-loop, i initially is the same copy as the data.frame AA in global env. When you change the names of i with names(i) <-, the data.frame i is copied twice. Finally, you are changing the name of a new data.frame i rather than the original data.frame AA in the global environment.
Here is an example to illustrate what I mean (tracemem "marks an object so that a message is printed whenever the internal code copies the object."):
tracemem(mtcars)
# [1] "<0x1095b2150>"
tracemem(iris)
# [1] "<0x10959a350>"
x <- list(mtcars, iris)
for(i in x){
cat('-------\n')
tracemem(i)
names(i) <- paste(names(i), 'xx')
}
# -------
# tracemem[0x1095b2150 -> 0x10d678c00]:
# tracemem[0x10d678c00 -> 0x10d678ca8]:
# -------
# tracemem[0x10959a350 -> 0x10cb307b0]:
# tracemem[0x10cb307b0 -> 0x10cb30818]:

Related

How to select different columns with sequential letter+number in R?

I am very new in R and I need some advice about very basic issues.
I want to create a new column that is the sum of existent columns in my data frame Data4
The extended code is this:
Data4$E<-(Data4$E1+Data4$E2+Data4$E3+Data4$E4+Data4$E5)
I would like to simplify the code and find a way to not write the sequence of the column's name every time.
I tried this, but it indeed wrong
Data4$E<-(Data4$E[1:5])
Do you know a way to do it?
Thank you!
Among your options are:
set.seed(12)
Data4 <- data.frame(replicate(5, rnorm(5, 10, 1)))
colnames(Data4) <- paste0("E", 1:5)
# base R
Data4$E <- rowSums(Data4) # if there are just columns E1 to E5
Data4$E_option2 <- rowSums(subset(Data4, select = paste0("E", 1:5))) # if there are other columns ..
# "tidy"
library(tidyverse)
Data4 <- Data4 %>%
mutate(E_option3 = pmap_dbl(Data4 %>%
select(E1:E5),
sum))
# E1 E2 E3 E4 E5 E E_option2 E_option3
#1 8.519432 9.727704 9.222280 9.296536 10.223641 46.98959 46.98959 46.98959
#2 11.577169 9.684651 8.706118 11.188879 12.007201 53.16402 53.16402 53.16402
#3 9.043256 9.371745 9.220433 10.340512 11.011979 48.98793 48.98793 48.98793
#4 9.079995 9.893536 10.011952 10.506968 9.697541 49.18999 49.18999 49.18999
#5 8.002358 10.428015 9.847584 9.706695 8.974755 46.95941 46.95941 46.95941
Use functions like sum or rowSums. It seems you want row sums. These functions are better than + because they have na.rm argument that controls wether or not NAs are ignored.
Data4$E <- rowSums(Data[, c("E1", "E2", "E3", "E4", "E5")], na.rm = TRUE)
An easy way to generate column names is to paste them with numbers. Equivalently, we could write it so we can reuse this for other such operations:
E_col_names <- sprintf("E%d", 1:5)
Data4$E <- rowSums(Data[, E_col_names], na.rm = TRUE)
One more way to do it in dplyr demonstrating it on toy_data created in one of the above answers. Just use E1:E5 inside c_across. Of course you may also use select helper functions e.g. starts_with here
#toy_data
set.seed(12)
Data4 <- data.frame(replicate(5, rnorm(5, 10, 1)))
colnames(Data4) <- paste0("E", 1:5)
library(dplyr)
Data4 %>% rowwise() %>%
mutate(E = sum(c_across(E1:E5)))
#> # A tibble: 5 x 6
#> # Rowwise:
#> E1 E2 E3 E4 E5 E
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 8.52 9.73 9.22 9.30 10.2 47.0
#> 2 11.6 9.68 8.71 11.2 12.0 53.2
#> 3 9.04 9.37 9.22 10.3 11.0 49.0
#> 4 9.08 9.89 10.0 10.5 9.70 49.2
#> 5 8.00 10.4 9.85 9.71 8.97 47.0
Created on 2021-05-25 by the reprex package (v2.0.0)

Using lapply to standardise the ts() function

I am new to R and writing functions. I've spent hours trying to figure this out and searching Google, but can't seem to find anything. Hopefully you can help? I want to use lapply() to analyze the data below using the ts() function.
My code looks like this:
library(dplyr)
#group out different sites
mylist <- data %>%
group_by(Site)
mylist
#Write ts() function
alpha_function = function(x) {
ts_alpha = ts(x$Temperature, frequency=12, start=c(0017, 7, 20))
return(data.frame(ts_alpha))
}
#Run list through lapply()
results = lapply(mylist, alpha_function())
But I get this error: argument "x" is missing with no default.
I have a data set that looks like:
Site(factor) Date(POSIXct) Temperature(num)
1 0017-03-04 2.73
2 0017-03-04 3.73
3 0017-03-04 2.71
4 0017-03-04 2.22
5 0017-03-04 2.89
etc.
I have over 3,000 temperature readings at different dates for 5 different sites.
Thanks in advance!
I'm not exactly an R guy, but I would wager this line:
results = lapply(mylist, alpha_function())
should be
results = lapply(mylist, alpha_function).
What you have calls the alpha function when you are trying to supply it to lapply, when what you really (most likely) want to do is provide a reference to the function without calling it. (The error you are getting indicates that alpha_function needs an x parameter when being called like alpha_function()).
A recommended approach when working with dplyr and the tidyverse is to keep things in data frames:
library(tidyverse)
library(zoo)
dat %>%
nest(-Site) %>%
mutate(data = map(data, ~ zoo(.x$Temperature, .x$Date)))
# # A tibble: 5 x 2
# Site data
# <fct> <list>
# 1 a <S3: zoo>
# 2 b <S3: zoo>
# 3 c <S3: zoo>
# 4 d <S3: zoo>
# 5 e <S3: zoo>
Or if we must have ts rather than zoo objects, we can use as.ts(zoo(...)).
In case we still prefer regular lists, we can use base split() and lapply():
dat %>%
split(.$Site) %>%
lapply(function(.x) zoo(.x$Temperature, .x$Date))
# List of 5
# $ a:‘zoo’ series from 2017-03-04 12:00:00 to 2017-05-06 00:30:00
# Data: num [1:3000] 5.37 5.49 5.32 5.44 5.43 ...
# Index: POSIXct[1:3000], format: "2017-03-04 12:00:00" ...
# $ b:‘zoo’ series from 2017-03-04 12:00:00 to 2017-05-06 00:30:00
# Data: num [1:3000] 5.36 5.22 5.15 5.41 5.41 ...
# Index: POSIXct[1:3000], format: "2017-03-04 12:00:00" ...
# $ c:‘zoo’ series from 2017-03-04 12:00:00 to 2017-05-06 00:30:00
# Data: num [1:3000] 6.08 6.11 6.22 6.13 6.03 ...
# Index: POSIXct[1:3000], format: "2017-03-04 12:00:00" ...
# $ d:‘zoo’ series from 2017-03-04 12:00:00 to 2017-05-06 00:30:00
# Data: num [1:3000] 5.06 4.96 5.23 5.16 5.29 ...
# Index: POSIXct[1:3000], format: "2017-03-04 12:00:00" ...
# $ e:‘zoo’ series from 2017-03-04 12:00:00 to 2017-05-06 00:30:00
# Data: num [1:3000] 5.1 5.08 5.14 5.13 5.22 ...
# Index: POSIXct[1:3000], format: "2017-03-04 12:00:00" ...
(where dat is generated as follows:
n_sites <- 5
n_dates <- 3000
set.seed(123) ; dat <- tibble(
Site = factor(rep(letters[1:n_sites], each = n_dates)),
Date = rep(seq.POSIXt(as.POSIXct("2017-03-04 12:00:00"), by = "30 min", length.out = n_dates), times = n_sites),
Temperature = as.vector(replicate(n_sites, runif(1, 5, 6) + cumsum(rnorm(n_dates, 0, 0.1))))
)

How to append columns horizontally in R in a loop?

I want to create a matrix of stockdata from n number of companies from a ticker list, though im struggling with appending them horizontally, it only works to append them vertically.
Also other functions like merge or rbind which i have tried, but they cannot work with the variables parsed as a string, so the hard part here is that i want to append n variables which are retrieved from the tickerlist which has n number of stocks. Other suggestions are welcome to get the same result.
Stocklist data:
> dput(stockslist)
structure(list(V1 = c("AMD", "MSFT", "SBUX", "IBM", "AAPL", "GSPC",
"AMZN")), .Names = "V1", class = "data.frame", row.names = c(NA,
-7L))
code:
library(quantmod)
library(tseries)
library(plyr)
library(PortfolioAnalytics)
library(PerformanceAnalytics)
library(zoo)
library(plotly)
tickerlist <- "sp500.csv" #CSV containing tickers on rows
stockslist <- read.csv("sp500.csv", header = FALSE, stringsAsFactors = F)
nrstocks = length(stockslist[,1]) #The number of stocks to download
maxretryattempts <- 5 #If there is an error downloading a price how many
times to retry
startDate = as.Date("2010-01-13")
for (i in 1:nrstocks) {
stockdata <- getSymbols(c(stockslist[i,1]), src = "yahoo", from =
startDate)
# pick 6th column of the ith stock
write.table((eval(parse(text=paste(stockslist[i,1]))))[,6], file =
"test.csv", append = TRUE, row.names=F)
}
This is exactly a great opportunity to talk about lists of dataframes. Having said that ...
Side bar: I really don't like side-effects. getSymbols defaults to using side-effect to saving the data into the parent frame/environment, and though this may be fine for most uses, I prefer functional methods. Luckily, using auto.assign=FALSE returns its behavior to within my bounds of comfort.
library(quantmod)
stocklist <- c("AMD", "MSFT")
startDate <- as.Date("2010-01-13")
dat <- sapply(stocklist, getSymbols, src = "google", from = startDate, auto.assign = FALSE,
simplify = FALSE)
str(dat)
# List of 2
# $ AMD :An 'xts' object on 2010-01-13/2017-05-16 containing:
# Data: num [1:1846, 1:5] 8.71 9.18 9.13 8.84 8.98 9.01 8.55 8.01 8.03 8.03 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:5] "AMD.Open" "AMD.High" "AMD.Low" "AMD.Close" ...
# Indexed by objects of class: [Date] TZ: UTC
# xts Attributes:
# List of 2
# ..$ src : chr "google"
# ..$ updated: POSIXct[1:1], format: "2017-05-16 21:01:37"
# $ MSFT:An 'xts' object on 2010-01-13/2017-05-16 containing:
# Data: num [1:1847, 1:5] 30.3 30.3 31.1 30.8 30.8 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:5] "MSFT.Open" "MSFT.High" "MSFT.Low" "MSFT.Close" ...
# Indexed by objects of class: [Date] TZ: UTC
# xts Attributes:
# List of 2
# ..$ src : chr "google"
# ..$ updated: POSIXct[1:1], format: "2017-05-16 21:01:37"
Though I only did two symbols, it should work for many more without problem. Also, I shifted to using Google since Yahoo was asking for authentication.
You used write.csv(...), realize that you will lose the timestamp for each datum, since the CSV will look something like:
"AMD.Open","AMD.High","AMD.Low","AMD.Close","AMD.Volume"
8.71,9.2,8.55,9.15,32741845
9.18,9.26,8.92,9,22658744
9.13,9.19,8.8,8.84,34344763
8.84,9.21,8.84,9.01,24875646
Using "AMD" as an example, consider:
write.csv(as.data.frame(AMD), file="AMD.csv", row.names = TRUE)
head(read.csv("~/Downloads/AMD.csv", row.names = 1))
# AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume
# 2010-01-13 8.71 9.20 8.55 9.15 32741845
# 2010-01-14 9.18 9.26 8.92 9.00 22658744
# 2010-01-15 9.13 9.19 8.80 8.84 34344763
# 2010-01-19 8.84 9.21 8.84 9.01 24875646
# 2010-01-20 8.98 9.00 8.76 8.87 22813520
# 2010-01-21 9.01 9.10 8.77 8.99 37888647
To save all of them at once:
ign <- mapply(function(x, fn) write.csv(as.data.frame(x), file = fn, row.names = TRUE),
dat, names(dat))
There are other ways to store your data such as Rdata files (save()).
It is not clear to me if you are intending to append them as additional columns (i.e., cbind behavior) or as rows (rbind). Between the two, I tend towards "rows", but I'll start with "columns" first.
"Appending" by column
This may be appropriate if you want day-by-day ticker comparisons (though there are arguably better ways to prepare for this). You'll run into problems, since they have (and most likely will have) different numbers of rows:
sapply(dat, nrow)
# AMD MSFT
# 1846 1847
In this case, you might want to join based on the dates (row names). To do this well, you should probably convert the row names (dates) to a column and merge on that column:
dat2 <- lapply(dat, function(x) {
x <- as.data.frame(x)
x$date <- rownames(x)
rownames(x) <- NULL
x
})
datwide <- Reduce(function(a, b) merge(a, b, by = "date", all = TRUE), dat2)
As a simple demonstration, remembering that there is one more row in "MSFT" than in "AMD", we can find that row and prove that things are still looking alright with:
which(! complete.cases(datwide))
# [1] 1251
datwide[1251 + -2:2,]
# date AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume
# 1249 2014-12-30 2.64 2.70 2.63 2.63 7783709 47.44 47.62 46.84 47.02 16384692
# 1250 2014-12-31 2.64 2.70 2.64 2.67 11177917 46.73 47.44 46.45 46.45 21552450
# 1251 2015-01-02 NA NA NA NA NA 46.66 47.42 46.54 46.76 27913852
# 1252 2015-01-05 2.67 2.70 2.64 2.66 8878176 46.37 46.73 46.25 46.32 39673865
# 1253 2015-01-06 2.65 2.66 2.55 2.63 13916645 46.38 46.75 45.54 45.65 36447854
"Appending" by row
getSymbols names the columns unique to the ticker, a slight frustration. Additionally, since we'll be discarding the column names, we should preserve the symbol name in the data.
dat3 <- lapply(dat, function(x) {
ticker <- gsub("\\..*", "", colnames(x)[1])
colnames(x) <- gsub(".*\\.", "", colnames(x))
x <- as.data.frame(x)
x$date <- rownames(x)
x$symbol <- ticker
rownames(x) <- NULL
x
}) # can also be accomplished with mapply(..., dat, names(dat))
datlong <- Reduce(function(a, b) rbind(a, b, make.row.names = FALSE), dat3)
head(datlong)
# Open High Low Close Volume date symbol
# 1 8.71 9.20 8.55 9.15 32741845 2010-01-13 AMD
# 2 9.18 9.26 8.92 9.00 22658744 2010-01-14 AMD
# 3 9.13 9.19 8.80 8.84 34344763 2010-01-15 AMD
# 4 8.84 9.21 8.84 9.01 24875646 2010-01-19 AMD
# 5 8.98 9.00 8.76 8.87 22813520 2010-01-20 AMD
# 6 9.01 9.10 8.77 8.99 37888647 2010-01-21 AMD
nrow(datlong)
# [1] 3693

How to simplify this normalization code? [duplicate]

I have a dataset called spam which contains 58 columns and approximately 3500 rows of data related to spam messages.
I plan on running some linear regression on this dataset in the future, but I'd like to do some pre-processing beforehand and standardize the columns to have zero mean and unit variance.
I've been told the best way to go about this is with R, so I'd like to ask how can i achieve normalization with R? I've already got the data properly loaded and I'm just looking for some packages or methods to perform this task.
I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale function on the data to do what you want.
dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)
# check that we get mean of 0 and sd of 1
colMeans(scaled.dat) # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)
Using built in functions is classy. Like this cat:
Realizing that the question is old and one answer is accepted, I'll provide another answer for reference.
scale is limited by the fact that it scales all variables. The solution below allows to scale only specific variable names while preserving other variables unchanged (and the variable names could be dynamically generated):
library(dplyr)
set.seed(1234)
dat <- data.frame(x = rnorm(10, 30, .2),
y = runif(10, 3, 5),
z = runif(10, 10, 20))
dat
dat2 <- dat %>% mutate_at(c("y", "z"), ~(scale(.) %>% as.vector))
dat2
which gives me this:
> dat
x y z
1 29.75859 3.633225 14.56091
2 30.05549 3.605387 12.65187
3 30.21689 3.318092 13.04672
4 29.53086 3.079992 15.07307
5 30.08582 3.437599 11.81096
6 30.10121 4.621197 17.59671
7 29.88505 4.051395 12.01248
8 29.89067 4.829316 12.58810
9 29.88711 4.662690 19.92150
10 29.82199 3.091541 18.07352
and
> dat2 <- dat %>% mutate_at(c("y", "z"), ~(scale(.) %>% as.vector))
> dat2
x y z
1 29.75859 -0.3004815 -0.06016029
2 30.05549 -0.3423437 -0.72529604
3 30.21689 -0.7743696 -0.58772361
4 29.53086 -1.1324181 0.11828039
5 30.08582 -0.5946582 -1.01827752
6 30.10121 1.1852038 0.99754666
7 29.88505 0.3283513 -0.94806607
8 29.89067 1.4981677 -0.74751378
9 29.88711 1.2475998 1.80753470
10 29.82199 -1.1150515 1.16367556
EDIT 1 (2016): Addressed Julian's comment: the output of scale is Nx1 matrix so ideally we should add an as.vector to convert the matrix type back into a vector type. Thanks Julian!
EDIT 2 (2019): Quoting Duccio A.'s comment: For the latest dplyr (version 0.8) you need to change dplyr::funcs with list, like dat %>% mutate_each_(list(~scale(.) %>% as.vector), vars=c("y","z"))
EDIT 3 (2020): Thanks to #mj_whales: the old solution is deprecated and now we need to use mutate_at.
This is 3 years old. Still, I feel I have to add the following:
The most common normalization is the z-transformation, where you subtract the mean and divide by the standard deviation of your variable. The result will have mean=0 and sd=1.
For that, you don't need any package.
zVar <- (myVar - mean(myVar)) / sd(myVar)
That's it.
'Caret' package provides methods for preprocessing data (e.g. centering and scaling). You could also use the following code:
library(caret)
# Assuming goal class is column 10
preObj <- preProcess(data[, -10], method=c("center", "scale"))
newData <- predict(preObj, data[, -10])
More details: http://www.inside-r.org/node/86978
When I used the solution stated by Dason, instead of getting a data frame as a result, I got a vector of numbers (the scaled values of my df).
In case someone is having the same trouble, you have to add as.data.frame() to the code, like this:
df.scaled <- as.data.frame(scale(df))
I hope this is will be useful for ppl having the same issue!
You can easily normalize the data also using data.Normalization function in clusterSim package. It provides different method of data normalization.
data.Normalization (x,type="n0",normalization="column")
Arguments
x
vector, matrix or dataset
type
type of normalization:
n0 - without normalization
n1 - standardization ((x-mean)/sd)
n2 - positional standardization ((x-median)/mad)
n3 - unitization ((x-mean)/range)
n3a - positional unitization ((x-median)/range)
n4 - unitization with zero minimum ((x-min)/range)
n5 - normalization in range <-1,1> ((x-mean)/max(abs(x-mean)))
n5a - positional normalization in range <-1,1> ((x-median)/max(abs(x-median)))
n6 - quotient transformation (x/sd)
n6a - positional quotient transformation (x/mad)
n7 - quotient transformation (x/range)
n8 - quotient transformation (x/max)
n9 - quotient transformation (x/mean)
n9a - positional quotient transformation (x/median)
n10 - quotient transformation (x/sum)
n11 - quotient transformation (x/sqrt(SSQ))
n12 - normalization ((x-mean)/sqrt(sum((x-mean)^2)))
n12a - positional normalization ((x-median)/sqrt(sum((x-median)^2)))
n13 - normalization with zero being the central point ((x-midrange)/(range/2))
normalization
"column" - normalization by variable, "row" - normalization by object
With dplyr v0.7.4 all variables can be scaled by using mutate_all():
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
set.seed(1234)
dat <- tibble(x = rnorm(10, 30, .2),
y = runif(10, 3, 5),
z = runif(10, 10, 20))
dat %>% mutate_all(scale)
#> # A tibble: 10 x 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 -0.827 -0.300 -0.0602
#> 2 0.663 -0.342 -0.725
#> 3 1.47 -0.774 -0.588
#> 4 -1.97 -1.13 0.118
#> 5 0.816 -0.595 -1.02
#> 6 0.893 1.19 0.998
#> 7 -0.192 0.328 -0.948
#> 8 -0.164 1.50 -0.748
#> 9 -0.182 1.25 1.81
#> 10 -0.509 -1.12 1.16
Specific variables can be excluded using mutate_at():
dat %>% mutate_at(scale, .vars = vars(-x))
#> # A tibble: 10 x 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 29.8 -0.300 -0.0602
#> 2 30.1 -0.342 -0.725
#> 3 30.2 -0.774 -0.588
#> 4 29.5 -1.13 0.118
#> 5 30.1 -0.595 -1.02
#> 6 30.1 1.19 0.998
#> 7 29.9 0.328 -0.948
#> 8 29.9 1.50 -0.748
#> 9 29.9 1.25 1.81
#> 10 29.8 -1.12 1.16
Created on 2018-04-24 by the reprex package (v0.2.0).
Again, even though this is an old question, it is very relevant! And I have found a simple way to normalise certain columns without the need of any packages:
normFunc <- function(x){(x-mean(x, na.rm = T))/sd(x, na.rm = T)}
For example
x<-rnorm(10,14,2)
y<-rnorm(10,7,3)
z<-rnorm(10,18,5)
df<-data.frame(x,y,z)
df[2:3] <- apply(df[2:3], 2, normFunc)
You will see that the y and z columns have been normalised. No packages needed :-)
Scale can be used for both full data frame and specific columns.
For specific columns, following code can be used:
trainingSet[, 3:7] = scale(trainingSet[, 3:7]) # For column 3 to 7
trainingSet[, 8] = scale(trainingSet[, 8]) # For column 8
Full data frame
trainingSet <- scale(trainingSet)
The collapse package provides the fastest scale function - implemented in C++ using Welfords Online Algorithm:
dat <- data.frame(x = rnorm(1e6, 30, .2),
y = runif(1e6, 3, 5),
z = runif(1e6, 10, 20))
library(collapse)
library(microbenchmark)
microbenchmark(fscale(dat), scale(dat))
Unit: milliseconds
expr min lq mean median uq max neval cld
fscale(dat) 27.86456 29.5864 38.96896 30.80421 43.79045 313.5729 100 a
scale(dat) 357.07130 391.0914 489.93546 416.33626 625.38561 793.2243 100 b
Furthermore: fscale is S3 generic for vectors, matrices and data frames and also supports grouped and/or weighted scaling operations, as well as scaling to arbitrary means and standard deviations.
The dplyr package has two functions that do this.
> require(dplyr)
To mutate specific columns of a data table, you can use the function mutate_at(). To mutate all columns, you can use mutate_all.
The following is a brief example for using these functions to standardize data.
Mutate specific columns:
dt = data.table(a = runif(3500), b = runif(3500), c = runif(3500))
dt = data.table(dt %>% mutate_at(vars("a", "c"), scale)) # can also index columns by number, e.g., vars(c(1,3))
> apply(dt, 2, mean)
a b c
1.783137e-16 5.064855e-01 -5.245395e-17
> apply(dt, 2, sd)
a b c
1.0000000 0.2906622 1.0000000
Mutate all columns:
dt = data.table(a = runif(3500), b = runif(3500), c = runif(3500))
dt = data.table(dt %>% mutate_all(scale))
> apply(dt, 2, mean)
a b c
-1.728266e-16 9.291994e-17 1.683551e-16
> apply(dt, 2, sd)
a b c
1 1 1
Before I happened to find this thread, I had the same problem. I had user dependant column types, so I wrote a for loop going through them and getting needed columns scale'd. There are probably better ways to do it, but this solved the problem just fine:
for(i in 1:length(colnames(df))) {
if(class(df[,i]) == "numeric" || class(df[,i]) == "integer") {
df[,i] <- as.vector(scale(df[,i])) }
}
as.vector is a needed part, because it turned out scale does rownames x 1 matrix which is usually not what you want to have in your data.frame.
#BBKim pretty much gave the best answer, but it can just be done shorter. I'm surprised noone came up with it yet.
dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
dat <- apply(dat, 2, function(x) (x - mean(x)) / sd(x))
Use the package "recommenderlab". Download and install the package.
This package has a command "Normalize" in built. It also allows you to choose one of the many methods for normalization namely 'center' or 'Z-score'
Follow the following example:
## create a matrix with ratings
m <- matrix(sample(c(NA,0:5),50, replace=TRUE, prob=c(.5,rep(.5/6,6))),nrow=5, ncol=10, dimnames = list(users=paste('u', 1:5, sep=”), items=paste('i', 1:10, sep=”)))
## do normalization
r <- as(m, "realRatingMatrix")
#here, 'centre' is the default method
r_n1 <- normalize(r)
#here "Z-score" is the used method used
r_n2 <- normalize(r, method="Z-score")
r
r_n1
r_n2
## show normalized data
image(r, main="Raw Data")
image(r_n1, main="Centered")
image(r_n2, main="Z-Score Normalization")
The code below could be the shortest way to achieve this.
dataframe <- apply(dataframe, 2, scale)
The normalize function from the BBMisc package was the right tool for me since it can deal with NA values.
Here is how to use it:
Given the following dataset,
ASR_API <- c("CV", "F", "IER", "LS-c", "LS-o")
Human <- c(NA, 5.8, 12.7, NA, NA)
Google <- c(23.2, 24.2, 16.6, 12.1, 28.8)
GoogleCloud <- c(23.3, 26.3, 18.3, 12.3, 27.3)
IBM <- c(21.8, 47.6, 24.0, 9.8, 25.3)
Microsoft <- c(29.1, 28.1, 23.1, 18.8, 35.9)
Speechmatics <- c(19.1, 38.4, 21.4, 7.3, 19.4)
Wit_ai <- c(35.6, 54.2, 37.4, 19.2, 41.7)
dt <- data.table(ASR_API,Human, Google, GoogleCloud, IBM, Microsoft, Speechmatics, Wit_ai)
> dt
ASR_API Human Google GoogleCloud IBM Microsoft Speechmatics Wit_ai
1: CV NA 23.2 23.3 21.8 29.1 19.1 35.6
2: F 5.8 24.2 26.3 47.6 28.1 38.4 54.2
3: IER 12.7 16.6 18.3 24.0 23.1 21.4 37.4
4: LS-c NA 12.1 12.3 9.8 18.8 7.3 19.2
5: LS-o NA 28.8 27.3 25.3 35.9 19.4 41.7
normalized values can be obtained like this:
> dtn <- normalize(dt, method = "standardize", range = c(0, 1), margin = 1L, on.constant = "quiet")
> dtn
ASR_API Human Google GoogleCloud IBM Microsoft Speechmatics Wit_ai
1: CV NA 0.3361245 0.2893457 -0.28468670 0.3247336 -0.18127203 -0.16032655
2: F -0.7071068 0.4875320 0.7715885 1.59862532 0.1700986 1.55068347 1.31594762
3: IER 0.7071068 -0.6631646 -0.5143923 -0.12409420 -0.6030768 0.02512682 -0.01746131
4: LS-c NA -1.3444981 -1.4788780 -1.16064578 -1.2680075 -1.24018782 -1.46198764
5: LS-o NA 1.1840062 0.9323361 -0.02919864 1.3762521 -0.15435044 0.32382788
where hand calculated method just ignores colmuns containing NAs:
> dt %>% mutate(normalizedHuman = (Human - mean(Human))/sd(Human)) %>%
+ mutate(normalizedGoogle = (Google - mean(Google))/sd(Google)) %>%
+ mutate(normalizedGoogleCloud = (GoogleCloud - mean(GoogleCloud))/sd(GoogleCloud)) %>%
+ mutate(normalizedIBM = (IBM - mean(IBM))/sd(IBM)) %>%
+ mutate(normalizedMicrosoft = (Microsoft - mean(Microsoft))/sd(Microsoft)) %>%
+ mutate(normalizedSpeechmatics = (Speechmatics - mean(Speechmatics))/sd(Speechmatics)) %>%
+ mutate(normalizedWit_ai = (Wit_ai - mean(Wit_ai))/sd(Wit_ai))
ASR_API Human Google GoogleCloud IBM Microsoft Speechmatics Wit_ai normalizedHuman normalizedGoogle
1 CV NA 23.2 23.3 21.8 29.1 19.1 35.6 NA 0.3361245
2 F 5.8 24.2 26.3 47.6 28.1 38.4 54.2 NA 0.4875320
3 IER 12.7 16.6 18.3 24.0 23.1 21.4 37.4 NA -0.6631646
4 LS-c NA 12.1 12.3 9.8 18.8 7.3 19.2 NA -1.3444981
5 LS-o NA 28.8 27.3 25.3 35.9 19.4 41.7 NA 1.1840062
normalizedGoogleCloud normalizedIBM normalizedMicrosoft normalizedSpeechmatics normalizedWit_ai
1 0.2893457 -0.28468670 0.3247336 -0.18127203 -0.16032655
2 0.7715885 1.59862532 0.1700986 1.55068347 1.31594762
3 -0.5143923 -0.12409420 -0.6030768 0.02512682 -0.01746131
4 -1.4788780 -1.16064578 -1.2680075 -1.24018782 -1.46198764
5 0.9323361 -0.02919864 1.3762521 -0.15435044 0.32382788
(normalizedHuman is made a list of NAs ...)
regarding the selection of specific columns for calculation, a generic method can be employed like this one:
data_vars <- df_full %>% dplyr::select(-ASR_API,-otherVarNotToBeUsed)
meta_vars <- df_full %>% dplyr::select(ASR_API,otherVarNotToBeUsed)
data_varsn <- normalize(data_vars, method = "standardize", range = c(0, 1), margin = 1L, on.constant = "quiet")
dtn <- cbind(meta_vars,data_varsn)

Create Variable from Each Column in Data Frame

I have one data frame that contains columns that I want to look at individually. I am not sure what the common method is for analyzing data individually like this, but I want to create a separate variable/data frame for each column in my original data frame. I know I can subset, but is there a way I can use a for loop (is this the easiest way?) in order to create x new variables from the x columns in my data frame?
For more details on my data frame, I have a product and a corresponding index (which the product is being judged against).
Example data frame:
Date Product 1 Index 1 Product 2 Index 2
1/1/1995 2.89 2.75 4.91 5.01
2/1/1995 1.38 1.65 3.47 3.29
So I would like to create a variable for each product and corresponding index, without manually creating a data frame for each one, or subsetting when i want to analyze the product.
Like someone mentioned in the comments, you can do this by indexing. But if you really want separate vectors for each column in your data frame, you could do it like this:
df <- data.frame(x=1:10, y=11:20, z=21:30)
for (i in colnames(df)) {
assign(i, df[, i])
}
You could index the columns and put them into a new list with each element containing the product/index pair and the date column.
ind <- seq(2, by = 2, length.out = ncol(dat[-1])/2)
(sets <- lapply(ind, function(i) dat[c(1, i:(i+1))]))
# [[1]]
# Date Product1 Index1
# 1 1/1/1995 2.89 2.75
# 2 2/1/1995 1.38 1.65
#
# [[2]]
# Date Product2 Index2
# 1 1/1/1995 4.91 5.01
# 2 2/1/1995 3.47 3.29
If you want, you can then assign these data frames to the global environment with list2env
list2env(setNames(sets, paste0("Set", seq_along(sets))), .GlobalEnv)
Set1
# Date Product1 Index1
# 1 1/1/1995 2.89 2.75
# 2 2/1/1995 1.38 1.65
Set2
# Date Product2 Index2
# 1 1/1/1995 4.91 5.01
# 2 2/1/1995 3.47 3.29
Data:
dat <-
structure(list(Date = structure(1:2, .Label = c("1/1/1995", "2/1/1995"
), class = "factor"), Product1 = c(2.89, 1.38), Index1 = c(2.75,
1.65), Product2 = c(4.91, 3.47), Index2 = c(5.01, 3.29)), .Names = c("Date",
"Product1", "Index1", "Product2", "Index2"), class = "data.frame", row.names = c(NA,
-2L))
This is what attach does. You can just do attach(my_data_frame).
Most people who know what they're doing would tell you this lies somewhere between "unnecessary" and "not a good idea".

Resources