way to customize zScore function with r - r

I am new to R and have a question.
Create a function, zScore, that will take a vector of numbers (x) and converts them to a vector of z-scaled numbers (see code below). (Don't worry about NA's)
#This creates the z-scaled numbers for sepal lengths
(iris$Sepal.Length - mean(iris$Sepal.Length))/sd(iris$Sepal.Length)
#This creates the z-scaled numbers for sepal widths
(iris$Sepal.Width - mean(iris$Sepal.Width))/sd(iris$Sepal.Width)
write a zScore function that is flexible.
thank you for any help you provide

You can use the following code:
# Z-score function
zscore <- function(x) {
(x - mean(x))/sd(x)
}
library(tidyverse)
iris %>%
mutate(zscore_sepal.length = zscore(Sepal.Length)) %>%
mutate(zscore_sepal.width = zscore(Sepal.Width))
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species zscore_sepal.length zscore_sepal.width
1 5.1 3.5 1.4 0.2 setosa -1.95660229 -3.514384
2 4.9 3.0 1.4 0.2 setosa -2.15660229 -4.014384
3 4.7 3.2 1.3 0.2 setosa -2.35660229 -3.814384
4 4.6 3.1 1.5 0.2 setosa -2.45660229 -3.914384
5 5.0 3.6 1.4 0.2 setosa -2.05660229 -3.414384

Related

using 'ifelse' in R: variable taking static value

I am trying to create new variable in a dataset based on the value of an indicator. The following is the code for the same:
prac_data <- head(iris,10)
COPY_IND='Y' ##declaring the indicator to be 'Y'
prac_data <- prac_data %>% mutate(New_Var=ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
I get the following output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species New_Var
1 5.1 3.5 1.4 0.2 setosa 5.1
2 4.9 3.0 1.4 0.2 setosa 5.1
3 4.7 3.2 1.3 0.2 setosa 5.1
4 4.6 3.1 1.5 0.2 setosa 5.1
5 5.0 3.6 1.4 0.2 setosa 5.1
6 5.4 3.9 1.7 0.4 setosa 5.1
7 4.6 3.4 1.4 0.3 setosa 5.1
8 5.0 3.4 1.5 0.2 setosa 5.1
9 4.4 2.9 1.4 0.2 setosa 5.1
10 4.9 3.1 1.5 0.1 setosa 5.1
I actually want to copy the variable 'Sepal.Length' in the 'New_Var' for every observation if indicator(COPY_IND) is Yes('Y').
If I do the the following, I get the desired response:
if (COPY_IND=='Y')
{
prac_data$New_Var <- prac_data$Sepal.Length
} else {prac_data$New_Var <- 'N'}
I just want to understand why R treats both 'if-else' approaches differently?
Is there another better elegant way to the same?
Thanks in advance!!
Actually, this might be easier to read as an answer.
From ifelse() help: "ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE".
Your test is just a single value, so ifelse() returns a single value, either Sepal.Length[1] or N, which is then duplicated across the whole column.
You need rowwise() on your way: prac_data <- prac_data %>% rowwise() %>% mutate(New_Var = ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
COPY_IND is always "Y" in your case, then the code could be simplified to prac_data$New_Var = prac_data$Sepal.Length. Even if you want to use ifelse statement row-wisely, it is better to follow the instructions in the help document
Further note that if(test) yes else no is much more efficient and often much preferable to ifelse(test, yes, no) whenever test is a simple true/false result, i.e., when length(test) == 1.
I guess the desired COPY_IND should be one column of the data frame/vector rather than a single fixed value. In this case, you code generate the right answer, e.g. keep the first five number:
library(dplyr)
prac_data <- head(iris,10)
prac_data$COPY_IND=c(rep('Y',5),rep('N',5))
#COPY_IND=c(rep('Y',5),rep('N',5)) works too
prac_data <- prac_data %>% mutate(New_Var=ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
generates the right column.

accessing variables in data frame in R

I am try to open all the csv files in my working directory and read all the tables into a large list of data frame. I find a similar solution on stackoverflow and the solution works. The code is:
load_data <- function(path)
{
files <- dir(path, pattern = '\\.csv', full.names = TRUE)
tables <- lapply(files, read.csv)
do.call(rbind, tables)
}
pollutantmean <- load_data("specdata")
However, I am confused to some steps. If I delete or omit do.call(rbind,tables), I am not able to access the column variables by calling tables[index]$variable. It returns NULL in the console. Then I try to print an output by calling tables[index] and I do not see any column variables' name appearing the the first row in the table. Can someone explain to me what cause the column variables' name missing and return NULL value?
To see why you are getting NULL let's create a reproducible example:
df1 <- head(mtcars)
df2 <- head(iris)
my_list <- list(df1, df2)
Test the subsetting with one bracket and two:
my_list[2]$Species
NULL
my_list[[2]]$Species
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
Subsetting with two brackets produces the desired output.
Further Explanation
Why doesn't one bracket work?
> my_list[2]
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
> my_list[[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
If someone couldn't tell the difference between the two outputs I wouldn't blame them, they look alike. There's one small important difference between using one bracket and two. The first returns a list, the second returns a data frame. To check, notice the [[1]] in the first line of the output of my_list[2]. That indicates that the output is a list. As a list we cannot analyze it as we would a data frame. We must use the two brackets to get back a data frame.

biglm finds the wrong data.frame to take the data from

I am trying to create chunks of my dataset to run biglm. (with fastLm I would need 350Gb of RAM)
My complete dataset is called res. As experiment I drastically decreased the size to 10.000 rows. I want to create chunks to use with biglm.
library(biglm)
formula <- iris$Sepal.Length ~ iris$Sepal.Width
test <- iris[1:10,]
biglm(formula, test)
And somehow, I get the following output:
> test <- iris[1:10,]
> test
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Above you can see the matrix test contains 10 rows. Yet when running biglm it shows a sample size of 150
> biglm(formula, test)
Large data regression model: biglm(formula, test)
Sample size = 150
Looks like it uses iris instead of test.. how is this possible and how do I get biglm to use chunk1 the way I intend it to?
I suspect the following line is to blame:
formula <- iris$Sepal.Length ~ iris$Sepal.Width
where in the formula you explicitly reference the iris dataset. This will cause R to try and find the iris dataset when lm is called, which it finds in the global environment (because of R's scoping rules).
In a formula you normally do not use vectors, but simply the column names:
formula <- Sepal.Length ~ Sepal.Width
This will ensure that the formula contains only the column (or variable) names, which will be found in the data lm is passed. So, lm will use test in stead of iris.

When trying to call an object with get() within group_by and mutate, it brings up the entire object and not the grouped object. How do I fix this?

Here is my code:
data(iris)
spec<-names(iris[1:4])
iris$Size<-factor(ifelse(iris$Sepal.Length>5,"A","B"))
for(i in spec){
attach(iris)
output<-iris %>%
group_by(Size)%>%
mutate(
out=mean(get(i)))
detach(iris)
}
The for loop is written around some graphing and report writing that uses object 'i' in various parts. I am using dplyr and plyr.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Size out
1 5.1 3.5 1.4 0.2 setosa A 1.199333
2 4.9 3.0 1.4 0.2 setosa B 1.199333
3 4.7 3.2 1.3 0.2 setosa B 1.199333
4 4.6 3.1 1.5 0.2 setosa B 1.199333
5 5.0 3.6 1.4 0.2 setosa B 1.199333
Notice how that variable 'out' has the same mean, which is the mean of the entire dataset instead of the grouped mean.
> tapply(iris$Petal.Width,iris$Size,mean)
A B
1.432203 0.340625
> mean(iris$Petal.Width)
[1] 1.199333
Using get() and attach() isn't really consistent with dplyr because it's really messing up the environments in which the functions are evaulated. It would better to use the standard-evaluation equivalent of mutate here as described in the NSE vigette (vignette("nse", package="dplyr"))
for(i in spec){
output<-iris %>%
group_by(Size)%>%
mutate_(.dots=list(out=lazyeval::interp(~mean(x), x=as.name(i))))
# print(output)
}

Defining functions (in rollapply) using lines of a dataframe

First of all, I have a dataframe (lets call it "years") with 5 rows and 10 columns. I need to build a new one doing (x1-x2)/x1, being x1 the first element and x2 the second element of a column in "years", then (x2-x3)/x2 and so forth. I thought rollapply would be the best tool for the task, but I can't figure out how to define such function to insert it in rollapply.
I'm new to R, so I hope my question is not too basic. Anyway, I couldn't find a similar question here so I'd be really thankful if someone could help me.
You can use transform, diff and length, no need to use rollapply
> df <- head(iris,5) # some data
> transform(df, New = c(NA, diff(Sepal.Length)/Sepal.Length[-length(Sepal.Length)] ))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species New
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3.0 1.4 0.2 setosa -0.03921569
3 4.7 3.2 1.3 0.2 setosa -0.04081633
4 4.6 3.1 1.5 0.2 setosa -0.02127660
5 5.0 3.6 1.4 0.2 setosa 0.08695652
diff.zoo in the zoo package with the arithmetic=FALSE argument will divide each number by the prior in each column:
library(zoo)
as.data.frame(1 - diff(zoo(DF), arithmetic = FALSE))

Resources