Writing an R loop to create new standardized columns - r

I'm using the Ionosphere dataset in R and am trying to write a loop that will create new columns that are standardized iterations of existing columns and name them accordingly.
I've got the "cname" as the new column name and c as the original. The code is:
install.packages("mlbench")
library(mlbench)
data('Ionosphere')
library(robustHD)
col <- colnames(Ionosphere)
for (c in col[1:length(col)-1]){
cname <- paste(c,"Std")
Ionosphere$cname <- standardize(Ionosphere$c)
}
But get the following error:
"Error in `$<-.data.frame`(`*tmp*`, "cname", value = numeric(0)) :
replacement has 0 rows, data has 351
In addition: Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA"
I feel like there's something super-simple I'm missing but I just can't see it.
Any help gratefully received.

We can use lapply, a custom-made standardization function, setNames, and cbind.
I do not have access to your dataset, so I am using the iris dataset as an example:
df<-iris
cbind(df, set_names(lapply(df[1:4],
\(x) (x - mean(x))/sd(x)),
paste0(names(df[1:4]), '_Std')))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_Std Sepal.Width_Std Petal.Length_Std Petal.Width_Std
1 5.1 3.5 1.4 0.2 setosa -0.89767388 1.01560199 -1.33575163 -1.3110521482
2 4.9 3.0 1.4 0.2 setosa -1.13920048 -0.13153881 -1.33575163 -1.3110521482
3 4.7 3.2 1.3 0.2 setosa -1.38072709 0.32731751 -1.39239929 -1.3110521482
4 4.6 3.1 1.5 0.2 setosa -1.50149039 0.09788935 -1.27910398 -1.3110521482
5 5.0 3.6 1.4 0.2 setosa -1.01843718 1.24503015 -1.33575163 -1.3110521482
...
I feel these transformations get easier with dplyr:
library(dplyr)
iris %>% mutate(across(where(is.numeric),
~ (.x - mean(.x))/sd(.x),
.names = "{col}_Std"))

Related

Unable to add columns to data table in R

I am trying to add columns with "" value to the data table. But getting following error. Can anyone help me here.
Since it is converted to data.table , I am unable to convert.
iris_sam <- iris
iris_sam <- as.data.table(iris_sam)
iris_sam[c("new", "New1")] <- ""
Error in `[.data.table`(x, i, which = TRUE) :
When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
data.table uses different syntax, please look into the documentation. For this case you can assign the new columns like this:
library(data.table)
iris_sam <- iris
iris_sam <- as.data.table(iris_sam)
iris_sam[j = c("new", "New1") := ""]
head(iris_sam, 5)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species new New1
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa

Passing vector of names to verify to assertr's verify in R

I am importing a dataset from a third party and would would like to be able to validate that all of the columns in the incoming dataset are named as agreed to and expected. To do this, I intended to use the verify statement in assertr's package in R with has_all_names. I can accomplish this with no problem if I manually enter the column names to be verified, but I can't seem to accomplish this by passing in a vector that contains the names of the columns to be verified. So for example, using the build-in iris dataset, I can verify that existence of the all the column names if I manually enter the names as an argument to the has_all_names function, but if I have the names stored in a vector and attempt to use it for verification, it does not work:
#Create a sample list of column names to be verified
#In my real work, I obtain this list from a database
(names(iris)->expected_variable_names)
Which outputs:
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
But then I run the following and:
#This works:
iris %>% verify(has_all_names("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"))
#But this does not:
iris %>% verify(has_all_names(expected_variable_names))
When I attempt to run the line that does not work, this generates:
verification [has_all_names(expected_variable_names)] failed! (1 failure)
verb redux_fn predicate column index value
1 verify NA has_all_names(expected_variable_names) NA 1 NA
Error: assertr stopped execution
Obviously, the failed attempt is indicating that not all of the column names are found in the dataframe, but since I'm passing in all the variable names that are indeed on the dataset, it should succeed. How can I pass into verify a vector or possibly even a list of column names to validate? I've tried a number of different variations of this last attempt with no success.
Thanks.
We may use invoke
library(purrr)
library(dplyr)
library(assertr)
iris %>%
verify(invoke(has_all_names, expected_variable_names))
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
...
Or with exec from rlang
library(rlang)
iris %>%
verify(exec(has_all_names, !!!expected_variable_names))
Or with do.call from base R
iris %>%
verify(do.call(has_all_names,
as.list(expected_variable_names)))

different behavior while using createDataFrame and read.df in SparkR

I am using Spark 1.5.1
When I do this
df <- createDataFrame(sqlContext, iris)
#creating a new column for category "Setosa"
df$Species1<-ifelse((df)[[5]]=="setosa",1,0)
head(df)
output: new column created
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
but when I saved the iris dataset as a CSV file and try to read it and convert it to sparkR dataframe
df <- read.df(sqlContext,"/Users/devesh/Github/deveshgit2/bdaml/data/iris/",
source = "com.databricks.spark.csv",header = "true",inferSchema = "true")
now when I try to create new column
df$Species1<-ifelse((df)[[5]]=="setosa",1,0)
I get the below error:
16/02/05 12:11:01 ERROR RBackendHandler: col on 922 failed Error in select(x, x$"*", alias(col, colName)) :
error in evaluating the argument 'col' in selecting a method for function 'select': Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Cannot resolve column name "Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species);
at org.apache.spark.s
SparkSQL doesn't support names with embedded dots. When you use createDataFrame names are automatically adjusted for you, for other methods you have to provide schema explicitly:
schema <- structType(
structField("Sepal_Length", "double"),
structField("Sepal_Width", "double"),
structField("Petal_Length", "double"),
structField("Petal_Width", "double"),
structField("Species", "string"))
df <- read.df(sqlContext, path, source = "com.databricks.spark.csv",
header="true", schema=schema)

Unable to modify an object returned by an R function

I wrote a function in R to return the name of the first column of a data frame:
my_func <- function(table, firstColumnOnly = TRUE){
if(firstColumnOnly)
return(colnames(table)[1])
else
return(colnames(table))
}
If I call the function like this:
my_func(fertility)<-"foo"
I get the following error:
Error in my_func(fertility, FALSE)[1] <- "foo" :
could not find function "my_func<-"
Why am I getting this error? I can do this without an error:
colnames(fertility)[1]<-"Country"
It seems like you are expecting that this:
my_func(fertility)<-"foo"
will be understood by R as:
colnames(table)[1] <- "foo" # if firstColumnOnly
or
colnames(table) <- "foo" # if !firstColumnOnly
It will not. One reason for this is that colnames() and colnames()<- are two distinct functions. The first one returns the column names, the second one assigns new names. Your function can only return the names, not assign them.
One workaround would be to write your function using colnames()<-:
my_func <- function(table, rep, firstColumnOnly = TRUE){
if(firstColumnOnly) colnames(table)[1] <- rep
else colnames(table) <- rep
return(table)
}
Test
head(my_func(iris,"foo"))
foo Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

Defining functions (in rollapply) using lines of a dataframe

First of all, I have a dataframe (lets call it "years") with 5 rows and 10 columns. I need to build a new one doing (x1-x2)/x1, being x1 the first element and x2 the second element of a column in "years", then (x2-x3)/x2 and so forth. I thought rollapply would be the best tool for the task, but I can't figure out how to define such function to insert it in rollapply.
I'm new to R, so I hope my question is not too basic. Anyway, I couldn't find a similar question here so I'd be really thankful if someone could help me.
You can use transform, diff and length, no need to use rollapply
> df <- head(iris,5) # some data
> transform(df, New = c(NA, diff(Sepal.Length)/Sepal.Length[-length(Sepal.Length)] ))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species New
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3.0 1.4 0.2 setosa -0.03921569
3 4.7 3.2 1.3 0.2 setosa -0.04081633
4 4.6 3.1 1.5 0.2 setosa -0.02127660
5 5.0 3.6 1.4 0.2 setosa 0.08695652
diff.zoo in the zoo package with the arithmetic=FALSE argument will divide each number by the prior in each column:
library(zoo)
as.data.frame(1 - diff(zoo(DF), arithmetic = FALSE))

Resources