Using outside function in dplyr to standardize values via selected geometric mean. (Getting it via sample instead of geom mean of full column) - r

Good evening fellow programmers/statistians etc.
I'm trying to standardize a set of variables dividing them by the geometric mean of a set of (same or not) variables I'm using as reference. Problem is, when trying to get it to work via dplyr, I'm getting results that I suspect are not the ones they should be if I do it case by case.
I have here some code explaining what I have done and why it failed. But It seems that dplyr is not getting my values via sample/row-wise, and instead is taking the full column to do my geometric mean.
I have been reviewing some questions, including some about geometric means, but for now I have not yet found how to solve it.
# A set of functions I'm using to calculate the geom mean.
gm_mean = function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
gm_mean2 = function(x, na.rm=TRUE){
exp(mean(log(x[x > 0]), na.rm=TRUE))
}
# And also psych::geometric.mean()
# x <- c(4, 8, 9, 9, 12, 14, 17)
# gm_mean(x) # It works as intended.
# gm_mean2(x) #It works as intended.
# psych::geometric.mean(x) #Indeed it works
So, using the iris dataset, I want to standardize a set of columns (coln1), dividing by the geometric mean of another set of columns (Which I would want to set as a variable, but since I'm not getting it to work as separate, I'm trying them without grouping them in a variable)
For now I have tried this (and failed)
library(dplyr)
coln1 <- colnames(iris)[1:2]
coln1 <- colnames(iris)[1:2]
iris %>% mutate(across( any_of(coln1), ~ .x / psych::geometric.mean(c(Sepal.Length,Sepal.Width)) )) ## Doesn't work as intended? No. Not at all.
# Let me illustrate. Value that we are getting doing it case by case its == to the output?
iris[1,1] / psych::geometric.mean(c(iris[1,1],iris[1,2]))
1.207 != 1.2187
iris[1,1] / psych::geometric.mean(c(iris$Sepal.Length,iris$Sepal.Width))
1.2817 == 1.287
# Its doing it by taking the full column of values, all of them, and not the values corresponding to that sample (in this case 2, but we could have more or less variables changing it in the psych:geometri.c.mean.)
# Notes.
# The geometric mean is the nth root of n products or e to the mean log of x. Useful for describing non-normal, i.e., geometric distributions. We are usign it via psych:: because it could be negative and we should solve that.
# iris %>% mutate(across( any_of(coln1), ~ .x / exp(mean(log(Sepal.Length+Sepal.Width))) )) # No. Cause this is not using the mean since its one value instead of two.

I think you've done a great job setting it up, it's just 'rowwise()' that you're missing really! I've re-arranged the logic in the mutate call but it's basically just rowwise.
coln1 <- colnames(iris)[3:4]
iris %>%
rowwise() %>%
mutate(geo.mean = psych::geometric.mean(c(Sepal.Length,Sepal.Width)),
across(.cols = all_of(coln1), .fns = ~ .x / geo.mean, .names = '{.col}_{.fn}'))
# A tibble: 150 x 8
# Rowwise:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species geo.mean Petal.Length_1 Petal.Width_1
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa 4.22 0.331 0.0473
# prove it's correctly functioning with first entry:
1.4 / psych::geometric.mean(c(5.1, 3.5))
[1] 0.3313667
0.2 / psych::geometric.mean(c(5.1, 3.5))
[1] 0.04733811

Related

Making a function with variable inputs

I am trying to make a function that does a simple calculation, however, I want to apply it across several columns, and each column has a different constant in the equation.
This is the formula I want to make into a function.
Example
df<- iris[1:10,2:4]
y_true<- c(3, 1, 0.4) # each number(constant) corresponds to each column in df
y_true_sepal_width<- 3 # (1 corresponds to petal.length, 0.4 to petal.width)
R<- 10 # Number of samples
y_estimated<- mean(df$Sepal.Width)
(((sqrt((y_estimated-y_true_sepal_width)^2/R))*100)) #This is (I believe) how to find the value for one column manually
I want to do this formula, but basically taking the mean of each column and substituting out each y_true as it moves across the data frame. I figured this would be done by putting the true constants into a vector, but haven't had any luck in incorporating it into the function.
Given that you have a df and y_true, you can create an estimate function as follows:
estimate = function(df, y_true) {
R = nrow(df)
y_estimated = apply(df, 2, mean)
((sqrt( (y_estimated - y_true)^2 / R)) / y_true) * 100
}
and then, you can use it with your data as follows:
df = iris[1:10,2:4]
y_true = c(3, 1, 0.4)
estimate(df = df, y_true = y_true)
which outputs:
Sepal.Width Petal.Length Petal.Width
3.267687 14.230249 14.230249

Convert multiple moran.test outputs into structured, storable, copy-pastable strings

I wish to collapse the output of spdep::moran.test into a single string that is regularly structured with variable names and values and that can both be saved as a text value into a dataframe, and be human readable in the RStudio console and copy-pastable into MS Word to form a table without too many additional manual adjustments. (I have multiple tests to run and wish to copy-paste their output in one go.)
In the course of looking for a solution, I stumbled upon the report package which claims to turn an htest class object into a "report" (I don't know what this looks like in R) and thus may address my goal to some extent. However, the report function doesn't work on moran.test, as presented in the code below.
I am exploring and there are probably alternative and more straightforward approaches which I haven't considered. Thus my question is twofold: 1. Solve the immediate issue with report and/or 2. Provide an alternative and more efficient solution to my goal.
The data preparation below is drawn from https://mgimond.github.io/simple_moransI_example.
library(sf)
library(spdep)
library(report)
# Load shapefile
s <- readRDS(url("https://github.com/mgimond/Data/raw/gh-pages/Exercises/nhme.rds"))
# Prevent error "old-style crs object detected; please recreate object with a recent sf::st_crs()"
st_crs(s) <- st_crs(s)
# Define neighboring polygons
nb <- poly2nb(s, queen=TRUE)
# Assign weights to the neighbors
lw <- nb2listw(nb, style="W", zero.policy=TRUE)
# Run Moran’s I test
(mt <- moran.test(s$Income,lw, alternative="greater"))
#Moran I test under randomisation
#data: s$Income
#weights: lw
#Moran I statistic standard deviate = 5.8525, p-value = 2.421e-09
#alternative hypothesis: greater
#sample estimates:
# Moran I statistic Expectation Variance
#0.68279551 -0.04000000 0.01525284
# Moran’s I test is of class htest required by function report::report
class(mt)
#[1] "htest"
# Function report::report returns an error
report(mt)
#Error in `$<-.data.frame`(`*tmp*`, "tau", value = c(`Moran I statistic` = 0.68279551202875, :
# replacement has 3 rows, data has 1
The desired output could look something like:
"P-value 2.421e-09 | Statistic 0.68279551 | Expectation -0.04000000 | Variance 0.01525284"
The point is the names and values, not the separators. This is based on my current assumptions of how to approach this task, which are probably imperfect.
You might want to take a look at the broom package:
broom::tidy(mt)
#> # A tibble: 1 x 7
#> estimate1 estimate2 estimate3 statistic p.value method alternative
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 0.683 -0.04 0.0153 5.85 2.42e-9 Moran I test u… greater
library(tidyverse)
mt %>%
broom::tidy() %>%
as.list() %>%
enframe() %>%
mutate(value = value %>% as.character()) %>% unite(data, sep = "=") %>%
pull(data) %>%
paste0(collapse = ", ")
#> [1] "estimate1=0.68279551202875, estimate2=-0.04, estimate3=0.0152528397222445, statistic=c(`Moran I statistic standard deviate` = 5.85248209823413), p.value=2.42145194022024e-09, method=Moran I test under randomisation, alternative=greater"
You can make a table and create a csv file from multiple tests (e.g. having multiple objects of class htest like mt, mt1 and mt2):
list(mt, mt2, mt3) %>% map(broom::tidy) %>% bind_rows() %>% write_csv("tests.csv")

R: Applying lm on every row of a dataframe using apply family

I have a data frame
x y z
1 4 6
2 5 7
3 6 8
4 7 9
5 8 10
Reproducible example below:
x <- c(1,2,3,4,5)
y <- c(4,5,6,7,8)
z <- c(6,7,8,9,10)
df <- data.frame(x, y, z)
df
I am trying to run a linear regression using lm between lines 1:4 against 5. I am trying to use the apply family here. I have seen other links in SO which talk about this , but having a tough time understanding the syntax. This link was a good link, but I am having a tough time understanding the syntax. This is my attempt at it.
apply(df, 1, function(x), lm(x[1,] ~ x[5,])$coefficients)
I am not sure what the syntax is to write apply such that it takes all rows.
I would also be thankful if someone could also show me how to do the same thin but with lm on columns of a dat frame too.
df = data.frame(x=c(1:5), y= c(4:8), z=c(6:10))
I'm defining the data frame differently in two ways: (a) each variable is a column (which is more natural in R), and (b) add a fourth row to the table, so the regression has enough degrees of freedom. I know I'm answering something slightly different than your question, but I think this scenario will be closer to the real-world one you're facing.
library(magrittr)
predictors <- c("x1", "x2", "x3", "x4")
df <- tibble::tribble(
~x1, ~x2, ~x3, ~x4, ~y,
1, 2, 3, 4, 5,
4, 5, 6, 7, 8,
6, 7, 8, 9, 10,
7, 3, 8, 4, 8 # Added this row for stability
)
The tidyverse function seem a natural fit to me.
df %>%
dplyr::select(!!predictors) %>%
purrr::map( function(x) coef(lm(df$y ~ x)) ) %>%
tibble::enframe(name="predictor", value="coefficients") %>%
dplyr::mutate(
int = purrr::map_chr(.$coefficients, "(Intercept)"),
slope = purrr::map_chr(.$coefficients, "x")
) %>%
dplyr::select(predictor, int, slope)
Line 2: use only the predictor variables (for the looping)
Line 3: loop over each predictor (ie, x), and predict df$y. The coef() will produce a vector of numerics. (Which may initially seem odd to store two numbers per data.fram cell.)
Line 4: convert to a tibble/data.frame for easier manipulation
Line 6: within each bivariate set of coefficients, extract the intercept.
Line 7: within each bivariate set of coefficients, extract the slope.
The code in the question has these problems:
the apply passes one row at a time so x[1, ] is really just x -- not wrong but pointless
x[5, ] is an error since x only has one row so one cannot ask for its 5th row
the apply includes the last row which would be regressing that row against itself which seems pointless
normally one puts the variables in columns and the cases in rows but df has it reversed. With the conventional orientation when one refers to a variable one is referring to a plain vector. With the orientation of the question df[i, ] is a one row data frame rather than a plain vector which is not what we want.
using the coef function is preferred to messing with the internals of the lm object as done in the question.
in a comment to which the poster agreed, #wibeasley stated that df[i, ] is the predictor, i.e. independent variable (one for each regression) and df[5, ] is the outcome variable, i.e. the dependent variable. That is the model is
df[5, ] = a + b * df[i, ] + error
with a separate regression for each value of i (except 5). In that case the variables are listed on the wrong sides of the formula in the code of the question.
1) Fixing up these problems we get:
DF <- as.data.frame(t(df))
nc <- ncol(DF)
sapply(DF[-nc], function(x) coef(lm(DF[, nc] ~ x)))
giving:
V1 V2 V3 V4
(Intercept) 4 3 2 1
x 1 1 1 1
2) If you do want to express this in terms of df then:
nr <- nrow(df)
apply(df[-nr,], 1, function(x) coef(lm(t(df[nr, ]) ~ x)))
3) If the intent was that df[5, ] is the predictor variable then we would not need an apply at all and this would do (where DF and nc are defined above):
coef(lm(as.matrix(DF[-nc]) ~ DF[[nc]]))
giving:
V1 V2 V3 V4
(Intercept) -4 -3 -2 -1
DF[[nc]] 1 1 1 1
Sorry if I misunderstood your question.
If you want the predicted value generated by the model then you can use
fitted(model)

R - Using nested dataframe to run function with different sets of parameters

I would like to create a wrapper for the Levenberg-Marquardt Nonlinear Least-Squares function nls.lm (minpack.lm library) similar to nls2 (nls2 library) to give a brute force method for evaluating the fit of a model to observed data.
The idea is to create a range of starting value combinations and either:
pass these to a function, then compare the function output to the observed data to create an R^2 value for each of the starting value combinations and run the nls.lm fitting with the best one of them.
or
run nls.lm on all combinations and select the best returned fit.
I wanted to do this without looping and after inspiration from here am trying to use nested dataframes, with one column for the parameter input list, one for the values returned by my function, one for the R^2 values, and one for the best fit models,something like:
df
# start_val fun_out R^2
# 1 {a=2,b=2} {22,24,26...} 0.8
# 2 {a=3,b=5} {35,38,41...} 0.6
This is the code I have so far:
require(dplyr);require(tidyr)
foo <- function(x,a,b) a*x^2+b # function I am fitting
x <- 1:10 # independent variable
y_obs <- foo(x,1.5,2.5) + rnorm(length(x),0,10) # observed data (dependent variable)
start_range <- data.frame(a=c(1,2),b=c(2,3)) # range of allowed starting points for fitting
reps <- 2 # number of starting points to generate
# Create a data frame of starting points
df<-as.data.frame(sapply(start_range, function(x) runif(reps,min=x[[1]],max=x[[2]]))) %>%
mutate(id=seq_len(reps)) %>% # fudge to make nest behave as I want
nest(1:ncol(start_range)) %>%
mutate(data=as.list(data)) %>%
as.data.frame()
df
# id data
# 1 1 1.316356, 2.662923
# 2 2 1.059356, 2.723081
I get stuck now trying to pass the parameters in data into the function foo(). I've tried using do.call(), and even with using constant parameters the following error appears:
mutate(df,y=do.call(foo,list(x,1,2)))
# Error: wrong result size (5), expected 2 or 1
Is there a way to create columns of a dataframe which contain lists directly without using nest()?
Also when trying to create the list to pass to do.call() using the dataframe columns, how do you create a list where the first element is the vector x, the second is the parameter a and the third is the parameter b? The follwing splits the list down the column:
mutate(df,my_list=list(x,data))
# id data my_list
# 1 1 1.316356, 2.662923 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
# 2 2 1.059356, 2.723081 1.316356, 2.662923, 1.059356, 2.723081
Running nls2 using algorithm = "random-search" and all = TRUE and the specified maxiter will evaluate foo at maxiter random points and return starting_fits which are the fits at those points. It consists of a set of "nls" class objects evaluated at each of the randomly chosen starting values. It does not do an optimization from each of these starting values but just returns the "nls" object at each. That is, nls is not run. Now for each starting fit run nlsLM giving fits, a list of nlsLM fits and from that summarize them in data (a data frame with one row per run) and show the least.
If we only want to pick the best starting value and just run nlsLM once from that then use the alternate code near the end.
library(nls2)
fo <- y_obs ~ foo(x, a, b)
starting_fits <- nls2(fo, algorithm = "random-search",
start = start_range, control = nls.control(maxiter = reps), all = TRUE)
fits <- lapply(starting_fits, function(fit) nlsLM(fo, start = coef(fit)))
data <- data.frame(RSS = sapply(fits, deviance), t(sapply(fits, coef)),
start = t(sapply(starting_fits, coef)))
# data$fits <- fits # optional to store each row's fitted object in that row
subset(data, RSS == min(RSS)) # minimum(s)
giving:
RSS a b start.a start.b
2 706.3956 1.396616 7.226525 1.681819 2.768374
R squared is used for linear regression. It is not valid for nonlinear regression. Residual sum of squares (RSS) is shown above instead.
Alternately if you just want to pick out the best starting value and run nlsLM on that then just omit the all=TRUE argument from the nls2 call and do this. If you need the coefficients and RSS for later code then try coef(fit) and deviance(fit) .
starting_fit <- nls2(fo, algorithm = "random-search",
start = start_range, control = nls.control(maxiter = reps))
fit <- nlsLM(fo, start = coef(starting_fit))
Note 1: If you are getting errors from nlsLM try replacing nlsLM(...) with try(nlsLM(...)). This will issue error messages (use try(..., silent = TRUE) if you don't want them) but will not stop processing.
Note 2: I assume that the foo shown in the question is just an example and the real function is more complex. The foo shown is linear in the coefficients so one could use lm for it. No nonlinear optimization is needed.
An approach like this perhaps?
library(dplyr)
library(purrr)
foo2 <- function(x,data) data$a*x^2+data$b
r2 <- function(e, o) 1 - sum((e - 0)^2) / sum((e - mean(e)^2))
df <- as.data.frame(sapply(start_range, function(x) runif(reps,min=x[[1]],max=x[[2]]))) %>%
mutate(id=seq_len(reps)) %>% # fudge to make nest behave as I want
nest(1:ncol(start_range))
df %>%
mutate(fun_out = map(data, foo2, x = x),
R2 = map(fun_out, o = y_obs, r2))
Result:
# A tibble: 3 x 4
id data fun_out R2
<int> <list> <list> <list>
1 1 <tibble [1 x 2]> <dbl [10]> <dbl [1]>
2 2 <tibble [1 x 2]> <dbl [10]> <dbl [1]>
3 3 <tibble [1 x 2]> <dbl [10]> <dbl [1]>

log- and z-transforming my data in R

I'm preparing my data for a PCA, for which I need to standardize it. I've been following someone else's code in vegan but am not getting a mean of zero and SD of 1, as I should be.
I'm using a data set called musci which has 13 variables, three of which are labels to identify my data.
log.musci<-log(musci[,4:13],10)
stand.musci<-decostand(log.musci,method="standardize",MARGIN=2)
When I then check for mean=0 and SD=1...
colMeans(stand.musci)
sapply(stand.musci,sd)
I get mean values ranging from -8.9 to 3.8 and SD values are just listed as NA (for every data point in my data set rather than for each variable). If I leave out the last variable in my standardization, i.e.
log.musci<-log(musci[,4:12],10)
the means don't change, but the SDs now all have a value of 1.
Any ideas of where I've gone wrong?
Cheers!
You data is likely a matrix.
## Sample data
dat <- as.matrix(data.frame(a=rnorm(100, 10, 4), b=rexp(100, 0.4)))
So, either convert to a data.frame and use sapply to operate on columns
dat <- data.frame(dat)
scaled <- sapply(dat, scale)
colMeans(scaled)
# a b
# -2.307095e-16 2.164935e-17
apply(scaled, 2, sd)
# a b
# 1 1
or use apply to do columnwise operations
scaled <- apply(dat, 2, scale)
A z-transformation is quite easy to do manually.
See below using a random string of data.
data <- c(1,2,3,4,5,6,7,8,9,10)
data
mean(data)
sd(data)
z <- ((data - mean(data))/(sd(data)))
z
mean(z) == 0
sd(z) == 1
The logarithm transformation (assuming you mean a natural logarithm) is done using the log() function.
log(data)
Hope this helps!

Resources