R: Applying lm on every row of a dataframe using apply family - r

I have a data frame
x y z
1 4 6
2 5 7
3 6 8
4 7 9
5 8 10
Reproducible example below:
x <- c(1,2,3,4,5)
y <- c(4,5,6,7,8)
z <- c(6,7,8,9,10)
df <- data.frame(x, y, z)
df
I am trying to run a linear regression using lm between lines 1:4 against 5. I am trying to use the apply family here. I have seen other links in SO which talk about this , but having a tough time understanding the syntax. This link was a good link, but I am having a tough time understanding the syntax. This is my attempt at it.
apply(df, 1, function(x), lm(x[1,] ~ x[5,])$coefficients)
I am not sure what the syntax is to write apply such that it takes all rows.
I would also be thankful if someone could also show me how to do the same thin but with lm on columns of a dat frame too.
df = data.frame(x=c(1:5), y= c(4:8), z=c(6:10))

I'm defining the data frame differently in two ways: (a) each variable is a column (which is more natural in R), and (b) add a fourth row to the table, so the regression has enough degrees of freedom. I know I'm answering something slightly different than your question, but I think this scenario will be closer to the real-world one you're facing.
library(magrittr)
predictors <- c("x1", "x2", "x3", "x4")
df <- tibble::tribble(
~x1, ~x2, ~x3, ~x4, ~y,
1, 2, 3, 4, 5,
4, 5, 6, 7, 8,
6, 7, 8, 9, 10,
7, 3, 8, 4, 8 # Added this row for stability
)
The tidyverse function seem a natural fit to me.
df %>%
dplyr::select(!!predictors) %>%
purrr::map( function(x) coef(lm(df$y ~ x)) ) %>%
tibble::enframe(name="predictor", value="coefficients") %>%
dplyr::mutate(
int = purrr::map_chr(.$coefficients, "(Intercept)"),
slope = purrr::map_chr(.$coefficients, "x")
) %>%
dplyr::select(predictor, int, slope)
Line 2: use only the predictor variables (for the looping)
Line 3: loop over each predictor (ie, x), and predict df$y. The coef() will produce a vector of numerics. (Which may initially seem odd to store two numbers per data.fram cell.)
Line 4: convert to a tibble/data.frame for easier manipulation
Line 6: within each bivariate set of coefficients, extract the intercept.
Line 7: within each bivariate set of coefficients, extract the slope.

The code in the question has these problems:
the apply passes one row at a time so x[1, ] is really just x -- not wrong but pointless
x[5, ] is an error since x only has one row so one cannot ask for its 5th row
the apply includes the last row which would be regressing that row against itself which seems pointless
normally one puts the variables in columns and the cases in rows but df has it reversed. With the conventional orientation when one refers to a variable one is referring to a plain vector. With the orientation of the question df[i, ] is a one row data frame rather than a plain vector which is not what we want.
using the coef function is preferred to messing with the internals of the lm object as done in the question.
in a comment to which the poster agreed, #wibeasley stated that df[i, ] is the predictor, i.e. independent variable (one for each regression) and df[5, ] is the outcome variable, i.e. the dependent variable. That is the model is
df[5, ] = a + b * df[i, ] + error
with a separate regression for each value of i (except 5). In that case the variables are listed on the wrong sides of the formula in the code of the question.
1) Fixing up these problems we get:
DF <- as.data.frame(t(df))
nc <- ncol(DF)
sapply(DF[-nc], function(x) coef(lm(DF[, nc] ~ x)))
giving:
V1 V2 V3 V4
(Intercept) 4 3 2 1
x 1 1 1 1
2) If you do want to express this in terms of df then:
nr <- nrow(df)
apply(df[-nr,], 1, function(x) coef(lm(t(df[nr, ]) ~ x)))
3) If the intent was that df[5, ] is the predictor variable then we would not need an apply at all and this would do (where DF and nc are defined above):
coef(lm(as.matrix(DF[-nc]) ~ DF[[nc]]))
giving:
V1 V2 V3 V4
(Intercept) -4 -3 -2 -1
DF[[nc]] 1 1 1 1

Sorry if I misunderstood your question.
If you want the predicted value generated by the model then you can use
fitted(model)

Related

generating a sum coding scheme

I have a dataframe which looks like this:
df <- data.frame(id= rep(seq(1:125),3),
timpoint= c(rep("T1", 125), rep("T2", 125), rep("T3", 125)),
treatment=c(rep("A",25),rep("B",25),rep("C",25),rep("D",25),rep("E",25)))
interaction.col <- paste(df$timpoint, df$treatment, sep = "_")
df <- cbind(df, interaction.col)
I am trying to generate a sum coding scheme for the interaction column which is a combination of the first two columns. According to this paper I should get a matrix of (a−1)×(b−1) columns and n rows(in this case 375)
I have read up on using contrasts:
contrasts(df$interaction.col) <- "contr.sum"
df.c <- contrasts(df$interaction.col)
However, somehow the output is a 15x14 matrix, while it should be a 375 x8.
Also, only the very last row is set to -1, which shouldn't be the case. For all the ID's of the last treatment (E) the interaction column should be set to -1 for the corresponding timepoint.
The last ID in treatment group E should be -1 for all columns. What am i doing wrong here?
Depending on what effects you are interested in and how you will fit the model, you will end up with different number of effects. For example, in the case whereby you fit the main and interaction effects, you should end up with 8 columns for the interactions ie (a-1) x (b-1). In the case you do not fit the main effects you end up with a*b - 1:
Here is how to create your matrix:
With main effects:
model.matrix(~treatment * timpoint, df, list(treatment = contr.sum, timpoint=contr.sum))
In this case, the last 8 columns are the ones you are interested in
Without main effects:
model.matrix(~treatment:timpoint, df, list(treatment = contr.sum, timpoint=contr.sum))

How to import a distance matrix for clustering in R

I have got a text file containing 200 models all compared to eachother and a molecular distance for each 2 models compared. It looks like this:
1 2 1.2323
1 3 6.4862
1 4 4.4789
1 5 3.6476
.
.
All the way down to 200, where the first number is the first model, the second number is the second model, and the third number the corresponding molecular distance when these two models are compared.
I can think of a way to import this into R and create a nice 200x200 matrix to perform some clustering analyses on. I am still new to Stack and R but thanks in advance!
Since you don't have the distance between model1 and itself, you would need to insert that yourself, using the answer from this question:
(you can ignore the wrong numbering of the models compared to your input data, it doesn't serve a purpose, really)
# Create some dummy data that has the same shape as your data:
df <- expand.grid(model1 = 1:120, model2 = 2:120)
df$distance <- runif(n = 119*120, min = 1, max = 10)
head(df)
# model1 model2 distance
# 1 2 7.958746
# 2 2 1.083700
# 3 2 9.211113
# 4 2 5.544380
# 5 2 5.498215
# 6 2 1.520450
inds <- seq(0, 200*119, by = 200)
val <- c(df$distance, rep(0, length(inds)))
inds <- c(seq_along(df$distance), inds + 0.5)
val <- val[order(inds)]
Once that's in place, you can use matrix() with the ncol and nrow to "reshape" your vector of distance in the appropriate way:
matrix(val, ncol = 200, nrow = 200)
Edit:
When your data only contains the distance for one direction, so only between e.g. model1 - model5 and not model5 - model1 , you will have to fill the values in the upper triangular part of a matrix, like they do here. Forget about the data I generated in the first part of this answer. Also, forget about adding the ones to your distance column.
dist_mat <- diag(200)
dist_mat[upper.tri(dist_mat)] <- your_data$distance
To copy the upper-triangular entries to below the diagonal, use:
dist_mat[lower.tri(dist_mat)] <- t(dist_mat)[lower.tri(dist_mat)]
As I do not know from your question what format is your file in, I will assume the most general file format, i.e., CSV.
Then you should look at the reading files, read.csv, or fread.
Example code:
dt <- read.csv(file, sep = "", header = TRUE)
I suggest using data.table package. Then:
setDT(dt)
dt[, id := paste0(as.character(col1), "-", as.character(col2))]
This creates a new variable out of the first and the second model and serves as a unique id.
What I do is then removing this id and scale the numerical input.
After scaling, run clustering algorithms.
Merge the result with the id to analyse your results.
Is that what you are looking for?

How to express membership in multiple categories in R?

How does one express a linear model where observations can belong to multiple categories and the number of categories is large?
For example, using time dummies as the categories, here is a problem that is easy to set up since the number of categories (time periods) is small and known:
tmp <- "day 1, day 2
0,1
1,0
1,1"
periods <- read.csv(text = tmp)
y <- rnorm(3)
print(lm(y ~ day.1 + day.2 + 0, data=periods))
Now suppose that instead of two days there were 100. Would I need to create a formula like the following?
y ~ day.1 + day.2 + ... + day.100 + 0
Presumably such a formula would have to be created programmatically. This seems inelegant and un-R-like.
What is the right R way to tackle this? For example, aside from the formula problem, is there a better way to create the dummies than creating a matrix of 1s and 0s (as I did above)? For the sake of concreteness, say that the actual data consists (for each observation) of a start and end date (so that tmp would contain a 1 in each column between start and end).
Update:
Based on the answer of #jlhoward, here is a larger example:
num.observations <- 1000
# Manually create 100 columns of dummies called x1, ..., x100
periods <- data.frame(1*matrix(runif(num.observations*100) > 0.5, nrow = num.observations))
y <- rnorm(num.observations)
print(summary(lm(y ~ ., data = periods)))
It illustrates the manual creation of a data frame of dummies (1s and 0s). I would be interested in learning whether there is a more R-like way of dealing with these "multiple dummies per observation" issue.
You can use the . notation to include all variables other than the response in a formula, and -1 to remove the intercept. Also, put everything in your data frame; don't make y a separate vector.
set.seed(1) # for reproducibility
df <- data.frame(y=rnorm(3),read.csv(text=tmp))
fit.1 <- lm(y ~ day.1 + day.2 + 0, df)
fit.2 <- lm(y ~ -1 + ., df)
identical(coef(fit.1),coef(fit.2))
# [1] TRUE

log- and z-transforming my data in R

I'm preparing my data for a PCA, for which I need to standardize it. I've been following someone else's code in vegan but am not getting a mean of zero and SD of 1, as I should be.
I'm using a data set called musci which has 13 variables, three of which are labels to identify my data.
log.musci<-log(musci[,4:13],10)
stand.musci<-decostand(log.musci,method="standardize",MARGIN=2)
When I then check for mean=0 and SD=1...
colMeans(stand.musci)
sapply(stand.musci,sd)
I get mean values ranging from -8.9 to 3.8 and SD values are just listed as NA (for every data point in my data set rather than for each variable). If I leave out the last variable in my standardization, i.e.
log.musci<-log(musci[,4:12],10)
the means don't change, but the SDs now all have a value of 1.
Any ideas of where I've gone wrong?
Cheers!
You data is likely a matrix.
## Sample data
dat <- as.matrix(data.frame(a=rnorm(100, 10, 4), b=rexp(100, 0.4)))
So, either convert to a data.frame and use sapply to operate on columns
dat <- data.frame(dat)
scaled <- sapply(dat, scale)
colMeans(scaled)
# a b
# -2.307095e-16 2.164935e-17
apply(scaled, 2, sd)
# a b
# 1 1
or use apply to do columnwise operations
scaled <- apply(dat, 2, scale)
A z-transformation is quite easy to do manually.
See below using a random string of data.
data <- c(1,2,3,4,5,6,7,8,9,10)
data
mean(data)
sd(data)
z <- ((data - mean(data))/(sd(data)))
z
mean(z) == 0
sd(z) == 1
The logarithm transformation (assuming you mean a natural logarithm) is done using the log() function.
log(data)
Hope this helps!

McNemar test in R - sparse data

I'm attempting to run a good sized dataset through R, using the McNemar test to determine whether I have a difference in the proportion of objects detected by one method over another on paired samples. I've noticed that the test works fine when I have a 2x2 table of
test1
y n
y 34 2
n 12 16
but if I try and run something more like:
34 0
12 0
it errors telling me that ''x' and 'y' must have the same number of levels (minimum 2)'.
I should clarify, that I've tried converting wide data to a 2x2 matrix using the table function on my wide data set, where rather than appearing as above, it negates the final column, giving me.
test1
y
y 34
n 12
I've also run mcnemar.test using the factor object option, which gives me the same error, so I'm assuming that it does something similar. I'm wondering whether there is either a way to force the table function to generate the 2nd column despite their being no observations which would fall under either of those categories, or whether there would be a way to make the test overlook this missing data?
Perhaps there's a better way to do this, but you can force R to construct a sparse contingency table by ensuring that the tabulated factors have the same levels attribute and that there are exactly 2 distinct levels specified.
# Example data
x1 <- c(rep("y", 34), rep("n", 12))
x2 <- rep("n", 46)
# Set levels explicitly
x1 <- factor(x1, levels = c("y", "n"))
x2 <- factor(x2, levels = c("y", "n"))
table(x1, x2)
# x2
# x1 y n
# y 0 34
# n 0 12
mcnemar.test(table(x1, x2))
#
# McNemar's Chi-squared test with continuity correction
#
# data: table(x1, x2)
# McNemar's chi-squared = 32.0294, df = 1, p-value = 1.519e-08

Resources