recursively write out model matrix in R - r

in the analysis I am running there are many predictor variables fro which I would like to build a model matrix. However, the model matrix requires a formula in a format such as
t<-model.matrix(f[,1]~f[,2]+f[,3]+....)
if my data frame is called f is there a quick way with paste or somethign just to write out this formula recusively? Otherwise Iw oudl need to type everything

Why not use:
f <- data.frame(z = 1:10, b= 1:10, d=factor(1:10))
model.matrix(~. , data=f[-1])
#-------------
(Intercept) b d2 d3 d4 d5 d6 d7 d8 d9 d10
1 1 1 0 0 0 0 0 0 0 0 0
2 1 2 1 0 0 0 0 0 0 0 0
3 1 3 0 1 0 0 0 0 0 0 0
4 1 4 0 0 1 0 0 0 0 0 0
5 1 5 0 0 0 1 0 0 0 0 0
6 1 6 0 0 0 0 1 0 0 0 0
7 1 7 0 0 0 0 0 1 0 0 0
8 1 8 0 0 0 0 0 0 1 0 0
9 1 9 0 0 0 0 0 0 0 1 0
10 1 10 0 0 0 0 0 0 0 0 1
attr(,"assign")
[1] 0 1 2 2 2 2 2 2 2 2 2
attr(,"contrasts")
attr(,"contrasts")$d
[1] "contr.treatment"
Compare to what you get with:
> model.matrix(z~., f)
(Intercept) b d2 d3 d4 d5 d6 d7 d8 d9 d10
1 1 1 0 0 0 0 0 0 0 0 0
2 1 2 1 0 0 0 0 0 0 0 0
3 1 3 0 1 0 0 0 0 0 0 0
4 1 4 0 0 1 0 0 0 0 0 0
5 1 5 0 0 0 1 0 0 0 0 0
6 1 6 0 0 0 0 1 0 0 0 0
7 1 7 0 0 0 0 0 1 0 0 0
8 1 8 0 0 0 0 0 0 1 0 0
9 1 9 0 0 0 0 0 0 0 1 0
10 1 10 0 0 0 0 0 0 0 0 1
attr(,"assign")
[1] 0 1 2 2 2 2 2 2 2 2 2
attr(,"contrasts")
attr(,"contrasts")$d
[1] "contr.treatment"

Related

Add a new column generated from predict() to a list of dataframes

I have a logistic regression model. I would like to predict the morphology of items in multiple dataframes that have been put into a list.
I have lots of dataframes (most say working with a list of dataframes is better).
I need help with 1:
Applying the predict function to a list of dataframes.
Adding these predictions to their corresponding dataframe inside the list.
I am not sure whether it is better to have the 1000 dataframes separately and predict using loops etc, or to continue having them inside a list.
Prior to this code I have split my data into train and test sets. I then trained the model using:
library(nnet)
#Training the multinomial model
multinom_model <- multinom(Morphology ~ ., data=morph, maxit=500)
#Checking the model
summary(multinom_model)
This was then followed by validation etc.
My new dataset, consisting of multiple dataframes stored in a list, called rose.list was formatted by the following:
filesrose <- list.files(pattern = "_rose.csv")
#Rename all files of rose dataset 'rose.i'
for (i in seq_along(filesrose)) {
assign(paste("rose", i, sep = "."), read.csv(filesrose[i]))
}
#Make a list of the dataframes
rose.list <- lapply(ls(pattern="rose."), function(x) get(x))
I have been using this function to predict on a singular new dataframe
# Predicting the classification for individual datasets
rose.1$Morph <- predict(multinom_model, newdata=rose.1, "class")
Which gives me the dataframe, with the new prediction column 'Morph'
But how would I do this for multiple dataframes in my rose.list? I have tried:
lapply(rose.list, predict(multinom_model, "class"))
Error in eval(predvars, data, env) : object 'Area' not found
and, but also has the error:
lapply(rose.list, predict(multinom_model, newdata = rose.list, "class"))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows:
You can use an anonymous function (those with function(x) or abbreviated \(x)).
library(nnet)
multinom_model <- multinom(low ~ ., birthwt)
lapply(df_list, \(x) predict(multinom_model, newdata=x, type='class'))
# $rose_1
# [1] 1 0 1 1 0 0 0 1 0 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 0
# [40] 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 0 1
# [79] 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0
# [118] 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1
# [157] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 1
# Levels: 0 1
#
# $rose_2
# [1] 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1
# [40] 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1
# [79] 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0
# [118] 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0
# [157] 0 0 0 1 1 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0
# Levels: 0 1
#
# $rose_3
# [1] 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 1
# [40] 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1
# [79] 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
# [118] 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
# [157] 0 1 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0
# Levels: 0 1
update
To add the predictions as new column to each data frame in the list, modify the code like so:
res <- lapply(df_list, \(x) cbind(x, pred=predict(multinom_model, newdata=x, type="class")))
lapply(res, head)
# $rose_1
# low age lwt race smoke ptl ht ui ftv bwt pred
# 136 0 24 115 1 0 0 0 0 2 3090 0
# 154 0 26 133 3 1 2 0 0 0 3260 0
# 34 1 19 112 1 1 0 0 1 0 2084 1
# 166 0 16 112 2 0 0 0 0 0 3374 0
# 27 1 20 150 1 1 0 0 0 2 1928 1
# 218 0 26 160 3 0 0 0 0 0 4054 0
#
# $rose_2
# low age lwt race smoke ptl ht ui ftv bwt pred
# 167 0 16 135 1 1 0 0 0 0 3374 0
# 26 1 25 92 1 1 0 0 0 0 1928 1
# 149 0 23 119 3 0 0 0 0 2 3232 0
# 98 0 22 95 3 0 0 1 0 0 2751 0
# 222 0 31 120 1 0 0 0 0 2 4167 0
# 220 0 22 129 1 0 0 0 0 0 4111 0
#
# $rose_3
# low age lwt race smoke ptl ht ui ftv bwt pred
# 183 0 36 175 1 0 0 0 0 0 3600 0
# 86 0 33 155 3 0 0 0 0 3 2551 0
# 51 1 20 121 1 1 1 0 1 0 2296 1
# 17 1 23 97 3 0 0 0 1 1 1588 1
# 78 1 14 101 3 1 1 0 0 0 2466 1
# 167 0 16 135 1 1 0 0 0 0 3374 0
Data:
data('birthwt', package='MASS')
set.seed(42)
df_list <- replicate(3, birthwt[sample(nrow(birthwt), replace=TRUE), ], simplify=FALSE) |>
setNames(paste0('rose_', 1:3))

How to keep ID in dummyVars()

I would like to do transform Gender and Country using One-Hot-Encoding.
With the code below I can not create the new dataset including the ID
library(caret)
ID<-1:10
Gender<-c("F","F","F","M","M","F","M","M","F","M")
Country<-c("Mali","France","France","Guinea","Senegal",
"Mali","France","Mali","Senegal","France")
data<-data.frame(ID,Gender,Country)
#One hot encoding
dmy <- dummyVars(" ~Gender+Country", data = data, fullRank = T)
dat_transformed <- data.frame(predict(dmy, newdata = data))
dat_transformed
Gender.M Country.Guinea Country.Mali Country.Senegal
1 0 0 1 0
2 0 0 0 0
3 0 0 0 0
4 1 1 0 0
5 1 0 0 1
6 0 0 1 0
7 1 0 0 0
8 1 0 1 0
9 0 0 0 1
10 1 0 0 0
I want to get a dataset that include the ID without enconding it.
ID Gender.M Country.Guinea Country.Mali Country.Senegal
1 1 0 0 1 0
2 2 0 0 0 0
3 3 0 0 0 0
4 4 1 1 0 0
5 5 1 0 0 1
6 6 0 0 1 0
7 7 1 0 0 0
8 8 1 0 1 0
9 9 0 0 0 1
10 10 1 0 0 0
dat_transformed <- cbind(ID,dat_transformed)
dat_transformed
ID Gender.M Country.Guinea Country.Mali Country.Senegal
1 0 0 1 0
2 0 0 0 0
3 0 0 0 0
4 1 1 0 0
5 1 0 0 1
6 0 0 1 0
7 1 0 0 0
8 1 0 1 0
9 0 0 0 1
10 1 0 0 0

why cannot create rarecurve in vegan?

I am trying to create rarefaction curves for my data but my raremax and Srare are equal to zero and rarecurve does not create any curve. I have 36 obs and 52 variables, I am trying to assess any potential underestimation of species richness due to low sample size. I think that since the raremax is zero it will not create any curve. Thanks
This is my code:
library(vegan)
Moo<-read.csv("data.csv", strip.white=T)
#total number of species at each site (row of data)
S <- specnumber(Moo)
S
[1] 6 9 3 6 12 3 10 5 3 6 6 8 4 12 3 2 5 0 3 4 4 5 3 10 7 1 3 11 11
[30] 4 4 3 5 4 2 5
# Number of INDIVIDUALS per site
raremax <- min(rowSums(Moo))
raremax
[1] 0
Srare <- rarefy(Moo, raremax)
Srare
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
attr(,"Subsample")
[1] 0
#Plot rarefaction results
par(mfrow = c(1,2))
plot(S, Srare, xlab = "Observed No. of Species",
ylab = "Rarefied No. of Species",
main = " plot(rarefy(Moo, raremax))")
abline(0, 1)
rarecurve(Moo, step = 20,
sample = raremax,
col = "blue",
cex = 0.6,
main = "rarecurve()")
'> head(Moo)
Trapezia.aereolata Trapezia.globosa Trapezia.formosa Trapezia.lutea Trapezia.punctimanus
1 0 0 0 5 0
2 0 0 0 5 0
3 0 0 0 4 0
4 0 0 0 8 0
5 0 0 0 6 0
6 0 0 0 2 0
Trapezia.septata Trapezia.serenei Trapezia.tigrina Alpheus.lottini Alpheus.sp..White
1 0 0 0 4 0
2 0 0 0 4 0
3 0 0 0 0 0
4 0 0 0 2 0
5 0 0 0 4 0
6 0 0 0 1 0
Acanthanas.sp. Ophiuroids Ophiocoma.erinaceous Ophiocoma.sp Ophiactis.sp.
1 0 3 0 0 0
2 0 1 0 0 0
3 0 0 0 0 0
4 0 1 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
Echinometra.sp. Chlamys.sp. Conus.sp. Psaumis.cavipes Tiarina.sp. Calcinus.sp.
1 0 0 0 1 1 0
2 0 0 0 0 0 2
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
Harpiliopsis.sp. Jocaste.sp. Pilodius.pugil Pilodius.sp. Coralliocaris Chrysopetalum.Sp.
1 0 0 0 0 0 0
2 0 0 0 1 2 2
3 0 0 0 1 0 0
4 0 0 0 0 2 0
5 0 0 0 0 2 1
6 0 0 0 0 0 0
Menathius.sp. Chlorodiella.nigra Chlorodiella.sp. Synalpheus.sp. Labrid..wrasse.
1 0 0 0 0 0
2 1 1 0 0 0
3 0 0 0 2 0
4 0 0 0 0 1
5 0 0 0 0 0
6 0 0 0 0 0
Liomera.cinctimana Domecia.hispida Domecia Eviota.Sp. Paguris.sp. Palaemonidae Majiidae
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0
5 0 0 1 1 1 1 0
6 0 0 0 0 0 0 0
Cerithium.litteratum Cerithium.sp. Granulina..margaritula. Coralliophila.monodonta
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 1 3 0 0
6 0 0 0 0
Muricidae Tanaeid Unknown.shell. Amphipoda Polychaeta Strombus..marginatus.. Buccinidae
1 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 1 1 0 0 0 0
6 0 0 0 1 0 0 0
gastropod..glasslike.shell Drupella
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0'

Conditional Sequence repeat In R

I have a matrix made of a very long series of integers bounded between 1 and 6. I would like to create an output matrix of the same length than the original matrix and 6 columns (the maximum value in the original matrix) where 1 is repeated n time the value of the first encountered integer in the column of the integer value. (i.e if the first value is 6 it would repeat 1 x 6 times in the 6th column of the output matrix and then the value in in row 7 of the original matrix would be used for the next repeat sequence. I have shown an example in the below. Is there an efficient way to do this in R?
Original Matrix output Matrix
c1 c1 c2 c3 c4 c5 c6
R1 1 R1 1 0 0 0 0 0
R2 1 R2 1 0 0 0 0 0
R3 3 R3 0 0 1 0 0 0
R4 2 R4 0 0 1 0 0 0
R5 6 R5 0 0 1 0 0 0
R6 1 R6 1 0 0 0 0 0
R7 1 R7 1 0 0 0 0 0
R8 1 R8 1 0 0 0 0 0
R9 1 R9 1 0 0 0 0 0
R10 4 R10 0 0 0 1 0 0
R11 4 R11 0 0 0 1 0 0
R12 2 R12 0 0 0 1 0 0
R13 1 R13 0 0 0 1 0 0
R14 3 R14 0 0 1 0 0 0
R15 1 R15 0 0 1 0 0 0
A further example of the input and output matrix to make my above example clearer.
Input matrix Output matrix
c1 1 2 3 4 5 6
1 2 1 0 1 0 0 0 0
2 2 2 0 1 0 0 0 0
3 1 3 1 0 0 0 0 0
4 6 4 0 0 0 0 0 1
5 3 5 0 0 0 0 0 1
6 4 6 0 0 0 0 0 1
7 5 7 0 0 0 0 0 1
8 4 8 0 0 0 0 0 1
9 5 9 0 0 0 0 0 1
10 4 10 0 0 0 1 0 0
11 3 11 0 0 0 1 0 0
12 3 12 0 0 0 1 0 0
13 2 13 0 0 0 1 0 0
14 3 14 0 0 1 0 0 0
15 4 15 0 0 1 0 0 0
16 5 16 0 0 1 0 0 0
17 5 17 0 0 0 0 1 0
18 5 18 0 0 0 0 1 0
This is a simplistic solution but it works:
input_data <- c(1, 1, 3, 2, 6, 1, 1, 1, 1, 4, 4, 2, 1, 3, 1)
result <- matrix(0, nrow = length(input_data), ncol = 6)
counter <- 0
for (i in 1:length(input_data)){
if (counter == 0){
counter <- set_value <- input_data[i]
}
result[i, set_value] <- 1
counter <- counter - 1
}
> cbind(input_data, result)
[1,] 1 1 0 0 0 0 0
[2,] 1 1 0 0 0 0 0
[3,] 3 0 0 1 0 0 0
[4,] 2 0 0 1 0 0 0
[5,] 6 0 0 1 0 0 0
[6,] 1 1 0 0 0 0 0
[7,] 1 1 0 0 0 0 0
[8,] 1 1 0 0 0 0 0
[9,] 1 1 0 0 0 0 0
[10,] 4 0 0 0 1 0 0
[11,] 4 0 0 0 1 0 0
[12,] 2 0 0 0 1 0 0
[13,] 1 0 0 0 1 0 0
[14,] 3 0 0 1 0 0 0
[15,] 1 0 0 1 0 0 0

Copy a row but with some modifications

I have a large data set like this:
SUB SMOKE AMT MDV ADDL II EVID
1 0 0 0 0 0 0
1 0 20 0 16 24 1
1 0 0 0 0 0 0
1 0 0 0 0 0 0
2 1 0 0 0 0 0
2 1 50 0 24 12 1
2 1 0 0 0 0 0
2 1 0 0 0 0 0
...
I want to copy the row where EVID=1 and insert it below, but for the copied row, AMT,ADDL,II and EVID should all equal to 0, SMOKE and MDV remain the same. The expected output should look like this:
SUB SMOKE AMT MDV ADDL II EVID
1 0 0 0 0 0 0
1 0 20 0 16 24 1
1 0 0 0 0 0 0
1 0 0 0 0 0 0
1 0 0 0 0 0 0
2 1 0 0 0 0 0
2 1 50 0 24 12 1
2 1 0 0 0 0 0
2 1 0 0 0 0 0
2 1 0 0 0 0 0
...
Does anyone have idea about realizing this?
# repeat EVID=0 rows 1 time and EVID=1 rows 2 times
r <- rep(1:nrow(DF), DF$EVID + 1)
DF2 <- DF[r, ]
# insert zeros
DF2[duplicated(r), c("AMT", "ADDL", "II", "EVID")] <- 0
giving:
> DF2
SUB SMOKE AMT MDV ADDL II EVID
1 1 0 0 0 0 0 0
2 1 0 20 0 16 24 1
2.1 1 0 0 0 0 0 0
3 1 0 0 0 0 0 0
4 1 0 0 0 0 0 0
5 2 1 0 0 0 0 0
6 2 1 50 0 24 12 1
6.1 2 1 0 0 0 0 0
7 2 1 0 0 0 0 0
8 2 1 0 0 0 0 0
Maybe this:
> t2 <- t[t$EVID==1,] # t is your data.frame
> t2[c("AMT","ADDL","II","EVID")] <- 0
> t2
SUB SMOKE AMT MDV ADDL II EVID
2 1 0 0 0 0 0 0
6 2 1 0 0 0 0 0
> rbind(t,t2)
SUB SMOKE AMT MDV ADDL II EVID
1 1 0 0 0 0 0 0
2 1 0 20 0 16 24 1
3 1 0 0 0 0 0 0
4 1 0 0 0 0 0 0
5 2 1 0 0 0 0 0
6 2 1 50 0 24 12 1
7 2 1 0 0 0 0 0
8 2 1 0 0 0 0 0
21 1 0 0 0 0 0 0 # this row
61 2 1 0 0 0 0 0 # and this one are new

Resources