R equivalent of Stata `tabulate , generate( )` command - r

I want to mimic the behavior of Stata's tabulate , generate() command in R. It is illustrated below; the command's functionality is twofold. First, in my example, it produces a one-way table of frequency counts. Second, it generated dummy variables for each of the values contained on the variable (var1) using the prefix (stubname) declared in option ,generate() to name the generated dummy variables (d_1 - d_7). My question is regarding the second functionality. R-base solutions are preferred, but packaged dependent are also welcome.
[Edit]: My final goal is to generate a data.frame() that emulates the last data set printed on the screen.
clear all
input var1
0
1
2
2
2
2
42
42
777
888
999999
end
tabulate var1 ,gen(d_)
/* var1 | Freq. Percent Cum.
------------+-----------------------------------
0 | 1 9.09 9.09
1 | 1 9.09 18.18
2 | 4 36.36 54.55
42 | 2 18.18 72.73
777 | 1 9.09 81.82
888 | 1 9.09 90.91
999999 | 1 9.09 100.00
------------+-----------------------------------
Total | 11 100.00 */
list, sep(11)
/* +--------------------------------------------------+
| var1 d_1 d_2 d_3 d_4 d_5 d_6 d_7 |
|--------------------------------------------------|
1. | 0 1 0 0 0 0 0 0 |
2. | 1 0 1 0 0 0 0 0 |
3. | 2 0 0 1 0 0 0 0 |
4. | 2 0 0 1 0 0 0 0 |
5. | 2 0 0 1 0 0 0 0 |
6. | 2 0 0 1 0 0 0 0 |
7. | 42 0 0 0 1 0 0 0 |
8. | 42 0 0 0 1 0 0 0 |
9. | 777 0 0 0 0 1 0 0 |
10. | 888 0 0 0 0 0 1 0 |
11. | 999999 0 0 0 0 0 0 1 |
+--------------------------------------------------+ */

set.seed(123)
df = data.frame(var1 = factor(sample(10, 20, TRUE)))
df = data.frame(df, model.matrix(~0+var1, df)) # 0 here is to suppress the intercept. The smallest value will be the base group--and hence will be dropped.
names(df)[-1] = paste0('d_', 1:(ncol(df)-1))
df
var1 d_1 d_2 d_3 d_4 d_5 d_6 d_7 d_8 d_9
1 3 0 1 0 0 0 0 0 0 0
2 3 0 1 0 0 0 0 0 0 0
3 10 0 0 0 0 0 0 0 0 1
4 2 1 0 0 0 0 0 0 0 0
5 6 0 0 0 0 1 0 0 0 0
6 5 0 0 0 1 0 0 0 0 0
7 4 0 0 1 0 0 0 0 0 0
8 6 0 0 0 0 1 0 0 0 0
9 9 0 0 0 0 0 0 0 1 0
10 10 0 0 0 0 0 0 0 0 1
11 5 0 0 0 1 0 0 0 0 0
12 3 0 1 0 0 0 0 0 0 0
13 9 0 0 0 0 0 0 0 1 0
14 9 0 0 0 0 0 0 0 1 0
15 9 0 0 0 0 0 0 0 1 0
16 3 0 1 0 0 0 0 0 0 0
17 8 0 0 0 0 0 0 1 0 0
18 10 0 0 0 0 0 0 0 0 1
19 7 0 0 0 0 0 1 0 0 0
20 10 0 0 0 0 0 0 0 0 1

I guess you are assuming each value in var_1 is unique so that you get dummy variables rather than counts in the d_ fields.
You could try something like this:
var1 <- 1:5
dummy_matrix <- vapply(var1, function(x) as.numeric(var1 == x), rep(1, 5)) # create a matrix of dummy vars
colnames(dummy_matrix) <- paste0("d_", var1) # name the columns
cbind(var1, dummy_matrix) # bind to var1
Output:
var1 d_1 d_2 d_3 d_4 d_5
1 1 1 0 0 0 0
2 2 0 1 0 0 0
3 3 0 0 1 0 0
4 4 0 0 0 1 0
5 5 0 0 0 0 1

Related

Add a new column generated from predict() to a list of dataframes

I have a logistic regression model. I would like to predict the morphology of items in multiple dataframes that have been put into a list.
I have lots of dataframes (most say working with a list of dataframes is better).
I need help with 1:
Applying the predict function to a list of dataframes.
Adding these predictions to their corresponding dataframe inside the list.
I am not sure whether it is better to have the 1000 dataframes separately and predict using loops etc, or to continue having them inside a list.
Prior to this code I have split my data into train and test sets. I then trained the model using:
library(nnet)
#Training the multinomial model
multinom_model <- multinom(Morphology ~ ., data=morph, maxit=500)
#Checking the model
summary(multinom_model)
This was then followed by validation etc.
My new dataset, consisting of multiple dataframes stored in a list, called rose.list was formatted by the following:
filesrose <- list.files(pattern = "_rose.csv")
#Rename all files of rose dataset 'rose.i'
for (i in seq_along(filesrose)) {
assign(paste("rose", i, sep = "."), read.csv(filesrose[i]))
}
#Make a list of the dataframes
rose.list <- lapply(ls(pattern="rose."), function(x) get(x))
I have been using this function to predict on a singular new dataframe
# Predicting the classification for individual datasets
rose.1$Morph <- predict(multinom_model, newdata=rose.1, "class")
Which gives me the dataframe, with the new prediction column 'Morph'
But how would I do this for multiple dataframes in my rose.list? I have tried:
lapply(rose.list, predict(multinom_model, "class"))
Error in eval(predvars, data, env) : object 'Area' not found
and, but also has the error:
lapply(rose.list, predict(multinom_model, newdata = rose.list, "class"))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows:
You can use an anonymous function (those with function(x) or abbreviated \(x)).
library(nnet)
multinom_model <- multinom(low ~ ., birthwt)
lapply(df_list, \(x) predict(multinom_model, newdata=x, type='class'))
# $rose_1
# [1] 1 0 1 1 0 0 0 1 0 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 0
# [40] 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 0 1
# [79] 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0
# [118] 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1
# [157] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 1
# Levels: 0 1
#
# $rose_2
# [1] 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1
# [40] 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1
# [79] 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0
# [118] 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0
# [157] 0 0 0 1 1 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0
# Levels: 0 1
#
# $rose_3
# [1] 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 1
# [40] 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1
# [79] 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
# [118] 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
# [157] 0 1 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0
# Levels: 0 1
update
To add the predictions as new column to each data frame in the list, modify the code like so:
res <- lapply(df_list, \(x) cbind(x, pred=predict(multinom_model, newdata=x, type="class")))
lapply(res, head)
# $rose_1
# low age lwt race smoke ptl ht ui ftv bwt pred
# 136 0 24 115 1 0 0 0 0 2 3090 0
# 154 0 26 133 3 1 2 0 0 0 3260 0
# 34 1 19 112 1 1 0 0 1 0 2084 1
# 166 0 16 112 2 0 0 0 0 0 3374 0
# 27 1 20 150 1 1 0 0 0 2 1928 1
# 218 0 26 160 3 0 0 0 0 0 4054 0
#
# $rose_2
# low age lwt race smoke ptl ht ui ftv bwt pred
# 167 0 16 135 1 1 0 0 0 0 3374 0
# 26 1 25 92 1 1 0 0 0 0 1928 1
# 149 0 23 119 3 0 0 0 0 2 3232 0
# 98 0 22 95 3 0 0 1 0 0 2751 0
# 222 0 31 120 1 0 0 0 0 2 4167 0
# 220 0 22 129 1 0 0 0 0 0 4111 0
#
# $rose_3
# low age lwt race smoke ptl ht ui ftv bwt pred
# 183 0 36 175 1 0 0 0 0 0 3600 0
# 86 0 33 155 3 0 0 0 0 3 2551 0
# 51 1 20 121 1 1 1 0 1 0 2296 1
# 17 1 23 97 3 0 0 0 1 1 1588 1
# 78 1 14 101 3 1 1 0 0 0 2466 1
# 167 0 16 135 1 1 0 0 0 0 3374 0
Data:
data('birthwt', package='MASS')
set.seed(42)
df_list <- replicate(3, birthwt[sample(nrow(birthwt), replace=TRUE), ], simplify=FALSE) |>
setNames(paste0('rose_', 1:3))

Updating running list if event happens

sample Data:
clear
* Input data
input student CITATION EXPELLED hadCITATION hadEXPELLED
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
2 0 0 0 0
2 0 0 0 0
2 1 0 1 0
2 1 0 1 0
2 0 0 1 0
3 1 0 1 0
3 0 1 1 1
3 1 1 1 1
3 1 0 1 1
3 1 0 1 1
4 . . . .
4 . 0 . 0
4 0 0 0 0
4 0 1 0 1
4 1 0 1 0
I want to create these hadCITATION and hadEXPELLED variable columns that update based on the responses of CITATION and EXPELLED.
This may help. I can't see that this makes sense without a time or sequence variable. My guess is that once you've had a CITATION or EXPULSION, then that's your history. The rules may be more complicated, but I can't see that you're explaining them. I can't see the rationale for your example for student 4.
clear
input student CITATION EXPELLED hadCITATION hadEXPELLED
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
2 0 0 0 0
2 0 0 0 0
2 1 0 1 0
2 1 0 1 0
2 0 0 1 0
3 1 0 1 0
3 0 1 1 1
3 1 1 1 1
3 1 0 1 1
3 1 0 1 1
4 . . . .
4 . 0 . 0
4 0 0 0 0
4 0 1 0 1
4 1 0 1 0
end
gen long time = _n
bysort student (time) : gen want1 = sum(CITATION) > 0
by student: gen want2 = sum(EXPELLED) > 0
list student CIT EXP hadCIT hadEXP want?, sepby(student)
+---------------------------------------------------------------------+
| student CITATION EXPELLED hadCIT~N hadEXP~D want1 want2 |
|---------------------------------------------------------------------|
1. | 1 0 0 0 0 0 0 |
2. | 1 0 0 0 0 0 0 |
3. | 1 0 0 0 0 0 0 |
4. | 1 0 0 0 0 0 0 |
5. | 1 0 0 0 0 0 0 |
|---------------------------------------------------------------------|
6. | 2 0 0 0 0 0 0 |
7. | 2 0 0 0 0 0 0 |
8. | 2 1 0 1 0 1 0 |
9. | 2 1 0 1 0 1 0 |
10. | 2 0 0 1 0 1 0 |
|---------------------------------------------------------------------|
11. | 3 1 0 1 0 1 0 |
12. | 3 0 1 1 1 1 1 |
13. | 3 1 1 1 1 1 1 |
14. | 3 1 0 1 1 1 1 |
15. | 3 1 0 1 1 1 1 |
|---------------------------------------------------------------------|
16. | 4 . . . . 0 0 |
17. | 4 . 0 . 0 0 0 |
18. | 4 0 0 0 0 0 0 |
19. | 4 0 1 0 1 0 1 |
20. | 4 1 0 1 0 1 1 |
+---------------------------------------------------------------------+

Using conditionals to count matching rows from two dataframes r

I have two dataframes, I want to count the number of times both have a result_gain of 1 in the same chr with the same start and stop number +/- probes.
here probes = 1000
So if dataframe1 had a result_gain of 1 at a start+/-probes and stop+/-probes at chr 5 at the same chr number and start+/-probes and stop+/-probes with a result_gain of 1 then I wish to count this, the method I have attached is not working, do you have any suggestions?
dataframe1
chr start stop result_gain result_loss result_cnloh
2 0 90247720 0 0 0
2 95627407 243199373 0 0 0
7 0 57789531 1 0 0
7 61760895 159138663 1 0 0
8 0 6974050 0 0 0
8 8102641 43646413 1 0 0
8 47060977 146364022 0 0 0
9 0 38771460 0 0 0
9 71034203 141213431 0 0 0
10 0 38685231 0 0 0
10 42810783 135534747 0 0 0
11 0 51530241 0 0 0
11 54835623 135006516 0 0 0
12 0 34768168 0 0 0
12 38416139 133851895 0 0 0
13 19263735 115169878 0 0 0
14 20213937 107349540 0 0 0
15 20161372 102531392 1 0 0
17 0 22175355 0 0 0
17 25375921 81195210 0 0 0
dataframe 2
chr start stop result_gain result_loss result_cnloh
2 0 90247720 1 0 0
2 95627407 243199373 0 0 0
7 0 57789531 1 0 0
7 61760895 159138663 1 0 0
8 0 6974050 0 0 0
8 8101641 43646413 1 0 0
8 47060977 146364022 0 0 0
9 0 38771460 0 0 0
9 71034203 141213431 0 0 0
10 0 38685231 0 0 0
10 42810783 135534747 0 0 0
11 0 51530241 0 1 0
11 54835623 135006516 0 0 0
12 0 34768168 0 0 0
12 38416139 133851895 0 0 0
13 19263735 115169878 0 0 0
14 20213937 107349540 0 0 0
15 20161372 102531392 1 0 0
17 0 22175355 0 0 0
17 25375921 81195210 0 0 0
here are the following matching rows from both
matching rows from dataframe1
7 0 57789531 1 0 0
7 61760895 159138663 1 0 0
8 8102641 43646413 1 0 0
15 20161372 102531392 1 0 0
matching rows from dataframe2
7 0 57789531 1 0 0
7 61760895 159138663 1 0 0
8 8101641 43646413 1 0 0
15 20161372 102531392 1 0 0
output
score = 4
After subsetting each chr for dataframe1 and dataframe2 I am using the following conditionals to count, but it does not work.
for (e in 1:nrow(gains_idcc)) {
for (h in 1:nrow(gains_dciss)) {
if (gains_dciss$start[e] <= (idcc$start[h] + probes_per_bp) | dciss$start[e] <= (idcc$start[h] - probes_per_bp) | dciss$start[e] == idcc$start[h] && dciss$stop[e] >= (idcc$stop[h] + probes_per_bp) | dciss$stop[e] >= (idcc$stop[h] - probes_per_bp) | dciss$stop[e] <= idcc$stop[h]) {
score_gain = score_gain + nrow(gains_dciss)

recursively write out model matrix in R

in the analysis I am running there are many predictor variables fro which I would like to build a model matrix. However, the model matrix requires a formula in a format such as
t<-model.matrix(f[,1]~f[,2]+f[,3]+....)
if my data frame is called f is there a quick way with paste or somethign just to write out this formula recusively? Otherwise Iw oudl need to type everything
Why not use:
f <- data.frame(z = 1:10, b= 1:10, d=factor(1:10))
model.matrix(~. , data=f[-1])
#-------------
(Intercept) b d2 d3 d4 d5 d6 d7 d8 d9 d10
1 1 1 0 0 0 0 0 0 0 0 0
2 1 2 1 0 0 0 0 0 0 0 0
3 1 3 0 1 0 0 0 0 0 0 0
4 1 4 0 0 1 0 0 0 0 0 0
5 1 5 0 0 0 1 0 0 0 0 0
6 1 6 0 0 0 0 1 0 0 0 0
7 1 7 0 0 0 0 0 1 0 0 0
8 1 8 0 0 0 0 0 0 1 0 0
9 1 9 0 0 0 0 0 0 0 1 0
10 1 10 0 0 0 0 0 0 0 0 1
attr(,"assign")
[1] 0 1 2 2 2 2 2 2 2 2 2
attr(,"contrasts")
attr(,"contrasts")$d
[1] "contr.treatment"
Compare to what you get with:
> model.matrix(z~., f)
(Intercept) b d2 d3 d4 d5 d6 d7 d8 d9 d10
1 1 1 0 0 0 0 0 0 0 0 0
2 1 2 1 0 0 0 0 0 0 0 0
3 1 3 0 1 0 0 0 0 0 0 0
4 1 4 0 0 1 0 0 0 0 0 0
5 1 5 0 0 0 1 0 0 0 0 0
6 1 6 0 0 0 0 1 0 0 0 0
7 1 7 0 0 0 0 0 1 0 0 0
8 1 8 0 0 0 0 0 0 1 0 0
9 1 9 0 0 0 0 0 0 0 1 0
10 1 10 0 0 0 0 0 0 0 0 1
attr(,"assign")
[1] 0 1 2 2 2 2 2 2 2 2 2
attr(,"contrasts")
attr(,"contrasts")$d
[1] "contr.treatment"

Copy a row but with some modifications

I have a large data set like this:
SUB SMOKE AMT MDV ADDL II EVID
1 0 0 0 0 0 0
1 0 20 0 16 24 1
1 0 0 0 0 0 0
1 0 0 0 0 0 0
2 1 0 0 0 0 0
2 1 50 0 24 12 1
2 1 0 0 0 0 0
2 1 0 0 0 0 0
...
I want to copy the row where EVID=1 and insert it below, but for the copied row, AMT,ADDL,II and EVID should all equal to 0, SMOKE and MDV remain the same. The expected output should look like this:
SUB SMOKE AMT MDV ADDL II EVID
1 0 0 0 0 0 0
1 0 20 0 16 24 1
1 0 0 0 0 0 0
1 0 0 0 0 0 0
1 0 0 0 0 0 0
2 1 0 0 0 0 0
2 1 50 0 24 12 1
2 1 0 0 0 0 0
2 1 0 0 0 0 0
2 1 0 0 0 0 0
...
Does anyone have idea about realizing this?
# repeat EVID=0 rows 1 time and EVID=1 rows 2 times
r <- rep(1:nrow(DF), DF$EVID + 1)
DF2 <- DF[r, ]
# insert zeros
DF2[duplicated(r), c("AMT", "ADDL", "II", "EVID")] <- 0
giving:
> DF2
SUB SMOKE AMT MDV ADDL II EVID
1 1 0 0 0 0 0 0
2 1 0 20 0 16 24 1
2.1 1 0 0 0 0 0 0
3 1 0 0 0 0 0 0
4 1 0 0 0 0 0 0
5 2 1 0 0 0 0 0
6 2 1 50 0 24 12 1
6.1 2 1 0 0 0 0 0
7 2 1 0 0 0 0 0
8 2 1 0 0 0 0 0
Maybe this:
> t2 <- t[t$EVID==1,] # t is your data.frame
> t2[c("AMT","ADDL","II","EVID")] <- 0
> t2
SUB SMOKE AMT MDV ADDL II EVID
2 1 0 0 0 0 0 0
6 2 1 0 0 0 0 0
> rbind(t,t2)
SUB SMOKE AMT MDV ADDL II EVID
1 1 0 0 0 0 0 0
2 1 0 20 0 16 24 1
3 1 0 0 0 0 0 0
4 1 0 0 0 0 0 0
5 2 1 0 0 0 0 0
6 2 1 50 0 24 12 1
7 2 1 0 0 0 0 0
8 2 1 0 0 0 0 0
21 1 0 0 0 0 0 0 # this row
61 2 1 0 0 0 0 0 # and this one are new

Resources