Data preparation before running exact logistic (elrm in R) - r

I started out using Firth's logistic (logistf) to deal with my small sample size (n=80), but wanted to try out exact logistic regression using the elrm package. However, I'm having trouble figuring out how to create the "collapsed" data required for elrm to run. I have a csv that I import into R as a dataframe that has the following variables/columns. Here is some example data (real data has a few more columns and 80 rows):
+------------+-----------+-----+--------+----------------+
| patien_num | asymmetry | age | female | field_strength |
+------------+-----------+-----+--------+----------------+
| 1 | 1 | 25 | 1 | 1.5 |
| 2 | 0 | 50 | 0 | 3 |
| 3 | 0 | 75 | 1 | 1.5 |
| 4 | 0 | 33 | 1 | 3 |
| 5 | 0 | 66 | 1 | 3 |
| 6 | 0 | 99 | 0 | 3 |
| 7 | 1 | 20 | 0 | 1.5 |
| 8 | 1 | 40 | 1 | 3 |
| 9 | 0 | 60 | 1 | 3 |
| 10 | 0 | 80 | 0 | 1.5 |
+------------+-----------+-----+--------+----------------+
Basically my data is one line per patient (not a frequency table). I'm trying to run a regression with asymmetry as the dependent variable and age (continuous), female (binary), and field_strength (factor) as independent variables. I'm trying to understand how to collapse this into the appropriate format so I can get that "ntrials" part required for the elrm formula.
I've looked at https://stats.idre.ucla.edu/r/dae/exact-logistic-regression/ but they start with data in a different format than mine, and having trouble. Any help appreciated!

Related

Possible to invert the randomForest function in R?

I computed a random forest to predict a target value in a large data structure.
The matrix contains some thousand rows, about 20 input variables and one output/target/response variable.
For example, the dataframe df is like:
| V1 | V2 | V3 | V4 | ... | Rsp |
---------------------------------
| 1 | 8 | 2 | 3 | ... | 1.5 |
| 2 | 4 | 3 | 4 | ... | 1.3 |
| 5 | 7 | 6 | 3 | ... | 1.4 |
| 2 | 8 | 8 | 4 | ... | 1.9 |
| 9 | 3 | 1 | 6 | ... | 2.1 |
. . . . . .
I calculated the forest:
df.r <- randomForest(Rsp ~ . , data = df , subset = train , mtry = 50, ntree=200)
p <- predict(df.r, df[-train,])
I want to minimize the response in order to get the best combinations of input variables. But because the input and output are noisy, I cannot directly take the variables at the minimum response value.
So my question is: Is it possible to go the tree bottom-up? Is it possible to get the combinations of variables which give me a low response value?

How to get a query result into a key value form in HiveQL

I have tried different things, but none succeeded. I have the following issue, and would be very gratefull if someone could help me.
I get the data from a view as several billions of records, for different measures
A)
| s_c_m1 | s_c_m2 | s_c_m3 | s_c_m4 | s_p_m1 | s_p_m2 | s_p_m3 | s_p_m4 |
|--------+--------+--------+--------+--------+--------+--------+--------|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|--------+--------+--------+--------+--------+--------+--------+--------|
Then I need to aggregate it by each measure. And so long so fine. I got this figured out.
B)
| s_c_m1 | s_c_m2 | s_c_m3 | s_c_m4 | s_p_m1 | s_p_m2 | s_p_m3 | s_p_m4 |
|--------+--------+--------+--------+--------+--------+--------+--------|
| 3 | 6 | 9 | 12 | 15 | 18 | 21 | 24 |
|--------+--------+--------+--------+--------+--------+--------+--------|
Then I need to get the data in the following form. I need to turn it into a key-value form.
C)
| measure | c | p |
|---------+----+----|
| m1 | 3 | 15 |
| m2 | 6 | 18 |
| m3 | 9 | 21 |
| m4 | 12 | 24 |
|---------+----+----|
The first 4 columns from B) would form in C) the first column, and the second 4 columns would form another column.
Is there an elegant way, that could be easily maintainable? The perfect solution would be if another measure would be introduced in A) and B), there no modification would be required and it would automatically pick up the difference.
I know how to get this done in SqlServer and Postgres, but here I am missing the expirience.
I think you should use map for this

Creation of Panel Data set in R

Programmers,
I have some difficulties in structuring my panel data set.
My panel data set, for the moment, has the following structure:
Exemplary here only with T = 2 and N = 3. (My real data set, however, is of size T = 6 and N = 20 000 000 )
Panel data structure 1:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
1 | 2 | C | ... | D |
1 | 3 | E | ... | F |
2 | 1 | G | ... | H |
2 | 2 | I | ... | J |
2 | 3 | K | ... | L |
The desired structure is:
Panel data structure 2:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
2 | 1 | G | ... | H |
1 | 2 | C | ... | D |
2 | 2 | I | ... | J |
1 | 3 | E | ... | F |
2 | 3 | K | ... | L |
This data structure represents the classic panel data structure, where the yearly observations over the whole period are structured for all individuals block by block.
My question: Is there any simple and efficient R-solution that changes the data structure from Table 1 to Table 2 for very large data sets (data.frame).
Thank you very much for all responses in advance!!
Enrico
You can reorder the rows of your dataframe using order():
df=df[order(df$ID,df$Year),]

Two data frames correlation in R

I need to correlate some data.
I have two data frames - df for patient health conditions with 253 columns and tax2.melt for patient's microbiota analyses with 3 columns.
taxt.melt is:
| bac_name | pat_id | percent |
|----------------------|--------|--------------|
| Unclassified | 1 | 5.4506702563 |
| Serratia_entomophila | 1 | 0 |
| Faecalibacterium | 1 | 4.0394862303 |
| Clostridium | 1 | 5.215098996 |
df is a data frame with patient ID_CODE and 253 variables
| ID_CODE | DIAB_GR | SEX | AGE | .... |
|---------|---------|-----|-----|--------|
| 1 | 232 | 0 | 0 | .... |
| 2 | 99 | 0 | 0 | .... |
So I need to correlate individual patient's conditions (like an abdominal obesity or diabetes) with percentage of individual gut bacteria in total gut microbiota (like Faecalibacterium or Clostridium)
The result should be some data frame with columns bac_name df_testvalue corr.
Thank you!
Could you give me an advice how to make it best in R?

Time dependent data in the coxph function of survival package

When considering time dependent data in survival analysis, you have multiple start-stop times for an individual subject with measurements for the covariates as each start-stop time. How does the coxph function keeps track of which subject it is associating the start and stop times along with the covariates?
The function looks as follows
coxph(Surv(start, stop, event, type) ~ X)
Your data may look as follows
subject | start | stop | event | covariate |
--------+---------+--------+--------+-----------+
1 | 1 | 7 | 0 | 2 |
1 | 7 | 14 | 0 | 3 |
1 | 14 | 17 | 1 | 6 |
2 | 1 | 7 | 0 | 1 |
2 | 7 | 14 | 0 | 1 |
2 | 14 | 21 | 0 | 2 |
3 | 1 | 3 | 1 | 8 |
How can the function get away without an individual subject specifier?
My understanding is that survival analysis is not interested in individuals through time, it is looking at total counts for each time point, so the subject specifier is irrelevant. Instead, based on the counts, probabilities can be estimated that any particular subject will be alive/dead at a certain time given certain treatments.

Resources