Time dependent data in the coxph function of survival package - r

When considering time dependent data in survival analysis, you have multiple start-stop times for an individual subject with measurements for the covariates as each start-stop time. How does the coxph function keeps track of which subject it is associating the start and stop times along with the covariates?
The function looks as follows
coxph(Surv(start, stop, event, type) ~ X)
Your data may look as follows
subject | start | stop | event | covariate |
--------+---------+--------+--------+-----------+
1 | 1 | 7 | 0 | 2 |
1 | 7 | 14 | 0 | 3 |
1 | 14 | 17 | 1 | 6 |
2 | 1 | 7 | 0 | 1 |
2 | 7 | 14 | 0 | 1 |
2 | 14 | 21 | 0 | 2 |
3 | 1 | 3 | 1 | 8 |
How can the function get away without an individual subject specifier?

My understanding is that survival analysis is not interested in individuals through time, it is looking at total counts for each time point, so the subject specifier is irrelevant. Instead, based on the counts, probabilities can be estimated that any particular subject will be alive/dead at a certain time given certain treatments.

Related

Data preparation before running exact logistic (elrm in R)

I started out using Firth's logistic (logistf) to deal with my small sample size (n=80), but wanted to try out exact logistic regression using the elrm package. However, I'm having trouble figuring out how to create the "collapsed" data required for elrm to run. I have a csv that I import into R as a dataframe that has the following variables/columns. Here is some example data (real data has a few more columns and 80 rows):
+------------+-----------+-----+--------+----------------+
| patien_num | asymmetry | age | female | field_strength |
+------------+-----------+-----+--------+----------------+
| 1 | 1 | 25 | 1 | 1.5 |
| 2 | 0 | 50 | 0 | 3 |
| 3 | 0 | 75 | 1 | 1.5 |
| 4 | 0 | 33 | 1 | 3 |
| 5 | 0 | 66 | 1 | 3 |
| 6 | 0 | 99 | 0 | 3 |
| 7 | 1 | 20 | 0 | 1.5 |
| 8 | 1 | 40 | 1 | 3 |
| 9 | 0 | 60 | 1 | 3 |
| 10 | 0 | 80 | 0 | 1.5 |
+------------+-----------+-----+--------+----------------+
Basically my data is one line per patient (not a frequency table). I'm trying to run a regression with asymmetry as the dependent variable and age (continuous), female (binary), and field_strength (factor) as independent variables. I'm trying to understand how to collapse this into the appropriate format so I can get that "ntrials" part required for the elrm formula.
I've looked at https://stats.idre.ucla.edu/r/dae/exact-logistic-regression/ but they start with data in a different format than mine, and having trouble. Any help appreciated!

Subsetting a table in R

In R, I've created a 3-dimensional table from a dataset. The three variables are all factors and are labelled H, O, and S. This is the code I used to simply create the table:
attach(df)
test <- table(H, O, S)
Outputting the flattened table produces this table below. The two values of S were split up, so these are labelled S1 and S2:
ftable(test)
+-----------+-----------+-----+-----+
| H | O | S1 | S2 |
+-----------+-----------+-----+-----+
| Isolation | Dead | 2 | 15 |
| | Sick | 64 | 20 |
| | Recovered | 153 | 379 |
| ICU | Dead | 0 | 15 |
| | Sick | 0 | 2 |
| | Recovered | 1 | 9 |
| Other | Dead | 7 | 133 |
| | Sick | 4 | 20 |
| | Recovered | 17 | 261 |
+-----------+-----------+-----+-----+
The goal is to use this table object, subset it, and produce a second table. Essentially, I want only "Isolation" and "ICU" from H, "Sick" and "Recovered" from O, and only S1, so it basically becomes the 2-dimensional table below:
+-----------+------+-----------+
| | Sick | Recovered |
+-----------+------+-----------+
| Isolation | 64 | 153 |
| ICU | 0 | 1 |
+-----------+------+-----------+
S = S1
I know I could first subset the dataframe and then create the new table, but the goal is to subset the table object itself. I'm not sure how to retrieve certain values from each dimension and produce the reduced table.
Edit: ANSWER
I now found a much simpler method. All I needed to do was reference the specific columns in their respective directions. So a much simpler solution is below:
> test[1:2,2:3,1]
O
H Sick Healed
Isolation 64 153
ICU 0 1
Subset the data before running table, example:
ftable(table(mtcars[, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 3 0 1
# 4 0 8
# 5 1 1
# 6 3 0 2
# 4 2 2
# 5 1 0
# 8 3 12 0
# 4 0 0
# 5 2 0
# subset then run table
ftable(table(mtcars[ mtcars$gear == 4, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 4 0 8
# 6 4 2 2

Possible to invert the randomForest function in R?

I computed a random forest to predict a target value in a large data structure.
The matrix contains some thousand rows, about 20 input variables and one output/target/response variable.
For example, the dataframe df is like:
| V1 | V2 | V3 | V4 | ... | Rsp |
---------------------------------
| 1 | 8 | 2 | 3 | ... | 1.5 |
| 2 | 4 | 3 | 4 | ... | 1.3 |
| 5 | 7 | 6 | 3 | ... | 1.4 |
| 2 | 8 | 8 | 4 | ... | 1.9 |
| 9 | 3 | 1 | 6 | ... | 2.1 |
. . . . . .
I calculated the forest:
df.r <- randomForest(Rsp ~ . , data = df , subset = train , mtry = 50, ntree=200)
p <- predict(df.r, df[-train,])
I want to minimize the response in order to get the best combinations of input variables. But because the input and output are noisy, I cannot directly take the variables at the minimum response value.
So my question is: Is it possible to go the tree bottom-up? Is it possible to get the combinations of variables which give me a low response value?

Combining aggregate functions in sqlite

Assuming the following table and using sqlite I have the following question:
Node |Loadcase | Fx | Cluster
---------------------------------
1 | 1 | 50 | A
2 | 1 | -40 | A
3 | 1 | 60 | B
4 | 1 | 80 | C
1 | 2 | 50 | A
2 | 2 | -50 | A
3 | 2 | 80 | B
4 | 2 | -100 | C
I am trying to write a query which fetches the maximum absolute value of Fx and the Load case for each Node 1-4.
An additional requirement is that Fx having the same Cluster shall be summed up before making this query .
In the example above I would expect the following results:
Node | Loadcase | MaxAbsClusteredFx
-----|-----------|-------------------
1 | 1 | 10
2* | |
3 | 2 | 80
4 | 2 | 100
N/A because summed up with node one. Both belonging to cluster A
Query:
For Node 1 I would execute a query similar to this
SELECT Loadcase,abs(Fx GROUP BY Cluster) FROM MyTable WHERE abs(Fx GROUP BY Cluster) = max(abs(Fx GROUP BY Cluster)) AND Node = 1
I keep getting " Error while executing query: near "Forces": syntax error " or alike.
Thankful for any help!

Two data frames correlation in R

I need to correlate some data.
I have two data frames - df for patient health conditions with 253 columns and tax2.melt for patient's microbiota analyses with 3 columns.
taxt.melt is:
| bac_name | pat_id | percent |
|----------------------|--------|--------------|
| Unclassified | 1 | 5.4506702563 |
| Serratia_entomophila | 1 | 0 |
| Faecalibacterium | 1 | 4.0394862303 |
| Clostridium | 1 | 5.215098996 |
df is a data frame with patient ID_CODE and 253 variables
| ID_CODE | DIAB_GR | SEX | AGE | .... |
|---------|---------|-----|-----|--------|
| 1 | 232 | 0 | 0 | .... |
| 2 | 99 | 0 | 0 | .... |
So I need to correlate individual patient's conditions (like an abdominal obesity or diabetes) with percentage of individual gut bacteria in total gut microbiota (like Faecalibacterium or Clostridium)
The result should be some data frame with columns bac_name df_testvalue corr.
Thank you!
Could you give me an advice how to make it best in R?

Resources