How do I create a survival object in R? - r

The question I am posting here is closely linked to another question I posted two days ago about gompertz aging analysis.
I am trying to construct a survival object, see ?Surv, in R. This will hopefully be used to perform Gompertz analysis to produce an output of two values (see original question for further details).
I have survival data from an experiment in flies which examines rates of aging in various genotypes. The data is available to me in several layouts so the choice of which is up to you, whichever suits the answer best.
One dataframe (wide.df) looks like this, where each genotype (Exp, of which there is ~640) has a row, and the days run in sequence horizontally from day 4 to day 98 with counts of new deaths every two days.
Exp Day4 Day6 Day8 Day10 Day12 Day14 ...
A 0 0 0 2 3 1 ...
I make the example using this:
wide.df2<-data.frame("A",0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2)
colnames(wide.df2)<-c("Exp","Day4","Day6","Day8","Day10","Day12","Day14","Day16","Day18","Day20","Day22","Day24","Day26","Day28","Day30","Day32","Day34","Day36")
Another version is like this, where each day has a row for each 'Exp' and the number of deaths on that day are recorded.
Exp Deaths Day
A 0 4
A 0 6
A 0 8
A 2 10
A 3 12
.. .. ..
To make this example:
df2<-data.frame(c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),c(0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2),c(4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36))
colnames(df2)<-c("Exp","Deaths","Day")
Each genotype has approximately 50 flies in it. What I need help with now is how to go from one of the above dataframes to a working survival object. What does this object look like? And how do I get from the above to the survival object smoothly?

After noting the total of Deaths was 55 and you said that the number of flies was "around 50", I decided the likely assumption was that this was a completely observed process. So you need to replicate the duplicate deaths so there is one row for each death and assign an event marker of 1. The "long" format is clearly the preferred format. You can then create a Surv-object with the 'Day' and 'event'
?Surv
df3 <- df2[rep(rownames(df2), df2$Deaths), ]
str(df3)
#---------------------
'data.frame': 55 obs. of 3 variables:
$ Exp : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
$ Deaths: num 2 2 3 3 3 1 3 3 3 4 ...
$ Day : num 10 10 12 12 12 14 16 16 16 18 ...
#----------------------
df3$event=1
str(with(df3, Surv(Day, event) ) )
#------------------
Surv [1:55, 1:2] 10 10 12 12 12 14 16 16 16 18 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "time" "status"
- attr(*, "type")= chr "right"
Note: If this were being done in the coxph function, the expansion to individual lines of date might not have been needed, since that function allows specification of case weights. (I'm guessing that the other regression function in the survival package would not have needed this to be done either.) In the past Terry Therneau has expressed puzzlement that people are creating Surv-objects outside the formula interface of the coxph. The intended use of htis Surv-object was not described in sufficient detail to know whether a weighted analysis without exapnsion were possible.

Related

R error: level sets of factors are different

I'm working on an assignment practicing Logistic Regression models. Our data is on shots made in NBA games and each row includes a column for what team the player making the shot belongs to and a column for who the home team was.
I am trying to add a column with TRUE/FALSE values based on whether or not the shot was taken by the home team, based on some example code we were provided.
df$home.advntg <- df$Team == df$Home
However I keep getting the error: "Error in Ops.factor(df$Team, df$Home) :
level sets of factors are different"
When I check the columns with str() however these are the results:
str(df$Team) : "Factor w/ 30 levels "ATL","BKN","BOS",..: 7 16 27 3 24 1 10 8 12 12 ..."
str(df$Home) : " Factor w/ 30 levels "ATL","BKN","BOS",..: 7 20 27 5 28 1 10 8 1 12 ..."
The data I'm using is a subset of a much larger dataset which covered shots made from 1997 to 2020. The code worked on the original data, so something about how I've reduced it to just the 2020 shots is probably responsible. The dates of the games are in YMD format, so to filter down to just 2020 I ran:
df0 <- read_csv("NBA Shot Locations 1997 - 2020.csv")
df0$Year <- substr(df0$"Game Date",1,4)
df <- filter(df0, Year == 2020)
df <- df[,-23]
When I run str and check the columns with the original data (for which there was no error) I get:
str(df$Team) : "Factor w/ 36 levels "ATL","BKN","BOS",..: 18 17 8 2 18 7 1 10 9 12 ..."
str(df$Home) : "Factor w/ 36 levels "ATL","BKN","BOS",..: 6 17 5 28 18 2 1 10 9 12 ..."
In both cases the Factor levels look like they're the same. I don't really understand what the numbers being returned by the str function represent.

"Number of observations <= number of random effects" error

I am using a package called diagmeta for meta-analysis purposes. I can use this package with a built in data set called Schneider2017. However when I make my own database/data set I get the following error:
Error: number of observations (=300) <= number of random effects (=3074) for term (Group * Cutoff | Study); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable
Another thread here on SO suggests the error is caused by the data format of one or more columns. I have made sure every column's data type matches that in the Schneider2017 dataset - no effect.
Link to the other thread
I have tried extracting all of the data from the Schneider2017 dataset into excel and then importing a dataset from Excel through R studio. This again makes no difference. This suggests to me that something in the data format could be different, although I can't see how.
diag2 <- diagmeta(tpos, fpos, tneg, fneg, cutpoint,
studlab = paste(author,year,group),
data = SRschneider,
model = "DIDS", log.cutoff = FALSE,
check.nobs.vs.nRE = "ignore")
The dataset looks like this:
I expected the same successful execution and plotting as with the built-in data set, but keep getting this error.
Result from doing str(mydataset):
> str(SRschneider)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 10 variables:
$ ...1 : num 1 2 3 4 5 6 7 8 9 10 ...
$ study_id: num 1 1 1 1 1 1 1 1 1 1 ...
$ author : chr "Arora" "Arora" "Arora" "Arora" ...
$ year : num 2006 2006 2006 2006 2006 ...
$ group : chr NA NA NA NA ...
$ cutpoint: chr "6" "7.0" "8.0" "9.0" ...
$ tpos : num 133 131 130 127 119 115 113 110 102 98 ...
$ fneg : num 5 7 8 11 19 23 25 28 36 40 ...
$ fpos : num 34 33 31 30 28 26 25 21 19 19 ...
$ tneg : num 0 1 3 4 6 8 9 13 15 15 ...
Just a quick follow-up on Ben's detailed answer.
The statistical method implemented in diagmeta() expects that argument cutpoint is a continuous variable. We added a corresponding check for argument cutpoint (as well as arguments TP, FP, TN, and FN) in version 0.3-1 of R package diagmeta; see commit in GitHub repository for technical details.
Accordingly, the following R commands will result in a more informative error message:
data(Schneider2017)
diagmeta(tpos, fpos, tneg, fneg, as.character(cutpoint),
studlab = paste(author, year, group), data = Schneider2017)
You said that you
have made sure every column's data type matches that in the Schneider2017 dataset
but that doesn't seem to be true. Besides differences between num (numeric) and int (integer) types (which actually aren't typically important), your data has
$ cutpoint: chr "6" "7.0" "8.0" "9.0" ...
while str(Schneider2017) has
$ cutpoint: num 6 7 8 9 10 11 12 13 14 15 ...
Having your cutpoint be a character rather than numeric means that R will try to treat it as a categorical variable (with many discrete levels). This is very likely the source of your problem.
The cutpoint variable is likely a character because R encountered some value in this column that can't be interpreted as numeric (something as simple as a typographic error). You can use SRschneider$cutpoint <- as.numeric(SRschneider$cutpoint) to convert the variable to numeric by brute force (values that can't be interpreted will be set to NA), but it would be better to go upstream and see where the problem is.
If you use tidyverse packages to load your data you should get a list of "parsing problems" that may be useful. You can also try cp <- SRschneider$cutpoint; cp[which(is.na(as.numeric(cp)))] to look at the values that can't be converted.

Can't use glht post-hoc with repeated measures ANOVA in R?

I have a data frame with this structure:
'data.frame': 39 obs. of 3 variables:
$ topic : Factor w/ 13 levels "Acido Folico",..: 1 2 3 4 5 6 7 8 9 10 ...
$ variable: Factor w/ 3 levels "Both","Preconception",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : int 14 1 36 17 5 9 19 9 19 25 ...
and I want to test the effect value ~ variable, considering that observation are grouped in topics. So I thought to use a repeated measure ANOVA, where "variable" is considered as a repeatead measure on every topic.
the call is aov(value ~ variable + Error(topic/variable)).
Up to this everything's ok.
Then I wanted to perform a post-hoc test with glht(model, linfct= mcp(variable = 'Tukey')), but I receive two errors:
‘glht’ does not support objects of class ‘aovlist’
no ‘model.matrix’ method for ‘model’ found! Since, taking out the error term solve the error I suppose that is the problem.
So, how can I perform a post-hoc test over a repeated measure anova?
Thanks!

Looping histograms AND subsets in R and printing to pdf

this is the first question I have asked on Stack Overflow. However, I am a student and have been using this website for several years without needing to ask a question. There is a wealth of information on here and I appreciate the people who take the time to answer questions. If I need to make any changes to the question or format of the question I will be more than happy to.
I am researching habitat use by a wildlife species. I conducted field studies on GPS collared animals and took vegetative measurements in the field and landscape measurements in GIS.
Currently, I need to classify each plot (unique.id) into a “forest type” (i.e., Douglas-fir low- elevation forest, ponderosa pine woodland, etc) based on the attributes of the plot. The "forest type" is arbitrary and created by me. I am not using to R classify for me, just to provide visual aids and summary statistics on each plot.
To aid in this, I would like to display a histogram of the tree diameter distributions by tree species for each plot. In the same image window, I would like to display a few other variables from that plot such as canopy cover (canopy), stand age (age), species of the tree that age was taken from (agespecies), elevation (Elev), aspect (Aspect), and stem density (density). Due to the large number of plots, I would be nice to print them all to a pdf or other format for review outside of R.
I am not looking for R to classify the plots for me, just to provide some summary and visual information for each plot to assist my in classifying it.
So far, I have been using the “histogram” function in the “lattice” library, but am open to using something different. I have been able to write code to build a diameter histogram and loop it for each plot. I have also been able to add a subset if I am just running one plot at a time, but I don’t know how to loop the subset. I also am unsure of how to add multiple subsets (canopy, age, agespecies, Elev, Aspect, density) to the histogram.
Finally, most plots do not contain every possible tree species. Is there is a way to order the histograms by which species has the highest number of counts and/or not show that histograms that are empty?
I have pasted my code so far and the structure of the data below. The data are in two separate files, “dbh” and “masterplot”
Data:
> str(dbh)
'data.frame': 80719 obs. of 7 variables:
$ unique.id: Factor w/ 1165 levels "CalvA1","CalvA10",..: 1 1 1 1 1 1 1 1 1 1 ...
$ species : Factor w/ 14 levels "abla","abpr",..: 1 2 3 3 4 4 5 5 5 7 ...
$ dbh : num 7.8 1.1 3.3 3.8 4.1 3.4 6.1 4.2 3.2 3.8 ...
str(masterplot)
'data.frame': 1170 obs. of 41 variables:
$ unique.id : Factor w/ 1165 levels "CalvA1","CalvA10",..: 1 2 3 4 5 6 7 8 9 10 ...
$ canopy : num 16 19 28 25 1 3 23 14 7 18 ...
$ age : num 147 72 167 64 153 144 192 154 173 44 ...
$ agespecies : Factor w/ 14 levels "abla","alru",..: 6 11 7 11 7 6 11 6 11 6 ...
$ Elev : num 1597 1850 1638 1540 1695 ...
$ Aspect : num 238.6 246.1 165.5 242.1 24.4 ...
$ density : num 8700 6600 6800 7800 14600 5600 13900 4600 3900 4000 ...
Code:
lathist.fx=function(x){
windows()
histogram(~dbh[unique.id==x] | species, breaks=c(0,4,11,50),data=dbh,)}
for (i in dbh$unique.id)
lathist.fx(i)
I think the subsets will look something like this…
sub=masterplot$age[masterplot$unique.id=="LeftA36"]

Repeated measure ANOVA or time series' analysis?

I am quite new in R and (I admit it!) not so good with statistics, so I am sorry if my problem is too trivial, but I would really appreciate some hints on the matter.
I have 9 points (plots) of soil humidity measurements for each of the 2 different plantation systems we have (agriforestry and agriculture) over 2 months (weekly measurements). We also have the distance in meters between the closest tree (bigger than 5cm DBH) and the exact measurement point in each of the plots (varying between 4.2 and 12m in Agriforestry and are 50m in agriculture). Therefore, I have a profile of humidity (y) over time (x) (that behave similarly but vary due to weather fluctuations) for each of the 18 plots (9 in agriforestry and 9 in agriculture). What I need to know is:
Are these variations in humidity between the measurement points over time dependent on (or influenced by) the distance of the trees? Meaning, do the trees hold more water or take more water from the soil if they are closer to the measurement points (that are in the middle of a plantation?
Are these curves (humidity x time) significantly different from each other?
I thought first about grouping every 3 points of tree measurements (smaller distances from trees, medium distances and higher distances) for the agriforestry system and all 9 from agroforestry and using them as replications, as they behave more similarly. However it confounded me a bit.
So... I got as far as thinking about using a repeated measure ANOVA from the ez package. So in this case I had:
str(SanPedro)
data.frame': 450 obs. of 6 variables:
Parcel : Factor w/ 2 levels "Forest","Agriculture": 1 1 1 1 1 1 1 1 1 1 ...
Distance: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
Plot : num 1 1 1 1 1 1 1 1 1 1 ...
Date : Date, format: "0011-07-20" "0011-07-24" ...
Humidity: num 0.217 0.205 0.199 0.2 0.192 0.181 0.184 0.18 0.179 0.178 ...
Number : num 1 2 3 4 5 6 7 8 9 10 ..
When I tried to run the ezANOVA as
ezANOVA(data=SanPedro, dv=Humidity, wid=Number, within=Parcel, between=Plot, type=1, return_aov=TRUE)
I got this:
Warning: Converting "Number" to factor for ANOVA.
Warning: "Plot" will be treated as numeric.
Error in ezANOVA_main(data = data, dv = dv, wid = wid, within = within, :
One or more cells is missing data. Try using ezDesign() to check your data.
If I check the ezDesign(SanPedro), I get:
ezDesign(SanPedro)
Error in as.list(c(x, y, row, col)) :
argument "x" is missing, with no default
In the end, I do not really understand the problem with the data, and I am not even sure if the ezANOVA is actually the right analysis for my case... I really deeply appreciate any hints and ideas on the matter!!! Thanks a loooot!!! =)

Resources