Data Cleaning for Survival Analysis

Data Cleaning for Survival Analysis - r

I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that an individual only has a single, sustained, transition from symptom present (ss=1) to symptom remitted (ss=0). An individual must have a complete sustained remission in order for it to count as a remission. Statistical problems/issues aside, I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,1,1,1,NA,0,0,1,1,0,NA,0,0,0,1,1,1,1,1,1,NA,1,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold and underlined characters represent changes from the dataset above
The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 1,1,1,1,1,0,0
ID# 2 (variable ss) to look like this: 1,1,0,0,0,0,0
ID #3 (variable ss) to look like this: 1,1,1,1,1,1,NA (no change because the row with NA will be deleted eventually)
ID #4 (variable ss) to look like this: 1,1,1,1,1,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).

I don't really think you have considered all the "edge case". What to do with two NA's in a row at the end of a period or 4 or 5 NA's in a row. This will give you the requested solution in your tiny test case, however, using the na.locf-function:
require(zoo)
fillNA <- function(vec) { if ( is.na(tail(vec, 1)) ){ vec } else { vec <- na.locf(vec) }
}
> mydat$locf <- with(mydat, ave(ss, id, FUN=fillNA))
> mydat
id time ss locf
1 1 0 1 1
2 1 1 1 1
3 1 2 1 1
4 1 3 1 1
5 1 4 NA 1
6 1 5 0 0
7 1 6 0 0
8 2 0 1 1
9 2 1 1 1
10 2 2 0 0
11 2 3 NA 0
12 2 4 0 0
13 2 5 0 0
14 2 6 0 0
15 3 0 1 1
16 3 1 1 1
17 3 2 1 1
18 3 3 1 1
19 3 4 1 1
20 3 5 1 1
21 3 6 NA NA
22 4 0 1 1
23 4 1 1 1
24 4 2 0 0
25 4 3 NA 0
26 4 4 NA 0
27 4 5 0 0
28 4 6 0 0

Related

How to convert a list of attributes to a table? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
i have a big list of attributes and i want to convert them into a table.
I looks like this:
name Key Value
1 age 20
1 sex 1
2 age 20
2 sex 0
3 age 22
4 age 30
5 age 29
6 age 6
I want to convert it to something like this:
name age sex
1 20 1
2 20 0
3 22
4 30
5 29
6 6
Anyone have an idea?
#
EDIT
My original list have a lot more values, not just 2. all the answers i saw don't answer my question.
I'll try to explain it better with smaller table.
my table:
name key value
1 first 0
2 first 1
2 sec 1
1 sec 2
1 tr 1
2 tr 0
3 first 0
3 sec 0
4 first 0
wanted result:
name first sec tr
1 0 2 1
2 1 1 0
3 0 0
4 0

Here is an example using data from the OP with tidyr::spread().
rawData <- "name,Key,Value
1,age,20
1,sex,1
2,age,20
2,sex,0
3,age,22
4,age,30
5,age,29
6,age,6"
df <- read.csv(text=rawData,header=TRUE,stringsAsFactors=FALSE)
library(tidyr)
df %>% spread(.,key=Key,value=Value)
...and the output:
> df %>% spread(.,key=Key,value=Value)
name age sex
1 1 20 1
2 2 20 0
3 3 22 NA
4 4 30 NA
5 5 29 NA
6 6 6 NA
>
For additional background on spread() and its complement, gather(), please see R for Data Science, Chapter 12: Tidy Data.

How to create a complex running calculation on an R data table

I want to create a running calculation that includes logic to restart the running sum when the value is negative. Initially I have a data table or frame like below :
df <- data.frame(value1 = c(0,0,10,0,1,0,2,0)
, value2 = c(5,1,2,6,8,3,7,2))
value1 value2
0 5
0 1
10 2
0 6
1 8
0 3
2 7
0 2
I would like to take the cumulative sum of value2 subtracted by value1. However, if the new value is less than 0, then start the running calculation over.
i.e. end up with
value1 value2 newvalue
0 5 5
0 1 6
10 2 2
0 6 8
1 8 15
0 3 18
2 7 23
0 2 25
I tried multiple attempts with data.table and dplyr packages with no luck.
EDIT: Updated df to match the actual table shown.

I am sure there are other simpler ways to do this by tweaking cumsum or other such functions, but I came up with this basic loop to produce the desired output. Hope it helps !!
> df
GroupID value1 value2
1 1 0 5
2 1 0 1
3 1 10 2
4 2 0 6
5 2 1 8
6 3 0 3
7 3 2 7
8 3 0 2
for(i in 1:nrow(df)) {
if(i == 1) {
df$newvalue[i] <- df$value2[i]
} else {
df$newvalue[i] <- (df$newvalue[i-1] + df$value2[i]) - df$value1[i]
if(df$newvalue[i] < 0 | df$GroupID[i] != df$GroupID[i-1]) {
df$newvalue[i] <- df$value2[i]
}
}
}
> df
GroupID value1 value2 newvalue
1 1 0 5 5
2 1 0 1 6
3 1 10 2 2
4 2 0 6 6
5 2 1 8 13
6 3 0 3 3
7 3 2 7 8
8 3 0 2 10

I believe that explicitly looping through the data frame is the only solution for calculating this type of conditional cumulative sum. Sagar's solution was very helpful to me (I up-voted but do not have enough reputation points for it to count).
In my experience, new value needs to be initialized prior to starting the loop in order to work properly. Below is how I would approach this:
df$newvalue <- df$value2
for(i in 2:nrow(df)) {
if(df$GroupID[i] == df$GroupID[i-1]) {
df$newvalue[i] <- max(df$newvalue[i-1] + df$value2[i]) - df$value1[i], df$value2[i])
}
}

Building a contingency table

I have a data like this:
A B
1 10
1 20
1 30
2 10
2 30
2 40
3 20
3 10
3 30
4 20
4 10
5 10
5 10
and I want to build a contingency table like this:
10 20 30 40
10 1 3 2 0
20 3 0 2 0
30 2 2 0 0
40 0 0 0 0
Meaning: According to column A, for each two values of column B mark + 1 in the specific Contingency table.
Can you help me do this?

Here is a very ugly answer, using the data from the image, because I already spent too much time on your problem. In general, it's not practical to have your result depend on the order of variables.
A <- rep(c(1:4),c(3,2,3,3))
B <- c(10,10,30,10,20,30,20,10,10,20,30)
data <- data.frame(cbind(A,B))
#split by A
library(plyr)
data2 <- ddply(data,.(A),function(x){
combined_pairs <- cbind(x$B[-nrow(x)],
x$B[-1])
#return data where first is always lowest
smallest <- apply(combined_pairs,MARGIN=1,
FUN=min)
largest <- apply(combined_pairs,MARGIN=1,
FUN=max)
return(data.frame(small=smallest,large=largest))
})
library(reshape2)
result <- dcast(small~large,data=data2,
fun.aggregate=length)
> result
small 10 20 30
1 10 1 3 1
2 20 0 0 2
I think you can add the empty rows yourself if you still need them.

R - Conditional replacement of column values in a data frame

I have a data frame which has 2 columns - A & B. I want to replace the values of column B in such a way that, when the VALUE>=5 replace with 1, else replace with 0.
Note - There are 2 conditions to be checked.
X=read.csv("Y:/impdat.csv")
A B
3 16
12 3
1 2
12 9
4 4
5 6
21 1
4 14
3 10
12 1
So after replacing, the data should be
A B
3 1
12 0
1 0
12 1
4 0
5 1
21 0
4 1
3 1
12 0
Sounds simple. But I am unable to implement it.
I tried
ifelse(X$B>=5,1,0)
This only prints the new values, but the original data remains the same.

X$B <- as.integer(X$B >= 5)
will do the trick.

transform(X, B=ifelse(B>=5,1,0))

Got it.
Just had to assign the object.
X$B=ifelse(X$B>=5,1,0)

Get frequencies (absolute and relative) of levels of a categorical variable from incidence binary data by combination of columns factors

I would like to have the frequencies of each levels of a categorical variable (row vector) denoting ecological type (3 levels: H,F,T) of a set of 93 herbaceous plants for the observed species present (=1) conditioning by sites (3 levels: A,B,C), habitats (3 levels: 1,2,3,4) and years (3 levels: 1,2,3).
I know the procedure is passed by tapply(), but the messy thing come from the logic operator for linking levels of the categorical variable (H,F,T) for the present species (=1) accross all of the species conditioning by combination of columns factors.
This could be summarized by a 12 x 3 contingency table indicating the numbers of each ecological types (3) of species per sites (3) and habitats (4).
Ex of my data (each habitat contain 20 lines): for each species (Sp1 to Sp93) 0 for absent and 1 for present. Vector "type" contain ecological type for each species.
Site,Habitat,Year,Sp1,Sp2,Sp3,Sp4,Sp5,Sp6,...,Sp93
type= c(H,H,F,T,F,T,H,....T) # vector of length 93
Thank you in advance.
I hope this would help describe my data objects better.
data = read.csv(file = "Veg_06.csv", header = TRUE)
data = data[1:240, -c(1,4:7)]
Ilot #
Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ... each level has 4 sublevels (from "Site") with 20 lines each, adding up to 80 lines by levels.
Site #
Factor w/ 4 levels "Am","Av","CP","CS": 2 2 2 2 2 2 2 2 2 2 ...
Sp #
int [1:240] 0 0 0 0 0 0 0 0 0 0 ... either "0" or "1" for absence or presence of species.
veg #
Factor w/ 3 levels "H","F","T": 3 3 2 2 3 1 2 1 2 1 ... categorical factor indicating type of species.

First off, I would recommend http://vita.had.co.nz/papers/tidy-data.pdf, Hadley Wickham's paper on Tidy Data, for some ideas on how to organize the data to be better suited to analysis. In essence, we think of each row as a single observation.
It sounds like fundamentally, your data is a collection of year, site, habitat, quadrant(? maybe line, not sure from the description), species with the observation point being that species was observed in that site, habitat, quadrant, and year. For simplicity, a row is present if the species is present.
In addition, there's the concept of type, which is associated with each species.
Analyzing and contingency table
Putting aside the question of how to get your data into this form, let's assume that we have the data in the form described above.
> raw <- expand.grid(species=1:93, quadrant=1:20, habitat=1:4, site=1:3, year=1:3)
> head(raw)
species quadrant habitat site year
1 1 1 1 1 1
2 2 1 1 1 1
3 3 1 1 1 1
4 4 1 1 1 1
5 5 1 1 1 1
6 6 1 1 1 1
And let's take a small sample and a large sample
> set.seed(100); d.small <- raw[sample(nrow(raw),20), ]
> set.seed(100); d.large <- raw[sample(nrow(raw),1000), ]
We can use the ftable function to get this into a state that we want, the 12x4 contingency table, as
> ftable(habitat ~ year + site, data=d.small)
habitat 1 2 3 4
year site
1 1 0 0 1 0
2 0 0 1 1
3 0 1 1 1
2 1 2 1 1 0
2 1 1 0 2
3 0 0 1 0
3 1 2 0 0 1
2 0 1 0 1
3 0 0 0 0
This will count the same species twice if it occurs in two different quadrants of the site/habitat mixture. We can discard the habitat and unique-ify to get the count across all of them
> ftable(habitat ~ year + site , data=unique(d.small[c('species', 'habitat','year','site')]))
Transforming (tidying the source data)
To transform the data as it stands into a form like this is tricky in vanilla R. With the tidyr package it gets easier (reshape does very similar things as well)
> onerow <- data.frame(year=1, site=1, habitat=2, quadrant=3, sp1=0, sp2=1,sp3=0,sp4=0,sp5=1)
> onerow
year site habitat quadrant sp1 sp2 sp3 sp4 sp5
1 1 1 2 3 0 1 0 0 1
Here I'm making assumptions about what your data look like that seem reasonable
> subset(gather(onerow, species, present, -(year:quadrant)), present==1)
year site habitat quadrant species present
2 1 1 2 3 sp2 1
5 1 1 2 3 sp5 1
> subset(gather(onerow, species, present, -(year:quadrant)), present==1, select=-present)
year site habitat quadrant species
2 1 1 2 3 sp2
5 1 1 2 3 sp5
And now you can proceed with the analysis above.
Merging in the species type data
Looking at your description a little closer, I think you also want to merge in a parallel vector of species type information.
> set.seed(100); sp.type <- data.frame(species=1:93, type=factor(sample(1:4, 93, replace=T)))
> merge(d.small, sp.type)
species quadrant habitat site year type
1 6 16 4 2 3 2
2 27 9 2 2 2 4
3 27 8 4 2 1 4
4 32 18 1 2 2 4
5 33 18 1 1 2 2
6 45 14 4 2 2 3
7 49 6 2 3 1 1
8 54 3 3 2 1 2
9 55 2 1 1 3 3
10 56 2 4 3 1 2
11 56 1 3 1 1 2
12 57 7 2 1 2 1
13 62 18 4 2 2 3
14 70 19 1 1 2 3
15 77 2 3 3 1 4
16 80 7 3 1 2 1
17 81 17 1 1 3 2
18 82 5 2 2 3 3
19 86 9 4 1 3 3
20 87 10 3 3 2 3
And now you can use the subset, unique, and ftable approach above to get the data you need.

Assuming you had a dataframe with (among other things) the columns named: "sites", "habitats", "years":
dfrm <- data.frame( sites = sample( LETTERS[1:3], 20, replace=TRUE),
habitats= sample( factor(1:4), 20, replace=TRUE),
years = sample( factor(paste("Y",1:4, sep="_")), 20, replace=TRUE) )
Then this will give you an additional factor-mode column that encodes the various levels of each row.
dfrm$three.way.inter <- with(dfrm, interaction(sites, habitats, years))
If you want non-populated levels then do nothing else. If you want possible levels that have no instances, then use drop=TRUE. Then you can analyze these within individual levels of the three classification variables.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Data Cleaning for Survival Analysis - r

Related

How to convert a list of attributes to a table? [duplicate]

How to create a complex running calculation on an R data table

Building a contingency table

R - Conditional replacement of column values in a data frame

Get frequencies (absolute and relative) of levels of a categorical variable from incidence binary data by combination of columns factors

Categories

Resources