Surv function input - right,left or interval censored? In R - r

I am at the beginning of setting up a survival analysis in R.
I took a look in this book here: https://www.powells.com/book/modeling-survival-data-9780387987842/ but struggle to properly set the data up in the first place. So this is a very basic question to survival analysis as I can not find a good example online.
I'd to understand how to incorporate the consorized data into my surv() function. I understand the inputs in surv are:
0 = right censored
1 = event
2 = left censored
3 = interval censored
Right Censored: The time of study ends before an event takes place (ob1)
Left Censored: The event has already happend before the study starts
Event: Typically, death or some other form of expected outcome (marked by x)
Intervall Censored: The observation starts at some point in the study and has an event / drops out before end of study (ob5)
Left truncated: Ob 3,4,5 are left truncated
To better understand what I am talking about I sketched the described types of censored data below:
"o" marks beginning of data / first occurance in data set
"x" marks event
Start of study End of observation
ob1 o-|-----------------------------------------------------------|--------
| |
ob2 o-|-------------------------------xo |
| |
ob3 | o-----------------------------------xo |
| |
ob4 | o------------------x-|----------o
| |
ob5 | o----------------------------o |
|--------------------------------------------------------------
1999 2010
Finally, what would i like to know:
Did I classify ob1- ob5 correctly?
How about the other types of observations?
How do I represent these as input for the surv function? If for example right censored is true, i.e. the study ends how does a "0" indicate so? What is the input for the time series when neither event(1) nor end of observation occur (0)? what happens at a time when "nothing" happens?
When and how is the interval censored data marked? 3 for beginning and end?
I can provide some sample code if needed.
Again, thank you for your help on this and valuable questions!

Related

Calculate the best distribution for a group of numbers that can FIT on a specific number

I have what I think is a interesting question, about google sheets and some Maths, here is the scenario:
4 numbers as follows:
64.20 | 107 | 535 | 1070
A reference number in which the previous numbers needs to fit leaving the minimum possible residue while setting the number of times each of them fitted in the reference number for example we could say the reference number is the following:
806.45
So here is the problem:
I'm calculating how many times those 4 numbers can fit in the reference number by starting from the higher to the lower number like this:
| 1070 | => =IF(E12/((I15+J15)+IF(H17,K17,0)+IF(H19,K19,0)) > 0,ROUNDDOWN(E12/((I15+J15)+IF(H17,K17,0)+IF(H19,K19,0))),0)
| 535 | => =IF(H15>0,ROUNDDOWN((E12-K15-IF(H17,K17,0)-IF(H19,K19,0))/(I14+J14)),ROUNDDOWN(E12/((I14+J14)+IF(H17,K17,0)+IF(H19,K19,0))))
| 107 | => =IF(OR(H15>0,H14>0),ROUNDDOWN((E12-K15-K14-IF(H17,K17,0)-IF(H19,K19,0))/(I13+J13)),ROUNDDOWN((E12-IF(H17,K17,0)-IF(H19,K19,0))/(I13+J13)))
| 64.20 | => =IF(OR(H15>0,H14>0,H13>0),ROUNDDOWN((E12-K15-K14-K13-IF(H17,K17,0)-IF(H19,K19,0))/(I12+J12)),ROUNDDOWN((E12-IF(H17,K17,0)-IF(H19,K19,0))/(I12+J12)))
As you can notice, I'm checking if the higher values has a concurrence, so I can substract the amount from the original number and calculate again how many times can fit the lower number in the result of that subtraction , you can also see that I'm including some checkboxes to the formula in order to add a fixed number to the main number.
This actually works, and as you can see in the example, the result is:
| 1070 | -> Fits 0 times
| 535 | -> Fits 1 time
| 107 | -> Fits 2 times
| 64.20 | -> Fits 0 times
The residue of 806.45 in this example is: 57.45
But each number that needs to fit on the main number needs to take in consideration others; IF you solve this exercise manually, you could get something much better.. like this:
| 1070 | -> Fits 0 times
| 535 | -> Fits 1 time
| 107 | -> Fits 0 times
| 64.20 | -> Fits 4 times
The residue of 806.45 in this example is: 14.65
When I’m talking about residue I mean the result when subtracting, I’m sorry if this is not clear, it’s hard to me to explain maths in English, since is not my native language, please see the spreadsheet and make a copy to better understand what I’m trying to do, or suggest me a way to explain it better if possible.
So what would you do to make it work more efficient and "smart" with the minimum possible residue after the calculation?
Here is the Google's spreadsheet for reference and practice, please make a copy so others can try their own solutions:
LINK TO SPREADSHEET
Thanks in advance for any help or hints.
Delete all current formulas in H12:H15.
Then place this mega-formula in H12:
=ArrayFormula(QUERY(SPLIT(FLATTEN(SPLIT(VLOOKUP(E12,QUERY(SPLIT(FLATTEN(QUERY(SPLIT(FLATTEN(QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&(SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13))),"|"),"Select Col1")&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)&" "&I14)&"|"&QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&((SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13)))),"|"),"Select Col2")+TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)*I14)),"|"),"Select Col1")&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I15),1,0)&" "&I15)&"|"&QUERY(SPLIT(FLATTEN(QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&(SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13))),"|"),"Select Col1")&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)&" "&I14)&"|"&QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&((SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13)))),"|"),"Select Col2")+TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)*I14)),"|"),"Select Col2")+TRANSPOSE(SEQUENCE(ROUNDUP(E12/I15),1,0)*I15)),"|"),"Select Col2, Col1 WHERE Col2 <= "&E12&" ORDER BY Col2 Asc, Col1 Desc"),2,TRUE)," / ",0,0))," "),"Select Col1"))
Typically, I explain my formulas. In this case, I trust that readers will understand why I cannot explain it. I can only offer it in working order.
To briefly give the general idea, this formula figures out how many times each of the four numbers fits into the target number alone and then adds every possible combination of all of those. Those are then limited to only the combinations less than the target number and sorted smallest to largest in total. Then a VLOOKUP looks up the target number in that list, returns the closest match, SPLITs the multiples from the amounts (which, in the end, have been concatenated into long strings), and returns only the multiples.

Finding the average number correct from grid coordinates

I am trying to calculate the average number participants scored correct on a memory task. I have a column called RecallType which tells me if participants were assessed through forward memory recall (called forwards) or through backwards memory recall (called backwards). I also have a column called ProbeState which identifies the type of memory task, of which there are two. In this column I have positions and digits. These are all my variables of interest.
The memory task itself is split by two columns. Recall.CRESP is a column specifying the correct answers on a memory test selected through grid coordinates. Recall.RESP shows participants response.
These columns look something like this:
|Recall.CRESP | Recall.RESP |
|---------------------------------|---------------------------------|
|grid35grid51grid12grid43grid54 | grid35grid51grid12grid43grid54 |
|grid11gird42gird22grid51grid32 | grid11gird15gird55grid42grid32 |
So for example in row 1 of this table, the participant got 5/5 correct as the grid coordinates of Recall.CRESP matches with Recall.RESP. However in row 2, the participant only got 2/5 correct as only the first and the last grid coordinate are identical. The order of the coordinates must match to be correct.
Ideally I would love to learn from any response. If you do reply please kindly put some comments.
Thanks.
As you are new to stackoverflow, please read the answer here on how to make a reproducible example so your question is clear: How to make a great R reproducible example?.
From what I understand, you are looking to split your string and then count the equal cases. Some code to get you started on that is below:
a = "grid11gird42gird22grid51grid32"
b = "grid11gird15gird55grid42grid32"
a1 = strsplit(a, "grid|gird")
b1 = strsplit(b, "grid|gird")
table(unlist(a1) == unlist(b1))["TRUE"] - 1
You should be able to take mean by your variable of interest using group_by and summarize functionality of package dplyr.
Try using regmatches
fun=function(x)do.call(rbind,regmatches(x,gregexpr(".*?\\d.",x)))
with(dat,rowSums(fun(Recall.CRESP)==fun(Recall.RESP)))
[1] 5 2
DATA:
structure(list(Recall.CRESP = c("grid35grid51grid12grid43grid54",
"grid11grid42grid22grid51grid32"), Recall.RESP = c("grid35grid51grid12grid43grid54",
"grid11grid15grid55grid42grid32")), .Names = c("Recall.CRESP",
"Recall.RESP"), row.names = c(NA, -2L), class = "data.frame")

Azure Machin Learing - How to train with very limited dataset

I am a beginner and I need some advice on how to go about modelling the below scenario
I am consuming ~5000 rows of data on an average from an external system everyday. The number of incoming rows are between 4950 to 5050. I want to build an alerting mechanism which would tell me if the number of incoming rows is not normal. i.e., I want a solution to let me know if I get say, 2500 rows on a given day which is 50% less or say 15000 rows which way more than the average.
Sample data as below:
| Day | Size of incoming data (in MB) | Number of Rows | Label |
| Weekday | 3.44 | 5000 | Y |
| Weekday | 3.3 | 4999 | Y |
| Weekday | 3.1 | 4955 | Y |
| Weekday | 3.44 | 5000 | Y |
| Weekend | 4.1 | 5050 | N |
My initial thought was to use some anomaly detection algorithm. I tried using the Principal Component Analysis algorithm to detect the anomaly. I had collected the total number of rows I receive everyday and used it for training the model. But, after training with the data I had, which is quite limited (less than 500 observations) I find that the accuracy is very poor. One-Class SVM also did not give me good result.
I had used "Number of rows" as Categorical Feature, Label as.. label and ignore the rest of the parameters as they are of no interest to me in this case. Irrespective of day and size of incoming data my logic revolves around the number of rows only.
Also, I don't have any negative scenario so far, meaning, I never received far too less or far too many records. So I labeled all days when I received 5050 rows as anomalous. The rest I labeled as normal.
I do realize that I am doing something fundamentally wrong here. The question is, does my scenario even qualify for usage on machine learning? (I believe it does, but wanted your opinion)
If yes, how to deal with such limited set of training data where you hardly have any sample anomaly. And is it really an anomaly problem or can I just use some classification algorithm to get better result?
Thanks
please see the time-series anomaly detection module. it should do what you need:
https://msdn.microsoft.com/library/azure/96b98cc0-50df-46ff-bc18-c0665d69f3e3?f=255&MSPPError=-2147217396

Modelling GLM in R with discrete non-binary response

I'm new to R and I would like to model the following in GLM:
A memory retention experiment where I ask each participant a similar question every X days. The question has either a correct or wrong answer, the participant must keep trying until the answer is correct. I want to find the probability of him answering the question correctly in 1 try given the past data of number of tries and the time offset between questions.
I'm following this tutorial to model it:
http://www.theanalysisfactor.com/r-tutorial-glm1/
Here's an example of a part of my table
0 3 -
1 1 2
0 2 1
0 5 4
1 1 2
The first value is a binary value whether he 'passes' or 'fails'. Answering in 1 attempt is pass and more than that is fail.
The second value is the number of attempts. We can see that if it is 1 then the first value is also 1, else, the first value is 0.
The third value is the number of days between the question and the previous question.
Right now I'm modelling it as
first ~ second + third
I was thinking if there is a better way to do it, since the first is directly related to the second value. Something like only using the second and third value, and finding P(second = 1). And eventually I would also like to find P(second = 2) in the future.
Thanks for your help :)

K Nearest Neighbor Questions

Hi I am having trouble understanding the workings of the K nearest neighbor algorithm specifically when trying to implement it in code. I am implementing this in R but just want to know the workings, I'm not so much worried about the code as much as the process. I will post what I have, my data, and what my questions are:
Training Data (just a portion of it):
Feature1 | Feature2 | Class
2 | 2 | A
1 | 4 | A
3 | 10 | B
12 | 100 | B
5 | 5 | A
So far in my code:
kNN <- function(trainingData, sampleToBeClassified){
#file input
train <- read.table(trainingData,sep=",",header=TRUE)
#get the data as a matrix (every column but the class column)
labels <- as.matrix(train[,ncol(train)])
#get the classes (just the class column)
features <- as.matrix(train[,1:(ncol(train)-1)])
}
And for this I am calculating the "distance" using this formula:
distance <- function(x1,x2) {
return(sqrt(sum((x1 - x2) ^ 2)))
}
So is the process for the rest of the algorithm as follows:?
1.Loop through every data (in this case every row for the 2 columns) and calculate the distance from the one number at a time and compare it to the sampleToBeClassified?
2.In the starting case that I want 1 nearest-neighbor classification, would I just be storing the variable that has the least distance to my sampleToBeClassified?
3.Whatever the closest distance variable is find out what class it is, then that class becomes the class of the sampleToBeClassified?
My main question is what role do the features play in this? My instinct is that the two features together are what defines that data item as a certain class, so what should I be calculating the distance between?
Am I on the right track at all?
Thanks
It looks as though you're on the right track. The three steps in your process seem to be correct for the 1-nearest neighbor cases. For kNN, you just need to make a list of the k nearest neighbors and then determine which class is most prevalent in that list.
As for features, these are just attributes that define each instance and (hopefully) give us an indication as to what class they belong to. For instance, if we're trying to classify animals we could use height and mass as features. So if we have an instance in the class elephant, its height might be 3.27m and its mass might be 5142kg. An instance in the class dog might have a height of 0.59m and a mass of 10.4kg. In classification, if we get something that's 0.8m tall and has a mass of 18.5kg, we know it's more likely to be a dog than a elephant.
Since we're only using 2 features here we can easily plot them on a graph with one feature as the X-axis and the other feature as the Y (it doesn't really matter which one) with the different classes denoted by different colors or symbols or something. If you plot the sample of your training data above, it's easy to see the separation between Class A and B.

Resources