Table frequency numerical category operations function - r

I am trying to learn how to write functions in R and I have a very specific question regarding the use of table and how to treat the "levels variable".
My original problem is to write a cumulative hazard function. My function basically does this:
Example: data x= c(1,1,2,2,2,3,14,25) which has 8 observations/times
From a vector 8 observations do the following operation for F(14)= 2/8 + 3/6+ 1/3+ 1/2
for F(2)= 2/8+3/6, so on.
Basically I want the sum of: (how many observations have time i)/(how many observations have time greater than or equal to i)
So for i=2, I have two fractions: 2/(8)+ 3/(6), because there are 6 observations with time i equal to 2 or more.
Specifically I was using the function table. However, this function gives me the frequencies and treats the value associated with the frequency as a level and not as a number.
For my data I have 5 levels: 1,2,3,14,15 but when I try to do operations like:
v<-c(1,2,3,14,15)
ta<-as.data.frame(table(v))
as.numeric(ta$v)<14
[1] TRUE TRUE TRUE TRUE TRUE
However, I want the result to be TRUE TRUE TRUE FALSE FALSE. I want the variables in table() to be treated as numbers.
How can I do that?
Just for the sake to see what I am doing, my extra code is below. It works well without the censoring, but this part is key for me to advance with censoring.
cumh<-function (x,t,y=rep(1,length(x))){
le<-length(x)
#Sum comparison of terms
isum<-sum(x<=t)
#Collapse table
ta<-as.data.frame((table(x)))
ta$cum<-cumsum(ta$Freq)
ta$den<-le
for (j in 1:(nrow(ta)-1)) {
ta$den[j+1]<-le-ta$cum[j]
}
ind<-isum>=ta$cum
#correction for right censor:
ta2<-as.data.frame(table(y*x))
cumhaz<-sum(ind*ta2$Freq/ta$den)
return(cumhaz)}

Here is one method using sapply and table
x <- c(1,1,2,2,2,3,14,25)
myTab <- table(x)
myTab / sapply(seq_along(myTab), function(i) sum(tail(c(0, myTab), -i)))
x
1 2 3 14 25
0.2500000 0.5000000 0.3333333 0.5000000 1.0000000
Here, tail successively removes values from the beginning of x. The remaining values are summed together. sapply does this for values from the beginning of x to the final value. To accomplish this, I pre-pended 0 to x. The summations then divide x to return the proportions.

Related

R - I don't understand why my code generates a count rather than a sum

I have a list of 10,000 values that look like this
Points
1 118
2 564
3 15
4 729
5 49
6 614
Calling the list t1 and running sum(t1>quantile(t(t1),0.8)) I would expect to get a sum of the values in the list that are greater than the 80th quantile, but what I really get is a count (not a sum) of all the values.
Try this:
sum(t1[t1>quantile(t(t1),0.8), ])
To see the difference check t1>quantile(t(t1),0.8) and then t1[t1>quantile(t(t1),0.8), ].
One is a logical vector and contains TRUE (resp. 1) if the value is greater than the 80% quantile and zero otherwise.
The other is t1 evaluate at that logical vector, so only values which are greater than the 80% quantile are returned
t1>quantile(t(t1),0.8) is a boolean, i.e. a sequence of TRUE/FALSE values (you can check it easily). Consequently, the sum of this vector is the number of occurrences of TRUE values, i.e. the count of individuals that satisfy the condition you specify.
Here is an example:
set.seed(123)
df <- data.frame(Point = rnorm(10000))
sum(df$Point > quantile(df$Point, 0.8))
The second line returns the sum for a boolean vector (TRUE/FALSE), hence you get the count (the number of times TRUE occurs). Use
sum(df$Point[df$Point > quantile(df$Point, 0.8)])
to get what you want.
You could use the ifelse fonction, that will add t1 if t1 is above your threshold and 0 otherwise
sum(ifelse(t1>quantile(t(t1),0.8),t1,0))

R: Produce Index Values to Group Increasing Values in Vector

I have a list of increasing year values that occasionally has breaks in it and I want to create a grouping value for each unbroken sequence. Think of a vector like this one (missing 2005,2011):
x <- c(2001,2002,2003,2004,2006,2007,2008,2009,2010,2013,2014,2015,2016)
I would like to produce an equal length vector that numbers every value in a run with the same index to end up with something like this.
[1] 1 1 1 1 2 2 2 2 2 3 3 3 3
I would like to do this using best R practices so I am trying to avoid falling back to a for loop but I am not sure how to get from Vector A to Vector B. Does anyone have any suggestions?
Some things I know I can do:
I can flag the record before or after a gap as true with an ifelse
I can get the index of when the counter should change by wrapping that in a which statement
This is the code to do each
ifelse(!is.na(lag(x)) & x == lag(x)+1, FALSE, TRUE)
which(ifelse(!is.na(lag(x)) & x == lag(x)+1, FALSE, TRUE))
I think there a couple solutions to this problem. One as d.b posted in the comment above that will produce a sequence that increments every time there is a break in the sequence.
cummax(c(1, diff(x)))
There is a similar solution that I chose to use with ifelse() flagging breaks and cumsum(). I chose this solution because additional information,like other vectors, can be included in the decision and diff seems to have problems with very erratic up and down values.
cumsum(ifelse(!is.na(lag(x)) & x == lag(x) + 1, FALSE, TRUE))

R commands for finding mode in R seem to be wrong

I watched video on YouTube re finding mode in R from list of numerics. When I enter commands they do not work. R does not even give an error message. The vector is
X <- c(1,2,2,2,3,4,5,6,7,8,9)
Then instructor says use
temp <- table(as.vector(x))
to basically sort all unique values in list. R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given. Then he says to use command,
names(temp)[temp--max(temp)]
which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list. I would like to stay with these commands as far as is possible as the instructor explains them in detail. Am I doing a typo or something?
You're kind of confused.
X <- c(1,2,2,2,3,4,5,6,7,8,9) ## define vector
temp <- table(as.vector(X))
to basically sort all unique values in list.
That's not exactly what this command does (sort(unique(X)) would give a sorted vector of the unique values; note that in R, lists and vectors are different kinds of objects, it's best not to use the words interchangeably). What table() does is to count the number of instances of each unique value (in sorted order); also, as.vector() is redundant.
R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given.
If you assign results to a variable, R doesn't print anything. If you want to see the value of a variable, type the variable's name by itself:
temp
you should see
1 2 3 4 5 6 7 8 9
1 3 1 1 1 1 1 1 1
the first row is the labels (unique values), the second is the counts.
Then he says to use command, names(temp)[temp--max(temp)] which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list.
No. You already have the sequence of counts stored in temp. You should have typed
names(temp)[temp==max(temp)]
(note =, not -) which should print
[1] "2"
i.e., this is the mode. The logic here is that temp==max(temp) gives you a logical vector (a vector of TRUE and FALSE values) that's only TRUE for the elements of temp that are equal to the maximum value; names(temp)[temp==max(temp)] selects the elements of the names vector (the first row shown in the printout of temp above) that correspond to TRUE values ...

Adding numbers within a vector in r

I have a vector
v<-c(1,2,3)
I need add the numbers in the vector in the following fashion
1,1+2,1+2+3
producing a second vector
v1<-c(1,3,6)
This is probably quite simple...but I am a bit stuck.
Use the cumulative sum function:
cumsum(v)
#[1] 1 3 6

calculating sums of unique values in a log in R

I have a data frame with three columns: timestamp, key, event which is ordered by time.
ts,key,event
3,12,1
8,49,1
12,42,1
46,12,-1
100,49,1
From this, I want to create a data frame with timestamp and (all unique keys - all unique keys with cumulative sum 0 up until a given timestamp) divided by all unique keys until the same timestamp. E.g. for the above example the result should be:
ts,prob
3,1
8,1
12,1
46,2/3
100,2/3
My initial step is to calculate the cumsum grouped by key:
items = data.frame(ts=c(3,8,12,46,100), key=c(12,49,42,12,49), event=c(1,1,1,-1,1))
sumByKey = ddply(items, .(key), transform, sum=cumsum(event))
In the second (and final) step i iterate over sumByKey with a for-loop and keep track of both all unique keys and all unique keys that have a 0 in their sum using vectors, e.g. if(!(k %in% uniqueKeys) uniqueKeys = append(uniqueKeys, key). The prob is derived using the two vectors.
Initially, i tried to solve the second step using plyr, but i wanted to avoid re-calculating the unique keys up to a certain timestamp for each row in sumByKey. What im missing is a way to either refer to external variables from a function passed to ddply. Or, alternatively (and more functional), use an accumulator passed back into the function, e.g. function(acc, x) acc + x.
Is it possible to solve the second step in a better way, using e.g. ddply?
If my interpretation is right, then this should do it :
items = data.frame(ts=c(3,8,12,46,100), key=c(12,49,42,12,49), event=c(1,1,1,-1,1))
# numbers of keys that sum to zero, no ddply necessary
nzero <- cumsum(ave(items$event,items$key,FUN=cumsum)==0)
# number of unique keys at a given timepoint
nunique <- rep(F,length(items$key))
nunique[match(unique(items$key),items$key)] <- T
nunique <- cumsum(nunique)
# makes :
items$p <- (nunique-nzero)/nunique
items
ts key event p
1 3 12 1 1.0000000
2 8 49 1 1.0000000
3 12 42 1 1.0000000
4 46 12 -1 0.6666667
5 100 49 1 0.6666667
If your problem is only computational time, I bet the better idea will be to implement your algorithm as a C chunk; you may first use R to convert keys to a coherent interval of integers (as.numeric(factor(...))) and then use boolean array in C to obtain unique key number easily and very fast. Remember that neither plyr nor standard R *pplys are significantly faster than loops (providing both are used without embarrassing errors, of course).

Resources