Using tapply or by with non-default settings of a function - r

I have been searching Stackoverflow for hours hoping to find something I guessed was self-evident but nobody seemed to have asked (which might mean it is indeed self-evident).
I want to use tapply or by, to find the first time a specific event occurs in a dataframe (first non-zero value). The way I did this before was via
max.col(df, ties.method = c("first"))
But somehow this does not work when used in conjunction with either tapply or by. Here's some examplary data
FIRM<-as.vector(sample(c("a","b","c","d"),100,replace=T))
MOMENT<-as.vector(sample((1990:1995),100,replace=T))
EVENT<-as.vector(sample(c("x12","x43","x35","y71","y81","xy1","xy67","yy123","xx901"),100,replace=T))
OCCURENCE<-as.vector(sample(c(0,1),100,replace=T))
m<-as.data.frame(cbind(FIRM,MOMENT,EVENT,OCCURENCE))
So here is what I tried and did not work
tapply(m[,4],m[,3],max.col) # This gives just 1s for every EVENT with the length of the resulting vector equal to number of EVENTs mentioned in the dataset
tapply(m[,4],m[,3],max.col(m, ties.method=c("first"))) # Error in match.fun(FUN) :
'max.col(m, ties.method = c("first"))' is not a function, character or symbol
In addition: Warning message: In max.col(m, ties.method = c("first")) : NAs introduced by coercion
Number 2 is really the crux of the problem. For reasons unclear to me, max.col is not recognised as a function once you change the default tie-breaking method (i.e. "random") to to one I need (i.e. "first").
Additionally, I'd want to be able to find the year in which the non-zero occurs.
I think a sensible alternative would be to multiply the MOMENT column with the OCCURENCE column (call that ID) and look for the first non-zero value in ID (for each factor EVENT) keep that ID value and turn the other values into zero
m$MOMENT<-as.numeric(as.character(m$MOMENT))
m$OCCURENCE<-as.numeric(as.character(m$OCCURENCE))
m[,"ID"]<-m$MOMENT * m$OCCURENCE
I have tried to code this with a function containing a when and if statement and using break but it does not work
tapply(m$ID,m$EVENT, function(x) m$ID[i]<- while (m$ID[i] == 0) {m$ID[i]
if (m$ID[i]>0) {m$YEAR[i] && break }})
The idea here was to iterate the function over EVENT while m$ID == 0 and then to change the value and break once m$ID > 0. Didn't work...
Any ideas on how to fix this (or much simpler solutions)?

The FUN argument of tapply must be a function but the code in the question supplies an expression, not a function. Try this:
tapply(m[,4], m[,3], max.col, ties.method = "first")
This will give a logical indicator of the first row in each event which has 1 in the OCCURENCE column and the second line will select those rows:
o <- order(m$EVENT, m$MOMENT) # omit this and next line if already ordered
m <- m[o,]
is.first <- ave(m$OCCURENCE == 1, m$EVENT, FUN = function(x) x & !duplicated(x))
m[is.first, ]
REVISED
Ordered by event and year.
Note that if its possible that there are events with only zeros then such events will be omitted entirely from m[is.first, ] .

I'm not quite sure what you are trying to achieve, so here is only some coding advice.
First of all, you need to read help("tapply") to lear how to pass arguments to the function that is passed to tapply:
tapply(m[,4],m[,3],max.col, ties.method="first")
However, I doubt this does what you need. Maybe something like this would be useful:
m<-data.frame(FIRM,MOMENT,EVENT,OCCURENCE)
#note how I create the data.frame in a different way
#in order to avoid coercing all columns to factors
tapply(m[,4],m[,3],which.max)
# x12 x35 x43 xx901 xy1 xy67 y71 y81 yy123
# 2 1 2 3 1 1 3 1 1
tapply(m[,4],m[,3],function(x) m[which.max(x), "MOMENT"])
# x12 x35 x43 xx901 xy1 xy67 y71 y81 yy123
# 1995 1995 1995 1991 1995 1995 1991 1995 1995

Related

Cannot fill my R for loop

so I am working on R with a matrix as following:
diff_0
SubPop0-1, SubPop1-1, SubPop2-1, SubPop3-1, SubPop4-1,
SubPop0-1, NA NA NA NA NA
SubPop1-1, 0.003403100 NA NA NA NA
SubPop2-1, 0.005481177 -0.002070277 NA NA NA
SubPop3-1, 0.002216444 0.005946314 0.001770977 NA NA
SubPop4-1, 0.010344845 0.007151529 0.004237316 -0.0021275130 NA
... but bigger ;-).
This is a matrix of pairwise genetic differenciation between each SubPop from 0 to 4. I would like to obtain a mean differenciation value for each subPop.
For instance, for SubPop-0, the mean would just correspond to the mean of the 4 values from column 1. However for SubPop-2, this would be the mean of the 2 values in line 3 and the 2 value in column 3, since this is a demi-matrix.
I wanted to write a for loop to compute each mean value for each SubPop, taking this into account. I tried the following:
Mean <- for (r in 1:nrow(diff_0)) {
mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T))
}
First this isolates each line and column of index [r], whose values refer to the same SubPop r. 'sum' enable to gather these values and eliminate 'NA's. Finally I get the mean value for my SubPop r. I was hoping my for loop would give me with value for each index r, which would be a SubPop.
However, eventhough my mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T)), if run alone with a fixed r value between 1 and 5, does give me what I want; well the 'for loop' itself only returns an empty vector.
Something like for (r in 1:nrow(diff_0)) { print(diff_0[r,1]) } also works, so I do not understand what is going on.
This is a trivial question but I could not find an answer on the internet! Although I know I am probably missing the obvious :-)...
Thank you very much,
Cheers!
Okay, based on what you want to do (and if I understand everything correctly) there are several ways of doing this.
The one that comes to my mind now is just making your lower triangular matrix to an "entire matrix" (i.e. fill the upper triangle with the transpose of the lower triangle) and then do row- or column-wise means
My R is running right now on something else, so I can't check my code but this should work
diff = diff_0
diff[upper.tri(diff)] = t(diff_0[lower.tri(diff)]) #This step might need some work
As I said, my R is running right now so I can't check the correctness of the last line - I might be confused with some transposes there, so I'd appreciate any feedback on whether it actually worked or not.
You can then either set the diagonal values to 0 or alternatively add na.rm = TRUE to the mean statement
mean_diffs = apply(diff,2,FUN = function(x)mean(x, na.rm = TRUE))
that should work
Also: Yes, your code does not work, because the assignment is not in the for loop. This should work:
means = rep(NA, nrow(diff_0)
for (r in 1:length(means)){
means[r] = mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T))
But in general for loops are not what you want to do in R
This may be a solution...
for(i in 1:nrow(diff_0)) {
k<-mean(cbind(as.numeric(diff_0[,i]),as.numeric(diff_0[i,])),na.rm=T)
if(i==1) {
data_mean <- k
}else{
data_mean <- rbind(data_mean,k)
}
}
colnames(data_mean) <- "mean"
rownames(data_mean) <- c("SubPop0","SubPop1","SubPop2","SubPop3","SubPop4")
data_mean
mean
SubPop0 0.005361391
SubPop1 0.003607666
SubPop2 0.002354798
SubPop3 0.001951555
SubPop4 0.004901544

Apply which.min to data.table under a condition

I have a data.table and need to know the index of the row containing a minimal value under a given condition. Simple example:
dt <- data.table(i=11:13, val=21:23)
# i val
# 1: 11 21
# 2: 12 22
# 3: 13 23
Now, suppose I'd like to know in which row val is minimal under the condition i>=12, which is 2 in this case.
What didn't work:
dt[i>=12, which.min(val)]
# [1] 1
returns 1, because within dt[i>=12] it is the first row.
Also
dt[i>=12, .I[which.min(val)]]
# [1] 1
returned 1, because .I is only supposed to be used with grouping.
What did work:
To apply .I correctly, I added a grouping column:
dt[i>=12, g:=TRUE]
dt[i>=12, .I[which.min(val)], by=g][, V1]
# [1] 2
Note, that g is NA for i<12, thus which.min excludes that group from the result.
But, this requires extra computational power to add the column and perform the grouping. My productive data.table has several millions of rows and I have to find the minimum very often, so I'd like to avoid any extra computations.
Do you have any idea, how to efficiently solve this?
But, this requires extra computational power to add the column and perform the grouping.
So, keep the data sorted by it if it's so important:
setorder(dt, val)
dt[.(i_min = 12), on=.(i >= i_min), mult="first", which = TRUE]
# 2
This can also be extended to check more threshold i values. Just give a vector in i_min =:
dt[.(i_min = 9:14), on=.(i >= i_min), mult="first", which = TRUE]
# [1] 1 1 1 2 3 NA
How it works
x[i, on=, ...] is the syntax for a join.
i can be another table or equivalently a list of equal-length vectors.
.() is a shorthand for list().
on= can have inequalities for a "non-equi join".
mult= can determine what happens when a row of i has more than one match in x.
which=TRUE will return row numbers of x instead of the full joined table.
You can use the fact that which.min will ignore NA values to "mask" the values you don't want to consider:
dt[,which.min(ifelse(i>=12, val, NA))]
As a simple example of this behavior, which.min(c(NA, 2, 1)) returns 3, because the 3rd element is the min among all the non-NA values.

Excluding a number of answers from a R dataframe

I'm looking for a way to exclude a number of answers from a length function.
This is a follow on question from Getting R Frequency counts for all possible answers In sql the syntax could be
select * from someTable
where variableName not in ( 0, null )
Given
Id <- c(1,2,3,4,5)
ClassA <- c(1,NA,3,1,1)
ClassB <- c(2,1,1,3,3)
R <- c(5,5,7,NA,9)
S <- c(3,7,NA,9,5)
df <- data.frame(Id,ClassA,ClassB,R,S)
ZeroTenNAScale <- c(0:10,NA);
R.freq = setNames(nm=c('R','freq'),data.frame(table(factor(df$R,levels=ZeroTenNAScale,exclude=NULL))));
S.freq = setNames(nm=c('S','freq'),data.frame(table(factor(df$S,levels=ZeroTenNAScale,exclude=NULL))));
length(S.freq$freq[S.freq$freq!=0])
# 5
How would I change
length(S.freq$freq[S.freq$freq!=0])
to get an answer of 4 by excluding 0 and NA?
We can use colSums,
colSums(!is.na(S.freq)[S.freq$freq!=0,])[[1]]
#[1] 4
You can use sum to calculate the sum of integers. if NA's are found in your column you could be using na.rm(), however because the NA is located in a different column you first need to remove the row containing NA.
Our solution is as follows, we remove the rows containing NA by subsetting S.freq[!is.na(S.freq$S),], but we also need the second column freq:
sum(S.freq[!is.na(S.freq$S), "freq"])
# 4
You can try na.omit (to remove NAs) and subset ( to get rid off all lines in freq equal to 0):
subset(na.omit(S.freq), freq != 0)
S freq
4 3 1
6 5 1
8 7 1
10 9 1
From here, that's straightforward:
length(subset(na.omit(S.freq), freq != 0)$freq)
[1] 4
Does it solve your problem?
Just add !is.na(S.freq$S) as a second filter:
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$S)])
If you want to extend it with other conditions, you could make an index vector first for readability:
idx <- S.freq$freq!=0 & !is.na(S.freq$S)
length(S.freq$freq[idx])
You're looking for values with frequency > 0, that means you're looking for unique values. You get this information directly from vector S:
length(unique(df$S))
and leaving NA aside you get answer 4 by:
length(unique(df$S[!is.na(df$S)]))
Regarding your question on how to exclude a number of items based on their value:
In R this is easily done with logical vectors as you used it in you code already:
length(S.freq$freq[S.freq$freq!=0])
you can combine different conditions to one logical vector and use it for subsetting e.g.
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$freq)])

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

dataframe where one column only has na values omitted

I have a data frame "accdata".
dim(accdata)
[1] 6496 188
One of the variables - "VAL" is of interest to me. I must calculate the number of instances where VAL is equal to 24.
I tried a few functions that returned error messages. After some research it seems I need to remove the NA values from VAL first.
I would try something like nonaaccdaa <- na.omit(accdata) except this removes instances of NA in any variable, not just VAL.
I tried nonaval <- na.omit(accdata[accdata$VAL]) but when I then checked the number of rows using nrow the result was null. I had expected a value between 1 and 6,496.
Whats up here?
This should do the trick:
sum(accdata$VAL == 24, na.rm=TRUE)

Resources