pLepage function in R - r

Here is a self-define function for computing Lepage D statistic, which returns result different from the D statistic generated by NSM3::pLepage():
LepageD <- function(x, y){
m=length(x); n=length(y); N=m+n
z=sort(c(x,y),index=TRUE)
rz=seq(1,(N-1)/2); rz=c(rz,(N+1)/2,rev(rz))
r=rz[sort(z$ix,index=TRUE)$ix]
C=sum(r[12:21])
rk=rank(c(x,y))
W=sum(rk[12:21])
Wstar=(W-n*(N+1)/2)/sqrt(m*n*(N+1)/12)
Cstar=(C-n*((N+1)^2)/(4*N))/sqrt(m*n*(N+1)*(3+N^2)/(48*(N^2)))
D=Wstar^2+Cstar^2
D
}
> LepageD(1:10, 2:12)
[1] 1.09216
> pLepage(1:10, 2:12)$obs.stat
[1] 1.112263
And my function is not able to deal with situation x and y have same sample size.
> LepageD(1:10, 2:11)
[1] NA
I'm confused about where I did wrong.

According to me, the problem lies somewhere around this line:
r=rz[sort(z$ix,index=TRUE)$ix]
The reason for error occurence here is that z (in the test case giving output as NA) has 20 elements.
So, sort(z$ix,index=TRUE)$ix produces output as:
1 2 4 6 8 10 12 14 16 18 3 5 7 9 11 13 15 17 19 20
Also, the length of vector rz is 19 (and not 20).
Content of rz vector:
[1] 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.5 9.0 8.0 7.0 6.0 5.0
[16] 4.0 3.0 2.0 1.0
So, when we try to access the 20th element of vector rz, it produces NA.
As you haven't used the na.rm = T argument while doing the sum, the values for C and W becomes NA.
And which results Wstar, Cstar and ultimately D to become NA.

Related

What is note_ind:ncol(dataset) mean in R?

I have this line of code but I don't know what it means especially the note_ind part.
apply(mydat[,-c(1,2,3,note_ind:ncol(dataset))],c(1,2),as.numeric)
The notation x:y is used to create numeric vector sequences where each element is the previous element incremented by 1. It is shorthand for `seq(x, y, by = 1). It is most commonly used for integer sequences, but it works on doubles also.
1:10
[1] 1 2 3 4 5 6 7 8 9 10
1.1:10.1
[1] 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1
1.5:10.2 # sequence stops after 9.5 because 10.2 < 9.5 + 1 - seq() behaves the same way
[1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
Presumably note_ind is an integer value from somewhere else in your code. ncol(data.set) is the number of columns, so note_ind:ncol(dataset) generates a seqence between those two values, incrementing by 1 for each element.

R: Creating an index vector

I need some help with R coding here.
The data set Glass consists of 214 rows of data in which each row corresponds to a glass sample. Each row consists of 10 columns. When viewed as a classification problem, column 10
(Type) specifies the class of each observation/instance. The remaining columns are attributes that might beused to infer column 10. Here is an example of the first row
RI Na Mg Al Si K Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0 1
First, I casted column 10 so that it is interpreted by R as a factor instead of an integer value.
Now I need to create a vector with indices for all observations (must have values 1-214). This needs to be done to creating training data for Naive Bayes. I know how to create a vector with 214 values, but not one that has specific indices for observations from a data frame.
If it helps this is being done to set up training data for Naive Bayes, thanks
I'm not totally sure that I get what you're trying to do... So please forgive me if my solution isn't helpful. If your df's name is 'df', just use the dplyr package for reordering your columns and write
library(dplyr)
df['index'] <- 1:214
df <- df %>% select(index,everything())
Here's an example. So that I can post full dataframes, my dataframes will only have 10 rows...
Let's say my dataframe is:
df <- data.frame(col1 = c(2.3,6.3,9.2,1.7,5.0,8.5,7.9,3.5,2.2,11.5),
col2 = c(1.5,2.8,1.7,3.5,6.0,9.0,12.0,18.0,20.0,25.0))
So it looks like
col1 col2
1 2.3 1.5
2 6.3 2.8
3 9.2 1.7
4 1.7 3.5
5 5.0 6.0
6 8.5 9.0
7 7.9 12.0
8 3.5 18.0
9 2.2 20.0
10 11.5 25.0
If I want to add another column that just is 1,2,3,4,5,6,7,8,9,10... and I'll call it 'index' ...I could do this:
library(dplyr)
df['index'] <- 1:10
df <- df %>% select(index, everything())
That will give me
index col1 col2
1 1 2.3 1.5
2 2 6.3 2.8
3 3 9.2 1.7
4 4 1.7 3.5
5 5 5.0 6.0
6 6 8.5 9.0
7 7 7.9 12.0
8 8 3.5 18.0
9 9 2.2 20.0
10 10 11.5 25.0
Hope this will help
df$ind <- seq.int(nrow(df))

subsetting closed values in a column based on binary column in a data frame by R

I have a data frame with 85 rows and 35 columns which is sorted based on age column, like below:
No Gender Age
1 F 5.8
2 F 5.9
3 F 6
4 M 6.2
5 F 7
6 F 7.2
7 M 7.4
8 M 7.8
9 M 7.9
10 M 8.1
11 F 8.3
12 F 8.6
13 M 8.9
14 M 9
15 F 9.2
16 F 9.3
I need to subset closest ages in different genders. like below:
No Gender Age
1 F 6
2 M 6.2
3 F 7.2
4 M 7.4
5 M 8.1
6 F 8.3
7 F 8.6
8 M 8.9
9 M 9
10 F 9.2
Ok, I think I got this. It was surprisingly difficult, and maybe someone else will be able to come up with a more elegant solution, but here's what I got:
df <- data.frame(No=c(1L,2L,3L,4L,5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,16L),Gender=c('F','F','F','M','F','F','M','M','M','M','F','F','M','M','F','F'),Age=c(5.8,5.9,6,6.2,7,7.2,7.4,7.8,7.9,8.1,8.3,8.6,8.9,9,9.2,9.3),stringsAsFactors=F);
mls <- df$Gender=='M';
mages <- df$Age[mls];
fages <- df$Age[!mls];
fisLower <- findInterval(mages,fages);
TOL <- 1e-5;
fisClosest <- fisLower+ifelse(fisLower==0L | fisLower<length(fages) & mages-fages[replace(fisLower,fisLower==0L,NA)]>fages[fisLower+1L]-mages+TOL,1L,0L);
mis <- unname(tapply(seq_along(mages),fisClosest,function(is) is[which.min(abs(mages[is]-fages[fisClosest[is[1L]]]))]));
fis <- unique(fisClosest);
df[sort(c(which(mls)[mis],which(!mls)[fis])),];
## No Gender Age
## 3 3 F 6.0
## 4 4 M 6.2
## 6 6 F 7.2
## 7 7 M 7.4
## 10 10 M 8.1
## 11 11 F 8.3
## 12 12 F 8.6
## 13 13 M 8.9
## 14 14 M 9.0
## 15 15 F 9.2
Explanation of variables:
df The input data.frame.
mls "male logicals": A logical vector representing which elements of df$Gender are male.
mages "male ages": The subset of df$Age for male rows.
fages "female ages": The subset of df$Age for female rows.
fisLower "female indexes lower": For each element of mages, this has the index into fages of the female age that lies just below (or possibly equal to) the male age. This could be zero if fages has no ages below the element of mages. Hence this vector is "parallel" to mages, meaning it's the same length and the elements correspond to each other.
TOL "tolerance" This was a necessary annoyance to prevent spurious floating-point comparison errors in the following statement.
fisClosest "female indexes closest" This is a simple transformation of fisLower. Basically, we must add 1L to each element of fisLower if the corresponding element of mages is actually closer to the subsequent element of fages (the "upper" one) rather than the one pointed to by the corresponding element of fisLower (the "lower" one). This must be done for two cases: (1) zero elements of fisLower, and (2) where the element of fisLower points to a non-last element of fages and the element of mages is actually closer to the subsequent element of fages.
mis "male indexes" First of all, understand that fisClosest may contain duplicates if multiple male ages have the same female age as their closest, IOW there is no other female age closer to that male age, for all of them. For each of these conflicts, we must find the one male age that is closest to the female age from the set of male ages. This requires a vector aggregation for which tapply() is appropriate. We group by fisClosest, passing mages indexes into the lambda, where we call which.min() on the absolute differences between the ages to get the winning male age, and return its index.
fis "female indexes" This is simply the unique set of indexes into fages which we need to select from df; we get this from fisClosest by removing duplicates.
At this point we can finally convert from mages and fages indexes (mis and fis) to df row indexes by indexing the appropriate respective polarities of mls. After combining and sorting the two index sets, we can finally index df to get the required output.
Original (Incorrect) Solution
It looks like you want the first and last row of each run length, excepting the first and last row of the entire data.frame. Here's one way to achieve that:
df <- data.frame(No=c(1L,2L,3L,4L,5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,16L),Gender=c('F','F','F','M','F','F','M','M','M','M','F','F','M','M','F','F'),Age=c(5.8,5.9,6,6.2,7,7.2,7.4,7.8,7.9,8.1,8.3,8.6,8.9,9,9.2,9.3),stringsAsFactors=F);
x <- cumsum(rle(df$Gender)$lengths); df2 <- df[unique(c(rbind(c(1L,x[-length(x)]+1L),x))),];
df2 <- df2[-c(1L,nrow(df2)),]; ## remove first and last row from original data.frame
df2;
## No Gender Age
## 3 3 F 6.0
## 4 4 M 6.2
## 5 5 F 7.0
## 6 6 F 7.2
## 7 7 M 7.4
## 10 10 M 8.1
## 11 11 F 8.3
## 12 12 F 8.6
## 13 13 M 8.9
## 14 14 M 9.0
## 15 15 F 9.2
I think you missed the F 7.0 row in your expected output; other than that, this gets the same set of rows. If you want to fix up No to be sequential from 1, you can run df2$No <- seq_len(nrow(df2)). Ditto for the row names (with rownames(df2) on the LHS).

R: Mean of subvectors based on repeats in another vector

I am trying to make two subvectors equal length from two vectors equal length.
Values in first vector are ordered as follows:
a<-c(9,9,9,8,8,7,6,5,5,5)
Second vector is random, but lets take
b<-c(1,2,3,4,5,6,7,8,9,10)
The first subvector is simple:it is just the vector a withouth repeats
f(a)<-c(9,8,7,6,5)
The second subvector should be made as follows:
for single value in vector a (no repeats in a)the vector g(b) has the same value as vector b on corresponding position. For repeats in a the g(b) value should be mean of values from corresponding subvector b. So:
g(b)<-c(mean(c(1,2,3)), mean(c(4,5)), 6, 7, mean(c(8,9,10)))
I have no idea where to start. Thx for help!
tapply is the function you want. See ?tapply to see how it works. Here:
res<-tapply(b,a,mean)
# 5 6 7 8 9
#9.0 7.0 6.0 4.5 2.0
If you want to preserve the order:
tapply(b,a,mean)[as.character(unique(a))]
# 9 8 7 6 5
#2.0 4.5 6.0 7.0 9.0
As you can see, it gives the unique values of a and for each of them, the desired function (in this case mean(b)) is evaluated.
We can also use ave
unique(ave(b, a))
#[1] 2.0 4.5 6.0 7.0 9.0
Or another option would be to convert the 'b' to factor with levels specified
tapply(b, factor(a, levels=unique(a)), FUN=mean)
# 9 8 7 6 5
#2.0 4.5 6.0 7.0 9.0
You can do in this way:
uniqueA <- a[!duplicated(a)] # or simply unique(a) but I'm not sure about order preservation
uniqueB <- as.numeric(by(b,match(a,uniqueA),mean))
> uniqueA
[1] 9 8 7 6 5
> uniqueB
[1] 2.0 4.5 6.0 7.0 9.0

Computing a "rightmost" moving average?

I would like to compute a moving average (ma) over some time series data but I would like the ma to consider the order n starting from the rightmost of my series so my last ma value corresponds to the ma of the last n values of my series. The desired function rightmost_ma would produce this output:
data <- seq(1,10)
> data
[1] 1 2 3 4 5 6 7 8 9 10
rightmost_ma(data, n=2)
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
I was reviewing the different ma possibilities e.g. package forecast and could not find how to cover this use case. Note that the critical requirement for me is to have valid non NA ma values for the last elements of the series or in other words I want my ma to produce valid results without "looking into the future".
Take a look at rollmean function from zoo package
> library(zoo)
> rollmean(zoo(1:10), 2, align ="right", fill=NA)
1 2 3 4 5 6 7 8 9 10
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
you can also use rollapply
> rollapply(zoo(1:10), width=2, FUN=mean, align = "right", fill=NA)
1 2 3 4 5 6 7 8 9 10
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
I think using stats::filter is less complicated, and might have better performance (though zoo is well written).
This:
filter(1:10, c(1,1)/2, sides=1)
gives:
Time Series:
Start = 1
End = 10
Frequency = 1
[1] NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
If you don't want the result to be a ts object, use as.vector on the result.

Resources