Need help using npmv package (nonparametric testing) in R - r

I'm trying to use the Bathke package "npmv" in R to run the nonpartest() function on a dataset that I created. In their paper they use their implemented code with provided dataset 'sherry':
install.packages("npmv")
library(npmv)
data("sberry",package="npmv")
nonpartest(weight|bot|fungi|rating~treatment,data=sberry,permreps=1000)
Which works perfect for their dataset. However, when I try to run it on my csv dataset which has the exact same dimensions, it does not work for some reason and keeps giving me the error "data set not found" and saying the sample size must be at least 2??
Their dataset is as follows:
Treatment Replicate Weight Botrytis Fungi Phomopsis
Kocide 1 6.9 4.1 17.24 1
Kocide 2 8.3 5.13 5.65 1
Kocide 3 8.4 6.07 8.8 1.5
Kocide 4 7.95 2.72 9.51 1.5
Elevate 1 8.6 1.19 17.06 1
Elevate 2 8.5 0.55 12.86 1
Elevate 3 8.2 0.74 6.76 0.5
Elevate 4 9.5 0.99 1.84 1
V-10135 1 6.2 4.29 4.64 1
V-10135 2 9 1.56 3.03 3
V-10135 3 6.8 0.88 5.6 0
V-10135 4 8.5 2.42 8.66 2
Control 1 7.5 15.6 13.08 1
Control 2 6.7 10.28 14.43 1
Control 3 8.7 13.29 10.92 2.5
Control 4 7.4 18.38 16.03 3
while mine is:
Treatment Replicate Weight_Loss Persistent Head_Size Salebarn_Q
LA 200 1 17.90 14.10 14.25 1.0
LA 200 2 19.30 15.30 2.56 1.0
LA 200 3 19.50 16.82 5.80 1.5
LA 200 4 18.94 12.70 7.51 1.5
Excede 1 19.60 11.20 14.52 1.0
Excede 2 19.50 10.54 9.83 1.0
Excede 3 19.10 10.83 3.82 0.5
Excede 4 20.40 11.00 0.04 1.0
Micotil 1 17.30 14.29 1.62 1.0
Micotil 2 20.00 11.65 0.13 3.0
Micotil 3 18.10 10.89 2.41 0.0
Micotil 4 19.50 12.43 5.93 2.0
Zoetis 1 18.50 25.48 10.08 1.0
Zoetis 2 17.60 20.12 11.93 1.0
Zoetis 3 19.70 23.29 7.93 2.5
Zoetis 4 18.50 28.32 13.08 3.0
(Zoetis being my control)
I tried the code
data("Cattle", package = "npmv")
nonpartest(Weight_Loss|Persistent|Head_Size|Salebarn_Q~Treatment,data=Cattle,permreps=1000)
Any idea how I would be able to return the same test statistics that they get for their example for my dataset? Thanks in advance.

The call to data() is supposed to be used for package-supplied datasets, not for ones you import. I think you are misinterpreting a warning.
data(Cattle,package='npmv')
Warning message:
In data(Cattle, package = "npmv") : data set ‘Cattle’ not found
R reports both 'warnings' and 'errors' and I don't think yours is an error. I get no error when loading your data and running that function:
Cattle <- read.table(text=" Treatment Replicate Weight_Loss Persistent Head_Size Salebarn_Q
'LA 200' 1 17.90 14.10 14.25 1.0
'LA 200' 2 19.30 15.30 2.56 1.0
'LA 200' 3 19.50 16.82 5.80 1.5
'LA 200' 4 18.94 12.70 7.51 1.5
Excede 1 19.60 11.20 14.52 1.0
Excede 2 19.50 10.54 9.83 1.0
Excede 3 19.10 10.83 3.82 0.5
Excede 4 20.40 11.00 0.04 1.0
Micotil 1 17.30 14.29 1.62 1.0
Micotil 2 20.00 11.65 0.13 3.0
Micotil 3 18.10 10.89 2.41 0.0
Micotil 4 19.50 12.43 5.93 2.0
Zoetis 1 18.50 25.48 10.08 1.0
Zoetis 2 17.60 20.12 11.93 1.0
Zoetis 3 19.70 23.29 7.93 2.5
Zoetis 4 18.50 28.32 13.08 3.0", header=TRUE)
Here's the call
nonpartest(Weight_Loss|Persistent|Head_Size|Salebarn_Q~Treatment,data=Cattle,permreps=1000)
Hit <Return> to see next plot:
Hit <Return> to see next plot:
Hit <Return> to see next plot:
Hit <Return> to see next plot:
$results
Test Statistic df1 df2 P-value
ANOVA type test p-value 2.843 6.912 27.6479 0.023
McKeon approx. for the Lawley Hotelling Test NA NA NA NA
Muller approx. for the Bartlett-Nanda-Pillai Test NA NA NA NA
Wilks Lambda NA NA NA NA
Permutation Test p-value
ANOVA type test p-value 0.007
McKeon approx. for the Lawley Hotelling Test NA
Muller approx. for the Bartlett-Nanda-Pillai Test NA
Wilks Lambda NA
$releffects
Weight_Loss Persistent Head_Size Salebarn_Q
Excede 0.71875 0.15625 0.50000 0.30469
LA 200 0.43750 0.59375 0.53125 0.53125
Micotil 0.45312 0.37500 0.23438 0.53125
Zoetis 0.39062 0.87500 0.73438 0.63281
Warning message:
In nonpartest(Weight_Loss | Persistent | Head_Size | Salebarn_Q ~ :
Rank covariance matrix is singular, only ANOVA test returned

Related

Why is t() returning a 'vector'?

Just trying to some basic matrix algebra in R and I'm getting some weird results that I don't completely understand.
So, my data looks like this:
Wt LvrWt Dose Y
1 176 6.5 0.88 0.42
2 176 9.5 0.88 0.25
3 190 9.0 1.00 0.56
4 176 8.9 0.88 0.23
5 200 7.2 1.00 0.23
6 167 8.9 0.83 0.32
7 188 8.0 0.94 0.37
8 195 10.0 0.98 0.41
9 176 8.0 0.88 0.33
10 165 7.9 0.84 0.38
11 158 6.9 0.80 0.27
12 148 7.3 0.74 0.36
13 149 5.2 0.75 0.21
14 163 8.4 0.81 0.28
15 170 7.2 0.85 0.34
16 186 6.8 0.94 0.28
17 146 7.3 0.73 0.30
18 181 9.0 0.90 0.37
19 149 6.4 0.75 0.46
And here is the code I'm using:
# Creating the X matrix
Xmatrix <- subset(questionOneA, select = -c(Y))
Xmatrix <- matrix(Xmatrix)
Xmatrix <- sapply(Xmatrix, as.numeric)
is.numeric(Xmatrix)
# Transposing the x matrix
Xtranspose <- t(Xmatrix)
Xtranspose <- matrix(Xtranspose)
is.numeric(Xtranspose)
The output of Xmatrix seems correct:
V1 V2 V3
1 176 6.5 0.88
2 176 9.5 0.88
3 190 9.0 1.00
4 176 8.9 0.88
5 200 7.2 1.00
6 167 8.9 0.83
7 188 8.0 0.94
8 195 10.0 0.98
9 176 8.0 0.88
10 165 7.9 0.84
11 158 6.9 0.80
12 148 7.3 0.74
13 149 5.2 0.75
14 163 8.4 0.81
15 170 7.2 0.85
16 186 6.8 0.94
17 146 7.3 0.73
18 181 9.0 0.90
19 149 6.4 0.75
However, the output of Xtranspose seems strange to me:
V1
1 176.00
2 6.50
3 0.88
4 176.00
5 9.50
6 0.88
7 190.00
8 9.00
9 1.00
10 176.00
11 8.90
12 0.88
13 200.00
14 7.20
15 1.00
16 167.00
17 8.90
18 0.83
19 188.00
20 8.00
21 0.94
22 195.00
23 10.00
24 0.98
25 176.00
26 8.00
27 0.88
28 165.00
29 7.90
30 0.84
31 158.00
32 6.90
33 0.80
34 148.00
35 7.30
36 0.74
37 149.00
38 5.20
39 0.75
40 163.00
41 8.40
42 0.81
43 170.00
44 7.20
45 0.85
46 186.00
47 6.80
48 0.94
49 146.00
50 7.30
51 0.73
52 181.00
53 9.00
54 0.90
55 149.00
56 6.40
57 0.75
I was expecting an output with 3 rows and 19 columns. What's happened here that I'm not understanding?
Any help would be appreciated!
You should use as.matrix instead of matrix to convert to matrix, also this can be done in fewer steps.
Xmatrix <- subset(questionOneA, select = -Y)
Xmatrix <- as.matrix(Xmatrix)
Xtranspose <- t(Xmatrix)
Xmatrix
# Wt LvrWt Dose
#1 176 6.5 0.88
#2 176 9.5 0.88
#3 190 9.0 1.00
#4 176 8.9 0.88
#5 200 7.2 1.00
#6 167 8.9 0.83
#7 188 8.0 0.94
#8 195 10.0 0.98
#9 176 8.0 0.88
#10 165 7.9 0.84
#11 158 6.9 0.80
#12 148 7.3 0.74
#13 149 5.2 0.75
#14 163 8.4 0.81
#15 170 7.2 0.85
#16 186 6.8 0.94
#17 146 7.3 0.73
#18 181 9.0 0.90
#19 149 6.4 0.75
Xtranspose
# 1 2 3 4 5 6 7 8
#Wt 176.00 176.00 190 176.00 200.0 167.00 188.00 195.00
#LvrWt 6.50 9.50 9 8.90 7.2 8.90 8.00 10.00
#Dose 0.88 0.88 1 0.88 1.0 0.83 0.94 0.98
# 9 10 11 12 13 14 15 16
#Wt 176.00 165.00 158.0 148.00 149.00 163.00 170.00 186.00
#LvrWt 8.00 7.90 6.9 7.30 5.20 8.40 7.20 6.80
#Dose 0.88 0.84 0.8 0.74 0.75 0.81 0.85 0.94
# 17 18 19
#Wt 146.00 181.0 149.00
#LvrWt 7.30 9.0 6.40
#Dose 0.73 0.9 0.75
See what matrix(Xmatrix) returns :
matrix(Xmatrix)
# [,1]
#[1,] Integer,19
#[2,] Numeric,19
#[3,] Numeric,19
Just check the output from each of your steps, and you will see the matrix becomes a "one column" matrix after this step:
Xtranspose <- matrix(Xtranspose)
This function creates a matrix. If you see the manual of the matrix function you will see that it defaults to nrow=1 and ncol=1.
Your matrix obviously has more elements than would fit in a 1x1 matrix, but creating a matrix isn't really what you would want to do at this point, you would just make sure that the 2-dimensional structure you have, is a matrix, for which as.matrix is better. (But unecessary, it already is a matrix.)
Though I will say, the manual does not explain this specific happening well enough. It does not clearly say what happens if you give matrix() a matrix as input data that has more elements than would fit in the given number of rows and columns you want.
Though it does say this, which is probably applicable to your case:
When coercing a vector, it produces a one-column matrix, and promotes the names (if any) of the vector to the rownames of the matrix.
This is also what you see.

R data.table, select columns with no NA

I have a table of stock prices here:
https://drive.google.com/file/d/1S666wiCzf-8MfgugN3IZOqCiM7tNPFh9/view?usp=sharing
Some columns have NA's because the company does not exist (until later dates), or the company folded.
What I want to do is: select columns that has no NA's. I use data.table because it is faster. Here are my working codes:
example <- fread(file = "example.csv", key = "date")
example_select <- example[,
lapply(.SD,
function(x) not(sum(is.na(x) > 0)))
] %>%
as.logical(.)
example[, ..example_select]
Is there better (less lines) code to do the same? Thank you!
Try:
example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]
There are lots of ways you could do this. Here's how I usually do it - a data.table approach without lapply:
example[, .SD, .SDcols = colSums(is.na(example)) == 0]
An answer using tidyverse packages
library(readr)
library(dplyr)
library(purrr)
data <- read_csv("~/Downloads/example.csv")
map2_dfc(data, names(data), .f = function(x, y) {
column <- tibble("{y}" := x)
if(any(is.na(column)))
return(NULL)
else
return(column)
})
Output
# A tibble: 5,076 x 11
date ACU ACY AE AEF AIM AIRI AMS APT ARMP ASXC
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001-01-02 2.75 4.75 14.4 8.44 2376 250 2.5 1.06 490000 179.
2 2001-01-03 2.75 4.5 14.5 9 2409 250 2.5 1.12 472500 193.
3 2001-01-04 2.75 4.5 14.1 8.88 2508 250 2.5 1.06 542500 301.
4 2001-01-05 2.38 4.5 14.1 8.88 2475 250 2.25 1.12 586250 301.
5 2001-01-08 2.56 4.75 14.3 8.75 2376 250 2.38 1.06 638750 276.
6 2001-01-09 2.56 4.75 14.3 8.88 2409 250 2.38 1.06 568750 264.
7 2001-01-10 2.56 5.5 14.5 8.69 2310 300 2.12 1.12 586250 274.
8 2001-01-11 2.69 5.25 14.4 8.69 2310 300 2.25 1.19 564375 333.
9 2001-01-12 2.75 4.81 14.6 8.75 2541 275 2 1.38 564375 370.
10 2001-01-16 2.75 4.88 14.9 8.94 2772 300 2.12 1.62 595000 358.
# … with 5,066 more rows
Using Filter :
library(data.table)
Filter(function(x) all(!is.na(x)), fread('example.csv'))
# date ACU ACY AE AEF AIM AIRI AMS APT
# 1: 2001-01-02 2.75 4.75 14.4 8.44 2376.00 250.00 2.50 1.06
# 2: 2001-01-03 2.75 4.50 14.5 9.00 2409.00 250.00 2.50 1.12
# 3: 2001-01-04 2.75 4.50 14.1 8.88 2508.00 250.00 2.50 1.06
# 4: 2001-01-05 2.38 4.50 14.1 8.88 2475.00 250.00 2.25 1.12
# 5: 2001-01-08 2.56 4.75 14.3 8.75 2376.00 250.00 2.38 1.06
# ---
#5072: 2021-03-02 36.95 10.59 28.1 8.77 2.34 1.61 2.48 14.33
#5073: 2021-03-03 38.40 10.00 30.1 8.78 2.26 1.57 2.47 12.92
#5074: 2021-03-04 37.90 8.03 30.8 8.63 2.09 1.44 2.27 12.44
#5075: 2021-03-05 35.68 8.13 31.5 8.70 2.05 1.48 2.35 12.45
#5076: 2021-03-08 37.87 8.22 31.9 8.59 2.01 1.52 2.47 12.15
# ARMP ASXC
# 1: 4.90e+05 178.75
# 2: 4.72e+05 192.97
# 3: 5.42e+05 300.62
# 4: 5.86e+05 300.62
# 5: 6.39e+05 276.25
# ---
#5072: 5.67e+00 3.92
#5073: 5.58e+00 4.54
#5074: 5.15e+00 4.08
#5075: 4.49e+00 3.81
#5076: 4.73e+00 4.15

How would I generate matrices to represent the variants of "R" in these equations?

I am essentially trying to make my own code for the nonpartest() function in the npmv package. I have a dataset:
Cattle <- read.table(text=" Treatment Replicate Weight_Loss Persistent Head_Size Salebarn_Q
'LA 200' 1 17.90 14.10 14.25 1.0
'LA 200' 2 19.30 15.30 2.56 1.0
'LA 200' 3 19.50 16.82 5.80 1.5
'LA 200' 4 18.94 12.70 7.51 1.5
Excede 1 19.60 11.20 14.52 1.0
Excede 2 19.50 10.54 9.83 1.0
Excede 3 19.10 10.83 3.82 0.5
Excede 4 20.40 11.00 0.04 1.0
Micotil 1 17.30 14.29 1.62 1.0
Micotil 2 20.00 11.65 0.13 3.0
Micotil 3 18.10 10.89 2.41 0.0
Micotil 4 19.50 12.43 5.93 2.0
Zoetis 1 18.50 25.48 10.08 1.0
Zoetis 2 17.60 20.12 11.93 1.0
Zoetis 3 19.70 23.29 7.93 2.5
Zoetis 4 18.50 28.32 13.08 3.0", header=TRUE)
Which I am trying to use to generate the matrices for Ri. and R.. and Rij in the equation in the paper below so that I can calculate the test statistics G and H
I attempted to do it using
R<-matrix(rank(Cattle,ties.method = "average"),N,p)
R_bar<-matrix(rank(Cattle,ties.method = "average"),1,p)
H<-(1/(a-1))*sum(n*(R-R_bar)*t(R-R_bar))
G<-(1/(N-a)*sum(sum(R-R_bar)*(R_prime-R_bar_prime)))
But that does not work apparently.. I'm not entirely sure what they're describing in the paper in regards to the dimensions of the R matrices.. I know you should use the rank() function and then transpose them using t() for the 'prime' versions
**Images show the excerpts of the paper where the different matrices and their dimensions and how they go in the actual equations are described

Merge function produces more rows than original dataframe

I have two data frames that look like this.
First one:
head(df_2015_2016)
Date HomeTeam AwayTeam B365H B365D B365A BWH BWD BWA IWH IWD IWA
1 08/08/15 Bournemouth Aston Villa 2.00 3.6 4.00 2.00 3.30 3.70 2.10 3.3 3.30
2 08/08/15 Chelsea Swansea 1.36 5.0 11.00 1.40 4.75 9.00 1.33 4.8 8.30
3 08/08/15 Everton Watford 1.70 3.9 5.50 1.70 3.50 5.00 1.70 3.6 4.70
4 08/08/15 Leicester Sunderland 1.95 3.5 4.33 2.00 3.30 3.75 2.00 3.3 3.60
5 08/08/15 Man United Tottenham 1.65 4.0 6.00 1.65 4.00 5.50 1.65 3.6 5.10
6 08/08/15 Norwich Crystal Palace 2.55 3.3 3.00 2.60 3.20 2.70 2.40 3.2 2.85
And the second one
> head(df_matches)
row_names ID scoresway_id club club_bet city
1 1 242 214 Gent Gent Gent
2 2 248 215 Anderlecht Anderlecht Bruxelles
3 3 243 217 Cercle Brugge Cercle Brugge Brugge
4 4 310 218 Sporting Charleroi Charleroi Charleroi
5 5 249 219 Club Brugge Club Brugge Brugge
6 6 234 222 Beerschot #N/B Antwerp
Now I would like to merge them. The df that I try to merge has 5062 rows
nrow(df_2015_2016)
[1] 5062
However, when I try to merge it
df <- merge(df_2015_2016, df_matches, by.x = "HomeTeam", by.y = "club_bet", all.x = T)
The endresult has 5733 rows.
nrow(df)
[1] 5733
The output that I want is just 5062 rows with a match or NA value is case there is no match.
Any feedback on what goes wrong here?

How to match across 2 data frames IDs and run operations in R loop?

I have 2 data frames, the sampling ("samp") and the coordinates ("coor").
The "samp" data frame:
Plot X Y H L
1 6.4 0.6 3.654 0.023
1 19.1 9.3 4.998 0.023
1 2.4 4.2 5.568 0.024
1 16.1 16.7 5.32 0.074
1 10.8 15.8 6.58 0.026
1 1 16 4.968 0.023
1 9.4 12.4 6.804 0.078
2 3.6 0.4 4.3 0.038
3 12.2 19.9 7.29 0.028
3 2 18.2 7.752 0.028
3 6.5 19.9 7.2 0.028
3 3.7 13.8 5.88 0.042
3 4.9 10.3 9.234 0.061
3 3.7 13.8 5.88 0.042
3 4.9 10.3 9.234 0.061
4 16.3 2.4 5.18 0.02
4 15.7 9.8 10.92 0.096
4 6 12.6 6.96 0.16
5 19.4 16.4 8.2 0.092
10 4.8 5.16 7.38 1.08
11 14.7 16.2 16.44 0.89
11 19 19 10.2 0.047
12 10.8 2.7 19.227 1.2
14 0.6 6.4 12.792 0.108
14 4.6 1.9 12.3 0.122
15 12.2 18 9.6 0.034
16 13 18.3 4.55 0.021
The "coor" data frame:
Plot X Y
1 356154.007 501363.546
2 356154.797 501345.977
3 356174.697 501336.114
4 356226.469 501336.816
5 356255.24 501352.714
10 356529.313 501292.4
11 356334.895 501320.725
12 356593.271 501255.297
14 356350.029 501314.385
15 356358.81 501285.955
16 356637.29 501227.297
17 356652.157 501263.238
18 356691.68 501262.403
19 356755.386 501242.501
20 356813.735 501210.59
22 356980.118 501178.974
23 357044.996 501168.859
24 357133.365 501158.418
25 357146.781 501158.866
26 357172.485 501161.646
I wish to run "for loop" function to register the "samp" data frame with the GPS coordinates from the "coor" data frame -- e.g. the "new_x" variable is the sum output of "X" from the "samp" and the "coor" , under the same "Plot" IDs.
This is what i tried but not working.
for (i in 1:nrow(samp)){
if (samp$Plot[i]==coor$Plot[i]){
(samp$new_x[i]<-(coor$X[i] + samp$X[i]))
} else (samp$new_x[i]<-samp$X[i])
}
The final output i wish to have is with a proper coordinate variable ("new_x") created onto the "samp" data frame. It should looks like this:
Plot X Y H L new_x
1 6.4 0.6 3.654 0.023 356160.407
1 19.1 9.3 4.998 0.023 356173.107
1 2.4 4.2 5.568 0.024 356156.407
1 16.1 16.7 5.32 0.074 356170.107
1 10.8 15.8 6.58 0.026 356164.807
1 1 16 4.968 0.023 356155.007
1 9.4 12.4 6.804 0.078 356163.407
2 3.6 0.4 4.3 0.038 356158.397
3 12.2 19.9 7.29 0.028 356186.897
3 2 18.2 7.752 0.028 356176.697
3 6.5 19.9 7.2 0.028 356181.197
3 3.7 13.8 5.88 0.042 356178.397
3 4.9 10.3 9.234 0.061 356179.597
3 3.7 13.8 5.88 0.042 356178.397
3 4.9 10.3 9.234 0.061 356179.597
4 16.3 2.4 5.18 0.02 356242.769
4 15.7 9.8 10.92 0.096 356242.169
4 6 12.6 6.96 0.16 356232.469
5 19.4 16.4 8.2 0.092 356274.64
10 4.8 5.16 7.38 1.08 356534.113
11 14.7 16.2 16.44 0.89 356349.595
11 19 19 10.2 0.047 356353.895
Any suggestion will be appreciated. Thanks.
You could merge the two datasets and create a new column by summing the X.x and X.y variables.
res <- transform(merge(samp, coor, by='Plot'), new_x=X.x+X.y)[,-c(6:7)]
colnames(res) <- colnames(out) #`out` is the expected result showed
all.equal(res[1:22,], out, check.attributes=FALSE)
#[1] TRUE

Resources