Wilcoxon test in R - x must be numeric error - r

I have a problem with the wilcox.test in R.
My data object is a matrix in which the first column contains a name, and all other columns contain a (gene expression) measurement, which is numeric:
str(myMatrix)
'data.frame': 2000 obs. of 143 variables:
$ precursor : chr "name1" "name2" "name3" "name4" ...
$ sample1: num 1.46e-03 2.64e+02 1.46e-03 1.46e-03 1.46e-03 ...
$ sample2: num 1.46e-03 1.91e+02 1.46e-03 1.46e-03 1.46e-03 ...
$ sample3: num 1.46e-03 3.01e+02 1.46e-03 1.46e-03 4.96 ...
For all of the 2000 rows I want to test whether there is a difference between 2 given parts of the matrix. I tried this in 4 different ways:
wilcox.test(as.numeric(myMatrix[i,2:87],myMatrix[i,88:98]))$p.value
#[1] 1.549484e-16
wilcox.test(myMatrix[i,2:87],myMatrix[i,88:98])$p.value
#Error in wilcox.test.default(myMatrix[i, 2:87], myMatrix[i, 88:98]) :
#'x' must be numeric
t.test(as.numeric(myMatrix[i,2:87],myMatrix[i,88:98]))$p.value
#[1] 0.2973957
t.test(myMatrix[i,2:87],myMatrix[i,88:98])$p.value
#[1] 0.3098505
So as you can see, only if I use as.numeric() on the already numeric values I get a result without an error message for the Wilcoxon test, but the results completely differ from t.test results even if they should not.
Manually verifying by using an online tool shows that the t.test results using as.numeric() values are wrong.
Any suggestions about how I can solve this problem and do the correct Wilcoxon test? If you need more information let me know.

Actually, myMatrix [i, 2:87] is still a data.frame. See the following example.
> myMat
fir X1 X2 X3 X4
1 name1 1 5 9 13
2 name2 2 6 10 14
3 name3 3 7 11 15
4 name4 4 8 12 16
> class(myMat[1, 2:4])
[1] "data.frame"
> as.numeric(myMat[1, 2:4])
[1] 1 5 9
Change your data to a real Matrix will solve your problem.
> myMat_01 <- myMat[, 2:5]
> rownames(myMat_01) <- myMat$fir
> myMat_01 <- as.matrix(myMat_01)
> class(myMat_01[1, 2:4])
[1] "integer"

Related

Counting the number of factor variables in a data frame

After generating data, I combined 5 variables into a data frame.
Two of those variables are factors.
Task:
I want to count the number of variables in the data frame that
are factors.
I ran the code letting df equal both a matrix and a data frame.
I'm listing both error messages.
I need help in using rep function-where it's located in the R command in particular. Is using the count function the correct approach here and if not what should I do?
Can you help with this, please. Thank you. MM
XXX's mark questions in the output
> df
var1 var2 var3 var4 var5
[1,] -1.2070657 1 -0.6319780 3 -0.9952502
[2,] 0.2774292 2 0.3485368 1 1.9176811
[3,] 1.0844412 3 0.2075986 2 0.8032506
> class(df)
[1] "matrix"
> library(plyr)
> count(df[1:5,],as.factor)
Error in df[1:5, ] : subscript out of bounds
> df
var1 var2 var3 var4 var5
[1,] -1.2070657 1 -0.6319780 3 -0.9952502
[2,] 0.2774292 2 0.3485368 1 1.9176811
[3,] 1.0844412 3 0.2075986 2 0.8032506
> #Error in df[1:5, ] : subscript out of bounds df=matrix
no applicable method for 'as.quoted' applied to
an object of class "function" df=dataframe
XXXXXXXXXXXXXXXXXXX
> #2]
>
> #working example
> b=c(1,2,3,4,5,3,6)
> #Let’s count the 3s in the vector b.
> count3 <- length(which(b == 3))
> count3
[1] 2
>
> #apply the technique
> vec=c("var1","var2","var3","var4","var5")
> countF <- length(which(var1==as.factor))
Error in var1 == as.factor :
comparison (1) is possible only for atomic and list types XXXXXXXX
> #apply the technique again
> #count the number of variables that are factors in vec
> #var2 and var4 are factors
> vec=c("var1","var2","var3","var4","var5")
> countF <- length(which(vec==as.factor))
Error in vec == as.factor :
comparison (1) is possible only for atomic and list types
XXXXXXXXXXXXXXXXXXX
I had changed columns 2 and 4 to be factors prior to cbinding but in that process columns 2 and 4 reverted back to being numeric. I used as.factor trying to get the code to run. As I read over comments I wondered why lapply would not be appropriate since were dealing with an array of variable names in a list. Do all of the apply functions return TRUE's or FALSE's? I'm still learning when to apply each of them.
MM
If you want to count the number of factor variables, you can use sapply combined with is.factor:
sum(sapply(df, is.factor))
where df is your target data frame.
A few problems here:
Your subscript is out of bounds problem is because df[1:5, ] is rows 1:5, whereas columns would be df[ ,1:5]. It appears that you only have 3 rows, not 5.
The second error no applicable method for 'as.quoted' applied to
an object of class "function" is referring to the as.factor, which is a function. It is saying that a function doesn't belong within the function count. You can check exactly what count wants by running ?count in the console
A third problem that I see is that R will not automatically think that integers are factors. You will have to specify this with numbers. If you read in words, they are often automatically set as factors.
Here is a reproducible example:
> df<-data.frame("var1"=rnorm(3),"var2"=c(1:3),"var3"=rnorm(3),"var4"=c(3,1,2),"var5"=rnorm(3))
> str(df)
'data.frame': 3 obs. of 5 variables:
$ var1: num 0.716 1.43 -0.726
$ var2: int 1 2 3
$ var3: num 0.238 -0.658 0.492
$ var4: num 3 1 2
$ var5: num 1.71 1.54 1.05
Here I used the structure str() function to check what type of data I have. Note, var1 is read in as an integer when I generated it as c(1:3), whereas specifying c(3,1,2) was read in as numeric in var4
Here, I will tell R I want two of the columns to be factors, and I will make another column of words, which will automatically become factors.
> df<-data.frame("var1"=rnorm(3),"var2"=as.factor(c(1:3)),"var3"=rnorm(3),"var4"=as.factor(c(3,1,2))
+ ,"var5"=rnorm(3), "var6"=c("Green","Red","Blue"))
> str(df)
'data.frame': 3 obs. of 6 variables:
$ var1: num -1.18 1.26 -0.53
$ var2: Factor w/ 3 levels "1","2","3": 1 2 3
$ var3: num 1.38 -0.401 -0.924
$ var4: Factor w/ 3 levels "1","2","3": 3 1 2
$ var5: num 1.688 0.547 0.727
$ var6: Factor w/ 3 levels "Blue","Green",..: 2 3 1
You can then as which are factors:
> sapply(df, is.factor)
var1 var2 var3 var4 var5 var6
FALSE TRUE FALSE TRUE FALSE TRUE
And if you wanted a number for how many are factors something like this would get you there:
> length(which(sapply(df, is.factor)==TRUE))
[1] 3
You have something similar: length(which(vec==as.factor)), but one problem with this is you are asking which things in the vec object are the same as a function as.factor, which doesn't make sense. So it is giving you the error Error in vec == as.factor :
comparison (1) is possible only for atomic and list types
as.factor is for setting things as factor (as I have shown above), but is.factor is for asking if something is a factor, which will return a logical (TRUE vs FALSE) - also shown above.

colnames() function in R - Treating table values as independant objects/variables

I have a list of values which I would like to use as names for separate tables scraped from separate URLs on a certain website.
> Fac_table
[[1]]
[1] "fulltime_fac_table"
[[2]]
[1] "parttime_fac_table"
[[3]]
[1] "honorary_fac_table"
[[4]]
[1] "retired_fac_table"
I would like to loop through the list to automatically generate 4 tables with the respective names.
The result should look like this:
> fulltime_fac_table
職稱
V1 "教授兼系主任"
V2 "教授"
V3 "教授"
V4 "教授"
V5 "特聘教授"
> parttime_fac_table
職稱 姓名
V1 "教授" "XXX"
V2 "教授" "XXX"
V3 "教授" "XXX"
V4 "教授" "XXX"
V5 "教授" "XXX"
V6 "教授" "XXX"
I have another list, named 'headers', containing column headings of the respective tables online.
> headers
[[1]]
[1] "職稱" "姓名" "    研究領域"
[4] "聯絡方式"
[[2]]
[1] "職稱" "姓名" "研究領域" "聯絡方式"
I was able to assign values to the respective tables with this code:
> assign(eval(parse(text="Fac_table[[i]]")), as_tibble(matrix(fac_data,
> nrow = length(headers[[i]])))
This results in a populated table, without column headings, like this one:
> honorary_fac_table
[,1] [,2]
V1 "名譽教授" "XXX"
V2 "名譽教授" "XXX"
V3 "名譽教授" "XXX"
V4 "名譽教授" "XXX"
But was unable to assign column names to each table.
Neither of the code below worked:
> assign(colnames(eval(parse(text="Fac_table[1]"))), c(gsub("\\s", "", headers[[1]])))
Error in assign(colnames(eval(parse(text = "Fac_table[1]"))), c(gsub("\\s", :
第一個引數不正確
> colnames(eval(parse(text="Fac_table[i]"))) <- c(gsub("\\s", "", headers[[i]]))
Error in colnames(eval(parse(text = "Fac_table[i]"))) <- c(gsub("\\s", :
賦值目標擴充到非語言的物件
> do.call("<-", colnames(eval(parse(text="Fac_table[i]"))), c(gsub("\\s", "", headers[[i]])))
Error in do.call("<-", colnames(eval(parse(text = "Fac_table[i]"))), c(gsub("\\s", :
second argument must be a list
To simplify the issue, a reproducible example is as follows:
> varNamelist <- list(c("tbl1","tbl2","tbl3","tbl4"))
> colHeaderlist <- list(c("col1","col2","col3","col4"))
> tableData <- matrix([1:12], ncol=4)
This works:
> assign(eval(parse(text="varNamelist[[1]][1]")), matrix(tableData, ncol
> = length(colHeaderlist[[1]])))
But this doesn't:
> colnames(as.name(varNamelist[[1]][1])) <- colHeaderlist[[1]]
Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3", "col4" :
attempt to set 'colnames' on an object with less than two dimensions
It seems like the colnames() function in R is unable to treat the strings as represented by "Fac_table[i]" as variable names, in which independent data (separate from Fac_table) can be stored.
> colnames(as.name(Fac_table[[1]])) <- headers[[1]]
Error in `colnames<-`(`*tmp*`, value = c("a", "b", "c", :
attempt to set 'colnames' on an object with less than two dimensions
Substituting for 'fulltime_fac_table' directly works fine.
> colnames(fulltime_fac_table) <- headers[[1]]
Is there any way around this issue?
Thanks!
There is a solution to this, but I think the current set up may be more complex than necessary if I understand correctly. So I'll try to make this task easier.
If you're working with one-dimensional data, I'd recommend using vectors, as they're more appropriate than lists for that purpose. So for this project, I'd begin by storing the names of tables and headers, like this:
varNamelist <- c("tbl1","tbl2","tbl3","tbl4")
colHeaderlist <- c("col1","col2","col3","col4")
It's still difficult to determine what the data format and origin for the input of these table is from your question, but in general, sometimes a data frame can be easier to work with than a matrix, as long as your not working with Big Data. The assign function is also typically not necessary for these sort of steps. Instead, when setting up a dataframe, we can apply the name of the data frame, the name of the columns, and the data contents all at once, like this:
tbl1 <- data.frame("col1"=c(1,2,3),
"col2"=c(4,5,6),
"col3"=c(7,8,9),
"col4"=c(10,11,12))
Again, we're using vectors, noted by the c() instead of list(), to fill each column since each column is it's own single dimension.
To check the output of tbl1, we can then use print():
print(tbl1)
col1 col2 col3 col4
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
If it's an option to create the tables closer to this way shown, that might make things easier than using so many lists and assign functions; that quickly becomes overly complicated.
But if you want at the end to store all the tables in a single place, you could put them in a list:
tableList <– list(tbl1=tbl1,tbl2=tbl2,tbl3=tbl3,tbl4=tbl4)
str(tableList)
List of 4
$ tbl1:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
$ tbl2:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
$ tbl3:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
$ tbl4:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
I've found a work around solution based on #Ryan's recommendation, given by this code:
for (i in seq_along(url)){
webpage <- read_html(url[i]) #loop through URL list to access html data
fac_data <- html_nodes(webpage,'.tableunder') %>% html_text()
fac_data1 <- html_nodes(webpage,'.tableunder1') %>% html_text()
fac_data <- c(fac_data, fac_data1) #Store table data on each URL in a variable
x <- fac_data %>% matrix(ncol = length(headers[[i]]), byrow=TRUE) #make matrix to extract column data
for (j in seq_along(headers[[i]])){
y <- cbind(x[,j]) #extract column data and store in temporary variable
colnames(y) <- as.character(headers[[i]][j]) #add column name
print(cbind(y)) #loop through headers list to print column data in sequence. ** cbind(y) will be overwritten when I try to store the result on a list with 'z <- cbind(y)'.
}
}
I am now able to print out all values, complete with headers of the data in question.
Follow-up questions have been posted here.
The final code solved this problem as well.

daply: Correct results, but confusing structure

I have a data.frame mydf, that contains data from 27 subjects. There are two predictors, congruent (2 levels) and offset (5 levels), so overall there are 10 conditions. Each of the 27 subjects was tested 20 times under each condition, resulting in a total of 10*27*20 = 5400 observations. RT is the response variable. The structure looks like this:
> str(mydf)
'data.frame': 5400 obs. of 4 variables:
$ subject : Factor w/ 27 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
$ congruent: logi TRUE FALSE FALSE TRUE FALSE TRUE ...
$ offset : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 5 5 1 2 5 5 2 2 3 5 ...
$ RT : int 330 343 457 436 302 311 595 330 338 374 ...
I've used daply() to calculate the mean RT of each subject in each of the 10 conditions:
myarray <- daply(mydf, .(subject, congruent, offset), summarize, mean = mean(RT))
The result looks just the way I wanted, i.e. a 3d-array; so to speak 5 tables (one for each offset condition) that show the mean of each subject in the congruent=FALSE vs. the congruent=TRUE condition.
However if I check the structure of myarray, I get a confusing output:
List of 270
$ : num 417
$ : num 393
$ : num 364
$ : num 399
$ : num 374
...
# and so on
...
[list output truncated]
- attr(*, "dim")= int [1:3] 27 2 5
- attr(*, "dimnames")=List of 3
..$ subject : chr [1:27] "1" "2" "3" "5" ...
..$ congruent: chr [1:2] "FALSE" "TRUE"
..$ offset : chr [1:5] "1" "2" "3" "4" ...
This looks totally different from the structure of the prototypical ozone array from the plyr package, even though it's a very similar format (3 dimensions, only numerical values).
I want to compute some further summarizing information on this array, by means of aaply. Precisely, I want to calculate the difference between the congruent and the incongruent means for each subject and offset.
However, already the most basic application of aaply() like aaply(myarray,2,mean) returns non-sense output:
FALSE TRUE
NA NA
Warning messages:
1: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
2: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
I have no idea, why the daply() function returns such weirdly structured output and thereby prevents any further use of aaply. Any kind of help is kindly appreciated, I frankly admit that I have hardly any experience with the plyr package.
Since you haven't included your data it's hard to know for sure, but I tried to make a dummy set off your str(). You can do what you want (I'm guessing) with two uses of ddply. First the means, then the difference of the means.
#Make dummy data
mydf <- data.frame(subject = rep(1:5, each = 150),
congruent = rep(c(TRUE, FALSE), each = 75),
offset = rep(1:5, each = 15), RT = sample(300:500, 750, replace = T))
#Make means
mydf.mean <- ddply(mydf, .(subject, congruent, offset), summarise, mean.RT = mean(RT))
#Calculate difference between congruent and incongruent
mydf.diff <- ddply(mydf.mean, .(subject, offset), summarise, diff.mean = diff(mean.RT))
head(mydf.diff)
# subject offset diff.mean
# 1 1 1 39.133333
# 2 1 2 9.200000
# 3 1 3 20.933333
# 4 1 4 -1.533333
# 5 1 5 -34.266667
# 6 2 1 -2.800000

Why is the class and mode of the object returned by matrix() and array() the same?

Below are the first few rows of my large data file:
Symbol|Security Name|Market Category|Test Issue|Financial Status|Round Lot Size
AAC|Australia Acquisition Corp. - Ordinary Shares|S|N|D|100
AACC|Asset Acceptance Capital Corp. - Common Stock|Q|N|N|100
AACOU|Australia Acquisition Corp. - Unit|S|N|N|100
AACOW|Australia Acquisition Corp. - Warrant|S|N|N|100
AAIT|iShares MSCI All Country Asia Information Technology Index Fund|G|N|N|100
AAME|Atlantic American Corporation - Common Stock|G|N|N|100
I read the data in:
data <- read.table("nasdaqlisted.txt", sep="|", quote='', header=TRUE, as.is=TRUE)
and construct an array and a matrix:
d1 <- array(data, dim=c(nrow(data), ncol(data)))
d2 <- matrix(data, nrow=nrow(data), ncol=ncol(data))
However, even though d1 is an array and d2 is a matrix, the class and mode are the same:
> class(d1)
[1] "matrix"
> mode(d1)
[1] "list"
> class(d2)
[1] "matrix"
> mode(d2)
[1] "list"
Why is this?
I'll bite and have a go at explaining my understanding of the issues.
You don't need your large test file to demonstrate the issue. A simple data.frame would do:
test <- data.frame(var1=1:2,var2=letters[1:2])
> test
var1 var2
1 1 a
2 2 b
Keep in mind that a data.frame is just a list internally.
> is.data.frame(test)
[1] TRUE
> is.list(test)
[1] TRUE
With a list-like structure as you would expect.
> str(test)
'data.frame': 2 obs. of 2 variables:
$ var1: int 1 2
$ var2: Factor w/ 2 levels "a","b": 1 2
> str(as.list(test))
List of 2
$ var1: int [1:2] 1 2
$ var2: Factor w/ 2 levels "a","b": 1 2
When you specify a matrix call against a data.frame or a list, you end up with a matrix filled with the elements of the data.frame or list.
result1 <- matrix(test)
> result1
[,1]
[1,] Integer,2
[2,] factor,2
Looking at the structure of result1, you can see it is still a list, but now just with dimensions (see the last line in the output below).
> str(result1)
List of 2
$ : int [1:2] 1 2
$ : Factor w/ 2 levels "a","b": 1 2
- attr(*, "dim")= int [1:2] 2 1
Which means it is now both a matrix and a list
> is.matrix(result1)
[1] TRUE
> is.list(result1)
[1] TRUE
If you strip the dimensions from this object, it will no longer be a matrix and will revert to just being a list.
dim(result1) <- NULL
> result1
[[1]]
[1] 1 2
[[2]]
[1] a b
Levels: a b
> is.matrix(result1)
[1] FALSE
> is.list(result1)
[1] TRUE
> str(result1)
List of 2
$ : int [1:2] 1 2
$ : Factor w/ 2 levels "a","b": 1 2

R merge() not working (anymore) as intended [duplicate]

This question already has answers here:
Why are these numbers not equal?
(6 answers)
Closed 6 years ago.
This has worked for me before but now it isn't and I have spent two days tinkering with it before I ask for help here.
I have two datasets, one called Access, the other CO2. Each one has four variables, two of which are common and are what I want to use to merge the two datasets. Just to play it really save, I am pasting the head() and str() outputs here:
> head(Access) > head(CO2)
x y access x y CO2equ
1 -32.65 83.65 0.00 1 -32.65 83.65 183316.4
2 -36.85 83.55 4481.25 2 -36.85 83.55 173327.8
3 -36.75 83.55 4464.75 3 -36.75 83.55 301413.9
4 -36.65 83.55 4448.25 4 -36.65 83.55 360757.2
5 -36.55 83.55 4431.00 5 -36.55 83.55 409523.5
6 -36.45 83.55 4414.50 6 -36.45 83.55 448302.0
> str(Access)
'data.frame': 2183106 obs. of 3 variables:
$ x : num -32.7 -36.8 -36.8 -36.7 -36.5 ...
$ y : num 83.7 83.5 83.5 83.5 83.5 ...
$ access: num 0 4481 4465 4448 4431 ...
- attr(*, "data_types")= chr "N" "N" "N"
> str(CO2)
'data.frame': 2183106 obs. of 3 variables:
$ x : num -32.7 -36.9 -36.8 -36.7 -36.6 ...
$ y : num 83.6 83.5 83.5 83.5 83.5 ...
$ CO2equ: num 183316 173328 301414 360757 409523 ...
- attr(*, "data_types")= chr "N" "N" "N"
Now I am trying to versions of merge(). The first one results in an empty data.frame, the second in all rows existing twice, once for the variables from the first dataset, and the second with the variables from the second dataset:
> M1 = merge(Access, CO2, c("x","y"))
> head(M1)
[1] x y access CO2equ
<0 rows> (or 0-length row.names)
> M2 = merge(Access, CO2, by=c("x","y"), all=TRUE)
> length(M2$x)
[1] 4366212
> head(M2)
x y access CO2equ
1 -179.95 -89.95 NA 0
2 -179.95 -89.85 NA 0
3 -179.95 -89.75 NA 0
4 -179.95 -89.65 NA 0
5 -179.95 -89.55 NA 0
6 -179.95 -89.45 NA 0
Obviously, the respective x- and y-values are not recognized as being equivalent - but I do not know why. The data types are the same, the values look the same, and worst of all, I did this successfully a few months ago. Back then, I sasve the command history and now when I just copy and paste it into my R console, it does not work. I tried it in both R 2.13.0 and Revolution R Enterprise 4.3. I am reasonably sure that this is not a software bug but something trivial that I just overlooked even after spending some two days on this.
Cheers,
Jochen
Try round(..., 1) on both x and y before the merge.

Resources