Counting the number of factor variables in a data frame

Counting the number of factor variables in a data frame - r

After generating data, I combined 5 variables into a data frame.
Two of those variables are factors.
Task:
I want to count the number of variables in the data frame that
are factors.
I ran the code letting df equal both a matrix and a data frame.
I'm listing both error messages.
I need help in using rep function-where it's located in the R command in particular. Is using the count function the correct approach here and if not what should I do?
Can you help with this, please. Thank you. MM
XXX's mark questions in the output
> df
var1 var2 var3 var4 var5
[1,] -1.2070657 1 -0.6319780 3 -0.9952502
[2,] 0.2774292 2 0.3485368 1 1.9176811
[3,] 1.0844412 3 0.2075986 2 0.8032506
> class(df)
[1] "matrix"
> library(plyr)
> count(df[1:5,],as.factor)
Error in df[1:5, ] : subscript out of bounds
> df
var1 var2 var3 var4 var5
[1,] -1.2070657 1 -0.6319780 3 -0.9952502
[2,] 0.2774292 2 0.3485368 1 1.9176811
[3,] 1.0844412 3 0.2075986 2 0.8032506
> #Error in df[1:5, ] : subscript out of bounds df=matrix
no applicable method for 'as.quoted' applied to
an object of class "function" df=dataframe
XXXXXXXXXXXXXXXXXXX
> #2]
>
> #working example
> b=c(1,2,3,4,5,3,6)
> #Let’s count the 3s in the vector b.
> count3 <- length(which(b == 3))
> count3
[1] 2
>
> #apply the technique
> vec=c("var1","var2","var3","var4","var5")
> countF <- length(which(var1==as.factor))
Error in var1 == as.factor :
comparison (1) is possible only for atomic and list types XXXXXXXX
> #apply the technique again
> #count the number of variables that are factors in vec
> #var2 and var4 are factors
> vec=c("var1","var2","var3","var4","var5")
> countF <- length(which(vec==as.factor))
Error in vec == as.factor :
comparison (1) is possible only for atomic and list types
XXXXXXXXXXXXXXXXXXX
I had changed columns 2 and 4 to be factors prior to cbinding but in that process columns 2 and 4 reverted back to being numeric. I used as.factor trying to get the code to run. As I read over comments I wondered why lapply would not be appropriate since were dealing with an array of variable names in a list. Do all of the apply functions return TRUE's or FALSE's? I'm still learning when to apply each of them.
MM

If you want to count the number of factor variables, you can use sapply combined with is.factor:
sum(sapply(df, is.factor))
where df is your target data frame.

A few problems here:
Your subscript is out of bounds problem is because df[1:5, ] is rows 1:5, whereas columns would be df[ ,1:5]. It appears that you only have 3 rows, not 5.
The second error no applicable method for 'as.quoted' applied to
an object of class "function" is referring to the as.factor, which is a function. It is saying that a function doesn't belong within the function count. You can check exactly what count wants by running ?count in the console
A third problem that I see is that R will not automatically think that integers are factors. You will have to specify this with numbers. If you read in words, they are often automatically set as factors.
Here is a reproducible example:
> df<-data.frame("var1"=rnorm(3),"var2"=c(1:3),"var3"=rnorm(3),"var4"=c(3,1,2),"var5"=rnorm(3))
> str(df)
'data.frame': 3 obs. of 5 variables:
$ var1: num 0.716 1.43 -0.726
$ var2: int 1 2 3
$ var3: num 0.238 -0.658 0.492
$ var4: num 3 1 2
$ var5: num 1.71 1.54 1.05
Here I used the structure str() function to check what type of data I have. Note, var1 is read in as an integer when I generated it as c(1:3), whereas specifying c(3,1,2) was read in as numeric in var4
Here, I will tell R I want two of the columns to be factors, and I will make another column of words, which will automatically become factors.
> df<-data.frame("var1"=rnorm(3),"var2"=as.factor(c(1:3)),"var3"=rnorm(3),"var4"=as.factor(c(3,1,2))
+ ,"var5"=rnorm(3), "var6"=c("Green","Red","Blue"))
> str(df)
'data.frame': 3 obs. of 6 variables:
$ var1: num -1.18 1.26 -0.53
$ var2: Factor w/ 3 levels "1","2","3": 1 2 3
$ var3: num 1.38 -0.401 -0.924
$ var4: Factor w/ 3 levels "1","2","3": 3 1 2
$ var5: num 1.688 0.547 0.727
$ var6: Factor w/ 3 levels "Blue","Green",..: 2 3 1
You can then as which are factors:
> sapply(df, is.factor)
var1 var2 var3 var4 var5 var6
FALSE TRUE FALSE TRUE FALSE TRUE
And if you wanted a number for how many are factors something like this would get you there:
> length(which(sapply(df, is.factor)==TRUE))
[1] 3
You have something similar: length(which(vec==as.factor)), but one problem with this is you are asking which things in the vec object are the same as a function as.factor, which doesn't make sense. So it is giving you the error Error in vec == as.factor :
comparison (1) is possible only for atomic and list types
as.factor is for setting things as factor (as I have shown above), but is.factor is for asking if something is a factor, which will return a logical (TRUE vs FALSE) - also shown above.

Related

"contrasts can be applied only to factors with 2 or more levels" Despite having multiple levels in each factor

I am working on a two-way mixed ANOVA using the data below, using one dependent variable, one between-subjects variable and one within-subjects variable. When I tested the normality of the residuals, of the dependent variable, I find that they are not normally distributed. But at this point I am able to perform the two-way ANOVA. Howerver, when I perform a log10 transformation, and run the script again using the log transformed variable, I get the error "contrasts can be applied only to factors with 2 or more levels".
> str(m_runjumpFREQ)
'data.frame': 564 obs. of 8 variables:
$ ID1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ Group : Factor w/ 2 levels "II","Non-II": 1 1 1 1 1 1 1 1 1 1 ...
$ Pos : Factor w/ 3 levels "center","forward",..: 2 1 2 3 2 2 1 3 2 2 ...
$ Match_outcome : Factor w/ 2 levels "W","L": 2 2 2 2 2 2 2 2 2 1 ...
$ time : Factor w/ 8 levels "runjump_nADJmin_q1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ runjump : num 0.0561 0.0858 0.0663 0.0425 0.0513 ...
$ log_runjumpFREQ: num -1.25 -1.07 -1.18 -1.37 -1.29 ...
Some answers on StackOverflow to this error have mentioned that one or more factors in the data set, used for the ANOVA, are of less than two levels. But as seen above they are not.
Another explanation I have read is that it may be the issue of missing values, where there may be NA's. There is:
m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 88
However, I get the same error even after removing the rows including NA's as follows.
> m_runjumpFREQ <- na.omit(m_runjumpFREQ)
> m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 0
I could run the same script without log transformation and it would work, but with it, I get the same error. The factors are the same either way and the missing values do not make a difference. Either I am doing a crucial mistake or the issue is in the line of the log transformation below.
log_runjumpFREQ <- log10(m_runjumpFREQ$runjump)
m_runjumpFREQ <- cbind(m_runjumpFREQ, log_runjumpFREQ)
I appreciate the help.

It is not good enough that the factors have 2 levels. In addition those levels must actually be present. For example, below f has 2 levels but only 1 is actually present.
y <- (1:6)^2
x <- 1:6
f <- factor(rep(1, 6), levels = 1:2)
nlevels(f) # f has 2 levels
## [1] 2
lm(y ~ x + f)
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels

When I read in a large table using fread it slightly changes the numbers in one of the columns

I have a large file that looks like this
region type coeff p-value distance count
82365593523656436 A -0.9494 0.050 -16479472.5 8
82365593523656436 B 0.47303 0.526 57815363.0 8
82365593523656436 C -0.8938 0.106 42848210.5 8
When I read it in using fread, suddenly 82365593523656436 is not found anymore
correlations <- data.frame(fread('all_to_all_correlations.txt'))
> "82365593523656436" %in% correlations$region
[1] FALSE
I can find a slightly different number
> "82365593523656432" %in% correlations$region
[1] TRUE
but this number is not in the actual file
grep 82365593523656432 all_to_all_correlations.txt
gives no results, while
grep 82365593523656436 all_to_all_correlations.txt
does.
When I try to read in the small sample file I showed above instead of the full file I get
Warning message:
In fread("test.txt") :
Some columns have been read as type 'integer64' but package bit64 isn't loaded.
Those columns will display as strange looking floating point data.
There is no need to reload the data.
Just require(bit64) toobtain the integer64 print method and print the data again.
and the data looks like
region type coeff p.value distance count
1 3.758823e-303 A -0.94940 0.050 -16479472 8
2 3.758823e-303 B 0.47303 0.526 57815363 8
3 3.758823e-303 C -0.89380 0.106 42848210 8
So I think during reading 82365593523656436 was changed into 82365593523656432. How can I prevent this from happening?

IDs (and that's apparently what the first column is) should usually be read as characters:
correlations <- setDF(fread('region type coeff p-value distance count
82365593523656436 A -0.9494 0.050 -16479472.5 8
82365593523656436 B 0.47303 0.526 57815363.0 8
82365593523656436 C -0.8938 0.106 42848210.5 8',
colClasses = c(region = "character")))
str(correlations)
#'data.frame': 3 obs. of 6 variables:
# $ region : chr "82365593523656436" "82365593523656436" "82365593523656436"
# $ type : chr "A" "B" "C"
# $ coeff : num -0.949 0.473 -0.894
# $ p-value : num 0.05 0.526 0.106
# $ distance: num -16479473 57815363 42848211
# $ count : int 8 8 8

Wilcoxon test in R - x must be numeric error

I have a problem with the wilcox.test in R.
My data object is a matrix in which the first column contains a name, and all other columns contain a (gene expression) measurement, which is numeric:
str(myMatrix)
'data.frame': 2000 obs. of 143 variables:
$ precursor : chr "name1" "name2" "name3" "name4" ...
$ sample1: num 1.46e-03 2.64e+02 1.46e-03 1.46e-03 1.46e-03 ...
$ sample2: num 1.46e-03 1.91e+02 1.46e-03 1.46e-03 1.46e-03 ...
$ sample3: num 1.46e-03 3.01e+02 1.46e-03 1.46e-03 4.96 ...
For all of the 2000 rows I want to test whether there is a difference between 2 given parts of the matrix. I tried this in 4 different ways:
wilcox.test(as.numeric(myMatrix[i,2:87],myMatrix[i,88:98]))$p.value
#[1] 1.549484e-16
wilcox.test(myMatrix[i,2:87],myMatrix[i,88:98])$p.value
#Error in wilcox.test.default(myMatrix[i, 2:87], myMatrix[i, 88:98]) :
#'x' must be numeric
t.test(as.numeric(myMatrix[i,2:87],myMatrix[i,88:98]))$p.value
#[1] 0.2973957
t.test(myMatrix[i,2:87],myMatrix[i,88:98])$p.value
#[1] 0.3098505
So as you can see, only if I use as.numeric() on the already numeric values I get a result without an error message for the Wilcoxon test, but the results completely differ from t.test results even if they should not.
Manually verifying by using an online tool shows that the t.test results using as.numeric() values are wrong.
Any suggestions about how I can solve this problem and do the correct Wilcoxon test? If you need more information let me know.

Actually, myMatrix [i, 2:87] is still a data.frame. See the following example.
> myMat
fir X1 X2 X3 X4
1 name1 1 5 9 13
2 name2 2 6 10 14
3 name3 3 7 11 15
4 name4 4 8 12 16
> class(myMat[1, 2:4])
[1] "data.frame"
> as.numeric(myMat[1, 2:4])
[1] 1 5 9
Change your data to a real Matrix will solve your problem.
> myMat_01 <- myMat[, 2:5]
> rownames(myMat_01) <- myMat$fir
> myMat_01 <- as.matrix(myMat_01)
> class(myMat_01[1, 2:4])
[1] "integer"

daply: Correct results, but confusing structure

I have a data.frame mydf, that contains data from 27 subjects. There are two predictors, congruent (2 levels) and offset (5 levels), so overall there are 10 conditions. Each of the 27 subjects was tested 20 times under each condition, resulting in a total of 10*27*20 = 5400 observations. RT is the response variable. The structure looks like this:
> str(mydf)
'data.frame': 5400 obs. of 4 variables:
$ subject : Factor w/ 27 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
$ congruent: logi TRUE FALSE FALSE TRUE FALSE TRUE ...
$ offset : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 5 5 1 2 5 5 2 2 3 5 ...
$ RT : int 330 343 457 436 302 311 595 330 338 374 ...
I've used daply() to calculate the mean RT of each subject in each of the 10 conditions:
myarray <- daply(mydf, .(subject, congruent, offset), summarize, mean = mean(RT))
The result looks just the way I wanted, i.e. a 3d-array; so to speak 5 tables (one for each offset condition) that show the mean of each subject in the congruent=FALSE vs. the congruent=TRUE condition.
However if I check the structure of myarray, I get a confusing output:
List of 270
$ : num 417
$ : num 393
$ : num 364
$ : num 399
$ : num 374
...
# and so on
...
[list output truncated]
- attr(*, "dim")= int [1:3] 27 2 5
- attr(*, "dimnames")=List of 3
..$ subject : chr [1:27] "1" "2" "3" "5" ...
..$ congruent: chr [1:2] "FALSE" "TRUE"
..$ offset : chr [1:5] "1" "2" "3" "4" ...
This looks totally different from the structure of the prototypical ozone array from the plyr package, even though it's a very similar format (3 dimensions, only numerical values).
I want to compute some further summarizing information on this array, by means of aaply. Precisely, I want to calculate the difference between the congruent and the incongruent means for each subject and offset.
However, already the most basic application of aaply() like aaply(myarray,2,mean) returns non-sense output:
FALSE TRUE
NA NA
Warning messages:
1: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
2: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
I have no idea, why the daply() function returns such weirdly structured output and thereby prevents any further use of aaply. Any kind of help is kindly appreciated, I frankly admit that I have hardly any experience with the plyr package.

Since you haven't included your data it's hard to know for sure, but I tried to make a dummy set off your str(). You can do what you want (I'm guessing) with two uses of ddply. First the means, then the difference of the means.
#Make dummy data
mydf <- data.frame(subject = rep(1:5, each = 150),
congruent = rep(c(TRUE, FALSE), each = 75),
offset = rep(1:5, each = 15), RT = sample(300:500, 750, replace = T))
#Make means
mydf.mean <- ddply(mydf, .(subject, congruent, offset), summarise, mean.RT = mean(RT))
#Calculate difference between congruent and incongruent
mydf.diff <- ddply(mydf.mean, .(subject, offset), summarise, diff.mean = diff(mean.RT))
head(mydf.diff)
# subject offset diff.mean
# 1 1 1 39.133333
# 2 1 2 9.200000
# 3 1 3 20.933333
# 4 1 4 -1.533333
# 5 1 5 -34.266667
# 6 2 1 -2.800000

Why is the class and mode of the object returned by matrix() and array() the same?

Below are the first few rows of my large data file:
Symbol|Security Name|Market Category|Test Issue|Financial Status|Round Lot Size
AAC|Australia Acquisition Corp. - Ordinary Shares|S|N|D|100
AACC|Asset Acceptance Capital Corp. - Common Stock|Q|N|N|100
AACOU|Australia Acquisition Corp. - Unit|S|N|N|100
AACOW|Australia Acquisition Corp. - Warrant|S|N|N|100
AAIT|iShares MSCI All Country Asia Information Technology Index Fund|G|N|N|100
AAME|Atlantic American Corporation - Common Stock|G|N|N|100
I read the data in:
data <- read.table("nasdaqlisted.txt", sep="|", quote='', header=TRUE, as.is=TRUE)
and construct an array and a matrix:
d1 <- array(data, dim=c(nrow(data), ncol(data)))
d2 <- matrix(data, nrow=nrow(data), ncol=ncol(data))
However, even though d1 is an array and d2 is a matrix, the class and mode are the same:
> class(d1)
[1] "matrix"
> mode(d1)
[1] "list"
> class(d2)
[1] "matrix"
> mode(d2)
[1] "list"
Why is this?

I'll bite and have a go at explaining my understanding of the issues.
You don't need your large test file to demonstrate the issue. A simple data.frame would do:
test <- data.frame(var1=1:2,var2=letters[1:2])
> test
var1 var2
1 1 a
2 2 b
Keep in mind that a data.frame is just a list internally.
> is.data.frame(test)
[1] TRUE
> is.list(test)
[1] TRUE
With a list-like structure as you would expect.
> str(test)
'data.frame': 2 obs. of 2 variables:
$ var1: int 1 2
$ var2: Factor w/ 2 levels "a","b": 1 2
> str(as.list(test))
List of 2
$ var1: int [1:2] 1 2
$ var2: Factor w/ 2 levels "a","b": 1 2
When you specify a matrix call against a data.frame or a list, you end up with a matrix filled with the elements of the data.frame or list.
result1 <- matrix(test)
> result1
[,1]
[1,] Integer,2
[2,] factor,2
Looking at the structure of result1, you can see it is still a list, but now just with dimensions (see the last line in the output below).
> str(result1)
List of 2
$ : int [1:2] 1 2
$ : Factor w/ 2 levels "a","b": 1 2
- attr(*, "dim")= int [1:2] 2 1
Which means it is now both a matrix and a list
> is.matrix(result1)
[1] TRUE
> is.list(result1)
[1] TRUE
If you strip the dimensions from this object, it will no longer be a matrix and will revert to just being a list.
dim(result1) <- NULL
> result1
[[1]]
[1] 1 2
[[2]]
[1] a b
Levels: a b
> is.matrix(result1)
[1] FALSE
> is.list(result1)
[1] TRUE
> str(result1)
List of 2
$ : int [1:2] 1 2
$ : Factor w/ 2 levels "a","b": 1 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Counting the number of factor variables in a data frame - r

If you want to count the number of factor variables, you can use sapply combined with is.factor: sum(sapply(df, is.factor)) where df is your target data frame.

Related

"contrasts can be applied only to factors with 2 or more levels" Despite having multiple levels in each factor

When I read in a large table using fread it slightly changes the numbers in one of the columns

Wilcoxon test in R - x must be numeric error

daply: Correct results, but confusing structure

Why is the class and mode of the object returned by matrix() and array() the same?

Categories

Resources