Problems working with factors and apply functions

Problems working with factors and apply functions - r

What I have is a data frame that contains, among others, a factor field which holds a range of values used as factor. From what I understand it is essentially bins for numeric values.
What I want to do is to convert these to numeric values so I can use them in the downstream analysis. The idea is simple enough; (a) get a function that takes the factor level, split it at the dash and extract numeric values and calculates the average and (b) apply the function of the column
data$Range.mean <- sapply(data$Range,
function(d) {
range <- as.matrix(strsplit(as.character(d), "-"))
(as.numeric(range[,1]) + as.numeric(range[,2]))/2
})
Which gives the following error
Error in FUN(X[[1L]], ...) :
(list) object cannot be coerced to type 'double'
I tried lapply instead which makes no difference. While looking for answers, I found some other solutions to this problem, which is essentially extracting the lower and upper bound separately to individual arrays then of course calculating pairwise average is trivial.
I would like to understand what I am doing/thinking wrong here though. Why is my code giving an error, and what does that error mean, really?

You are correct in that factors in fact are integers with labeled bins. So if you have a factor like this
x <- factor(c("0-1", "0-1", "1-2", "1-2"))
it is essentially a combination of the following components
as.integer(x)
levels(x)
To convert the factor to the actual values specified by its lables, you can take a detour through as.character and parse that into numbers.
# Recreating a data frame with a factor like yours
data <- data.frame(Range = cut(runif(100), 0:10/10))
levels(data$Range) <- sub("\\((.*),(.*)]", "\\1-\\2", levels(data$Range))
# Calculating range means
sapply(strsplit(as.character(data$Range), "-"),
function(x) mean(as.numeric(x)))

Related

How to use 'as.factor' with 'apply'?

I tried to convert the categorical features in a dataset to factors. However, using apply with as.factor did not work:
convert <- c(2:5, 7:9,11,16:17)
read_file[,convert] <- data.frame(apply(read_file[convert], 2, as.factor))
However, switching to lapply did work:
read_file[,convert] <- data.frame(lapply(read_file[convert], as.factor))
Can someone explain to me what's the difference and why second code works while the first fails?

apply returns a matrix and a matrix cannot contain a factor variable. Factor variables are coerced to character variables if you create a matrix from them. The documentation in help("apply") says:
In all cases the result is coerced by as.vector to one of the basic
vector types before the dimensions are set, so that (for example)
factor results will be coerced to a character array.
lapply returns a list and a list can contain (almost) anything. In fact, a data.frame is just a list with some additional attributes. You don't even need to call data.frame there. You can just subset-assign a list into a data.frame.

Selecting unique values from single column of a data frame

I have a data frame consisting of five character variables which represent specific bacteria. I then have thousands of observations of each variable that all begin with the letter K. eg
x <- c(K0001,K0001,K0003,K0006)
y <- c(K0001,K0001,K0002,K0003)
z <- c(K0001,K0002,K0007,K0008)
r <- c(K0001,K0001,K0001,K0001)
o <- c(K0003,K0009,K0009,K0009)
I need to identify unique observations in the first column that don't appear in any of the remaining four columns. I have tried the approach suggested here which I think would work if I could create individual vectors using select ...
How to tell what is in one vector and not another?
but when I try to create a vector for analysis using the code ...
x <- select(data$x)
I get the error
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "character
I have tried to mutate the vectors using as.factor and as.numeric but neither of these approaches work as the first gives an equivalent error as above, and as.numeric returns NAs.
Thanks in advance

The reference that you cited recommended using setdiff. The only thing that you need to do to apply that solution is to convert the four columns into one, so that it can be treated as a set. You can do that with unlist
setdiff(data$x, unlist(data[,2:5]))
"K0006"

matrix subseting by column's name using `subset` function

Consider the following simulation snippet:
k <- 1:5
x <- seq(0,10,length.out = 100)
dsts <- lapply(1:length(k), function(i) cbind(x=x, distri=dchisq(x,k[i]),i) )
dsts <- do.call(rbind,dsts)
why does this code throws an error (dsts is matrix):
subset(dsts,i==1)
#Error in subset.matrix(dsts, i == 1) : object 'i' not found
Even this one:
colnames(dsts)[3] <- 'iii'
subset(dsts,iii==1)
But not this one (matrix coerced as dataframe):
subset(as.data.frame(dsts),i==1)
This one works either where x is already defined:
subset(dsts,x> 500)
The error occurs in subset.matrix() on this line:
else if (!is.logical(subset))
Is this a bug that should be reported to R Core?

The behavior you are describing is by design and is documented on the ?subset help page.
From the help page:
For data frames, the subset argument works on the rows. Note that subset will be evaluated in the data frame, so columns can be referred to (by name) as variables in the expression (see the examples).
In R, data.frames and matrices are very different types of objects. If this is causing a problem, you are probably using the wrong data structure for your data. Matrices are really only necessary if you meed matrix arithmetic. If you are thinking of your columns as different attributes for a row observations, then you should be storing your data in a data.frame in the first place. You could store all your values in a simple vector where every three values represent one observation, but that would also be a poor choice of data structure for your data. I'm not sure if you were trying to be more efficient by choosing a matrix but it seems like just the wrong choice.
A data.frame is stored as a named list while a matrix is stored as a dimensioned vector. A list can be used as an environment which makes it easy to evaluate variable names in that context. The biggest difference between the two is that data.frames can hold columns of different classes (numerics, characters, dates) while matrices can only hold values of exactly one data.type. You cannot always easily convert between the two without a loss of information.
Thinks like $ only work with data.frames as well.
dd <- data.frame(x=1:10)
dd$x
mm <- matrix(1:10, ncol=1, dimnames=list(NULL, "x"))
mm$x # Error
If you want to subset a matrix, you are better off using standard [ subsetting rather than the sub setting function.
dsts[ dsts[,"i"]==1, ]
This behavior has been a part of R for a very long time. Any changes to this behavior is likely to introduce breaking changes to existing code that relies on variables being evaluated in a certain context. I think the problem lies with whomever told you to use a matrix in the first place. Rather than cbind(), you should have used data.frame()

R: Extracting elements from data.matrix(): elements non-numeric

I programmed a function, which created (or at least tried to create) a data frame of numeric values. I need to retrieve these numeric values later on in the function. For that purpose, I explicitly assigned all values in the data frame a numeric class, using
as.numeric()
Later on in my function, when I extract the elements from the data frame, using
mydataframe[1,2]
I get an error "non-numeric argument to binary operator". I don't really understand what is non-numeric in my data frame.
If I ask for class and mode of the values in the data frame, they are both "numeric", storage mode is "double". Can anyone enlighten me? Where do I go wrong?
By the way, I can extract elements without error, if I use
as.numeric(mydataframe[1,2])
But I need to extract quite a lot of elements, so I prefer all elements of my data frame being numeric.
My code:
mydata <- by(data, data[,index], function(data) {
*myfunction including a for-loop, creating a vector of numbers (subvar1)*}
var1 <- as.numeric(sum(subvar1) / n)
var2 <- as.numeric(mean(data[,value]))
var3 <- nrow(data)
var3 <- as.numeric(var3)
list(var1=var1, var2=var2, var3=var3)})
mydataframe <- data.matrix(do.call(rbind, mydata))
Thanks in advance!

Argument is not numeric or logical: returning NA

I'm trying to import a csv file into R, and I was able to do this by
Lab2x<-read.table("Lab2x.csv").
From here I'm trying to calculate the average, standard deviation, standard error, t-statistic and the p-value. I was taught to do this using:
xbar <- mean(Lab2x) # calculate the sample average
sd <- sqrt(var(Lab2x)) # calculate the sample sd
se <- sd/sqrt(12) # calculate se of sample average
tstat <- (xbar - 2.27)/se # calculate the t statistic
pvalue <- 2*(1-pt(abs(tstat),11)) # calculate the p-value
However, when I try to use any of these I get the error:
Warning message:
In mean.default(Lab2x) : argument is not numeric or logical: returning NA
What am I doing wrong/missing?

Lab2x is a list with one or more columns, so the functions expecting a numeric vector will report that they are getting the wrong type of argument. Try substituting Lab2x[[1]] for Lab2x, assuming it is the first column you are interested in.

Its hard to tell without seeing your data (try head(Lab2x)).
My advice is to check the data types of Lab2x: read.table constructs a data.frame from the data, and your values are currently being interpreted as character vectors rather than numeric values at the moment. A few problems it could be:
A few columns aren't numeric, and the warnings are being thrown there
All columns aren't numeric, which means its struggling to find the numbers:
Is it reading in the correct number of columns? try read.csv instead.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex