Challenge: recoding a data.frame() — make it faster - r

Recoding is a common practice for survey data, but the most obvious routes take more time than they should.
The fastest code that accomplishes the same task with the provided sample data by system.time() on my machine wins.
## Sample data
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)
dat <- as.data.frame(dat)
re.codes <- c("This","That","And","The","Other")
Code to optimize.
for(x in 1:ncol(dat)) {
dat[,x] <- factor(dat[,x], labels=re.codes)
}
Current system.time():
user system elapsed
4.40 0.10 4.49
Hint: dat <- lapply(1:ncol(dat), function(x) dat[,x] <- factor(dat[,x],labels=rc))) is not any faster.

Combining #DWin's answer, and my answer from Most efficient list to data.frame method?:
system.time({
dat3 <- list()
# define attributes once outside of loop
attrib <- list(class="factor", levels=re.codes)
for (i in names(dat)) { # loop over each column in 'dat'
dat3[[i]] <- as.integer(dat[[i]]) # convert column to integer
attributes(dat3[[i]]) <- attrib # assign factor attributes
}
# convert 'dat3' into a data.frame. We can do it like this because:
# 1) we know 'dat' and 'dat3' have the same number of rows and columns
# 2) we want 'dat3' to have the same colnames as 'dat'
# 3) we don't care if 'dat3' has different rownames than 'dat'
attributes(dat3) <- list(row.names=c(NA_integer_,nrow(dat)),
class="data.frame", names=names(dat))
})
identical(dat2, dat3) # 'dat2' is from #Dwin's answer

My computer is obviously much slower, but structure is a pretty fast way to do this:
> system.time({
+ dat1 <- dat
+ for(x in 1:ncol(dat)) {
+ dat1[,x] <- factor(dat1[,x], labels=re.codes)
+ }
+ })
user system elapsed
11.965 3.172 15.164
>
> system.time({
+ m <- as.matrix(dat)
+ dat2 <- data.frame( matrix( re.codes[m], nrow = nrow(m)))
+ })
user system elapsed
2.100 0.516 2.621
>
> system.time(dat3 <- data.frame(lapply(dat, structure, class='factor', levels=re.codes)))
user system elapsed
0.484 0.332 0.820
# this isn't because the levels get re-ordered
> all.equal(dat1, dat2)
> all.equal(dat1, dat3)
[1] TRUE

Try this:
m <- as.matrix(dat)
dat <- data.frame( matrix( re.codes[m], nrow = nrow(m)))

A data.table answer for your consideration. We're just using setattr() from it, which works on data.frame, and columns of data.frame. No need to convert to data.table.
The test data again :
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000))
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)
dat <- as.data.frame(dat)
re.codes <- c("This","That","And","The","Other")
Now change the class and set the levels of each column directly, by reference :
require(data.table)
system.time(for (i in 1:ncol(dat)) {
setattr(dat[[i]],"levels",re.codes)
setattr(dat[[i]],"class","factor")
}
# user system elapsed
# 0 0 0
identical(dat, <result in question>)
# [1] TRUE
Does 0.00 win? As you increase the size of the data, this method stays at 0.00.
Ok, I admit, I changed the input data slightly to be integer for all columns (the question has double input data in a third of the columns). Those double columns have to be converted to integer because factor is only valid for integer vectors. As mentioned in the other answers.
So, strictly with the input data in the question, and including the double to integer conversion :
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)
dat <- as.data.frame(dat)
re.codes <- c("This","That","And","The","Other")
system.time(for (i in 1:ncol(dat)) {
if (!is.integer(dat[[i]]))
set(dat,j=i,value=as.integer(dat[[i]]))
setattr(dat[[i]],"levels",re.codes)
setattr(dat[[i]],"class","factor")
})
# user system elapsed
# 0.06 0.01 0.08 # on my slow netbook
identical(dat, <result in question>)
# [1] TRUE
Note that set also works on data.frame, too. You don't have to convert to data.table to use it.
These are very small times, clearly. Since it's only a small input dataset :
dim(dat)
# [1] 250000 36
object.size(dat)
# 68.7 Mb
Scaling up from this should reveal larger differences. But even so I think it should be (just about) measurably fastest. Not a significant difference that anyone minds about, at this size, though.
The setattr function is also in the bit package, btw. So the 0.00 method can be done with either data.table or bit. To do the type conversion by reference (if required) either set or := (both in data.table) is needed, afaik.

The help page for class() says that class<- is deprecated and to use as. methods. I haven't quite figured out why the earlier effort was reporting 0 observations when the data was obviously in the object, but this method results in a complete object:
system.time({ dat2 <- vector(mode="list", length(dat))
for (i in 1:length(dat) ){ dat2[[i]] <- dat[[i]]
storage.mode(dat2[[i]]) <- "integer"
attributes(dat2[[i]]) <- list(class="factor", levels=re.codes)}
names(dat2) <- names(dat)
dat2 <- as.data.frame(dat2)})
#--------------------------
user system elapsed
0.266 0.290 0.560
> str(dat2)
'data.frame': 250000 obs. of 36 variables:
$ V1 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
$ V2 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
$ V3 : Factor w/ 5 levels "This","That",..: 1 2 4 5 3 1 2 4 5 3 ...
$ V4 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
$ V5 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
$ V6 : Factor w/ 5 levels "This","That",..: 1 2 4 5 3 1 2 4 5 3 ...
$ V7 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
$ V8 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
snipped
All 36 columns are there.

Making factors is expensive; only doing it once is comparable with the commands using structure, and in my opinion, preferable as you don't have to depend on how factors happen to be constructed.
rc <- factor(re.codes, levels=re.codes)
dat5 <- as.data.frame(lapply(dat, function(d) rc[d]))
EDIT 2: Interestingly, this seems to be a case where lapply does speed things up. This for loop is substantially slower.
for(i in seq_along(dat)) {
dat[[i]] <- rc[dat[[i]]]
}
EDIT 1: You can also speed things up by being more precise with your types. Try any of the solutions (but especially your original one) creating your data as integers, as follows. For details, see a previous answer of mine here.
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000))
This is also a good idea as converting to integers from floating points, as is being done in all of the faster solutions here, can give unexpected behavior, see this question.

Related

Iteratively change column classes in a set range

I am working with a modest survey dataset (190 x 2162). The platform we are using exports to csv, but when imported every column is a factor. I could assign each column a class on import, but a recent change of hats at work has put many more surveys in my future. So, out of respect for my future self's sanity, I am looking to build a small library of functions to convert ranges of columns as needed.
Goals of the initial function:
Convert from factor to numeric
Take column names as input, rather than numbers
The core of the function appears to work fine:
startNum <- match("Q7_1", names(rawNum))
endNum <- match("Q7_19", names(rawNum))
for(i in c(startNum:endNum)){
rawNum[,i] <- as.character(rawNum[,i])
rawNum[,i] <- as.numeric(rawNum[,i])
}
However, when I attempt to wrap this in a function, it falls apart. I believe the issue is in passing the *_col_name arguments into the function, but I can't seem to find where it is going wrong.
facToNum <- function(frame_name, start_col_name, end_col_name){
startNum <- match(start_col_name, names(frame_name))
endNum <- match(end_col_name, names(frame_name))
for(i in c(startNum:endNum)){
frame_name[,i] <- as.character(frame_name[,i])
frame_name[,i] <- as.numeric(frame_name[,i])
}
}
What am I missing here? I'm sure it's something obvious, but I've had to soldier on with my partial solution, and it rankles.
I don't see any obvious mistakes in your function except that you need to return the changed dataframe at the end of the function.
facToNum <- function(frame_name, start_col_name, end_col_name){
startNum <- match(start_col_name, names(frame_name))
endNum <- match(end_col_name, names(frame_name))
for(i in c(startNum:endNum)){
frame_name[,i] <- as.character(frame_name[,i])
frame_name[,i] <- as.numeric(frame_name[,i])
}
return(frame_name)
}
You can also simplify your approach using dplyr library.
library(dplyr)
facToNum <- function(frame_name, start_col_name, end_col_name){
frame_name %>%
mutate(across(start_col_name:end_col_name, ~as.numeric(as.character(.x))))
}
df <- data.frame(a = factor(rnorm(5)), b = factor(runif(5)), c = 1:5)
str(df)
#data.frame': 5 obs. of 3 variables:
# $ a: Factor w/ 5 levels "-1.64324479436782",..: 3 1 5 4 2
# $ b: Factor w/ 5 levels "0.11049344507046",..: 1 4 3 2 5
# $ c: int 1 2 3 4 5
result <- facToNum(df, "a", "b")
str(result)
#'data.frame': 5 obs. of 3 variables:
# $ a: num -0.11733 -1.64324 1.26266 0.00823 -0.9531
# $ b: num 0.11 0.614 0.545 0.228 0.902
# $ c: int 1 2 3 4 5

Function works well on dummy data, on real data "Error: grouping factor must have exactly 2 levels"?

I have created a function, which works well on dummy data. But, when I run this function on real data, I've got back an error
Error in wilcox.test.formula(tab[[dependent]] ~ as.factor(tab$group), :
grouping factor must have exactly 2 levels
and warning messages:
In wilcox.test.default(x = c(11.2558701380866, 31.8401548036613, : cannot compute exact p-value with ties
So, "thresholding" in my function seems not correctly split real data in two groups. Also, the sub-setting of the real data is not correct. But I don't understand why?? The dummy and real tables structure seem the same:
Structure of dummy and real data:
Dummy:
> str(tab)
'data.frame': 80 obs. of 3 variables:
$ infGrad : num 14.15 12.53 3.03 9.21 16.36 ...
$ distance : int 1 1 1 1 1 1 1 1 1 1 ...
$ uniqueGroup: Factor w/ 2 levels "x","y": 1 2 1 2 1 2 1 2 1 2 ...
Real:
> str(tab)
'data.frame': 142 obs. of 10 variables:
$ distance : num 100 100 100 100 100 100 100 100 100 100 ...
$ infGrad : num 11.3 17.4 31.8 11.1 47.8 ...
$ uniqueGroup: Factor w/ 6 levels "x",..: 5 2 5 2 5 5 5 5 3 6 ...
I have found that NAs might cause these problems, or specification of formula of the wilcox.test(y ~ x).
So, I tried to add na.omit to my function, and instead of wilcox.test(y~x) use wilcox.test(y, x). None of these have worked.
Do you have any ideas how to make my function work or how to make it more robust to accept my real data? Your help is highly appreciated.
What the code does:
classify data in two groups by "moving threshold"
test statistical differences between those groups.
I run the function with nested lapply to vary my thresholds and different data subsets.
My dummy data:
set.seed(10)
infGrad <- c(rnorm(20, mean=14, sd=8),
rnorm(20, mean=13, sd=5),
rnorm(20, mean=8, sd=2),
rnorm(20, mean=7, sd=1))
distance <- rep(c(1:4), each = 20)
uniqueGroup <- rep(c("x", "y"), 40)
tab<-data.frame(infGrad, distance, uniqueGroup)
# Create moving threshols function
movThreshold <- function(th, tab, dependent, ...) {
tab<-na.omit(tab)
# Classify data
tab$group<- ifelse(tab$distance < th, "a", "b") # does not WORK on REAL data
# Calculate wincoxon test
test<-wilcox.test(tab[[dependent]] ~ as.factor(tab$group), # specify column name
data = tab)
# Put results in a vector
c(th, dependent, round(test$p.value, 3))
}
# Define two vectors to run through
# unique group
gr.list<-unique(tab$uniqueGroup)
# unique threshold
th.list<-c(2,3,4)
# apply function over threshols and subset
res<-lapply(gr.list, function(x) lapply(th.list,
movThreshold,
tab = tab[uniqueGroup == x,], # does not work on REAL data
dependent = "infGrad"))
What seems not working on real data:
Groups classification within the function
tab$group<- ifelse(tab$distance < th, "a", "b")
Data subsetting in nested lapply loop
subsetting: tab = tab[uniqueGroup == x,]
The issue probably happens because of a single value group.
You can reproduce the error for instance adding a high value to th.list.
# unique threshold
th.list<-c(2,3,4,100)
The easiest way to avoid this is checking for the length of tab$group before performing the test.
This change in the function should suffice:
movThreshold <- function(th, tab, dependent, ...) {
tab<-na.omit(tab)
# Classify data
tab$group<- ifelse(tab$distance < th, "a", "b") # does not WORK on REAL data
# Check there are two groups
if(length(unique(tab$group))<2){return(NA)}
# Calculate wincoxon test
test<-wilcox.test(tab[[dependent]] ~ as.factor(tab$group), # specify column name
data = tab)
# Put results in a vector
c(th, dependent, round(test$p.value, 3))
}

Call apply-like function on each row of dataframe with multiple arguments from each row

I have a dataframe with multiple columns. For each row in the dataframe, I want to call a function on the row, and the input of the function is using multiple columns from that row. For example, let's say I have this data and this testFunc which accepts two args:
> df <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
> df
x y z
1 1 3 5
2 2 4 6
> testFunc <- function(a, b) a + b
Let's say I want to apply this testFunc to columns x and z. So, for row 1 I want 1+5, and for row 2 I want 2 + 6. Is there a way to do this without writing a for loop, maybe with the apply function family?
I tried this:
> df[,c('x','z')]
x z
1 1 5
2 2 6
> lapply(df[,c('x','z')], testFunc)
Error in a + b : 'b' is missing
But got error, any ideas?
EDIT: the actual function I want to call is not a simple sum, but it is power.t.test. I used a+b just for example purposes. The end goal is to be able to do something like this (written in pseudocode):
df = data.frame(
delta=c(delta_values),
power=c(power_values),
sig.level=c(sig.level_values)
)
lapply(df, power.t.test(delta_from_each_row_of_df,
power_from_each_row_of_df,
sig.level_from_each_row_of_df
))
where the result is a vector of outputs for power.t.test for each row of df.
You can apply apply to a subset of the original data.
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
apply(dat[,c('x','z')], 1, function(x) sum(x) )
or if your function is just sum use the vectorized version:
rowSums(dat[,c('x','z')])
[1] 6 8
If you want to use testFunc
testFunc <- function(a, b) a + b
apply(dat[,c('x','z')], 1, function(x) testFunc(x[1],x[2]))
EDIT To access columns by name and not index you can do something like this:
testFunc <- function(a, b) a + b
apply(dat[,c('x','z')], 1, function(y) testFunc(y['z'],y['x']))
A data.frame is a list, so ...
For vectorized functions do.call is usually a good bet. But the names of arguments come into play. Here your testFunc is called with args x and y in place of a and b. The ... allows irrelevant args to be passed without causing an error:
do.call( function(x,z,...) testFunc(x,z), df )
For non-vectorized functions, mapply will work, but you need to match the ordering of the args or explicitly name them:
mapply(testFunc, df$x, df$z)
Sometimes apply will work - as when all args are of the same type so coercing the data.frame to a matrix does not cause problems by changing data types. Your example was of this sort.
If your function is to be called within another function into which the arguments are all passed, there is a much slicker method than these. Study the first lines of the body of lm() if you want to go that route.
Use mapply
> df <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
> df
x y z
1 1 3 5
2 2 4 6
> mapply(function(x,y) x+y, df$x, df$z)
[1] 6 8
> cbind(df,f = mapply(function(x,y) x+y, df$x, df$z) )
x y z f
1 1 3 5 6
2 2 4 6 8
New answer with dplyr package
If the function that you want to apply is vectorized,
then you could use the mutate function from the dplyr package:
> library(dplyr)
> myf <- function(tens, ones) { 10 * tens + ones }
> x <- data.frame(hundreds = 7:9, tens = 1:3, ones = 4:6)
> mutate(x, value = myf(tens, ones))
hundreds tens ones value
1 7 1 4 14
2 8 2 5 25
3 9 3 6 36
Old answer with plyr package
In my humble opinion,
the tool best suited to the task is mdply from the plyr package.
Example:
> library(plyr)
> x <- data.frame(tens = 1:3, ones = 4:6)
> mdply(x, function(tens, ones) { 10 * tens + ones })
tens ones V1
1 1 4 14
2 2 5 25
3 3 6 36
Unfortunately, as Bertjan Broeksema pointed out,
this approach fails if you don't use all the columns of the data frame
in the mdply call.
For example,
> library(plyr)
> x <- data.frame(hundreds = 7:9, tens = 1:3, ones = 4:6)
> mdply(x, function(tens, ones) { 10 * tens + ones })
Error in (function (tens, ones) : unused argument (hundreds = 7)
Others have correctly pointed out that mapply is made for this purpose, but (for the sake of completeness) a conceptually simpler method is just to use a for loop.
for (row in 1:nrow(df)) {
df$newvar[row] <- testFunc(df$x[row], df$z[row])
}
Many functions are vectorization already, and so there is no need for any iterations (neither for loops or *pply functions). Your testFunc is one such example. You can simply call:
testFunc(df[, "x"], df[, "z"])
In general, I would recommend trying such vectorization approaches first and see if they get you your intended results.
Alternatively, if you need to pass multiple arguments to a function which is not vectorized, mapply might be what you are looking for:
mapply(power.t.test, df[, "x"], df[, "z"])
Here is an alternate approach. It is more intuitive.
One key aspect I feel some of the answers did not take into account, which I point out for posterity, is apply() lets you do row calculations easily, but only for matrix (all numeric) data
operations on columns are possible still for dataframes:
as.data.frame(lapply(df, myFunctionForColumn()))
To operate on rows, we make the transpose first.
tdf<-as.data.frame(t(df))
as.data.frame(lapply(tdf, myFunctionForRow()))
The downside is that I believe R will make a copy of your data table.
Which could be a memory issue. (This is truly sad, because it is programmatically simple for tdf to just be an iterator to the original df, thus saving memory, but R does not allow pointer or iterator referencing.)
Also, a related question, is how to operate on each individual cell in a dataframe.
newdf <- as.data.frame(lapply(df, function(x) {sapply(x, myFunctionForEachCell()}))
data.table has a really intuitive way of doing this as well:
library(data.table)
sample_fxn = function(x,y,z){
return((x+y)*z)
}
df = data.table(A = 1:5,B=seq(2,10,2),C = 6:10)
> df
A B C
1: 1 2 6
2: 2 4 7
3: 3 6 8
4: 4 8 9
5: 5 10 10
The := operator can be called within brackets to add a new column using a function
df[,new_column := sample_fxn(A,B,C)]
> df
A B C new_column
1: 1 2 6 18
2: 2 4 7 42
3: 3 6 8 72
4: 4 8 9 108
5: 5 10 10 150
It's also easy to accept constants as arguments as well using this method:
df[,new_column2 := sample_fxn(A,B,2)]
> df
A B C new_column new_column2
1: 1 2 6 18 6
2: 2 4 7 42 12
3: 3 6 8 72 18
4: 4 8 9 108 24
5: 5 10 10 150 30
#user20877984's answer is excellent. Since they summed it up far better than my previous answer, here is my (posibly still shoddy) attempt at an application of the concept:
Using do.call in a basic fashion:
powvalues <- list(power=0.9,delta=2)
do.call(power.t.test,powvalues)
Working on a full data set:
# get the example data
df <- data.frame(delta=c(1,1,2,2), power=c(.90,.85,.75,.45))
#> df
# delta power
#1 1 0.90
#2 1 0.85
#3 2 0.75
#4 2 0.45
lapply the power.t.test function to each of the rows of specified values:
result <- lapply(
split(df,1:nrow(df)),
function(x) do.call(power.t.test,x)
)
> str(result)
List of 4
$ 1:List of 8
..$ n : num 22
..$ delta : num 1
..$ sd : num 1
..$ sig.level : num 0.05
..$ power : num 0.9
..$ alternative: chr "two.sided"
..$ note : chr "n is number in *each* group"
..$ method : chr "Two-sample t test power calculation"
..- attr(*, "class")= chr "power.htest"
$ 2:List of 8
..$ n : num 19
..$ delta : num 1
..$ sd : num 1
..$ sig.level : num 0.05
..$ power : num 0.85
... ...
I came here looking for tidyverse function name - which I knew existed. Adding this for (my) future reference and for tidyverse enthusiasts: purrrlyr:invoke_rows (purrr:invoke_rows in older versions).
With connection to standard stats methods as in the original question, the broom package would probably help.
If data.frame columns are different types, apply() has a problem.
A subtlety about row iteration is how apply(a.data.frame, 1, ...) does
implicit type conversion to character types when columns are different types;
eg. a factor and numeric column. Here's an example, using a factor
in one column to modify a numeric column:
mean.height = list(BOY=69.5, GIRL=64.0)
subjects = data.frame(gender = factor(c("BOY", "GIRL", "GIRL", "BOY"))
, height = c(71.0, 59.3, 62.1, 62.1))
apply(height, 1, function(x) x[2] - mean.height[[x[1]]])
The subtraction fails because the columns are converted to character types.
One fix is to back-convert the second column to a number:
apply(subjects, 1, function(x) as.numeric(x[2]) - mean.height[[x[1]]])
But the conversions can be avoided by keeping the columns separate
and using mapply():
mapply(function(x,y) y - mean.height[[x]], subjects$gender, subjects$height)
mapply() is needed because [[ ]] does not accept a vector argument. So the column
iteration could be done before the subtraction by passing a vector to [],
by a bit more ugly code:
subjects$height - unlist(mean.height[subjects$gender])
A really nice function for this is adply from plyr, especially if you want to append the result to the original dataframe. This function and its cousin ddply have saved me a lot of headaches and lines of code!
df_appended <- adply(df, 1, mutate, sum=x+z)
Alternatively, you can call the function you desire.
df_appended <- adply(df, 1, mutate, sum=testFunc(x,z))

Convert factor to integer in a data frame

I have the following code
anna.table<-data.frame (anna1,anna2)
write.table<-(anna.table, file="anna.file.txt",sep='\t', quote=FALSE)
my table in the end contains numbers such as the following
chr start end score
chr2 41237927 41238801 151
chr1 36976262 36977889 226
chr8 83023623 83025129 185
and so on......
after that i am trying to to get only the values which fit some criteria such as score less than a specific value
so i am doing the following
anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
significant.anna<-subset(anna.total,score <=0.001)
Error: In Ops.factor(score, 0.001) <= not meaningful for factors
so i guess the problem is that my table has factors and not integers
I guess that my anna.total$score is a factor and i must make it an integer
If i read correctly the as.numeric might solve my problem
i am reading about the as.numeric function but i cannot understand how i can use it
Hence could you please give me some advices?
thank you in advance
best regards
Anna
PS : i tried the following
anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
anna.total$score.new<-as.numeric (as.character(anna.total$score))
write.table(anna.total,file="peak.list.numeric.v3.txt",append = FALSE ,quote = FALSE,col.names =TRUE,row.names=FALSE, sep="\t")
anna.peaks<-subset(anna.total,fdr.new <=0.001)
Warning messages:
1: In Ops.factor(score, 0.001) : <= not meaningful for factors
again i have the same problem......
With anna.table (it is a data frame by the way, a table is something else!), the easiest way will be to just do:
anna.table2 <- data.matrix(anna.table)
as data.matrix() will convert factors to their underlying numeric (integer) levels. This will work for a data frame that contains only numeric, integer, factor or other variables that can be coerced to numeric, but any character strings (character) will cause the matrix to become a character matrix.
If you want anna.table2 to be a data frame, not as matrix, then you can subsequently do:
anna.table2 <- data.frame(anna.table2)
Other options are to coerce all factor variables to their integer levels. Here is an example of that:
## dummy data
set.seed(1)
dat <- data.frame(a = factor(sample(letters[1:3], 10, replace = TRUE)),
b = runif(10))
## sapply over `dat`, converting factor to numeric
dat2 <- sapply(dat, function(x) if(is.factor(x)) {
as.numeric(x)
} else {
x
})
dat2 <- data.frame(dat2) ## convert to a data frame
Which gives:
> str(dat)
'data.frame': 10 obs. of 2 variables:
$ a: Factor w/ 3 levels "a","b","c": 1 2 2 3 1 3 3 2 2 1
$ b: num 0.206 0.177 0.687 0.384 0.77 ...
> str(dat2)
'data.frame': 10 obs. of 2 variables:
$ a: num 1 2 2 3 1 3 3 2 2 1
$ b: num 0.206 0.177 0.687 0.384 0.77 ...
However, do note that the above will work only if you want the underlying numeric representation. If your factor has essentially numeric levels, then we need to be a bit cleverer in how we convert the factor to a numeric whilst preserving the "numeric" information coded in the levels. Here is an example:
## dummy data
set.seed(1)
dat3 <- data.frame(a = factor(sample(1:3, 10, replace = TRUE), levels = 3:1),
b = runif(10))
## sapply over `dat3`, converting factor to numeric
dat4 <- sapply(dat3, function(x) if(is.factor(x)) {
as.numeric(as.character(x))
} else {
x
})
dat4 <- data.frame(dat4) ## convert to a data frame
Note how we need to do as.character(x) first before we do as.numeric(). The extra call encodes the level information before we convert that to numeric. To see why this matters, note what dat3$a is
> dat3$a
[1] 1 2 2 3 1 3 3 2 2 1
Levels: 3 2 1
If we just convert that to numeric, we get the wrong data as R converts the underlying level codes
> as.numeric(dat3$a)
[1] 3 2 2 1 3 1 1 2 2 3
If we coerce the factor to a character vector first, then to a numeric one, we preserve the original information not R's internal representation
> as.numeric(as.character(dat3$a))
[1] 1 2 2 3 1 3 3 2 2 1
If your data are like this second example, then you can't use the simple data.matrix() trick as that is the same as applying as.numeric() directly to the factor and as this second example shows, that doesn't preserve the original information.
I know this is an older question, but I just had the same problem and may be it helps:
In this case, your score column seems like it should not have become a factor column. That usually happens after read.table when it is a text column. Depending on which country you are from, may be you separate floats with a "," and not with a ".". Then R thinks that is a character column and makes it a factor. AND in that case Gavins answer won't work, because R won't make "123,456" to 123.456 . You can easily fix that in a text editor with replace "," with "." though.

Advice on how to convert factors to integers by using the as.numeric [duplicate]

I have the following code
anna.table<-data.frame (anna1,anna2)
write.table<-(anna.table, file="anna.file.txt",sep='\t', quote=FALSE)
my table in the end contains numbers such as the following
chr start end score
chr2 41237927 41238801 151
chr1 36976262 36977889 226
chr8 83023623 83025129 185
and so on......
after that i am trying to to get only the values which fit some criteria such as score less than a specific value
so i am doing the following
anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
significant.anna<-subset(anna.total,score <=0.001)
Error: In Ops.factor(score, 0.001) <= not meaningful for factors
so i guess the problem is that my table has factors and not integers
I guess that my anna.total$score is a factor and i must make it an integer
If i read correctly the as.numeric might solve my problem
i am reading about the as.numeric function but i cannot understand how i can use it
Hence could you please give me some advices?
thank you in advance
best regards
Anna
PS : i tried the following
anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
anna.total$score.new<-as.numeric (as.character(anna.total$score))
write.table(anna.total,file="peak.list.numeric.v3.txt",append = FALSE ,quote = FALSE,col.names =TRUE,row.names=FALSE, sep="\t")
anna.peaks<-subset(anna.total,fdr.new <=0.001)
Warning messages:
1: In Ops.factor(score, 0.001) : <= not meaningful for factors
again i have the same problem......
With anna.table (it is a data frame by the way, a table is something else!), the easiest way will be to just do:
anna.table2 <- data.matrix(anna.table)
as data.matrix() will convert factors to their underlying numeric (integer) levels. This will work for a data frame that contains only numeric, integer, factor or other variables that can be coerced to numeric, but any character strings (character) will cause the matrix to become a character matrix.
If you want anna.table2 to be a data frame, not as matrix, then you can subsequently do:
anna.table2 <- data.frame(anna.table2)
Other options are to coerce all factor variables to their integer levels. Here is an example of that:
## dummy data
set.seed(1)
dat <- data.frame(a = factor(sample(letters[1:3], 10, replace = TRUE)),
b = runif(10))
## sapply over `dat`, converting factor to numeric
dat2 <- sapply(dat, function(x) if(is.factor(x)) {
as.numeric(x)
} else {
x
})
dat2 <- data.frame(dat2) ## convert to a data frame
Which gives:
> str(dat)
'data.frame': 10 obs. of 2 variables:
$ a: Factor w/ 3 levels "a","b","c": 1 2 2 3 1 3 3 2 2 1
$ b: num 0.206 0.177 0.687 0.384 0.77 ...
> str(dat2)
'data.frame': 10 obs. of 2 variables:
$ a: num 1 2 2 3 1 3 3 2 2 1
$ b: num 0.206 0.177 0.687 0.384 0.77 ...
However, do note that the above will work only if you want the underlying numeric representation. If your factor has essentially numeric levels, then we need to be a bit cleverer in how we convert the factor to a numeric whilst preserving the "numeric" information coded in the levels. Here is an example:
## dummy data
set.seed(1)
dat3 <- data.frame(a = factor(sample(1:3, 10, replace = TRUE), levels = 3:1),
b = runif(10))
## sapply over `dat3`, converting factor to numeric
dat4 <- sapply(dat3, function(x) if(is.factor(x)) {
as.numeric(as.character(x))
} else {
x
})
dat4 <- data.frame(dat4) ## convert to a data frame
Note how we need to do as.character(x) first before we do as.numeric(). The extra call encodes the level information before we convert that to numeric. To see why this matters, note what dat3$a is
> dat3$a
[1] 1 2 2 3 1 3 3 2 2 1
Levels: 3 2 1
If we just convert that to numeric, we get the wrong data as R converts the underlying level codes
> as.numeric(dat3$a)
[1] 3 2 2 1 3 1 1 2 2 3
If we coerce the factor to a character vector first, then to a numeric one, we preserve the original information not R's internal representation
> as.numeric(as.character(dat3$a))
[1] 1 2 2 3 1 3 3 2 2 1
If your data are like this second example, then you can't use the simple data.matrix() trick as that is the same as applying as.numeric() directly to the factor and as this second example shows, that doesn't preserve the original information.
I know this is an older question, but I just had the same problem and may be it helps:
In this case, your score column seems like it should not have become a factor column. That usually happens after read.table when it is a text column. Depending on which country you are from, may be you separate floats with a "," and not with a ".". Then R thinks that is a character column and makes it a factor. AND in that case Gavins answer won't work, because R won't make "123,456" to 123.456 . You can easily fix that in a text editor with replace "," with "." though.

Resources