R error "sum not meaningful for factors" - r

I have a file called rRna_RDP_taxonomy_phylum with the following data :
364 "Firmicutes" 39.31
244 "Proteobacteria" 26.35
218 "Actinobacteria" 23.54
65 "Bacteroidetes" 7.02
22 "Fusobacteria" 2.38
6 "Thermotogae" 0.65
3 unclassified_Bacteria 0.32
2 "Spirochaetes" 0.22
1 "Tenericutes" 0.11
1 Cyanobacteria 0.11
And I'm using this code for creating a pie chart in R:
if(file.exists("rRna_RDP_taxonomy_phylum")){
family <- read.table ("rRna_RDP_taxonomy_phylum", sep="\t")
piedat <- rbind(family[1:7, ],
as.data.frame(t(c(sum(family[8:nrow(family),1]),
"Others",
sum(family[8:nrow(family),3])))))
png(file="../graph/RDP_phylum_low.png", width=600, height=550, res=75)
pie(as.numeric(piedat$V3), labels=piedat$V3, clockwise=TRUE, col=graph_col, main="More representative Phyliums")
legend("topright", legend=piedat$V2, cex=0.8, fill=graph_col)
dev.off()
png(file="../graph/RDP_phylm_high.png", width=1300, height=850, res=75)
pie(as.numeric(piedat$V3), labels=piedat$V3, clockwise=TRUE, col=graph_col, main="More representative Phyliums")
legend("topright", legend=piedat$V2, cex=0.8, fill=graph_col)
dev.off()
}
I've been using this code for different datafiles and it works fine, but with the file presented adobe it crash returning the following message:
Error in Summary.factor(c(6L, 2L, 1L), na.rm = FALSE) :
sum not meaningful for factors
Calls: rbind -> as.data.frame -> t -> Summary.factor
Execution halted
I need to understand why it crash with this file and if there's any way to prevent this kind of errors.
Thanks!

The error comes when you try to call sum(x) and x is a factor.
What that means is that one of your columns, though they look like numbers are actually factors (what you are seeing is the text representation)
simple fix, convert to numeric. However, it needs an intermeidate step of converting to character first. Use the following:
family[, 1] <- as.numeric(as.character( family[, 1] ))
family[, 3] <- as.numeric(as.character( family[, 3] ))
For a detailed explanation of why the intermediate as.character step is needed, take a look at this question: How to convert a factor to integer\numeric without loss of information?

Related

How to handle "write.xlsx" error: arguments imply differing number of rows

I'm trying to write an xlsx file from a list of dataframes that I created but I'm getting an error due to missing data (I couldn't download it). I just want to write the xlsx file besides having this lacking data. Any help is appreciated.
For replication of the problem:
library(quantmod)
name_of_symbols <- c("AKER","YECO","SNOA")
research_dates <- c("2018-11-19","2018-11-19","2018-11-14")
my_symbols_df <- lapply(name_of_symbols, function(x) tryCatch(getSymbols(x, auto.assign = FALSE),error = function(e) { }))
my_stocks_OHLCV <- list()
for (i in 1:3) {
trade_date <- paste(as.Date(research_dates[i]))
OHLCV_data <- my_symbols_df[[i]][trade_date]
my_stocks_OHLCV[[i]] <- data.frame(OHLCV_data)
}
And you can see the missing data down here in my_stocks_OHLCV[[2]] and the write.xlsx error I'm getting:
print(my_stocks_OHLCV)
[[1]]
AKER.Open AKER.High AKER.Low AKER.Close AKER.Volume AKER.Adjusted
2018-11-19 2.67 3.2 1.56 1.75 15385800 1.75
[[2]]
data frame with 0 columns and 0 rows
[[3]]
SNOA.Open SNOA.High SNOA.Low SNOA.Close SNOA.Volume SNOA.Adjusted
2018-11-14 1.1 1.14 1.01 1.1 107900 1.1
write.xlsx(my_stocks_OHLCV, "C:/Users/MICRO/Downloads/Datasets_stocks/dux_OHLCV.xlsx")
Error in (function (..., row.names = NULL, check.rows = FALSE,
check.names = TRUE,:arguments imply differing number of rows: 1, 0
How do I run write.xlsx even though I have this missing data?
The main question you need to ask is, what do you want instead?
As you are working with stock data, the best idea, is that if you don't have data for a stock, then remove it. Something like this should work,
my_stocks_OHLCV[lapply(my_stocks_OHLCV,nrow)>0]
If you want a row full of NA or 0
Then use the lapply function and for each element of the list, of length 0, replace with either NA's, vector of 0's (c(0,0,0,0,0,0)) etc...
Something like this,
condition <- !lapply(my_stocks_OHLCV,nrow)>0
my_stocks_OHLCV[condition] <- data.frame(rep(NA,6))
Here we define the condition variable, to be the elements in the list where you don't have any data. We can then replace those by NA or swap the NA for 0. However, I can't think of a reason to do this.
A variation on your question, and one you could handle inside your for loop, is to check if you have data, and if you don't, replace the values there, with NAs, and you could given it the correct headers, as you know which stock it relates to.
Hope this helps.

subsetting data.cube inside custom function

I am trying to make a function of my own to subset a data.cube in R, and format the result automatically for some predefined plots I aim to build.
This is my function.
require(data.table)
require(data.cube)
secciona <- function(cubo = NULL,
fecha_valor = list(),
loc_valor = list(),
prod_valor = list(),
drop = FALSE){
cubo[fecha_valor, loc_valor, prod_valor, drop = drop]
## The line above will really be an asignment of type y <- format(cubo[...drop])
## Rest of code which will end up plotting the subset of the function
}
The thing is I keep on getting the error: Error in eval(expr, envir, enclos) : object 'fecha_valor' not found
What is most strange for me, is that on the console everything works fine, but not when defined inside the subsetting function of mine.
In console:
> dc[list(as.Date("2013/01/01"))]
> dc[list(as.Date("2013/01/01")),]
> dc[list(as.Date("2013/01/01")),,]
> dc[list(as.Date("2013/01/01")),list(),list()]
all give as result:
<data.cube>
fact:
5627 rows x 2 dimensions x 1 measures (0.32 MB)
dimensions:
localizacion : 4 entities x 3 levels (0.01 MB)
producto : 153994 entities x 3 levels (21.29 MB)
total size: 21.61 MB
But whenever I try
secciona(dc)
secciona(dc, fecha_valor = list(as.Date("2013/01/01")))
secciona(dc, fecha_valor = list())
I always get the error above mentioned.
Any ideas why this is happening? should I proceed in else way for my approach of editing the subset for plotting?
This is the standard issue that R users will face when dealing with non-standard evaluation. This is a consequence of Computing on the language R language feature.
[.data.cube function expects to be used in interactive way, that extends the flexibility of the arguments passed to it, but gives some restrictions. In that aspect it is similar to [.data.table when passing expressions from wrapper function to [ subset operator. I've added dummy example to make it reproducible.
I see you are already using data.cube-oop branch, so just to clarify for other readers. data.cube-oop branch is 92 commits ahead of master branch, to install use the following.
install.packages("data.cube", repos = paste0("https://", c(
"jangorecki.gitlab.io/data.cube",
"Rdatatable.github.io/data.table",
"cran.rstudio.com"
)))
library(data.cube)
set.seed(1)
ar = array(rnorm(8,10,5), rep(2,3),
dimnames = list(color = c("green","red"),
year = c("2014","2015"),
country = c("IN","UK"))) # sorted
dc = as.data.cube(ar)
f = function(color=list(), year=list(), country=list(), drop=FALSE){
expr = substitute(
dc[color=.color, year=.year, country=.country, drop=.drop],
list(.color=color, .year=year, .country=country, .drop=drop)
)
eval(expr)
}
f(year=list(c("2014","2015")), country="UK")
#<data.cube>
#fact:
# 4 rows x 3 dimensions x 1 measures (0.00 MB)
#dimensions:
# color : 2 entities x 1 levels (0.00 MB)
# year : 2 entities x 1 levels (0.00 MB)
# country : 1 entities x 1 levels (0.00 MB)
#total size: 0.01 MB
You can track the expression just by putting print(expr) before/instead eval(expr).
Read more about non-standard evaluation:
- R Language Definition: Computing on the language
- Advanced R: Non-standard evaluation
- manual of substitute function
And some related SO questions:
- Passing on non-standard evaluation arguments to the subset function
- In R, why is [ better than subset?

Undefined columns selected error in R while reading multiple files

I have lot's of files in my directoy and I want to read all files and select the second columns of them and put those columns as rows of a matrix, but I face with strange error.
would anybody help me to figure out, what's going wrong with my code ?
Here is my effort:
#read all files in one directoy into R and select desired column
nm <- list.files(path="April/mRNA_expression/")
Gene_exp<-do.call(rbind, lapply(nm, function(x) read.table(file=x,header=TRUE, sep= ",")[, 2]))
save(Gene_exp, file="Path to folder")
The error I get is :
## Error in `[.data.frame`(read.table(file = x, header = TRUE, sep = ""), :
## undefined columns selected*
To check that, really my files have 2 columns I did this :
b <- read.table("A.genes.normalized_results", sep="")
dim(b)
## [1] 20532 2
My text file Looks like this :
gene_id normalized_count
?|100130426 0.0000
?|100133144 10.6340
?|100134869 5.6790
?|10357 106.4628
?|10431 710.8902
?|136542 0.0000
?|155060 132.2883
?|26823 0.5098
?|280660 0.0000
?|317712 0.0000
?|340602 0.0000
?|388795 1.2745
?|390284 5.3527
?|391343 2.5489
?|391714 0.7647
?|404770 0.0000
?|441362 0.0000
The better solution would be to only import the second column when reading it. Use the colClasses argument to completely skip the first:
Gene_exp<-do.call(rbind, lapply(nm, function(x) read.delim(file=x,header=TRUE, colClasses=c('NULL', 'character'))))
I am assuming the second column is character. Change it to the appropriate class if you need to.

Reading large fixed format text file in r

I am trying to input a large (> 70 MB) fixed format text file into r. For a smaller file (< 1MB), I can use the read.fwf() function as shown below.
condodattest1a <- read.fwf(impfile1,widths=testcsv3$Varlen,col.names=testcsv3$Varname)
When I try to run the line of code below,
condodattest1 <- read.fwf(impfile,widths=testcsv3$Varlen,col.names=testcsv3$Varname)
I get the following error message:
Error: cannot allocate vector of size 2 Kb
The only difference between the 2 lines is the size of the input file.
The formatting for the file I want to import is given in the dataframe called testcsv3. I show a small snippet of the dataframe below:
> head(testcsv3)
Varlen Varname Varclass Varsep Varforfmt
1 2 "V1" "character" 2 "A2.0"
2 15 "V2" "character" 17 "A15.0"
3 28 "V3" "character" 45 "A28.0"
4 3 "V4" "character" 48 "F3.0"
5 1 "V5" "character" 49 "A1.0"
6 3 "V6" "character" 52 "A3.0"
At least part of my problem is that I am reading in all the data as factors when I use read.fwf() and I end up exceeding the memory limit on my computer.
I tried to use read.table() as a way of formatting each variable but it seems I need a text delimiter with that function. There is a suggestion in section 3.3 in the link below that I could use sep to identify the column where every variable starts.
http://data.princeton.edu/R/readingData.html
However, when I use the command below:
condodattest1b <- read.table(impfile1,sep=testcsv3$Varsep,col.names=testcsv3$Varname, colClasses=testcsv3$Varclass)
I get the following error message:
Error in read.table(impfile1, sep = testcsv3$Varsep, col.names = testcsv3$Varname, : invalid 'sep' argument
Finally, I tried to use:
condodattest1c <- read.fortran(impfile1,lengths=testcsv3$Varlen, format=testcsv3$Varforfmt, col.names=testcsv3$Varname)
but I get the following message:
Error in processFormat(format) : missing lengths for some fields
In addition: Warning messages:
1: In processFormat(format) : NAs introduced by coercion
2: In processFormat(format) : NAs introduced by coercion
3: In processFormat(format) : NAs introduced by coercion
All I am trying to do at this point is format the data when they come into r as something other than factors. I am hoping this will limit the amount of memory I am using and allow me to actually input the file. I would appreciate any suggestions about how I can do this. I know the Fortran formats for all the variables and the column at which each variable begins.
Thank you,
Warren
Maybe this code works for you. You have to fill varlen with the field sizes and add the corresponding type strings (e.g. numeric, character, integer) to colclasses
my.readfwf <- function(filename,varlen,colclasses) {
sidx <- cumsum(c(1,varlen[1:(length(varlen)-1)]))
eidx <- sidx+varlen-1
filecontent <- scan(filename,character(0),sep="\n")
if (any(diff(nchar(filecontent))!=0))
stop("line lengths differ!")
nlines <- length(filecontent)
res <- list()
for (i in seq_along(varlen)) {
res[[i]] <- sapply(filecontent,substring,first=sidx[i],last=eidx[i])
mode(res[[i]]) <- colclasses[i]
}
attributes(res) <- list(names=paste("V",seq_along(res),sep=""),row.names=seq_along(res[[1]]),class="data.frame")
return(res)
}

Cannot coerce class ....to a data.frame error

R subject
I have an "cannot coerce class "c("summary.turnpoints", "turnpoints")" to a data.frame" error when trying to save the summary in a file. I have tried to fix that with as.data.frame with no success.
code :
library(plyr)
library(pastecs)
data <- read.table("C:\\Users\\Ron\\Desktop\\dataset.txt", header=F, col.name="A")
data.tp=turnpoints(data$A)
print(data.tp)
Turning points for: data$A
nbr observations : 5990
nbr ex-aequos : 51
nbr turning points: 413 (first point is a pit)
E(p) = 3992 Var(p) = 1064.567 (theoretical)
Turning points for: data$A
nbr observations : 5990
nbr ex-aequos : 51
nbr turning points: 413 (first point is a pit)
E(p) = 3992 Var(p) = 1064.567 (theoretical)
data.sum=summary(data.tp)
print(data.sum)
point type proba info
1 11 pit 7.232437e-15 46.97444
2 21 peak 7.594058e-14 43.58212
3 30 pit 3.479857e-27 87.89303
4 51 peak 5.200612e-29 93.95723
5 62 pit 7.594058e-14 43.58212
6 70 peak 6.213321e-14 43.87163
7 81 pit 6.276081e-16 50.50099
8 91 peak 5.534016e-23 73.93602
.....................................
write.table(data.sum, file = "C:\\Users\\Ron\\Desktop\\datasetTurnP.txt")
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class "c("summary.turnpoints", "turnpoints")" to a data.frame
In addition: Warning messages:
1: package ‘plyr’ was built under R version 3.0.1
2: package ‘pastecs’ was built under R version 3.0.1
How can I save these summary results to a text file?
Thank you.
Look at the Value section of:
?pastecs::summary.turnpoints
It should be clear that this will not be a set of lists all of which have the same length. Hence the error message. So rather than asking for the impossible, ... tell us what you wanted to save.
It's actually not impossible, just not possible with write.table, since it's not a dataframe. The dump function would allow you to construct an ASCII representation of the structure(...) representation of that summary-object.
dump(data.sum, file="dump_data_sum.asc")
This could then be source()-ed

Resources