For some reason (it's never done this before), R is not saving out files in the correct way.
The file needs to save out as integer numbers regardless of how big/small the number is. R is doing that for some values, but not for others. Remaking the file just changes what value is contracted.
This is what the incorrect file looks like:
1 834101 248830000
1 4e+06 4005000 #incorrect line
1 4955000 4965000
This is the code I used to get it:
write.table(outtable, 'outtable.txt', sep = "\t",
row.names = F, col.names = F, quote = F)
This is what I need the file to look like:
1 834101 248830000
1 4000000 4005000
1 4955000 4965000
How do I stop R writing out the '4000000' or '6000000' as 4e+06/6e+06?
I'd be very grateful for any help!
Two options:
Change options("scipen") to something bigger; I believe it defaults to 0, so something 2 or more here will work:
dat <- structure(list(V1 = c(1L, 1L, 1L), V2 = c(834101, 4000000, 4955000), V3 = c(248830000L, 4005000L, 4965000L)), class = "data.frame", row.names = c(NA, -3L))
options(scipen = 2) # anything 2 or higher will work, 99 is fine too
write.table(dat, sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
# 1 834101 248830000
# 1 4000000 4005000
# 1 4955000 4965000
(Larger ints might need higher versions of scipen=, note that from ?options, this number relates to the number of digits "width".)
Format as strings before writing.
str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ V1: int 1 1 1
# $ V2: num 834101 4000000 4955000
# $ V3: int 248830000 4005000 4965000
dat[] <- lapply(dat, sprintf, fmt = "%0i")
str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ V1: chr "1" "1" "1"
# $ V2: chr "834101" "4000000" "4955000"
# $ V3: chr "248830000" "4005000" "4965000"
write.table(dat, sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
# 1 834101 248830000
# 1 4000000 4005000
# 1 4955000 4965000
This has the side-effect of either modifying your table or requiring you to have two copies of it; over to you if that's a problem.
Related
I'm new to R, and I'm trying to save data to an xlsx file. I'm using writexl (xlsx was causing trouble).
It seems that having strings and integers in my data frame causes problems when I try to use write_xlsx.
I've recreated the issue here:
library(writexl)
matrix <- matrix(1,2,2)
block <- cbind(list("ones","more ones"),matrix)
df <- data.frame(block)
data = list("sheet1"=df)
write_xlsx(data, path = "data.xlsx", col_names = FALSE, format_headers = FALSE)
The file data.xlsx correctly contains "sheet1", but it is blank. I would like
ones 1 1
more ones 1 1
Any way to get this output using write_xlsx?
I usually use openxlsx package. Try and adapt the following code:
library(openxlsx)
wb <- createWorkbook()
addWorksheet(wb, "Sheet1")
writeData(wb, "Sheet1", df, colNames = FALSE)
saveWorkbook(wb, "test.xlsx", overwrite = TRUE)
I'd raise an issue on the github repo of the package: https://github.com/ropensci/writexl/issues.
Doing this:
df <- data.frame(
X1 = c("ones", "more ones"),
X2 = c(1, 1),
X3 = c(1, 1)
)
write_xlsx(df, path = "data.xlsx", col_names = FALSE, format_headers = FALSE)
works fine. I'd say it's because df in your code has list-columns:
> str(df)
'data.frame': 2 obs. of 3 variables:
$ X1:List of 2
..$ : chr "ones"
..$ : chr "more ones"
$ X2:List of 2
..$ : num 1
..$ : num 1
$ X3:List of 2
..$ : num 1
..$ : num 1
Not sure if the package is meant to have this functionality or not.
I have a character vector of classes that I would like to apply to a dataframe, so as to convert the current class of each field in that dataframe to the corresponding entry in the vector. For example:
frame <- data.frame(A = c(2:5), B = c(3:6))
classes <- c("character", "factor")
With a for-loop, I know that this can be accomplished using lapply. For example:
for(i in 1:2) { frame[i] <- lapply(frame[i], paste("as", classes[i], sep = ".")) }
For my purposes, however, a for-loop cannot work. Is there another solution that I am missing?
Thank you in advance for your input!
Note: I have been informed that this might be a duplicate of this post. And, yes, my question is similar to it. But I have looked at the class() approach before. And it does not seem to effectively deal with converting fields to factors. The lapply approach, on the other hand, does it well. But, unfortunately, I cannot utilize a for-loop in this instance
If you're not averse to using lapply without a for loop, you can try something like the following.
frame[] <- lapply(seq_along(frame), function(x) {
FUN <- paste("as", classes[x], sep = ".")
match.fun(FUN)(frame[[x]])
})
str(frame)
# 'data.frame': 4 obs. of 2 variables:
# $ A: chr "2" "3" "4" "5"
# $ B: Factor w/ 4 levels "3","4","5","6": 1 2 3 4
However, a better option is to try to apply the correct classes when you're reading the data in to begin with.
x <- tempfile() # Just to pretend....
write.csv(frame2, x, row.names = FALSE) # ... that we are reading a csv
frame3 <- read.csv(x, colClasses = classes)
str(frame3)
# 'data.frame': 4 obs. of 2 variables:
# $ A: chr "2" "3" "4" "5"
# $ B: Factor w/ 4 levels "3","4","5","6": 1 2 3 4
Sample data:
frame <- frame2 <- data.frame(A = c(2:5), B = c(3:6))
classes <- c("character", "factor")
I have a number of data files that I am reading into R as CSVs. I would like to specify the colClasses of certain columns in these data files, but the lengths of the dataframes are unknown as they contain species abundance data (hence, different numbers of species).
Is there a way that I can set, say, every column after the first 10 to numeric (so, ncol[10]:length(df)) using colClasses in read.csv?
This is what I tried, but to no avail:
df <- read.csv("file.csv", header=T, colClasses=c(ncols[10], rep("numeric", ncols)))
Any help would be greatly appreciated.
Thanks,
Paul
I would start with using count.fields to determine how many columns there are in the data. You can do this just on the first line.
Then, from there, you can use rep for your colClasses.
It's fugly, but works. Here's an example:
The first few lines are just to create a dummy csv file in your workspace since you didn't provide a reproducible example.
X <- tempfile()
cat("A,B,C,D,E,F",
"1,2,3,4,5,6",
"6,5,4,3,2,1", sep = "\n", file = X)
This is where the actual answer starts. Replace "x" with your actual file name in both places below. The -2 is because we have two columns that are already accounted for.
Y <- read.csv(X, colClasses = c(
"numeric", "numeric", rep("character", count.fields(textConnection(
readLines(X, n=1)), sep=",")-2)))
# Y <- read.csv("file.csv", colClasses = c(
# "numeric", "numeric", rep(
# "character", count.fields(readLines(
# "file.csv", n = 1), sep = ",")-2)))
str(Y)
# 'data.frame': 2 obs. of 6 variables:
# $ A: num 1 6
# $ B: num 2 5
# $ C: chr "3" "4"
# $ D: chr "4" "3"
# $ E: chr "5" "2"
# $ F: chr "6" "1"
I have the following factor:
> str(prediction)
Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 2 1 1 ...
- attr(*, "names")= chr [1:9000] "1" "2" "3" "4" ...
and I wish to get a csv of 9000 x 1 vector of ones or zeros.
I have tried:
write.table(prediction, file = "prediction-1-Decision-Tree-08-Oct-2013.csv", sep = ",", col.names = NA, qmethod = "double")
but this gives me a csv with two columns and header:
"","x"
"1","1"
"2","0"
"3","0"
"4","0"
"5","0"
etc.
I wish to have no header, and just one column.
you're almost there! just add row.names=FALSE to your write.table call:
write.table(prediction, file = "prediction-1-Decision-Tree-08-Oct-2013.csv", sep = ",", col.names = NA, qmethod = "double"
, row.names=FALSE)
What you are seeing is not a column, but the row.names to original R object. For future reference, There are two things that give away the fact that those are rownames and not data - well 3, if you count the manual ;)
The header for that column is ""
The numbers are sequential, starting at 1 (which is what one would expect if there are no explicit rownames given)
A very unexpected behavior of the useful data.frame in R arises from keeping character columns as factor. This causes many problems if it is not considered. For example suppose the following code:
foo=data.frame(name=c("c","a"),value=1:2)
# name val
# 1 c 1
# 2 a 2
bar=matrix(1:6,nrow=3)
rownames(bar)=c("a","b","c")
# [,1] [,2]
# a 1 4
# b 2 5
# c 3 6
Then what do you expect of running bar[foo$name,]? It normally should return the rows of bar that are named according to the foo$name that means rows 'c' and 'a'. But the result is different:
bar[foo$name,]
# [,1] [,2]
# b 2 5
# a 1 4
The reason is here: foo$name is not a character vector, but an integer vector.
foo$name
# [1] c a
# Levels: a c
To have the expected behavior, I manually convert it to character vector:
foo$name = as.character(foo$name)
bar[foo$name,]
# [,1] [,2]
# c 3 6
# a 1 4
But the problem is that we may easily miss to perform this, and have hidden bugs in our codes. Is there any better solution?
This is a feature and R is working as documented. This can be dealt with generally in a few ways:
use the argument stringsAsFactors = TRUE in the call to data.frame(). See ?data.frame
if you detest this behaviour so, set the option globally via
options(stringsAsFactors = FALSE)
(as noted by #JoshuaUlrich in comments) a third option is to wrap character variables in I(....). This alters the class of the object being assigned to the data frame component to include "AsIs". In general this shouldn't be a problem as the object inherits (in this case) the class "character" so should work as before.
You can check what the default for stringsAsFactors is on the currently running R process via:
> default.stringsAsFactors()
[1] TRUE
The issue is slightly wider than data.frame() in scope as this also affects read.table(). In that function, as well as the two options above, you can also tell R what all the classes of the variables are via argument colClasses and R will respect that, e.g.
> tmp <- read.table(text = '"Var1","Var2"
+ "A","B"
+ "C","C"
+ "B","D"', header = TRUE, colClasses = rep("character", 2), sep = ",")
> str(tmp)
'data.frame': 3 obs. of 2 variables:
$ Var1: chr "A" "C" "B"
$ Var2: chr "B" "C" "D"
In the example data below, author and title are automatically converted to factor (unless you add the argument stringsAsFactors = FALSE when you are creating the data). What if we forgot to change the default setting and don't want to set the options globally?
Some code I found somewhere (most likely SO) uses sapply() to identify factors and convert them to strings.
dat = data.frame(title = c("title1", "title2", "title3"),
author = c("author1", "author2", "author3"),
customerID = c(1, 2, 1))
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : Factor w/ 3 levels "title1","title2",..: 1 2 3
# $ author : Factor w/ 3 levels "author1","author2",..: 1 2 3
# $ customerID: num 1 2 1
dat[sapply(dat, is.factor)] = lapply(dat[sapply(dat, is.factor)],
as.character)
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : chr "title1" "title2" "title3"
# $ author : chr "author1" "author2" "author3"
# $ customerID: num 1 2 1
I assume this would be faster than re-reading in the dataset with the stringsAsFactors = FALSE argument, but have never tested.