R: how to use writexl when data frame contains strings and numbers - r

I'm new to R, and I'm trying to save data to an xlsx file. I'm using writexl (xlsx was causing trouble).
It seems that having strings and integers in my data frame causes problems when I try to use write_xlsx.
I've recreated the issue here:
library(writexl)
matrix <- matrix(1,2,2)
block <- cbind(list("ones","more ones"),matrix)
df <- data.frame(block)
data = list("sheet1"=df)
write_xlsx(data, path = "data.xlsx", col_names = FALSE, format_headers = FALSE)
The file data.xlsx correctly contains "sheet1", but it is blank. I would like
ones 1 1
more ones 1 1
Any way to get this output using write_xlsx?

I usually use openxlsx package. Try and adapt the following code:
library(openxlsx)
wb <- createWorkbook()
addWorksheet(wb, "Sheet1")
writeData(wb, "Sheet1", df, colNames = FALSE)
saveWorkbook(wb, "test.xlsx", overwrite = TRUE)

I'd raise an issue on the github repo of the package: https://github.com/ropensci/writexl/issues.
Doing this:
df <- data.frame(
X1 = c("ones", "more ones"),
X2 = c(1, 1),
X3 = c(1, 1)
)
write_xlsx(df, path = "data.xlsx", col_names = FALSE, format_headers = FALSE)
works fine. I'd say it's because df in your code has list-columns:
> str(df)
'data.frame': 2 obs. of 3 variables:
$ X1:List of 2
..$ : chr "ones"
..$ : chr "more ones"
$ X2:List of 2
..$ : num 1
..$ : num 1
$ X3:List of 2
..$ : num 1
..$ : num 1
Not sure if the package is meant to have this functionality or not.

Related

Load in multiple CSV files and add suffix to column names in R

I am trying to load in a series of CSV files and then append a suffix to each column in the CSV (except for the primary key (subject_id). Each csv looks something like this
subject_id
var1
var2
1
55
57
2
55
57
Imagine this csv file was titled data1 and subsequent files are titled data2, data3... etc.
For each csv that I load, I would like to to convert the table into something like
subject_id
var1_data1
var2_data1
1
55
57
2
55
57
subject_id
var1_data2
var2_data2
1
55
57
2
55
57
I know how to load in the datasets;
filenames <- list.files(path= "data", full.names = TRUE)
datasets <- lapply(filenames, read_csv)
but I am struggling with figuring out how to write a loop/lapply statement to add the suffixes in the way I want.
The function below, will add a suffix but it is static.
lapply(datasets, function(df) {
names(df)[-1] <- paste0(names(df)[-1], "_data1")
df
})
The next thing I tried was to sandwhich a for loop in the middle of the function above
filenames2 <- sub('\\.csv$', '', list.files(path = "data"))
lapply(dataset3, function(df) {
for (val in filenames2){
names(df)[-1] <- paste0(names(df)[-1], val)
df
}
})
But this just changes everything to NULL/doesn't work. Does anyone have a thought on what might be the best way to proceed? I am also open to solutions in python, though R would be preferred.
Thank you!
Suppose we have the files generated reproducibly in the Note at the end.
Then we get the file names in fnames and Map a function Read over them to read in each file and fix the names returning the fixed up data frame.
fnames <- Sys.glob("data*.csv")
Read <- function(f) {
df <- read.csv(f)
names(df)[-1] <- paste0(names(df[-1]), "_", sub(".csv$", "", basename(f)))
df
}
L <- Map(Read, fnames)
str(L)
giving this named list:
List of 3
$ data1.csv:'data.frame': 2 obs. of 3 variables:
..$ subject_id: int [1:2] 1 2
..$ var1_data1: int [1:2] 55 55
..$ var2_data1: int [1:2] 57 57
$ data2.csv:'data.frame': 2 obs. of 3 variables:
..$ subject_id: int [1:2] 1 2
..$ var1_data2: int [1:2] 55 55
..$ var2_data2: int [1:2] 57 57
$ data3.csv:'data.frame': 2 obs. of 3 variables:
..$ subject_id: int [1:2] 1 2
..$ var1_data3: int [1:2] 55 55
..$ var2_data3: int [1:2] 57 57
Note
Lines <- "subject_id var1 var2
1 55 57
2 55 57"
data1 <- data2 <- data3 <- read.table(text = Lines, header = TRUE)
for(f in c("data1", "data2", "data3")) write.csv(get(f), paste0(f, ".csv"), row.names = FALSE, quote = FALSE)
If every dataset has the same columns, another approach would be to make a single data.frame, where a column would be the dataset origin, here a way to do that.
datasets <-
purrr::map_df(
.x = filenames,
.f = read_csv,
.id = "dataset"
)

Is there an R package with a generalized class of data.frame in which a column can be an array (or how do I define such a class)?

I have been wondering about this for a long time. The data.frame class in base R only allow the columns to be vectors. I was looking for a package which generalize this so that each "column" can be a 2-d or even n-d array with similar methods to the original class data.frame such as sub-setting with "[]", merge, aggregate, etc.
My reason for such a class is to deal with Monte Carlo simulation data. For example, for each simulation the result can be expressed as a data frame in which the row indices are dates, and columns include character and numeric. If I simulate 1000 times then I get 1000 such data frames. If there is a class in R with which I can store the results in one object and has the convenience of most of the data.frame methods, it'll make my coding a lot easier.
As I couldn't find such a package I attempted to create my own with no success. I came across this package "S4Vectors" with a "DataFrame" class, which "supports the storage of any type of object (with length and [ methods) as columns." Here is my attempt.
library(S4Vectors)
test <- matrix(1:6,2,3)
test1 <- matrix(7:12,2,3)
setClass("Column", slots=list(), contains = "matrix")
setMethod("length", "Column", function(x) {nrow(x)})
'[.Column' <- function(x, i, j, ...) {
i <- ((i-1)*ncol(x)+1):(i*(ncol(x)))
NextMethod()
}
testColumn <- new("Column", test)
testColumn1 <- new("Column", test1)
length(testColumn)
testColumn[1]
testDataFrame <- DataFrame(Col1 = testColumn, Col2 = testColumn1)
I did get the length and [ method to work but the last statement gives an error "cannot coerce class "Column" to a DataFrame".
Has anyone ever tried to do something similar?
Update: Thanks to G. Grothendieck I now know a data frame can take a matrix as a column by using the I() function. Now I am wondering if there is way to preserve such a structure in all operations. An example would be to aggregate the data frame
data.frame(v = c(1,1,2,2), m = I(diag(4)))
by v so that the result is
data.frame(v = c(1,2), m = I(matrix(c(1,1,0,0,0,0,1,1), 2, 4, byrow = T))).
data frames do allow matrix columns:
m <- diag(4)
v <- 1:4
DF <- data.frame(v, m = I(m))
str(DF)
giving:
'data.frame': 4 obs. of 2 variables:
$ v: int 1 2 3 4
$ m: 'AsIs' num [1:4, 1:4] 1 0 0 0 0 1 0 0 0 0 ...
Update 1
The R aggregate function can create matrix columns. For example,
DF <- data.frame(v = 1:4, g = c(1, 1, 2, 2))
ag <- aggregate(v ~ g, DF, function(x) c(sum = sum(x), mean = mean(x)))
str(ag)
giving:
'data.frame': 2 obs. of 2 variables:
$ g: num 1 2
$ v: num [1:2, 1:2] 3 7 1.5 3.5
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "sum" "mean"
Update 2
I don't think the aggregation discussed in the comments is nicely supported in R but you can use the following workaround:
m <- matrix(1:16, 4)
v <- c(1, 1, 2, 2)
DF <- data.frame(v, m = I(m))
nr <- nrow(DF)
ag2 <- aggregate(list(sum = 1:nr), DF["v"], function(ix) colSums(DF$m[ix, ]))
str(ag2)
giving:
'data.frame': 2 obs. of 2 variables:
$ v : num 1 2
$ sum: num [1:2, 1:4] 3 7 11 15 19 23 27 31

Formatting a df column of vectors while maintaining the structure. (R)

I have a 2 column data frame (DF) of which one column contains vectors and the other is characters.
Orig. Matched
AbcD c("ab.d","Acbd","AA.D","")
jKdf c("JJf.","K.dF","JkD.","")
My aim is to strip all the punctuation marks (commas and periods) as well make everything lowercase. This is easy enough for the character column, but the vector column is more challenging.
Some lower case methods I tried using are
lapply(DF, tolower). This causes the data frame to convert to a matrix. In doing so I lose the column of vectors structure.
In regards to the punctuation, I tried
gsub("\\.", "", DF) and
gsub("\\,", "", DF) to remove the periods and commas respectively.
This causes the data frame to convert to a character list.
I guess my questions are as follows:
Is there another way to remove punctuation and convert to lowercase that preserves the data frame structure?
If not, how may i be able to convert the above outputs back into the original format; that being of a column of vectors?
I'm sure there are other ways to get this done but here's an example that works pretty well:
DF = data.frame(a = c("JJf.","K.dF","JkD.",""), b = c("ab.d","Acbd","AA.D",""))
DF2 = as.data.frame(lapply(X = DF, FUN = tolower))
DF2$a = gsub(pattern = "\\.",replacement = "", x = DF2$a)
Data frames are just special cases of lists where all the elements have the same length so coercion back and fourth isn't usually a problem.
From your description, it sounds like you have some data that looks like:
mydf <- data.frame(Orig = c("AbcD", "jKdf"),
Matched = I(list(c("ab.d","Ac,bd","AA.D",""),
c("JJf.","K.dF","JkD.",""))))
mydf
# Orig Matched
# 1 AbcD ab.d, Ac....
# 2 jKdf JJf., K.....
str(mydf)
# 'data.frame': 2 obs. of 2 variables:
# $ Orig : Factor w/ 2 levels "AbcD","jKdf": 1 2
# $ Matched:List of 2
# ..$ : chr "ab.d" "Ac,bd" "AA.D" ""
# ..$ : chr "JJf." "K.dF" "JkD." ""
# ..- attr(*, "class")= chr "AsIs"
Usually, if you want to replace data while maintaining the same structure, you replace with [], like this:
mydf[] <- lapply(mydf, function(x) {
if (is.list(x)) {
lapply(x, function(y) {
tolower(gsub("[.,]", "", y))
})
} else {
tolower(gsub("[.,]", "", x))
}
})
Here's the result:
mydf
# Orig Matched
# 1 abcd abd, acbd, aad,
# 2 jkdf jjf, kdf, jkd,
str(mydf)
# 'data.frame': 2 obs. of 2 variables:
# $ Orig : chr "abcd" "jkdf"
# $ Matched:List of 2
# ..$ : chr "abd" "acbd" "aad" ""
# ..$ : chr "jjf" "kdf" "jkd" ""

Applying a vector of classes to a dataframe

I have a character vector of classes that I would like to apply to a dataframe, so as to convert the current class of each field in that dataframe to the corresponding entry in the vector. For example:
frame <- data.frame(A = c(2:5), B = c(3:6))
classes <- c("character", "factor")
With a for-loop, I know that this can be accomplished using lapply. For example:
for(i in 1:2) { frame[i] <- lapply(frame[i], paste("as", classes[i], sep = ".")) }
For my purposes, however, a for-loop cannot work. Is there another solution that I am missing?
Thank you in advance for your input!
Note: I have been informed that this might be a duplicate of this post. And, yes, my question is similar to it. But I have looked at the class() approach before. And it does not seem to effectively deal with converting fields to factors. The lapply approach, on the other hand, does it well. But, unfortunately, I cannot utilize a for-loop in this instance
If you're not averse to using lapply without a for loop, you can try something like the following.
frame[] <- lapply(seq_along(frame), function(x) {
FUN <- paste("as", classes[x], sep = ".")
match.fun(FUN)(frame[[x]])
})
str(frame)
# 'data.frame': 4 obs. of 2 variables:
# $ A: chr "2" "3" "4" "5"
# $ B: Factor w/ 4 levels "3","4","5","6": 1 2 3 4
However, a better option is to try to apply the correct classes when you're reading the data in to begin with.
x <- tempfile() # Just to pretend....
write.csv(frame2, x, row.names = FALSE) # ... that we are reading a csv
frame3 <- read.csv(x, colClasses = classes)
str(frame3)
# 'data.frame': 4 obs. of 2 variables:
# $ A: chr "2" "3" "4" "5"
# $ B: Factor w/ 4 levels "3","4","5","6": 1 2 3 4
Sample data:
frame <- frame2 <- data.frame(A = c(2:5), B = c(3:6))
classes <- c("character", "factor")

Assign colClasses to certain columns in data frames with unknown length

I have a number of data files that I am reading into R as CSVs. I would like to specify the colClasses of certain columns in these data files, but the lengths of the dataframes are unknown as they contain species abundance data (hence, different numbers of species).
Is there a way that I can set, say, every column after the first 10 to numeric (so, ncol[10]:length(df)) using colClasses in read.csv?
This is what I tried, but to no avail:
df <- read.csv("file.csv", header=T, colClasses=c(ncols[10], rep("numeric", ncols)))
Any help would be greatly appreciated.
Thanks,
Paul
I would start with using count.fields to determine how many columns there are in the data. You can do this just on the first line.
Then, from there, you can use rep for your colClasses.
It's fugly, but works. Here's an example:
The first few lines are just to create a dummy csv file in your workspace since you didn't provide a reproducible example.
X <- tempfile()
cat("A,B,C,D,E,F",
"1,2,3,4,5,6",
"6,5,4,3,2,1", sep = "\n", file = X)
This is where the actual answer starts. Replace "x" with your actual file name in both places below. The -2 is because we have two columns that are already accounted for.
Y <- read.csv(X, colClasses = c(
"numeric", "numeric", rep("character", count.fields(textConnection(
readLines(X, n=1)), sep=",")-2)))
# Y <- read.csv("file.csv", colClasses = c(
# "numeric", "numeric", rep(
# "character", count.fields(readLines(
# "file.csv", n = 1), sep = ",")-2)))
str(Y)
# 'data.frame': 2 obs. of 6 variables:
# $ A: num 1 6
# $ B: num 2 5
# $ C: chr "3" "4"
# $ D: chr "4" "3"
# $ E: chr "5" "2"
# $ F: chr "6" "1"

Resources