Combining vectors of unequal size using rbind.na - r

I've imported some data files with an unequal number of columns and was hoping to create a data frame out of them. I've use lapply to convert them into vectors, and now I'm trying to put these vectors into a data frame.
I'm using rbind.na from the package {qpcR} to try out and fill out the remaining elements of each vector with NA so they all become the same size. For some reason the function isn't being recognized by do.call. Can anyone figure out why this is the case?
library(plyr)
library(qpcR)
files <- list.files(path = "C:/documents", pattern = "*.txt", full.names = TRUE)
readdata <- function(x)
{
con <- file(x, open="rt")
mydata <- readLines(con, warn = FALSE, encoding = "UTF-8")
close(con)
return(mydata)
}
all.files <- lapply(files, readdata)
combine <- do.call(rbind.na, all.files)
If anyone has any potential alternatives they can think of I'm open to that too. I actually tried using a function from here but my output didn't give me any columns.
Here is the error:
Error in do.call(rbind.na, all.files) : object 'rbind.na' not found
The package has definitely been installed too.
EDIT: changed cbind.na to rbind.na for the error.

It appears that the function is not exported by the package. Using qpcR:::rbind.na will allow you to access the function.
The triple colon allows you to access the internal variables of a namespace. Be aware though that ?":::" advises against using it in your code, presumably because objects that aren't exported can't be relied upon in future versions of a package. It suggests contacting the package maintainer to export the object if it is stable and useful.

Related

Error in x[is.na(x)] <- na.string : replacement has length zero when exporting data frame to openxlsx in R

I have an issue when I try to export a data frame with the library openxlsx to an Excel. When I tried, this error happen:
openxlsx::write.xlsx(usertl_lp, file = "Mi_Exportación.xlsx")
Error in x[is.na(x)] <- na.string : replacement has length zero
usertl_lp_clean <- usertl_lp %>% mutate(across(where(is.list), as.character))
openxlsx::write.xlsx(usertl_lp_clean, file = "Mi_Exportación.xlsx")
This error may be caused by cells containing vectors. So, using across to modify the vector to character.
I posted this here for others in need.
I think you are looking for the writeData function from the same package.
Check out writeFormula from the same package as well or even write_xlsx from the writexl package.
I was having a similar problem in a data frame, but, in my case, I was using the related openxlsx::writeData.
The data frame was generated using sapply, with functions which could deliver errors because of the data. So, I coded to fill with NA when an error were generated. I ended up with NaN and NAs in the same column.
What worked for me is conducting the following treatment before writeData:
df[is.na(df)]<-''
so, for your problem, the following may work:
df[is.na(df)]<-''
openxlsx::write.xlsx(as.data.frame(df), file = "df.xlsx", colNames = TRUE, rowNames = FALSE, append = FALSE)

Using a For-loop to create multiple objects with incremental suffixes, then reading in .csv file to each new object (also with incremental suffixes)

I've just started learning R so forgive me for my ignorance! I'm reading in lots of .csv files, each of which correlates to a different year (2010-2019). I then filter down the .csv files based on a variable within one of the columns (because the datasets are very large. Currently I am using the below code to do this and then repeating it for each year:
data_2010 <- data.table::fread("//Project/2010 data/2010 data.csv", select = c("date", "id", "type"))
data_b_2010 <- data_2010[which(data_2010$type=="ABC123")]
rm(data_2010)
What I would like to do is use a For-loop to create new object data_20xx for each year, and then read in the .csv files (and apply the filter of "type") for each year too.
I think I know how to create the objects in a For-loop but not entirely sure how I would also assign the .csv files and change the filepath string so it updates with each year (i.e. "//Project/2010 data/2010 data.csv" to "//Project/2011 data/2011 data.csv").
Any help would be greatly appreciated!
Next time please provide a repoducible example so we can help you.
I would use data.table which contains specialized functions to do what you want.
library(data.table)
setwd("Project")
allfiles <- list.files(recursive = T, full.names = T)
allcsv <- allfiles[grepl(".csv", allfiles)]
data_list <- list()
for(i in 1:length(allcsv)) {
print(paste(round(i/length(allcsv),2)))
data_list[i] <- fread(allcsv[i])
}
data_list_filtered <- lapply(data_list, function(x) {
y <- data.frame(x)
return(y[which(y["type"]=="ABC123",)])
})
result <- rbindlist(data_list_filtered)
First, list.files will tell you all the files contained in your working dir by default.
Second, read each csv file into the data_list list using the fast and efficient fread function.
Third, do the filtering within a loop, as requested.
Fourth, use rbindlist from data.table to rbind all of these data.table's.
Finally, if you are not familiar with the data.table syntax, you can run setDF(result) to convert your results back to a data.frame.
I strongly encourage you to learn the data.table syntax as it is quite powerful and efficient for tabular data manipulations. These vignettes will get you started.

How to read and rbind all .xlsx files in a folder efficiently using read_excel

I am new to R and need to create one dataframe from 80 .xlsx files that mostly share the same columns and are all in the same folder. I want to bind all these files efficiently in a manner that would work if I added or removed files from the folder later. I want to do this without converting the files to .csv, unless someone can show me how to that efficiently for large numbers of files within R itself.
I've previously been reading files individually using the read_excel function from the readxl package. After, I would use rbind to bind them. This was fine for 10 files, but not 80! I've experimented with many solutions offered online however none of these seem to work, largely because they are using functions other than read_excel or formats other than .xlsx. I haven't kept track of many of my failed attempts, so cannot offer code other than one alternate method I tried to adapt to read_excel from the read_csv function.
#Method 1
library(readxl)
library(purr)
library(dplyr)
library(tidyverse)
file.list <- list.files(pattern='*.xlsx')
alldata <- file.list %>%
map(read_excel) %>%
reduce(rbind)
#Output
New names:
* `` -> ...2
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
Any code on how to do this would be greatly appreciated. Sorry if anything is wrong about this post, it is my first one.
UPDATE:
Using the changes suggested by the answers, I'm now using the code:
file.list <- list.files(pattern='*.xlsx')
alldata <- file.list %>%
map_dfr(read_excel) %>%
reduce(bind_rows)
This output now is as follows:
New names:
* `` -> ...2
Error: Column `10.Alert.alone` can't be converted from numeric to character
This happens regardless of which type of bind() function I use in the reduce() slot. If anyone can help with this, please let me know!
You're on the right track here. But you need to use map_dfr instead of plain-vanilla map. map_dfr outputs a data frame (or actually tibble) for each iteration, and combines them via bind_rows.
This should work:
library(readxl)
library(tidyverse)
file.list <- list.files(pattern='*.xlsx')
alldata <- file.list %>%
map_dfr(~read_excel(.x))
Note that this assumes your files all have consistent column names and data types. If they don't, you may have to do some cleaning. (One trick I've used in complex cases is to add a %>% mutate_all(as.character) to the read_excel command inside the map function. That will turn everything into characters, and then you can convert the data types from there.)
this should get you there/close...
library(data.table)
library(readxl)
#create files list
file.list <- list.files( pattern = ".*\\.xlsx$", full.names = TRUE )
#read files to list of data.frames
l <- lapply( l, readxl::read_excel )
#bind l together to one larger data.table, by columnname, fill missing with NA
dt <- data.table::rbindlist( l, use.names = TRUE, fill = TRUE )
Try using map_dfr.
alldata <- file.list %>%
map_dfr(read_excel)

How to bind rows in R such that instead of type conversion error binding defaults to filling value to NA?

I am currently tasked with merging multiple xlsx files into one master R (.rds) data file. Since these files are filled in manually there is a lot type conversion errors when using approaches such as dyplr::bind_rows such as
Column ``XYZ`` can't be converted from numeric to character
While I very much need the binding to be "smart" such that it happens according to the corresponding column names of the to be merged dataframes -when encountering conversion issues instead of getting an error, I would like to have these problematic cell contents treated as NA and not get an error - just a warning perhaps.
Is there a convenient way/function for doing this in R?
I have used bind_rows from dyplr package.
My current import procedure
files <- list.files("data",pattern = "xlsx", full.names = TRUE)
tmp <- read_excel(files[1], sheet = "data", trim_ws = TRUE)
names(tmp) <- make.names(str_squish(names(tmp)))
for (i in 2:length(files)) {
print(i)
tmp2 <- read_excel(files[i], sheet = "data",trim_ws = TRUE)
names(tmp2) <- make.names(str_squish(names(tmp2)))
tmp<-bind_rows(tmp,tmp2)
}
It has been pointed out that using a loop here is not efficient, but since the files are messy - many manual mistakes - and relatively small in number I focused on being able to sequentially track the binding process.

Using rbind() to combine multiple data frames into one larger data.frame within lapply()

I'm using R-Studio 0.99.491 and R version 3.2.3 (2015-12-10). I'm a relative newbie to R, and I'd appreciate some help. I'm doing a project where I'm trying to use the server logs on an old media server to identify which folders/files within the server are still being accessed and which aren't, so that my team knows which files to migrate. Each log is for a 24 hour period, and I have approximately a year's worth of logs, so in theory, I should be able to see all of the access over the past year.
My ideal output is to get a tree structure or plot that will show me the folders on our server that are being used. I've figured out how to read one log (one day) into R as a data.frame and then use the data.tree package in R to turn that into a tree. Now, I want to recursively go through all of the files in the directory, one by one, and add them to that original data.frame, before I create the tree. Here's my current code:
#Create the list of log files in the folder
files <- list.files(pattern = "*.log", full.names = TRUE, recursive = FALSE)
#Create a new data.frame to hold the aggregated log data
uridata <- data.frame()
#My function to go through each file, one by one, and add it to the 'uridata' df, above
lapply(files, function(x){
uriraw <- read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
#print(nrow(uriraw)
uridata <- rbind(uridata, uriraw)
#print(nrow(uridata))
})
The problem is that, no matter what I try, the value of 'uridata' within the lapply loop seems to not be saved/passed outside of the lapply loop, but is somehow being overwritten each time the loop runs. So instead of getting one big data.frame, I just get the contents of the last 'uriraw' file. (That's why there are those two commented print commands inside the loop; I was testing how many lines there were in the data frames each time the loop ran.)
Can anyone clarify what I'm doing wrong? Again, I'd like one big data.frame at the end that combines the contents of each of the (currently seven) log files in the folder.
do.call() is your friend.
big.list.of.data.frames <- lapply(files, function(x){
read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
})
or more concisely (but less-tinkerable):
big.list.of.data.frames <- lapply(files, read.table,
skip = 3,header = TRUE,
stringsAsFactors = FALSE)
Then:
big.data.frame <- do.call(rbind,big.list.of.data.frames)
This is a recommended way to do things because "growing" a data frame dynamically in R is painful. Slow and memory-expensive, because a new frame gets built at each iteration.
You can use map_df from purrr package instead of lapply, to directly have all results combined as a data frame.
map_df(files, read.table, skip = 3, header = TRUE, stringsAsFactors = FALSE)
Another option is fread from data.table
library(data.table)
rbindlist(lapply(files, fread, skip=3))

Resources