I have a set of excel files each containing one sheet of data, all of similar structure (mostly -- see below), that I want to ultimately combine into one large data frame (with each sub-set indexed by original file source).
I am able to create a list of multiple dataframes, and then merge these into one dataframe, pretty easily with the following code:
files <- grep(".xlsx", dir(), value=TRUE) # vector of file names
IDnos <- substr(files,20,24) #vector with key 5-digit ID info of each file
library("XLConnect")
library("data.table")
datalist <- lapply(files, readWorksheetFromFile, sheet = "Data")
names(datalist) <- IDnos
bigdatatable <- rbindlist(datalist, idcol = "IDNo")
One data column "Value" is usually class numeric, except I found that in several there was an "ND" put in to one row, making it class character, so in the final data frame the column is character.
Although I can fix this with some simple cleaning, I was left wondering if there is way to identify at the "list of dataframes" stage which files (or dataframe components of the list I created) with class character for column "Value". For example I can't run sapply(datalist,class) or other variations. I am hoping to avoid a for-loop.
Is there any way to use lapply or sapply to drill down into dataframes within a list?
Here's how I would use lapply to find the class of column a in a list of 2 data frames, named x and y.
datalist <- list(x = data.frame(a = letters),
y = data.frame(a = 1:26))
lapply(datalist, function(x) class(x$a))
$x
[1] "factor"
$y
[1] "integer"
Related
I have a list of dataframes which I imported using
setwd("C:path")
fnames <- list.files()
csv <- lapply(fnames, read.csv, header = T, sep=";")
I will need to do this multiple times creating more lists, I would like to keep all the dataframes available separately (i.e. I don't want or need to combine them), I simply used the above code to import them all in quickly. But accessing them now is a little cumbersome and not intuitive (to me at least). Rather having to use [[1]] to access the first element, is there a way that I could amend the first bit of code so that I can name the elements in the list, for example based off a Date which is a variable in each of the dataframes in the list? The dates are stored as chr in the format "dd-mm-yyyy" , so I could potentially just name the dataframes using dd-mm from the Date variable.
You can extract the required data from the 1st value in the Date column of each dataframe and assign it as name of the list.
names(csv) <- sapply(csv, function(x) substr(x$Date[1], 1, 5))
Or extract the data using regex.
names(csv) <- sapply(csv, function(x) sub("(\\w+-\\w+).*", "\\1", x$Date[1]))
We can use
names(csv) <- sapply(csv, function(x) substring(x$Date[1], 1, 5))
I'm creating an empty list of dataframes that I will append later using lapply.
library(tidyverse)
library(dplyr)
library(purrr)
my.list <- lapply(1:192, function(x, nr = 468, nc = 1) { data.frame(symbol = matrix(nrow=nr, ncol=nc)) })
str(my.list)
If you obtain the structure of my.list you will notice that the structure of the columns within each dataframe is "logical". I would like the structure of the column in each dataframe to be character rather than logical.
Can I change anything within my lapply function above so that the columns in the resulting list of dataframes are character? Or how best would I go about this task? I'm creating this empty list of dataframes because I understand that R works faster if it doesn't have to constantly append files. Thus my next step is to perform a map function to populate each dataframe in this list of dataframes with character data.
The issue would be that by creating NA, by default it is NA_logical_. If we want to create a character column, use NA_character_. Here, we can fix with
my.list <- lapply(my.list, function(x) {x[] <- lapply(x, as.character); x})
Or while creating the data.frame column, use
my.list <- lapply(1:192, function(x) data.frame(symbol = rep(NA_character_, 468)))
The matrix route to get a single column data.frame is not ideal and is sometimes incorrect (because matrix can have only a single class whereas data.frame columns can be of different type). The easiest option is replicate the NA_character_ with n times to create a single column data.frame with n rows
I have a data.frame that contains one Date type variable. I want to export 4 files, one containing a subset corresponding to each week. The following will divide my data in 4 however I don't know how to store each of this in a new data.frame.
split(DataAir, sample(rep(1:4)))
Thanks
If you save your split data frames in a variable. You can access the elements with double-bracket subsetting, (e.g. s[[1]]). To save, create a vector of file names
as you'd like and write each to file.
s <- split(iris, iris$Species)
filenames <- paste0("my_path/file", 1:3, ".csv")
for(i in 1:length(s)) write.csv(s[[i]], filenames[i])
And for R users that get unnecessarily bugged out by for loops:
mapply(function(x,y) write.csv(x,y), s, filenames)
I have a list of filenames that were found by searching the working directory. I want to either make one data frame with multiple elements that can be selected from or multiple data frames. To select either parts of one data frame or pick from multiple data frames, I want to name them using a part of the associated filename.
Currently, I set filenames using list.files and set up the data frame using lapply with read.csv
filenames = list.files(recursive=TRUE,pattern="*dat.csv",full.names=FALSE)
data = lapply(filenames,function(i){
read.csv(i,stringsAsFactors=FALSE)
})
Can someone explain to me the best way to go about this data import and name assignment?
A good way to store this would be as a single, combined data frame with a column describing the original file, let's say type:
data_frames = lapply(filenames,function(i){
ret <- read.csv(i,stringsAsFactors=FALSE)
ret$type <- gsub("dat.csv$", "", i)
ret
})
data = do.call(rbind, data_frames)
Or shorter, with plyr:
library(plyr)
data = ldply(filenames, read.csv, stringsAsFactors = FALSE, .id = "type")
data$type <- gsub("dat.csv$", "", data$type)
That way you could extract whatever subset you wanted with:
# to get all lines from, say, the AAAdat.csv file
subset(data, type == "AAA")
You could store each dataset as an individual variable with a name like AAA, but you shouldn't, because it's a bad idea to use your variable names to store information.
(Note that this assumes your datasets share most, or at least some, columns. If they have entirely different structures, this is not an appropriate approach).
After merging multiple data frames into one, I would like to know how to change the column headers in the master data frame to represent the original files that they came from. I merged a large number of data frames into one using the code below:
library(plyr)
dflist = list.files(path=dir, pattern="csv$", full.names=TRUE, recursive=FALSE)
import.list = llply(dflist, read.csv)
Master = Reduce(function(x, y) merge(x, y, by="Hours"), import.list)
I would like the columns that belonged to each original data frame to be named by the unique ID that the original data frame/ csv file is named by (i.e. aa, ab, ac). The unique IDs in the filenames comes immediately before a low line ("_") so I can isolate them using the code below. However, I am having trouble now applying this to column headers. Any help would be much appreciated.
filename = dflist[1]
unqID = strsplit(filename,"_")[[1]][1]
You could define a function in your llply call to and have read.csv assign names.
or just rename them after reading them in and before merging #joran suggested
#First get the names
filenames = dflist
#I am unsure about the line below, as I
unqID = lapply(filenames,function(x) strplit(x,"_")[1])
names(import.list) <- paste("unqID", names(import.list),sep=".") #renaming the list items
And then merge using your code