I wanted to read all the files and load multiple sheets from an excel file to R, use the xlsx package.
I pasted the code below:
filelist <- list.files(pattern = "\\.xls") # list all the xlsx files
library(xlsx)
allxlsx.files <- list() # create a list to populate with xlsx data
for (file in filelist) {
wb <- loadWorkbook(file)
sheets <- getSheets(wb)
sheet <- sheets[['_']] # get sheets with field section reading
res <- readColumns(sheet, 1, 2, 114, 120, colClasses=c("character", "numeric"))
}
traceback()
1: readColumns(sheet, 1, 2, 114, 120, colClasses = c("character",
"numeric")) at #6
Can someone enlighten me how to proceed?
I think you are subsetting the sheets incorrectly.
You can use grep on names of sheets to get all the file names with "_".
I have created and used a single xlsx file with hypothetical data having 5 sheets with names as below for demonstration.
> names(sheets)
[1] "Sheet_1" "Sheet2" "Sheet_3" "Sheet4" "sheet_4"
Getting the required sheets can be done using
sheet = sheets[grep("_",names(sheets))]
You can check it by using
> names(sheet)
[1] "Sheet_1" "Sheet_3" "sheet_4"
So your final code will look like following
filelist <- "sheeetLoadTrial1.xlsx" # single xlsx files
library(xlsx)
allxlsx.sheets <- list() # create a list to populate with xlsx sheet data
for (file in filelist) {
wb <- loadWorkbook(file)
sheets <- getSheets(wb)
sheet = sheets[grep("_",names(sheets))]
for(i in c(1:length(sheet))){
res <- readColumns(sheet[[i]], 1, 2,1,8,header = F)
allxlsx.sheets[[i]] = res
}
names(allxlsx.sheets) <- names(sheet)
}
after this your final required list will be
> allxlsx.sheets
$Sheet_1
X1 X2
1 1 2
2 2 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
$Sheet_3
X1 X2
1 1 2
2 2 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
$sheet_4
X1 X2
1 1 2
2 2 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
For more than one file you can just append the allxlsx.sheets to allxlsx.files list.
Related
I Have multiple csv files that i have already read into R. Now I want to append all these into one file. I tried few things but getting different errors. Can anyone please help me with this?
TRY 1:
mydata <- rbind(x1,x2,x3,x4,x5,x6,x7,x8)
WHERE XI,X2....X8 Are the CSV files I read into R, error I am getting is
ERROR 1 :In [<-.factor(*tmp*, ri, value = c(NA, NA, NA, NA, NA, NA, NA, :
invalid factor level, NA generated
TRY 2: Then I try this in another way :
mydata1<- c(x1,x2,x3,x4,x5,x6,x7,x8)
> mydata2 <- do.call('rbind',lapply(mydata1,read.table,header=T))
Error 2: in FUN(X[[i]], ...) :
'file' must be a character string or connection
can anyone please help me know what is the right way to do this?
How to import all files from a single folder at once and bind by row (e.g., same format for each file.)
library(tidyverse)
list.files(path = "location_of/data/folder_you_want/",
pattern="*.csv",
full.names = T) %>%
map_df(~read_csv(.))
If there is a file that you want to exclude then
list.files(path = "location_of/data/folder_you_want/",
pattern="*.csv",
full.names = T) %>%
.[ !grepl("data/folder/name_of_file_to_remove.csv", .) ] %>%
map_df(~read_csv(.))
Sample CSV Files
Note
CSV files to be merged here have
- equal number of columns
- same column names
- same order of columns
- number of rows can be different
1st csv file abc.csv
A,B,C,D
1,2,3,4
2,3,4,5
3,4,5,6
1,1,1,1
2,2,2,2
44,44,44,44
4,4,4,4
4,4,4,4
33,33,33,33
11,1,11,1
2nd csv file pqr.csv
A,B,C,D
1,2,3,40
2,3,4,50
3,4,50,60
4,4,4,4
5,5,5,5
6,6,6,6
List FILENAMES of CSV Files
Note
The path below E:/MergeCSV/ has just the files to be merged. No other csv files. So in this path, there are only two csv files, abc.csv and pqr.csv
## List filenames to be merged.
filenames <- list.files(path="E:/MergeCSV/",pattern="*.csv")
## Print filenames to be merged
print(filenames)
## [1] "abc.csv" "pqr.csv"
FULL PATH to CSV Files
## Full path to csv filenames
fullpath=file.path("E:/MergeCSV",filenames)
## Print Full Path to the files
print(fullpath)
## [1] "E:/MergeCSV/abc.csv" "E:/MergeCSV/pqr.csv"
MERGE CSV Files
## Merge listed files from the path above
dataset <- do.call("rbind",lapply(filenames,FUN=function(files){ read.csv(files)}))
## Print the merged csv dataset, if its large use `head()` function to get glimpse of merged dataset
dataset
# A B C D
# 1 1 2 3 4
# 2 2 3 4 5
# 3 3 4 5 6
# 4 1 1 1 1
# 5 2 2 2 2
# 6 44 44 44 44
# 7 4 4 4 4
# 8 4 4 4 4
# 9 33 33 33 33
# 10 11 1 11 1
# 11 1 2 3 40
# 12 2 3 4 50
# 13 3 4 50 60
# 14 4 4 4 4
# 15 5 5 5 5
# 16 6 6 6 6
head(dataset)
# A B C D
# 1 1 2 3 4
# 2 2 3 4 5
# 3 3 4 5 6
# 4 1 1 1 1
# 5 2 2 2 2
# 6 44 44 44 44
## Print dimension of merged dataset
dim(dataset)
## [1] 16 4
The accepted answer above generates the error shown in the comments because the do.call requires the "fullpath" parameter. Use the code as shown to use in the directory of your choice:
dataset <- do.call("rbind",lapply(fullpath,FUN=function(files){ read.csv(files)}))
You can use a combination of lapply(), and do.call().
## cd to the csv directory
setwd("mycsvs")
## read in csvs
csvList <- lapply(list.files("./"), read.csv, stringsAsFactors = F)
## bind them all with do.call
csv <- do.call(rbind, csvList)
You can also use fread() function from the data.table package and rbindlist() instead for a performance boost.
I'm trying to read an Excel file with over 30 tabs of data. The complication is that each tab actually has 2 tables in it. There is a table at the top of the sheet, then a few blank rows, then a second table below with completely different column titles.
I'm aware of the openxlsx and readxl packages, but they seem to assume that the Excel data is formatted into tidy tables.
If I can get the raw data into R (perhaps in a text matrix...), I'm confident I can do the dirty work of parsing it into data frames. Any advice? Many thanks.
you can use XLConnect package to access arbitrary region in Excel Worksheet. Then you can extract list of data frames. Please see below:
Simulation:
library(XLConnect)
# simulate xlsx-file
df1 <- data.frame(x = 1:10, y = 0:9)
df2 <- data.frame(x = 1:20, y = 0:19)
wb <- loadWorkbook("temp.xlsx", create = TRUE )
createSheet(wb, "sh1")
writeWorksheet(wb, df1, "sh1", startRow = 1)
writeWorksheet(wb, df2, "sh1", startRow = 15)
lapply(2:30, function(x) cloneSheet(wb, "sh1", paste0("sh", x)))
saveWorkbook(wb)
Extract Data
# read.data
wb <- loadWorkbook("temp.xlsx")
df1s <- lapply(1:30, function(x) readWorksheet(wb, x, startRow = 1, endRow = 11))
df2s <- lapply(1:30, function(x) readWorksheet(wb, x, startRow = 15, endRow = 35))
df1s[[1]]
df2s[[2]]
Output data.frame #1 from the first sheet and data.frame #2 from the second one:
> df1s[[1]]
x y
1 1 0
2 2 1
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9
> df2s[[2]]
x y
1 1 0
2 2 1
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9
11 11 10
12 12 11
13 13 12
14 14 13
15 15 14
16 16 15
17 17 16
18 18 17
19 19 18
20 20 19
I just have a data frame and want to split the data frame by rows, assign the several new data frames to new variables and save them as csv files.
a <- rep(1:5,each=3)
b <-rep(1:3,each=5)
c <- data.frame(a,b)
# a b
1 1 1
2 1 1
3 1 1
4 2 1
5 2 1
6 2 2
7 3 2
8 3 2
9 3 2
10 4 2
11 4 3
12 4 3
13 5 3
14 5 3
15 5 3
I want to split c by column a. i.e all rows are 1 in column a are split from c and assign it to A and save A as A.csv. The same to B.csv with all 2 in column a.
What I can do is
A<-c[c$a%in%1,]
write.csv (A, "A.csv")
B<-c[c$a%in%2,]
write.csv (B, "B.csv")
...
If I have 1000 rows and there will be lots of subsets, I just wonder if there is a simple way to do this by using for loop?
The split() function is very useful to split data frame. Also, you can use lapply() here - it should be more efficient than a loop.
dfs <- split(c, c$a) # list of dfs
# use numbers as file names
lapply(names(dfs),
function(x){write.csv(dfs[[x]], paste0(x,".csv"),
row.names = FALSE)})
# or use letters (max 26!) as file names
names(dfs) <- LETTERS[1:length(dfs)]
lapply(names(dfs),
function(x){write.csv(dfs[[x]],
file = paste0(x,".csv"),
row.names = FALSE)})
for(i in seq_along(unique(c$a))){
write.csv(c[c$a == i,], paste0(LETTERS[i], ".csv"))}
You should consider, however, what happens if you have more than 26 subsets. What will those files be named?
I am a new user to R. I need your advice -
I have around 100 csv files. The number of columns can change in each file. I am looking for help in identifying number of "unique columns" in each file - (If the file has a duplicate column , I want it to count as 1 unique column)
file1.csv
a,b,c,d
1,2,0,4
2,0,3,5
3,0,4,6
4,8,7,0
file2.csv
a,b,c,d,c
1,2,0,3,0
2,3,4,5,4
3,6,2,0,2
4,2,3,5,3
So technically, the code should give me 4 columns (a,b,c,d) for file1.csv and 4 columns for file2.csv (a,b,c,d - column c is duplicate).I know using the dim(df)[2] will give me number of columns in each file but if I have to do it for 100 files, how should I do it?
If the column names are enough to determine duplicated columns, an easy and faster way to do this would be to read the first line of each file with readLines(), split according to the file separator (",") with strsplit(), and then find the length of the unique vector returned.
You can wrap this in a sapply or lapply to iterate over the file list.
files <- c("file1.csv", "file2.csv")
ncolumns <- sapply(files, function(f) {
header.line <- readLines(f, n=1)
length(unique(strsplit(header.line, ",")[[1]]))
})
ncolumns
# file1.csv file2.csv
# 4 4
Assuming column names are enough to determine uniqueness, this will be faster since you don't have to load the whole csv file.
I would use a loop that reads each file in turn. You don't want to open them all at the same time or you could run out of memory.
get file list:
f = list.files("./dir/", pattern="csv")
read files, find unique columns and write result to a variable:
answer = sapply(f, function(i){
# read the file
x = read.csv(i)
# extract column names and then get the unique ones
x = unique(colnames(x))
# return the number of column names
length(x)
})
You can then have a look at your file lengths:
# Summary statistics
summary(answer)
# Boxplot
boxplot(answer)
# Plot of number of columns vs names (probably messy with 100)
barplot(answer, names.arg=f)
You can try using the length() and unique() functions together to count the number of unique column names. For example:
data <- data.frame(matrix(c(1:12), nrow=3, ncol=4))
colnames(data) <- c("a","b","c","b")
length(unique(colnames(data)))
Depending on what your upload process is, you can try to integrate this into a loop or run as a batch process.
In case the column names are not just decoration: f counts the number of unique columns of a data frame X:
f <- function( X )
{
A <- mapply(c,as.list(X),colnames(X))
sum( apply(A,2,function(col)
{
1 / sum( colSums( matrix(!(rep(col,ncol(A))==c(A)),nrow(A)) ) == 0 )
} ) )
}
Examples:
> X1 <- data.frame( a = 1:3, b = 5:7, c = 3:1, d = 9:7 )
> X2 <- cbind( X1, c=4:2 )
> X3 <- cbind( X1, c=1:3 )
> X4 <- cbind( X1, e=5:7 )
> X5 <- cbind( X1, b=5:7 )
> X1
a b c d
1 1 5 3 9
2 2 6 2 8
3 3 7 1 7
> X2
a b c d c
1 1 5 3 9 4
2 2 6 2 8 3
3 3 7 1 7 2
> X3
a b c d c
1 1 5 3 9 1
2 2 6 2 8 2
3 3 7 1 7 3
> X4
a b c d e
1 1 5 3 9 5
2 2 6 2 8 6
3 3 7 1 7 7
> X5
a b c d b
1 1 5 3 9 5
2 2 6 2 8 6
3 3 7 1 7 7
>
> f(X1)
[1] 4
> f(X2)
[1] 5
> f(X3)
[1] 5
> f(X4)
[1] 5
> f(X5)
[1] 4
> f(cbind(X1,X1))
[1] 4
> f(cbind(X1,X5))
[1] 4
> f(cbind(X1,X2))
[1] 5
> f(cbind(X2,X3))
[1] 6
>
I am trying to write a table from R into Excel. Here is some sample code:
library(XLConnect)
wb <- loadWorkbook("C:\\Users\\Bob\\Desktop\\Example.xls", create=TRUE)
output <- as.table(output)
createSheet(wb, name="Worksheet 1")
writeWorksheet(wb, output, sheet="Worksheet 1")
saveWorkbook(wb)
But it seems that the writeWorksheet function converts the table into a dataframe. This makes the data look messy and unformatted. I want the table structure to be preserved. How would I modify the above code?
The issue here is that writeWorksheet converts the table object to a data frame. The way that happens is that R will basically "melt" it into long format, whereas a table object is typically printed to the console in "wide" format.
It is a bit of a nuisance, but you generally have to manually convert the table into a data frame that matches the format you're after. An example:
library(reshape2)
tbl <- with(mtcars,table(cyl,gear))
> tbl
gear
cyl 3 4 5
4 1 8 2
6 2 4 1
8 12 0 2
> as.data.frame(tbl)
cyl gear Freq
1 4 3 1
2 6 3 2
3 8 3 12
4 4 4 8
5 6 4 4
6 8 4 0
7 4 5 2
8 6 5 1
9 8 5 2
> tbl_df <- as.data.frame(tbl)
> final <- dcast(tbl_df,cyl~gear,value.var = "Freq")
> final
cyl 3 4 5
1 4 1 8 2
2 6 2 4 1
3 8 12 0 2
> class(final)
[1] "data.frame"
Then you should be able to write that data frame to the Excel worksheet with no problem.