I didn't find an answer to this question, so hopefully this is the place to get some help on this.
I am reading in many Excel files contained in .zip files. Each .zip that I have has about 40 excel files that I want to read. I am trying to create a list of data frames, but encounter an error on reading some files based on file content.
This is the read statment, inside a for loop:
library(readxl)
df[[i]] <- read_excel(xls_lst[i],
skip = 4,
col_names = FALSE,
na = "n/a",
col_types = data_types)
data_types has these values :
> data_types
[1] "text" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
which is correct for this file.
The read_excel statement works well on some files, but returns warning message on others :
In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types,... :
[54, 7]: expecting numeric: got '9999.990000'
Well, the value '9999.99000' looks like a numeric to me.
When I open the Excel file that creates this warning, the file indeed shows these values, and also shows that the column is formatted as text in Excel.
When I change the column formatting to numeric, re-save the Excel sheet, then the data is read in correctly.
However, I have several hundreds of these files to read ... how can read_excel ignore the column format indicated by Excel, and instead use the col_type defintion that I supply in the calling statement ?
Thanks,
I tried to build a toy example.
My xlsx file contains:
3 1
3 3
4 4
5 5
7 '999
6 3
Reading in it your way:
data_types<-c("numeric","numeric")
a<-read_excel("aa.xlsx",
col_names = FALSE,
na = "n/a",
col_types = data_types
)
Warning message:
In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types, :
[5, 2]: expecting numeric: got '999'
Reading in everything as text
data_types<-c("text","text")
dat<-read_excel("aa.xlsx",
col_names = FALSE,
na = "n/a",
col_types = data_types
)
And using type.convert:
dat[]<-lapply(dat, type.convert)
works at least for this simple example.
*Edited:
There was a mistake in the code.
*Edit in response to comment:
Another toy example demonstrating how you could apply type.convert to your data:
#list of data frames
l<-list()
l[[1]]<-data.frame(matrix(rep(as.character(1:5),2), ncol = 2), stringsAsFactors = FALSE)
l<-rep(l,3)
#looping over your list to encode columns correctly:
for (i in 1: length(l)){
l[[i]][]<-lapply(l[[i]], type.convert)
}
There might be better solutions. But I think this should work.
Related
I have a group of .xls files containing data for different periods of the year. I would like to merge them so that I have all the data in one file. I tried the following code:
#create files list
setwd("~/2010")
file.list <- list.files( pattern = ".*\\.xls$", full.names = TRUE )
When I continue, I get some warnings but I don't think they are relevent. See below:
#read files
> l <- lapply( file.list, readxl::read_excel )
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, ... :
Expecting numeric in F1944 / R1944C6: got '-'
2: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, ... :
Expecting numeric in H1944 / R1944C8: got '-'
Then, I run the following line and the problems with the attributes pop up:
> dt <- data.table::rbindlist( l, use.names = TRUE, fill = TRUE )
Error in data.table::rbindlist(l, use.names = TRUE, fill = TRUE) :
Class attribute on column 15 of item 4 does not match with column 15 of item 1.
Can someone help me to fix this? Many thanks in advance
If you are going to bind together two datasets, the classes of the columns must match. Yours apparently do not. So you somehow need to address these mismatches.
Because you did not supply a col_types argument to read_xl::read_excel, it is guessing column types. I assume you expect the columns to be the same class in all of the data frames (otherwise, why bind them?) in which case you could pass a col_types argument so that read_xl::read_excel doesn't have to guess.
The error messages here are useful: I think they are saying that a column was guessed to be numeric but then the parser encountered a "-". Maybe this led to the column being assigned class "character". Perhaps "-" appears in the raw data to indicate a missing value. Then passing na = c("", "-") to read_xl::read_excel could resolve the issue.
I am working on a basketball project. I am struggling to open my data on R :
https://www.basketball-reference.com/leagues/NBA_2019_totals.html
I have imported the data on excel and then saved it as CSV (for macintosh).
When I import the data on R I get an error message :
"Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec, : invalid multibyte string at '<e7>lex<20>Abrines' "
The following seems to work. The readHTMLTable function does give warnings due to the presence of null characters in column Player.
library(XML)
uri <- "https://www.basketball-reference.com/leagues/NBA_2019_totals.html"
data <- readHTMLTable(readLines(uri), which = 1, header = TRUE)
i <- grep("Player", data$Player, ignore.case = TRUE)
data <- data[-i, ]
cols <- c(1, 4, 6:ncol(data))
data[cols] <- lapply(data[cols], function(x) as.numeric(as.character(x)))
Check if there are NA values. This is needed because the table in the link restarts the headers every now and then and character strings become mixed with numeric entries. The grep above is meant to detect such cases but maybe there are others.
sapply(data, function(x) sum(is.na(x)))
No, everything is alright. So write the data set as a CSV file.
write.csv(data, "nba.csv")
The file Encoding to Latin1 can help.
For example, to read a file in csv skipping second row:
Test=(read.csv("IMDB.csv",header=T,sep=",",fileEncoding="latin1")[-2,])
I have a text file of names, separated by commas, and I want to read this into whatever in R (data frame or vector are fine). I try read.csv and it just reads them all in as headers for separate columns, but 0 rows of data. I try header=FALSE and it reads them in as separate columns. I could work with this, but what I really want is one column that just has a bunch of rows, one for each name. For example, when I try to print this data frame, it prints all the column headers, which are useless, and then doesn't print the values. It seems like it should be easily usable, but it appears to me one column of names would be easier to work with.
Since the OP asked me to, I'll post the comment above as an answer.
It's very simple, and it comes from some practice in reading in sequences of data, numeric or character, using scan.
dat <- scan(file = your_filename, what = 'character', sep = ',')
You can use read.csv are read string as header, but then just extract names (using names) and put this into a data.frame:
data.frame(x = names(read.csv("FILE")))
For example:
write.table("qwerty,asdfg,zxcvb,poiuy,lkjhg,mnbvc",
"FILE", col.names = FALSE, row.names = FALSE, quote = FALSE)
data.frame(x = names(read.csv("FILE")))
x
1 qwerty
2 asdfg
3 zxcvb
4 poiuy
5 lkjhg
6 mnbvc
Something like that?
Make some test data:
# test data
list_of_names <- c("qwerty","asdfg","zxcvb","poiuy","lkjhg","mnbvc" )
list_of_names <- paste(list_of_names, collapse = ",")
list_of_names
# write to temp file
tf <- tempfile()
writeLines(list_of_names, tf)
You need this part:
# read from file
line_read <- readLines(tf)
line_read
list_of_names_new <- unlist(strsplit(line_read, ","))
I have an excel file with several sheets, each one with several columns, so I would like to not to specify the type of column separately, but automatedly. I want to read them as stringsAsFactors= FALSE would do, because it interprets the type of column, correctly. In my current method, a column width "0.492 ± 0.6" is interpreted as number, returning NA, "because" the stringsAsFactors option is not available in read_excel. So here, I write a workaround, that works more or less well, but that I cannot use in real life, because I am not allowed to create a new file. Note: I need other columns as numbers or integers, also others that have only text as characters, as stringsAsFactors does in my read.csv example.
library(readxl)
file= "myfile.xlsx"
firstread<-read_excel(file, sheet = "mysheet", col_names = TRUE, na = "", skip = 0)
#firstread has the problem of the a column with "0.492 ± 0.6",
#being interpreted as number (returns NA)
colna<-colnames(firstread)
# read every column as character
colnumt<-ncol(firstread)
textcol<-rep("text", colnumt)
secondreadchar<-read_excel(file, sheet = "mysheet", col_names = TRUE,
col_types = textcol, na = "", skip = 0)
# another column, with the number 0.532, is now 0.5319999999999999
# and several other similar cases.
# read again with stringsAsFactors
# critical step, in real life, I "cannot" write a csv file.
write.csv(secondreadchar, "allcharac.txt", row.names = FALSE)
stringsasfactor<-read.csv("allcharac.txt", stringsAsFactors = FALSE)
colnames(stringsasfactor)<-colna
# column with "0.492 ± 0.6" now is character, as desired, others numeric as desired as well
Here is a script that imports all the data in your excel file. It puts each sheet's data in a list called dfs:
library(readxl)
# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")
# Loop through the sheet names and get the data in each sheet
dfs <- lapply(all_sheets, function(x) {
#Get the number of column in current sheet
col_num <- NCOL(read_excel(path = "myfile.xlsx", sheet = x))
# Get the dataframe with columns as text
df <- read_excel(path = "myfile.xlsx", sheet = x, col_types = rep('text',col_num))
# Convert to data.frame
df <- as.data.frame(df, stringsAsFactors = FALSE)
# Get numeric fields by trying to convert them into
# numeric values. If it returns NA then not a numeric field.
# Otherwise numeric.
cond <- apply(df, 2, function(x) {
x <- x[!is.na(x)]
all(suppressWarnings(!is.na(as.numeric(x))))
})
numeric_cols <- names(df)[cond]
df[,numeric_cols] <- sapply(df[,numeric_cols], as.numeric)
# Return df in desired format
df
})
# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets
The process goes as follows:
First, you get all the sheets in the file with excel_sheets and then loop through the sheet names to create dataframes. For each of these dataframes, you initially import the data as text by setting the col_types parameter to text. Once you have gotten the dataframe's columns as text, you can convert the structure from a tibble to a data.frame. After that, you then find columns that are actually numeric columns and convert them into numeric values.
Edit:
As of late April, a new version of readxl got released, and the read_excel function got two enhancements pertinent to this question. The first is that you can have the function guess the column types for you with the argument "guess" provided to the col_types parameter. The second enhancement (corollary to the first) is that guess_max parameter got added to the read_excel function. This new parameter allows you to set the number of rows used for guessing the column types. Essentially, what I wrote above could be shortened with the following:
library(readxl)
# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")
dfs <- lapply(all_sheets, function(sheetname) {
suppressWarnings(read_excel(path = "myfile.xlsx",
sheet = sheetname,
col_types = 'guess',
guess_max = Inf))
})
# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets
I would recommend that you update readxl to the latest version to shorten your script and as a result avoid possible annoyances.
I hope this helps.
I got several CSV files which contain numbers in the local german style i.e. with a comma as the decimal separator and the point as the thousand separator e.g. 10.380,45. The values in the CSV file are separated by ";". The files also contain columns from the classes character, Date, Date & Time and Logical.
The problem with the read.table functions is, that you can specify the decimal separator with dec=",", but NOT the thousand point separator. (If I'm wrong, please correct me)
I know that preprocessing is a workaround, but I want to write my code in a way, that others can use it without me.
I found a way to read the CSV file the way I want it with read.csv2, by setting my own classes, as can be seen in the following example.
Based on Most elegant way to load csv with point as thousands separator in R
# Create test example
df_test_write <- cbind.data.frame(c("a","b","c","d","e","f","g","h","i","j",rep("k",times=200)),
c("5.200,39","250,36","1.000.258,25","3,58","5,55","10.550,00","10.333,00","80,33","20.500.000,00","10,00",rep("3.133,33",times=200)),
c("25.03.2015","28.04.2015","03.05.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016",rep("08.08.2016",times=200)),
stringsAsFactors=FALSE)
colnames(df_test_write) <- c("col_text","col_num","col_date")
# write test csv
write.csv2(df_test_write,file="Test.csv",quote=FALSE,row.names=FALSE)
#### read with read.csv2 ####
# First, define your own class
#define your own numeric class
setClass('myNum')
#define conversion
setAs("character","myNum", function(from) as.numeric(gsub(",","\\.",gsub("\\.","",from))))
# own date class
library(lubridate)
setClass('myDate')
setAs("character","myDate",function(from) dmy(from))
# Read the csv file, in colClasses the columns class can be defined
df_test_readcsv <- read.csv2(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
colClasses = c(
col_text = "character",
col_num = "myNum",
col_date = "myDate"
)
)
My problem now is, that the different datasets have up to 200 columns and 350000 Rows. With the upper solution I need between 40 and 60 seconds to load one CSV file and I would like to speed this up.
Through my research I found fread() from the data.table package, which is really fast. It takes approximately 3 to 5 seconds to load the CSV file.
Unfortunately there is also no possibility to specify the thousand separator. So I tried to use my solution with colClasses, but there seems to be the issue, that you can't use individual classes with fread https://github.com/Rdatatable/data.table/issues/491
See also my following test code:
##### read with fread ####
library(data.table)
# Test without colclasses
df_test_readfread1 <- fread(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
dec = ",",
sep=";",
verbose=TRUE)
str(df_test_readfread1)
# PROBLEM: In my real dataset it turns the number into an numeric column,
# unforunately it sees the "." as decimal separator, so it turns e.g. 10.550,
# into 10.5
# Here it keeps everything as character
# Test with colclasses
df_test_readfread2 <- fread(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
colClasses = c(
col_text = "character",
col_num = "myNum",
col_date = "myDate"
),
sep=";",
verbose=TRUE)
str(df_test_readfread2)
# Keeps everything as character
So my question is: Is there a way to read CSV files with numeric values like 10.380,45 with fread?
(Alternatively: What is the fastest way to read a CSV with such numeric values?)
I never used package myself, but it's from Hadley Wickham, should be good stuff
https://cran.r-project.org/web/packages/readr/readr.pdf
It supposed to handle locales:
locale(date_names = "en", date_format = "%AD", time_format = "%AT",
decimal_mark = ".", grouping_mark = ",", tz = "UTC",
encoding = "UTF-8", asciify = FALSE)
decimal_mark and grouping_mark is what you're looking for
EDIT form PhiSeu: Solution
Thanks to your suggestion here are two solutions with read_csv2() from the readr package. For my 350000 row CSV file it takes approximately 8 seconds, which is much faster then the read.csv2 solution.
(Another helpful package from hadley and RStudio, thanks)
library(readr)
# solution 1 with specified columns
df_test_readr <- read_csv2(paste0(getwd(),"/Test.csv"),
locale = locale("de"),
col_names = TRUE,
cols(
col_text = col_character(),
col_num = col_number(), # number is automatically regcognized through locale=("de")
col_date2 = col_date(format ="%d.%m.%Y") # Date specification
)
)
# solution 2 with overall definition of date format
df_test_readr <- read_csv2(paste0(getwd(),"/Test.csv"),
locale = locale("de",date_format = "%d.%m.%Y"), # specifies the date format for the whole file
col_names = TRUE
)
Remove all commas first maybe.
filepath<-paste0(getwd(),"/Test.csv")
filestring<-readChar(filepath, file.info(filepath)$size)
filestring<-gsub('.','',filestring,fixed=TRUE)
fread(filestring)