I'm using read_excel to import data into R. Sometimes the specific cell in the excel file is empty. I want R to record this as na and not to ignore that cell. Currently, R doesn't import anything.
I've read through the R Documentation on read_excel and found that, by default, read_excel treats blank cells as missing data. But instead of not importing anything, I'd like to have the actual information that these data are missing. I couldn't find information on how to do that.
In the excel file A1 in sheet 1 does not contain any data.
x <- read_excel("file.xlsx", sheet = 1, range="A1", col_names = FALSE)
Expected result: x 1 obs. of 1 variable, the value should be NA
Actual result: x 0 obs. of 0 variables
A workaround that could work in the above example is to use:
x <- as.numeric(read_excel("file.xlsx", sheet = 1, range="A1", col_names = FALSE))
It is not so elegant- R translates the tibble of 0x0 into NA. (Note: it will also change any strings into NA).
I don't think it works for importing more than one cell.
If anybody finds a more elegant and more generally applicable solution, that would be helpful.
Related
So I have 2 excels VIN1.xlsx and VIN2.xlsx that I need to compare.
VIN1 excel has a dale column OUTGATE_DT which is populated for atleast 1 rows.
VIN2 excel has a date column OUTGATE_DT which is completely null for all rows.
when I import VIN1.xlsx excel using read_excel, it creates the object, and when I check the OUTGATE_DT column, it says its datatype to be as POSIXct[1:4] (which I assume is correct for Date Column. )
But when I import VIN2.xlsx excel using read_excel, it creates the object, and when I check the OUTGATE_DT column, it says its datatype to be logical[1:4] (it is doing this because this column is entirely empty).
and that is why my compare_df(vin1,vin2) functions failing
stating -
Error in rbindlist(l, use.names, fill, idcol) :
Class attribute on column 80 of item 2 does not match with column 80 of item 1.
I am completely new with R, your help would be highly appreciated. TIA
Please check the screenshot for reference.
You should use read_excel() as the following read_excel(, col_types = "text")
All your columns will be considered as text so you won't have any issue to compare or anything.
Or, if you want to keep the column types in your original df, you can do something like this:
library(dplyr)
library(readxl)
VIN2 <- read_excel(VIN2.xlsx) %>%
mutate(OUTGATE_DT = as.Date(OUTGATE_DT))
then you shouldn't have a problem using rbind or bind_rows from dplyr.
When importing data using readxl into R (i don't have the other version requiring java as it won't install), using col_names = false, the data will import as characters instead of numbers - which means I can't actually do any functions on them. Similarly, using writexl when creating a table, making col_names = false also turns numbers into text. I can't seem to find a workaround for this (I can't seem to find a function where I can create custom column names so I tried converting the characters into numbers using:
databook[] <- lapply(databook, function(x) as.numeric(as.character(x)))
this worked, however the row that had the specific column titles I wanted (which were letters only) disappeared. I had to import the table using col_names = FALSE because colmn 2 onwards all have the same title and so when r imports it they rename the column titles which I don't want.
I tried specifying a particular set of rows on which to run the changes:
databook[2:13,] <- lapply(databook, function(x) as.numeric(as.character(x)))
However, this caused it to no longer convert the characters (using str(databook) returned chr results instead of num). Furthermore, a random empty row was added in row 2, giving me 14 rows when the original table had 13.
I'd be grateful for any help fixing this.
I was just going through a tremendous headache caused by read_csv messing up my data by substituting content with NA while reading simple and clean csv files.
I’m iterating over multiple large csv files that add up to millions of observations. Some columns contain quite some NA for some variables.
When reading a csv that contains NA in a certain column for the first 1000 + x observations, read_csv populates the entire column with NA and thus, the data is lost for further operations.
The warning message “Warning: x parsing failure” is shown, but as I’m reading multiple files I cannot check this file by file. Still, I would not know an automated fix for the parsing problem indicated also with problems(x)
Using read.csv instead of read_csv does not cause the problem, but it is slow and I run into encoding issues (using different encodings requires too much memory for large files).
An option to overcome this bug is to add a first observation (first row) to your data that contains something for each column, but still I need to read the file first somehow.
See a simplified example below:
##create a dtafrane
df <- data.frame( id = numeric(), string = character(),
stringsAsFactors=FALSE)
##poluate columns
df[1:1500,1] <- seq(1:1500)
df[1500,2] <- "something"
# variable string contains the first value in obs. 1500
df[1500,]
## check the numbers of NA in variable string
sum(is.na(df$string)) # 1499
##write the df
write_csv(df, "df.csv")
##read the df with read_csv and read.csv
df_readr <- read_csv('df.csv')
df_read_standard <- read.csv('df.csv')
##check the number of NA in variable string
sum(is.na(df_readr$string)) #1500
sum(is.na(df_read_standard$string)) #1499
## the read_csv files is all NA for variable string
problems(df_readr) ##What should that tell me? How to fix it?
Thanks to MrFlick for giving the answering comment on my questions:
The whole reason read_csv can be faster than read.csv is because it can make assumptions about your data. It looks at the first 1000 rows to guess the column types (via guess_max) but if there is no data in a column it can't guess what's in that column. Since you seem to know what's supposed to be in the columns, you should use the col_types= parameter to tell read_csv what to expect rather than making it guess. See the ?readr::cols help page to see how to tell read_csv what it needs to know.
Also guess_max = Inf overcomes the problem, but the speed advantage of read_csv seems to be lost.
I use read_excel from readxl package to read a file where 3 of the columns I want to coerce as text, and the rest I'm happy to let read_excel guess the type. Can I do this?
I tried using col_type setting the columns I want to be text and the rest as blank, but this results in the blank columns being skipped. I tried using NA instead of blank, but that coerces the columns all to text whereas read_excel otherwise would have read some as `numeral.
My code was
col_filter <- rep('blank', 14)
col_filter[1, 3, 7] <- 'text'
read_excel(file, sheet, col_type=col_filter)
14 is the number of columns in the excel file, 1 3 7 are the columns I want to read as text. Bonus if I can do this without knowing in advance the number of columns (short of reading the file once first just to check the number of columns).
You can use this :
col_filter <- readxl:::xlsx_col_types(path = file, n = 100,nskip = 1)
col_filter[1, 3, 7] <- 'text'
read_excel(file, sheet, col_type=col_filter)
I have used R for various things over the past year but due to the number of packages and functions available, I am still sadly a beginner. I believe R would allow me to do what I want to do with minimal code, but I am struggling.
What I want to do:
I have roughly a hundred different excel files containing data on students. Each excel file represents a different school but contains the same variables. I need to:
Import the data into R from Excel
Add a variable to each file containing the filename
Merge all of the data (add observations/rows - do not need to match on variables)
I will need to do this for multiple sets of data, so I am trying to make this as simple and easy to replicate as possible.
What the Data Look Like:
Row 1 Title
Row 2 StudentID Var1 Var2 Var3 Var4 Var5
Row 3 11234 1 9/8/2011 343 159-167 32
Row 4 11235 2 9/16/2011 112 152-160 12
Row 5 11236 1 9/8/2011 325 164-171 44
Row 1 is meaningless and Row 2 contains the variable names. The files have different numbers of rows.
What I have so far:
At first I simply tried to import data from excel. Using the XLSX package, this works nicely:
dat <- read.xlsx2("FILENAME.xlsx", sheetIndex=1,
sheetName=NULL, startRow=2,
endRow=NULL, as.data.frame=TRUE,
header=TRUE)
Next, I focused on figuring out how to merge the files (also thought this is where I should add the filename variable to the datafiles). This is where I got stuck.
setwd("FILE_PATH_TO_EXCEL_DIRECTORY")
filenames <- list.files(pattern=".xls")
do.call("rbind", lapply(filenames, read.xlsx2, sheetIndex=1, colIndex=6, header=TRUE, startrow=2, FILENAMEVAR=filenames));
I set my directory, make a list of all the excel file names in the folder, and then try to merge them in one statement using the a variable for the filenames.
When I do this I get the following error:
Error in data.frame(res, ...) :
arguments imply differing number of rows: 616, 1, 5
I know there is a problem with my application of lapply - the startrow is not being recognized as an option and the FILENAMEVAR is trying to merge the list of 5 sample filenames as opposed to adding a column containing the filename.
What next?
If anyone can refer me to a useful resource or function, critique what I have so far, or point me in a new direction, it would be GREATLY appreciated!
I'll post my comment (with bdemerast picking up on the typo). The solution was untested as xlsx will not run happily on my machine
You need to pass a single FILENAMEVAR to read.xlsx2.
lapply(filenames, function(x) read.xlsx2(file=x, sheetIndex=1, colIndex=6, header=TRUE, startRow=2, FILENAMEVAR=x))