R - Read specific columns from XLSX - r

This seems like a silly question, but I really could not find a solution! I need to read only specific columns from an Excel file. The file have multiple sheets with different number of columns, but the ones I need to read will be there. I can do this for csv files, but not for excel! This is my present code, which reads the first 14 columns (but the columns I need might not always be in the first 14). I can't just read them all as rbind will throw an error citing row mismatch (different number of rows in the sheets).
EDIT: I solved this by omitting the col_types parameter, it worked as sheets with different column numbers only had column headers. Still, this is no way a robust solution, so I hope someone can do a better job than me.
INV <- lapply(sheets, function(X) read_excel("./Inventory.xlsx", sheet = X, col_types = c(rep("text", 14))))
names(INV) <- sheets
INV <- do.call("rbind", INV)
I am trying to do something like this:
INV <- lapply(FILES[grepl("Inventory", FILES)],
function(n) read_csv(file=paste0(n), col_types=cols_only(DIVISION="c",
DEPARTMENT="i",
ITEM_ID="c",
DESCRIPTION="c",
UNIT_QTY="i",
COMP_UNIT_QTY="i",
REGION="c",
LOCATION_TYPE="c",
ZONE="c",
LOCATION_ID="c",
ATS_IND="c",
CONTAINER_ID="c",
STATUS="c",
TROUBLE_CODES="c")))
But, for an Excel file. I tried using read.xlsx from openxlsx and read_excel from readxl, but nneither supported doing this. There must be some other way. Don't worry about column types, I am fine with all as characters.
I would very much appreciate if this can be done using readxl or openxlsx.

Related

How do I stop r from using the first row of data as the column name?

I'm extremely new to using R and I keep running into an issue with my current data set. The data set consists of several txt files with 30 rows and 3 columns of numerical data. However, when I try to work with them in r, it automatically makes the first row of data the column heading, so that when I try to combine the files, everything gets messed up as none of them have the same column titles. How do I stop this problem from happening? The code I've used so far is below!
setwd("U:\\filepath")
library(readr)
library(dplyr)
file.list <- list.files(pattern='*.txt')
df.list <- lapply(file.list, read_tsv)
After this point it just says that there are 29 rows and 1 column, which is not what I want! Any help is appreciated!
You say:
After this point it just says that there are 29 rows and 1 column, which is not what I want!
What that is telling you is that you don't have a tab-separated file. There's not a way to tell which delimiter is being assumed, but it's not a tab. You can tell that by paying attention to the number of columns. Since you got only one column, the read_tsv function didn’t find any tabs. And then you have the issue that your colnames are all different. That well could mean that your files do not have a header line. If you wanted to see what was in your files you could do something like:
df.list <- lapply(file.list, function(x) readLines(x)[1])
df.list[[1]]
If there are tabs, then they should reveal themselves by getting expanded into spaces when printed to the console.
Generally it is better to determine what delimiters exist by looking at the file with a text editor (but not MS Word).
Use df_list <- lapply(file.list, read_tsv, col_names = FALSE).

Reading zipped folder containing non-traditional spreadsheet

I'm trying to read a zipped folder called etfreit.zip contained in Purchases from April 2016 onward.
Inside the zipped folder is a file called 2016.xls which is difficult to read as it contains empty rows along with Japanese text.
I have tried various ways of reading the xls from R, but I keep getting errors. This is the code I tried:
download.file("http://www3.boj.or.jp/market/jp/etfreit.zip", destfile="etfreit.zip")
unzip("etfreit.zip")
data <- read.csv(text=readLines("2016.xls")[-(1:10)])
I'm trying to skip the first 10 rows as I simply wish to read the data in the xls file. The code works only to the extent that it runs, but the data looks truly bizarre.
Would greatly appreciate any help on reading the spreadsheet properly in R for purposes of performing analysis.
There is more than one bizzare thing going on here I think, but I had some success with (somewhat older) gdata package:
data = gdata::read.xls("2016.xls")
By the way, treating xls file as csv seldom works. Actually it shouldn't work at all :) Find out a proper import function for your type of data and then use it, don't assume that read.csv is going to take care about anything else than csv (properly).
As per your comment: I'm not sure what you mean by "not properly aligned", but here is some code that cleans the data a bit, and gives you numeric variables instead of factors (note I'm using tidyr for that):
data2 = data[-c(1:7), -c(1, 6)]
names(data2) = c("date", "var1", "var2", "var3")
data2[, c(2:4)] = sapply(data2[, c(2:4)], tidyr::extract_numeric)
# Optionally convert the column with factor dates to Posixct
data2$date = as.POSIXct(data2$date)
Also, note that I am removing only 7 upper rows - this seems to be the portion of the data that contains the header with Japanese.
"Odd" unusual excel tables cab be read with the jailbreakr package. It is still in development, but looks pretty ace:
https://github.com/rsheets/jailbreakr

in R, Can I get only the names columns of a csv(txt) file?

I have many big files. But I would like get only the names of the columns without load them.
Using the data.table packages, I can do
df1 <-fread("file.txt")
names1<- names(df)
But, for get all names of the all files, is ver expensive. There is some other option?
Many functions to read in data have optional arguments that allow you to specify how many lines you'd like to read in. For example, the read.table function would allow you to do:
df1 <- read.table("file.txt", nrows=1, header=TRUE)
colnames(df1)
I'd bet that fread() has this option too.
(Note that you may even be able to get away with nrows=0, but I haven't checked to see if that works)
EDIT
As commenter kindly points out, fread() and read.table() work a little differently.
For fread(), you'll want to supply argument nrows=0 :
df1 <- fread("file.txt", nrows=0) ##works
As per the documentation,
nrows=0 is a special case that just returns the column names and types; e.g., a dry run for a large file or to quickly check format consistency of a set of files before starting to read any.
But nrows=0 is one of the ignored cases when supplied in read.table()
df1 <- fread("file.txt") ##reads entire file
df1 <- fread("file.txt", nrows=-1) ##reads entire file
df1 <- fread("file.txt", nrows=0) ##reads entire file

read_excel all columns text [duplicate]

This question already has answers here:
Specifying Column Types when Importing xlsx Data to R with Package readxl
(6 answers)
Closed 2 years ago.
I have an Excel file with all columns of type "text". When read_excel is called, however, some of the columns are guessed to be "dbl" instead. I know I can use col_types to specify the columns, but that requires me knowing how many columns there are in my file.
Is there a way I can detect the number of columns? Or, alternatively, specify that the columns are all "text"? Something like
read_excel("file.xlsx", col_types = "text")
which, quite reasonably, gives an error that I haven't specified the type for all the columns.
Currently, I can solve this by reading in the file twice:
read_excel_one_type <- function(filename, col_type = "text"){
temp <- read_excel(path = filename)
ncol.temp <- ncol(temp)
read_excel(path = filename, col_types = rep(col_type, ncol.temp))
}
but a method that doesn't require reading the file twice would be better.
This answer seems to be helpful: https://stackoverflow.com/a/34015430/5220858. I have found that the excel file needs to be formatted correctly from the start in order for R to automatically detect the correct data type (i.e. numeric, date, text). I think the post though is more relevant to your question. The poster shows a bit of code similar to what you have provided, except only one line of data is read to determine the number of columns, then the rest is read into R based on the first line.
library("xlsx")
file<-"myfile.xlsx"
sheetIndex<-1
mydf<-read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,
startRow=NULL, endRow=NULL, colIndex=NULL,
as.data.frame=TRUE, header=TRUE, colClasses=NA,
keepFormulas=FALSE, encoding="unknown")
works for me

merge multiple files with different rows in R

I know that this question has been asked previously, but answers to the previous posts cannot seem to solve my problem.
I have dozens of tab-delimited .txt files. Each file has two columns ("pos", "score"). I would like to compile all of the "score" columns into one file with multiple columns. The number of rows in each file varies and they are irrelevant for the compilation.
If someone could direct me on how to accomplish this, preferably in R, it would be a lot of helpful.
Alternatively, my ultimate goal is to read the median and mean of the "score" column from each file. So if this could be accomplished, with or without compiling the files, it would be even more helpful.
Thanks.
UPDATE:
As appealing as the idea of personal code ninjas is, I understand this will have to remain a fantasy. Sorry for not being explicit.
I have tried lapply and Reduce, e.g.,
> files <- dir(pattern="X.*\\.txt$")
> File_list <- lapply(filesToProcess,function(score)
+ read.table(score,header=TRUE,row.names=1))
> File_list <- lapply(files,function(z) z[c("pos","score")])
> out_file <- Reduce(function(x,y) {merge(x,y,by=c("pos"))},File_list)
which I know doesn't really make sense, considering I have variable row numbers. I have also tried plyr
> files <- list.files()
> out_list <- llply(files,read.table)
As well as cbind and rbind. Usually I get an error message, because the row numbers don't match up or I just get all the "score" data compiled into one column.
The advice on similar posts (e.g., Merging multiple csv files in R, Simultaneously merge multiple data.frames in a list, and Merge multiple files in a list with different number of rows) has not been helpful.
I hope this clears things up.
This problem could be solved in two steps:
Step 1. Read the data from your csv files into a list of data frames, where files is a vector of file names. If you need to add extra arguments to read.csv, add them like shown below. See ?lapply for details.
list_of_dataframes <- lapply(files, read.csv, stringsAsFactors = FALSE)
Step 2. Calculate means for each data frame:
means <- sapply(list_of_dataframes, function(df) mean(df$score))
Of course, you can always do it in one step like this:
means <- sapply(files, function(filename) mean(read.csv(filename)$score))
I think you want smth like this:
all_data = do.call(rbind, lapply(files,
function(f) {
cbind(read.csv(f), file_name=f)
}))
You can then do whatever "by" type of action you like. Also, don't forget to adjust the various read.csv options to suit your needs.
E.g. once you have the above, you can do the following (and much more):
library(data.table)
dt = data.table(all_data)
dt[, list(mean(score), median(score)), by = file_name]
A small note: you could also use data.table's fread, to read the files in instead of the read.table and its derivatives, and that would be much faster, and while we're at it, use rbindlist instead of do.call(rbind,.

Resources