I'm importing a lot of datasets. All of them have some empty lines at the top (before header), however it's not always the same number of rows that I need to skip.
Right now I'm using:
df2 <- read_delim("filename.xls",
"\t", escape_double = FALSE,
guess_max=10000,
locale = locale(encoding = "ISO-8859-1"),
na = "empty", trim_ws = TRUE, skip = 9)
But sometimes I only need to skip 3 lines fx.
Can I somehow set up a rule that when my column B (in Excel) contains one of the following words at the beginning of a sentence:
Datastatistik
Overførte records
FI-CA
Oprettet
Column A is always empty but I delete this in a code after the import.
This is an example of my data (I have hidden personal numbers):
My first variable header is called "Bilagsnummer" or "Bilagsnr.".
I don't know if it's possible to set up a rule that says something like the first occurrence of this word is my header? Really I'm just brainstorming here, cause I have no idea how to automatise this data import.
---EDIT---
I looked at the post #Bram linked to, and it did solve some of my problem.
I changed some of it.
This is the code I used:
temp <- readLines("file.xls")
skipline <- which(grepl("\tDatastatistik", temp) |
grepl("\tOverførte", temp) |
grepl("FI-CA", temp) |
grepl("Oprettet", temp) |
temp == "")
So the skipline interger that I made contains those lines that need to be skipped. These are correct using the grepl function (since the wording at the end of sentence changes from time to time).
Now, I still have a problem though.
When I use skip = skipline in my read.delim It only works for the fist row.
I get the warning message:
In if (skip > 0L) readLines(file, skip) :
the condition has length > 1 and only the first element will be used
May have found a solution, but not the optimal one. Let's see.
Import your df with the empty lines:
df2 <- read_delim("filename.xls",
"\t", escape_double = FALSE,
guess_max=10000,
locale = locale(encoding = "ISO-8859-1"),
na = "empty", trim_ws = TRUE)
Find the number of empty rows at the beginning:
NonNAindex <- which(!is.na(df2[,2]))
lastEmpty <- (min(NonNAindex)-1)
Re-import your document using that info:
df2 <- read_delim("filename.xls",
"\t", escape_double = FALSE,
guess_max=10000,
locale = locale(encoding = "ISO-8859-1"),
na = "empty", trim_ws = TRUE, skip = lastEmpty)
Related
I have found few people posted similar issues but still can't solve my problem. The expected objects in the dataframe were 957463 but only 392400 extracted out.
I used read.delim2("test.csv", header = TRUE, sep = ",", quote = "\"", fill = TRUE) to create dataframe with the expected records but the output were less than expected.
#set working directory --------------------------------
L <- setwd("C:/Users/abmo8004/Documents/R Project/csv/")
#List files in the path ------------------------
l <- list.files(L)
#form dataframe from csv file ---------------------------
df <- read.delim2("test.csv", header = TRUE, sep = ",", quote = "\"", fill = TRUE)
I expect the output to be 957463 , but the actual output is 392400. Can anyone please look at the codes?
How could I import a file :
starting with an undefined number of comment lines
followed by a line with headers, some of them containing the comment character which is used to identify the comment lines above?
For example, with a file like this:
# comment 1
# ...
# comment X
c01,c#02,c03,c04
1,2,3,4
5,6,7,8
Then:
myDF = read.table(myfile, sep=',', header=T)
Error in read.table(myfile, sep = ",", header = T) : more columns
than column names
The obvious problem is that # is used as comment character to announce comment lines, but also in the headers (which, admittedly, is bad practice, but I have no control on this).
The number of comment lines being unknown a priori, I can't even use the skip argument. Also, I don't know the column names (not even their number) before importing, so I'd really need to read them from the file.
Any solution beyond manually manipulating the file?
It may be easy enough to count the number of lines that start with a comment, and then skip them.
csvfile <- "# comment 1
# ...
# comment X
c01,c#02,c03,c04
1,2,3,4
5,6,7,8"
# return a logical for whether the line starts with a comment.
# remove everything from the first FALSE and afterward
# take the sum of what's left
start_comment <- grepl("^#", readLines(textConnection(csvfile)))
start_comment <- sum(head(start_comment, which(!start_comment)[1] - 1))
# skip the lines that start with the comment character
Data <- read.csv(textConnection(csvfile),
skip = start_comment,
stringsAsFactors = FALSE)
Note that this will work with read.csv, because in read.csv, comment.char = "". If you must use read.table, or must have comment.char = #, you may need a couple more steps.
start_comment <- grepl("^#", readLines(textConnection(csvfile)))
start_comment <- sum(head(start_comment, which(!start_comment)[1] - 1))
# Get the headers by themselves.
Head <- read.table(textConnection(csvfile),
skip = start_comment,
header = FALSE,
sep = ",",
comment.char = "",
nrows = 1)
Data <- read.table(textConnection(csvfile),
sep = ",",
header = FALSE,
skip = start_comment + 1,
stringsAsFactors = FALSE)
# apply column names to Data
names(Data) <- unlist(Head)
I have a very large data set that for illustrative purposes looks something like the following.
Cust_ID , Sales_Assistant , Store
123 , Mary, Worthington, 22
456 , Jack, Charles , 42
The real data has many more columns and millions of rows. I'm using the following code to import it into R but it is falling over because one or more of the columns has a comma in the data (see Sales_Assistant above).
df <- read.csv("C:/dataextract.csv", header = TRUE , as.is = TRUE , sep = "," , na.strings = "NA" , quote = "" , fill = TRUE , dec = "." , allowEscapes = FALSE , row.names=NULL)
Adding row.names=NULL imported all the data but it split the Sales_Assistant column over two columns and threw all the other data out of alignment. If I run the code without this I get an error...
Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed
...and the data won't load.
Can you think of a way around this that doesn't involve tackling the data at source, or opening it in a text editor? Is there a solution in R?
First and foremost, it is a csv file. "Mary, Worthington" is meant to respond to two columns. If you have commas in your values, consider saving the data by using tsv (tab-separated values).
However, if you data has equal amount of commas per row with good alignment in some sense, I would consider ignoring the first row (which is the column names as you read the file) of the data frame and reassigning it proper column names.
For instance, in your case you can replace Sales_Assistant by
Sales_Assistant_First_Name, Sales_Assistant_Last_Name
which makes perfect sense. Then I could basically do
df <- df[-1, ]
colnames(df) <- c("Cust_ID" , "Sales_Assistant_First_Name" , "Sales_Assistant_Last_Name", "Store")
df <- read.csv("C:/dataextract.csv", skip = 1, header = FALSE)
df_cnames <- read.csv("C:/dataextract.csv", nrow = 1, header = FALSE)
df <- within(df, V2V3 <- paste(V2, V3, sep = ''))
df <- subset(df, select = (c("V1", "V2V3", "V4")))
colnames(df) <- df_cnames
It may need some modification depending on the actual source
I'm new with R and seek some digestible guidance. I wish to create data.frame so I can create column and establish variables in my data. I start with exporting url into R and save into Excel;
data <- read.delim("http://weather.uwyo.edu/cgi-bin/wyowx.fcgi?
TYPE=sflist&DATE=20170303&HOUR=current&UNITS=M&STATION=wmkd",
fill = TRUE, header = TRUE,sep = "\t" stringsAsFactors = TRUE,
na.strings = " ", strip.white = TRUE, nrows = 27, skip = 9)
write.xlsx(data, "E:/Self Tutorial/R/data.xls")
This data got missing value somewhere in the middle of element thus make the length irregular. Due to irregular length I use write.table instead of data.frame.
As 1st attempt, in global environment, data exist in R value(NULL) not in R data;
dat.table = write.table(data)
str(dat.table) # just checking #result NULL?
try again
dat.table = write.table(data,"E:/Self Tutorial/R/data.xls", sep = "\t", quote = FALSE)
dat.table ##print nothing
remove sep =
dat.table = write.table(data,"E:/Self Tutorial/R/data.xls", quote = FALSE
dat.table ##print nothing
since its not working, I try read.table
dat.read <- read.table("E:/Self Tutorial/R/data.xls", header = T, sep = "\t")
Data loaded in R console, but as expected with irregular column distribution, (??even though I already use {na.strings = " ", strip.white = TRUE} in read.delim argument)
What should I understand from this mistake, and which is it. Thank you.
I have a large number of files, each in tab-delimited format. I need to apply some modeling (glm/gbm etc) on each of these files. They are obtained from hospital data where in exceptional cases entries may not be the proper format. For example, when entering glucose level for a patient, the data entry operator may enter N or A by mistake instead of actual number.
While reading these files in loop, I am encountering problem as such columns (glucose) are treated as factor while it should be a numeric. It is painful to investigate each file and and look for error. I am reading the files in the following way but it is certainly not a good approach.
read.table(fn, header = TRUE, sep= "\t" , na.strings = c('', 'NEG', 'TR', 'NA', '<NA>', "Done", "D", "A"))
Is there any other function through which I can assume those values/outliers to be na?
You can inspect which elements are not number (for the glucose case):
data = read.csv(file, as.is = TRUE, sep = '\t') # dont convert string to factor
glucose = data$glucose
sapply(glucose, function(x)!is.na(as.numeric(x)), USE.NAMES = FALSE)
Then you can work with these indexes (interpolate or remove).
To loop the files:
files = list.files(path, '*.csv')
for (file in files)
{
data = read.csv(file, sep = '\t', as.is = TRUE)
gluc = data$glucose
idxs = sapply(gluc, function(x)!is.na(as.numeric(x)), USE.NAMES = FALSE)
# interpolate or remove here
}
Use the colClasses argument to read.table and friends to specify which columns should be numeric, then R does not need to try and guess. If a column is designated to be numeric then any entries that are not numbers will be converted to NA automatically.