Read csv from specific row - r

I have daily data starting from 1980 in csv file. But I want to read data only from 1985. Because the other dataset in another file starts from 1985. How can I skip reading the data before 1985 in R language?

I think you want to take a look at ?read.csv to see all the options.
It's a bit hard to give an exact answer without seeing a sample of your data.
If your data doesn't have a header and you know which line the 1985 data starts on, you can just use something like...
impordata <- read.csv(file,skip=1825)
...to skip the first 1825 lines.
Otherwise you can always just subset the data after you've imported it if you have a year variable in your data.
impordata <- read.csv("skiplines.csv")
impordata <- subset(impordata,year>=1985)
If you don't know where the 1985 data starts, you can use grep to find the first instance of 1985 in your file's date variable and then only keep from that line onwards:
impordata <- read.csv("skiplines.csv")
impordata <- impordata[min(grep(1985,impordata$date)):nrow(impordata),]

Here are a few alternatives. (You may wish to convert the first column to "Date" class afterwards and possibly convert the entire thing to a zoo object or other time series class object.)
# create test data
fn <- tempfile()
dd <- seq(as.Date("1980-01-01"), as.Date("1989-12-31"), by = "day")
DF <- data.frame(Date = dd, Value = seq_along(dd))
write.table(DF, file = fn, row.names = FALSE)
read.table + subset
# if file is small enough to fit in memory try this:
DF2 <- read.table(fn, header = TRUE, as.is = TRUE)
DF2 <- subset(DF2, Date >= "1985-01-01")
read.zoo
# or this which produces a zoo object and also automatically converts the
# Date column to Date class. Note that all columns other than the Date column
# should be numeric for it to be representable as a zoo object.
library(zoo)
z <- read.zoo(fn, header = TRUE)
zw <- window(z, start = "1985-01-01")
If your data is not in the same format as the example you will need to use additional arguments to read.zoo.
multiple read.table's
# if the data is very large read 1st row (DF.row1) and 1st column (DF.Date)
# and use those to set col.names= and skip=
DF.row1 <- read.table(fn, header = TRUE, nrow = 1)
nc <- ncol(DF.row1)
DF.Date <- read.table(fn, header = TRUE, as.is = TRUE,
colClasses = c(NA, rep("NULL", nc - 1)))
n1985 <- which.max(DF.Date$Date >= "1985-01-01")
DF3 <- read.table(fn, col.names = names(DF.row1), skip = n1985, as.is = TRUE)
sqldf
# this is probably the easiest if data set is large.
library(sqldf)
DF4 <- read.csv.sql(fn, sql = 'select * from file where Date >= "1985-01-01"')

A data.table method which will offer speed and memory performance:
library(data.table)
fread(file, skip = 1825)

Related

searching multiple txt files for data and reporting result to new table

I have thousands of txt files containing Mass, %Base data. I need to search each file for a row within a specific mass range. Then, report that row into a new table with the filename as an additional character. The goal is a table of (Mass, %Base, Filename) for all of the text files based on the condition of the search.
Existing File example for file1name.txt:
Mass %Base
100 .1
101 26.2
...
900 0
Goal:
Mass %Base File
375.004 98 file1name
375.003 96 file2name
My current code is:
library(tidyverse)
library(readr)
#setwd to where data is located
setwd("Z:/Dnigra")
#set path where data is located
path <- "Z:/Dnigra"
mc <- 375.3 #mc is the calculated target mass
limit<- 0.1 # the width of the search window
#finds the files with the correct extensions
fs <-list.files(path, pattern=glob2rx("*.txt$"))
for (f in fs){
fname <- file.path(path, f)
df <- read_tsv(fname,col_names=FALSE, skip =1)
#filters the data that includes the target mass
df <- between(mc,limit,limit)
#create new data based on contents
allSpectra <- data.frame(df,f)
#write new data to sep file
write.table(allSpectra ,"allwobble.csv",
append= T,
sep=",",
row = F
)
}
The end result is a table with:
df f
FALSE filename
Also errors:
Parsed with column specification: cols( X1 = col_character(), X2 = col_character() ) Warning: 2536 parsing failures.
I think there may be a few things here to address:
First, with read_tsv you might want to specify the column types as double if appropriate, so values are not read in as character strings. This would affect your ability to filter and subset based on Mass.
Next, the between statement has the syntax of:
between(x, left, right)
where x <= right and x >= left. If you want to make sure your mc value is between 375.2 and 375.4 you might want between(X1, mc-limit, mc+limit) instead. Note that since no header was read in, the Mass variable is assumed first as X1.
When you use write.table and append, you might want to set col.names to FALSE (or include header on first write).
Hope this is helpful to you.
for (f in fs){
fname <- file.path(path, f)
df <- read_tsv(fname, col_names = FALSE, skip=1, col_types = "dd")
#filters the data that includes the target mass
df <- filter(df, between(X1, mc-limit, mc+limit))
#create new data based on contents
allSpectra <- data.frame(df,f)
#write new data to sep file
write.table(allSpectra ,"allwobble.csv",
append= T,
sep=",",
row = F,
col.names = FALSE
)
}
Thanks #Ben. I had gotten to that point last night and had added a tolerance calculation. The "dd" definitely helped but required a col_names to get through another error. The final code is below. A parsing error comes up, but it does what it need to do!
tol<- .02 # the width of the search window
mmneg <- mc - tol
mmpos <- mc + tol
#finds the files with the correct extensions
fs <-list.files(path, pattern=glob2rx("*.txt$"))
for (f in fs){
fname <- file.path(path, f)
df <- read_tsv(fname, skip =1,skip_empty_rows = T, col_types="dd", col_names=c("X1","X2"))
#filters the data that excludes the offending peak
df<- filter(df,between(X1,mmneg,mmpos))
#create new data based on contents
allSpectra <- data.frame(df,f)
#write new data to sep file
write.table(allSpectra ,"Caviunin_20_.csv",
append= T,
sep=",",
row = F,
col.names = F
)
}

Get column types of excel sheet automatically

I have an excel file with several sheets, each one with several columns, so I would like to not to specify the type of column separately, but automatedly. I want to read them as stringsAsFactors= FALSE would do, because it interprets the type of column, correctly. In my current method, a column width "0.492 ± 0.6" is interpreted as number, returning NA, "because" the stringsAsFactors option is not available in read_excel. So here, I write a workaround, that works more or less well, but that I cannot use in real life, because I am not allowed to create a new file. Note: I need other columns as numbers or integers, also others that have only text as characters, as stringsAsFactors does in my read.csv example.
library(readxl)
file= "myfile.xlsx"
firstread<-read_excel(file, sheet = "mysheet", col_names = TRUE, na = "", skip = 0)
#firstread has the problem of the a column with "0.492 ± 0.6",
#being interpreted as number (returns NA)
colna<-colnames(firstread)
# read every column as character
colnumt<-ncol(firstread)
textcol<-rep("text", colnumt)
secondreadchar<-read_excel(file, sheet = "mysheet", col_names = TRUE,
col_types = textcol, na = "", skip = 0)
# another column, with the number 0.532, is now 0.5319999999999999
# and several other similar cases.
# read again with stringsAsFactors
# critical step, in real life, I "cannot" write a csv file.
write.csv(secondreadchar, "allcharac.txt", row.names = FALSE)
stringsasfactor<-read.csv("allcharac.txt", stringsAsFactors = FALSE)
colnames(stringsasfactor)<-colna
# column with "0.492 ± 0.6" now is character, as desired, others numeric as desired as well
Here is a script that imports all the data in your excel file. It puts each sheet's data in a list called dfs:
library(readxl)
# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")
# Loop through the sheet names and get the data in each sheet
dfs <- lapply(all_sheets, function(x) {
#Get the number of column in current sheet
col_num <- NCOL(read_excel(path = "myfile.xlsx", sheet = x))
# Get the dataframe with columns as text
df <- read_excel(path = "myfile.xlsx", sheet = x, col_types = rep('text',col_num))
# Convert to data.frame
df <- as.data.frame(df, stringsAsFactors = FALSE)
# Get numeric fields by trying to convert them into
# numeric values. If it returns NA then not a numeric field.
# Otherwise numeric.
cond <- apply(df, 2, function(x) {
x <- x[!is.na(x)]
all(suppressWarnings(!is.na(as.numeric(x))))
})
numeric_cols <- names(df)[cond]
df[,numeric_cols] <- sapply(df[,numeric_cols], as.numeric)
# Return df in desired format
df
})
# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets
The process goes as follows:
First, you get all the sheets in the file with excel_sheets and then loop through the sheet names to create dataframes. For each of these dataframes, you initially import the data as text by setting the col_types parameter to text. Once you have gotten the dataframe's columns as text, you can convert the structure from a tibble to a data.frame. After that, you then find columns that are actually numeric columns and convert them into numeric values.
Edit:
As of late April, a new version of readxl got released, and the read_excel function got two enhancements pertinent to this question. The first is that you can have the function guess the column types for you with the argument "guess" provided to the col_types parameter. The second enhancement (corollary to the first) is that guess_max parameter got added to the read_excel function. This new parameter allows you to set the number of rows used for guessing the column types. Essentially, what I wrote above could be shortened with the following:
library(readxl)
# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")
dfs <- lapply(all_sheets, function(sheetname) {
suppressWarnings(read_excel(path = "myfile.xlsx",
sheet = sheetname,
col_types = 'guess',
guess_max = Inf))
})
# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets
I would recommend that you update readxl to the latest version to shorten your script and as a result avoid possible annoyances.
I hope this helps.

reading row names in read_csv2 (readr package)

I am trying to load an example dataset from here: http://www.agrocampus-ouest.fr/math/RforStat/decathlon.csv to run an example PCA.
The correctly loaded data frame can be replicated with this line of code:
decathlon = read.csv('http://www.agrocampus-ouest.fr/math/RforStat/decathlon.csv',
header = TRUE, row.names = 1, check.names = FALSE,
dec = '.', sep = ';')
However, I was wondering if this can be simulated with function(s) from readr package. Suitable function for this seems to be read_csv2, however, the row.names command is not available:
dplyrtlon = read_csv2('http://www.agrocampus-ouest.fr/math/RforStat/decathlon.csv',
col_names = TRUE, col_types = NULL, skip = 0)
Any suggestion on how to do this within readr?
readr returns tibbles instead of data frames. Tibbles are much faster and memory efficient than data frames but do not support row names.
Depending on what you want to do with your data after reading it in, you could either, add a column name to the first column (it looks like last names):
dplyrtlon <- read_csv2('http://www.agrocampus-ouest.fr/math/RforStat/decathlon.csv',
col_types = NULL, skip = 0)
names(dplyrtlon)[1] <- "last_name"
or you could convert the variable to a data frame, and use the content of the first column to set up row names:
r <- as.data.frame(dplyrtlon)
rownames(r) <- r[, 1]
r <- r[, -1]

Error with assigning column names to data frame inR

I am running the following code in order to open up a set of CSV files that have temperature vs. time data
temp = list.files(pattern="*.csv")
for (i in 1:length(temp))
{
assign(temp[i], read.csv(temp[i], header=FALSE, skip =20))
colnames(as.data.frame(temp[i])) <- c("Date","Unit","Temp")
}
the data in the data frames looks like this:
V1 V2 V3
1 6/30/13 10:00:01 AM C 32.5
2 6/30/13 10:20:01 AM C 32.5
3 6/30/13 10:40:01 AM C 33.5
4 6/30/13 11:00:01 AM C 34.5
5 6/30/13 11:20:01 AM C 37.0
6 6/30/13 11:40:01 AM C 35.5
I am just trying to assign column names but am getting the following error message:
Error in `colnames<-`(`*tmp*`, value = c("Date", "Unit", "Temp")) :
'names' attribute [3] must be the same length as the vector [1]
I think it may have something to do how my loop is reading the csv files. They are all stored in the same directory in R.
Thanks for your help!
I'd take a slightly different approach which might be more understandable:
temp = list.files(pattern="*.csv")
for (i in 1:length(temp))
{
tmp <- read.csv(temp[i], header=FALSE, skip =20)
colnames(tmp) <- c("Date","Unit","Temp")
# Now what do you want to do?
# For instance, use the file name as the name of a list element containing the data?
}
Update:
temp = list.files(pattern="*.csv")
stations <- vector("list", length(temp))
for (i in 1:length(temp)) {
tmp <- read.csv(temp[i], header=FALSE, skip =20)
colnames(tmp) <- c("Date","Unit","Temp")
stations[[i]] <- tmp
}
names(stations) <- temp # optional; could process file names too like using basename
station1 <- station[[1]] # etc station1 would be a data.frame
This 2nd part could be improved as well, depending upon how you plan to use the data, and how much of it there is. A good command to know is str(some object). It will really help you understand R's data structures.
Update #2:
Getting individual data frames into your workspace will be quite hard - someone more clever than I may know some tricks. Since you want to plot these, I'd first make names more like you want with:
names(stations) <- paste(basename(temp), 1:length(stations), sep = "_")
Then I would iterate over the list created above as follows, creating your plots as you go:
for (i in 1:length(stations)) {
tmp <- stations[[i]]
# tmp is a data frame with columns Date, Unit, Temp
# plot your data using the plot commands you like to use, for example
p <- qplot(x = Date, y = Temp, data = tmp, geom = "smooth", main = names(stations)[i])
print(p)
# this is approx code, you'll have to play with it, and watch out for Dates
# I recommend the package lubridate if you have any troubles parsing the dates
# qplot is in package ggplot2
}
And if you want to save them in a file, use this:
pdf("filename.pdf")
# then the plotting loop just above
dev.off()
A multipage pdf will be created. Good Luck!
It is usually not recommended practice to use the 'assign' statement in R. (I should really find some resources on why this is so.)
You can do what you are trying using a function like this:
read.a.file <- function (f, cnames, ...) {
my.df <- read.csv(f, ...)
colnames(my.df) <- cnames
## Here you can add more preprocessing of your files.
}
And loop over the list of files using this:
lapply(X=temp, FUN=read.a.file, cnames=c("Date", "Unit", "Temp"), skip=20, header=FALSE)
"read.csv" returns a data.frame so you don't need "as.data.frame" call;
You can use "col.names" argument to "read.csv" to assign column names;
I don't know what version of R you are using, but "colnames(as.data.frame(...)) <-" is just an incorrect call since it calls for "as.data.frame<-" function that does not exist, at least in version 2.14.
A short-term fix to your woes is the following, but you really need to read up more on using R as from what you did above I expect you'll get into another mess very quickly. Maybe start by never using assign.
lapply(list.files(pattern = "*.csv"), function (f) {
df = read.csv(f, header = F, skip = 20))
names(df) = c('Date', 'Unit', 'Temp')
df
}) -> your_list_of_data.frames
Although more likely you want this (edited to preserve file name info):
df = do.call(rbind,
lapply(list.files(pattern = "*.csv"), function(f)
cbind(f, read.csv(f, header = F, skip = 20))))
names(df) = c('Filename', 'Date', 'Unit', 'Temp')
At a glance it appears that you are missing a set of subset braces, [], around the elements of your temp list. Your attribute list has three elements but because you have temp[i] instead of temp[[i]] the for loop isn't actually accessing the elements of the list thus treating as an element of length one, as the error says.

Read xts from CSV file in R

I'm trying to read time series from CSV file and save them as xts to be able to process them with quantmod. The problem is that numeric values are not parsed.
CSV file:
name;amount;datetime
test1;3;2010-09-23 19:00:00.057
test2;9;2010-09-23 19:00:00.073
R code:
library(xts)
ColClasses = c("character", "numeric", "character")
Data <- read.zoo("c:\\dat\\test2.csv", index.column = 3, sep = ";", header = TRUE, FUN = as.POSIXct, colClasses = ColClasses)
as.xts(Data)
Result:
name amount
2010-09-23 19:00:00 "test1" "3"
2010-09-23 19:00:00 "test2" "9"
See amount column contains character data but expected to be numeric. What's wrong with my code?
The internal data structure of both zoo and xts is matrix, so you cannot mix data types.
Just read in the data with read.table:
Data <- read.table("file.csv", sep=";", header=TRUE, colClasses=ColClasses)
I notice your data have subseconds, so you may be interested in xts::align.time. This code will take Data and create one object with a column for each "name" by seconds.
NewData <- do.call( merge, lapply( split(Data,Data$name), function(x) {
align.time( xts(x[,"amount"],as.POSIXct(x[,"datetime"])), n=1 )
}) )
If you want to create objects test1 and test2 in your global environment, you can do something like:
lapply( split(Data,Data$name), function(x) {
assign(x[,"name"], xts(x[,"amount"],as.POSIXct(x[,"datetime"])),envir=.GlobalEnv)
})
You cannot mix numeric and character data in a zoo or xts object; however, if the name column is not intended to be time series data but rather is intended to distinguish between multiple time series, one for test1, one for test2, etc. then you can split on column 1 using split=1 to cause such splitting as shown in the following code. Be sure to set the digits.secs or else you won't see the sub-seconds on output (although they will be there in any case):
options(digits.secs = 3)
z <- read.zoo("myfile.csv", sep = ";", split = 1, index = 3, header = TRUE, tz = "")
x <- as.xts(z)

Resources