I want to analyze market data that is being saved in a text file.
The data consists of "Date Time;Price;Size". I want to only look at the Sizes, how can I separate this data in R so that I may do statistical analysis on the sizes?
Example:
20170918 040001;50.42;1
20170918 040002;50.42;1
Just use read.csv with semicolon as a delimeter:
df <- read.csv(file="path/to/your/file.csv", sep=";", header=TRUE)
The sizes can be accessed using df$Sizes.
You can use the select argument of data.table:
library(data.table)
#[[1L]] extracts the column of the temporary table to a vector;
# you could also use $V2, but this _may_ not be perfectly robust
price = fread('/path/to/file'select = 2L)[[1L]]
fread should be able to detect automatically that your file doesn't have headers, as well as that the field separator is ;. If not, set header = FALSE and/or sep = ';'.
Of course, it's not likely that you will only use the vector of prices independently of the rest of the data. So you should really just store the whole data file in a data.table:
market_data = fread('/path/to/file', col.names = c('date_time', 'price', 'size'))
Then you can manipulate market_data as you would any data.table (see Getting Started), e.g.
market_data[ , mean(price)]
market_data[ , sd(price)]
and so on.
df=read.table("your file")
size=df[4]
your sizes data will be in size as a data frame
Related
I'm trying to import a csv file into a vector. There are 100 entries in this csv file, and this is what the file looks like:
My code reads as follows:
> choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")
> choice_vector
And yet, when I try to display said vector, it shows up as:
It is somehow creating a second column which I cannot figure out why it is doing so. In addition, trying to write to a new csv file actually writes the contents of that second column to that as well.
The second column was "habilitated" in excel.
Option1: Manually delete the column in excel.
Option2: Delete all columns with all NA
choice_vector2 <- choice_vector[,colSums(is.na(choice_vector))<nrow(choice_vector)]
In case of being interested in reading the first column only:
choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")[,1]
Good luck!
Short answer:
You have an issue with your data file, but
choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")$V1
should create the vector that you're expecting.
Long answer:
The read.csv function returns a data frame and you need to address a particular column within the data frame with the $ operator in order to extract that column as a vector. As for why you have an unexpected column of NAs, your CSV probably codes for two columns. When you read a CSV with R, a comma indicates a data field to its right. If you look at your CSV with a text editor, I'm guessing it'll look like this:
A,
B,
D,
A,
A,
F,
The absence of anything (other than another comma or a line break) to the right of a comma is interpreted as NA.
If we are using fread from data.table, there is a select option to select only the columns of interest
library(data.table)
dt <- fread("choices.csv", select = 1)
Other than that, it is not clear about why the issue happens. Could be some strange white space. If that is the case, specify strip.white = TRUE (by default it is FALSE)
read.csv(("choices.csv", header = FALSE,
fileEncoding="UTF-8-BOM", strip.white = TRUE)
Or as we commented, copy the columns of interest into a new file, save it and then read with read.csv
There is a large dataset that I need to download over the web using R, but I would like to learn how to filter it at the same time while downloading to the Dates that I need. Right now, I have it setup to download and .unzip and then I create another data set with a filter. The file is a text ";" delimited file
There is a Date column with format 1/1/2009 and I need to only select two dates, 3/1/2009 and 3/2/2009, how to do that in R ?
When I import it, R set it as a factor, since I only need those two dates and there is no need to do a Between, I just select the two factors and call it a day.
Thanks!
I don't think you can filter while downloading. To select only these dates you can use the subset function:
# do not convert string to factors
d.all = read.csv(file, ..., stringsAsFactors = FALSE, sep = ';')
# Date column is called DATE:
d.filter = subset(d.all, DATE %in% c("1/1/2009", "3/1/2009"))
My issue is likely with how I'm exporting the data from the for loop, but I'm not sure how to fix it.
I've got over 200 files in a folder, all structured in the same way, from which I'd like to pull the maximum number from a single column. I've made a for loop to do this based off of code from here http://www.r-bloggers.com/looping-through-files/
What I have running so far looks like this:
fileNames<-Sys.glob("*.csv")
for(i in 1:length(fileNames)){
data<-read.csv(fileNames[i])
VelM = max(data[,8],na.rm=TRUE)
write.table(VelM, "Summary", append=TRUE, sep=",",
row.names=FALSE,col.names=FALSE)
}
This works, but I need to figure out a way to have a second column in my summary file that contains the original file name the data in that row came from for reference.
I tried making both a matrix and a data frame instead of going straight to the table writing, but in both cases I wasn't able to append the data and ended up with values from only the last file.
Any ideas would be greatly appreciated!
Here's what I would recommend to improve your current method, also going with fread() because it's very fast and has the select argument. Notice I have moved the write.table() call outside the for() loop. This allows a cleaner way of adding the new column of file names alongside the max column, and eliminates the need to append to the file on every iteration.
library(data.table)
fileNames <- Sys.glob("*.csv")
VelM <- numeric(length(fileNames))
for(i in seq_along(fileNames)) {
VelM[i] <- max(fread(fileNames[i], select = 8)[[1L]], na.rm = TRUE)
}
write.table(data.frame(VelM, fileNames), "Summary", sep = ",",
row.names = FALSE, col.names = FALSE)
If you want to quickly read files, you should consider using data.table::fread or readr::read_csv instead of base read.csv.
For example:
fileNames <- list.files(path = your_path, pattern='\\.csv') # instead of Sys.glob
library('data.table')
dt <- rbindlist(lapply(fileNames, fread, select=8, idcol=TRUE))
dt[, .(max_val = max(your_var)), by = id]
write.table(dt, 'yourfile.csv', sep=',', row.names=FALSE, col.names=FALSE)
Explanation: data.table::fread reads in only the select=8th column from each file (via lapply to fileNames, which returns a list of data.tables). Then data.table::rbindlist combines this list of data.tables (of one column each) into a single data.table, producing an additional column idcol. From ?fread, note that
If input is a named list, ids are generated using them
Because lapply returns a named list with each name being the element of fileNames, this is an easy way of passing fileNames index for grouping.
The rest is data.table syntax. It wasn't clear from your question if there is a header row and whether you know the heading in advance. If so, you can either keep header=TRUE and use the header name for your_var, or you can do skip=1, header=FALSE, col.names = 'your_var'.
I have a data.table in R that I'm trying to write out to a .txt file, and then input back into R.
It's sizeable table of 6.5M observations and 20 variables, so I want to use fread().
When I use
write.table(data, file = "data.txt")
a table of about 2.2GB is written in data.txt. In manually inspecting it, I can see that there are column names, that it's separated by " ", and that there are quotes on character variables. So everything should be fine.
However,
data <- fread("data.txt")
returns a data.table of 6.5M observations and 1 variable. OK, maybe for some reason fread() isn't automatically understanding the separator string:
data <- fread("data.txt", sep = " ")
All the data is in the proper variables now, but
R has added an unnecessary row-number column
in one (only one) of my columns all NAs have been replaced by 9218868437227407266
All variable names are missing
Maybe fread() isn't recognizing the header, somehow.
data <- fread("data.txt", sep = " ", header = T)
Now my first set of observations is my column names. Not very useful.
I'm completely baffled. Does anyone understand what's happening here?
EDIT:
row.names = F solved the names problem, thanks Ananda Mahto.
Ran
datasub <- data[runif(1000,1,6497651), ]
write.table(datasub, file = "datasub.txt", row.names = F)
fread("datasub.txt")
fread() seems to work fine for the smaller dataset.
EDIT:
Here is the subset of data I created above:
https://github.com/cbcoursera1/ExploratoryDataAnalysisProject2/blob/master/datasub.txt
This data comes from the National Emissions Inventory (NEI) and is made available by the EPA. More information is available here:
http://www.epa.gov/ttn/chief/eiinformation.html
EDIT:
I can no longer reproduce this issue. It may be that row.names = F solved the issue, or possibly restarting R/clearing my environment/something random fixed the problem.
I have a script that is working perfectly except that in my R cbind operation, adjacent to the numerical value that I require in the first row, is an 'X'.
Here is my script:
library(ncdf)
library(Kendall)
library(forecast)
library(zoo)
setwd("/home/cohara/RainfallData")
files=list.files(pattern="*.nc")
j=81
for (i in seq(1,9))
{
file<-open.ncdf(sprintf("/home/cohara/RainfallData/%s.nc",i))
year<-get.var.ncdf(file,"time")
data<-get.var.ncdf(file,"var61")
fit<-lm(data~year) #least sqaures regression
mean=rollmean(data,4,fill=NA)
kendall<-Kendall(data,year)
write.table(kendall[[2]],file="/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_-_89_years.csv",append=TRUE,quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(kendall[[1]],file="/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_-_89_years.csv",append=TRUE,quote=FALSE,row.names=FALSE,col.names=FALSE)
png(sprintf("./10 percent increase over %s years.png",j))
par(family="serif",mar=c(4,6,4,1),oma=c(1,1,1,1))
plot(year,data,pch="*",col=4,ylab="Precipitation (mm)",main=(sprintf("10 percent increase over %s years",j)),cex.lab=1.5,cex.main=2,ylim=c(800,1400),abline(fit,col="red",lty=1.5))
par(new=T)
plot(year,mean,type="l",xlab="year",ylab="Precipitation (mm)",cex.lab=1.5,ylim=c(800,1400),lty=1.5)
legend("bottomright",legend=c("Kendall tau = ",kendall[[1]]))
legend("bottomleft",legend=c("Kendall 2-tailed p-value = ",kendall[[2]]))
legend(x="topright",c("4 year moving average","Simple linear trend"),lty=1.5,col=c("black","red"),cex=1.2)
legend("topleft",c("Annual total"),pch="*",col="blue",cex=1.2)
dev.off()
j=j+1
}
tmp<-read.csv("/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_to_89_years.csv")
tmp2<-read.csv("/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_-_89_years.csv")
tmp<-cbind(tmp,tmp2)
tmp3<-read.csv("/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_to_89_years.csv")
tmp4<-read.csv("/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_-_89_years.csv")
tmp3<-cbind(tmp3,tmp4)
write.table(tmp,"/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_to_89_years.csv",sep="\t",row.names=FALSE)
write.table(tmp3,"/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_to_89_years.csv",sep="\t",row.names=FALSE)
The output looks like this, from the .csv files created:
X0.0190228056162596 X0.000701081415172666
0.0395622998 0.00531819
0.0126547674 0.0108218994
0.0077754743 0.0015568719
0.0001407317 0.002680057
0.0096391216 0.012719159
0.0107234037 0.0092436085
0.0503448173 0.0103918528
0.0167525802 0.0025036721
I want to be able to use excel functions on the data, so, for simplicity, I don't want row names (I'll be running this loop maybe a hundred times), but I need column names because otherwise the first set of values is cut off.
Can anyone tell me where the 'X' is coming from and how to get rid of it?
Thanks in advance,
Ciara
Here is what I think is going on. Start by running these small examples:
df1 <- read.csv(text = "0.0190228056162596, 0.000701081415172666
0.0395622998, 0.00531819
0.0126547674, 0.0108218994")
df2 <- read.csv(text = "0.0190228056162596, 0.000701081415172666
0.0395622998, 0.00531819
0.0126547674, 0.0108218994", header = FALSE)
df1
df2
str(df1)
str(df2)
names(df1)
names(df2)
make.names(c(0.0190228056162596, 0.000701081415172666))
Please read ?read.csv and about the header argument. As you will find, header = TRUE is default in read.csv. Thus, if the csv file you read lacks header, read.csv will still 'assume' that the file has a header, and use the values in the first row as a header. Another argument in read.csv is check.names, which defaults to TRUE:
If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names).
In your case, it seems that the data you read lack a header and that the first row is numbers only. read.csv will default treat this row as a header. make.names takes values in the first row (here numbers 0.0190228056162596, 0.000701081415172666), and spits out the 'syntactically valid variable names' X0.0190228056162596 and X0.000701081415172666. Which is not what you want.
Thus, you need to explicitly set header = FALSE to avoid that read.csvconvert the first row to (valid) variable names.
For next time, please provide a minimal, self contained example. Check these links for general ideas, and how to do it in R: here, here, here, and here