R - Writing data to CSV in a loop - r

I have a loop that is going through a list of variables (in a csv), accessing a database and extracting the relevant data. It does this for 4 different time periods (which depend on the variables).
I am trying to get R to write this data to a csv, but at current I can only get it to store the data for the last variable in 4 different csv files as it overwrites the previous variable each time.
I'd like it to have all of the data for these variables for one time period all in the same file/sheet. (So either 4 sheets or 4 csv files with all of the data on them) This is because I need to do some data manipulation on the variables before I feed them into the next loop of the script.
I'd like it to be something like this, but need 4 separate sheets/files so I can cover each time period.
date/time | var1 | var2 | ... | varn
I would post the code, but even only posting the relevant loop and none of the surrounding code would be ~150 lines. I am not familiar with R (I can follow the script but struggle writing my own), I inherited this project and don't have long to work on it.
Note: each variable is recorded at a different frequency - some will only have one data point an hour, others one every minute, so will need to match these up based on time recorded (to the nearest minute).
EDIT: I hope I've explained this clearly enough

Four different .csv files would be easiest, because you could do something like the following in your loop:
outfile.name <- paste('Sales', year.of.data, sep='')
write.csv(outfile.name, out.filepath, row.names=FALSE)
You could also append the data into one data.frame and then export it all at once into one sheet. You won't be able to export to multiple sheets for a .csv, because a CSV won't let you have multiple sheets.

Related

R Updating A Column In a Large Dataframe

I've got a dataframe, which is stored in a csv, of 63 columns and 1.3 million rows. Each row is a chess game, each column is details about the game (e.g. who played in the game, what their ranking was, the time it was played, etc). I have a column called "Analyzed", which is whether someone later analyzed the game, so it's a yes/no variable.
I need to use the API offered by chess.com to check whether a game is analyzed. That's easy. However, how do I systematically update the csv file, without wasting huge amounts of time reading in and writing out the csv file, while accounting for the fact that this is going to take a huge amount of time and I need to do it in stages? I believe a best practice for chess.com's API is to use Sys.sleep after every API call so that you lower the likelihood that you are accidentally making concurrent requests, which the API doesn't handle very well. So I have Sys.sleep for a quarter of a second. If we assume the API call itself takes no time, then this means this program will need to run for 90 hours because of the sleep time alone. My goal is to make it so that I can easily run this program in chunks, so that I don't need to run it for 90 hours in a row.
The code below works great to get whether a game has been analyzed, but I don't know how to intelligently update the original csv file. I think my best bet would be to rewrite the new dataframe and replace the old Games.csv every 1000 or say API calls. See the commented code below.
My overall question is, when I need to update a column in csv that is large, what is the smart way to update that column incrementally?
library(bigchess)
library(rjson)
library(jsonlite)
df <- read.csv <- "Games.csv"
for(i in 1:nrow(df)){
data <- read_json(df$urls[i])
if(data$analysisLogExists == TRUE){
df$Analyzed[i] <- 1
}
if(data$analysisLogExists==FALSE){
df$Analyzed[i] = 0
}
Sys.sleep(.25)
##This won't work because the second time I run it then I'll just reread the original lines
##if i try to account for this by subsetting only the the columns that haven't been updated,
##then this still doesn't work because then the write command below will not be writing the whole dataset to the csv
if(i%%1000){
write.csv(df,"Games.csv",row.names = F)
}
}

How do I use a for loop to open .ncdf files and average a matrix variable that has different values over all the files? (Using R Programming)

I'm currently trying to code an averaged matrix for all matrix values from a specific air quality variable (ColumnAmountNO2TropCloudScreened) positioned in different .ncdf4 files. The only way I was able to do it was listing all the files, opening them using lapply, creating a single NO2 variable for every ncdf. file and then applying abind to all of the variables. Even though I was able to do it, it took me a lot of time to type in different names for the NO2 variables (NO2_1, NO2_2,NO2_3,etc) and which row to access the original listed file ([[1]],[[2]],[[3]],etc).
I am trying to type in a code that's smarter and easier than just typing in a bunch of numbers. I have all the original .ncdf4 files listed, and am trying to loop over the files to open them and get the 'ColumnAmountNO2TropCloudScreened' matrix value from each, so then I can average them. However, I am having no luck. Would someone know what is wrong with this code/my thought over it? Thanks.
I'm trying the code as it follows:
# Load libraries
library(ncdf4)
library(abind)
library(plot.matrix)
# Set working directory
setwd("~/researchdatasets/2020")
# Declare data frame
df=NULL
# List all files in one file
files1= list.files(pattern='\\.nc4$',full.names=FALSE)
# Loop to open files, get NO2 variables
for(i in seq(along=files1)) {
nc_data = nc_open(files1[i])
NO2_var<-ncvar_get(nc_data,'ColumnAmountNO2TropCloudScreened')
nc_close(nc_data)
}
# Average variables
list_NO2= apply(abind::abind(NO2_var,along=3),1:2,mean,na.rm=TRUE)
NCO's ncra averages variables across all input files with, e.g.,
ncra in*.nc out.nc

Reading multiples uneven .csv files in one dataframe with R

I need the following help if possible please let me know your comments
My ObjectTive:-
I had multiples .csv files one location.
all .csv files have different numbers of row(m) and column (n) i.e. m=!n
all csv files have an almost similar date (Calendar day & time stamps eg: 04/01/2016 7:01) but the interesting point is some data have some time stamps missing
All .csv files have following data common ( Open,High,Low, Close,Date).
My Objective is to import only "Close" column from all .csv files but each file have different numbers of rows as some time stamps data is missing in some files.
If on any case any time stamps data is missing but the previous present then repeat previous values.
If on any case any time stamps data is missing and the previous also missing then put 'NA' on it. This is only applicable for first few data points.
Here is my planning:-
Reading/Writing Files: We’ll need to implement a logic to read files in a certain fashion and then write separate files for different sets of instruments separately.
Inconsistent time series: You’ll notice that the time series is not consistent and continuous for some securities, so you need to generate your own datetime stamps and then fill data against each datestamp (wherever available).
Missing data points: There will be certain timestamps against which you don’t have the data, make your timeseries continuous by filling in the missing points with data from pervious timestamp.
Maybe try
read_in <- function(csv){
f <- read.csv(csv)
f <- f[!is.na(f$time_stamp),]
f$close
}
l <- lapply(csv_list, read_in)
df <- rbindlist(l)

How to delete all rows in R until a certain value

I have a several data frames which start with a bit of text. Sometimes the information I need starts at row 11 and sometimes it starts at row 16 for instance. It changes. All the data frames have in common that the usefull information starts after a row with the title "location".
I'd like to make a loop to delete all the rows in the data frame above the useful information (including the row with "location").
I'm guessing that you want something like this:
readfun <- function(fn,n=-1,target="location",...) {
r <- readLines(fn,n=n)
locline <- grep(target,r)[1]
read.table(fn,skip=locline,...)
}
This is fairly inefficient because it reads the data file twice (once as raw character strings and once as a data frame), but it should work reasonably well if your files are not too big. (#MrFlick points out in the comments that if you have a reasonable upper bound on how far into the file your target will occur, you can set n so that you don't have to read the whole file just to search for the target.)
I don't know any other details of your files, but it might be safer to use "^location" to identify a line that begins with that string, or some other more specific target ...

Import Large Unusual File To R

First time poster here, so I'll try and make myself as clear as possible on the help I need. I'm fairly new to R, and this is my first real independent programming experience.
I have stock tick data for about 2.5 years, each day has its own file. The files are .txt and consist of approximately 20-30 million rows, and averaging I guess 360mb each. I am working one file at a time for now. I don't need all the data these files contain, and I was hoping that I could use the programming to minimize my files a bit.
Now my problem is that I am having some difficulties with writing the proper code so R understands what I need it to do.
Let me first show you some of the data so you can get an idea of the formatting.
M977
R 64266NRE1VEW107 FI0009653869 2EURXHEL 630 1
R 64516SSA0B 80SHB SE0002798108 8SEKXSTO 40 1
R 645730BBREEW750 FR0010734145 8EURXHEL 640 1
R 64655OXS1C 900SWE SE0002800136 8SEKXSTO 40 1
R 64663OXS1P 450SWE SE0002800219 8SEKXSTO 40 1
R 64801SSIEGV LU0362355355 11EURXCSE 160 1
M978
Another snip of data:
M732
D 3547742
A 3551497B 200000 67110 02800
D 3550806
D 3547743
A 3551498S 250000 69228 09900
So as you can see each line begins with a letter. Each letter denotes what the line means. For instance R means order book directory message, M means milliseconds after last second, H means stock trading action message. There are 14 different letters used in total.
I have used the readLines function to import the data into R. This however seems to take a very long time for R to process when I want to work with the data.
Now I would like to write some sort of If function that says if the first letter is R then from offset 1 to 4 the code means Market Segment Identifier etc., and have R add columns to these so I can work with the data in a more structured fashion.
What is the best way of importing such data, and also creating some form of structure - i.e. use unique ID information in the line of data to analyze 1 stock at a time for instance.
You can try something like this :
options(stringsAsFactors = FALSE)
f_A <- function(line,tab_A){
values <- unlist(strsplit(line," "))[2:5]
rbind(tab_A,list(name_1=as.character(values[1]),name_2=as.numeric(values[2]),name_3=as.numeric(values[3]),name_4=as.numeric(values[4])))
}
tab_A <- data.frame(name_1=character(),name_2=numeric(),name_3=numeric(),name_4=numeric(),stringsAsFactors=F)
for(i in readLines(con="/home/data.txt")){
switch(strsplit(x=i,split="")[[1]][1],M=cat("1\n"),R=cat("2\n"),D=cat("3\n"),A=(tab_A <- f_A(i,tab_A)))
}
And replace cat() by different functions that add values to each type of data.frame. Use the pattern of the function f_A() to construct others functions and same things for the table structure.
You can combine your readLines() command with regular expressions. To get more information about regular expressions, look at the R help site for grep()
> ?grep
So you can go through all the lines, check for each line what it means, and then handle or store the content of the line however you like. (Regular Expressions are also useful to split the data within one line...)

Resources