Converting dataframe into csv with "unexpected token" - r

I am currently writing a code to download timeseries (which will then be converted into csv-files) to conduct an event study upon.
The following code (part of the complete code) I wrote:
tickers = c("^AEX", "^ATX", "^BFX", "^FCHI", "^FTSE", "^GDAXI", "^IBEX", "^OMX","^OMXH25", "^OSEAX", "^SSMI", "FTSEMIB.MI")
Aggregate <- getSymbols(tickers,
from = "2014-01-01",
to = "2021-12-31")
na.omit(Aggregate,"iz",interp="linear")
Ticker <- Aggregate
Ticker
class(Ticker)
data1 <-as.data.frame(Ticker)
data1
class(data1)
data2 <- data1 # Duplicate data frame
data2 # Print new data frame
AEX <- ^AEX
write.zoo(AEX,"//Users/TEST/Library/CloudStorage/OneDrive-Personal/Event Study Basis\\AEX.csv",index.name="Date",sep=",")
As the tickers of the indices (^AEX, ^ATX etc.) all possess a "^", which Excel doesn't "eat" I want to make sure the dataframe I want to export to a .csv file does not possess this "^". For a different analysis, the code worked (different tickers) now I get an error every time I try to run it.
My questions:
which command will solve my problem? --> Converting ^AEX into AEX so Excel eats it :)

Related

Unable to program a loop for regression analysis in R

I have a problem with my R code. I would like to run about 100 regressions and perform this process with a loop. I have tried to program the loop myself using help from YouTube and the like, but I am getting nowhere. Therefore, I would like to ask you if you can help me.
Specifically, it's about the following:
I have a dataset of the 100 companies in the Nasdaq100 and I would like to regress the sales per share with the stock price performance on a quarterly basis. Another problem is that the data set contains these 100 companies and a subset with the respective ticker symbol has to be created for each additional company so that R can access it correctly for each regression.
Here is an excerpt from the code:
Nasdaq_100 = read_xlsx("Nasdaq_100_Sales_Data.xlsx")
#Correlation between quarterly close price and Sales of AMD
AMD <- subset (Nasdaq_100, Nasdaq_100$TickerSymbol=="AMD")
AMD_regression = lm(AMD$Sales ~ AMD$Stockprice_quarterly, data = Nasdaq_100)
summary(AMD_regression)
Can you help me to program this loop for regression analysis?
Thanks in advance for any help!
To convert this to a for loop, first get a list of the .xlsx files in your working directory:
require(data.table)
myfiles <- list.files(pattern="*.xlsx")
Then loop through each file, and saving with minor modifications to your existing code:
for (file in myfiles) {
Nasdaq_100 <- data.table::fread(file)
AMD <- subset (Nasdaq_100, Nasdaq_100$TickerSymbol=="AMD")
AMD_regression = lm(AMD$Sales ~ AMD$Stockprice_quarterly, data = Nasdaq_100)
summary(AMD_regression)
data.table::fwrite(AMD_regression, file=paste0("output_", file), quote = F, sep = "\t", row.names = F)
}
Copy-paste in r and Let me know if it works.
file <- choose.files()
lmset <- data.frame(x='x',y='y')
for(i in seq_len(100))
{
data <- read_excel(file,sheet=i)
lmset <- rbind(lmset,lm(AMD$Sales~AMD$Stockprice_quarterly,data=data)$coefficients)
}

Parsing large XML to dataframe in R

I have large XML files that I want to turn into dataframes for further processing within R and other programs. This is all being done in macOS.
Each monthly XML is around 1gb large, has 150k records and 191 different variables. In the end I might not need the full 191 variables but I'd like to keep them and decide later.
The XML files can be accessed here (scroll to the bottom for the monthly zips, when uncompressed one should look at "dming" XMLs)
I've made some progress but processing for larger files takes too long (see below)
The XML looks like this:
<ROOT>
<ROWSET_DUASDIA>
<ROW_DUASDIA NUM="1">
<variable1>value</variable1>
...
<variable191>value</variable191>
</ROW_DUASDIA>
...
<ROW_DUASDIA NUM="150236">
<variable1>value</variable1>
...
<variable191>value</variable191>
</ROW_DUASDIA>
</ROWSET_DUASDIA>
</ROOT>
I hope that's clear enough. This is my first time working with an XML.
I've looked at many answers here and in fact managed to get the data into a dataframe using a smaller sample (using a daily XML instead of the monthly ones) and xml2. Here's what I did
library(xml2)
raw <- read_xml(filename)
# Find all records
dua <- xml_find_all(raw,"//ROW_DUASDIA")
# Create empty dataframe
dualen <- length(dua)
varlen <- length(xml_children(dua[[1]]))
df <- data.frame(matrix(NA,nrow=dualen,ncol=varlen))
# For loop to enter the data for each record in each row
for (j in 1:dualen) {
df[j, ] <- xml_text(xml_children(dua[[j]]),trim=TRUE)
}
# Name columns
colnames(df) <- c(names(as_list(dua[[1]])))
I imagine that's fairly rudimentary but I'm also pretty new to R.
Anyway, this works fine with daily data (4-5k records), but it's probably too inefficient for 150k records, and in fact I waited a couple hours and it hadn't finished. Granted, I would only need to run this code once a month but I would like to improve it nonetheless.
I tried to turn the elements for all records into a list using the as_list function within xml2 so I could continue with plyr, but this also took too long.
Thanks in advance.
While there is no guarantee of better performance on larger XML files, the ("old school") XML package maintains a compact data frame handler, xmlToDataFrame, for flat XML files like yours. Any missing nodes available in other siblings result in NA for corresponding fields.
library(XML)
doc <- xmlParse("/path/to/file.xml")
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA"))
You can even conceivably download the daily zips, unzip need XML, and parse it into data frame should the large monthly XMLs pose memory challenges. As example, below extracts December 2018 daily data into a list of data frames to be row binded at end. Process even adds a DDate field. Method is wrapped in a tryCatch due to missing days in sequence or other URL or zip issues.
dec_urls <- paste0(1201:1231)
temp_zip <- "/path/to/temp.zip"
xml_folder <- "/path/to/xml/folder"
xml_process <- function(dt) {
tryCatch({
# DOWNLOAD ZIP TO URL
url <- paste0("ftp://ftp.aduanas.gub.uy/DUA%20Diarios%20XML/2018/dd2018", dt,".zip")
file <- paste0(xml_folder, "/dding2018", dt, ".xml")
download.file(url, temp_zip)
unzip(temp_zip, files=paste0("dding2018", dt, ".xml"), exdir=xml_folder)
unlink(temp_zip) # DESTROY TEMP ZIP
# PARSE XML TO DATA FRAME
doc <- xmlParse(file)
df <- transform(xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA")),
DDate = as.Date(paste("2018", dt), format="%Y%m%d", origin="1970-01-01"))
unlink(file) # DESTROY TEMP XML
# RETURN XML DF
return(df)
}, error = function(e) NA)
}
# BUILD LIST OF DATA FRAMES
dec_df_list <- lapply(dec_urls, xml_process)
# FILTER OUT "NAs" CAUGHT IN tryCatch
dec_df_list <- Filter(NROW, dec_df_list)
# ROW BIND TO FINAL SINGLE DATA FRAME
dec_final_df <- do.call(rbind, dec_df_list)
Here is a solution that processes the entire document at once as opposed to reading each of the 150,000 records in the loop. This should provide a significant performance boost.
This version can also handle cases where the number of variables per record is different.
library(xml2)
doc<-read_xml('<ROOT>
<ROWSET_DUASDIA>
<ROW_DUASDIA NUM="1">
<variable1>value1</variable1>
<variable191>value2</variable191>
</ROW_DUASDIA>
<ROW_DUASDIA NUM="150236">
<variable1>value3</variable1>
<variable2>value_new</variable2>
<variable191>value4</variable191>
</ROW_DUASDIA>
</ROWSET_DUASDIA>
</ROOT>')
#find all of the nodes/records
nodes<-xml_find_all(doc, ".//ROW_DUASDIA")
#find the record NUM and the number of variables under each record
nodenum<-xml_attr(nodes, "NUM")
nodeslength<-xml_length(nodes)
#find the variable names and values
nodenames<-xml_name(xml_children(nodes))
nodevalues<-trimws(xml_text(xml_children(nodes)))
#create dataframe
df<-data.frame(NUM=rep(nodenum, times=nodeslength),
variable=nodenames, values=nodevalues, stringsAsFactors = FALSE)
#dataframe is in a long format.
#Use the function cast, or spread from the tidyr to convert wide format
# NUM variable values
# 1 1 variable1 value1
# 2 1 variable191 value2
# 3 150236 variable1 value3
# 4 150236 variable2 value_new
# 5 150236 variable191 value4
#Convert to wide format
library(tidyr)
spread(df, variable, values)

Convert XML Doc to CSV using R - Or search items in XML file using R

I have a large, un-organized XML file that I need to search to determine if a certain ID numbers are in the file. I would like to use R to do so and because of the format, I am having trouble converting it to a data frame or even a list to extract to a csv. I figured I can search easily if it is in a csv format. So , I need help understanding how to do convert it and extract it properly, or how to search the document for values using R. Below is the code I have used to try and covert the doc,but several errors occur with my various attempts.
## Method 1. I tried to convert to a data frame, but the each column is not the same length.
require(XML)
require(plyr)
file<-"EJ.XML"
doc <- xmlParse(file,useInternalNodes = TRUE)
xL <- xmlToList(doc)
data <- ldply(xL, data.frame)
datanew <- read.table(data, header = FALSE, fill = TRUE)
## Method 2. I tried to convert it to a list and the file extracts but only lists 2 words on the file.
data<- xmlParse("EJ.XML")
print(data)
head(data)
xml_data<- xmlToList(data)
class(data)
topxml <- xmlRoot(data)
topxml <- xmlSApply(topxml,function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml),
row.names=NULL)
write.csv(xml_df, file = "MyData.csv",row.names=FALSE)
I am going to do some research on how to search within R as well, but I assume the file needs to be in a data frame or list to so either way. Any help is appreciated! Attached is a screen shot of the data. I am interested in finding matching entity id numbers to a list I have in a excel doc.

Data transpose function in R not working properly

I am using R to do some work but I'm having difficulties in transposing data.
My data is in rows and the columns are different variables. When using the function phyDat, the author indicates a transpose function because importing data is stored in columns.
So I use the following code to finish this process:
#read file from local disk in csv format. this format can be generated by save as function of excel.
origin <- read.csv(file.choose(),header = TRUE, row.names = 1)
origin <- t(origin)
events <- phyDat(origin, type="USER", levels=c(0,1))
When I check the data shown in R studio, it is transposed but the result it is not. So I went back and modified the code as follows:
origin <- read.csv(file.choose(),header = TRUE, row.names = 1)
events <- phyDat(origin, type="USER", levels=c(0,1))
This time the data does not reflect transposed data, and the result is consistent with it.
How I currently solve the problem is transposing the data in CSV file before importing to R. Is there something I can do to fix this problem?
I had the same problem and I solved it by doing an extra step as follows:
#read file from local disk in csv format. this format can be generated by save as function of excel.
origin <- read.csv(file.choose(),header = TRUE, row.names = 1)
origin <- as.data.frame(t(origin))
events <- phyDat(origin, type="USER", levels=c(0,1))
Maybe it is too late but hope it could help other users with the same problem.

Quantmod get/viewFinancials in Loop

I would like to cycle through a list of tickers, get their financials and export them to CSV files in a folder on my desktop. However, I have been having trouble with an error in R related to viewFinancials() in the Quantmod package. The code and error are shown below.
And so, my question is how to assign a variable as an object of class financial so that my loop runs properly? Or if anyone has another alternative, I would be excited to hear it!
Here is the error message:
Error in viewFinancials(co.f, "BS", "Q") :
‘x’ must be of type ‘financials’
Here is the code I am working on:
tickers <- c('AAPL','ORCL','MSFT')
for(i in 1:length(tickers)){
co <- tickers[1]
#co.f <- paste(co,".f",sep='') #First attempt, was worth a try
co.f <- getFin(co, auto.assign=T) # automatically assigns data to "co.f" object
BS.q<-viewFinancials(co.f,'BS',"Q") # quarterly balance sheet
IS.q<-viewFinancials(co.f,"IS","Q") # quarterly income statement
CF.q<-viewFinancials(co.f,"CF","Q") # quarterly cash flow statement
BS<-viewFinancials(co.f,"BS","A") # annual balance sheet
IS<-viewFinancials(co.f,"IS","A") # annual income statement
CF<-viewFinancials(co.f,"CF","A") # annual cash flow statement
d<-Sys.Date()
combinedA <- rbind(BS,IS,CF)
combinedQ <- rbind(BS.q,IS.q,CF.q)
BSAfile <- paste('/Users/dedwards/Desktop/RFinancials/',d,' ',co,'_BS_A.csv',sep='')
BSQfile <- paste('/Users/dedwards/Desktop/RFinancials/',d,' ',co,'_BS_Q.csv',sep='')
write.csv(combinedA, file = BSAfile, row.names=TRUE)
write.csv(combinedQ, file = BSQfile, row.names=TRUE)
}
co.f contains the name of the object in the workspace that actually contains the financials object. To actually use that object you need to call get(co.f)
obj <- get(co.f)
# now you can use obj where you were previously trying to use co.f
Alternatively it looks like
co.f <- getFin(co, auto.assign = FALSE)
also works and is probably more straight forward.
Rather than writing a loop, you might consider the tidyquant package which enables multiple stocks to be passed to the tq_get() function. Setting tq_get(get = "financials") will allow you to download the financials for multiple stocks. Here's and example:
library(tidyquant)
c("FB", "AMZN", "NFLX", "GOOG") %>%
tq_get(get = "financials")
This returns a nested data frame of all the financial statement data (income statement, balance sheet, cashflow) in both annual and quarterly periods. You can use the unnest() function to peel away the layers.
If you need to save the data, you can unnest then write to a csv using the write_csv() function.

Resources