reading monthly returns in R

reading monthly returns in R - r

I have the following monthly equity data in file "equity.dat":
2010-03,1e+06
2010-03,1.00611e+06
2010-04,998465
2010-05,1.00727e+06
2010-06,1.00965e+06
I am trying to compute the monthly returns using the following code:
library(PerformanceAnalytics)
y = Return.read(filename="equity.dat", frequency = "m", sep=",", header=FALSE)
y
z = Return.calculate(y)
z
z[1]=0 #added this to remove the NA in first return
but I get the following error:
Error in read.zoo(filename, sep = sep, format = format, FUN = FUN, header = header, :
index has bad entries at data rows: 1 2 3 4 5
I checked out the formatting for Return.read when using as.mon and that is why am using yyyy-mm. should I be using a different format.

According to ?Return.read the default format.in= here is "%F" which is not the format of your data so it will have to be specified. Also the index must be unique (in this case its not) or else it must be aggregated as per ?zoo and ?read.zoo the latter of which it uses internally:
Return.read(filename = "equity.dat", frequency = "m", sep = ",", header = FALSE,
aggregate = function(x) tail(x, 1), format = "%Y-%m")
We have used tail to define the aggregating function -- you may or may not wish to use something else.

Related

How can I process my StringTie data so that I can run DEseq2 using R?

I have StringTie data for a parental cell line and a KO cell line (which I'll refer to as B10). I am interested in comparing the parental and B10 cell lines. The issue seems to be that my StringTie files are separate, meaning I have one for the parental cell line and one for B10. I've included the code I have written to date for context along with the error messages I received and troubleshooting steps I have already tried. I have no idea where to go from here and I'd appreciate all the help I could get. This isn't something that anyone in my lab has done before so I'm struggling to do this without any guidance.
Thank you all in advance!
`# My code to go from StringTie to count data:
(I copy pasted this so all my notes are included. I'm new to R so they're really just for me. I'm not trying to explain to everyone what every bit of the code means condescendingly. You all likely know much more that I do)
# Open Data
# List StringTie output files for all samples
# All files should be in same directory
files_B10 <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/B10", recursive = TRUE, full.names = TRUE)
files_parental <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/parental", recursive = TRUE, full.names = TRUE)
tmp_B10 <- read_tsv(files_B10[1])
tx2gene_B10 <- tmp_B10[, c("t_name", "gene_name")]
txi_B10 <- tximport(files_B10, type = "stringtie", tx2gene = tx2gene_B10)
tmp_parental <- read_tsv(files_parental[1])
tx2gene_parental <- tmp_parental[, c("t_name", "gene_name")]
txi_parental <- tximport(files_parental, type = "stringtie", tx2gene = tx2gene_parental)
# Create a filter (vector) showing which rows have at least two columns with 5 or more counts
txi_B10.filter<-apply(txi_B10$counts,1,function(x) length(x[x>5])>=2)
txi_parental.filter<-apply(txi_parental$counts,1,function(x) length(x[x>5])>=2)
head(txi_parental.filter)
sum(txi_B10.filter)
# Now filter the txi object to keep only the rows of $counts, $abundance, and $length where the txi.filter value is >=5 is true
txi_B10$counts<-txi_B10$counts[txi_B10.filter,]
txi_B10$abundance<-txi_B10$abundance[txi_B10.filter,]
txi_B10$length<-txi_B10$length[txi_B10.filter,]
txi_parental$counts<-txi_parental$counts[txi_parental.filter,]
txi_parental$abundance<-txi_parental$abundance[txi_parental.filter,]
txi_parental$length<-txi_parental$length[txi_parental.filter,]
# save count data as csv files
write.csv(txi_B10$counts, "txi_B10.counts.csv")
write.csv(txi_parental$counts, "txi_parental.counts.csv")
# Open count data
# Do this in order that the files are organized in file manager
txi_B10_counts <- read_csv("txi_B10.counts.csv")
txi_parental_counts <- read_csv("txi_parental.counts.csv")
# Set column names
colnames(txi_B10_counts) = c("Gene_name", "B10_n1", "B10_n2")
View(txi_B10_counts)
colnames(txi_parental_counts) = c("Gene_name", "parental_n1", "parental_n2")
View(txi_parental_counts)
## R is case sensitive so you just wanna ensure that everything is in the same case
## convert Gene names which is column [[1]] into lowercase
txi_parental_counts[[1]] <- tolower( txi_parental_counts[[1]])
View(txi_parental_counts)
txi_B10_counts[[1]] <- tolower(txi_B10_counts[[1]])
View(txi_B10_counts)
## Capitalize the first letter of each gene name
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_parental_counts$Gene_name <- capFirst(txi_parental_counts$Gene_name)
View(txi_parental_counts)
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_B10_counts$Gene_name <- capFirst(txi_B10_counts$Gene_name)
View(txi_B10_counts)
# Merge PL and KO into one table
# full_join takes all counts from PL and KO even if the gene names are missing
# If a value is missing it writes it as NA
# This site explains different types of merging https://remiller1450.github.io/s230s19/Merging_and_Joining.html
mergedCounts <- full_join (x = txi_parental_counts, y = txi_B10_counts, by = "Gene_name")
view(mergedCounts)
# Replace NA with value = 0
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# Save file for merged counts
write.csv(mergedCounts, "MergedCounts.csv")
## --------------------------------------------------------------------------------
# My code to go from count data to DEseq2
# Import data
# I added my metadata incase the issue is how I set up the columns
# metaData is a file with your samples name and Comparison
# Your second column in metadata must be called Comparison, otherwise you'll get error in dds line
metadata <- read.csv(metadata.csv', header = TRUE, sep = ",")
countData <- read.csv('MergedCounts.csv', header = TRUE, sep = ",")
# Assign "Gene Names" as row names
# Notice how there's suddenly an extra row (x)?
# R automatically created and assigned column x as row names
# If you don't fix this the # of columns won't add up
rownames(countData) <- countData[,1]
countData <- countData[,-1]
# Create DEseq2 object
# !!!!!!! Here is where I get stuck!!!!!!!
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line
# It says Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are not integers
## --------------------------------------------------------------------------------
# How I tried to fix this:
# 1) I saw something here that suggested this might be an issue with having zeros in the count data
# I viewed the countData files to make sure there were no zeros and there weren't any
# I thought that would be the case since I replaced NA with value = 0 earlier using this bit of code
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# 2) I was then informed that StringTie outputs non integer values
# It was recommended that I try DESeqDataSetFromTximport instead
dds <- DESeqDataSetFromTximport(countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line either
# It says Error in DESeqDataSetFromTximport(countData, colData = metaData, design = ~Comparison, : is(txi, "list") is not TRUE
# I think this might be because merging the parental and B10 counts led to a file that's no longer a txi or accessible through Tximport
# It seems like this should be done with the original StringTie files from the very beginning of the code
# My concern with doing that is that the files for parental and B10 are separate so I don't see how I could end up comparing the two
# I think this approach would work if I was interested in comparing n1 verses n2 for each cell line but that is not of interest to me
`

Plot function outside the candlestick pattern in R

I have two xts objects: stock and base. I calculate the relative strength (which is simply the ratio of closing price of stock and of the base index) and I want to plot the weekly relative strength outside the candlestick pattern. The links for the data are here and here.
library(quantmod)
library(xts)
read_stock = function(fichier){ #read and preprocess data
stock = read.csv(fichier, header = T)
stock$DATE = as.Date(stock$DATE, format = "%d/%m/%Y") #standardize time format
stock = stock[! duplicated(index(stock), fromLast = T),] # Remove rows with a duplicated timestamp,
# but keep the latest one
stock$CLOSE = as.numeric(stock$CLOSE) #current numeric columns are of type character
stock$OPEN = as.numeric(stock$OPEN) #so need to convert into double
stock$HIGH = as.numeric(stock$HIGH) #otherwise quantmod functions won't work
stock$LOW = as.numeric(stock$LOW)
stock$VOLUME = as.numeric(stock$VOLUME)
stock = xts(x = stock[,-1], order.by = stock[,1]) # convert to xts class
return(stock)
}
relative.strength = function(stock, base = read_stock("vni.csv")){
rs = Cl(stock) / Cl(base)
rs = apply.weekly(rs, FUN = mean)
}
stock = read_stock("aaa.csv")
candleChart(stock, theme='white')
addRS = newTA(FUN=relative.strength,col='red', legend='RS')
addRS()
However R returns me this error:
Error in `/.default`(Cl(stock), Cl(base)) : non-numeric argument to binary operator
How can I fix this?

One problem is that "vni.csv" contains a "Ticker" column. Since xts objects are a matrix at their core, you can't have columns of different types. So the first thing you need to do is ensure that you only keep the OHLC and volume columns of the "vni.csv" file. I've refactored your read_stock function to be:
read_stock = function(fichier) {
# read and preprocess data
stock <- read.csv(fichier, header = TRUE, as.is = TRUE)
stock$DATE = as.Date(stock$DATE, format = "%d/%m/%Y")
stock = stock[!duplicated(index(stock), fromLast = TRUE),]
# convert to xts class
stock = xts(OHLCV(stock), order.by = stock$DATE)
return(stock)
}
Next, it looks like the the first argument to relative.strength inside the addRS function is passed as a matrix, not an xts object. So you need to convert to xts, but take care that the index class of the stock object is the same as the index class of the base object.
Then you need to make sure your weekly rs object has an observation for each day in stock. You can do that by merging your weekly data with an empty xts object that has all the index values for the stock object.
So I refactored your relative.strength function to:
relative.strength = function(stock, base) {
# convert to xts
sxts <- as.xts(stock)
# ensure 'stock' index class is the same as 'base' index class
indexClass(sxts) <- indexClass(base)
index(sxts) <- index(sxts)
# calculate relative strength
rs = Cl(sxts) / Cl(base)
# weekly mean relative strength
rs = apply.weekly(rs, FUN = mean)
# merge 'rs' with empty xts object contain the same index values as 'stock'
merge(rs, xts(,index(sxts)), fill = na.locf)
}
Now, this code:
stock = read_stock("aaa.csv")
base = read_stock("vni.csv")
addRS = newTA(FUN=relative.strength, col='red', legend='RS')
candleChart(stock, theme='white')
addRS(base)
Produces this chart:

The following line in your read_stock function is causing the problem:
stock = xts(x = stock[,-1], order.by = stock[,1]) # convert to xts class
vni.csv has the actual symbol name in the third column of your data, so when you put stock[,-1] you're actually including a character column and xts forces all the other columns to be characters as well. Then R alerts you about dividing a number by a character at Cl(stock) / Cl(base). Here is a simple example of this error message with division:
> x <- c(1,2)
> y <- c("A", "B")
> x/y
Error in x/y : non-numeric argument to binary operator
I suggest you remove the character column in vni.csv that contains "VNIndex" in every row or modify your function called read_stock() to better protect against this type of issue.

Benford - Dataset with NA strings returns an error in extract.digits

I've a dataset of macroeconomic data like GDP, inflation, etc... where Rows=different macroeconomic indicators and columns=years
Since some values are missing (ex: the GDP of any country in any year), they are charged as "NA".
When I perform these operations:
#
data = read.table("14varnumeros.txt", header = FALSE, sep = "", na.strings = "NA", dec = ".", strip.white = TRUE)
benford(data, number.of.digits = 1, sign = "both", discrete=TRUE, round=3)
#
It gives me this error:
Error in extract.digits(data, number.of.digits, sign, second.order, discrete = discrete, :
Data must be a numeric vector
I assume that this is because of the NA strings, but I do not know how to solve it.

I came across this issue, too. In my case, it wasn't missing data, instead it's because of a quirk in the extract.digits() function of the benford.analysis package. The function is checking if the data supplied to it is numeric data, but it does so using class(dat) != "numeric" instead of using the is.numeric() function.
This produces unexpected errors. Consider the code below:
library(benford.analysis)
dat <- data.frame(v1 = 1:5, v2 = c(1, 2, 3, 4, 5))
benford(dat$v1) # produces error
I've submitted an issue on Github, but you can simply wrap your data in as.numeric() and you should be fine.

defining empty data frame with string columns and numeric columns and then adding values later

I am trying to define a data frame initially and then add rows using rbind one by one later. My problem is that in the following definition of result, I want humanReplicate, ratReplicate to be in string format and other columns to be in numeric format.
result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate",
pvalue = "p-value", alternative = "alternative_hypothesis", Conf.int1 = "conf.int1",
Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
Then, I am currently adding new rows (which are results of a Fisher's test) in the following way :
newLine <- data.frame(t(c(humanReplicate = humanReplicateName,
ratReplicate = ratReplicateName, pvalue = fisherTest$p,
alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1],
Conf.int2 =fisherTest$conf.int[2], oddratio = fisherTest$estimate[[1]])))
result <-rbind(result,newLine)
Here all the values of the fisherTest$X is numeric. humanReplicateName and ratReplicateName are strings. But if I define result in the above way and then define newLine in this way then all the columns of the data frame becomes string. I understand when I define result here, I am making all of them string. But if I want to make it mixture of string and number, how should I define it?
My final goal is to get result as csv file and I am doing the following to do this:
write.table(result, file = "newData4.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")

Your problem is in the concatenate function c(). You do not need it as well as the tfunction. Indeed
tmp <- data.frame(t(c("test" = 2,"human" = "Paul")))
apparently gives the same than this line
tmp2 <- data.frame("test" = 2,"human" = "Paul")
> tmp
test human
1 2 Paul
> tmp2
test human
1 2 Paul
but
> tmp$test
test
2
Levels: 2
> tmp2$test
[1] 2
and
> is.numeric(tmp$test)
[1] FALSE
> is.numeric(tmp2$test)
[1] TRUE
What happens is that with c first you make 1 vector, and mixing text and numeric is interpreted as a vector of factors, while with the directly call to dataframe you fill two different and indipendent columns

Passing partial column names to function

I am looking to use a function to speed up a data cleaning process. In the example shown I am looking to remove values reported in the am and pm columns if the ".no" column for that day has a value of 1.
df1 = data.frame (identifier = c(1:4),
mon.no = c(1,NA,NA,NA),mon.am = c(2,1,NA,3),mon.pm = c(3,4,NA,5),
tues.no = c(NA,NA,1,NA),tues.am = c(2,3,1,4),tues.pm = c(3,3,2,3))
I envisage using a function uses the day to clean the data:
clean1 = function (day) {
df1$day.am[df1$day.no==1] = NA
df1$day.pm[df1$day.no==1] = NA
return (df1)}
df2 = clean1(mon)
However this returns the following error.
Error in `$<-.data.frame`(`*tmp*`, "day.am", value = logical(0)) :
replacement has 0 rows, data has 4
I assume that this is because the function expects a full column name and cannot fill in the gaps around a text input? Is it possible to use a function in that way?
Having read these notes I think that it would be better practice to have my data in a tidy format and am working on a solution which involves reorganising my data. However it would also be handy to be able to do this while the data is in it's original format.
Thanks.

You're really close. #Tyler Rinker in comments has explained why it doesn't work. Here's a fix:
clean1 = function (day) {
day.am = paste(day, "am", sep=".") # make a string from the variable day and the suffixes
day.pm = paste(day, "pm", sep=".")
day.no = paste(day, "no", sep=".")
df1[day.am][df1[day.no]==1] = NA
df1[day.pm][df1[day.no]==1] = NA
return (df1)}
df2 = clean1("mon") # "mon" should be a string
Somebody else might offer more efficient ways of doing this. Note that you're only ever working from your original df1 here. If you now run
df3 = clean1("tues")
you won't get a dataframe with both days cleaned. You could fix this by supplying the dataframe to be acted on to the function too:
clean2 = function(df, day){...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

reading monthly returns in R - r

Related

How can I process my StringTie data so that I can run DEseq2 using R?

Plot function outside the candlestick pattern in R

Benford - Dataset with NA strings returns an error in extract.digits

defining empty data frame with string columns and numeric columns and then adding values later

Passing partial column names to function

Categories

Resources