How to save variables & stop variables from being overwritten in R? - r

So I have a bunch of functions that save the column numbers of my data. So for example, my data looks something like:
>MergedData
[[1]]
Date EUR.HIGH EUR.LOW EUR.CLOSE EUR.OPEN EUR.LAST
01/01/16 1.00 1.00 1.25 1.30 1.24
[[2]]
Date AUD.HIGH AUD.LOW AUD.CLOSE AUD.OPEN AUD.LAST
01/01/16 1.00 1.00 1.25 1.30 1.24
I have 29 of the above currencies. So in this case, MergedData[[1]] will return all of my Euro prices, and so on for 29 currencies.
I also have a function in R that calculates the variables and saves the numbers 1 to 29 that correspond with the currencies. This code calculates values in the first row of my data, i.e:
trending <- intersect(which(!ma.sig[1,]==0), which(!pricebreak[1,]==0))
which returns something like:
>sig.nt
[1] 1 2 5...
And so I can use this to pull up 'trending' currencies via a for() function:
for (i in length(sig.nt){
MergedData[[i]]
...misc. code for calculations of trending currencies...
}
I want to be able to 'save' my trending currencies for future references and calculations. The problem is sig.nt variable changes with every new row. I was thinking of using the lockBinding command:
sig.exist <- sig.nt #saves existing trend
lockBinding('curexist', .GlobalEnv)
But wouldn't this still get overwritten everytime I run my script? Help would be much appreciated!

Related

How can I see the proportion for each answer while making it a tibble?

I'm trying to create a tibble that will allow me to see the proportion for each answer in one of my variables.
Currently, my code looks like this
drinkchoice <- tibble(prop.table(table(surveyq$drink_choice)))
when running the code, it returns the proportion of each answer in the variable but does not list the answers that come with it. For example it returns a table looking like:
0.007
0.04
0.29
0.13
0.09
but when I remove tibble() from the original line of code, it responds back with
pepsi 0.007
fanta 0.04
sprite 0.29
brisk 0.13
coke 0.09
I was wondering if there is any way to code it so even when using the tibble() function, I can keep it so when the code runs it returns me back the answer with the correct proportion with it.
Edit: I added the line ~ tibble::rownames_to_column("Drink") and I was wondering whether it is possible to rewrite the numbers under the new column? As it would solve my problem

Power BI: Make a line chart continuous when source contains null values (handle missing values)

This question spinned off a question I posted earlier;
Custom x-axis values in Power BI
Suppose the following data set:
Focus on the second and third row. How can I make the line in corresponding graph hereunder be continuous and not stop in the middle?
In Excel I used to solve this problem by applying NA() in a formula generating data for the graph. Is there a similar solution using DAX perhaps?
The short version:
What you should do:
Unleash python and after following the steps there, insert this script (dont't worry, I've added some details here as well):
import pandas as pd
# retrieve index column
index = dataset[['YearWeek_txt']]
# subset input data for further calculations
dataset2 = dataset[['1', '2']]
# handle missing values
dataset2 = dataset2.fillna(method='ffill')
Then you will be able to set it up like this:
Why you should do it:
For built-in options, these are your choices as far as I know:
1. Categorical YearMonth and categorical x-axis does not work
2. Categorical YearMonth and continuous x-axis does not work
3. Numerical YearMonth and categorical x-axis does not work
4. Numerical YearMonth and continuous x-axis does not work
The details, starting with why built-in approaches fail:
1. Categorical YearMonth and categorical x-axis
I've used the following dataset that resembles the screenshot of your table:
YearWeek 1 2
201603 2.37 2.83
201606 2.55
201607 2.98
201611 2.33 2.47
201615 2.14 2.97
201619 2.15 2.02
201623 2.33 2.8
201627 3.04 2.57
201631 2.95 2.98
201635 3.08 2.16
201639 2.50 2.42
201643 3.07 3.02
201647 2.19 2.37
201651 2.38 2.65
201703 2.50 3.02
201711 2.10 2
201715 2.76 3.04
And NO, I didn't bother manually copying ALL your data. Just your YearWeek series. The rest are random numbers between 2 and 3.
Then I set the data up as numerical for 1 and 2, and YearWeek aa type text in the Power Query Editor:
So this is the original setup with a table and chart like yours:
The data is sorted descending by YearWeek_txt:
And the x-axis is set up asCategorical:
Conclusion: Fail
2. Categorical YearMonth and numercial x-axis
With the same setup as above, you can try to change the x-axis type to Continuous:
But as you'll see, it just flips right back to 'Categorical', presumably because the type of YearWeek is text.
Conclusion: Fail
3. Numerical YearMonth and categorical x-axis
I've duplicated the original setup so that I've got two tables Categorical and Numerical where the type of YearWeek are text and integer, respectively:
So numerical YearMonth and categorical x-axis still gives you this:
Conclusion: Fail
4. Numerical YearMonth and continuous x-axis does not work
But now, with the same setup as above, you are able to change the x-axis type to Continuous:
And you'll end up with this:
Conclusion: LOL
And now, Python:
In the Power Query Editor, activate the Categorical table, select Transform > Run Python Script and insert the following snippet in the Run Python Script Editor:
# 'dataset' holds the input data for this script
import pandas as pd
# retrieve index column
index = dataset[['YearWeek_txt']]
# subset input data for further calculations
dataset2 = dataset[['1', '2']]
# handle missing values
dataset2 = dataset2.fillna(method='ffill')
Click OK and click Table next to dataset2 here:
And you'll get this (make sure that the column data types are correct):
As you can see, no more missing values. dataset2 = dataset2.fillna(method='ffill') has replaced all missing values with the preceding value in both columns.
Click Close&Apply to get back to the Desktop, and enjoy your table and chart with no more missing values:
Conclusion: Python is cool
End note:
There are a lot of details that can go wrong here with decimal points, data types and so on. Let me know how things work out for you and I'll have a look at it again if it doesn't work on your end.

For Loop for Correlations

I am wanting to get correlation values between two variables for each county.
I have subset my data as shown below and get the appropriate value for the individual Adams county, but am now wanting to do the other counties:
CorrData<-read.csv("H://Correlation
Datasets/CorrelationData_Master_Regression.csv")
CorrData2<-subset(CorrData, CountyName=="Adams")
dzCases<-(cor.test(CorrData2$NumVisit, CorrData2$dzdx,
method="kendall"))
dzCases
I am wanting to do a For Loop or something similar that will make the process more efficient, and so that I don't have write 20 different variable correlations for each of the 93 counties.
When I run the following in R, it doesn't give an error, but it doesn't give me the response I was hoping for either. Rather than the Spearman's Correlation for each county, it seems to be ignoring the loop portion and just giving me the correlation between the two variables for ALL counties.
CorrData<-read.csv("H:\\CorrelationData_Master_Regression.csv")
for (i in CorrData$CountyName)
{
dzCasesYears<-cor.test(CorrData$NumVisit, CorrData$dzdx,
method="spearman")
}
A very small sample of my data looks similar to this:
CountyName Year NumVisits dzdx
Adams 2010 4.545454545 1.19
Adams 2011 20.83333333 0.20
Elmore 2010 26.92307692 0.24
Elmore 2011 0 0.61
Brown 2010 0 -1.16
Brown 2011 17.14285714 -1.28
Clark 2010 25 -1.02
Clark 2011 0 1.13
Cass 2010 17.85714286 0.50
Cass 2011 27.55102041 0.11
I have tried to find a similar example online, but am not having luck!
Thank you in advance for all your help!
You are looping but not using your iterator 'i' in your code. If this makes sense with respect with what you want to do (and judging from your condition). Based on comments, you might want to make sure you are using numerics. Also, i noticed that you are not iterating into your output cor.test vector. I'm not sure a loop is the most efficient way to do it, but it will be just fine and since your started with a loop, You should have something of the kind:
dzCasesYears = list() #Prep a list to store your corr.test results
counter = 0 # To store your corr.test into list through iterating
for (i in unique(CorrData$CountyName))
{
counter = counter + 1
# Creating new variables makes the code clearer
x = as.numeric(CorrData[CorrData$CountyName == i,]$NumVisit)
y = as.numeric(CorrData[CorrData$CountyName == i,]$dzdx)
dzCasesYears[[counter]] <-cor.test(x,y,method="spearman")
}
And it's always good to put a unique there when you are iterating.
data.table makes operations like this very simple.
library('data.table')
CorrData <- as.data.table(read.csv("H:\\CorrelationData_Master_Regression.csv"))
CorrData[, cor(dzdx, NumVisits), CountyName]
With the sample data, it's all negative ones because there's two points per county and so the correlation is perfect. The full dataset should be more interesting!
CountyName V1
1: Adams -1
2: Elmore -1
3: Brown -1
4: Clark -1
5: Cass -1
Edit to include p values from cor.test as OP asked in the comment
This is also quite simple!
CorrData[, .(cor=cor(dzdx, NumVisits),
p=cor.test(dzdx, NumVisits)$p.value),
CountyName]
...But it won't work with your sample data as two points per county is not enough for cor.test to get a p value. Perhaps you could take #smci's advice and dput a larger subset of the data to make your question truly reproducible

using getSymbols to load different start time variables (time series data)

getSymbols(c("PI","RSXFS", "TB3MS", src="FRED",from="1959-1-1", from="1992-1", from="1934-1-1")
How can I load data by using getSymbols for different start dates for multiple variables?
I needs 200 variables from FRED. I can download the FRED CODE easily, but the problem is that dates. Each variables have different starting date.
First I load data set with time series format and then i will use window commend for fixing the same time period for all 200 data.
May be you are looking for mapply
symbols<-c("PI","RSXFS", "TB3MS")
begin.date<-c("1959-1-1","1992-1", "1934-1-1")
jj<- mapply(function(sym,dt) getSymbols(sym, src="FRED", from=dt,auto.assign = FALSE),symbols,begin.date)
head(jj[[3]])
TB3MS
1934-01-01 0.72
1934-02-01 0.62
1934-03-01 0.24
1934-04-01 0.15
1934-05-01 0.16
1934-06-01 0.15

Ensuring temporal data density in R

ISSUE ---------
I have thousands of time series files (.csv) that contain intermittent data spanning for between 20-50 years (see df). Each file contains the date_time and a metric (temperature). The data is hourly and where no measurement exists there is an 'NA'.
>df
date_time temp
01/05/1943 11:00 5.2
01/05/1943 12:00 5.2
01/05/1943 13:00 5.8
01/05/1943 14:00 NA
01/05/1943 15:00 NA
01/05/1943 16:00 5.8
01/05/1943 17:00 5.8
01/05/1943 18:00 6.3
I need to check these files to see if they have sufficient data density. I.e. that the ratio of NA's to data values is not too high. To do this I have 3 criteria that must be checked for each file:
Ensure that no more than 10% of the hours in a day are NA's
Ensure that no more than 10% of the days in a month are NA's
Ensure that there are 3 continuous years of data with valid days and months.
Each criterion must be fulfilled sequentially and if the file does not meet the requirements then I must create a data frame (or any list) of the files that do not meet the criteria.
QUESTION--------
I wanted to ask the community how to go about this. I have considered the value of nested if loops, along with using sqldf, plyr, aggregate or even dplyr. But I do not know the simplest way to achieve this. Any example code or suggestions would be very much appreciated.
I think this will work for you. These will check every hour for NA's in the next day, month or 3 year period. Not tested because I don't care to make up data to test it. These functions should spit out the number of NA's in the respective time period. So for function checkdays if it returns a value greater than 2.4 then according to your 10% rule you'd have a problem. For months 72 and for 3 year periods you're hoping for values less than 2628. Again please check these functions. By the way the functions assume your NA data is in column 2. Cheers.
checkdays <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-23)){
nadata=data[i:(i+23),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
checkmonth <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-719)){
nadata=data[i:(i+719),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
check3years <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-26279)){
nadata=data[i:(i+26279),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
So I ended up testing these. They work for me. Here are system times for a dataset a year long. So I don't think you'll have problems.
> system.time(checkdays(RM_W1))
user system elapsed
0.38 0.00 0.37
> system.time(checkmonth(RM_W1))
user system elapsed
0.62 0.00 0.62
Optimization:
I took the time to run these functions with the data you posted above and it wasn't good. For loops are dangerous because they work well for small data sets but slow down exponentially as datasets get larger, that is if they're not constructed properly. I cannot report system times for the functions above with your data (it never finished) but I waited about 30 minutes. After reading this awesome post Speed up the loop operation in R I rewrote the functions to be much faster. By minimising the amount of things that happen in the loop and pre-allocating memory you can really speed things up. You need to call the function like checkdays(df[,2]) but its faster this way.
checkdays <- function(data){
countNA=numeric(length(data)-23)
for(i in 1:(length(data)-23)){
nadata=data[i:(i+23)]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
> system.time(checkdays(df[,2]))
user system elapsed
4.41 0.00 4.41
I believe this should be sufficient for your needs. In regards to leap years you should be able to modify the optimized function as I mentioned in the comments. However make sure you specify a leap year dataset as second dataset rather than a second column.

Resources