Error in correlation matrix of a data frame - r

R version 3.5.1.
I want to create a correlation matrix with a Data frame that goes like this:
BGI WINDSPEED SLOPE
4277.2 4.23 7.54
4139.8 5.25 8.63
4652.9 3.59 6.54
3942.6 4.42 10.05
I put that as an example but I have more than 20 columns and over 40,000 entries.
I tried using the following codes
corrplot(cor(Site.2),method = "color",outline="white")
Matrix_Site=cor(Site.2,method=c("spearman"))
but every time, the same warning appears:
Error in cor(Site.2) : 'x' must be numeric
I would like to correlate every variable of the data frame with each other and create a graph and a table with it, similar to this.

Related

Missing data warning R

I have a dataframe with climatic values like temperature_max, temperature_min... in diferent locations. The data collection is a time series data there are some especific days in which there are no data registration. I woul like to impute taking in account date and also the location (place variable in the dataframe)
I have tried to impute those missing values with amelia. But no imputation is done with warning information
Checking variables:
head(df): PLACE, DATE, TEMP_MAX, TEMP_MIN, TEMP_AVG
PLACE DATE TEMP_MAX TEMP_MIN TEMP_AVG
F 12/01/2007 19.7 2.5 10.1
F 13/01/2007 18.8 3.5 10.4
F 14/01/2007 17.3 2.4 10.4
F 15/01/2007 19.5 4.0 9.2
F 16/01/2007
F 17/01/2007 21.5 2.8 9.7
F 18/01/2007 17.7 3.3 12.9
F 19/01/2007 18.3 3.8 9.7
A 16/01/2007 17.7 3.4 9.7
A 17/01/2007
A 18/01/2007 19.7 6.2 10.4
A 19/01/2007 17.7 3.8 10.1
A 20/01/2007 18.6 3.8 12.9
This is just some of the records of my data set.
DF = amelia(df, m=4, ts= c("DATE"), cs = c("PLACE"))
where DATE is time series data (01/01/2001, 02/01/2001, 03/01/2001...) but if you filter by PLACE the time series is not equal (not the same star and end time).
I have 3 questions:
I am not sure if I should have the time series data complete for all the places, I mean same start and end time for all the places.
I am not using lags or polytime parameters so, am I imputting correctly taking in account time series influence? I am not sure about how to use lag parameter although I have checked the R package information.
The last question is that when I try to use that code there is a warning
and no imputation is done.
Warning: There are observations in the data that are completely missing.
These observations will remain unimputed in the final datasets.
-- Imputation 1 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 2 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 3 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 4 --
No missing data in bootstrapped sample: EM chain unnecessary
Can someone help me with this?
Thanks very much for your time!
For the software it does not matter if you have different start and end dates for different places. I think that it is more up to you and your thoughts on the data. I would ask myself, if those were missing data (missing at random) thus I would create empty rows in your data set or not.
You want to use lags in order to use past values of the variable to improve the prediction of missing values. It is not mandatory (i.e., the function can impute missing data even without such a specification) but it can be useful.
I contacted the author of the package and he told me that you need to specify the splinetime or polytime arguments to make sure that Amelia will use the time-series information to impute. For instance, if you set polytime = 3, it will impute based on a cubic of time. If you do that, I think you shouldn't see that error anymore.

Power BI: Make a line chart continuous when source contains null values (handle missing values)

This question spinned off a question I posted earlier;
Custom x-axis values in Power BI
Suppose the following data set:
Focus on the second and third row. How can I make the line in corresponding graph hereunder be continuous and not stop in the middle?
In Excel I used to solve this problem by applying NA() in a formula generating data for the graph. Is there a similar solution using DAX perhaps?
The short version:
What you should do:
Unleash python and after following the steps there, insert this script (dont't worry, I've added some details here as well):
import pandas as pd
# retrieve index column
index = dataset[['YearWeek_txt']]
# subset input data for further calculations
dataset2 = dataset[['1', '2']]
# handle missing values
dataset2 = dataset2.fillna(method='ffill')
Then you will be able to set it up like this:
Why you should do it:
For built-in options, these are your choices as far as I know:
1. Categorical YearMonth and categorical x-axis does not work
2. Categorical YearMonth and continuous x-axis does not work
3. Numerical YearMonth and categorical x-axis does not work
4. Numerical YearMonth and continuous x-axis does not work
The details, starting with why built-in approaches fail:
1. Categorical YearMonth and categorical x-axis
I've used the following dataset that resembles the screenshot of your table:
YearWeek 1 2
201603 2.37 2.83
201606 2.55
201607 2.98
201611 2.33 2.47
201615 2.14 2.97
201619 2.15 2.02
201623 2.33 2.8
201627 3.04 2.57
201631 2.95 2.98
201635 3.08 2.16
201639 2.50 2.42
201643 3.07 3.02
201647 2.19 2.37
201651 2.38 2.65
201703 2.50 3.02
201711 2.10 2
201715 2.76 3.04
And NO, I didn't bother manually copying ALL your data. Just your YearWeek series. The rest are random numbers between 2 and 3.
Then I set the data up as numerical for 1 and 2, and YearWeek aa type text in the Power Query Editor:
So this is the original setup with a table and chart like yours:
The data is sorted descending by YearWeek_txt:
And the x-axis is set up asCategorical:
Conclusion: Fail
2. Categorical YearMonth and numercial x-axis
With the same setup as above, you can try to change the x-axis type to Continuous:
But as you'll see, it just flips right back to 'Categorical', presumably because the type of YearWeek is text.
Conclusion: Fail
3. Numerical YearMonth and categorical x-axis
I've duplicated the original setup so that I've got two tables Categorical and Numerical where the type of YearWeek are text and integer, respectively:
So numerical YearMonth and categorical x-axis still gives you this:
Conclusion: Fail
4. Numerical YearMonth and continuous x-axis does not work
But now, with the same setup as above, you are able to change the x-axis type to Continuous:
And you'll end up with this:
Conclusion: LOL
And now, Python:
In the Power Query Editor, activate the Categorical table, select Transform > Run Python Script and insert the following snippet in the Run Python Script Editor:
# 'dataset' holds the input data for this script
import pandas as pd
# retrieve index column
index = dataset[['YearWeek_txt']]
# subset input data for further calculations
dataset2 = dataset[['1', '2']]
# handle missing values
dataset2 = dataset2.fillna(method='ffill')
Click OK and click Table next to dataset2 here:
And you'll get this (make sure that the column data types are correct):
As you can see, no more missing values. dataset2 = dataset2.fillna(method='ffill') has replaced all missing values with the preceding value in both columns.
Click Close&Apply to get back to the Desktop, and enjoy your table and chart with no more missing values:
Conclusion: Python is cool
End note:
There are a lot of details that can go wrong here with decimal points, data types and so on. Let me know how things work out for you and I'll have a look at it again if it doesn't work on your end.

How to save variables & stop variables from being overwritten in R?

So I have a bunch of functions that save the column numbers of my data. So for example, my data looks something like:
>MergedData
[[1]]
Date EUR.HIGH EUR.LOW EUR.CLOSE EUR.OPEN EUR.LAST
01/01/16 1.00 1.00 1.25 1.30 1.24
[[2]]
Date AUD.HIGH AUD.LOW AUD.CLOSE AUD.OPEN AUD.LAST
01/01/16 1.00 1.00 1.25 1.30 1.24
I have 29 of the above currencies. So in this case, MergedData[[1]] will return all of my Euro prices, and so on for 29 currencies.
I also have a function in R that calculates the variables and saves the numbers 1 to 29 that correspond with the currencies. This code calculates values in the first row of my data, i.e:
trending <- intersect(which(!ma.sig[1,]==0), which(!pricebreak[1,]==0))
which returns something like:
>sig.nt
[1] 1 2 5...
And so I can use this to pull up 'trending' currencies via a for() function:
for (i in length(sig.nt){
MergedData[[i]]
...misc. code for calculations of trending currencies...
}
I want to be able to 'save' my trending currencies for future references and calculations. The problem is sig.nt variable changes with every new row. I was thinking of using the lockBinding command:
sig.exist <- sig.nt #saves existing trend
lockBinding('curexist', .GlobalEnv)
But wouldn't this still get overwritten everytime I run my script? Help would be much appreciated!

Function to identify changes done previously

BACKGROUND
I have a list of 16 data frames. A data frame in it looks like this. All the other data frames have the similar format. DateTime column is of Date class while Value column is of time series class
> head(train_data[[1]])
DateTime Value
739 2009-07-31 49.9
740 2009-08-31 53.5
741 2009-09-30 54.4
742 2009-10-31 56.0
743 2009-11-30 54.4
744 2009-12-31 55.3
I am performing forecasting for the Value column across all the data.frames in this list . The following line of code feeds data into UCM model.
train_dataucm <- lapply(train_data, transform, Value = ifelse(Value > 50000 , Value/100000 , Value ))
The transform function is used to reduce large values because UCM has some issues rounding off large values ( I don't know why though ). I just understood that from user #KRC in this link
One data frame got affected because it had large values which got transformed to log values. All the other dataframes remained unaffected.
> head(train_data[[5]])
DateTime Value
715 2009-07-31 139901
716 2009-08-31 139492
717 2009-09-30 138818
718 2009-10-31 138432
719 2009-11-30 138659
720 2009-12-31 138013
I got to know this because I manually checked each one of the 15 data frames
PROBLEM
Is there any function which can call out the data frames which got
affected due to the condition which I inserted?
The function must be able to list down the data frames which got affected and should be able to put them into a list.
If I will be able to do this, then I can apply anti log function on the values and get the actual values.
This way I can give the correct forecasts with minimal human intervention.
I hope I am clear in specifying the problem .
Thank You.
Simply check whether any of your values in a data frame is too high:
has_too_high_values = function (df)
any(df$Value > 50000)
And then collect them, e.g. using Filter:
Filter(has_too_high_values, train_data)

plotting histogram by a data frame

I am a new user of R and I am running the last 7 days this language using the mixdist package for the modal analysis of finite mixture distributions. I am working on nanoparticles thus the R is for the analysis of particle size distributions recorded by a particle analyser I am using to my experiments.
My problem is illustrated below:
Firstly I am collecting my data from excel (raw data)
Diameter dN/dlog(dp) frequencies
4.87 1825.078136 0.001541926
5.62 2363.940947 0.001997187
6.49 2022.259831 0.001708516
7.5 1136.653264 0.000960307
8.66 363.4570006 0.000307068
10 255.6702845 0.000216004
11.55 241.6525906 0.000204161
13.34 410.3425535 0.00034668
15.4 886.929307 0.000749327
17.78 936.4632499 0.000791176
20.54 579.7940281 0.000489842
23.71 11.915522 0.00001
27.38 0 0
31.62 0 0
36.52 5172.088 0.004369665
42.17 19455.13684 0.01643677
48.7 42857.20502 0.036208126
56.23 68085.64903 0.057522504
64.94 87135.1959 0.07361661
74.99 96708.55662 0.081704712
86.6 97982.18946 0.082780747
100 95617.46266 0.080782896
115.48 93732.08861 0.079190028
133.35 93718.2981 0.079178377
153.99 92982.3002 0.078556565
177.83 88545.18227 0.074807844
205.35 78231.4116 0.066094203
237.14 63261.43349 0.053446741
273.84 46759.77702 0.039505233
316.23 32196.42834 0.027201315
365.17 21586.84472 0.018237755
421.7 14703.9162 0.012422678
486.97 10539.84662 0.008904643
562.34 7986.233881 0.00674721
649.38 6133.971913 0.005182317
749.89 4500.351801 0.003802145
865.96 2960.469207 0.002501167
1000 1649.858041 0.001393891
Inf 0 0
using the function
pikraw<-read.table(file="clipboard", sep="\t", header=T)
After importing the data in R I am choosing the 1st and the 3rd column of the above table :
diameter<- pikraw$Diameter
frequencies<-pikraw[3]
Then I am grouping my data using the functions
pikgrp <- data.frame(length =diameter, freq =frequencies)
class(pikgrp) <- c("mixdata", "data.frame")
Doing all these I am going to plot the histogram of this data
plot(pikgrp,log="x")
and there something strange happens: The horizontal axis and the values on this look fine although the y axis of the graph appear the low values of the frequencies as they are and the high values with a cut decimal lowering the plot.
Have you got any explanation on what is happening? Probably the answer could be very simple although after exhausting my self and losing a whole weekend I believe that I have all the rights on my side.
It looks to me like you are reading your data wrong. Try this:
pikraw <- read.table(file="clipboard", sep="", header=T)
That is, change the sep argument to sep="". Everything worked fine from there.
Also, note that using the clipboard as file argument only works if you have your data on the clipboard. I recommend creating a .txt (or .csv) file with your data. That way you don't have to have your data on the clipboard everytime you want to read it.

Resources