R data processing and forecasting - r

I am a new R user. I have data in the following xls file
nKPI December-2012 July-2013 January-2014 July-2014 January-2015 June-2015 January-2016 July-2016
NKPI-03001 0.13 0.25 0.23 0.09 0.07 0.08 0.19 0.14
NKPI-03002 0.23 0.22 0.21 0.16 0.20 0.22 0.32 0.37
NKPI-03003 0.38 0.41 0.44 0.36 0.32 0.28 0.36 0.35
NKPI-03004 0.47 0.37 0.49 0.38 0.41 0.43 0.51 0.54
NKPI-03005 0.24 0.41 0.55 0.43 0.41 0.42 0.54 0.52
NKPI-03006 0.31 0.38 0.39 0.36 0.34 0.40 0.59 0.55
NKPI-03008 0.20 0.21 0.17 0.09 0.10 0.13 0.25 0.29
There are 704 rows of nkpi entries to process.
I need to forecast a value for july 2017 and jan 2018 using this data and create a plot for each kpi.
I can read the data into a data frame and drop rows with missing data as follows:
kpi_df <- read.xls("ochre_kpi.xls", header=TRUE)
# drop rows with no or missing data
kpi_df <- na.omit(kpi_df)
At this stage I get lost. I thank in advance any one who can offer guidance and assistance

It's best to work in tidy data format in R (if you're not sure what this is, get googling). I'm a big fan of tidy tools in the tidyverse library. If you're curious as to why I think it's preferred, you could read Hadley Wickham's tidy tools manifesto. You can find tutorials online, specifically DataCamp, and look to the RStudio cheatsheets for help (RStudio -> Help -> Cheatsheets). In terms of getting started on the analysis above, this should do.
Note: when loading a package (this is done with the call library(name_of_package)) for the first time, you'll need to call install.packages('name_of_package') to install the package.
Start with data cleaning.
To get data into tidy format:
library(tidyverse)
kpi <- dplyr::gather(kpi_df, key="date", value="value")
This would make your table kpi look like:
nKPI date value
NKPI-03001 December-2012 0.13
NKPI-03001 July-2013 0.25
The next thing to do would be to look get R to understand that the date column has dates in it. I'd normally recommend lubridate::parse_date_time which is on page 37 of this documentation. However, since your dates only have year and month, you run in to the same problem as discussed here. To get around that, the zoo package is good, so no lubridate this time. The code to fix your dates would be:
library(zoo)
kpi <- kpi %>% mutate(date = zoo::as.Date(zoo::as.yearmon(date, "%B-%Y"))
Now your data is cleaned and ready to go!
To plot: I'd use ggplot() because of the facet abilities, since you want a plot for each kpi.
# Plot value over time
ggplot(data=kpi, aes(x=date, y=value) +
# Type of plot is scatter plot
geom_point() +
# Separate plots by the nKPI variable
facet_grid(~nKPI)
As for making a prediction, the function to generate a linear regression model is lm(). You can read about it by typing ?lm into your R console.
Hope this all helps! Welcome to R!

Related

How can I make an asymmetric correlation in r?

Sometimes I used to use psych::corr.test function with two data frames like:
df1 <- tibble(a=c(1,2,4,5,67,21,21,65,1,5), b=c(21,5,2,6,8,4,2,6,2,2))
df2 <- tibble(a=c(1,2,3,4,5,6,7,8,9,8), b=c(1,6,54,8,3,8,9,5,2,1), c=c(1,4,6,8,5,3,9,7,5,4))
corr <- corr.test(df1,df2, adjust = "BH")
And I was getting p-values from corr$p.adj
But sometimes it gives me strange repetitive p.values like:
a b c
a 0.5727443 0.5964993 0.5727443
b 0.2566757 0.5727443 0.2566757
Does anyone know how adequate these p-values are? Can we do this with the corr.test? If not, how can I make an asymmetric correlation?
I'm stressed that if I try to perform symmetric correlation like
df <- bind_cols(df1,df2[-3])
corr <- corr.test(df, adjust = "BH")
it's p-values not so repetative:
Probability values (Entries above the diagonal are adjusted for multiple tests.)
a...1 b...2 a...3 b...4
a...1 0.00 0.97 0.62 0.72
b...2 0.97 0.00 0.38 0.62
a...3 0.39 0.06 0.00 0.62
b...4 0.60 0.40 0.41 0.00
UPD: Okay, I realised that it's as repetitive as the first and I'm a bit stupid.
The BH correction is based on computing the cumulative minimum of n/i * p, where the p has your n = 6 unadjusted p-values in decreasing order, and i is 6:1. (You can see the calculation in psych::p.adjust.)
Because it's a cumulative minimum (i.e. the first value, then the min of the first and second, then the min of the first to third, etc.) there are likely to be repetitions.

Find column number that satisfies condition based on another vector

I'm trying to train a classifier for the classes "Hit", "Miss" based on the variables User, Planning Horizon, Material, and some more.
Most of them are categorical variables except for Planning Horizon (integer)
I have unbalanced data so im trying to use thresholding to select the final output of the model (Rather than just using the default 0.5 probability)
The variable User has the most impact on the class outcome, so im trying to use different thresholds for every user. Im thinking about using the naive bayes posterior probability P(Class|User).
The question is, how can i apply those different rules for the output matrix of the model:
The "Thresholds matrix", a different threshold for every user:
User P("Hit"|User)
A 0.80
B 0.40
C 0.61
And the outputs of the classifier (P(Hit) and P(Miss)) and the last column (Final Prediction) is what i need to construct.
User P("Miss") P("Hit") Final Prediction
B 0.79 0.21 Miss
B 0.20 0.80 Hit
A 0.15 0.85 Hit
C 0.22 0.78 Hit
A 0.90 0.10 Miss
B 0.80 0.20 Miss
Notice the first row gets a MISS because P(Miss) is lower than P(Hit|User=B)
I would merge my threshold matrix and then create the Final Prediction column by hand like this.
df <- read.table(text='User P("Miss") P("Hit") "Final Prediction"
B 0.79 0.21 Miss
B 0.20 0.80 Hit
A 0.15 0.85 Hit
C 0.22 0.78 Hit
A 0.90 0.10 Miss
B 0.80 0.20 Miss',
header=TRUE, sep=' ', check.names=FALSE)
thm <- read.table(text='User P("Hit"|User)
A 0.80
B 0.40
C 0.61',
header=TRUE, sep=' ', check.names=FALSE)
thmdf <- merge(thm, df)
thmdf['My Final Prediction'] <-
ifelse(thmdf$`P(Hit)` < thmdf$`P(Hit|User)`,
'Miss',
'Hit')
thmdf

extract percentage explained variance in ssa result (Rssa)

I'm working with the Rssa package to decompose time series, witch works fine except that I can't get the percentage of explained variance from each eigenvector (if these are the right words to explain this). However, these percentages are noted on top on one of the graphs I can plot with this package.
Let me give an example:
d=rnorm(200,10,3)
plot(d,type="l")
ssa=ssa(d, L = 100,digits=0)
plot(ssa,type="vector") #the percentage I want is in the title of each individual graph
# to reconstruct the trend and the residuals
res <- reconstruct(ssa, groups = list(1))
trend <- res$F1
How do I get these percentages in a vector? Especially since I want to loop over multiple series.
Thank you!
Seems that the code for weighted norm of the series by component is hidden in the package.
I extract the code from Rssa:::.plot.ssa.vectors.1d.ssa and wrapped it a small function:
component_wnorm <-
function(x) {
idx <- seq_len(min(nsigma(x), 10))
x <- ssa
total <- wnorm(x)^2
round(100*x$sigma[idx]^2 / total, digits = 2)
}
component_wnorm(ssa)
[1] 92.02 0.35 0.34 0.27 0.27 0.25 0.22 0.20 0.20 0.18
The recent version of Rssa has the function contributions.
Therefore, you can use
> s <- ssa(d, L=100)
> c <- contributions(s)*100
> print(c[1:10], digits = 2)
[1] 92.41 0.28 0.26 0.26 0.26 0.23 0.23 0.21 0.20 0.20

computing percentile rank efficently in R

I'm developing an R package which requires me to report percentile ranks for each of the returned values. However, the distribution I have is huge (~10 million values).
The way I'm currently doing it is by generating an ecdf function, saving that function to a file and reading it in the package when needed. This is problematic because the file I save ends up being huge (~120mb) and takes too long to load back in:
f = ecdf(rnorm(10000000))
save(f, file='tmp.Rsav')
Is there anyway to make this more efficient maybe somehow by approximating the percentile rank in R?
Thanks
Just do an ecdf on a downsampled distro:
> items <- 100000
> downsample <- 100 # downsample by a factor of 100
> data <- rnorm(items)
> data.down <- sort(data)[(1:(items / downsample)) * downsample] # pick every 100th
> round(ecdf(data.down)(-5:5), 2)
[1] 0.00 0.00 0.00 0.02 0.16 0.50 0.84 0.98 1.00 1.00 1.00
> round(ecdf(data)(-5:5), 2)
[1] 0.00 0.00 0.00 0.02 0.16 0.50 0.84 0.98 1.00 1.00 1.00
Note you probably want to think about the downsampling a little bit as the example here will return slightly biased answers, but the general strategy should work.

Extracting output from principal function in psych package as a data frame

When I use the principal function, like in the following code, I get a nice table which gives all the standardized loadings, as well as a table with the eigenvalues and the proportion and cumulative proportion explained.
rotatedpca <- principal(PCFdataset, nfactors = 8, rotate = "varimax", scores = T)
I would like to export this output to an excel file (using WriteXLS), but I can only do that for dataframes, and rotatedpca is not a dataframe and cannot be coerced into one it seems. I am able to extract the standardized loadings by using the following code:
loadings<-as.data.frame(unclass(rotatedpca$loadings))
But I cannot figure out how to access the other information that normally displays when I simply call the principal function, in particular the eigenvalues and the proportion and cumulative variance explained. I tried rotatedcpa$values, but that returns what looks like the eigenvalues for all 12 original variables as factors without rotation, which I don't understand. And I haven't been able to figure out any way to even try to extract the variance explained values. How can I simply create a dataframe that looks like the R output I get below from the principal function, for example?
RC2 RC3 RC8 RC1 RC4 RC5 RC6 RC7
SS loadings 1.52 1.50 1.45 1.44 1.01 1.00 0.99 0.98
Proportion Var 0.13 0.12 0.12 0.12 0.08 0.08 0.08 0.08
Cumulative Var 0.13 0.25 0.37 0.49 0.58 0.66 0.74 0.82
Proportion Explained 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10
Cumulative Proportion 0.15 0.31 0.45 0.60 0.70 0.80 0.90 1.00
Thanks for reading my post!
I have just added this feature to the latest (as of today) release of psych 1.3.10.11.
If you either
f3 <- fa(Thurstone,3)
#or
p3 <- principal(Thurstone,3)
#then
p <- print(f3)
p # will give you
p
$Vaccounted
MR1 MR2 MR3
SS loadings 2.6411150 1.8621522 1.4951831
Proportion Var 0.2934572 0.2069058 0.1661315
Cumulative Var 0.2934572 0.5003630 0.6664945
Proportion Explained 0.4402995 0.3104389 0.2492616
Proportion 0.4402995 0.7507384 1.0000000
In general, if you have suggestions or questions re the psych package, you will get a faster answer if you contact me directly.
Bill
Why not this:
capture.output( print(rotatedpca), file="pc.txt")
You can read desired portions into Excel using its Text to Columns... function off the /Data menu. Or you can just paste it into an open blank Excel document and select the rows you want to convert. Use the "fixed" option that will probably be offered automagically.

Resources