Converting a numeric dataframe with fractions to intergers in r via interpolation - r

I have collected data which looks something like this:
wl Spec.94
299.784 57.95
300.151 57.18
300.517 88.18
300.884 18.71
301.252 100.90
301.617 127.06
301.983 75.02
302.349 54.20
302.715 50.93
303.082 50.43
However, the program I use to analyze the data can only handle whole numbers for wl. I have an excel sheet I inherited that interpolates this data and produces this:
wl Spec
300 41.03
301 61.77
302 51.84
I really don't know how that spreadsheet works, but the column titles that it auto-populates are Target Wl, Nearest Smaller Index, Nearest Smaller Wl, Upper Wl, Bias, Low-side value, High-side Value, and Interpolated value.
I need to be able to replicate this process in my r code, to make the analysis reproducible, but I have no idea where to start. How do I interpolate my data in r to get the values for Spec.95 at whole number values in wl?

You could do a sequence over the rounded range and feed approx with it.
wl <- do.call(seq, as.list(round(range(dat$wl))))
cbind(wl, Spec.94=approx(dat$wl, dat$Spec, wl)$y)
# wl Spec
# [1,] 300 57.49681
# [2,] 301 44.61772
# [3,] 302 74.05295
# [4,] 303 50.54172
However, the values are somewhat different, and I'm not sure how your specific excel code interpolates the 41.03 value which should be somewhere between 57.95 and 57.18. Maybe you could figure that out?

Related

R confusion matrix error

I have to lists and I'm creating a confusion matrix like this
conf.mat <- table(x,y)
but the
accuracy <- sum(diag(conf.mat))/length(y) * 100)
is giving me 0 when I know for sure they aren't.
x is a long list that ends like this
[1546] data mining
25 Levels: clustering algorithms ...
and y ends like this
[1546] mixed discrete-continuous optimization
646 Levels: access control ... world wide web
The thing is even though I assume diag(conf.mat) to contain 1546 it only contains 25 entries.
Any ideas what's happening? I assume it has something to do with the levels but I'm not sure how to fix this.

HMM text recognition in R depmixs4

I'm wondering how I would utilize the depmixs4 package for R to run HMM on a dataset. What functions would I use so I get a classification of a testing data set?
I have a file of training data, a file of label data, and a test data.
Training data consists of 4620 rows. Each row has 1079 values. These values are 83 windows with 13 values per window so in otherwords the 1079 is data that is made up of 83 states and each category has 13 observations. Each of these rows with 1079 values is a spoken word so it have 4620 utterances. But in total the data only has 7 distinct words. each of these distinct words have 660 different utterances hence the 4620 rows of words.
So we have words (0-6)
The label file is a list where each row is labeled 0-6 corresponding to what word they are. For example row 300 is labeled 2, row 450 is labeled 6 and 520 is labeled 0.
The test file contains about 5000 rows structured exactly like the training data except there are no labels assocaiated with it.
I want to use HMM to using the training data to classify the test data.
How would I use depmixs4 to output a classification of my test data?
I'm looking at :
depmix(response, data=NULL, nstates, transition=~1, family=gaussian(),
prior=~1, initdata=NULL, respstart=NULL, trstart=NULL, instart=NULL,
ntimes=NULL,...)
but I don't know what response refers to or any of the other parameters.
Here's a quick, albeit incomplete, test to get you started, if only to familiarize you with the basic outline. Please note that this is a toy example and it merely scratches the surface for HMM design/analysis. The vignette for the depmixs4 package, for instance, offers quite a lot of context and examples. Meanwhile, here's a brief intro.
Let's say that you wanted to investigate if industrial production offers clues about economic recessions. First, let's load the relevant packages and then download the data from the St. Louis Fed:
library(quantmod)
library(depmixS4)
library(TTR)
fred.tickers <-c("INDPRO")
getSymbols(fred.tickers,src="FRED")
Next, transform the data into rolling 1-year percentage changes to minimize noise in the data and convert data into data.frame format for analysis in depmixs4:
indpro.1yr <-na.omit(ROC(INDPRO,12))
indpro.1yr.df <-data.frame(indpro.1yr)
Now, let's run a simple HMM model and choose just 2 states--growth and contraction. Note that we're only using industrial production to search for signals:
model <- depmix(response=INDPRO ~ 1,
family = gaussian(),
nstates = 2,
data = indpro.1yr.df ,
transition=~1)
Now let's fit the resulting model, generate posterior states
for analysis, and estimate probabilities of recession. Also, we'll bind the data with dates in an xts format for easier viewing/analysis. (Note the use of set.seed(1), which is used to create a replicable starting value to launch the modeling.)
set.seed(1)
model.fit <- fit(model, verbose = FALSE)
model.prob <- posterior(model.fit)
prob.rec <-model.prob[,2]
prob.rec.dates <-xts(prob.rec,as.Date(index(indpro.1yr)),
order.by=as.Date(index(indpro.1yr)))
Finally, let's review and ideally plot the data:
head(prob.rec.dates)
[,1]
1920-01-01 1.0000000
1920-02-01 1.0000000
1920-03-01 1.0000000
1920-04-01 0.9991880
1920-05-01 0.9999549
1920-06-01 0.9739622
High values (>0.80 ??) indicate/suggest that the economy is in recession/contraction.
Again, a very, very basic introduction, perhaps too basic. Hope it helps.

HoltWinter Initial values not matching with Rob Hyndman theory

I am following this tutorial by Rob Hyndman for initialization (additive).
Steps to calculate initial values are specified as:
I am running above steps manually (with pen/paper) on data set provided in Rob Hydman free online text book. Values I got after first two steps are:
I used same data set on "R", but seasonal output values in R are drastically different (screenshot below)
Not sure what I am doing wrong. Any help would be appreciated.
Another interesting thing I have observed just now is, initial level (l(t)) in text book is 33.8, but in R output it is : 48.24, which proves that I am missing something while calculating manually.
EDIT:
Here is how I am calculating Moving Averages Smooth (Based on formula used in Section 2 of this link. )
After calculating I have de-trended, means original value - smoothed value.
Then seasonal values: Which is
S1 =Average of Q1
S2 = Average of Q2
...
The first two values of your moving average are incorrect. You have assumed that the values prior to the first observation are zero. They are not zero, they are missing, which is quite different. It is impossible to compute the moving average for the first two observations for this reason.
The third and subsequent values of your moving average are only approximately correct because you have rounded the data to the first decimal point instead of using the data as provided in the fpp package in R.
The values obtained following this procedure are used as initial values in the optimization within ets(). So the output from ets() will not contain the initial values but the optimized values. The table in the book gives the optimized values. You will not be able to reproduce them using a simple procedure.
However, you can reproduce what is provided by HoltWinters because it does not do any optimization of initial values. Using HoltWinters, the initial seasonal values are given as:
> HoltWinters(y)$fitted[1:4,]
xhat level trend season
[1,] 43.73934 33.21330 1.207739 9.318302
[2,] 28.25863 35.65614 1.376490 -8.774002
[3,] 36.86581 37.57569 1.450688 -2.160566
[4,] 41.87604 38.83521 1.424568 1.616267
(The output in coefficients gives the final states not the initial states.)
The seasonal indices in the last column can be computed as follows:
y MAsmooth detrend detrend.adj
41.72746 NA NA NA
24.04185 NA NA NA
32.32810 34.41724 -2.089139 -2.160566
37.32871 35.64101 1.687695 1.616267
46.21315 36.82342 9.389730 9.318302
29.34633 38.04890 -8.702575 -8.774002
36.48291 NA NA NA
42.97772 NA NA NA
The last column is the adjusted detrended data (so they add to zero).

plotting histogram by a data frame

I am a new user of R and I am running the last 7 days this language using the mixdist package for the modal analysis of finite mixture distributions. I am working on nanoparticles thus the R is for the analysis of particle size distributions recorded by a particle analyser I am using to my experiments.
My problem is illustrated below:
Firstly I am collecting my data from excel (raw data)
Diameter dN/dlog(dp) frequencies
4.87 1825.078136 0.001541926
5.62 2363.940947 0.001997187
6.49 2022.259831 0.001708516
7.5 1136.653264 0.000960307
8.66 363.4570006 0.000307068
10 255.6702845 0.000216004
11.55 241.6525906 0.000204161
13.34 410.3425535 0.00034668
15.4 886.929307 0.000749327
17.78 936.4632499 0.000791176
20.54 579.7940281 0.000489842
23.71 11.915522 0.00001
27.38 0 0
31.62 0 0
36.52 5172.088 0.004369665
42.17 19455.13684 0.01643677
48.7 42857.20502 0.036208126
56.23 68085.64903 0.057522504
64.94 87135.1959 0.07361661
74.99 96708.55662 0.081704712
86.6 97982.18946 0.082780747
100 95617.46266 0.080782896
115.48 93732.08861 0.079190028
133.35 93718.2981 0.079178377
153.99 92982.3002 0.078556565
177.83 88545.18227 0.074807844
205.35 78231.4116 0.066094203
237.14 63261.43349 0.053446741
273.84 46759.77702 0.039505233
316.23 32196.42834 0.027201315
365.17 21586.84472 0.018237755
421.7 14703.9162 0.012422678
486.97 10539.84662 0.008904643
562.34 7986.233881 0.00674721
649.38 6133.971913 0.005182317
749.89 4500.351801 0.003802145
865.96 2960.469207 0.002501167
1000 1649.858041 0.001393891
Inf 0 0
using the function
pikraw<-read.table(file="clipboard", sep="\t", header=T)
After importing the data in R I am choosing the 1st and the 3rd column of the above table :
diameter<- pikraw$Diameter
frequencies<-pikraw[3]
Then I am grouping my data using the functions
pikgrp <- data.frame(length =diameter, freq =frequencies)
class(pikgrp) <- c("mixdata", "data.frame")
Doing all these I am going to plot the histogram of this data
plot(pikgrp,log="x")
and there something strange happens: The horizontal axis and the values on this look fine although the y axis of the graph appear the low values of the frequencies as they are and the high values with a cut decimal lowering the plot.
Have you got any explanation on what is happening? Probably the answer could be very simple although after exhausting my self and losing a whole weekend I believe that I have all the rights on my side.
It looks to me like you are reading your data wrong. Try this:
pikraw <- read.table(file="clipboard", sep="", header=T)
That is, change the sep argument to sep="". Everything worked fine from there.
Also, note that using the clipboard as file argument only works if you have your data on the clipboard. I recommend creating a .txt (or .csv) file with your data. That way you don't have to have your data on the clipboard everytime you want to read it.

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources