Using KNN for pattern matching of time series - r

I want to try to implement a KNN algorithm for pattern matching (or pattern recognision) in my time series data. The data are of consumption measurements. I've got a table with some columns, where the first column is datetime of the measurement and the other columns represent the measurements. There is one example:
datetime mains stove kitchen microwave TV
2013-04-21 14:22:13 341.03 6 57 5 0
2013-04-21 14:22:16 342.36 6 57 5 0
2013-04-21 14:22:20 342.52 6 58 5 0
2013-04-21 14:22:23 342.07 6 57 5 0
2013-04-21 14:22:26 341.77 6 57 5 0
2013-04-21 14:22:30 341.66 6 55 5 0
I want to use the KNN algorithm to compare the pattern of the mains signal with the patterns of other signals. So my training set would consist of labeled measurements of every appliance and the test data set would consist of the mains signal measurements. The aim of this is to detect changes in the signal - which appliance was turned on in which time.
What I actually want to ask is:
how to cope with the datetime format? In which format should it be passed to KNN? (I wonder there will be some conversion to integer or normalization?)
is the KNN algorithm suitable for this task?
how to generally perform pattern matching with KNN?
What I've already tried - I tried to put single vector consisting of labeled patterns of data (of each appliance) to the KNN as the training set and then put mains data as the test set. I totally omitted the datetime column. I've got bad resuls.
I'm implementing this in R language.
Any ideas?

Related

Cluster analysis in R on large data set

I have a data set with rankings as the column names and about 15,000 contestants. My data looks like:
contestant
1
2
3
4
101
13
0
5
12
14
0
1
34
6
...
...
...
...
...
500
0
2
23
3
I've been working on doing cluster analysis on this dataset. The dendrograms are obviously not very helpful with this dataset--it produces a thick block line because of the large number of entries.
I'm wondering if there is a better way to do cluster analysis with this type of data. I've tried
fviz_cluster()
and similar commands, as well as went through multiple tutorials. Many tutorials guided me through making dendograms. The data all seems to be different than mine (comparing two variables, etc) and much smaller. Essentially, I'm asking which types of cluster analysis may work well with this type of data.

R – How to give a common ID to matching data with close value, and arrange dataframe for paired tests (ex from Hmisc::find.matches() )

Hi everyone ! I hope you are having a great day.
Aim and context
My two dataframes are built from different methods, but measure the same parameters on the same signals.
I’d like to match every signals in the first dataframe with the same signal in the second dataframe, to compare the parameter values, and evaluate the methods against each other.
I would gratefully appreciate any help, as I reached my beginner’s limits in R coding but also in dataframe management.
Basically, I would like to find matches in two separate dataframes and consider that the matches are refering to the same entity (for instance along the creation an ID variable), in order to perform statistical analysis for paired data.
I could have made the matches by hand on a spreadsheet, but because there are hundreds of entries and more comparisons to come, I’d like to automatize the matching and creation of dataframe.
To give you an idea, my dataframes look like this :
DF1
Recording
Selection
Start (ms)
Freq.max (kHz)
001
1
11.3
42.4
001
2
122.9
46.2
001
3
232.3
47.5
002
1
22.9
30.9
002
2
512.4
31.3
My second dataframe would look something like this :
DF2
Recording
Selection
Start (ms)
Freq.max (kHz)
001
1
10.9
41.8
001
2
122.1
44.5
001
3
231.3
44.4
002
1
513.0
30.2
My ideas
I thought about identifying each signal, but
An ID using "Recording + selection" (001_1, 001_2...) would not work because some signals are not detected in both methods.
So I'd want to use the start position to identify the signals, but rounding to the closest or upper/lower value would not match all the signals.
Hmisc::find.matches() function
I tried the function find.matches() from the package Hmisc, that gives the matches of your columns, given the tolerance threshold you input.
find <- find.matches(DF_method1$start_one, DF_method2$start_two, tol=(2))
(I arbitrarily chose a tolerance of 2 ms, for it to be considered as the same signal)
The output looks like this :
Matches:
Match #1 Match #2 Match #3
[1,] 1 7 0
[2,] 2 42 0
[3,] 3 0 0
[4,] 4 0 0
[5,] 0 0 0
[6,] 5 0 0
[7,] 22 6 0
I feel like it is coming together but I am stuck to these two questions :
How to find the closest match among each recording, not comparing all signals in all recordings ? (example here, all 1st matches are correctly identified, except n°7, matched with n°22, which is from a different recording) How could I run the function, within each recording ?
How to create a dataframe from the output ? Which would be a dataframe with only the signals that had a match and their related parameter values.
I feel like this function gets close to my aim but if you have any other suggestion, I am all ears.
Thanks a lot

How to fix linear model fitting error in S-plus

I am trying to fit values in my algorithm so that I could predict a next month's number. I am getting a No data for variable errror when clearly I've defined what the objects are that I am putting into the equation.
I've tried to place them in vectors so that it could use one vector as a training data set to predict the new values. Current script has worked for me for a different dataset but for some reason isn't working here.
The data is small so I was wondering if that has anything to do with it. The data is:
Month io obs Units Sold
12 in 1 114
1 in 2 29
2 in 3 105
3 in 4 30
4 in 5
I'm trying to predict Units Sold with the code below
matt<-TEST1
isdf<-matt[matt$month<=3,]
isdf<-na.omit(isdf)
osdf<-matt[matt$Units.Sold==4,]
lmfit<-lm(Units.Sold~obs+Month,data=isdf,na.action=na.omit)
predict(lmFit,osdf[1,1])
I am expecting to be able to place lmfit in predict and get an output.

How to deal with imbalance datasets for target variable?

Currently I am facing a problem to predict numbers of goals made(soccer) with an imbalance data (760 obs * 129 variables), Target variable - FTHG- Full time home goals with
0 1 2 3 4 5 6 -count of goals made
175 243 176 107 43 10 6 -- total times of goals made respectively
My objective is convert it into majority(1) and minority(0) into binary class, to perform sampling techniques and further use XGBoost.
Could anyone let me know how to convert it into binary class majority and minority to perform sampling smot techniques.

CART Methodology for data with mutually exhaustive rows

I am trying to use CART to analyse a data set whose each row is a segment, for example
Segment_ID | Attribute_1 | Attribute_2 | Attribute_3 | Attribute_4 | Target
1 2 3 100 3 0.1
2 0 6 150 5 0.3
3 0 3 200 6 0.56
4 1 4 103 4 0.23
Each segment has a certain population from the base data (irrelevant to my final use).
I want to condense, for example in the above case, the 4 segments into 2 big segments, based on the 4 attributes and on the target variable. I am currently dealing with 15k segments and want only 10 segments with each of the final segment based on target and also having a sensible attribute distribution.
Now, pardon my if I am wrong but CHAID on SPSS (if not using autogrow) will generally split the data into 70:30 ratio where it builds the tree on 70% of the data and tests on the remaining 30%. I can't use this approach since I need all my segments in the data to be included. I essentially want to club these segments into a a few big segments as explained before. My question is whether I can use CART (rpart in R) for the same. There is an explicit option 'subset' in the rpart function in R but I am not sure whether not mentioning it will ensure CART utilizing 100% of my data. I am relatively new to R and hence a very basic question.

Resources