I am new to data mining and I am trying to figure out how to cluster cell tower IDs to find its location from the known location labels (Home, Work, Elsewhere, No signal).
I have a location driven dataset of user A that contains cellID (Unique ID of detected celltowers), starttime (date & time it detected particular tower), endtime (last date & time before it connected to different celltower), placenames(user labelled place names such as home, work). There are unlabelled locations in dataset as well that are left empty by the user and I want to label these celltowers using clustering approach so that they represent as one of location names.
I am using R programming and I tried to feed complete dataset to kmeans clustering but it's resulting me with warning message which I completely dont have a clue why?
*Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(dataset, 4, 15) : NAs introduced by coercion*
Any suggestions on how can I use clustering approach for this problem? Thanks
Since you have all the labeled and unlabeled data available at the training stage, what you are looking for is "transductive learning", which is a little different from clustering (which is "unsupervised learning").
For each cell tower you collect the average starttime, endtime and cellID. You can get lat/lng from cellIDs here: https://mozilla-ichnaea.readthedocs.org/en/latest/api/search.html or http://locationapi.org/api (expensive).
This gives you 4-dimensional feature vectors for each tower, the goal is to assign a ternary labeling based on these continuous feautres:
[avg_starttime avg_endtime lat lng] home/work/other
I don't know about R, but in python basic transductive learning is available:
http://scikit-learn.org/stable/modules/label_propagation.html#label-propagation
If you don't get good results with label propagation, and since off-the-shelf transductive learning tools are rare, you might just want to ignore some of your data during training and use more standard methods. If you ignore the unlabeled data at the start you can have an "inductive" or "supervised" learning problem (solve with an SVM). If you ignore the labels at the start you can use unsupervised learning (eg "clustering"; use kmeans or DBSCAN) and then assign labels to the clusters after clustering is done.
I don't know how you got the NaN.
Related
I am trying to integrate Time series Model of R in Tableau and I am new to integration. Please help me in resolving below mentioned Error. Below is my code in tableau for integration with R. Calculation is Valid bur getting an error.
SCRIPT_REAL(
"library(forecast);
cln_count_ts <- ts(.arg1,frequency = 7);
arima.fit <- auto.arima(log10(cln_count_ts));
forecast_ts <- forecast(arima.fit, h =10);",
SUM([Count]))
Error : Error in auto.arima(log10(cln_count_ts)) : No suitable ARIMA model found
When Tableau calls R, Python, or another tool, it does so as a "table calc". That means it sends the external system one or more vectors as arguments and expects a single vector in response.
Depending on your data and calculation, you may want to send all your data to R in a single call, passing a very large vector, or call it several times with different vectors - say forecasting each region separately. Or even call R multiple times with many vectors of size one (aka scalars).
So with table calcs, you have other decisions to make beyond just choosing the function to invoke. Chiefly, you have to decide how to partition your data for analysis. And in some cases, you also need to determine the order that the data appears in the vectors you send to R - say if the order implies a time series.
The Tableau terms for specifying how to divide and order data for table calculations are "partitioning and addressing". See the section on that topic in the online help. You can change those settings by using the "Edit Table Calc" menu item.
Background:
This question is perhaps a bit broad, but hopefully interesting to anyone using relational data, R, Power BI or all of the above.
I'm trying to recreate a relational model for the dataset nycflights13 described in the book R for Data Science by Wickham and Grolemund. And I'm trying to do so using both R and Power BI. The dataset consists of the 5 tables airlines, ariports, flights, weather and planes. In section 13.2 nycflights13 theres a passage that states:
flights connects to weather via origin (the location), and year,
month, day and hour (the time).
The relationships are illustrated by this figure:
Question 1: How can I set up this model in Power BI?
Using the following R script will make the datasets available for Power BI in the folder c:/data:
# install.packages("tidyverse")
# install.packages("nycflights13")
library(tidyverse)
library(nycflights13)
setwd("C:/data/")
#getwd()
airlines
df_airlines <- data.frame(airlines)
df_airports <- data.frame(airports)
df_planes <- data.frame(planes)
df_weather <- data.frame(weather)
df_flights <- data.frame(flights)
write.csv(df_airlines, file = "C:/data/airlines.txt", row.names = FALSE)
write.csv(df_airports, file = "C:/data/airports.txt", row.names = FALSE)
write.csv(df_planes, file = "C:/data/planes.txt", row.names = FALSE)
write.csv(df_weather, file = "C:/data/weather.txt", row.names = FALSE)
write.csv(df_flights, file = "C:/data/flights.txt", row.names = FALSE)
Having imported the tables in Power BI, I'm trying to establish the relations in the Relationships tab:
And I'm able to do so to some extent, but when I try to connect flights to weather using for example year, I'm getting the following error message:
You can't create a relationship between these two columsn because one
of the columns must have unique values.
And I understand that this happens because primary keys must contain unique values and cannot contain null values. But how can you establish a primary key in power BI that consists of multiple fields?
Question 2: If there's no answer to question 1, how can you do this in R instead?
I really love this book, and It may even be described there already, but how do you establish a relationship like this in R? Or perhaps you don't need to since you can join on multiple columns or composite key using dplyr without there being 'established' a relatinship at all?
Put another way, are the relationship illustrated by the figure aboe with the arrows:
and in Power BI with the lines:
really not necessary in R as long as you have the required verbs and there actually does exist a relatinship between the data in the different tables?
Question 3 - Why is flight highlighted in the flights table:
I thought that a highlighted column name indicated that there had been established a connection between tables using that column. But as far as I can tell, that is not the case here, and there is no arrow pointing to it:
Does it perhaps indicate that it is a primary key in the flights table without any connection to another table?
I know this is a bit broad, but I'm really curious about these things so I'm hoping some of you will find it interesting!
I can comment on Power BI part.
The key issue here is that Power BI requires Dimensional Model, not relational one. There is a huge difference.
As described, the model from the book is not suitable for BI tools, it must be redesigned. For example, table "Weather" in the book is presented as a "dimension", while in reality it must be a fact table (similar to table "Flights"). As a result, "Flights" and "Weather" should never have direct connections - they must share common dimensions, such as:
Airport
Airline
Plane
Date
Time
Similarly, multiple keys and multiple connections between tables are very rare exceptions and are frowned upon (usually, they are indications of design mistakes). In a properly designed model, you should never see them.
If you want to understand the issue more, read this book:
Star Schema Complete Reference
To answer your Q3 specifically, in dimensional modeling "Flight" (I assume it's flight number) is called a "degenerate dimension". Normally, it would have been a key to a dimension table, but if it's absent, it stays in a fact table as an orphan key. Such situation is common for Order numbers, Invoice numbers, etc.
Degenerate dimensions
Overall, you are on the right track - if you figure out how to transform the model from the book into a proper star schema, and then use it in R and PowerBI, you will be impressed with the new capabilities - it's worth it.
I'm a total R beginner and try to cluster user data using the function skmeans.
I always get the error message:
"Error in if (!all(row_norms(x) > 0)) stop("Zero rows are not allowed.") :
missing value where TRUE/FALSE needed".
There already is a topic about this error message explaining that zeros are not allowed in rows.
However, my blueprint for what I'm trying to do is an example based on a data set which is also full of zeros. Working with this example, the error message does not appear and the function works fine. The error message only occurs when I apply the same procedure to my data set which doesn't seem different from the blueprint's data set.
Here's the function used for the kmeans:
weindaten.clusters <- skmeans(wendaten.tr, 5, method="genetic")
And here's the data set:
For my own data set, I used this function
kunden.cluster<- skmeans(test4, 5, method="genetic")
for this data set:
Could somebody please help me understand what the difference between the two data sets is (vector vs. something else maybe) and how I can change my data to be able to use skeams?
You cannot use spherical k-means on this data.
Spherical k-means uses angles for similarity. But the all-zero row cannot be used in angular computations.
Choose a different algorithm, unless you can treat the all-zero roe specially (for example on text, this would be an empty document).
I have extensively read and re-read the Troubleshooting R Connections and Tableau and R Integration help documents, but as a new Tableau user they just aren't helping me.
I need to be able to calculate Kaplan-Meier survival probabilities across any dimensions that are dragged onto the sheet. Ideally, I would be able to retrieve this in a tabular format at multiple time points, but for now, I would be happy just to get it at a single time point.
My data in Tableau have columns for [event-boolean] and [time to event]. Let's say I also have columns for Gender and District.
Currently, I have a calculated field [surv] as:
SCRIPT_REAL('
library(survival);
fit <- summary(survfit(Surv(.arg2,.arg1) ~ 1), times=365);
fit$surv'
, min([event-boolean])
, min([time to event])
)
I have messed with Computed Using, Addressing, Partitions, Aggregate Measures, and parameters to the R function, but no combination I have tried has worked.
If [District] is in Columns, do I need to change my SCRIPT_REAL call or do I just need to change some other combination of levers?
I used Andrew's solution to solve this problem. Essentially,
- Turn off Aggregate Measures
- In the Measure Values shelf, select Compute Using > Cell
- In the calculated field, start with If FIRST() == 0 script_*() END
- Ctrl+drag the measure to the Filters shelf and use a Special > Non-null filter.
I'm trying to analyse multiple sequences with TraMineR at once. I've had a look at seqdef but I'm struggling to understand how I'd create a TraMineR dataset when I'm dealing with multiple variables. I guess I'm working with something similar to the dataset used by Aassve et al. (as mentioned in the tutorial), whereby each wave has information about several states (e.g. children, marriage, employment). All my variables are binary. Here's an example of a dataset with three waves (D,W2,W3) and three variables.
D<-data.frame(ID=c(1:4),A1=c(1,1,1,0),B1=c(0,1,0,1),C1=c(0,0,0,1))
W2<-data.frame(A2=c(0,1,1,0),B2=c(1,1,0,1),C2=c(0,1,0,1))
W3<-data.frame(A3=c(0,1,1,0),B3=c(1,1,0,1),C3=c(0,1,0,1))
L<-data.frame(D,W2,W3)
I may be wrong but the material I found deals with the data management and analysis of one variable at a time only (e.g. employment status across several waves). My dataset is much larger than the above so I can't really impute these manually as shown on page 48 of the tutorial. Has anyone dealt with this type of data using TraMineR (or similar package)?
1) How would you feed the data above to TraMineR?
2) How would you compute the substitution costs and then cluster them?
Many thanks
When using sequence analysis, we are interested in the evolution of one variable (for instance, a sequence of one variable across several waves). You have then multiple possibilities to analyze several variables:
Create on sequences per variable and then analyze the links between the cluster of sequences. In my opinion, this is the best way to go, if your variables measure different concepts (for instance, family and employment).
Create a new variable for each wave that is the interaction of the different variables of one wave using the interaction function. For instance, for wave one, use L$IntVar1 <- interaction(L$A1, L$B1, L$C1, drop=T) (use drop=T to remove unused combination of answers). And then analyze the sequence of this newly created variable. In my opinion, this is the prefered way if your variables are different dimensions of the same concept. For instance, marriage, children and union are all related to familly life.
Create one sequence object per variable and then use seqdistmc to compute the distance (multi-channel sequence analysis). This is equivalent to the previous method depending on how you will set substitution costs (see below).
If you use the second strategy, you could use the following substitution costs. You can count the differences between the original variable to set the substition costs. For instance, between states "Married, Child" and "Not married and Child", you could set the substitution to "1" because there is only a difference on the "marriage" variable. Similarly, you would set the substition cost between states "Married, Child" and "Not married and No Child" to "2" because all of your variables are different. Finally, you set the indel cost to half the maximum substitution cost. This is the strategy used by seqdistmc.
Hope this helps.
In Biemann and Datta (2013) they talk about multi dimensional analysis. That means creating multiple sequences for the same "individuals".
I used the following approach to do so:
1) define 3 dimensional sequences
comp.seq <- seqdef(comp,NULL,states=comp.scodes,labels=comp.labels, alphabet=comp.alphabet,missing="Z")
titles.seq <- seqdef(titles,NULL,states=titles.scodes,labels=titles.labels, alphabet=titles.alphabet,missing="Z")
member.seq <- seqdef(member,NULL,states=member.scodes,labels=member.labels, alphabet=member.alphabet,missing="Z")
2) Compute the multi channel (multi dimension) distance
mcdist <- seqdistmc(channels=list(comp.seq,member.seq,titles.seq),method="OM",sm=list("TRATE","TRATE","TRATE"),with.missing=TRUE)
3) cluster it with ward's method:
library(cluster)
clusterward<- agnes(mcdist,diss=TRUE,method="ward")
plot(clusterward,which.plots=2)
Nevermind the parameters like "missing" or "left" and etc. but i hope the brief code sample helps.