first time Create ML user, I am trying to create an activity classification model for my app. For testing I have created three activities with seven feature columns. All values are integers ranging from -10000 to 10000. When I click on Train Model, I get the following error:
Feature column xx is empty on row 0 of input data table
I have CSV files as input with header rows. If I exclude feature column xx the error is on the next column. I'm on Create ML 1.0 (Xcode 11.6).
Any ideas?
Cheers
Christian
Ok, due to the sensor transmission, I had to multiply the sensor data with 1000 to get integers. I created a script to clean up the sensor data by dividing it through 1000 and now my model works.
Related
I have a list of matrices containing association measurements between GPS tracked animals. One matrix in the list is observed association rates, the others are association rates for randomized versions of the GPS tracking trajectories. For example, I currently have 99 permutations of randomized tracking trajectories resulting in a list of 99 animal association matrices, plus the observed association matrix. I am expecting that for the animals that belong to the same pack, the observed association rates will be higher than the randomized association rates. Accordingly, I would like to determine the rank of the observed rates compared to the randomized rates for each dyad (cell). Essentially, I am doing a rank-permutation test. However, since I am only really concerned with determining if the observed association data is greater than the randomized trajectory association data, any result just giving the rank of the observed cells is sufficient.
ls <- list(matrix(10:18,3,3), matrix(18:10,3,3))
I've seen using sapply can get the ranks of particular cells. Could I do the following for all cells and take the final number in the resulting vector to get the rank of the cell in that position in the list (knowing the position of the observed data in the list of matrices, e.g. last).
rank(sapply(ls, '[',1,1))
The ideal result would be a matrix of the same form as those in the list giving the rank of the observed data, although any similar solutions are welcome. Thanks in advance.
You can proceed that way, but there are cleaner and quicker methods to get what you want.
Here's some code that would take your ls produce a 3x3 matrix with the following properties:
if the entry in ls[[1]] is greater than the corresponding entry of ls[[2]], record a 1
if the entry in ls[[1]] is less than the corresponding entry of ls[[2]], record a 2
if the entries are equal, record a 1.5
result <- 1 * (ls[[1]] > ls[[2]]) + 2 * (ls[[1]] < ls[[2]]) + 1.5 * (ls[[1]] == ls[[2]])
How it works: when we do something like ls[[1]] > ls[[2]], we are ripping out the matrices of interest and directly comparing them. The result of this bit of code is a T/F-populated matrix, which is secretly coded as a 0/1 matrix. We can then multiply it by whatever coefficient we want to represent that situation.
I am having some issues in interpreting the results from prcomp().
Say I have a centered and scaled data.table called dat, with N columns and M rows. Indeed every column represents a feature and every row a record. I also got a M-dimensional vector of outcomes Y.
I wanted to know what the PCA of this system says. So I just executed:
dat.pca=prcomp(dat,retx=TRUE)
By the elbow method I decided to retain 5 PCA modes, accounting for 90% of the variance. Then, I got the following data.table:
dat.pcadata=as.data.table(dat.pca$x)
dat.pcadata has M rows and N columns, and each column corresponds to a PCA mode.
My question is: do I understand correctly if I say that now my system should be trained to forecast the outcomes Y using the first 5 columns of dat.pcadata as features?
For this, I am using the banknote data in R given by data(banknote), which shows measurements of 200 Swiss banknotes. My data matrix is called X, and I have performed PCA by pca.banknote<-prcomp(X).
I am trying to show that the inner product between each observation X[i,] and Principal Component Loading 3 given by pca.banknote$rot[,3] is the same as the 3rd PC scores given by pca.banknote$x[,3].
I have attempted:
all.equal(as.matrix(X[,])%*%banknote.pca$rot[,3], as.matrix(banknote.pca$x[,3]), check.attributes=FALSE)
but this simply gives a mean difference of 1, i.e. they are not equal.
Do I need to change the format of one of these to a vector/data frame etc for this to work? Or any ideas at all as to where the issue is?
Any feedback would be much appreciated. Thanks.
I'm trying to generate a adjacency matrix from a csv.
The csv contains 2 columns, 1 for users and 1 for projects. The two columns form a bipartite graph, where each user can be part of multiple projects or none at all, but no edges between nodes of the same set (there are no repeating entries for the same user-project pair, but there are repeated entries of the same user or projects with different combinations for pairs).
I wrote a comparison for comparing each user's project with the entire project set using Matlab and ismember(a,b). The algorithm runs iteratively through each entry. In the end, I have an adjacency matrix of size M(|users| + |user|) x (|users| + |user|).
For small entry count < 15000, it works fast, but for a sample of +15000, Matlab stalls. I initialize the adjacency matrix with a zeros matrix (zero(r,c)) and add row by row the results of ismember(a,b). But for my Matlab, a zeros matrix zero(15000,15000) almost maxes out the memory. I tried also making a zero matrix in R with that size (matrix(0, 15000, 15000)) and it also maxes out R's memory.
Is there a way to get around this? My full sample size is 597,000 rows (with ~70,000 users and ~35,000 projects) and I want to run a network analysis of it.
Also I want to keep it in matrix format and not an adjacency list because I have a max cut min flow algorithm I want to run on the results and it only works with matrices.
Updated:
The data looks like this
User | Project
382 2429
385 2838
294 2502
... ...
It is taken from SourceForge using Zerlot from University of Notredame. Where each int value is a key in a SQL database.
I want to convert this affiliation data into a one-mode user-to-user adjacency matrix where each edge between users is a shared project.
I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.