CreateML Recommender Training Error: Item IDs in the recommender model must be numbered 0, 1, ..., num_items - 1 - coreml

I'm using CreateML to generate a Recommender model using an implicit dataset of the format: User ID, Item ID. The data is loaded into CreateML as a CSV with about 400k rows.
When attempting to 'Train' the model, I receive the following error:
Training Error: Item IDs in the recommender model must be numbered 0, 1, ..., num_items - 1
My dataset is in the following format:
"user_id","item_id"
"e7ca1b039bca4f81a33b21acc202df24","f7267c60-6185-11ea-b8dd-0657986dc989"
"1cd4285b19424a94b33ad6637ec1abb2","e643af62-6185-11ea-9d27-0657986dc989"
"1cd4285b19424a94b33ad6637ec1abb2","f2fd13ce-6185-11ea-b210-0657986dc989"
"1cd4285b19424a94b33ad6637ec1abb2","e95864ae-6185-11ea-a254-0657986dc989"
"31042cbfd30c42feb693569c7a2d3f0a","e513a2dc-6185-11ea-9b4c-0657986dc989"
"39e95dbb21854534958d53a0df33cbf2","f27f62c6-6185-11ea-b14c-0657986dc989"
"5c26ca2918264a6bbcffc37de5079f6f","ec080d6c-6185-11ea-a6ca-0657986dc989"
I've tried modifying both Item ID and User ID to enumerated IDs, but I still receive the training error. Example:
"item_ids","user_ids"
0,0
1,0
2,0
2,0
0,225
400,225
409,225
0,282
0,4
8,4
8,4
I receive this error both within the CreateML UI and when using CreateML within a Swift playground. I've also tried removing duplicates and verified that the maximum ID for each column is (num_items - 1).
I've searched for documentation on what the exact requirement is for the set of IDs with no luck.
Thank you in advance for any helping clarifying this error message.

I was able to discuss this issue with Apple's CoreML developers during WWDC2020. They described this as a known bug which will be fixed with the upcoming OS (Big Sur). The work-around for this bug is:
In the CSV dataset, create records for a single user which interacts with ALL items, and create records for a single item interacted with by ALL users.
Using pandas in python, I essentially implemented the following:
# Find the unique item ids
item_ids = ratings_df.item_id.unique()
# Find the unique user ids
user_ids = ratings_df.user_id.unique()
# Create a 'dummy user' which interacts with all items
mock_item_interactions_df = pd.DataFrame({'item_id': item_ids, 'user_id': 'mock-user'})
ratings_with_mocks_df = ratings_df.append(mock_item_interactions_df)
# Create a 'dummy item' which interacts with all users
mock_item_interactions_df = pd.DataFrame({'item_id': 'mock-item', 'user_id': user_ids})
ratings_with_mocks_df = ratings_with_mocks_df.append(mock_item_interactions_df)
# Export the CSV
ratings_with_mocks_df.to_csv('data/ratings-w-mocks.csv', quoting=csv.QUOTE_NONNUMERIC, index=True)
Using this CSV, I successfully generated a CoreML model using CreateML.

Try adding unnamed first column to your csv data which counts rows from 0 ... number of items - 1
like
"","userID","itemID","rating"
0,"a","x",1
1,"a","y",0
...
I think today after adding this column it started working for me. I use UUID for userID and itemID in my training model. and be sure to sort rows by itemID so all for one itemID are close to each other

Related

How to combine many records value into one record

As you can see from the below picture I was able to combine two deals (blocked red) but the output should have one result instead of two. If anyone has any solutions on this please advise.
The red blocked component has more than one record, each record has an amount, the sum of all record amount must be shown in a single row.
record1: Amount:100
record2: Amount:200
record3: Amount:500
Merge of all records is following
record: Amount:800
Is it possible to merge many rows into a single row in integromat?
Based on your screenshot you aggregate an incorrect module. Source module in your aggregator has to be set to a module that generates multiple modules, in your case, it is module 10.
You aggregate module 14 that generates for every input module a single output module, there is nothing to aggregate. Module 10 returns for a single input 2 bundles.
Your case:
/---[6]---([14]---[11 aggregator])---
---[10] multiple output bundles
\---[6]---([14]---[11 aggregator])---
Solution:
/---[6]---[14]---\
---([10] [11 aggregator])--- single output bundle
\---[6]---[14]---/
Your scenario has to look like this (Aggregator: Source module = module no.10):

Error while using "EpiEstim" and "ggplot2" libraries

First of all, I must say I'm completely noob in R. So I apologize in advance for asking for help with such a simple task. My task is to form a graph of COVID-19 cases for a certain period using data from the CSV file. Unfortunately, at the moment I cannot contact the person from the World Health Organization who provided the data and the script for launching. But I was left with an error that I cannot fix either myself, not with the help of Google.
script.R
library(EpiEstim)
library(ggplot2)
COVID<-read.csv("dataset.csv")
res_parametric_si<-estimate_R(COVID$I,method="parametric_si",config=make_config(list(mean_si=4,std_si=3)))
plot(res_parametric_si)
dataset.csv
Date,Suspected per day,Total suspected,Discarded/pending,Confirmed per day,Total confirmed,Deaths per day,Deaths Total,Case fatality rate,Daily confirmed,Recovered per day,Recovered total,Active cases,Tested with PCR,# of PCR tests total,average tests/ 7 days,Inf HCW,Inf HCW/d,Vent HCW,Susp per day
01-Jul-20,1239,91172,45285,889,45887,12,1185,2.58%,889,505,20053,24649,11109,676684,10073,6828,63,,1239
02-Jul-20,1249,92421,45658,876,46763,27,1212,2.59%,876,505,20558,24993,13167,689851,9966,6874,46,,1249
03-Jul-20,1288,93709,46032,914,47677,15,1227,2.57%,914,597,21155,25295,11825,701676,9915.7,6937,63,,1288
04-Jul-20,926,94635,46135,823,48500,22,1249,2.58%,823,221,21376,25875,9934,711610,9957,6990,53,,926
05-Jul-20,680,95315,46272,543,49043,13,1262,2.57%,543,327,21703,26078,6696,718306,9963.7,7030,40,,680
06-Jul-20,871,96186,46579,564,49607,21,1283,2.59%,564,490,22193,26131,9343,727649,10303.9,7046,16,,871
07-Jul-20,1170,97356,46942,807,50414,23,1306,2.59%,807,926,23119,25989,13568,741217,10806,7092,46,,1170
Error
Error in process_I(incid) (script.R#4): incid must be a vector or a dataframe with either i) a column called 'I', or ii) 2 columns called 'local' and 'imported'.
For the example data the issue seems to be that it does only cover 7 data points, and the configurator assumes that there it can window over more than 7 days. What worked for me was the following code (working in the sense that it does not throw an error).
config <- make_config(incid = COVID$Daily.confirmed,
method="parametric_si",
list(mean_si=4,std_si=3, t_start = c(2,3),t_end = c(6,7)))
res_parametric_si<-estimate_R(COVID$Daily.confirmed,method="parametric_si",config=config)
plot(res_parametric_si)

getRetweeters() returns one id whereas getRetweetCount() returns 2 -- in twitteR package

I use twitteR package and I am trying to retrieve account ids of retweeters..
The retweeterCount and the list of retweeters does not appear to be always consistent.
For example, I retrieved a status (tweet) using
st<-showStatus("1058168768009043969")
retweeters(st$getId()) # returns "260857015"
st$getRetweetCount() # however returns 2
st$getRetweeters() # returns a known error
Using twitteR's getRetweeters method
twitter site shows 2 retweets as shown here
https://twitter.com/ConsueloMack/status/1058168768009043969
In order to run one needs a valid key and setup the oauth as follows
require('twitteR')
twapi<-read.csv("./coach_keys.json",sep=":",stringsAsFactors=F,header=F)
# in Linux you can obtain oauth as follows
setup_twitter_oauth(twapi[twapi$V1=="API_KEY",c("V2")],
twapi[twapi$V1=="API_SECRET_KEY",c("V2")],
twapi[twapi$V1=="ACCESS_TOKEN",c("V2")],
twapi[twapi$V1=="ACCESS_TOKEN_SECRET",c("V2")])
# then the above snippet can be run
I expected the retweeters method to return as many as indicated by
the getRetweetCount().
However, it does not. I am seeking some pointers especially if I am doing something wrong. Is it common occurrence? Can someone show for the ID I have how to retrieve count and the list consistent with each other?
Thank you very much.

Can I use this CSV to load a neo4j graph with cypher?

I am a medical doctor trying to model a drugs to enzymes database and am starting with a CSV file I use to load my data into the Gephi graph layouting program. I understand the power of a graph db but am illiterate with cypher:
The current CSV has the following format:
source;target;arc_type; <- this is an header needed for Gephi import
artemisinin;2B6;induces;
...
amiodarone;1A2;represses;
...
3A457;carbamazepine;metabolizes;
These sample records show the three types of relationships. Drugs can repress or augment a cytochrome, and cytochromes metabolize drugs.
Is there a way to use this CSV as is to load into neo4j and create the graph?
Thank you very much.
In neo4j terminology, a relationship must have "type", and a node can have any number of labels. It looks like your use case could benefit from labelling your nodes with either Drug or Cytochrome.
Here is a possible neo4j data model for your use case:
(:Drug)-[:MODULATES {induces: false}]->(:Cytochrome)
(:Cytochrome)-[:METABOLIZES]->(:Drug)
The induces property has a boolean value indicating whether a drug induces (true) or represses (false) the related cythochrome.
The following is a (somewhat complex) query that generates the above data model from your CSV file:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///Drugs.csv' AS line FIELDTERMINATOR ';'
WITH line,
CASE line.arc_type
WHEN 'metabolizes' THEN {a: [1]}
WHEN 'induces' THEN {b: [true]}
ELSE {b: [false]}
END AS todo
FOREACH (ignored IN todo.a |
MERGE (c:Cytochrome {id: line.source})
MERGE (d:Drug {id: line.target})
MERGE (c)-[:METABOLIZES]->(d)
)
FOREACH (induces IN todo.b |
MERGE (d:Drug {id: line.source})
MERGE (c:Cytochrome {id: line.target})
MERGE (d)-[:MODULATES {induces: induces}]->(c)
)
The FOREACH clause does nothing if the value after the IN is null.
Yes it's possible, but you will need to install APOC : a list of usefull stored procedures for Neo4j. You can find it here : https://neo4j-contrib.github.io/neo4j-apoc-procedures/
Then you should put your CSV file into the import folder of Neo4j, and run those queries :
The first one to create a unique constraint on :Node(name) :
CREATE CONSTRAINT ON (n:Node) ASSERT n.name IS UNIQUE;
And then this query to import your data :
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///my-csv-file.csv' AS line
MERGE (n:Node {name:line.source})
MERGE (m:Node {name:line.target})
CALL apoc.create.relationship(n, line.arc_type,{​}, m)

Undefined columns selected error in R

I apologize in advance because I'm extremely new to coding and was thrust into it just a few days ago by my boss for a project.
My data set is called s1. S1 has 123 variables and 4 of them have some form of "QISSUE" in their name. I want to take these four variables and duplicate them all, adding "Rec" to the end of each one (That way I can freely play with the new variables, while still maintaining the actual ones).
Running this line of code keeps giving me an error:
b<- llply(s1[,
str_c(names(s1)
[str_detect(names(s1), fixed("QISSUE"))],
"Rec")],table)
The error is as such:
Error in `[.data.frame`(s1, , str_c(names(s1)[str_detect(names(s1), fixed("QISSUE")) & :
undefined columns selected
Thank you!
Use this to get the subset. Of course there is other ways to do that with simpler code
b<- llply(s1[,
names(s1)[str_detect(names(s1), fixed("QISSUE"))]
],c)
nwnam=str_c(names(s1)[str_detect(names(s1), fixed("QISSUE"))],"Rec")
ndf=data.frame(do.call(cbind,b));colnames(ndf)=nwnam
ndf
# of course you can do
cbind(s1,ndf)

Resources