Fastai tabular data - bad result on simple dataset - fast-ai

I am testing fastai tabular model and getting unexpected results.
Basically, I am trying to predict y = x * x using an input dataframe built on random x.
from fastai.tabular import *
# Build input dataframe
SIZE = 10000
df = pd.DataFrame({'x': np.random.randn(SIZE)})
df['y'] = df['x'] ** 2
# Create data object
dep_var = 'y'
cont_names = ['x']
data = (TabularList.from_df(df, cont_names=cont_names)
.split_by_rand_pct(valid_pct=0.1)
.label_from_df(cols=dep_var, label_cls=FloatList)
.databunch())
# Create model and learn
learn = tabular_learner(data, layers=[200,100], metrics=rmse)
learn.fit_one_cycle(5)
#epoch train_loss valid_loss root_mean_squared_error time
#0 0.706821 0.472120 0.467643 00:01
#1 0.275269 0.077311 0.271610 00:01
#2 0.194118 0.133515 0.325397 00:01
#3 0.176048 0.076927 0.187314 00:01
#4 0.163826 0.078179 0.207878 00:01
# Display result
row = df.sample(1).iloc[0]
print(row)
learn.predict(row)
# Typically:
# x = -1.582047 / y = 2.502874 / predicted_y = 2.324813
I'd expect deep learning to perform better so I'm probably doing something wrong here.
Could someone explain why I'm getting such poor results ?

When you square a distribution of random numbers like that, you're bound to wind up with some tiny values of y, which could lead to some floating point arithmetic issues that could affect your results.

Related

How to generate a random sample according with a vector of different probabilities in a single command in R?

I need to simulate a vote cast ([0]=Reject, [1]=Approve) of a fictitious population according with their "probability to approve" (i.e., their probability to cast a vote [1]). Each individual (id) has a home state (uf) which has supposedly a different probability to approve (prob_approve) and which is considered known in this toy example.
Here is an example of the data:
pop_tab <- read.table(header=TRUE,sep=',',text = "id,uf,prob_approve
1,SC,0.528788386
2,AM,0.391834279
3,RJ,0.805862415
4,SP,0.762671162
5,CE,0.168054353
6,MG,0.78433876
7,PR,0.529794529
8,PA,0.334581091
9,SP,0.762671162
10,PA,0.334581091")
I tried:
x <- list(prob = pop_tab$prob_approve)
vote <- lapply(x, runif)
... but I don't think the 'runif()' function was processed with the probabilities on column "prop_approve".
How could I simulate the vote cast of the population, according with their home-state probabilities, in a single command, without having to process line by line in a for loop?
Thank you in advance.
Use rbinom():
pop_tab <- read.table(header=TRUE,sep=',',text = "id,uf,prob_approve
1,SC,0.528788386
2,AM,0.391834279
3,RJ,0.805862415
4,SP,0.762671162
5,CE,0.168054353
6,MG,0.78433876
7,PR,0.529794529
8,PA,0.334581091
9,SP,0.762671162
10,PA,0.334581091")
rbinom(n = nrow(pop_tab),
size = 1,
prob = pop_tab$prob_approve)
## [1] 0 0 1 0 0 1 1 1 1 1

singular error when compute a specific variable mlogit (nested)

i'm using data from ("TravelMode", package = "AER") and try to follow [Heiss,2002] paper
this is what my code look like initially
Nested Structured Picture
library(mlogit)
data("TravelMode", package = "AER")
TravelMode_frame <- mlogit.data(TravelMode,choice = "choice",shape="long",chid.var="individual",alt.var = "mode")
ml_TM <- mlogit(choice ~travel|income,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'),air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
then i want to separate travel time variable between air and the other three as the picture below, so i wrote
air <- idx(TravelMode_frame, 2) %in% c('air')
TravelMode_frame$travel_air <- 0
TravelMode_frame$travel_air[air] <- TravelMode_frame$travel[air]
TravelMode_frame$travel[TravelMode_frame$alt == "air"] <- "0"
then my data look like this
individual mode choice wait vcost travel gcost income size idx travel_air
1 1 air FALSE 69 59 0 70 35 1 1:air 100
2 1 train FALSE 34 31 372 71 35 1 1:rain 0
3 1 bus FALSE 35 25 417 70 35 1 1:bus 0
4 1 car TRUE 0 10 180 30 35 1 1:car 0
~~~ indexes ~~~~
chid alt
1 1 air
2 1 train
3 1 bus
4 1 car
but when i compute it by
ml_TM <- mlogit(choice ~travel+travel_air|income,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'), air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
it's say Error in solve.default(H, g[!fixed]) : system is computationally singular: reciprocal condition number = 2.32747e-23
i had no idea why's this happened. could someone pls help me?
i try cutting variable in the formula out 1 by 1 and it's useless
ps. i roll back to the data before i create travel_air and try making travel time a alt specific constant by
ml_TM <- mlogit(choice ~0|income|travel,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'),
air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
and it cant compute either (Error in solve.default(crossprod(attr(x, "gradi")[, !fixed])) : system is computationally singular: reciprocal condition number = 1.39039e-20)
i think i get the idea behind this a little bit now. you can tell me if i'm wrong tho, i think my mistakes are
First thing, i forget to rescale income and travel time, so i need to add
TravelMode$travel <- TravelMode$travel/60+TravelMode$wait/60
TravelMode$income <- TravelMode$income/10
about the first question, this one
ml_TM <- mlogit(choice ~travel+travel_air|income,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'), air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
my nested have degenerate nested, so IV parameter will not count as dissimilarity anymore but the parameter to proportion the variable instead as the J & L model in the table in the (Heiss,2002) below and maybe because i tried to make it compute 2 variables at once, so come the error because they have to make IV parameter proportion to those variables simultaneously.
for this problem
ml_TM <- mlogit(choice ~0|income|travel,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'),
air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
like the above case as model L in the table

Distance matrix

I am trying to determine the distance between everypoint in one data set vs the other data set in R. Each data set has an X and Y parameter. I have been converting the data sets into data frames and the finding the distance. However my current code creates a large matrix to due this listing both the data sets as columns and rows. I then need to identify a specific part of the matrix I care about to get my answers, Is there a way just to put DSA as the columns and DSB as the rows. this whould cut the matrix in 1/4 which since my data sets contain thousands of points each whould really cut down the time for the algorithum to run
Here is the code I am using
tumor<-data.frame(DSA[,c ("X_Parameter","Y_Parameter")])
cells<-data.frame(DSB[,c ("X_Parameter","Y_Parameter")])
distances<-as.matrix(dist(rbind(tumor,cells)))
row.start<-nrow(tumor)+1
row.end<-nrow(tumor)+nrow(cells)
col.start<-1
col.end<-nrow(tumor)
distances[row.start:row.end, col.start:col.end]
d<- distances[row.start:row.end, col.start:col.end]
Try flexclust::dist2:
n_tumor = 2000
n_cells = 2000
tumor = matrix(runif(n_tumor * 2), n_tumor, )
cells = matrix(runif(n_cells * 2), n_cells, )
t_dist = system.time({
distances<-as.matrix(dist(rbind(tumor,cells)))
row.start<-nrow(tumor)+1
row.end<-nrow(tumor)+nrow(cells)
col.start<-1
col.end<-nrow(tumor)
d <- distances[row.start:row.end, col.start:col.end]
})[3]
require(flexclust)
t_dist2 = system.time({d2 = dist2(x = cells, y = tumor, method = "euclidean")})[3]
t_dist # 1.477
t_dist2 # 0.244
identical(unname(d), d2) # TRUE
EDIT:
Another alternative is proxy::dist.
This will compute only the portion of the matrix you need:
tumoridx <- rep(1:nrow(tumor), each=nrow(cells)
cellsidx <- rep(1:nrow(cells), nrow(tumor))
tcdist <- matrix(sqrt(rowSums((tumor[tumoridx, ] - cells[cellsidx, ])^2)),
nrow(cells), nrow(tumor))

what can you do to debug linear optimisation when using R lp

High level question is in the subject title: what can you do to debug linear optimisation when using R lp.
The detailed issue is that I have a working program adapted from: [http://pena.lt/y/2014/07/24/mathematically-optimising-fantasy-football-teams/][1]
Based on player data it chooses an optimal 15 man squad - handy for start of year or when you can change all players
I have changed it to:
1) Read player data from an Excel file (which I can supply - just tell me how)
2) Add 2 constraints to show players I definitely want to include in team and those I definitely don't.
Player data has the following columns:
web_name
team_name
type_name
now_cost
total_points
InTeam
In
Out
Good start, so I go about modelling the normal weeks when you can only transfer 1 player. I think I have the right constraint but now lp chooses about 200 players for me - not 15. Something very wrong - but I can't see it how it gets there.
I have tried going back from my new code to strip out the new feature and it still works.
I have tried removing the In/Out constraints and keeping the new "1 change" constraint. Same result.
Have upgraded packages and to latest R
Any pointers?
Code is
#Straight lift from Web - http://pena.lt/y/2014/07/24/mathematically-optimising-fantasy-football-teams/
# plus extra constraints to exclude and include specific players via Excel In/Out columns
# This variant looks to limit changes (typically 1 or 2) for a normal week
library(gdata)
library(lpSolve)
library(stringr)
library(RCurl)
library(jsonlite)
library(plyr)
excelfile<-"C:/Users/mike/Documents/FF/Start2015R.xlsx"
df=read.xls(excelfile)
# Constants
num_teams = 20
num_constraints = 8
# InTeam,In,Out,Cost + 4 positions
#Create the constraints
num_gk = 2
num_def = 5
num_mid = 5
num_fwd = 3
team_size = num_gk + num_def + num_mid + num_fwd
#max_cost = 1000
max_cost = 998
#max_cost = 2000
max_changes = 2
min_same = team_size - max_changes
# Create vectors to constrain by position
df$Goalkeeper = ifelse(df$type_name == "Goalkeeper", 1, 0)
df$Defender = ifelse(df$type_name == "Defender", 1, 0)
df$Midfielder = ifelse(df$type_name == "Midfielder", 1, 0)
df$Forward = ifelse(df$type_name == "Forward", 1, 0)
# Create vector to constrain by max number of players allowed per team
team_constraint = unlist(lapply(unique(df$team_name), function(x, df){
ifelse(df$team_name==x, 1, 0)
}, df=df))
# next we need the constraint directions. First is for MinSame
const_dir <- c(">=","=","=","=", "=", "=", "=", rep("<=", 21))
# The vector to optimize against
objective = df$total_points
# Put the complete matrix together
# nrow is number of constraints
const_mat = matrix(c(df$Inteam,df$In,df$Out,df$Goalkeeper, df$Defender, df$Midfielder, df$Forward,
df$now_cost, team_constraint),
nrow=( num_constraints + length(unique(df$team_name))),
byrow=TRUE)
const_rhs = c(min_same ,sum(df$In),0,num_gk, num_def, num_mid, num_fwd, max_cost, rep(3, num_teams))
# And solve the linear system
x = lp ("max", objective, const_mat, const_dir, const_rhs, all.bin=TRUE, all.int=TRUE)
print(arrange(df[which(x$solution==1),], desc(Goalkeeper), desc(Defender), desc(Midfielder), desc(Forward), desc(total_points)))
print (df[which(x$solution==1),"web_name",drop=FALSE], row.names = FALSE)
# what changed
df[which(x$solution != df$InTeam),"web_name",drop=FALSE]

MA terms in arima

With the arima function I found some nice results, however now i have trouble interpreting them for use outside R.
I am currently struggeling with the MA terms, here is a short example:
ser=c(1, 14, 3, 9) #Example series
mod=arima(ser,c(0,0,1)) #From {stats} library
mod
#Series: ser
#ARIMA(0,0,1) with non-zero mean
#
#Coefficients:
# ma1 intercept
# -0.9999 7.1000
#s.e. 0.5982 0.8762
#
#sigma^2 estimated as 7.676: log likelihood = -10.56
#AIC = 27.11 AICc = Inf BIC = 25.27
mod$resid
#Time Series:
#Start = 1
#End = 4
#Frequency = 1
#[1] -4.3136670 3.1436951 -1.3280435 0.6708065
predict(mod,n.ahead=5)
#$pred
#Time Series:
#Start = 5
#End = 9
#Frequency = 1
#[1] 6.500081 7.100027 7.100027 7.100027 7.100027
#
#$se
#Time Series:
#Start = 5
#End = 9
#Frequency = 1
#[1] 3.034798 3.917908 3.917908 3.917908 3.917908
?arima
When looking at the specification this formula is presented:
X[t] = a[1]X[t-1] + … + a[p]X[t-p] + e[t] + b[1]e[t-1] + … + b[q]e[t-q]
Given my choice of AR and MA terms, and considering that i have included a constant this should reduce to:
X[t] = e[t] + b[1]e[t-1] + constant
However this does not hold up when i compare the results from R with manual calculations:
6.500081 != 6.429261 == -0.9999 * 0.6708065 + 7.1000
Furthermore I can also not succeed in reproducing the insample errors, assuming i know the first one this should be possible:
-4.3136670 * -0.9999 +7.1000 != 14 - 3.1436951
3.1436951 * -0.9999 +7.1000 != 3 + 1.3280435
-1.3280435 * -0.9999 +7.1000 != 9 - 0.6708065
I hope someone can shed some light on this matter so i will actually be able to use the nice results that I have obtained.
Hi not having your exact data makes it a bit hard to give an appropriate answer - but I think the explanations on this site may help you. The intercept is a bit confusingly named and may in fact be - depending on your specification a sample mean. Actually if I read your code correctly this is also true for your estimated results.
Based on the replies given on stackexchange this seems to be the answer:
When MA terms approach -1 the model is near non-invertible and therefore should not be used. In these situation the manual calculations may not match the calculations in R even though the interpretation is actually correct.

Resources