h2o deep learning NN 1 layer non reproducing X + Y = Z - r

I have just started using the h2o package in order to build a supervised NN network wit the deeplearning package.
In order to get a bit on track I started trying to simulate a function like X + Y = Z
My code is the following:
data <- read.table("DeepLearningTest.csv", header = FALSE, sep = ",", quote = "", stringsAsFactor = F)
test <- read.table("DeepLearningTestRun.csv", header = FALSE, sep = ",", quote = "", stringsAsFactor = F)
df <- data.frame (data)
tf <- data.frame (test)
h2o.init ()
hf <- as.h2o (df)
m2 <- h2o.deeplearning(
training_frame=hf,
x=0:1,
y='C',
hidden = c(100),
epochs=100000,
stopping_tolerance=0.001
)
h2o.predict (m2, as.h2o(tf))
Where my test samples are the following (for example):
1 1 2
2 2 4
3 3 6
4 4 8
. . .
until
2000 2000 4000
In general is X + X = 2X
The thing I am not understanding and for which I am writing is that, if I use one layer network (for the univeral approximation theorem should be sufficient)
I can traing the network and the predict quite good values in the range of the prediction.
for instance the network is giving me
100 100 200
101 101 202
7 7 14
but when I put
4000 4000
the result is misleading. It gives me something like 6300
It seems that the network is not able to generalized.
What am I doing wrong to make this behavior?
Thank you for you attention.
Regards,
Nicola

Related

Loop same actions in R

Have an issue here.
I want to loop my operations in R, however, do not know how to make this properly and efficiently.
I have several different sized datasets and performing the same block of code each time is time-consuming.
Here is the code I need to apply to each of the datasets and write the data or the output from a model into the datasets with different names.
##########################################################################################################################
#the combined list of separate data frames where the last letter is changing A, B, C...
z <- list(Data_A, Data_B, Data_C)
#need to loop these operations performed by using data from datasets. Here is an example by using data from Data_A dataset.
# TFP estimation by using ACF method
ACF_A <- prodest::prodestACF(Data_A$turn, fX = Data_A$cogs, sX = Data_A$tfa, pX = Data_A$cogs, idvar = Data_A$ID, timevar = Data_A$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
omegaACF_A <- prodest::omega(ACF_A)
Data_A$omegaACF_A <- prodest::omega(ACF_A)
#########################################################################################################################
# Growth variables
Data_A <- Data_A %>%
arrange(ID, Year) %>%
group_by(ID) %>%
mutate(domegaACF_A = omegaACF_A - dplyr::lag(omegaACF_A),
debt = LOAN + LTD,
ddebt = debt - dplyr::lag(debt),
dsales = SALE - dplyr::lag(SALE)) %>%
ungroup
# Panel data frame
PData_A <- pdata.frame(Data_A, index = c("ID","Year"))
# Within estimator
within_2way_A <- plm(domegaACF_A ~ dplyr::lag(domegaACF_A, 1) + dplyr::lag(domegaACF_A, 2) + ddebt + lag(ff1, 1) + ddebt:lag(ff1, 1) + log(Age) + ta + dsales,
data = PData_A, effect = "twoways", model ="within", index = c("ID", "Year"))
The main problem is that I do not know how to store the data in separate datasets with according names. For example, in the block of the following code, _A should be changing to _B, _C according to the dataset that is used.
ACF_A <- prodest::prodestACF(Data_A$turn, fX = Data_A$cogs, sX = Data_A$tfa, pX = Data_A$cogs, idvar = Data_A$ID, timevar = Data_A$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
omegaACF_A <- prodest::omega(ACF_A)
Data_A$omegaACF_A <- prodest::omega(ACF_A)
I know there are lapply and for loops but I do not know how to use them with changing names of storing variables:
z <- list (df1, df2, df3)
for (i in z){
ACF_[1 or 2 or 3] <- prodest::prodestACF(i$turn, fX = i$cogs, sX = i$tfa, pX = i$cogs, idvar = i$ID, timevar = i$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
omegaACF_[1 or 2 or 3] <- prodest::omega(ACF_[1 or 2 or 3])
Data_[]$omegaACF_[1 or 2 or 3] <- prodest::omega(ACF_[1 or 2 or 3])
{
UPD: Here are several datasets: https://drive.google.com/drive/folders/1gBV2ZkywW6JqDjRICafCwtYhh2DHWaUq?usp=sharing
UPD2:
Data_A
turn cogs tfa SALE
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
Data_B
turn cogs tfa SALE
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
After running the loop I need:
ACF_A, ACF_B, etc. storage variable, where the results of the estimations of prodest function will be stored
omegaACF_A, omegaACF_B, etc. storage where omega variable from prodest will be stored
omegaACF_A, omegaACF_B results of estimations should be added to Data_A, Data_B datasets accordingly as a new variable.
After that, in Data_A, Data_B datasets, growth variables should be created
The plm regression should be stored in within_2way_A, within_2way_B accordingly
So in the end, I need:
Data_A
turn cogs tfa SALE omegaACF_A domegaACF_A debt ddebt dsales
1 1 1 1 0.1 NA 1 NA NA
2 2 2 2 0.3 0.2 2 1 1
3 3 3 3 0.6 0.3 3 1 1
4 4 4 4 0.9 0.3 4 1 1
Data_B
turn cogs tfa SALE omegaACF_B domegaACF_B debt ddebt dsales
5 5 5 5 1.1 NA 5 NA NA
6 6 6 6 1.5 0.4 6 1 1
7 7 7 7 1.7 0.2 7 1 1
8 8 8 8 2.0 0.3 8 1 1
One approach is to separate the ACF estimation and omega calculation from the summary creation with different lapply() commands. Since you did not supply any example data, it's a blind shot, but try the following. Note that I assumed that every dataset has the same column names! In case it doesn't solve your problem I will remove my answer.
data <- list(Data_A, Data_B, Data_C)
Estimates <- lapply(data, function(x){
prodest::prodestACF(x$turn, fX = x$cogs, sX = x$tfa, pX = x$cogs, idvar = x$ID, timevar = x$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
}
Summaries_estimates <- lapply(Estimates, summary)
Omegas <- lapply(Estimates, function(x) prodest::omega(x))
Summaries_omega <- lapply(Omegas, summary)
Alternative using loops
Since you asked, it is also possible to define a loop that loops everything together though this is usually much slower. For this, we have to define empty lists that carry the results of ACF etc. and loop over the lists of data.frames that we already created.
data <- list(Data_A, Data_B, Data_C)
Estimates <- list()
Summaries_estimates <- list()
Omegas <- list()
Summaries_omegas <- list()
for(i in 1:(length(data))){
Estimates[[i]] <- prodest::prodestACF(data[[i]]$turn, fX = data[[i]]$cogs, sX = data[[i]]$tfa, pX = data[[i]]$cogs, idvar = data[[i]]$ID, timevar = data[[i]]$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
}
Summaries_estimates[[i]] <- summary(Estimates[[i]])
Omegas[[i]] <- prodest::omega(Estimates[[i]]))
Summaries_omega[[i]] <- summary(Omegas[[i]])
}

Need advice on how to abstract my simulator for opening collectible card packs

I've been building simulators in Excel with VBA to understand the distribution of outcomes a player may experience as they open up collectible card packs. These were largely built with nested for loops, and as you can imagine...were slow as molasses.
I've been spinning up on R over the last couple months, and have come up with a function that handles a particular definition of a pack (i.e., two cards with particular drop rates for n characters on either card), and now am trying to abstract my function so that it can take any number of cards of whatever type of thing you want to throw at it(i.e., currency, gear, materials, etc).
What this simulator is basically doing is saying "I want to watch 10,000 people open up 250 packs of 2 cards" and then I perform some analysis after the results are generated to ask questions like "How many $ will you need to spend to acquire character x?" or "What's the distribution of outcomes for getting x, y or z pieces of a character?"
Here's my generic function and then I'll provide some inputs that the function operates on:
mySimAnyCard <- function(observations, packs, lookup, droptable, cardNum){
obvs <- rep(1:observations, each = packs)
pks <- rep(1:packs, times = observations)
crd <- rep(cardNum, length.out = length(obvs))
if("prob" %in% colnames(lookup))
{
awrd = sample(lookup[,"award"], length(obvs), replace = TRUE, prob = lookup[,"prob"])
} else {
awrd = sample(unique(lookup[,"award"]), length(obvs), replace = TRUE)
}
qty = sample(droptable[,"qty"], length(obvs), prob = droptable[,"prob"], replace = TRUE)
df <- data.frame(observation = obvs, pack = pks, card = cardNum, award = awrd, quantity = qty)
observations and packs are set to an integer.
lookup takes a dataframe:
award prob
1 Nick 0.5
2 Alex 0.4
3 Sam 0.1
and droptable takes a similar dataframe :
qty prob
1 10 0.1355
2 12 0.3500
3 15 0.2500
4 20 0.1500
5 25 0.1000
6 50 0.0080
... continued
cardnum also takes an integer.
It's fine to run this multiple times and assign the output to a variable and then rbind and order, but what I'd really like to do is feed a master function a dataframe that contains which cards it needs to provision and which lookup and droptables it should pull against for each card a la:
card lookup droptable
1 1 char1 chardrops
2 2 char1 chardrops
3 3 char2 <NA>
4 4 credits <NA>
5 5 credits creditdrops
6 6 abilityMats abilityMatDrops
7 7 abilityMats abilityMatDrops
It's probably never going to be more than 20 cards...so I'm willing to take the speed of a for loop, but I'm curious how the SO community would approach this problem.
Here's what I put together thus far:
mySimAllCards <- function(observations, packs, cards){
full <- data.frame()
for(i in i:length(cards$card)){
tmp <- mySimAnyCard(observations, packs, cards[i,2], cards[i,3], i)
full <- rbind(full, tmp)
}
}
which trips over
Error in `[.default`(lookup, , "award") : incorrect number of dimensions
I can work through the issues above, but is there a better approach to consider?

Formating life-tables to use in survival analysis

I'm trying to use the 'relsurv' package in R to compare the survival of a cohort to national life tables. The code below shows my problem using the example from relsurv but changing the life-table data. I've just used two years and two ages in the life-table data below, the actual data is much larger but gives the same error. The error is 'invalid ratetable argument' but I've formatted it as per the example life-tables 'slopop' and 'survexp.us'.
library(survival)
library(relsurv)
data(rdata) # example data from relsurv
raw = read.table(header=T, stringsAsFactors = F, sep=' ', text='
Year Age sex qx
1980 30 1 0.00189
1980 31 1 0.00188
1981 30 1 0.00191
1981 31 1 0.00191
1980 30 2 0.00077
1980 31 2 0.00078
1981 30 2 0.00076
1981 31 2 0.00074
')
ages = c(30,40) # in years
years = c(1980, 1990)
rtab = array(data=NA, dim=c(length(ages), 2, length(years))) # set up blank array: ages, sexes, years
for (y in unique(raw$Year)){
for (s in 1:2){
rtab[ , s, y-min(years)+1] = -1 * log(1-subset(raw, Year==y&sex==s)$qx) / 365.24 # probability of death in next year, transformed to hazard (see ratetables help)
}
}
attributes(rtab)$dimnames[[1]] = as.character(ages)
attributes(rtab)$dimnames[[2]] = c('male','female')
attributes(rtab)$dimnames[[3]] = as.character(years)
attributes(rtab)$dimid <- c("age", "sex", 'year')
attributes(rtab)$dim <- c(length(ages), 2, length(years))
attributes(rtab)$factor = c(0,0,1)
attributes(rtab)$type = c(2,1,4)
attributes(rtab)$cutpoints[[1]] = ages*365.24 # must be in days
attributes(rtab)$cutpoints[[2]] = NULL
attributes(rtab)$cutpoints[[3]] = as.date(paste("1Jan", years, sep='')) # must be date
attributes(rtab)$class = "ratetable"
# example from relsurv
rsmul(Surv(time,cens) ~ sex+as.factor(agegr)+
ratetable(age=age*365.24, sex=sex, year=year),
data=rdata, ratetable=rtab, int=1)
Try using the transrate function from the relsurv package to reformat the data. That should give you a compatible dataset.
Regards,
Josh
Three things to add:
You should set attributes(rtab)$factor = c(0,1,0), since sex (the second dimension) is a factor (i.e., doesn't change over time).
A good way to check whether something is a valid rate table is to use the is.ratetable() function. is.ratetable(rtab, verbose = TRUE) will even return a message stating what was wrong.
Check the result of is.ratetable without using verbose first, because it will lie about valid rate tables.
The rest of this comment is about this lie.
If the type attribute isn't given, is.ratetable will calculate it using the factor attribute; you can see this by just printing the function. However, it seems to do so incorrectly. It uses type <- 1 * (fac == 1) + 2 * (fac == 0) + 4 * (fac > 0), where fac is attributes(rtab)$factor.
But the next section, which checks the type attribute if it's provided, says the only valid values are 1, 2, 3, and 4. It's impossible to get 1 from the code above.
For example, let's examine the slopop ratetable provided with the relsurv package.
library(relsurv)
data(slopop)
is.ratetable(slopop)
# [1] TRUE
is.ratetable(slopop, verbose = TRUE)
# [1] "wrong length for cutpoints 3"
I think this is where your rate table is being hung up.

R How to iterate loops over every file in a folder?

I am struggling to iterate 2 loops over all the files in a folder. I have over 600 .csv files, which contain information about the latency and duration of saccades made in a sentence. They look like this:
order subject sentence latency duration
1 1 1 641 76
2 1 1 98 57
3 1 1 252 49
4 1 1 229 43
For each of the files, I want to create 2 new columns called Start and End, to calculate the start and end point of each saccade. The values in each of those are calculated from the values in the latency and duration columns. I can do this using a loop for each file, like so:
SentFile = read.csv(file.choose(), header = TRUE, sep = ",")
# Calculate Start
for (i in 1:(nrow(SentFile)-1)){
SentFile$Start[1] = SentFile$Latency[1]
SentFile$Start[i+1] = SentFile$Start[i] +
SentFile$Duration[i] + SentFile$Latency[i+1]}
#Calculate End
for (i in 1:(nrow(SentFile)-1)){
SentFile$End[i] = SentFile$Start[i] + SentFile$Duration[i]}
And then the result looks like this:
order subject sentence latency duration Start End
1 1 1 641 76 641 717
2 1 1 98 57 815 872
3 1 1 252 49 1124 1173
4 1 1 229 43 1402 1445
I am sure there is probably a more efficient way of doing it, but it is very important to use the precise cells specified in the loop to calculate the Start and End values and that was the only way I could think of to get it to work for each individual file.
As I said, I have over 600 files, and I want to be able to calculate the Start and End values for the entire set and add the new columns to each file. I tried using lapply, like this:
sent_files = list.files()
lapply(sent_files, function(x){
SentFile = read.csv(x, header = TRUE, sep = ",")
for (i in 1:(nrow(SentFile)-1)){
SentFile$Start[1] = SentFile$Latency[1]
SentFile$Start[i+1] = SentFile$Start[i] + SentFile$Duration[i]
+ SentFile$Latency[i+1]}
#Calculate End of Saccade Absolute Time Stamp #######
for (i in 1:(nrow(SentFile)-1)){
SentFile$End[i] = SentFile$Start[i] + SentFile$Duration[i]}})
However, I keep getting this error message:
Error in `$<-.data.frame`(`*tmp*`, "SacStart", value = c(2934L, NA)):replacement has 2 rows, data has 1
I would really appreciate any help in getting this to work!
First, replace for loops:
data <- data.frame(
"order" = c(1,2,3,4), subject = c(1,1,1,1), sentance = c(1,1,1,1), latency= c(641, 98, 252, 229), duration = c(76, 57, 49, 43)
)
data$end <- cumsum(data$latency + data$duration)
data$start <- data$end - data$duration
Secondly, you are not assigning results of the CSV load to your environment variable.
If you want to process all files in one go, change the code for data load to this:
data.list <- lapply(sent_files, function(x){
data <- read.csv(x, header = TRUE, sep = ",")
return(data)
})
data <- do.call("rbind", data.list)

Vectorization in R

I am new to R, and I have researched vectorization. But I am still trying to train my mind to think in vectorization terms. Very often examples of vectorization instead of loops are either too simple, so it's difficult for me to generalize them, or not present at all.
Can anyone suggest how I can vectorize the following?
Model2 <- subset(Cor.RMA, MODEL == Models.Sort[2,1])
RCM2 <- count(Model2$REPAIR_CODE)
colnames(RCM2) <- c("REPAIR_CODE", "FREQ")
M2M <- merge(RCM.Sort, RCM2, by = "REPAIR_CODE", all.x = TRUE)
M2M.Sort <- M2M[order(M2M$FREQ.x, decreasing = TRUE), ]
M2M.Sort[is.na(M2M.Sort)] <- 0
In the above code, each "2" needs to run from 2 to 85
writeWorksheetToFile(file="CL2 - TC - RC.xlsx",
data = M2M.Sort[ ,c("FREQ.y")],
sheet = "RC by Model",
clearSheets = FALSE,
startRow = 6,
startCol = 6)
In the above code, "data" should from from "M2M..." to "M85M..." and "startCol" should run from 6 to 89 for an Excel printout.
The data frame this comes from (Cor.RMA) has columns "MODEL", "REPAIR_CODE", and others that are unused.
RCM.Sort is a frequency table of each "REPAIR_CODE" across all models that I use as a Master list to adjoin Device-specific Repair Code counts. (left-join: all.x = TRUE)
Models.Sort is a frequency table I generated using the "count" function from the plyr package, so I can create subsets for each MODEL.
Then I merge a list of each "REPAIR_CODE" that I generated using the "unique" function.
Sample Data:
CASE_NO DEVICE_TYPE MODEL TRIAGE_CODE REPAIR_CODE
12341 Smartphone X TC01 RC01
12342 Smartphone Y TC02 RC02
12343 Smartphone Z TC01, TC05 RC05
12344 Tablet AA TC02 RC37
12345 Wearable BB TC05 RC37
12346 Smartphone X TC07 RC01
12347 Smartphone Y TC04 RC02
I very much appreciate your time and effort if you are willing to help.
Alright, this is not what your original script did, but here goes:
models <- c("X","Y","Z","AA","BB") # in your case it would be Models.Sort[2:85,1]
new <- Cor.RMA[Cor.RMA$MODEL %in% models,]
new2 <- aggregate(new$REPAIR_CODE, list(new$MODEL), table)
temp <- unlist(new2[[2]])
temp <- temp[, order(colSums(temp), decreasing = T)]
out <- data.frame(group=new2[,1], temp)
out <- out[order(rowSums(out[,-1]), decreasing = T),]
out
# group RC01 RC02 RC37 RC05
# 3 X 2 0 0 0
# 4 Y 0 2 0 0
# 1 AA 0 0 1 0
# 2 BB 0 0 1 0
# 5 Z 0 0 0 1
You can then write it easily to an xlsx file, e.g. with:
require(xlsx)
xlsx:::xlsx.write(out,"test.xlsx",row.names=F)
Edit: Added sorting.
Edit2: Fixed sorting.

Resources