I'm trying to find patterns in a large dataset using the neuralnet package.
My data file looks something like this (30,204,447 rows) :
id.company,EPS.or.Sales,FQ.or.FY,fiscal,date,value
000001,EPS,FY,2001,20020201,-5.520000
000001,SAL,FQ,2000,20020401,70.300003
000001,SAL,FY,2001,20020325,49.200001
000002,EPS,FQ,2008,20071009,-4.000000
000002,SAL,FY,2008,20071009,1.400000
I have split this initial file into four new files for annual/quarterly sales/EPS and it is on those files that I want to use neural networks to see if I can use the variables id.company, fiscal and date in the case below to predict the annual sales results.
To do so, I have written the following code:
dataset <- read.table("fy_sal_data.txt",header=T, sep="\t") #my file doesn't actually use comas as separators
#extract training set and testing set
trainset <- dataset[1:1000, ]
testset <- dataset[1001:2000, ]
#building the NN
ann <- neuralnet(value ~ id.company + fiscal + date, trainset, hidden = 3,
lifesign="minimal", threshold=0.01)
#testing the output
temp_test <- subset(testset, select=c("id.company", "fiscal", "date"))
ann.results <- compute(ann, temp_test)
#display the results
cleanoutput <- cbind(testset$value, as.data.frame(ann.results$net.result))
colnames(cleanoutput) <- c("Expected Output", "NN Output")
head(cleanoutput, 30)
Now my problem is that the compute function returns a constant answer no matter the inputs of the testing set.
Expected Output NN Output
1001 2006.500000 1417.796651
1002 2009.000000 1417.796651
1003 2006.500000 1417.796651
1004 2002.500000 1417.796651
I am very new to R and its neural networks packages but I have found online that some of the reasons for such results can be either:
an insufficient number of training examples (here I'm using a thousand ones but I've also tried using a million rows and the results were the same, only it took 4h to train)
or an error in the formula.
I am sure I'm doing something wrong but I can't seem to figure out what.
Related
I'm running into some problems while running plm regressions in my panel database. Basically, I have to take out a year from my base and also all observations from some variable that are zero. I tried to make a reproducible example using a dataset from AER package.
require (AER)
library (AER)
require(plm)
library("plm")
data("Grunfeld", package = "AER")
View(Grunfeld)
#Here I randomize some observations of the third variable (capital) as zero, to reproduce my dataset
for (i in 1:220) {
x <- rnorm(10,0,1)
if (mean(x) >=0) {
Grunfeld[i,3] <- 0
}
}
View(Grunfeld)
panel <- Grunfeld
#First Method
#This is how I was originally manipulating my data and running my regression
panel <- Grunfeld
dd <-pdata.frame(panel, index = c('firm', 'year'))
dd <- dd[dd$year!=1935, ]
dd <- dd[dd$capital !=0, ]
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
summary(ols_model_2)
#However, I couuldn't plot the variables of the datasets in graphs, because they weren't vectors. So I tried another way:
#Second Method
panel <- panel[panel$year!= 1935, ]
panel <- panel[panel$capital != 0,]
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
summary(ols_model)
#But this gave extremely different results for the ols regression!
In my understanding, both approaches sould have yielded the same outputs in the OLS regression. Now I'm afraid my entire analysis is wrong, because I was doing it like the first way. Could anyone explain me what is happening?
Thanks in advance!
You are a running two different models. I am not sure why you would expect results to be the same.
Your first model is:
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
While the second is:
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
As you see from the summary of the models, both are "Oneway (individual) effect Within Model". In the first one you dont specify the index, since dd is a pdata.frame object. In the second you do specify the index, because panel is a simple data.frame. However this makes no difference at all.
The difference is using the log of capital or capital without log.
As a side note, leaving out 0 observations is often very problematic. If you do that, make sure you also try alternative ways of dealing with zero, and see how much your results change. You can get started here https://stats.stackexchange.com/questions/1444/how-should-i-transform-non-negative-data-including-zeros
I am trying to create a dual-species occupancy model using unmarkedFrameOccuMulti. I've been successful in producing the UMF and have even got a basic plot of the detections but when I try to run an individual model I get the error message;
Error in occu(~1, ~Vill_Dist, umf) : Data is not an unmarkedFrameOccu object.
I've made sure the csvs have the same number of rows etc. I'm a bit mythed because I can't find much online and the UMF itself has ran perfectly, just R can't seem to seperate out the aspects of it?
S <- 2 # number of species
M <- 354 #number of sites - i.e. number of sites with actual data (#i.e. not NAs/transects that were taken - some transects were done 14 times, others as little as 2 times)
J <- 9.07 #average number of visits per transect
y <- list(matrix(rbinom(354, 1, 0.456)), #species 1 leopard
matrix(rbinom(354, 1, 0.033))) #species 2 wolf
So the above is code I'm following from the R help on unmarkedoccumulti. The ordering of the numbers is based on the rbinom function. i.e. 0.033% of the sites surveyed wolves were seen.
obscov <- read.csv("grazcov2.csv")
Error message is ObsCovData needs M*obsNum of rows
umf <- unmarkedFrameOccuMulti(y=y, siteCovs = predcovs2, obsCovs = NULL)
predcovs2
summary(umf)
plot(umf)
umf
m1 <- occu(~1, ~Vill_Dist, umf) - this is the code that doesn't work - Vill_Dist being one of the covariates in the csv - spellt correctly/same etc.
I was expected to produce a model that would predict occurence of leopards/wolves based off the covariates.
As I was writing this out I had an idea for what might be going wrong. I couldn't get the model to work previously because I was putting in the detection data in csv format rather than using the simple binomial function.
Is it simply that R cannot mix csv/imported data and the binomial data?
I received a comment from a reviewer who wanted to have all the p-values for each line of specific variables levels in a demographic characteristic table (Table 1). Even though the request appears quite strange (and inexact) to me, I would like to agree with his suggestion.
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
print(tableOne, nonnormal = c("bili","chol","copper","alk.phos","trig"), exact = c("status","stage"), smd = TRUE)
the output:
I need to have the p-values for each level of the variables status, edema and stage, with Bonferroni correction. I went through the documentation without success.
In addition, is it correct to use chi-squared to compare sample sizes across rows?
UPDATE:
I'm not sure if my approach is correct, however I would like to share it with you. I generated for the variable status a dummy variable for each strata, than I calculated the chisq .
library(tableone)
## Load data
library(survival); data(pbc)
d <- pbc[,c("status", "trt")]
# Convert dummy variables
d$status.0 <- ifelse(d$status==0, 1,0)
d$status.1 <- ifelse(d$status==1, 1,0)
d$status.2 <- ifelse(d$status==2, 1,0)
t <- rbind(
chisq.test(d$status.0, d$trt),
# p-value = 0.7202
chisq.test(d$status.1, d$trt),
# p-value = 1
chisq.test(d$status.2, d$trt)
#p-value = 0.7818
)
t
BONFERRONI ADJ FOR MULTIPLE COMPARISONS:
p <- t[,"p.value"]
p.adjust(p, method = "bonferroni")
This question was posted some time ago, so I supose you already answered the reviewer.
I don't really understand why computing adjusted p values for just three varibles. In fact, adjusting p values depends on the number of comparisons made. If you use p.adjust() with a vector of 3 p values, results will not really be "adjusted" by the amount of comparison made (you really did more than a dozen and a half!)
I show how to extract all p-values so you can compute the adjusted ones.
To extract pValues from package tableOne there is a way calling object attributes (explained first), and two quick and dirty ways (at the bottom part).
To extract them, first I copy your code to create your tableOne:
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
You can see what your "tableOne" object has via attributes()
attributes(tableOne)
You can see a tableOne usually has a table for continuous and categorical variables. You can use attributes() in them too
attributes(tableOne$CatTable)
# you can notice $pValues
Now you know "where" the pValues are, you can extract them with attr()
attr(tableOne$CatTable, "pValues")
Something similar with numerical variables:
attributes(tableOne$ContTable)
# $pValues are there
attr(tableOne$ContTable, "pValues")
You have pValues for Normal and NonNormal variables.
As you set them before, you can extract both
mypCont <- attr(tableOne$ContTable, "pValues") # put them in an object
nonnormal = c("bili","chol","copper","alk.phos","trig") # copied from your code
mypCont[rownames(mypCont) %in% c(nonnormal), "pNonNormal"] # extract NonNormal
"%!in%" <- Negate("%in%")
mypCont[rownames(mypCont) %!in% c(nonnormal), "pNonNormal"] # extract Normal
All that said, and your pValues extracted, I think there are two much more convenient quick and dirty ways to accomplish the same:
Quick and dirty way A: using dput() with your printed tableOne. Then search in the console where the pValues are and copy-paste them to the script, to store them in an object
Quick and dirty way B: If you look in tableOne vignette there is an "Exporting" section, you can use print(tableOne, quote = TRUE) and then just copy and paste to a spreadsheet (like LibreOffice, Excel...).
Then I would select the column with pValue, transpose it, and get it back to R, to compute adjusted p values with p.adjust() and copy them back to the spreadsheet for journal submission
I have this code I am trying to run. It gets everything right until I want to create my Plot.
# Install package to use Support Vector Machine Algorithm
install.packages("e1071")
# If this function does not work click on the packages tab and check e1071
library("e1071", lib.loc="/Library/Frameworks/R.framework/Versions/3.2/Resources/library")
# Choose File
diabetes <- read.csv(file.choose(), na.strings = "?")
View(diabetes)
##### Data Preprocessing
# Count number of rows with missing data
sum(!complete.cases(diabetes))
# Summary of data set
summary(diabetes)
str(diabetes)
# Replace "no" and ">30" with 0 and "<30" with 1
diabetes$readmitted<-as.character(diabetes$readmitted)
diabetes$readmitted[diabetes$readmitted== "NO"] <- "0"
diabetes$readmitted[diabetes$readmitted== "<30"] <- "1"
diabetes$readmitted[diabetes$readmitted== ">30"] <- "0"
diabetes$readmitted<-factor(diabetes$readmitted)
str(diabetes$readmitted)
summary(diabetes$readmitted)
# Removal of insignificant variables
diabetes$encounter_id<-NULL
diabetes$patient_nbr<-NULL
diabetes$weight<-NULL # Weight had too many missing values to be a part of our model
diabetes$payer_code<-NULL
diabetes$medical_specialty<-NULL
diabetes$nateglinide<-NULL
diabetes$chlorpropamide<-NULL
diabetes$acetohexamide<-NULL
diabetes$tolbutamide<-NULL
diabetes$acarbose<-NULL
diabetes$miglitol<-NULL
diabetes$troglitazone<-NULL
diabetes$tolazamide<-NULL
diabetes$examide<-NULL
diabetes$citoglipton<-NULL
diabetes$glyburide.metformin<-NULL
diabetes$glipizide.metformin<-NULL
diabetes$glimepiride.pioglitazone<-NULL
diabetes$metformin.rosiglitazone<-NULL
diabetes$metformin.pioglitazone<-NULL
# Change variables to be factors
diabetes$admission_type_id<-factor(diabetes$admission_type_id)
diabetes$discharge_disposition_id<-factor(diabetes$discharge_disposition_id)
diabetes$admission_source_id<-factor(diabetes$admission_source_id)
str(diabetes)
# Summary after data pre-processing
summary(diabetes)
# Set Seed and split data set into training and test data
set.seed(1234)
ind <- sample(2, nrow(diabetes), replace = TRUE, prob = c(0.7, 0.3))
train.data <- diabetes[ind == 1, ]
test.data <- diabetes[ind == 2, ]
# Create Model using readmitted as dependent variable
model1<-readmitted~.
model1<-svm(readmitted~., data=train.data)
summary(model1)
plot(model1, diabetes, type='C-classification', kernel='radial')
### I am also having trouble here making the tables###########
# Create table of model vs training data in confusion matrix
table(predict(model1), train.data$readmitted)
# Pull Test data to get confusion matrix
testPred <- predict(model1, newdata = test.data)
table (testPred, test.data$readmitted)
# Create second model using select readmitted and select variables
model2<-readmitted~race + gender + age + admission_type_id + discharge_disposition_id + time_in_hospital + num_lab_procedures + num_procedures + num_medications + number_outpatient + number_inpatient + number_emergency + number_diagnoses + change + diabetesMed
model2<-svm(model2, data=train.data)
summary(model2)
### Also having trouble here making the second table#########
# Create table using second model and training data
table(predict(model2), train.data$readmitted)
testPred2 <- predict(model2, newdata = test.data)
table (testPred2, test.data$readmitted)
I have been playing around with plot and the tables and cant seems to get anything to work.
I have been using a data set with 9999 rows to test this out on. But my real data set is 107,000 rows. So it takes a long time to run this and find out I am wrong.
Any help would be greatly appreciated. Thank You
Well , I need data that you are working on. I did run on these kind of problems with large data sets.
For data sets ,I prefer using package(caret) this helps in parallel
processing too and handles large grids.
For plots , library(hexbin) or tabplot package in R might help you.
well above said , is for fast processing your data so that you can use the whole data set and visualizing large datasets.
I am not sure what error you are getting plot. please tell about the error you are getting.
As part of my data analysis, I am using linear regression analysis to check whether I can predict tomorrow's value using today's data.
My data are about 100 time series of company returns. Here is my code so far:
returns <- read.zoo("returns.csv", header=TRUE, sep=",", format="%d-%m-%y")
returns_lag <- lag(returns)
lm_univariate <- lm(returns_lag$companyA ~ returns$companyA)
This works without problems, now I wish to run a linear regression for every of the 100 companies. Since setting up each linear regression model manually would take too much time, I would like to use some kind of loop (or apply function) to shorten the process.
My approach:
test <- lapply(returns_lag ~ returns, lm)
But this leads to the error "unexpected symbol in "test2" " since the tilde is not being recognized there.
So, basically I want to run a linear regression for every company separately.
The only question that looks similar to what I wanted is Linear regression of time series over multiple columns , however there the data seems to be stored in a matrix and the code example is quite messy compared to what I was looking for.
Formulas are great when you know the exact name of the variables you want to include in the regression. When you are looping over values, they aren't so great. Here's an example that uses indexing to extract the columns of interest for each iteration
#sample data
x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14) - 1
returns <- zoo(cbind(companya=rnorm(10), companyb=rnorm(10)), x.Date)
returns_lag <- lag(returns)
$loop over columns/companies
xx<-lapply(setNames(1:ncol(returns),names(returns)), function(i) {
today <-returns_lag[,i]
yesterday <-head(returns[,i], -1)
lm(today~yesterday)
})
xx
This will return the results for each column as a list.
Using the dyn package (which loads zoo) we can do this:
library(dyn)
z <- zoo(EuStockMarkets) # test data
lapply(as.list(z), function(z) dyn$lm(z ~ lag(z, -1)))