Adding a linear model from another dataset in ggplot - r

I have a dataset that contains time series information regarding soil elevation from several sampling stations. I have modeled the change in soil elevation over time for each station using ggplot. Now I would like to add a line to my graph that depicts a linear model fit to other geological data over time from a different dataset but I have been unable to do so. I know that I can add the slope and the intercept to my functions manually but I would rather not.
My data is as follows..
str(SETdata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1620 obs. of 6 variables:
$ Observation : num 1 2 3 4 5 6 7 8 9 10 ...
$ Plot_Name : Factor w/ 3 levels "1900-01-01","1900-01-02",..: 1 1 1
1 1 1 1 1 1 1 ...
$ PipeDirectionCode: chr "001°" "001°" "001°" "001°" ...
$ Pin : num 1 2 3 4 5 6 7 8 9 1 ...
$ EventDate : num 0 0 0 0 0 0 0 0 0 0 ...
$ PinHeight_mm : num 221 207 192 220 212 212 206 209 203 222 ...
str(FeldsparData)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 540 obs. of 4 variables:
$ Benchmark : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 2 ...
$ Plot : Factor w/ 12 levels "1a","1b","1c",..: 1 1 1 2 2 2 3 3 3 5
...
$ TotalChange: num 0 0 0 0 0 0 0 0 0 0 ...
$ Day : num 0 0 0 0 0 0 0 0 0 0 ...
The graph I have is
SETdata %>%
ggplot()+
aes(x = EventDate, y = PinHeight_mm, color = Plot_Name, group = Plot_Name)+
stat_summary(fun.y = mean, geom = "point")+
stat_summary(fun.y = mean, geom = "line")
And I would like it to include this line
reg <- lm(TotalChange ~ Day, data = FeldsparData)
My attempts seem to have been thwarted because R does not like that I am using two different datasets.

Related

ggplot2 : Bar plot in decreaing/increasing order

I have a data frame with this structure :
'data.frame': 1000 obs. of 10 variables:
$ Age : Factor w/ 3 levels "Middle","Old",..: 2 1 3 1 1 3 1 1 1 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 1 2 1 2 ...
$ OwnHome : Factor w/ 2 levels "Own","Rent": 1 2 2 1 1 1 2 1 1 1 ...
$ Married : Factor w/ 2 levels "Married","Single": 2 2 2 1 2 1 2 2 1 1 ...
$ Location : Factor w/ 2 levels "Close","Far": 2 1 1 1 1 1 1 1 1 2 ...
$ Salary : int 47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
$ Children : int 0 0 0 1 0 0 0 0 3 0 ...
$ History : Factor w/ 3 levels "High","Low","Medium": 1 1 2 1 1 2 3 1 2 NA ...
$ Catalogs : int 6 6 18 18 12 6 12 18 6 18 ...
$ AmountSpent: int 755 1318 296 2436 1304 495 782 1155 158 3034 ...
and want to make a bar plot with geom_bar() for Age:
Age :
Middle:508
Old :205
Young :287
when I run this code below:
age_plt <- ggplot(data = df, aes(x = Age))
age_plt + geom_bar()
I want ggplot to draw the plot in increasing order(first Old,second Young and the last Middle).
How can I add this feature to my code ?(preferably without using any other variables ,because in the next steps I want to add some new features to the same plot(for example grouping the plot with Gender column.))
Change the factor order for Age before ggplot
library(tidyverse)
df%>%
mutate(Age = fct_relevel(Age,"Old","Young"))%>%
ggplot(aes(x = Age)) +
geom_bar()

R: why does gbm give NA values on Titanic data?

I have the classic titanic data. Here is the description of the cleaned data.
> str(titanic)
'data.frame': 887 obs. of 7 variables:
$ Survived : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 27 54 2 27 14 ...
$ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ...
$ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
I first split the data.
set.seed(123)
train_ind <- sample(seq_len(nrow(titanic)), size = smp_size)
train <- titanic[train_ind, ]
test <- titanic[-train_ind, ]
Then I changed Survived column to 0 and 1.
train$Survived <- as.factor(ifelse(train$Survived == 'Yes', 1, 0))
test$Survived <- as.factor(ifelse(test$Survived == 'Yes', 1, 0))
Finally, I ran gradient boosting algorithm.
dt_gb <- gbm(Survived ~ ., data = train)
Here are the results.
> print(dt_gb)
gbm(formula = Survived ~ ., data = train)
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 6 predictors of which 0 had non-zero influence.
Since there are 0 predictors that have non-zero influence, the predictions are NA. I am wondering why this is case? Anything wrong with my code?
Refrain from converting Survival to 0/1 factor in training and test data. Instead, change the Survival column to a 0/1 vector with numeric type.
# e.g. like this
titanic$Survival <- as.numeric(titantic$Survival) - 1
# data should look like this
> str(titanic)
'data.frame': 887 obs. of 7 variables:
$ Survived : num 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 27 54 2 27 14 ...
$ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ...
$ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
Then fit the model with Bernoulli loss.
dt_gb <- gbm::gbm(formula = Survived ~ ., data = titanic,
distribution = "bernoulli")
> print(dt_gb)
gbm::gbm(formula = Survived ~ ., distribution = "bernoulli",
data = titanic)
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 6 predictors of which 6 had non-zero influence.
Obtain predicted survival probabilities for the first few passengers:
>head(predict(dt_gb, type = "response"))
[1] 0.1200703 0.9024225 0.5875393 0.9271306 0.1200703 0.1200703

Complex Counting of dataframe with factors

I have this table.
'data.frame': 5303 obs. of 9 variables:
$ Metric.ID : num 7156 7220 7220 7220 7220 ...
$ Metric.Name : Factor w/ 99 levels "Avoid accessing data by using the position and length",..: 51 59 59
$ Technical.Criterion: Factor w/ 25 levels "Architecture - Multi-Layers and Data Access",..: 4 9 9 9 9 9 9 9 9 9 ...
$ RT.Snapshot.name : Factor w/ 1 level "2017_RT12": 1 1 1 1 1 1 1 1 1 1 ...
$ Violation.status : Factor w/ 2 levels "Added","Deleted": 2 1 2 2 2 1 1 1 1 1 ...
$ Critical.Y.N : num 0 0 0 0 0 0 0 0 0 0 ...
$ Grouping : Factor w/ 29 levels "281","Bes",..: 27 6 6 6 6 7 7 7 7 7 ...
$ Object.type : Factor w/ 11 levels "Cobol Program",..: 8 7 7 7 7 7 7 7 7 7 ...
$ Object.name : Factor w/ 3771 levels "[S:\\SOURCES\\",..: 3771 3770 3769 3768 3767 3
I want to have a statistic output like this:
For every Technical.Criterion a row with the sum of all rows of Critical.Y.N = 0 and 1
So I have to combine the rows of my database to a new matrix. Using Values of the factor sums ...
But I have no idea how to start...? Any hints?
Thanks
I believe you're asking for a cross-tabulation. Because you did not provide a reproducible sample, I've used mine:
xtabs(~ Sub.Category + Category, retail)
Produce this:
And if you want the value to be say, based on Sales, instead of the count, then you can modify the code to:
xtabs(Sales ~ Sub.Category + Category, retail)
And you will get the following output:
EDIT based on extra information in the OP's comment
If you want to have your tables also share a common title and want to change the name of that title, you can use a combination of names() and dimnames(). An xtab is a cross-tabulation table and if you call dimnames() on it it returns a list of length 2, first one corresponding to the row and second to the column.
dimnames(xtab(dat))
$Technical.Criterion
[1] "TechnicalCrit1" "TechnicalCrit2" "TechnicalCrit3"
$`Object.type`
[1] "Object.type1" "Object.type2" "Object.type3"
So given a data frame, b:
'data.frame': 3 obs. of 9 variables:
$ Metric.ID : int 101 102 103
$ Metric.Name : Factor w/ 3 levels "A","B","C": 1 2 3
$ Technical.Criterion: Factor w/ 3 levels "TechnicalCrit1",..: 1 2 3
$ RT.Snapshot.name : Factor w/ 3 levels "A","B","C": 1 2 3
$ Violation.status : Factor w/ 2 levels "Added","Deleted": 1 2 1
$ Critical.Y.N : num 1 0 1
$ Grouping : Factor w/ 3 levels "A","B","C": 1 2 3
$ Object.type : Factor w/ 3 levels "Object.type1",..: 1 2 3
$ Object.name : Factor w/ 3 levels "A","B","C": 1 2 3
We can use xtab and then change the "common" header right at the top of our table. Since I don't know how many levels are in b$Violation.status, I would use a generic for loop:
for(i in 1:length(unique(b$Violation.status))){
tab[[i]] <- xtabs(Critical.Y.N ~ Technical.Criterion + Object.type, b)
names(dimnames(tab[[i]]))[2] <- paste("Violation.status", i)
}
This produces:
Violation.status 1
Technical.Criterion Object.type1 Object.type2 Object.type3
TechnicalCrit1 1 0 0
TechnicalCrit2 0 0 0
TechnicalCrit3 0 0 1
Which I can now use in my shiny app.

RMLSE validation between rpart model and test set showing Na and zero

I am seeking a little help as I have hit a wall.
I have trained a model (CART) with a train dataset and am looking to validate the accuracy of the model with RMLSE on a test set.
I have the following:
data.frame': 5463 obs. of 15 variables:
$ Start_date: chr "2011-01-20 02:00:00" "2011-01-20 05:00:00" "2011-01-20 06:00:00"
$ Season : Factor w/ 4 levels "spring","summer",..: 1 1 1 1 1 1 1 1 1 1
$ Holiday : Factor w/ 2 levels "Not","Holiday": 1 1 1 1 1 1 1 1 1 1 ...
$ Workingday: int 1 1 1 1 1 1 1 1 1 1 ...
$ Weather : Factor w/ 4 levels "Clear","Cloudy",..: 1 1 1 1 1 2 1 2 2 2..
$ Temp : num 10.66 9.84 9.02 9.02 9.02 ...
$ Humidity : int 56 60 60 55 55 52 48 45 42 45 ...
$ Windspeed : num 0 15 15 15 19 ...
$ Count : num 1 1 1 8 18 6 3 4 5 3 ...
$ Date : chr "2011-01-20" "2011-01-20" "2011-01-20" "2011-01-20" ...
$ Hour : Factor w/ 24 levels "00","01","02",..: 3 6 7 8 9 10 11 12 .
$ Year : chr "2011" "2011" "2011" "2011" ...
$ Month : chr "01" "01" "01" "01" ...
$ Weekday : Factor w/ 7 levels "Friday","Monday",..: 5 5 5 5 5 5 5 5 5 5
$ Hour_Bin : num 0 0 0 0 0 0 0 0 0 0 ...
$ temp_Bin : num 1 1 1 2 2 2 2 2 2 2 ...
$ year_Bin : num 1 1 1 1 1 1 1 1 1 1 ...
The predicted values is vector of:
Named num [1:5463] 9 9 9 9 9 9 9 9 9 9 ...
- attr(*, "names")= chr [1:5463] "9266" "9267" "9268" "9269" ...
I have used the function:
Evaluate_Model <- function (test, pred) {
return(sqrt(1/nrow(test)*sum((log(pred+1)-log(test$Count+1))^2)))
}
and also tried the matrix package
library('Metrics')
rmsle(test$Count, pred)
when I try to get the Root Mean Squared Logarithmic Error, I am returned either [0] or [Na].
I gone through the process of converting the count variable to different data types, and also tried putting the prediction into a dataframe and evaluate it from their.
I have also trained a model with one attribute and tried to evaluate these models, but am still hetting the same result.
My target variable (count) and the other attributes have zero values, but these are real values, not na's.
IS it the training of the algorithm, the data types???
Any help would be appreciated.
A sample of the model code:
model3 <- rpart(Count~Month+Temp, data = train)
# round prediction
pred <- round(predict(model3, newdata = test))
Evaluate_Model(test, pred)
Thanks in advance.

sum by factor conditional on another factor

I'm working with a data frame of stock information, here is what it looks like:
> str(test)
'data.frame': 211717 obs. of 19 variables:
$ Symbol : Factor w/ 3378 levels "AACC","AACE",..: 1 1 1 1 1 1 1 1 1 1 ...
$ MktCategory : Factor w/ 3 levels "","NNM","SCM": 2 2 2 2 2 2 2 2 2 2 ...
$ TSO : num 37205115 37205115 37205115 37205115 37205115 ...
$ TSO_Date : Factor w/ 200 levels "","1/1/2006",..: 137 137 137 137 137 137 137 137 137 137 ...
$ X.OfMP : int 56 56 56 56 56 56 56 56 56 56 ...
$ MPID : Factor w/ 670 levels "","ABLE","ABNA",..: 608 459 533 618 550 635 307 146 387 482 ...
$ MP_type : Factor w/ 4 levels "","C","M","NR": 2 3 4 3 3 3 3 4 3 4 ...
$ Total_Vol : int 32900 0 2949 758522 41316 706131 29300 16898 362569 1490 ...
$ Total_Rank : int 18 0 35 2 17 3 21 26 5 40 ...
$ Total_Pct : int 0 0 0 14 0 13 0 0 7 0 ...
$ Block_Vol : int 0 0 0 60800 20000 34900 19200 16600 0 0 ...
$ Block_Rank : int 0 0 0 2 6 4 7 9 0 0 ...
$ Block_Pct : int 0 0 0 15 5 9 5 4 0 0 ...
$ YTD_Total_Vol : num 81615 2929 10684 1949230 190874 ...
$ YTD_Total_Rank: int 28 59 44 3 17 5 30 27 12 67 ...
$ YTD_Total_Pct : int 0 0 0 9 0 7 0 0 2 0 ...
$ YTD_Block_Vol : int 0 0 0 197420 80000 390600 60900 73787 55994 0 ...
$ YTD_Block_Rank: int 0 0 0 5 13 3 16 14 17 0 ...
$ YTD_Block_Pct : int 0 0 0 6 3 12 2 2 2 0 ...
So I know how to sum total volume(Total_Vol) by Symbol with the aggregate function:
volbystock<-aggregate(test$Total_Vol,by=list(test$Symbol),FUN=sum)
but I am trying to analyze volume of only a few MPID values. I want to only add the Total_Vol of a Symbol when the MPID is one of the MPIDs in another list. In other words, I only want the Total_Vol of a certain Symbol added if the corresponding MPID is one of the following:
> use_MPID<-c("GSCO","LATS","TACT","INCA","LATS","LQNT","ITGI")
Using dply you can do something like this:
# load dplyr
library(dplyr)
# create a vector of MPIDs you are interested on
use_MPID <- c("GSCO","LATS","TACT","INCA","LATS","LQNT","ITGI")
# create a fake dataset just for representation
test <- data.frame(cbind(c("ci", "di", "bi", "bi"), c("GSCO","LATS","TACT","INCA"), c(35, 110, 201, 435)))
names(test) <- c("Symbol", "MPID", "TotalVol")
# use dplyr to summarise your dataset
volbystock <- test %.%
group_by(Symbol) %.%
select(Symbol, MPID, TotalVol) %.%
filter(MPID %in% use_MPID)
It looks like you could just subset your data.frame, by using:
use_MPID <- c("GSCO","LATS","TACT","INCA","LATS","LQNT","ITGI")
relevant.symbols <- which(test$MPID %in% use_MPID)
volbystock <- aggregate(test$Total_Vol[relevant.symbols],
by=list(test$Symbol[relevant.symbols]),
FUN=sum)
Does this solve your problem?
edit
Even better, you could use the subset optional argument, along with providing the right formula:
use_MPID <- c("GSCO","LATS","TACT","INCA","LATS","LQNT","ITGI")
volbystock <- aggregate(formula=test$Total_Vol ~ test$Symbol,
subset=(test$MPID %in% use_MPID),
FUN=sum)

Resources