ggplot2 : Bar plot in decreaing/increasing order - r

I have a data frame with this structure :
'data.frame': 1000 obs. of 10 variables:
$ Age : Factor w/ 3 levels "Middle","Old",..: 2 1 3 1 1 3 1 1 1 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 1 2 1 2 ...
$ OwnHome : Factor w/ 2 levels "Own","Rent": 1 2 2 1 1 1 2 1 1 1 ...
$ Married : Factor w/ 2 levels "Married","Single": 2 2 2 1 2 1 2 2 1 1 ...
$ Location : Factor w/ 2 levels "Close","Far": 2 1 1 1 1 1 1 1 1 2 ...
$ Salary : int 47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
$ Children : int 0 0 0 1 0 0 0 0 3 0 ...
$ History : Factor w/ 3 levels "High","Low","Medium": 1 1 2 1 1 2 3 1 2 NA ...
$ Catalogs : int 6 6 18 18 12 6 12 18 6 18 ...
$ AmountSpent: int 755 1318 296 2436 1304 495 782 1155 158 3034 ...
and want to make a bar plot with geom_bar() for Age:
Age :
Middle:508
Old :205
Young :287
when I run this code below:
age_plt <- ggplot(data = df, aes(x = Age))
age_plt + geom_bar()
I want ggplot to draw the plot in increasing order(first Old,second Young and the last Middle).
How can I add this feature to my code ?(preferably without using any other variables ,because in the next steps I want to add some new features to the same plot(for example grouping the plot with Gender column.))

Change the factor order for Age before ggplot
library(tidyverse)
df%>%
mutate(Age = fct_relevel(Age,"Old","Young"))%>%
ggplot(aes(x = Age)) +
geom_bar()

Related

Adding a linear model from another dataset in ggplot

I have a dataset that contains time series information regarding soil elevation from several sampling stations. I have modeled the change in soil elevation over time for each station using ggplot. Now I would like to add a line to my graph that depicts a linear model fit to other geological data over time from a different dataset but I have been unable to do so. I know that I can add the slope and the intercept to my functions manually but I would rather not.
My data is as follows..
str(SETdata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1620 obs. of 6 variables:
$ Observation : num 1 2 3 4 5 6 7 8 9 10 ...
$ Plot_Name : Factor w/ 3 levels "1900-01-01","1900-01-02",..: 1 1 1
1 1 1 1 1 1 1 ...
$ PipeDirectionCode: chr "001°" "001°" "001°" "001°" ...
$ Pin : num 1 2 3 4 5 6 7 8 9 1 ...
$ EventDate : num 0 0 0 0 0 0 0 0 0 0 ...
$ PinHeight_mm : num 221 207 192 220 212 212 206 209 203 222 ...
str(FeldsparData)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 540 obs. of 4 variables:
$ Benchmark : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 2 ...
$ Plot : Factor w/ 12 levels "1a","1b","1c",..: 1 1 1 2 2 2 3 3 3 5
...
$ TotalChange: num 0 0 0 0 0 0 0 0 0 0 ...
$ Day : num 0 0 0 0 0 0 0 0 0 0 ...
The graph I have is
SETdata %>%
ggplot()+
aes(x = EventDate, y = PinHeight_mm, color = Plot_Name, group = Plot_Name)+
stat_summary(fun.y = mean, geom = "point")+
stat_summary(fun.y = mean, geom = "line")
And I would like it to include this line
reg <- lm(TotalChange ~ Day, data = FeldsparData)
My attempts seem to have been thwarted because R does not like that I am using two different datasets.

Complex Counting of dataframe with factors

I have this table.
'data.frame': 5303 obs. of 9 variables:
$ Metric.ID : num 7156 7220 7220 7220 7220 ...
$ Metric.Name : Factor w/ 99 levels "Avoid accessing data by using the position and length",..: 51 59 59
$ Technical.Criterion: Factor w/ 25 levels "Architecture - Multi-Layers and Data Access",..: 4 9 9 9 9 9 9 9 9 9 ...
$ RT.Snapshot.name : Factor w/ 1 level "2017_RT12": 1 1 1 1 1 1 1 1 1 1 ...
$ Violation.status : Factor w/ 2 levels "Added","Deleted": 2 1 2 2 2 1 1 1 1 1 ...
$ Critical.Y.N : num 0 0 0 0 0 0 0 0 0 0 ...
$ Grouping : Factor w/ 29 levels "281","Bes",..: 27 6 6 6 6 7 7 7 7 7 ...
$ Object.type : Factor w/ 11 levels "Cobol Program",..: 8 7 7 7 7 7 7 7 7 7 ...
$ Object.name : Factor w/ 3771 levels "[S:\\SOURCES\\",..: 3771 3770 3769 3768 3767 3
I want to have a statistic output like this:
For every Technical.Criterion a row with the sum of all rows of Critical.Y.N = 0 and 1
So I have to combine the rows of my database to a new matrix. Using Values of the factor sums ...
But I have no idea how to start...? Any hints?
Thanks
I believe you're asking for a cross-tabulation. Because you did not provide a reproducible sample, I've used mine:
xtabs(~ Sub.Category + Category, retail)
Produce this:
And if you want the value to be say, based on Sales, instead of the count, then you can modify the code to:
xtabs(Sales ~ Sub.Category + Category, retail)
And you will get the following output:
EDIT based on extra information in the OP's comment
If you want to have your tables also share a common title and want to change the name of that title, you can use a combination of names() and dimnames(). An xtab is a cross-tabulation table and if you call dimnames() on it it returns a list of length 2, first one corresponding to the row and second to the column.
dimnames(xtab(dat))
$Technical.Criterion
[1] "TechnicalCrit1" "TechnicalCrit2" "TechnicalCrit3"
$`Object.type`
[1] "Object.type1" "Object.type2" "Object.type3"
So given a data frame, b:
'data.frame': 3 obs. of 9 variables:
$ Metric.ID : int 101 102 103
$ Metric.Name : Factor w/ 3 levels "A","B","C": 1 2 3
$ Technical.Criterion: Factor w/ 3 levels "TechnicalCrit1",..: 1 2 3
$ RT.Snapshot.name : Factor w/ 3 levels "A","B","C": 1 2 3
$ Violation.status : Factor w/ 2 levels "Added","Deleted": 1 2 1
$ Critical.Y.N : num 1 0 1
$ Grouping : Factor w/ 3 levels "A","B","C": 1 2 3
$ Object.type : Factor w/ 3 levels "Object.type1",..: 1 2 3
$ Object.name : Factor w/ 3 levels "A","B","C": 1 2 3
We can use xtab and then change the "common" header right at the top of our table. Since I don't know how many levels are in b$Violation.status, I would use a generic for loop:
for(i in 1:length(unique(b$Violation.status))){
tab[[i]] <- xtabs(Critical.Y.N ~ Technical.Criterion + Object.type, b)
names(dimnames(tab[[i]]))[2] <- paste("Violation.status", i)
}
This produces:
Violation.status 1
Technical.Criterion Object.type1 Object.type2 Object.type3
TechnicalCrit1 1 0 0
TechnicalCrit2 0 0 0
TechnicalCrit3 0 0 1
Which I can now use in my shiny app.

change value of variable in r using dplyr

refine_original %>%
+ mutate(company=replace(company, grepl("ps",company), "phillips")) %>%
+ as.data.frame()
Error in replace(company, grepl("ps", company), "phillips") :
object 'company' not found
I do not why it is giving error object not found.
> str(refine_original)
'data.frame': 25 obs. of 6 variables:
$ company : Factor w/ 19 levels "ak zo","akz0",..: 10 8 7 13 11 9 3 4 5 2 ...
$ Product.code...number: Factor w/ 23 levels "p-23","p-34",..: 4 3 19 20 17 1 13 11 22 2 ...
$ address : Factor w/ 25 levels "Delfzijlstraat 54",..: 9 10 11 12 13 14 19 20 21 22 ...
$ city : Factor w/ 1 level "arnhem": 1 1 1 1 1 1 1 1 1 1 ...
$ country : Factor w/ 1 level "the netherlands": 1 1 1 1 1 1 1 1 1 1 ...
$ name : Factor w/ 20 levels "dhr j. Gansen",..: 7 6 1 9 4 5 2 10 3 8 ...
Please help
Your code has extra + signs in it. remove them then the errors should go away:
refine_original %>%
mutate(company=replace(company, grepl("ps",company), "phillips")) %>%
as.data.frame()

Extracting complete dataframe from Hmisc package in R

I've used aregImpute to impute the missing values then i used impute.transcan function trying to get complete dataset using the following code.
impute_arg <- aregImpute(~ age + job + marital + education + default +
balance + housing + loan + contact + day + month + duration + campaign +
pdays + previous + poutcome + y , data = mov.miss, n.impute = 10 , nk =0)
imputed <- impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE)
y <- completed[names(imputed)]
and when i used str(y) it already gives me a dataframe but with NAs as it is not imputed before, My question is how to get complete dataset without NAs after imputation?
str(y)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 NA 35 30 NA 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 NA 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 1 1 ...
$ balance : int NA 4789 1350 1476 0 747 307 147 NA -88 ...
$ housing : Factor w/ 2 levels "no","yes": NA 2 2 2 NA 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 NA 1 1 NA 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 NA 1 ...
$ day : int 19 NA 16 3 5 23 14 6 14 NA ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 NA 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 NA ...
$ pdays : int -1 339 330 NA -1 176 330 -1 -1 NA ...
$ previous : int 0 4 NA 0 NA 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I have tested your code myself, and it works just fine, except for the last line:
y <- completed[names(imputed)]
I believe there's a type in the above line. Plus, you do not even need the completed function.
Besides, if you want to get a data.frame from the impute.transcan function, then wrap it with as.data.frame:
imputed <- as.data.frame(impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE))
Moreover, if you need to test your missing data pattern, you can also use the md.pattern function provided by the mice package.

Error in Cross Validation in GLMNET package R for Binomial Target Variable

This is in reference to https://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome I am trying to use the Cross Validation in GLMNET (i.e. cv.glmnet) for a binomial target variable. The glmnet works fine but the cv.glmnet throws an error here is the error log:
Error in storage.mode(y) = "double" : invalid to change the storage mode of a factor
In addition: Warning messages:
1: In Ops.factor(x, w) : ‘*’ not meaningful for factors
2: In Ops.factor(y, ybar) : ‘-’ not meaningful for factors
Data Types:
'data.frame': 490 obs. of 13 variables:
$ loan_id : Factor w/ 614 levels "LP001002","LP001003",..: 190 381 259 310 432 156 179 24 429 408 ...
$ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
$ married : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 1 ...
$ dependents : Factor w/ 4 levels "0","1","2","3+": 1 1 1 3 1 4 2 3 1 1 ...
$ education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 1 2 1 2 ...
$ self_employed : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ applicantincome : int 9328 3333 14683 7667 6500 39999 3750 3365 2920 2213 ...
$ coapplicantincome: num 0 2500 2100 0 0 ...
$ loanamount : int 188 128 304 185 105 600 116 112 87 66 ...
$ loan_amount_term : Factor w/ 10 levels "12","36","60",..: 6 9 9 9 9 6 9 9 9 9 ...
$ credit_history : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ property_area : Factor w/ 3 levels "Rural","Semiurban",..: 1 2 1 1 1 2 2 1 1 1 ...
$ loan_status : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 2 2 ...
Codes Used:
xfactors<-model.matrix(loan_status ~ gender+married+dependents+education+self_employed+loan_amount_term+credit_history+property_area,data=data_train)[,-1]
x<-as.matrix(data.frame(applicantincome,coapplicantincome,loanamount,xfactors))
glmmod<-glmnet(x,y=as.factor(loan_status),alpha=1,family='binomial')
plot(glmmod,xvar="lambda")
grid()
cv.glmmod <- cv.glmnet(x,y=loan_status,alpha=1) #This Is Where It Throws The Error
The credit for the answer goes to #user20650.
Suspect you need to add the familyto cv.glmnet as well. An example:
x <- model.matrix(am ~ 0 + . , data=mtcars)
cv.glmnet(x, y=factor(mtcars$am), alpha=1)
cv.glmnet(x, y=factor(mtcars$am), alpha=1, family="binomial")

Resources