Using lapply in group_by() of two factors in R - r

I have this data frame (named as OEM_final). This is the structure:
str(OEM_final)
'data.frame': 2265 obs. of 17 variables:
$ dia_hora_OEM : POSIXct, format: "2019-12-31 06:40:13" "2019-12-31 06:43:00" "2019-12-31 07:11:30" "2019-12-31 07:18:30" ...
$ coche_OEM : Factor w/ 6 levels "356232050832996",..: 3 3 3 3 3 3 3 3 6 6 ...
$ DTC_OEM_dec64: chr "[{\"code\":\"B1182\",\"description\":\"Tire pressure monitor module\",\"faultInformations\":[{\"description\":\"| __truncated__ "[{\"code\":\"B1182\",\"description\":\"Tire pressure monitor module\",\"faultInformations\":[{\"description\":\"| __truncated__ "[{\"code\":\"B1182\",\"description\":\"Tire pressure monitor module\",\"faultInformations\":[{\"description\":\"| __truncated__ "[{\"code\":\"B1182\",\"description\":\"Tire pressure monitor module\",\"faultInformations\":[{\"description\":\"| __truncated__ ...
$ rowname : Factor w/ 2265 levels "1","10","100",..: 1 1112 1489 1600 1711 1822 1933 2044 2155 2 ...
$ B1182 : Factor w/ 2 levels "B1182","NULL": 1 1 1 1 1 1 1 1 2 2 ...
$ B124D : Factor w/ 2 levels "B124D","NULL": 1 1 1 1 1 1 1 1 2 2 ...
$ NA. : Factor w/ 6 levels "c(NA, NA, NA, NA, NA, NA, NA, NA)",..: 3 3 3 3 3 3 3 3 1 1 ...
$ P2000 : Factor w/ 2 levels "c(\"P2000\", \"P2000\", \"P2000\")",..: 2 2 2 2 2 2 2 2 2 2 ...
$ U3003 : Factor w/ 2 levels "NULL","U3003": 1 1 1 1 1 1 1 1 1 1 ...
$ B1D01 : Factor w/ 3 levels "B1D01","c(\"B1D01\", \"B1D01\")",..: 3 3 3 3 3 3 3 3 3 3 ...
$ U0155 : Factor w/ 2 levels "NULL","U0155": 1 1 1 1 1 1 1 1 1 1 ...
$ C1B00 : Factor w/ 2 levels "C1B00","NULL": 2 2 2 2 2 2 2 2 2 2 ...
$ P037D : Factor w/ 2 levels "NULL","P037D": 1 1 1 1 1 1 1 1 1 1 ...
$ P0616 : Factor w/ 2 levels "NULL","P0616": 1 1 1 1 1 1 1 1 1 1 ...
$ P0562 : Factor w/ 2 levels "NULL","P0562": 1 1 1 1 1 1 1 1 1 1 ...
$ U0073 : Factor w/ 2 levels "NULL","U0073": 1 1 1 1 1 1 1 1 1 1 ...
$ P0138 : Factor w/ 2 levels "c(\"P0138\", \"P0138\", \"P0138\")",..: 2 2 2 2 2 2 2 2 2 2 ...
I would like to calculate the earlier date (dia_hora_OEM) that appears when grouping by two factors. The two factors are:
One of this factor, which is common in all the possible combinations, is coche_OEM.
The other one is one from column 8 (P2000) to the last one (P0138), one at a time.
So, the group_by() would be:
group_by(coche_OEM, P2000)
group_by(coche_OEM, U3003)
group_by(coche_OEM, B1D01)
group_by(coche_OEM, U0155)
...
I tried different ways to accomplish this:
Using for loops:
for (DTC in c(U3003, P2000)) {
OEM_final %>%
group_by(DTC, coche_OEM) %>%
filter(dia_hora_OEM == min(dia_hora_OEM))
}
But I get an error saying:
Error in c(U3003, P2000) : object 'U3003' not found
Using lapply
In this case, I created a function:
groupCombDTC <- function(x) {
OEM_final %>%
group_by(coche_OEM, x) %>%
filter(dia_hora_OEM == min(dia_hora_OEM))
}
And then I ran lapply():
lapply(colnames(OEM_final)[8:17], groupCombDTC)
I get this error:
Error: Column `x` is unknown
Can anybody help me iterating in different combinations using group_by()?

That's a standard problem of standard evaluation with dplyr. dplyr is based on non standard evaluation so quoted arguments need to be unquoted.
Several solutions exist. This one works well
groupCombDTC <- function(x) {
OEM_final %>%
group_by(coche_OEM, !!rlang::sym(x)) %>%
filter(dia_hora_OEM == min(dia_hora_OEM))
}
It requires to use together !! and rlang::sym to unquote and evaluate your variable name.
Column names as arguments are easier to handle with data.table. If you want more elements regarding SE/NSE in dplyr and data.table, you can have a look at a blog post I wrote a few days ago

Related

ggplot2 : Bar plot in decreaing/increasing order

I have a data frame with this structure :
'data.frame': 1000 obs. of 10 variables:
$ Age : Factor w/ 3 levels "Middle","Old",..: 2 1 3 1 1 3 1 1 1 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 1 2 1 2 ...
$ OwnHome : Factor w/ 2 levels "Own","Rent": 1 2 2 1 1 1 2 1 1 1 ...
$ Married : Factor w/ 2 levels "Married","Single": 2 2 2 1 2 1 2 2 1 1 ...
$ Location : Factor w/ 2 levels "Close","Far": 2 1 1 1 1 1 1 1 1 2 ...
$ Salary : int 47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
$ Children : int 0 0 0 1 0 0 0 0 3 0 ...
$ History : Factor w/ 3 levels "High","Low","Medium": 1 1 2 1 1 2 3 1 2 NA ...
$ Catalogs : int 6 6 18 18 12 6 12 18 6 18 ...
$ AmountSpent: int 755 1318 296 2436 1304 495 782 1155 158 3034 ...
and want to make a bar plot with geom_bar() for Age:
Age :
Middle:508
Old :205
Young :287
when I run this code below:
age_plt <- ggplot(data = df, aes(x = Age))
age_plt + geom_bar()
I want ggplot to draw the plot in increasing order(first Old,second Young and the last Middle).
How can I add this feature to my code ?(preferably without using any other variables ,because in the next steps I want to add some new features to the same plot(for example grouping the plot with Gender column.))
Change the factor order for Age before ggplot
library(tidyverse)
df%>%
mutate(Age = fct_relevel(Age,"Old","Young"))%>%
ggplot(aes(x = Age)) +
geom_bar()

Complex Counting of dataframe with factors

I have this table.
'data.frame': 5303 obs. of 9 variables:
$ Metric.ID : num 7156 7220 7220 7220 7220 ...
$ Metric.Name : Factor w/ 99 levels "Avoid accessing data by using the position and length",..: 51 59 59
$ Technical.Criterion: Factor w/ 25 levels "Architecture - Multi-Layers and Data Access",..: 4 9 9 9 9 9 9 9 9 9 ...
$ RT.Snapshot.name : Factor w/ 1 level "2017_RT12": 1 1 1 1 1 1 1 1 1 1 ...
$ Violation.status : Factor w/ 2 levels "Added","Deleted": 2 1 2 2 2 1 1 1 1 1 ...
$ Critical.Y.N : num 0 0 0 0 0 0 0 0 0 0 ...
$ Grouping : Factor w/ 29 levels "281","Bes",..: 27 6 6 6 6 7 7 7 7 7 ...
$ Object.type : Factor w/ 11 levels "Cobol Program",..: 8 7 7 7 7 7 7 7 7 7 ...
$ Object.name : Factor w/ 3771 levels "[S:\\SOURCES\\",..: 3771 3770 3769 3768 3767 3
I want to have a statistic output like this:
For every Technical.Criterion a row with the sum of all rows of Critical.Y.N = 0 and 1
So I have to combine the rows of my database to a new matrix. Using Values of the factor sums ...
But I have no idea how to start...? Any hints?
Thanks
I believe you're asking for a cross-tabulation. Because you did not provide a reproducible sample, I've used mine:
xtabs(~ Sub.Category + Category, retail)
Produce this:
And if you want the value to be say, based on Sales, instead of the count, then you can modify the code to:
xtabs(Sales ~ Sub.Category + Category, retail)
And you will get the following output:
EDIT based on extra information in the OP's comment
If you want to have your tables also share a common title and want to change the name of that title, you can use a combination of names() and dimnames(). An xtab is a cross-tabulation table and if you call dimnames() on it it returns a list of length 2, first one corresponding to the row and second to the column.
dimnames(xtab(dat))
$Technical.Criterion
[1] "TechnicalCrit1" "TechnicalCrit2" "TechnicalCrit3"
$`Object.type`
[1] "Object.type1" "Object.type2" "Object.type3"
So given a data frame, b:
'data.frame': 3 obs. of 9 variables:
$ Metric.ID : int 101 102 103
$ Metric.Name : Factor w/ 3 levels "A","B","C": 1 2 3
$ Technical.Criterion: Factor w/ 3 levels "TechnicalCrit1",..: 1 2 3
$ RT.Snapshot.name : Factor w/ 3 levels "A","B","C": 1 2 3
$ Violation.status : Factor w/ 2 levels "Added","Deleted": 1 2 1
$ Critical.Y.N : num 1 0 1
$ Grouping : Factor w/ 3 levels "A","B","C": 1 2 3
$ Object.type : Factor w/ 3 levels "Object.type1",..: 1 2 3
$ Object.name : Factor w/ 3 levels "A","B","C": 1 2 3
We can use xtab and then change the "common" header right at the top of our table. Since I don't know how many levels are in b$Violation.status, I would use a generic for loop:
for(i in 1:length(unique(b$Violation.status))){
tab[[i]] <- xtabs(Critical.Y.N ~ Technical.Criterion + Object.type, b)
names(dimnames(tab[[i]]))[2] <- paste("Violation.status", i)
}
This produces:
Violation.status 1
Technical.Criterion Object.type1 Object.type2 Object.type3
TechnicalCrit1 1 0 0
TechnicalCrit2 0 0 0
TechnicalCrit3 0 0 1
Which I can now use in my shiny app.

Extracting complete dataframe from Hmisc package in R

I've used aregImpute to impute the missing values then i used impute.transcan function trying to get complete dataset using the following code.
impute_arg <- aregImpute(~ age + job + marital + education + default +
balance + housing + loan + contact + day + month + duration + campaign +
pdays + previous + poutcome + y , data = mov.miss, n.impute = 10 , nk =0)
imputed <- impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE)
y <- completed[names(imputed)]
and when i used str(y) it already gives me a dataframe but with NAs as it is not imputed before, My question is how to get complete dataset without NAs after imputation?
str(y)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 NA 35 30 NA 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 NA 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 1 1 ...
$ balance : int NA 4789 1350 1476 0 747 307 147 NA -88 ...
$ housing : Factor w/ 2 levels "no","yes": NA 2 2 2 NA 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 NA 1 1 NA 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 NA 1 ...
$ day : int 19 NA 16 3 5 23 14 6 14 NA ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 NA 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 NA ...
$ pdays : int -1 339 330 NA -1 176 330 -1 -1 NA ...
$ previous : int 0 4 NA 0 NA 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I have tested your code myself, and it works just fine, except for the last line:
y <- completed[names(imputed)]
I believe there's a type in the above line. Plus, you do not even need the completed function.
Besides, if you want to get a data.frame from the impute.transcan function, then wrap it with as.data.frame:
imputed <- as.data.frame(impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE))
Moreover, if you need to test your missing data pattern, you can also use the md.pattern function provided by the mice package.

Error in Cross Validation in GLMNET package R for Binomial Target Variable

This is in reference to https://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome I am trying to use the Cross Validation in GLMNET (i.e. cv.glmnet) for a binomial target variable. The glmnet works fine but the cv.glmnet throws an error here is the error log:
Error in storage.mode(y) = "double" : invalid to change the storage mode of a factor
In addition: Warning messages:
1: In Ops.factor(x, w) : ‘*’ not meaningful for factors
2: In Ops.factor(y, ybar) : ‘-’ not meaningful for factors
Data Types:
'data.frame': 490 obs. of 13 variables:
$ loan_id : Factor w/ 614 levels "LP001002","LP001003",..: 190 381 259 310 432 156 179 24 429 408 ...
$ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
$ married : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 1 ...
$ dependents : Factor w/ 4 levels "0","1","2","3+": 1 1 1 3 1 4 2 3 1 1 ...
$ education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 1 2 1 2 ...
$ self_employed : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ applicantincome : int 9328 3333 14683 7667 6500 39999 3750 3365 2920 2213 ...
$ coapplicantincome: num 0 2500 2100 0 0 ...
$ loanamount : int 188 128 304 185 105 600 116 112 87 66 ...
$ loan_amount_term : Factor w/ 10 levels "12","36","60",..: 6 9 9 9 9 6 9 9 9 9 ...
$ credit_history : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ property_area : Factor w/ 3 levels "Rural","Semiurban",..: 1 2 1 1 1 2 2 1 1 1 ...
$ loan_status : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 2 2 ...
Codes Used:
xfactors<-model.matrix(loan_status ~ gender+married+dependents+education+self_employed+loan_amount_term+credit_history+property_area,data=data_train)[,-1]
x<-as.matrix(data.frame(applicantincome,coapplicantincome,loanamount,xfactors))
glmmod<-glmnet(x,y=as.factor(loan_status),alpha=1,family='binomial')
plot(glmmod,xvar="lambda")
grid()
cv.glmmod <- cv.glmnet(x,y=loan_status,alpha=1) #This Is Where It Throws The Error
The credit for the answer goes to #user20650.
Suspect you need to add the familyto cv.glmnet as well. An example:
x <- model.matrix(am ~ 0 + . , data=mtcars)
cv.glmnet(x, y=factor(mtcars$am), alpha=1)
cv.glmnet(x, y=factor(mtcars$am), alpha=1, family="binomial")

R nnet multiniom (multinomial logistic regression models) - assign penalties to avoid misclassification

I am using multinom from nnet package to fit a logistic regression model to data consists of 3 classes, however the prevalence of the classes is not balanced. I would like to assign weight/penalties in order to tell the model to avoid misclassification for a certain class.
Here is my code and a slice of my data:
mnm <- multinom(formula = cut.rank ~ ., data = training.logist, trace = FALSE, maxit = 1000, weights=c(10,5,1))
> str(head(training.logist))
'data.frame': 6 obs. of 15 variables:
$ is_top_rated_listing : Factor w/ 2 levels "0","1": 1 1 1 2 2 2
$ seller_is_top_rated_seller : int 1 1 1 1 1 1
$ is_auto_pay : Factor w/ 2 levels "0","1": 2 2 2 2 2 2
$ is_returns_accepted : Factor w/ 2 levels "0","1": 2 2 2 2 2 2
$ seller_feedback_rating_star : Factor w/ 11 levels "Blue","Green",..: 7 7 7 9 9 9
$ keywords_title_assoc : num 1 1 1 1 1 1
$ normalized.price_shipping : num 0 0 0.00871 0.01853 0.01853 ...
$ normalized.seller_feedback_score : num 0.7117 0.8791 0.0966 0.095 0.095 ...
$ normalized.seller_positive_feedback_percent: num 0.7117 0.8791 0.0966 0.095 0.095 ...
$ item_condition : Factor w/ 2 levels "New","New other (see details)": 1 1 1 1 1 1
$ listing_type : Factor w/ 2 levels "FixedPrice","StoreInventory": 2 2 2 1 1 1
$ best_offer_enabled : Factor w/ 2 levels "0","1": 1 1 1 1 1 1
$ shipping_handling_time : int 10 10 10 1 1 1
$ shipping_locations : Factor w/ 7 levels "AU,Americas,Europe,Asia",..: 5 5 5 5 5 5
$ cut.rank : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1
>
Anyone have an idea how to assign misclassification penalties? specifically I would like assign a penalty ratio of 10:5:1 (correspond to class 1,2,3) meaning I really like to be accurate on class 1.
The distribution of my target variable cut.rank is ~ 0.04,0.08,0.88.
Because class 1 has a low prevalence the model sensitivity for that class is low.

Resources