I have a data table of observation and model of being yes and no. For simplicity I have assumed only to groups. I wast to calculate some categorical statistics which I want to have control over which one to be chosen. I know how to do it using eval and save it in another data.table but I want to add to the existing data.table as I have only one row for each group. Could anyone help me?
First I create the contingency table for each group.
DT <- data.table::data.table(obs = rep(c("yes","no"), 5), mod = c(rep("yes",5), rep("no", 5)), groupBy = c(1,1,1,1,1,2,1,1,2,1))
categorical <- DT[, .(a = sum(obs == category[1] & mod == category[1]),
b = sum(obs == category[2] & mod == category[1]),
c = sum(obs == category[1] & mod == category[2]),
d = sum(obs == category[2] & mod == category[2])), by = groupBy]
Then define the statistics
my_exprs = quote(list(
n = a+b+c+d,
s = (a+c)/(a+b+c+d),
r = (a+b)/(a+b+c+d)))
If i use the following lines, it will give me a new data.table:
statList <- c("n","s")
w = which(names(my_exprs) %in% statList)
categorical[, eval(my_exprs[c(1,w)]), by = groupBy]
How to use := in this example to add the results to my old DT, here called categorical?! I did the following and got error message:
categorical[, `:=`(eval(my_exprs[c(1,w)])), by = groupBy]
Error in `[.data.table`(categorical, , `:=`(eval(my_exprs[c(1, w)])), :
In `:=`(col1=val1, col2=val2, ...) form, all arguments must be named.
Thanks,
I cannot reproduce your example, but it might work to keep your my_exprs, but define
my_newcols = as.call(c(quote(`:=`), my_exprs))
as in Arun's answer.
Alternately, you could just construct the expression with a := at the start:
my_newcols = quote(`:=`(n = a+b+c+d, s = a+c))
Related
DT <- as.data.frame(datafile)
for (i in 1:25){
MER = DT[, 1]
PER = DT[, (1+i)]
model_i = assign(paste("lm", i, sep = ""), model_parameters(lm(PER ~ MER)))
alpha = append(alpha, lm(PER ~ MER)$coefficient[1])
t_alpha = append(t_alpha, model_i[1,7])
if (model_i[1,7] > 1.96){
`indicator = append(indicator, 1)`
}else {
`indicator = append(indicator, 0)
}`
For a csv file(i.e. datafile), what's the benefits of using as.data.frame?
Is it using the second column, third column... 26th column as y, to regress the first column(x)?
What is the line of model_i doing? Is it: create lm1, lm2...lm25, then assign these numbers to model 1, model2...model 25? But it seems that lm and model are different names, what does it assign?
For the append function, how should we use it? Does it like: append(name the item, how to find this item)? If we already append it to the datafile, why we need to store it on the left hand side(i.e. alpha)?
Thank you very much for your help.
I am new to R, and I can't fix the bug after searching for one hour. It seems that there's no similar problem posted before.
I followed the instruction from https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ ,and want to test the proportional assumption for my data.
Following is my code:
sf <- function(y) {
c('Y>=1' = qlogis(mean(y >= 1)),
'Y>=2' = qlogis(mean(y >= 2)),
'Y>=3' = qlogis(mean(y >= 3)),
'Y>=3' = qlogis(mean(y >= 4)),
'Y>=3' = qlogis(mean(y >= 5)))
}
(s <- with(dat, summary(as.numeric(implied_rating) ~ GDP + importance, fun = sf)))
But the error occurs.
"Error in summary.formula(matrix(as.numeric(implied_rating)) ~
matrix(GDP) + : matrix variables must have column dimnames"
What should I do?
Many thanks in advance!
Solved. I thought dimnames is colnames...
Just mannually set dimnames to every column.
But I still wonder if there's better way to solve the problem.
So I am encountering the following issue:
df = data.frame(...) # with columns "Article" & "Revenue"
df_agg = aggregate(.~ Article, data = df, sum)
# Let A be some Article
sum_1 = sum(df$Revenue[df$Article == A], na.rm=T)
sum_2 = sum(df_agg$Revenue[df_agg$Article == A], na.rm=T)
I would expect sum_1 == sum_2 is true, but this is not the case. Why could that be?
The problem vanishes if I am not using the dot in the formula argument but instead Revenue ~ Article. But why?
This question is about efficient use of data.table in R.
Suppose the following DataTable
DataTable <- data.table(Id = rep(1:10,5), Method = rep(c("M1","M2","M3","M4", "M5"), each = 10), Value = rnorm(100))
What I want to know is: for which Ids the maximum absolute difference in value between M1 and M3 is more than 2?
I thought about this code:
DataTable[,if( max(abs(.SD[Method == "M1", Value] - .SD[Method == "M3", Value] )) > 2) 1, by = "Id"]$Id
This gives the desired output, but it seems so unelegant and is also quite slow. Is there a better way to do this?
Here are 2 possible approaches:
1) Like what akrun suggested with Value[Method=="M1"]
DataTable[, Id[any(abs(Value[Method=="M1"] - Value[Method=="M3"]) > 2)], by=.(Id)]$V1
2) Cast into a wide format (might be slower with large dataset)
dcast(DataTable, Id ~ Method, sum, value.var="Value")[, Id[abs(M1 - M3) > 2]]
data:
set.seed(0L)
DataTable <- data.table(Id = rep(1:10, 5),
Method = rep(c("M1","M2","M3","M4", "M5"), each = 10),
Value = rnorm(50))
I am working on a credit card prospect identification case study. I have to replace values of all columns with its corresponding WOE values. I can do it in 2-3 steps. However, I want to know whether there is a way to do it in 1 shot.
Use scorecard package and it is simple to use woebin(),woebin_plot(),woebin_ply(),iv() function.
temp <- credit_data
library(scorecard)
bins <- woebin(dt = temp,y = "targetvariable")
woebin_plot(bins$Income)
WOE_temp <- woebin_ply(temp,bins)
View(WOE_temp)
View(temp[is.na(temp$No.of.dependents),])
IV_values <- iv(dt = temp,y = "target variable")
(IV_values)
You might want to take a look at the woe package (in case WOE stands for Weight of Evidence).
Here's the relevant code snippet from the documentation:
library(woe)
res_woe <- woe(Data = mtcars, Independent = "cyl", Continuous = FALSE, Dependent = "am", C_Bin = 10, Bad = 0, Good = 1)
Hi please follow the following steps :-
Step 1: Calculate woe and iv using information package:-
library(fuzzyjoin)
library(Information)
IV <-
Information::create_infotables(data = test_df,
y = "label_column",
parallel =
TRUE)
Where in 'y' we need to assign label and 'data' we need to assign a dataframe.
Step 2: Use the following function:-
This is my own custom written function to replace actual values in a dataframe with woe calculated using information package:-
woe_replace <- function(df_orig, IV) {
df <- cbind(df_orig)
df_clmtyp <- data.frame(clmtyp = sapply(df, class))
df_col_typ <-
data.frame(clmnm = colnames(df), clmtyp = df_clmtyp$clmtyp)
for (rownm in 1:nrow(df_col_typ)) {
colmn_nm <- toString(df_col_typ[rownm, "clmnm"])
if(colmn_nm %in% names(IV$Tables)){
column_woe_df <- cbind(data.frame(IV$Tables[[toString(df_col_typ[rownm, "clmnm"])]]))
if (df_col_typ[rownm, "clmtyp"] == "factor" | df_col_typ[rownm, "clmtyp"] == "character") {
df <-
dplyr::inner_join(
df,
column_woe_df[,c(colmn_nm,"WOE")],
by = colmn_nm,
type = "inner",
match = "all"
)
df[colmn_nm]<-NULL
colnames(df)[colnames(df)=="WOE"]<-colmn_nm
} else if (df_col_typ[rownm, "clmtyp"] == "numeric" | df_col_typ[rownm, "clmtyp"] == "integer") {
column_woe_df$lv<-as.numeric(str_sub(
column_woe_df[,colmn_nm],
regexpr("\\[", column_woe_df[,colmn_nm]) + 1,
regexpr(",", column_woe_df[,colmn_nm]) - 1
))
column_woe_df$uv<-as.numeric(str_sub(
column_woe_df[,colmn_nm],
regexpr(",", column_woe_df[,colmn_nm]) + 1,
regexpr("\\]", column_woe_df[,colmn_nm]) - 1
))
column_woe_df[colmn_nm]<-NULL
column_woe_df<-column_woe_df[,c("lv","uv","WOE")]
colnames(df)[colnames(df)==colmn_nm]<-"WOE_temp2381111111111111697"
df <-
fuzzy_inner_join(
df,
column_woe_df[,c("lv","uv","WOE")],
by = c("WOE_temp2381111111111111697"="lv","WOE_temp2381111111111111697"="uv"),
match_fun=list(`>=`,`<=`)
)
df["WOE_temp2381111111111111697"]<-NULL
df["lv"]<-NULL
df["uv"]<-NULL
colnames(df)[colnames(df)=="WOE"]<-colmn_nm
}}
}
return(df)
}
Function Call:-
test_df_woe <- woe_replace(test_df, IV)
OR Super one Shot:-
test_df_woe <- woe_replace(test_df,Information::create_infotables(data = test_df, y = "label_column",parallel =TRUE))