Conditional frequencies calculation for more than one variable - r

I want to calculate conditional probabilities in my data. Therefore I coded the following:
creditrisks <- read.table("kredit.asc", header=TRUE)
glimpse(creditrisks)
creditrisks$moral1 <- as.integer(moral>1)
creditrisks$konto1 <- as.integer(laufkont==1)
creditrisks$konto2 <- as.integer(laufkont==4)
creditrisks$zweck <- as.integer(0<verw & verw<9)
attach(creditrisks)
prop.table(table(kredit,konto1),2)
prop.table(table(kredit,konto2),2)
prop.table(table(kredit,moral1),2)
prop.table(table(kredit,zweck),2)
The results look like this:
This works well for me, the only thing I want to change is that I can calculate all conditional frequencies at once, so the table should look like this:
With cbind I loose all the variable names, so I'm searching for a more elegant way.
The dataset can be found here: dataset
Thanks for your help!

Try this.
lapply(creditrisks[, c("konto1", "konto2", "moral1", "zweck")],
function(x) prop.table(table(creditrisks$kredit, x), 2)
)
You can also cbind them together by
do.call(cbind,
lapply(
creditrisks[, c("konto1", "konto2", "moral1", "zweck")],
function(x) prop.table(table(creditrisks$kredit, x), 2)
)
)

Related

Create a function with 'trapz' tool? and apply lappy?

I'm trying to calculate the 'trapezoidal AUC(area under the curve)' by using the 'trapz' tool from 'caTools'. It is very simple to calculate one variable's AUC when using trapz like this:
tAUC <- trapz(df1$time, df1$CAT.19)
tAUC
Now, I want to create a function with this and eventually 'lapply' it to do batch calculation, but having trouble making this into a function.
I have tried like:
t_func <- function(x){
trapz(df1$time, df1$x)
}
but having error that says "non-conformable arguments"
Can anyone help me with this? Thank you so much.
my df1 looks like this
An image is not helpful way to share data. I have created a fake dataset to reproduce the dataset that you have.
set.seed(123)
df1 <- data.frame(time = seq(0, 120, 15), CAT.01 = rnorm(9), CAT.02 = rnorm(9))
tAUC <- sapply(df1[-1], function(x) caTools::trapz(df1$time, x))
tAUC
# CAT.01 CAT.02
#27.23374 39.27199
If you need a list you may use lapply instead of sapply.

Way to do this using apply?

I want to take an average for each row across different data frames. Does anyone know of a more clever way to do this using apply statements? Sorry for the wall of code.
Youl would need a vector of 1000:1006 for each hiXXXX file and then a vector 2:13 for the columns. I have used mapply for something weird like this before so maybe that could do it somehow?
for (i in 1:nrow(subavg)) {
subavg[i,c(2)] <- mean(c(hi1000[i,c(2)],hi1001[i,c(2)],hi1002[i,c(2)],hi1003[i,c(2)],hi1004[i,c(2)],hi1005[i,c(2)],hi1006[i,c(2)]))
subavg[i,c(3)] <- mean(c(hi1000[i,c(3)],hi1001[i,c(3)],hi1002[i,c(3)],hi1003[i,c(3)],hi1004[i,c(3)],hi1005[i,c(3)],hi1006[i,c(3)]))
subavg[i,c(4)] <- mean(c(hi1000[i,c(4)],hi1001[i,c(4)],hi1002[i,c(4)],hi1003[i,c(4)],hi1004[i,c(4)],hi1005[i,c(4)],hi1006[i,c(4)]))
subavg[i,c(5)] <- mean(c(hi1000[i,c(5)],hi1001[i,c(5)],hi1002[i,c(5)],hi1003[i,c(5)],hi1004[i,c(5)],hi1005[i,c(5)],hi1006[i,c(5)]))
subavg[i,c(6)] <- mean(c(hi1000[i,c(6)],hi1001[i,c(6)],hi1002[i,c(6)],hi1003[i,c(6)],hi1004[i,c(6)],hi1005[i,c(6)],hi1006[i,c(6)]))
subavg[i,c(7)] <- mean(c(hi1000[i,c(7)],hi1001[i,c(7)],hi1002[i,c(7)],hi1003[i,c(7)],hi1004[i,c(7)],hi1005[i,c(7)],hi1006[i,c(7)]))
subavg[i,c(8)] <- mean(c(hi1000[i,c(8)],hi1001[i,c(8)],hi1002[i,c(8)],hi1003[i,c(8)],hi1004[i,c(8)],hi1005[i,c(8)],hi1006[i,c(8)]))
subavg[i,c(9)] <- mean(c(hi1000[i,c(9)],hi1001[i,c(9)],hi1002[i,c(9)],hi1003[i,c(9)],hi1004[i,c(9)],hi1005[i,c(9)],hi1006[i,c(9)]))
subavg[i,c(10)] <- mean(c(hi1000[i,c(10)],hi1001[i,c(10)],hi1002[i,c(10)],hi1003[i,c(10)],hi1004[i,c(10)],hi1005[i,c(10)],hi1006[i,c(10)]))
subavg[i,c(11)] <- mean(c(hi1000[i,c(11)],hi1001[i,c(11)],hi1002[i,c(11)],hi1003[i,c(11)],hi1004[i,c(11)],hi1005[i,c(11)],hi1006[i,c(11)]))
subavg[i,c(12)] <- mean(c(hi1000[i,c(12)],hi1001[i,c(12)],hi1002[i,c(12)],hi1003[i,c(12)],hi1004[i,c(12)],hi1005[i,c(12)],hi1006[i,c(12)]))
subavg[i,c(13)] <- mean(c(hi1000[i,c(13)],hi1001[i,c(13)],hi1002[i,c(13)],hi1003[i,c(13)],hi1004[i,c(13)],hi1005[i,c(13)],hi1006[i,c(13)]))
}
As there are only 7 datasets, we can use that as arguments for Map, then cbind it, and get the rowMeans
Map(function(...) rowMeans(cbind(...)), hi1000, hi1001, hi1002, hi1003,
hi1004, hi1005, hi1006)
Or use + with Reduce after getting the datasets in a list and then divide by the total number of datasets, i.e. 7
Reduce(`+`, mget(paste0("hi", 1000:1006)))/7
The second solution is more compact, but if we have NAs in the dataset, it is better to use the first one as the rowMeans have na.rm argument. By default it is FALSE, but we can set it to TRUE.

Applying multiple function via sapply

I'm trying to replicate solution on applying multiple functions in sapply posted on R-Bloggers but I can't get it to work in the desired manner. I'm working with a simple data set, similar to the one generated below:
require(datasets)
crs_mat <- cor(mtcars)
# Triangle function
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)] <- NA
return(cormat)
}
require(reshape2)
crs_mat <- melt(get_upper_tri(crs_mat))
I would like to replace some text values across columns Var1 and Var2. The erroneous syntax below illustrates what I am trying to achieve:
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
gsub("mpg","MPG",x),
# Replace second phrase
gsub("gear", "GeArr",x)
# Ideally, perform other changes
})
Naturally, the code is not syntactically correct and fails. To summarise, I would like to do the following:
Go through all the values in first two columns (Var1 and Var2) and perform simple replacements via gsub.
Ideally, I would like to avoid defining a separate function, as discussed in the linked post and keep everything within the sapply syntax
I don't want a nested loop
I had a look at the broadly similar subject discussed here and here but, if possible, I would like to avoid making use of plyr. I'm also interested in replacing the column values not in creating new columns and I would like to avoid specifying any column names. While working with my existing data frame it is more convenient for me to use column numbers.
Edit
Following very useful comments, what I'm trying to achieve can be summarised in the solution below:
fun.clean.columns <- function(x, str_width = 15) {
# Make character
x <- as.character(x)
# Replace various phrases
x <- gsub("perc85","something else", x)
x <- gsub("again", x)
x <- gsub("more","even more", x)
x <- gsub("abc","ohmg", x)
# Clean spaces
x <- trimws(x)
# Wrap strings
x <- str_wrap(x, width = str_width)
# Return object
return(x)
}
mean_data[,1:2] <- sapply(mean_data[,1:2], fun.clean.columns)
I don't need this function in my global.env so I can run rm after this but even nicer solution would involve squeezing this within the apply syntax.
We can use mgsub from library(qdap) to replace multiple patterns. Here, I am looping the first and second column using lapply and assign the results back to the crs_mat[,1:2]. Note that I am using lapply instead of sapply as lapply keeps the structure intact
library(qdap)
crs_mat[,1:2] <- lapply(crs_mat[,1:2], mgsub,
pattern=c('mpg', 'gear'), replacement=c('MPG', 'GeArr'))
Here is a start of a solution for you, I think you're capable of extending it yourself. There's probably more elegant approaches available, but I don't see them atm.
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
step1 <- gsub("mpg","MPG",x)
# Replace second phrase. Note that this operates on a modified dataframe.
step2 <- gsub("gear", "GeArr",step1)
# Ideally, perform other changes
return(step2)
#or one nested line, not practical if more needs to be done
#return(gsub("gear", "GeArr",gsub("mpg","MPG",x)))
})

Removing quotes in function output in R

I am trying to write a function in R, for a simple time series regression (the result of this function is the output for more complicated ones). In the first part i define the variables and create some lags for the function, which are named ar_i depending on the used lag.
However in the second part i try to combine this lags in a matrix using a cbind function on the variables initially defined. As you can see the output is not the expected matrix, but the names of the lags themselves. I tried to solve this by using the noquote() and cat() function, but these don't seem to work.
Do you have any suggestions? Thanks in advance!!!
Pd: The code and the results are below.
trans <- dlpib
ar <- dlpib
linear <- 1:4
for (i in linear){
assign(paste("ar_",i,sep = ""), lag(ar,k=-i))
}
linear_dat <- cbind(paste("ar_",linear, collapse=',', sep = ""))
> linear_dat
[,1]
[1,] "ar_1,ar_2,ar_3,ar_4"
I think you could go about this more efficiently with sapply:
linear <- 1:4
linear_list <- lapply(linear, function(i) lag(ar, k=-i))
linear_dat <- do.call(cbind, linear_list)
colnames(linear_dat) <- paste0("ar_", linear)

Function over vectors collected in a list in R

I have looked long and hard for a solution to the folliwing problem, but I couldn't find it. I apologize in advance if this is a duplicate, and I will delete this question if you direct me to an answer.
I have a list (Mylist) where each element holds many different fields. I'm interested in the numeric vector called ´coefficients´. I can thus select coefficients related to the i'thinstance of the list as
Mylist[[i]]$coefficients
but how do I get the average of coefficients over all i? The average is just meant as an example. What I'm generally interested in is how to compute a function over a list where each field of the list holds more than one data.frame/matrix/string etc.
UPDATE: As kindly supplied by Thomas below, here are some fake data for the problem:
Mylist <- replicate(10,data.frame(coefficients=rnorm(20),
something=rnorm(20)), simplify=FALSE)
I have tried looking at lapply, but since ´Mylist´ have other fields than coefficients I don't see how to do it.
Thanks!
You might need to provide more details on the exact structure of your data, but here's a simple example:
# some fake data:
mylist <- replicate(10,data.frame(coefficients=rnorm(20),
something=rnorm(20)), simplify=FALSE)
# take the grand mean:
mean(sapply(mylist,function(x) x$coefficients))
But perhaps you want the mean for each set of corresponding coefficients across all the list entries, which you could get with something like either of the following (which are identical):
colMeans(do.call(rbind,lapply(mylist,function(x) x$coefficients)))
rowMeans(do.call(cbind,lapply(mylist,function(x) x$coefficients)))
Which #SimonO101 rightly points out simplifies to:
rowMeans(sapply(mylist, function(x) x$coefficients))
because sapply is just a wrapper for lapply that does the simplification for you.
If you want the mean for all coefficients across all lists try...
mean( unlist( sapply( Mylists , function(x) `[`(x , 'coefficients') ) ) )
However, you should clarify what you want because it is unclear if you want...
# A mean for each set of coefficients
sapply( Mylists , function(x) mean( x$coefficients ) )
# The mean for each coefficient across all lists
rowMeans( sapply( Mylists , function(x) x$coefficients ) )

Resources