Related
I am trying to make a model using the lapply function where lapply indexes through each column of response variables and creates a linear model using the predictor variables. I am then each individual linear model to the stepAIC funciton and then to the stepVIF function after that. I can make this work in a dataset with no na values, however my dataset is full of na values which is giving me a variable length differ error when I pass the linear models to the stepAIC function.
This is what I have tried so far. I made the miultiple.func variable in an attempt to remove na values from column at a time buit it does not work and I think that it would end up removing all of the rows of data except for the fourth column due to how the complete.cases() works. This is why I only want to remove the Na values from one column at a time of the response variables (the column being called in the model), and all of the Na values from the predictor variables (col d, e and f).
data_dummy <- data.frame(first_column = c("A", "B", "c", "d", "e", "f"),
second_column = c("1", "Na", "3", "4", "5", "6"),
third_column = c("Na", "7", "3", "Na", "2", "6"),
fourth_column = c("5", "8", "3", "4", "5", "1"),
fith_column = c("2", "Na", "3", "na", "2", "6"),
sixth_column = c("5", "9", "3", "4", "na", "1")
)
g <- 3
multiple.func <- function(g) {
c(data34[complete.cases(data_dummy[[,c(g)]]),], lm(reformulate(names(data34)[4:7], response=names(data_dummy)[g]), data_dummy)) #trying to remove NA
}
full.model <- lapply(data_dummy, multiple.func)
step.model <- lapply(full.model, function(x)MASS::stepAICIC(x, direction = "both", trace = FALSE)) #Fit stepwise regression model
stepmod3 <- lapply(step.model, function(x)pedometrics::stepVIF(model = x, threshold = 10, verbose = TRUE))
I have a factor variable with 14 levels, which I'm trying to into collapse into only 3 levels. It contains two N/A which I also wanna remove.
My code looks like this:
job <- fct_collapse(E$occupation, other = c("7","9", "10", "13" "14"), 1 = c("1", "2", "3", "12"), 2 = c("4", "5", "6", "8", "11"))
However it just gives me tons of error. Can anyone help here me here?
We could also this with a named list
library(forcats)
lst1 <- setNames(list(as.character(c(7, 9, 10, 13, 14)),
as.character(c(1, 2, 3, 12)), as.character(c(4, 5, 6, 8, 11))), c('other', 1, 2))
fct_collapse(df$occupation, !!!lst1)
data
df <- structure(list(occupation = c("1", "3", "5", "7", "9", "10",
"12", "14", "13", "4", "7", "6", "5")), class = "data.frame", row.names = c(NA,
-13L))
For numbers try using backquotes in fct_collapse.
job <- forcats::fct_collapse(df$occupation,
other = c("7","9", "10", "13", "14"),
`1` = c("1", "2", "3", "12"),
`2` = c("4", "5", "6", "8", "11"))
I am working on large data sets, for which i have written a code to perform row by row operation on a data frame, which is sequential. The process is slow.
I am trying to perform the operation using parallel processing to make it fast.
Here is code
library(geometry)
# Data set - a
data_a = structure(c(10.4515034409741, 15.6780890052356, 12.5581992918563,
9.19067944250871, 14.4459166666667, 11.414, 17.65325, 12.468,
11.273, 15.5945), .Dim = c(5L, 2L), .Dimnames = list(c("1", "2",
"3", "4", "5"), c("a", "b")))
# Data set - b
data_b = structure(c(10.4515034409741, 15.6780890052356, 12.5581992918563,
9.19067944250871, 14.4459166666667, 11.3318076923077, 13.132273830156,
6.16003995082975, 11.59114820435, 10.9573192090395, 11.414, 17.65325,
12.468, 11.273, 15.5945, 11.5245, 12.0249, 6.3186, 13.744, 11.0921), .Dim = c(10L,
2L), .Dimnames = list(c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), c("a",
"b")))
conv_hull_1 <- convhulln( data_a, options = "FA") # Draw Convex Hull
test = c()
for (i in 1:nrow(data_b)){
df = c()
con_hull_all <- inhulln(conv_hull_1, matrix(data_b[i,], ncol = 2))
df$flag <- ifelse(con_hull_all[1] == TRUE , 0 , ifelse(con_hull_all[1] == FALSE , 1, 2))
test <- as.data.frame(rbind(test, df))
print(i)
}
test
Is there any way to parallelize row wise computation?
As you can observe, for small datasets the computational time is really low, but as soon as i increase the data size, the computation time increases drastically.
Can you provide solution with the code.
Thanks in advance.
You could take advantage of the parameter to the inhulln function. This allows more than one row of points to be tested to be passed in.
I've tried the code below on a 320,000 row matrix that I made from the original data and it's quick.
library(geometry)
library(dplyr)
# Data set - a
data_a = structure(
c(
10.4515034409741,
15.6780890052356,
12.5581992918563,
9.19067944250871,
14.4459166666667,
11.414,
17.65325,
12.468,
11.273,
15.5945
),
.Dim = c(5L, 2L),
.Dimnames = list(c("1", "2",
"3", "4", "5"), c("a", "b"))
)
# Data set - b
data_b = structure(
c(
10.4515034409741,
15.6780890052356,
12.5581992918563,
9.19067944250871,
14.4459166666667,
11.3318076923077,
13.132273830156,
6.16003995082975,
11.59114820435,
10.9573192090395,
11.414,
17.65325,
12.468,
11.273,
15.5945,
11.5245,
12.0249,
6.3186,
13.744,
11.0921
),
.Dim = c(10L,
2L),
.Dimnames = list(c(
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10"
), c("a",
"b"))
)
conv_hull_1 <- convhulln( data_a, options = "FA") # Draw Convex Hull
#Make a big data_b
for (i in 1:15) {
data_b = rbind(data_b, data_b)
}
In_Or_Out <- inhulln(conv_hull_1, data_b)
result <- data.frame(data_b) %>% bind_cols(InOrOut=In_Or_Out)
I use dplyr::bind_cols to bind the in or out result to a data frame version of the original data so you might need some changes for your specific environment.
I'm trying to use msSurv for a multi-state modelling problem that looks at an individuals transition to different stages. Part of that is creating a tree object which is where I think I'm making a mistake but I can't understand what it is. I'll include the minimum workable example here.
Nodes <- c("1", "2", "3", "4", "5", "6")
Edges <- list("1" = list(edges = c("2", "3", "4", "5", "6")),
"2" = list(edges = c("1", "3", "4", "5", "6")),
"3" = list(edges = c("1", "2", "4", "5", "6")),
"4" = list(edges = c("1", "2", "3", "5", "6")),
"5" = list(edges = c("3", "4", "6")),
"6" = list(edges = NULL))
treeobj <- new("graphNEL", nodes = Nodes, edgeL = Edges, edgemode = "directed")
fit3 <- msSurv(df, treeobj, bs = TRUE, LT = TRUE)
The error I'm getting is as follows.
No states eligible for exit distribution calculation.
Entry distributions calculated for states 6 .
Error in bs.IA[, , j, b] : subscript out of bounds
The dataset in question can be found here.
Any help is sincerely appreciated.
I may be misunderstanding, but your 6 group doesn't have 1-6 as an edge, thus the program returns an error because in essence you're saying 6 isn't connected to the calculation. In relation to the solution, I believe 6 should have edges, as in this line may need to have edges: "6" = list(edges = NULL))
I have this dataframe called mydf where I have three principal covariates (PCA.1,PCA.2, PCA.3). I want to get the 3d distance matrix and get the shortest euclidean distance between all the compared Samples. In another dataframe called myref, I have some known identity of Samples and some unknown samples. By calculating the shortest euclidean distance from mydf, I want to assign the known Identity to the unknown samples. Can someone please help me get this done.
mydf
mydf <- structure(list(Sample = c("1", "2", "4", "5", "6", "7", "8",
"9", "10", "12"), PCA.1 = c(0.00338, -0.020373, -0.019842, -0.019161,
-0.019594, -0.019728, -0.020356, 0.043339, -0.017559, -0.020657
), PCA.2 = c(0.00047, -0.010116, -0.011532, -0.011582, -0.013245,
-0.011751, -0.010299, -0.005801, -0.01, -0.011334), PCA.3 = c(-0.008787,
0.001412, 0.003751, 0.00371, 0.004242, 0.003738, 0.000592, -0.037229,
0.004307, 0.00339)), .Names = c("Sample", "PCA.1", "PCA.2", "PCA.3"
), row.names = c(NA, 10L), class = "data.frame")
myref
myref<- structure(list(Sample = c("1", "2", "4", "5", "6", "7", "8",
"9", "10", "12"), Identity = c("apple", "unknown", "ball", "unknown",
"unknown", "car", "unknown", "cat", "unknown", "dog")), .Names = c("Sample",
"Identity"), row.names = c(NA, 10L), class = "data.frame")
uIX = which(myref$Identity == "unknown")
dMat = as.matrix(dist(mydf[, -1])) # Calculate the Euclidean distance matrix
nn = apply(dMat, 1, order)[2, ] # For each row of dMat order the values increasing values.
# Select nearest neighbor (it is 2, because 1st row will be self)
myref$Identity[uIX] = myref$Identity[nn[uIX]]
Note that the above code will set some identities to unknown. If instead you want to match to the nearest neighbor with a known identity, change the second line to
dMat[uIX, uIX] = Inf