So, let's say I have a 1000-row, 6-column dataframe, the columns are a1, a2, b1, b2, c1, c2. I want to run some t-tests using a's, b's, and c's and get an output df with 3 columns for the t-values of a-b-c and another three for the significance information for those values, making it a total of 6 columns. The problem I have is with rows, I want to loop over chunks of 20, rendering the output a (1000/20=)50-row, 6-column df.
I have already tried creating an index column for my inital df which repeats a 1 for the first 20 row, a 2 for the next 20 row and so on.
convert_n <- function(df) {
df <- df %T>% {.$n_for_t_tests = rep(c(1:(nrow(df)/20)), each = 20)}
}
df <- convert_n(df)
However, I can't seem to find a way to properly utilize the items in this column as indices for a "for" or any kind of loop.
Below you can see the relevant code for that creates a 1-row, 6-column df; I need to modify the [0:20] parts, create a loop that does this for 20 groups and binds them.
t_test_a <- t.test(df$a1[0:20], dfff$a2[0:20], paired = T, conf.level
= 0.95)
t_test_b <- t.test(df$b1[0:20], dfff$b2[0:20], paired = T, conf.level
= 0.95)
t_test_c <- t.test(df$c1[0:20], dfff$c2[0:20], paired = T, conf.level
= 0.95)
t_tests_df <- data.frame(t_a = t_test_a$statistic[["t"]],
t_b = t_test_b$statistic[["t"]],
t_c = t_test_c$statistic[["t"]])
t_tests_df <- t_tests_df %T>% {.$dif_significance_a = ifelse(.$t_a >
2, "YES", "NO")} %T>%
{.$dif_significance_b = ifelse(.$t_b >
2, "YES", "NO")} %T>%
{.$dif_significance_c = ifelse(.$t_c >
2, "YES", "NO")} %>%
dplyr::select(t_a, dif_significance_a,
t_b, dif_significance_b,
t_c, dif_significance_c)
Thank you in advance for your help.
You can use split() and sapply():
set.seed(42)
df <- data.frame(a1 = sample(1000, 1000), a2 = sample(1000, 1000),
b1 = sample(1000, 1000), b2 = sample(1000, 1000),
c1 = sample(1000, 1000), c2 = sample(1000, 1000))
group <- gl(50, 20)
D <- split(df, group)
myt <- function(Di)
with(Di, c(at=t.test(a1, a2)$statistic, ap=t.test(a1, a2)$p.value,
bt=t.test(b1, b2)$statistic, bp=t.test(b1, b2)$p.value,
ct=t.test(c1, c2)$statistic, cp=t.test(c1, c2)$p.value))
sapply(D, FUN=myt) ### or
t(sapply(D, FUN=myt))
This is not the most pretty but i did a for loop like this:
df <- data.frame(a1 = sample(1000, 1000),
a2 = sample(1000, 1000),
b1 = sample(1000, 1000),
b2 = sample(1000, 1000),
c1 = sample(1000, 1000),
c2 = sample(1000, 1000))
df_ttest <- data.frame(p_a = c(1:50),
t_a = c(1:50),
p_b = c(1:50),
t_b = c(1:50),
p_c = c(1:50),
t_c = c(1:50))
index <- 0:50*20
for(i in seq_along(index)) {
df_ttest$p_a[i] = t.test(df$a1[index[i] : index[i+1]])$p.value
df_ttest$p_b[i] = t.test(df$b1[index[i] : index[i+1]])$p.value
df_ttest$p_c[i] = t.test(df$c1[index[i] : index[i+1]])$p.value
df_ttest$t_a[i] = t.test(df$a1[index[i] : index[i+1]])$statistic
df_ttest$t_b[i] = t.test(df$b1[index[i] : index[i+1]])$statistic
df_ttest$t_c[i] = t.test(df$c1[index[i] : index[i+1]])$statistic
}
This gives a 50x6 dataframe with seperate columns of p and t values for every 20 row chunk of a, b and c.
You could even go further and make a nested for loop to cycle through each row in df_ttest to make this abit prettier.
Related
I have a very large matrix 200 x 1500, where the rows are samples and the columns are data. I want to do pairwise comparisons of all 1500 columns (~1.1M tests), so combn would take too long. I'm trying to run a linear model on the first column against each 1499 other columns, process/write the pvalue to output file (i.e.data1.tsv) and then repeat for the second column (excluding the comparison with the first column) and save to data2.tsv. And continue until all comparisons have been made. Below is the code for what I am trying to achieve.
library(data.table)
df = as.data.frame(matrix(runif(20, min=0, max=100), nrow=4))
colnames(df) = c("data_1", "data_2", "data_3", "data_4", "data_5")
rownames(df) = c("sample_1", "sample_2", "sample_3", "sample_4")
pval_1 = as.numeric(summary(lm(data_1 ~ data_2, data=df))$coefficients[,"Pr(>|t|)"][2])
pval_2 = as.numeric(summary(lm(data_1 ~ data_3, data=df))$coefficients[,"Pr(>|t|)"][2])
pval_3 = as.numeric(summary(lm(data_1 ~ data_4, data=df))$coefficients[,"Pr(>|t|)"][2])
pval_4 = as.numeric(summary(lm(data_1 ~ data_5, data=df))$coefficients[,"Pr(>|t|)"][2])
data_1 = data.frame(id1 = c("data_1","data_1","data_1","data_1"),
id2 = c("data_2","data_3","data_4","data_5"),
pval = c(pval_1, pval_2,pval_3,pval_4))
fwrite(data_1, to_path)
pval_5 = as.numeric(summary(lm(data_2 ~ data_3, data=df))$coefficients[,"Pr(>|t|)"][2])
pval_6 = as.numeric(summary(lm(data_2 ~ data_4, data=df))$coefficients[,"Pr(>|t|)"][2])
pval_7 = as.numeric(summary(lm(data_2 ~ data_5, data=df))$coefficients[,"Pr(>|t|)"][2])
data_2 = data.frame(id1 = c("data_2","data_2","data_2"),
id2 = c("data_3","data_4","data_5"),
pval = c(pval_5, pval_6,pval_7))
fwrite(data_2, to_path)
pval_8 = as.numeric(summary(lm(data_3 ~ data_4, data=df))$coefficients[,"Pr(>|t|)"][2])
pval_9 = as.numeric(summary(lm(data_3 ~ data_5, data=df))$coefficients[,"Pr(>|t|)"][2])
data_3 = data.frame(id1 = c("data_3","data_3"),
id2 = c("data_4","data_5"),
pval = c(pval_8,pval_9))
fwrite(data_3, to_path )
I don't understand why it is important to split the files instead of creating a dataframe containing all the pairwise p-values between variables.
Being said that, take a look.
I created a random dataframe with the dimension that you want to work (by the way, this is not that large...):
library(Hmisc)
df = as.data.frame(matrix(runif(200 * 1500, min=0, max=10), nrow=200))
Later, by using the function rcorr from the Hmisc library. I runned the correlation matrix that will provide you also the p-values between variables.
mycor <- rcorr(as.matrix(df), type="pearson")
mycor_p = mycor$P
Having done that, I extract the upper matrix from the mycor_p matrix and create a 3-column matrix.
index <- which(upper.tri(mycor_p, diag = TRUE), arr.ind = TRUE)
dim_n <- dimnames(mycor_p)
res = data.frame(row = dim_n[[1]][index[, 1]],
col = dim_n[[2]][index[, 2]],
val = mycor_p[index])
Finally, remove NA cases
final_df = res[complete.cases(res), ]
And export this dataframe:
write.csv(final_df, file = your_path, row.names=FALSE)
I've like two make to sequential operations:
1) Ajusted two nls models in a subset; and
2) Loop the models just a number of iteracions =1.
For the first step I make:
#Packages
library(minpack.lm)
# Data set - Diameter in function of Feature and Age
Feature<-sort(rep(c("A","B"),22))
Age<-c(60,72,88,96,27,
36,48,60,72,88,96,27,36,48,60,72,
88,96,27,36,48,60,27,27,36,48,60,
72,88,96,27,36,48,60,72,88,96,27,
36,48,60,72,88,96)
Diameter<-c(13.9,16.2,
19.1,19.3,4.7,6.7,9.6,11.2,13.1,15.3,
15.4,5.4,7,9.9,11.7,13.4,16.1,16.2,
5.9,8.3,12.3,14.5,2.3,5.2,6.2,8.6,9.3,
11.3,15.1,15.5,5,7,7.9,8.4,10.5,14,14,
4.1,4.9,6,6.7,7.7,8,8.2)
d<-dados <- data.frame(Feature,Age,Diameter)
str(d)
#Create a nls model (Levenberg-Marquardt algoritm) for each Feature (A abd B)
e1<- Diameter ~ a1 * Age^a2
Fecture_vec<-unique(d$Feature)
mod_ND <- list() #List for save each model
for(i in 1:length(Fecture_vec)){
d2 <- subset(d, d$Feature == Fecture_vec[i])
mod_ND[[i]] <- nlsLM(e1, data = d2,
start = list(a1 = 0.1, a2 = 10),
control = nls.control(maxiter = 1000))
print(summary(mod_ND[[i]]))
}
#
Here, so far so good, but if I try to make a loop with 999 simulation and recycle the start values with coef(mod_ND[[i]])[1] and coef(mod_ND[[i]])[2] and stop when number of iterations is 1:
e1<- Diameter ~ a1 * Age^a2
Fecture_vec<-unique(d$Feature)
mod_ND <- list() #List for save each model
for(i in 1:length(Fecture_vec)){
d2 <- subset(d, d$Feature == Fecture_vec[i])
mod_ND[[i]] <- nlsLM(e1, data = d2,
start = list(a1 = 0.1, a2 = 10),
control = nls.control(maxiter = 1000))
Xs<-data.frame()
for(z in 1:999){
d2 <- subset(d, d$Feature == Fecture_vec[i])
mod_ND[[z]] <- nlsLM(e1, data = d2,
start = list(a1 = coef(mod_ND[[i]])[1], a2 = mod_ND[[i]])[2]),
control = nls.control(maxiter = 1000))
if (mod_ND[[z,c(finIter")]] <= 1){ break } ## Stop when iteractions =1
print(summary(mod_ND[[z]]))
}
}
#
Doesn't work!! Please any ideas?
#Packages
library(minpack.lm)
library(dplyr)
m<-function(d, a=0.01,b=10){
mod<- nlsLM(Diameter ~ a1 * Age^a2,start = list(a1 = a, a2 = b),control = nls.control(maxiter = 1000), data = d)
par1<- summary(mod)$coefficients[[1]]
par2 <- summary(mod)$coefficients[[2]]
print(summary(mod))
if(mod$convInfo[["finIter"]]>1){
m(d,par1,par2)
}else{
print(" --------Feature B-----------")
}
}
list_models <- dlply(d,.(Feature),m)
list_models
Using dabestr package I'm trying to get the differences between two sets of control & test data. Moifying slightly example from help file I tried:
library(dabestr)
N <- 70
c1 <- rnorm(N, mean = 50, sd = 20)
t1 <- rnorm(N, mean = 200, sd = 20)
ID <- seq(1:N)
long.data <- tibble::tibble(ID = ID, Control1 = c1, Test1 = t1)
meandiff1 <- long.data %>%
tidyr::gather(key = Group, value = Measurement, Control1:Test1)
ID <- seq(1:N) + N
c2 <- rnorm(N, mean = 100, sd = 70)
t2 <- rnorm(N, mean = 100, sd = 70)
long.data <- tibble::tibble(ID = ID, Control2 = c2, Test2 = t2)
meandiff2 <- long.data %>%
tidyr::gather(key = Group, value = Measurement, Control2:Test2)
meandiff <- dplyr::bind_rows(meandiff1, meandiff2)
paired_mean_diff <-
dabest(meandiff, x = Group, y = Measurement,
idx = c("Control1", "Test1", "Control2", "Test2"),
paired = TRUE,
id.col = ID)
plot(paired_mean_diff)
I get these results:
So not only is everything compared to Control1 but also the paired = TRUE option seems to have no effect. I was hoping to get something similar to examples from the package page:
Any pointers on how to achieve that?
For a paired plot, you want to nest the idx keyword option as such:
paired_mean_diff <-
dabest(meandiff, x = Group, y = Measurement,
idx = list(c("Control1", "Test1"),
c("Control2", "Test2")),
paired = TRUE,
id.col = ID)
In some classification tasks, using mlr package, I need to deal with a data.frame similar to this one:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
I need to cross-validate the model keeping together the values with the same ID, I know from the tutorial that:
https://mlr-org.github.io/mlr-tutorial/release/html/task/index.html#further-settings
We could include a blocking factor in the task. This would indicate that some observations "belong together" and should not be separated when splitting the data into training and test sets for resampling.
The question is how can I include this blocking factor in the makeClassifTask?
Unfortunately, I couldn't find any example.
What version of mlr do you have? Blocking should be part of it since a while. You can find it directly as an argument in makeClassifTask
Here is an example for your data:
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = cv10)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})
#all entries are empty, blocking indeed works!
The answer by #jakob-r no longer works. My guess is something changed with cv10.
Minor edit to use "blocking.cv = TRUE"
Complete working example:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
resDesc <- makeResampleDesc("CV",iters=10,blocking.cv = TRUE)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = resDesc)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})
I am new to R and trying to loop through each row of df1 and search for rows in df2 that are close in distance (5mi/8046.72m). I think df1 is looping as intended but I don't think it is going through all of df2.
{for (i in 1:1452){
p1 <- df1[i, 4:5]
p2 <- df2[1:11, 2:3]
d <- distCosine(p1, p2, r=6378137)
return(d< 8046.72)
i <- i+1}
}
I get the output:
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
I would just use an apply function. First, let's make your problem reproducible by creating some "fake" data - I am making the lon/lat pairs artificially close so that we can get a few TRUE's back in the results:
library(geosphere)
df1 <- data.frame(X1 = sample(letters, 100, replace = T),
x2 = sample(letters, 100, replace = T),
x3 = sample(letters, 100, replace = T),
lon = sample(10:12 + rnorm(100, 0, 0.1), 100, replace = T),
lat = sample(10:12 + rnorm(100, 0, 0.1), replace = T))
df2 <- data.frame(x1 = sample(letters, 100, replace = T),
lon = sample(10:12 + rnorm(100, 0, 0.1), 100, replace = T),
lat = sample(10:12 + rnorm(100, 0, 0.1), 100, replace = T))
We can then create two matrices containing the values of interest:
m1 <- as.matrix(df1[, c("lon", "lat")])
m2 <- as.matrix(df2[1:11, c("lon", "lat")])
Now we can use the apply function across the rows of m2 which return a 100 X 11 matrix:
results <- apply(m2, 1, FUN = function(x) distCosine(x, m1))
To get the less than 5 mi (~8046.72m), results, we simply subset:
results[results < 8046.72]
# Showing the next two for alternative output
which(results < 8046.72)
which(results < 8046.72, arr.ind = T)
Note: In your question, it looks like you are interested in the first 1,452 rows -- this would mean the results would we be a 1,452 X 11 matrix.