I need to randomly sample a dataset which is arranged in long format. In my dataset, each subject has 4 observations, so if I randomly sample a row I am randomly losing one or more observation per subject.
This is a simulated data for illustration purposes, my data is much bigger.
sub sex group dv1 dv2
P1 m A 0.66 0.94
P1 m B 0.98 0.26
P1 m C 0.02 0.03
P1 m D 0.60 0.30
P2 m A 0.92 0.99
P2 m B 0.82 0.09
P2 m C 0.44 0.67
P2 m D 0.53 0.80
P3 f A 0.29 0.22
P3 f B 0.46 0.20
P3 f C 0.37 0.77
P3 f D 0.76 0.54
P4 m A 0.28 0.99
P4 m B 0.16 0.57
P4 m C 0.46 0.75
P4 m D 0.28 0.21
In this example, I need to randomly select 2 males. For example, I tried using dplyr packaged (see below), but if I give a sample of 2, it just gives me 2 rows for sex="m" and 2 for sex="f". In total, 4 randomly chosen rows. What I need it to do is to give me 8 rows where 4 come from one male and 4 from another. Changing grouping parameter to sub doesn't work, as it barks that there are only 2 levels in the group (actually, it would work in this toy example as there are 4 levels for each sub, but note that I am choosing like 50 samples from a bigger dataset). Also, it would just give me 2 random rows for each sub, which is not what I need.
library(dplyr)
subset <- data %>%
group_by(sex) %>%
sample_n(2)
Please do not suggest to reshape the date to wide format and sample it there, as I know that I can do that. I am sure there must be a way to sample in long format.
I would sample from the patient names and then filter by those sampled names:
Look at all males
male_subset <- data %>% filter(sex == "m")
Look for unique male ID
male_IDs <- unique(male_subset$sub)
Sample from the unique IDs
sampled_IDs <- sample(male_IDs, 2)
Now you subset your data based on these sampled IDs:
data %>% filter(sub %in% sampled_IDs)
This should return all four rows for each of the 2 sampled individuals.
I'm not sure if I've quite understood what you want. Would this do it?
data %>% filter(sex == 'm') %>% filter(sub %in% sample(paste0('P',1:4), 2))
You'd have to change what's in the paste0 function for your real data, of course.
In base R,
set.seed(1)
subset<- sample(data[data$sex == "m",]$sub,2)
data_subset<-data[data$sub %in% subset,]
nrow(data_subset)
# [1] 8
Works, but not flashy.
Related
I have a df which is 67200 obs long, with 5 vars. I would like to create a list of subsequences from one var, each of equal length (600 obs). I would like the sequence to be iterative so that I can identify rolling sequences i.e. seq1 = 0:600, seq2 = 1:601, seq3 = 2:602, and so on. I will then sum the data from each subsequence to identify the sequence with the highest total.
I understand how to make a basic sequence using seq, however after reading around SO and other sites, I can only find info on how to identify specific sequences. Any help with ideas on ways to create said subsequences would be great.
Sample Data:
Var1 Var2 Var3 Var4 Var5
0.00 0.31 0.32 0.00 0.01
0.10 0.46 0.46 0.13 0.01
0.20 0.46 0.47 0.14 0.02
0.30 0.40 0.21 0.14 0.02
0.40 0.38 0.11 0.20 0.03
0.50 0.38 0.07 0.25 0.04
Expected Output:
List of x each subsequnce
To answer your question I think you can achieve your expected output with lapply and seq :
x <- 600
n <- 0:(nrow(df) - 600)
lapply(n, function(i) seq(i, i+x))
However, reading the description it seems you are trying to perform rolling calculation and the above is not the best approach to do this. Look into zoo library it has functions like rollsum, rollmean or a general rollapply which will have better way to do this.
I want to apply two different formulas on four columns of my dataframe df. I have done this manually, but since my original data frame has several columns, I want to be able to use loops or case when to do this faster.
Here's how sample dataframe df looks like:
A B C D
20 100 4 1200
40 150 6 2300
34 200 3 1230
32 225 9 1100
12 220 10 1000
Formula 1:
(x-max(x))/(max(x)-min(x))
Formula 2:
(min(x)-x)/(max(x)-min(x))
I'd like to apply formula 1 on columns B and D and formula 2 on columns A and C.
After applying the formula, I want to store the values in a different dataframe but with the same column names.
Here's what I did:
formula_1 <-function(x) {
(((x - min(x)))/(max(x) - min(x)))
}
formula_2 <-function(x){(min(x)-x)/(max(x)-min(x))
}
Create an empty dataframe BI_score
BI_score$B <- formula_1(df$B)
BI_score$D <- formula_1 (df$D)
BI_score$A <- formula_2 (df$A)
BI_score$C <- formula_2 (df$C)
EDIT
As there are some NAs and Inf values and if we want to exclude them from calculation, we can handle it by updating the function as below and then apply the function to column as shown previously.
formula_1 <-function(x) {
temp <- x[is.finite(x)]
replace(x, is.finite(x), (((temp - min(temp)))/(max(temp) - min(temp))))
}
formula_2 <-function(x) {
temp <- x[is.finite(x)]
replace(x, is.finite(x), (min(temp)-temp)/(max(temp)-min(temp)))
}
The most straight forward approach would be to use lapply to apply the function separately on selected columns.
BI_score <- df
fm1_cols <- c("B", "D")
fm2_cols <- c("A", "C")
BI_score[fm1_cols] <- lapply(df[fm1_cols], formula_1)
BI_score[fm2_cols] <- lapply(df[fm2_cols], formula_2)
BI_score
# A B C D
#1 -0.29 0.00 -0.14 0.154
#2 -1.00 0.40 -0.43 1.000
#3 -0.79 0.80 0.00 0.177
#4 -0.71 1.00 -0.86 0.077
#5 0.00 0.96 -1.00 0.000
As mentioned by #Sotos, if you want to apply the function on alternate columns you could do
BI_score[c(TRUE, FALSE)] <- lapply(df[c(TRUE, FALSE)], formula_1)
BI_score[c(FALSE, TRUE)] <- lapply(df[c(FALSE, TRUE)], formula_2)
Just for fun, approach using dplyr
library(dplyr)
bind_cols(df %>% select(fm1_cols) %>% mutate_all(formula_1),
df %>% select(fm2_cols) %>% mutate_all(formula_2))
If your goal is to apply the two functions on alternating columns, then you can do it via logical indexing
cbind.data.frame(sapply(df[c(TRUE, FALSE)], formula_2),
sapply(df[c(FALSE, TRUE)], formula_1))
# A C B D
#1 -0.2857143 -0.1428571 0.00 0.15384615
#2 -1.0000000 -0.4285714 0.40 1.00000000
#3 -0.7857143 0.0000000 0.80 0.17692308
#4 -0.7142857 -0.8571429 1.00 0.07692308
#5 0.0000000 -1.0000000 0.96 0.00000000
We can use mutate_at from dplyr
library(dplyr)
df1 %>%
mutate_at(vars(B, D), formula_1) %>%
mutate_at(vars(A, C), formula_2)
I am trying to plot all columns of a data frame based on a column in the data frame. The df basically looks like this:
iters a b c
1 1 0.92 0.83 0.97
2 2 0.12 0.93 0.76
3 3 0.55 0.41 0.87
4 4 0.43 0.55 0.49
So far I have tried this code:
df <- melt(acc_s1, id.vars = 'iter', variable.name = 'letter')
ggplot(df, aes(iter,value)) + geom_line(aes(colour = letter))
Unfortunately, my results looks like this (don't mind the slightly different names):
Any ideas, where this comes from?
Thanks
I have a data frame of n columns and r rows. I want to determine which column is correlated most with column 1, and then aggregate these two columns. The aggregated column will be considered the new column 1. Then, I remove the column that is correlated most from the set. Thus, the size of the date is decreased by one column. I then repeat the process, until the data frame result has has n columns, with the second column being the aggregation of two columns, the third column being the aggregation of three columns, etc. I am therefore wondering if there is an efficient or quicker way to get to the result I'm going for. I've tried various things, but without success so far. Any suggestions?
n <- 5
r <- 6
> df
X1 X2 X3 X4 X5
1 0.32 0.88 0.12 0.91 0.18
2 0.52 0.61 0.44 0.19 0.65
3 0.84 0.71 0.50 0.67 0.36
4 0.12 0.30 0.72 0.40 0.05
5 0.40 0.62 0.48 0.39 0.95
6 0.55 0.28 0.33 0.81 0.60
This is what result should look like:
> result
X1 X2 X3 X4 X5
1 0.32 0.50 1.38 2.29 2.41
2 0.52 1.17 1.78 1.97 2.41
3 0.84 1.20 1.91 2.58 3.08
4 0.12 0.17 0.47 0.87 1.59
5 0.40 1.35 1.97 2.36 2.84
6 0.55 1.15 1.43 2.24 2.57
I think most of the slowness and eventual crash comes from memory overheads during the loop and not from the correlations (though that could be improved too as #coffeeinjunky says). This is most likely as a result of the way data.frames are modified in R. Consider switching to data.tables and take advantage of their "assignment by reference" paradigm. For example, below is your code translated into data.table syntax. You can time the two loops, compare perfomance and comment the results. cheers.
n <- 5L
r <- 6L
result <- setDT(data.frame(matrix(NA,nrow=r,ncol=n)))
temp <- copy(df) # Create a temporary data frame in which I calculate the correlations
set(result, j=1L, value=temp[[1]]) # The first column is the same
for (icol in as.integer(2:n)) {
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1]) # Determine which are correlated most
set(x=result, i=NULL, j=as.integer(icol), value=(temp[[1]] + temp[[mch]]))# Aggregate and place result in results datatable
set(x=temp, i=NULL, j=1L, value=result[[icol]])# Set result as new 1st column
set(x=temp, i=NULL, j=as.integer(mch), value=NULL) # Remove column
}
Try
for (i in 2:n) {
maxcor <- names(which.max(sapply(temp[,-1, drop=F], function(x) cor(temp[, 1], x) )))
result[,i] <- temp[,1] + temp[,maxcor]
temp[,1] <- result[,i] # Set result as new 1st column
temp[,maxcor] <- NULL # Remove column
}
The error was caused because in the last iteration, subsetting temp yields a single vector, and standard R behavior is to reduce the class from dataframe to vector in such cases, which causes sapply to pass on only the first element, etc.
One more comment: currently, you are using the most positive correlation, not the strongest correlation, which may also be negative. Make sure this is what you want.
To adress your question in the comment: Note that your old code could be improved by avoiding repeat computation. For instance,
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1])
contains the command cor(temp) twice. This means each and every correlation is computed twice. Replacing it with
cortemp <- cor(temp)
mch <- match(c(max(cortemp[-1,1])),cortemp[,1])
should cut the computational burden of the initial code line in half.
I'm working with biochemical data from subjects, analysing the results by sex. I have 19 biochemical tests to analyse for each sex, for each of two drugs (haematology and anatomy tests coming later).
For reasons of reproducibility of results and for preventing transcription errors, I am trying to summarise each test into one table. Included in the table output, I need a column for the Dunnett post hoc comparison p-values. Because the Dunnett test compares to the control results, with a control and 3 drug levels I only get 3 p-values. However, I have 4 mean and sd values.
Using ddply to get the mean and sd results (having limited the number of significant figures, I get a dataset that looks like this:
Sex<- c(rep("F",4), rep("M",4))
Druglevel <- c(rep(0:3,2))
Sample <- c(rep(10,8))
Mean <- c(0.44, 0.50, 0.46, 0.49, 0.48, 0.55, 0.47, 0.57)
sd <- c(0.07, 0.07, 0.09, 0.12, 0.18, 0.19, 0.13, 0.41)
Drug1Biochem1 <- data.frame(Sex, Druglevel, Sample, Mean, sd)
I have used glht in the package multcomp to perform the Dunnett tests on the aov object I constructed from undertaking a normal aov. I've extracted the p-values from the glht summary (I've rounded these to three decimal places). The male and female analyses have been run using separate ANOVA so I have one set of output for each sex. The female results are:
femaleR <- c(0.371, 0.973, 0.490)
and the male results are:
maleR <- c(0.862, 0.999, 0.738)
How can I append a column for the p-values to my original dataframe (Drug1Biochem1) so that both femaleR and maleR are in that final column, with row 1 and row 5 of that column empty (i.e. no p-values for the control)?
I wish to output the resulting combination to html, which can be inserted into a Word document so no transcription errors occur. I have set a seed value so that the results of the program are reproducible (when I finally stop debugging).
In summary, I would like a data frame (or table, or whatever I can output to html) that has the following format:
Sex Druglevel Sample Mean sd p-value
F 0 10 0.44 0.07
F 1 10 0.50 0.07 0.371
F 2 10 0.46 0.09 0.973
F 3 10 0.49 0.12 0.480
M 0 10 0.48 0.18
M 1 10 0.55 0.19 0.862
M 2 10 0.47 0.13 0.999
M 3 10 0.57 0.41 0.738
For each test, I wish to reproduce this exact table. There will always be 4 groups per sex, and there will never be a p-value for the control, which will always be summarised in row 1 (F) and row 5 (M).
You could try merge
dN <- data.frame(Sex=rep(c('M', 'F'), each=3), Druglevel=1:3,
pval=c(maleR, femaleR))
merge(Drug1Biochem1, dN, by=c('Sex', 'Druglevel'), all=TRUE)
# Sex Druglevel Sample Mean sd pval
#1 F 0 10 0.44 0.07 NA
#2 F 1 10 0.50 0.07 0.371
#3 F 2 10 0.46 0.09 0.973
#4 F 3 10 0.49 0.12 0.490
#5 M 0 10 0.48 0.18 NA
#6 M 1 10 0.55 0.19 0.862
#7 M 2 10 0.47 0.13 0.999
#8 M 3 10 0.57 0.41 0.738