Related
This question already has answers here:
Sample n random rows per group in a dataframe
(5 answers)
Stratified random sampling from data frame
(6 answers)
Closed 4 years ago.
I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:
df <- data.frame(
word = sample(LETTERS, 100, replace = T),
position = sample(1:5, 100, replace = T)
)
head(df)
word position
1 K 1
2 R 5
3 J 2
4 Y 5
5 Z 5
6 U 4
Obviously, the tranches of 'position' are differently sized:
table(df$position)
1 2 3 4 5
15 15 17 28 25
To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:
df_pos1 <- df[df$position==1,]
df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]
df_pos2 <- df[df$position==2,]
df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]
df_pos3 <- df[df$position==3,]
df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]
df_pos4 <- df[df$position==4,]
df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]
df_pos5 <- df[df$position==5,]
df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]
and so on, to finally combine the individual samples in a single dataframe:
df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)
but this procedure is cumbersome and error-prone. A more economical solution might be a for loop. I've tried this code so far, which, however, returns, not a combination of the individual samples for each position value but a single sample drawn from all values for 'position':
df_samples <-c()
for(i in unique(df$position)){
df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])
}
df_samples
word position
13 D 2
2 R 5
12 G 3
4 Y 5
16 Z 3
11 S 3
6 U 4
14 J 3
9 O 5
1 K 1
What's wrong with this code and how can it be improved?
Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().
df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])
final_df <- do.call(rbind, df_list)
Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.
Specifically,
by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.
do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.
Each time you run the loop you are overwriting the last entry. Try:
df_samples <- data.frame()
df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])
We can use data.table with a group by sample of the row index .I and use that to subset the dataset. This would be very efficient
i1 <- setDT(df)[, sample(.I, 3), position]$V1
df[i1]
Or use sample_n from tidyverse
library(tidyverse)
df %>%
group_by(position) %>%
sample_n(3)
Or as a function
f1 <- function(data) {
data as.data.table(data)
i1 <- data[, sample(.I, 3), by = position]$V1
data[i1]
}
I have a data frame of two columns
set.seed(120)
df <- data.frame(m1 = runif(500,1,30),n1 = round(runif(500,10,25),0))
and I wish to add a third column that uses column n1 and m1 to generate a normal distribution and then to get the standard deviation of that normal distribution. I mean to use the values in each row of the columns n1 as the number of replicates (n) and m1 as the mean.
How can I write a function to do this? I have tried to use apply
stdev <- function(x,y) sd(rnorm(n1,m1))
df$Sim <- apply(df,1,stdev)
But this does not work. Any pointers would be much appreciated.
Many thanks,
Matt
Your data frame input looks like:
# > head(df)
# m1 n1
# 1 12.365323 15
# 2 4.654487 15
# 3 10.993779 24
# 4 24.069388 22
# 5 6.684450 18
# 6 15.056766 16
I mean to use the values in each row of the columns n1 and m1 as the number of replicates (n) and as the mean.
First show you how to use apply:
apply(df, 1, function(x) sd(rnorm(n = x[2], mean = x[1])))
But a better way is to use mapply:
mapply(function(x,y) sd(rnorm(n = x, mean = y)), df$n1, df$m1)
apply is ideal for matrix input; for data frame input you get great overhead for type conversion.
Another option
lapply(Map(rnorm,n=df$m1,mean=df$n1),sd)
Let say I’ve a data frame consists of one variable (x)
df <- data.frame(x=c(1,2,3,3,5,6,7,8,9,9,4,4))
I want to know how many numbers are less than 2,3,4,5,6,7.
I know how to do this manually using
# This will tell you how many numbers in df less than 4
xnew <- length(df[ which(df$x < 4), ])
My question is how can I automate this by using for-loop or other method(s)? And I need to store the results in an array as follows
i length
2 1
3 2
4 4
5 6
6 7
7 8
Thanks
One way would be to loop over (sapply) the numbers (2:7), check which elements in df$x is less than (<) the "number" and do the sum, cbind with the numbers, will give the matrix output
res <- cbind(i=2:7, length=sapply(2:7, function(y) sum(df$x <y)))
Or you can vectorize by creating a matrix of numbers (2:7) with each number replicated by the number of rows of df, do the logical operation < with df$x. The logical operation is repeated for each column of the matrix, and get the column sums using colSums.
length <- colSums(df$x <matrix(2:7, nrow=nrow(df), ncol=6, byrow=TRUE))
#or
#length <- colSums(df$x < `dim<-`(rep(2:7,each=nrow(df)),c(12,6)))
cbind(i=2:7, length=length)
num = c(2,3,4,5,6,7)
res = sapply(num, function(u) length(df$x[df$x < u]))
data.frame(number=num,
numberBelow=res)
A vectorized solution:
findInterval(2:7*(1-.Machine$double.eps),sort(df$x))
The .Machine$double.eps part assure that you are taking just the numbers lower than and not lower or equal than.
Suppose I have a data frame in R where I would like to use 2 columns "factor1" and "factor2" as factors and I need to calculate mean value for all other columns per each pair of the above mentioned factors. After running the code below, the last line gives the following warnings:
Warning messages:
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Why is it happening and what should I do to make it right?
Thanks.
Here is my code:
# Create data frame
myDataFrame <- data.frame(factor1=c(1,1,1,2,2,2,3,3,3), factor2=c(3,3,3,4,4,4,5,5,5), val1=c(1,2,3,4,5,6,7,8,9), val2=c(9,8,7,6,5,4,3,2,1))
# Split by 2 columns (factors)
splitDataFrame <- split(myDataFrame, list(myDataFrame$factor1, mydataFrame$factor2))
# Calculate mean value for each column per each pair of factors
splitMeanValues <- lapply(splitDataFrame, function(x) apply(x, 2, mean))
# Combine back to reduced table whereas there is only one value (mean) per each pair of factors
MeanValues <- unsplit(splitMeanValues, list(unique(myDataFrame$factor1), unique(mydataFrame$factor2)))
EDIT1: Added data frame creation (see above)
If you need to calculate the mean for all other columns than the factors, you can use the formula syntax of aggregate()
aggregate(.~factor1+factor2, myDataFrame, FUN=mean)
That returns
factor1 factor2 val1 val2
1 1 3 2 8
2 2 4 5 5
3 3 5 8 2
Your split() method didn't work because when you unsplit you must have the same number of rows as when you split your data. You were reduing the number of rows for all groups to just one row. Plus, unsplit really should be used with the exact same list of factors that was used to do the split otherwise groups may get out of order. You could to a split and then lapply some collapsing function and then rbind the list back into a single data.frame if you really wanted, but for a simple mean, aggregate is probably best.
The same result can be obtained with summaryBy() in the doBy package. Although it's pretty much the same as aggregate() in this case.
> library(doBy)
> summaryBy( . ~ factor1+factor2, data = myDataFrame)
# factor1 factor2 val1.mean val2.mean
# 1 1 3 2 8
# 2 2 4 5 5
# 3 3 5 8 2
Have you tried aggregate?
aggregate(myDataFrame$valueColum, myDataFrame$factor1, FUN=mean)
aggregate(myDataFrame$valueColum, myDataFrame$factor2, FUN=mean)
I have the following data
set.seed(11)
Data<-rbind(c(1:5),c(2:6))
Candidates <- matrix(1:25 + rnorm(25), ncol=5,
dimnames=list(NULL, paste0("x", 1:5)))
colnames(Data)<-colnames(Candidates)
I want to subtract each row of my Data from each row of the Candidate matrix
And return the minimal absolute difference
So for row one I want to find out the smallest amount of error possible
sum(abs(Data[1,]-Candidates[1,]))
sum(abs(Data[1,]-Candidates[2,]))
sum(abs(Data[1,]-Candidates[3,]))
sum(abs(Data[1,]-Candidates[4,]))
sum(abs(Data[1,]-Candidates[5,]))
In this case it's 38.15826. At the moment I'm not actually interested in finding out which Candidate row results in the smallest absolute deviation, I just want to know the smallest absolute deviation for each Data row.
I would then like to end up with a new dataset which has my original Data and the smallest deviation, e.g. row one would like this:
x1 x2 x3 x4 x5 MinDev
1 2 3 4 5 38.15826
My real Candidate Matrix is relatively small but my real Data is quite large,
so at the moment I'm just building a loop that
Err[i,]<- min(rbinds(
sum(abs(Data[i,]-Candidates[1,])),
sum(abs(Data[i,]-Candidates[2,]))...))
but I'm sure there's a better, more automated way to do this so that it can accomodate large Data matrices and Candidate matrices of different sizes.
Any ideas?
You can use sweep, rowSums, and apply to automate this
sum(abs(Data[1,]-Candidates[1,])) ## 38.15826
Testing on the first row of Data:
min(
rowSums(abs(
## subtract row 1 of Data from each row of Candidates
sweep(Candidates,2,Data[1,],"-"))))
## 38.15826
For convenience/readability, encapsulate this in a function:
getMinDev <- function(x) {
min(rowSums(abs(sweep(Candidates,2,x,"-"))))
}
Now apply to each row of Data:
cbind(Data,MinDev=apply(Data,1,getMinDev))
There may be methods that are marginally faster than sweep (e.g. the matrix computations given in #e4e5f4's answer), but this should be a good baseline. I like sweep because it is descriptive and doesn't depend on knowing that R uses column-major matrix ordering.
You can use apply with some matrix operations:
CalcMinDev <- function(x)
{
m <- t(matrix(rep(x, nrow(Candidates)), nrow=nrow(Candidates)))
min(rowSums(abs(m - Candidates)))
}
cbind(Data, MinDev=apply(Data, 1, CalcMinDev))
Following #BenBolker's suggestion to turn my comment (using dist function with method="manhattan") to an answer:
The idea: The trick is that if you supply a matrix to dist, it'll return the distance of all combinations back as a lower triangular matrix.
dist(rbind(Candidates, Data), method="manhattan")
# 1 2 3 4 5 6
# 2 8.786827
# 3 11.039044 3.718396
# 4 16.120267 7.333440 6.041076
# 5 21.465682 12.678855 10.426638 5.345415
# 6 38.158256 45.763021 48.015238 53.096461 58.441876
# 7 35.158256 40.763021 44.048344 48.096461 53.441876 5.000000
Here, 6th row and the 7th row (from index 1 to 5) are the distances you're interested in. So, basically, you'll just have to calculate indices to extract the elements you're interested.
The final code would look like:
idx1 <- seq_len(nrow(Data)) + nrow(Candidates)
idx2 <- seq_len(ncol(Candidates))
tt <- dist(rbind(Candidates, Data), method="manhattan")
transform(Data, minDev = apply(as.matrix(tt)[idx1, idx2], 1, min))
# x1 x2 x3 x4 x5 minDev
# 6 1 2 3 4 5 38.15826
# 7 2 3 4 5 6 35.15826