I want to split my data into 3 parts with the ratio of 6:2:2. Is there a R command that can do that? Thanks.
I used createDataPartition in the caret package, that can split data into two parts. But how to do it with 3 splits? Is that possible? Or I need two steps to do that?
You randomly split with (roughly) this ratio using sample:
set.seed(144)
spl <- split(iris, sample(c(1, 1, 1, 2, 3), nrow(iris), replace=T))
This split your initial data frame into a list. Now you can check that you've gotten the split ratio you were looking for using lapply with nrow called on each element of your list:
unlist(lapply(spl, nrow))
# 1 2 3
# 98 26 26
If you wanted to randomly shuffle but to get exactly your ratio for each group, you could shuffle the indices and then select the correct number of each type of index from the shuffled list. For iris, we would want 90 for group 1, 30 for group 2, and 30 for group 3:
set.seed(144)
nums <- c(90, 30, 30)
assignments <- rep(NA, nrow(iris))
assignments[sample(nrow(iris))] <- rep(c(1, 2, 3), nums)
spl2 <- split(iris, assignments)
unlist(lapply(spl2, nrow))
# 1 2 3
# 90 30 30
Related
I have a data.frame in R, containing several categorical variables, each with its own mean and standard deviation. I want to generate values from a normal data distribution for each categorical variable defined by these values and generate individual data.frames for each discrete categorical variable.
Here's some dummy data
dummy_data <- data.frame(VARIABLE = LETTERS[seq( from = 1, to = 10 )],
MEAN = runif(10, 5, 10), SD = runif(10, 1, 3))
dummy_data
VARIABLE MEAN SD
1 A 6.278751 1.937093
2 B 6.384247 2.487678
3 C 9.017496 2.003202
4 D 5.125994 1.829517
5 E 9.525213 1.914513
6 F 9.004893 2.734934
7 G 9.780757 2.511341
8 H 5.372160 1.510281
9 I 6.240331 2.796826
10 J 8.478280 2.325139
What I'd like to do from here, is to generate individual data.frames for each row, with each data.frame containing a normal distribution based on the MEAN and SD columns.
So, for example, I'd have a separate data.frame that contained....
A <- subset(dummy_data, VARIABLE == 'A')
A <- data.frame(rnorm(20, A$MEAN, A$SD))
A
rnorm.20..A.MEAN..A.SD.
1 5.131331
2 9.388104
3 8.909453
4 5.813257
5 5.353137
6 7.598521
7 2.693924
8 5.425703
9 8.939687
10 9.148066
11 4.528936
12 7.576479
13 8.207456
14 6.838258
15 6.972061
16 7.824283
17 6.283434
18 4.503815
19 2.133388
20 7.472886
The real data I'm working with is much larger than ten rows, and so I don't want to subset the whole thing to generate the individual data.frames if I can help it.
Thanks in advance
What about a solution using dplyr?:
library(dplyr)
#A dataframe containing all the information
Huge_df <- dummy_data %>% group_by(VARIABLE) %>% summarise(SIMULATED = rnorm(20, MEAN, SD))
#You can then split the dataframe if needed:
Splitted <- split.data.frame(Huge_df, "VARIABLE")
If you then need to save every individual dataframe, or do something else with them, you can always unlist the Splitted object
Using data.table:
library(data.table)
result <- setDT(dummy_data)[, .(sample=rnorm(20, mean=MEAN, sd=SD)), by=.(VARIABLE)]
list.of.df <- split(result, result$VARIABLE)
You can put everything into a list, then return all the elements in the list to the global environment (if desired, or keep in the list):
set.seed(123)
dummy_data <- data.frame(VARIABLE = LETTERS[seq( from = 1, to = 10 )],
MEAN = runif(10, 5, 10), SD = runif(10, 1, 3))
# put all the values into a list
list_dist <- vector(mode = "list", length = nrow(dummy_data))
for(i in 1:nrow(dummy_data)){
list_dist[[i]] <- data.frame(values = rnorm(20, dummy_data[i,2], dummy_data[i,3]))
}
# name the list elements
names(list_dist) <- dummy_data$VARIABLE
# or more detailed names, for instance,
# names(list_dist) <- paste0(dummy_data$VARIABLE, "_Distribution")
#return all list values to the global environment
list2env(list_dist,globalenv())
Example data to copy
df <- data.frame(
AA = c(100, 200, 300, 400),
X1 = c(2, 1, 3, 1),
X2 = c(1, 3, 4, 1)
)
Based on the index of AA, and it's values, I would like to calculate the sum of indicators based on the condition df$AA[i] > df[df$X1[i], c('AA')] (here for X1) for every row on a fluctuating number of variables.
My probably naive approach is to use a for-loop, which works perfectly for a fixed number of variables (columns), in the given example X1, X2. My problem is that I do not know the number of variables beforehand. Theoretically, any number 1, 2, 3, ... is possibly.
for (i in 1:nrow(df)) {
df$index[i] <- sum(df$AA[i] > df[df$X1[i], c('AA')],
df$AA[i] > df[df$X2[i], c('AA')])
}
Which gives the desired output for a fixed number of variables X1, X2:
df
#> AA X1 X2 index
#> 1 100 2 1 0
#> 2 200 1 3 1
#> 3 300 3 4 0
#> 4 400 1 1 2
Is there a smooth base R approach which translates my approach to a flexible number of variables X1, ..., Xn?
Note, the reason why I am interested in a base R approach is my aim to extend an existing package, which is fully written in base R. So I would like to keep it like that.
Loops or *apply-family approaches are both very welcome.
I am aware of the fact that operations on dataframes are often considered to be slower. Since all variables AA, X1, ... are of the same length, a solution which does not rely on a dataframe structure would also be great!
Created on 2022-04-06 by the reprex package (v2.0.1)
You don't need to loop through rows. You can use Reduce.
Reduce(`+`, lapply(df[-1], function(x) df$AA > df$AA[x]))
#> [1] 0 1 0 2
Does this correspond to what you're looking for ?
df$index <- apply(df, 1, function(x){sum(x[1] > df$AA[x[-1]])})
assuming that AA is the column 1 and all your Xi are all the other columns.
The following one-liner will work especially because df is a data-frame:
df$index <- rowSums( # To sum over a non-specified number of columns
mapply(
df[,- which(names(df) == "AA")], # Everything except AA
df[,"AA", drop = FALSE], # Only AA, but in a data-frame
FUN = function(index, aa) aa[index] < aa)) # Compare
I have a matrix containing 5 columns and 20 rows. For each row, I want to find the proportion of even numbers that row has and write it per row. My trouble is finding the proportion of even numbers.
So here is a part of the output:
1 2 3 4 5
[1,] 6 5 1 2 5
x <- apply(matrix, 1, length(matrix %% 2 == 0)/5)
matrix <- cbind(matrix, x)
take a look in ?"%%". Here an example:
## reproducible example
set.seed(1)
mat <- matrix(
sample(1:10,5*20,replace = TRUE),
nrow = 20, ncol = 5, byrow = TRUE)
## 1- convert matrix to a logical one using %%
## 2- compute occurrence of TRUE value using the vectorised rowSums
## 3- divide by the number of column to convert occurrence to proportions
rowSums(mat %% 2 ==0)/ncol(mat)
I have a data frame with a group of x and y points. I need to calculate the euclidean distance of every point relative to every other point. Then I have to figure, for each row, how many are within a given range.
For example, if I had this data frame:
x y
- -
1 2
2 2
9 9
I should add a column that signals how many points (if we consider these points to be in a cartesian plane) are within a distance of 3 units from every other point.
x y n
- - -
1 2 1
2 2 1
9 9 0
Thus, the first point (1,2) has one other point (2,2) that is within that range, whereas the point (9,9) has 0 points at a distance of 3 units.
I could do this with a couple of nested for loops, but I am interested in solving this in R in an idiomatic way, preferably using dplyr or other library.
This is what I have:
ddply(.data=mydataframe, .variables('x', 'y'), .fun=count.in.range)
count.in.range <- function (df) {
xp <- df$x
yp <- df$y
return(nrow(filter(df, dist( rbind(c(x,y), c(xp,yp)) ) < 3 )))
}
But, for some reason, this doesn't work. I think it has to do with filter.
Given
df_ <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
You can use the function "dist":
matrix_dist <- as.matrix(dist(df_))
df_$n <- rowSums(matrix_dist <= 3)
This is base approach with straightforward application of a "distance function" but only on a row-by-row basis:
apply( df_ , 1, function(x) sum( (x[1] - df_[['x']])^2+(x[2]-df_[['y']])^2 <=9 )-1 )
#[1] 1 1 0
It's also really a "sweep" operation, although I wouldn't really expect a performance improvement.
I would suggest you work with pairs of points in the long format and then use a data.table solution, which is probably one of the fastest alternatives to work with large datasets
library(data.table)
library(reshape)
df <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
The first thing you need to do is to reshape your data to long format with all possible combinations of pairs of points:
df_long <- expand.grid.df(df,df)
# rename columns
setDT(df_long )
setnames(df_long, c("x","y","x1","y1"))
Now you only need to do this:
# calculate distance between pairs
df_long[ , mydist := dist ( matrix(c(x,x1,y,y1), ncol = 2, nrow = 2) ) , by=.(x,y,x1,y1)]
# count how many points are within a distance of 3 units
df_long[mydist <3 , .(count = .N), by=.(x,y)]
#> x y count
#> 1: 1 2 2
#> 2: 2 2 2
#> 3: 9 9 1
I am trying to perform following kind of summation on a matrix:
Let's say the matrix is:
mat <- matrix(c(1:5,rep(0,7),c(1:7),rep(0,5),c(1:10), 0,0), 12,3)
I want to do cumulative sum on rows up to row numbers 5, 7, 10 for column numbers 1,2,3 respectively. (The real data can have arbitrary number of rows and columns).
For now, I have been using following code:
sum1 <- matrix(rep(0, 36), 12, 3)
row_index <- c(5,7,10)
for (k in 1:3) {
sum1[1:row_index[k], k] <- cumsum(mat[1:row_index[k], k])
}
sum1 <- matrix(apply(sum1,1,sum))
To start with, I have the matrix and row_index. I want to avoid using the loop as the data has a lot of columns. I am wondering if there is a way to do that.
depth <- c(5,7,10)
mapply( function(x,y) cumsum(mat[1:x, y]), depth, seq_along(depth) )
[[1]]
[1] 1 3 6 10 15
[[2]]
[1] 1 3 6 10 15 21 28
[[3]]
[1] 1 3 6 10 15 21 28 36 45 55
First, define a function:
sumcolumn <- function(rows, columns, mat){
cumsum(mat[1:rows, columns])
}
then use mapply on your vectors of columns/rows:
mapply(sumcolumn, rows = c(5, 7, 10), columns = c(1, 2, 3), MoreArgs = list(mat = mat))