For the application of a function to multiple smaller datasets from a larger dataset, I need to perform a split of the large dataset by multiple variables. However, for further use of the child datasets, I want to store them in a nested list with the different grouping variables as list node names (to be used with rapply).
An example:
head_mtcars <- head(mtcars, 10)
I know from here that I can split the data set using list(data$V1, data$V2), but the generated list unfortunately only keeps the grouping variable in the same level. I would be wishing for list nodes like $6$3, $8$3 etc.:
split(head_mtcars, list(head_mtcars$cyl, head_mtcars$gear), drop = T)
$`6.3`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
$`8.3`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
Duster 360 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
$`4.4`
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
$`6.4`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
I also tried to change the separator but this does not help:
## only changes the naming separator to a $ but does not actually create a new list level:
split(head_mtcars, list(head_mtcars$cyl, head_mtcars$gear), drop = T, sep = "$")
$`6$3`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
$`8$3`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
Duster 360 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
$`4$4`
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
$`6$4`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
I also tried to modify the code from here to be used with multiple splitting variables, but this moves the group variables to dimnames, from which I don't know how (if possible at all) to convert to nested list levels (it works perfectly when using only one grouping variable).
by(head_mtcars, list(head_mtcars$cyl, head_mtcars$gear), identity, simplify = FALSE)
: 4
: 3
NULL
-------------------------------------------------------------
: 6
: 3
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
-------------------------------------------------------------
: 8
: 3
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
Duster 360 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
-------------------------------------------------------------
: 4
: 4
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
-------------------------------------------------------------
: 6
: 4
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
-------------------------------------------------------------
: 8
: 4
NULL
I also tried various tidyverse approaches but also none of them really solved the problem.
In the end, I would like to have a nested list with the levels from $cyl as the first level and the levels from $gear as the level below. Any advice?
Here is a method to nest the splits an arbitrary number of times using reduce() and map_depth().
Note that the formula interface for split() is a relatively recent feature so if it doesn't work you may have to upgrade to a more recent version.
library(purrr)
head_mtcars <- head(mtcars, 10)
fms <- list(~cyl, ~gear, ~carb)
reduce(.x = fms, .f = ~ map_depth(.x, .depth = vec_depth(.x) - 2, split, .y), .init = head_mtcars)
$`4`
$`4`$`4`
$`4`$`4`$`1`
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
$`4`$`4`$`2`
mpg cyl disp hp drat wt qsec vs am gear carb
Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
$`6`
$`6`$`3`
$`6`$`3`$`1`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
$`6`$`4`
$`6`$`4`$`4`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
$`8`
$`8`$`3`
$`8`$`3`$`2`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
$`8`$`3`$`4`
mpg cyl disp hp drat wt qsec vs am gear carb
Duster 360 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
Related
This question already has answers here:
In R, how to get an object's name after it is sent to a function?
(4 answers)
Closed 1 year ago.
I am trying to create a function which summarises a grouped dataset and then adds a column to identify which variable is being summarised (ID column).
I am not sure how to add the ID column using the curly curly appraoch.
my_fun <- function(dat, var_name){
dat %>%
mutate(id_column = names({{var_name}}))
}
my_fun(mtcars, cyl)
What I want is for the variable name, in this case cyl, to be recycled.
Just, deparse/subsitute at the start
my_fun <- function(dat, var_name){
nm1 <- deparse(substitute(var_name))
dat %>%
mutate(id_column = nm1)
}
-testing
my_fun(mtcars, cyl)
mpg cyl disp hp drat wt qsec vs am gear carb id_column
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 cyl
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 cyl
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 cyl
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 cyl
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 cyl
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 cyl
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 cyl
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 cyl
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 cyl
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 cyl
...
In the tidyverse, it may also be done directly from a symbol i.e. use ensym to convert to symbol and then evaluate (!!) to get the value or convert to string with as_string
my_fun <- function(dat, var_name){
var_name <- rlang::ensym(var_name)
dat %>%
mutate(id_column = rlang::as_string(var_name), val_column = !! var_name)
}
-testing
my_fun(head(mtcars), cyl)
mpg cyl disp hp drat wt qsec vs am gear carb id_column val_column
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 cyl 6
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 cyl 6
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 cyl 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 cyl 6
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 cyl 8
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 cyl 6
I have a dataset as below and I want to create a new row that contains the values of colnames(df). Many thanks in advance.
df <- head(mtcars); df
Expected Answer
mpg cyl disp hp drat wt qsec vs am gear carb
newRow mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
rbind is what you need:
rbind(newRow = colnames(df), df)
mpg cyl disp hp drat wt qsec vs am gear carb
newRow mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
I wanted to import multiple csv files from a folder and sort them into distinct data frames based on the file name.
The pattern of my file name is chX_imgN_chYROI, where X & Y = 1, 2 & 3, N = 1,2,3,4 & 5. The 'N' does not matter as I want to combine .csv files based on distinct combinations of X and Y (eg ch1_ch2ROI <– ch1_img1_ch2ROI, ch1_img2_ch2ROI..... ch1_img5_ch2ROI)
I'm a novice and any suggestions/insights will be helpful. Thanks!
The first part of this question (import multiple csv files) is really a duplicate of How do I make a list of data frames?.
But the second part -- combining some frames -- is a little different. I'll generate some sample data.
From the duplicate part, you'd probably use something like below to read in the files:
alldat <- sapply(list.files(somedir, pattern = "ch.*_img.*_ch.*.csv", full.names = TRUE),
read.csv, stringsAsFactors = FALSE,
simplify = FALSE)
Even if you just use this code blindly, I still recommend you read over the answers in How do I make a list of data frames?, as the advice and methodology are efficient and very idiomatic to R. Done correctly, they can make many workflows significantly easier to visualize, understand, and maintain.
To mimic import process, I'll use this fake data:
alldat <- list(
"ch1_img1_ch1ROI" = mtcars[1:2,],
"ch1_img1_ch2ROI" = mtcars[3:4,],
"ch1_img2_ch1ROI" = mtcars[5:6,],
"ch2_img1_ch1ROI" = mtcars[7:8,],
"ch2_img1_ch2ROI" = mtcars[9:10,],
"ch2_img2_ch2ROI" = mtcars[11:12,]
)
alldat
# $ch1_img1_ch1ROI
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
# $ch1_img1_ch2ROI
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# $ch1_img2_ch1ROI
# mpg cyl disp hp drat wt qsec vs am gear carb
# Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
# $ch2_img1_ch1ROI
# mpg cyl disp hp drat wt qsec vs am gear carb
# Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
# Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
# $ch2_img1_ch2ROI
# mpg cyl disp hp drat wt qsec vs am gear carb
# Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
# Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
# $ch2_img2_ch2ROI
# mpg cyl disp hp drat wt qsec vs am gear carb
# Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
# Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
By your logic, we have some combinations of X/Y that are unique and some that have multiple N's. Let's group solely by X/Y combinations.
First, we'll extract the X and Y components into a unique string for each filename:
gsub(".*ch([0-9]+)_.*ch([0-9]+).*", "\\1_\\2", names(alldat))
# [1] "1_1" "1_2" "1_1" "2_1" "2_2" "2_2"
Notice that we have some frames that need to be combined, namely elements 1 and 3, and elements 5 and 6.
split the list of frames by this string. Notice how we have a list of 4 elements, each of which is a nested list of 1 or more frames.
spllists <- split(alldat, gsub(".*ch([0-9]+)_.*ch([0-9]+).*", "\\1_\\2", names(alldat)))
str(spllists, max.level = 2)
# List of 4
# $ 1_1:List of 2
# ..$ ch1_img1_ch1ROI:'data.frame': 2 obs. of 11 variables:
# ..$ ch1_img2_ch1ROI:'data.frame': 2 obs. of 11 variables:
# $ 1_2:List of 1
# ..$ ch1_img1_ch2ROI:'data.frame': 2 obs. of 11 variables:
# $ 2_1:List of 1
# ..$ ch2_img1_ch1ROI:'data.frame': 2 obs. of 11 variables:
# $ 2_2:List of 2
# ..$ ch2_img1_ch2ROI:'data.frame': 2 obs. of 11 variables:
# ..$ ch2_img2_ch2ROI:'data.frame': 2 obs. of 11 variables:
Iterate (lapply) over the outer list, combining the inner lists. To do the inner row-combining, we'd use
spllists[[1]]
# $ch1_img1_ch1ROI
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
# $ch1_img2_ch1ROI
# mpg cyl disp hp drat wt qsec vs am gear carb
# Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
do.call(rbind, spllists[[1]])
# mpg cyl disp hp drat wt qsec vs am gear carb
# ch1_img1_ch1ROI.Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# ch1_img1_ch1ROI.Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# ch1_img2_ch1ROI.Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# ch1_img2_ch1ROI.Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
So to do this for all elements in the spllists, we'll use
alldat2 <- lapply(spllists, function(x) do.call(rbind, x))
alldat2
# $`1_1`
# mpg cyl disp hp drat wt qsec vs am gear carb
# ch1_img1_ch1ROI.Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# ch1_img1_ch1ROI.Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# ch1_img2_ch1ROI.Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# ch1_img2_ch1ROI.Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# $`1_2`
# mpg cyl disp hp drat wt qsec vs am gear carb
# ch1_img1_ch2ROI.Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# ch1_img1_ch2ROI.Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# $`2_1`
# mpg cyl disp hp drat wt qsec vs am gear carb
# ch2_img1_ch1ROI.Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
# ch2_img1_ch1ROI.Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
# $`2_2`
# mpg cyl disp hp drat wt qsec vs am gear carb
# ch2_img1_ch2ROI.Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
# ch2_img1_ch2ROI.Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
# ch2_img2_ch2ROI.Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
# ch2_img2_ch2ROI.Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
[i] indicates where I have to iterate pearsons coefficient over the columns and how to convert this into a dataframe attached onto a variable?
Code example:
*INSTEAD OF DOING THIS*
F.ReedBunting.pear<- cor.test(W_farmland_mean$Years,W_farmland_mean$ReedBunting,method='pearson')
F.Whitethroat.pear<- cor.test(W_farmland_mean$Years,W_farmland_mean$Whitethroat,method='pearson')
F.Rook.pear<- cor.test(W_farmland_mean$Years,W_farmland_mean$Rook,method='pearson')
.
.
.
*HOW CAN IT BE DONE QUICKLY WITH THIS*
workspaceone <- sapply(W_farmland_mean, function(x){
cor.test(W_farmland_mean$Years, W_farmland_mean[, 1[i]], method = 'pearson')
})
I think you should try:
result_cor <- apply(W_farmland_mean,2,function(x){cor.test(W_farmland_mean$Years,x, method = 'pearson')$estimate})
It will extract the Pearson coefficient of the comparison of each columns with the column years of your dataset.
Example
With the mtcars dataset:
df <- mtcars[c(1:10),]
> df
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
And if we apply the function:
result_cor = apply(df,2, function(x){cor.test(x,df$mpg,method ='pearson')$estimate})
And you get the following output:
> result_cor
mpg cyl disp hp drat wt qsec
1.0000000 -0.8614165 -0.7739868 -0.8937223 0.5413585 -0.5991894 0.5494131
vs am gear carb
0.4796102 0.2919683 0.6646449 -0.3711956
I am splitting the large data frame into smaller data frame each of size 5000 records. But after performing the rbind operation on each subsample I want to shuffle the subsample data. When I tried to shuffle data it is not throwing me any error or shuffling the data. Can any one help me in reshuffling the data
# splitting the dataframe into smaller dataframes
test_list <-split(New_data_zero, (seq(nrow(New_data_zero))-1) %/% 5000)
# performing the rbind to add data for all the data frames
for (i in 1: length(test_list)){
test_list[[i]] <- rbind(test_list[[i]],New_data)
}
# Trying to shuffle the each subsample but not performing the operation
for (i in 1: length(test_list)){
test_list[[i]] <- test_list[[i]][sample(1:nrow(test_list[[i]])),]
}
Try this
myfun <- function(df, numobs) {
sdf <- split(df, rep(1:ceiling(nrow(df)/numobs), each=numobs))
lapply(sdf, function(x) x[sample(nrow(x)),])
}
set.seed(1)
myfun(mtcars, 5)
Output
$`1`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
$`2`
mpg cyl disp hp drat wt qsec vs am gear carb
Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4
Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
etc