Combine/merge multiple data frames by element names - r

I have data frames generated by lapply with distinct element names.
head(df1)
$Sample1
G1 G2 Group
1 1.016673 -1.04402692 1
2 1.019958 -0.86763046 1
3 1.033050 -1.09717438 1
4 1.036969 0.26971351 1
5 1.044059 1.73402959 1
$Sample2
G1 G2 Group
1 1.413218 0.22466456 1
2 1.413339 -0.91755436 1
3 1.415782 -0.23471118 1
4 1.434750 -0.77498973 1
5 1.436905 0.76642626 1
Another set is similar in format, specified by 2 under Group
head(df2)
$Sample1
G1 G2 Group
1 1.053269 -1.04460950 2
2 1.059461 -0.86711232 2
3 1.072446 -1.09748431 2
4 1.078763 0.26785751 2
5 1.038325 1.73818175 2
$Sample2
G1 G2 Group
1 1.438067 0.22933986 2
2 1.856085 -0.91988726 2
3 1.415782 -0.23405677 2
4 1.434750 -0.77406530 2
5 1.436905 0.76078091 2
My goal is to combine/merge them together by element names, for example Sample1 and Sample2.
$Sample1
G1 G2 Group
1 1.016673 -1.04402692 1
2 1.019958 -0.86763046 1
3 1.033050 -1.09717438 1
4 1.036969 0.26971351 1
5 1.044059 1.73402959 1
1 1.053269 -1.04460950 2
2 1.059461 -0.86711232 2
3 1.072446 -1.09748431 2
4 1.078763 0.26785751 2
5 1.038325 1.73818175 2
$Sample2
G1 G2 Group
1 1.413218 0.22466456 1
2 1.413339 -0.91755436 1
3 1.415782 -0.23471118 1
4 1.434750 -0.77498973 1
5 1.436905 0.76642626 1
1 1.438067 0.22933986 2
2 1.856085 -0.91988726 2
3 1.415782 -0.23405677 2
4 1.434750 -0.77406530 2
5 1.436905 0.76078091 2
I could not figure out how to do this. Could someone help me? Thanks!

Maybe try mapply and rbind:
a1 <- list(mtcars[1:5,],mtcars[6:10,])
a2 <- list(mtcars[11:15,],mtcars[16:20,])
> mapply(FUN = rbind,a1,a2,SIMPLIFY = FALSE)
[[1]]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
[[2]]
mpg cyl disp hp drat wt qsec vs am gear carb
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Equivalently (I think) in purrr would be map2(a1,a2,rbind).

Related

apply function to dataframe's chosen column whilst grouping by another chosen column

I would like to apply the below function (cut.at.n.tile) to a data frame (some_data_frame) whilst grouping by a chosen column (e.g. SomeGroupingColumn) and choosing the target column (e.g. ChosenColumn). I tried using sapply() without success - see code below. Any input very much appreciated. Apologies for this not being fully replicable/self contained ...
cut.at.n.tile <- function(X, n = 7) {
cut(X, breaks = quantile(X, probs = (0:n)/n, na.rm = TRUE),
labels = 1:n, include.lowest = TRUE)
}
some_data_frame$SeasonTypeNumber = sapply(split(some_data_frame['ChosenColumn'], SomeGroupingColumn), cut.at.n.tile)
There are a few problems here.
some_data_frame['ChosenColumn'] always returns a single-column data.frame, not a vector which your function requires. I suggest switching to some_data_frame[['ChosenColumn']].
SomeGroupingColumn looks like it should be a column (hence the name) in the data, but it is not referenced within a frame. Perhaps some_data_frame[['SomeGroupingColumn']].
You need to ensure that the breaks= used are unique. For example,
cut.at.n.tile(subset(mtcars, cyl == 8)$disp)
# Error in cut.default(X, breaks = quantile(X, probs = (0:n)/n, na.rm = TRUE), :
# 'breaks' are not unique
If we debug that function, we see
X
# [1] 360.0 360.0 275.8 275.8 275.8 472.0 460.0 440.0 318.0 304.0 350.0 400.0 351.0 301.0
quantile(X, probs = (0:n)/n, na.rm = TRUE)
# 0% 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 100%
# 275.8000 275.8000 303.1429 336.2857 354.8571 371.4286 442.8571 472.0000
where 275.8 is repeated. This can happen based on nuances in the raw data, and you can't really predict when it will occur.
Since we'll likely have multiple groups, all of the subvectors' levels= (since cut returns a factor) must be the same length, though admittedly 1 in one group is unlikely to be the same as 1 in another group.
Since in this case we can never be certain which n-tile a number strictly applies (in 275.8 in the first or second n-tile?), we can only adjust one of the dupes and accept the imperfection. I suggest a cumsum(duplicated(.)*1e-9): the premise is that it adds an iota to each value that is a dupe, rendering it no-longer a dupe. It is possible that adding 1e-9 to one value will make it a dupe of the next ... so we can be a little OCD by repeatedly doing this until we have no duplicates.
sapply is unlikely to return a vector, much (almost "certainly") more likely to return a list (if the groups are not perfectly balanced) or a matrix (perfectly balanced). We cannot simply unlist, since the order of the unlisted vectors will likely not be the order of the source data.
We can use `split<-`, or we can use a few other techniques (dplyr and/or data.table)
Updated function, and demonstration with mtcars:
cut.at.n.tile <- function(X, n = 7) {
brks <- quantile(X, probs = (0:n)/n, na.rm = TRUE)
while (any(dupes <- duplicated(brks))) brks <- brks + cumsum(1e-9*dupes)
cut(X, breaks = brks, labels = 1:n, include.lowest = TRUE)
}
base R
ret <- lapply(split(mtcars[['disp']], mtcars[['cyl']]), cut.at.n.tile)
mtcars[["newcol"]] <- NA # create an empty column
split(mtcars[['newcol']], mtcars[['cyl']]) <- ret
mtcars
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 7
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 5
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 7
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 7
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4
# Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 4
# Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 1
# Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 1
# Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 1
# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 7
# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 7
# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 6
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 2
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 5
# Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 3
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 3
# Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 4
# Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 6
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 3
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 5
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 3
# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 4
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1
# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 6
Validation:
cut.at.n.tile(subset(mtcars, cyl == 8)$disp)
# [1] 5 5 1 1 1 7 7 6 3 3 4 6 4 2
# Levels: 1 2 3 4 5 6 7
subset(mtcars, cyl == 8)$newcol
# [1] 5 5 1 1 1 7 7 6 3 3 4 6 4 2
dplyr
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(newcol = cut.at.n.tile(disp)) %>%
ungroup()
# # A tibble: 32 × 12
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 2
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 7
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 5
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 6
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 5
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 7
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 7
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 4
# # … with 22 more rows
# # ℹ Use `print(n = ...)` to see more rows
data.table
library(data.table)
as.data.table(mtcars)[, newcol := cut.at.n.tile(disp), by = .(cyl)][]
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <fctr>
# 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
# 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
# 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
# 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 7
# 5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
# 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
# 7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 5
# 8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 7
# 9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 7
# 10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4
# ---
# 23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 3
# 24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 4
# 25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 6
# 26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 3
# 27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 5
# 28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 3
# 29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 4
# 30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1
# 31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
# 32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 6

How to randomly sample multiple consecutive rows of a dataframe in R?

I've a dataframe with 100 rows and 20 columns and want to randomly sample 5 times 10 consecutive rows, e.g. 10:19, 25:34, etc. With: sample_n( df, 5 ) I'm able to extract 5 unique, randomly sampled rows, but don't know how to sample consecutive rows. Any help? Thanks!
df <- mtcars
df$row_nm <- seq(nrow(df))
set.seed(7)
sample_seq <- function(n, N) {
i <- sample(seq(N), size = 1)
ifelse(
test = i + (seq(n) - 1) <= N,
yes = i + (seq(n) - 1),
no = i + (seq(n) - 1) - N
)
}
replica <- replicate(n = 5, sample_seq(n = 10, N = nrow(df)))
# result
lapply(seq(ncol(replica)), function(x) df[replica[, x], ])
#> [[1]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 11
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 12
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 15
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 16
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 17
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 18
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 19
#>
#> [[2]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 19
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 20
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 21
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 22
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 23
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 24
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 25
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 26
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 27
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 28
#>
#> [[3]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 31
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 32
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 7
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 8
#>
#> [[4]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 28
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 29
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 30
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 31
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 32
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
#>
#> [[5]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 7
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 8
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 9
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 11
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 12
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 15
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 16
Created on 2022-01-24 by the reprex package (v2.0.1)
You could so something like:
#sample data
df <- data.table(value = 1:100000)
#function which sampled consecutive rows (x = dataframe, rows = nr of consecutive rows, nr = amount of times you want to sample consecutive rows)
sample_fun <- function(x, rows, nr){
#maximum number which can be sampled
numbers <- 1:(nrow(x) - rows)
#randomly sample 5 numbers
sampled.numbers <- sample(numbers, nr)
#convert to vector (5 consecutive)
sampled.rows <- lapply(sampled.numbers, function(x){seq(x, (x+rows-1), 1)})
sampled.rows <- do.call(c, sampled.rows)
#sample and return
result <- x[sampled.rows,]
return(result)
}
sample_fun(x = df, rows = 5, nr = 2)
You don't mention if this can include replacement (i.e. if you sample 10:19, can you then also sample 15:24?). You also don't mention if you can sample anything above row 91, which would mean that sample of 10 gets cut off (i.e. 98,99,100 would only be 3 consecutive rows unless you want that to loop back to row 1). Assuming you can sample any value with replacement, the solution can be done in one line:
sapply(sample(1:100,5),function(x){seq(x,x+9)})
This applies the sequence function to each of 5 individually sampled numbers. The output will be a matrix, where each column is a sample of 10 consecutive rows, but as noted, these will potentially overlap, or go above 100.
If you want a solution where the rows will not overlap at all, and avoiding values over 100, without making values above 91 less likely to be sampled, this actually gets kind of trick, but I think the code below should work. You cant just sample from 1:91 without affect probability of your random sample, because then this means a value like 100 actually only has a 1/91 probability of being sampled (the sample value has to be 91), where other values don't involve that same constraint. This solution makes it so all rows are equally likely to be sampled.
Rows=c(1:100,1:100)
SampleRows=matrix(0,nrow=10,ncol=5)
for(i in 1:ncol(SampleRows)){
SampledValue=sample(Rows,1)
RowsIndex=min(which(Rows==SampledValue))
Sequence=Rows[RowsIndex:(RowsIndex+9)]
SampleRows[,i]=Sequence
Rows=Rows[!(Rows %in% Sequence)]
}
This approach creates a vector that sequences from 1:100, repeated twice (variable Rows), you'll see why this is important in a bit. For each of 5 iterations (corresponding to 5 samples), we take a sampled value from Rows, which will be a number 1:100, we then figure out where that number is in Rows, and take all 9 values next to it. In the first sample this will always be 10 consecutive numbers (e.g. 20:29). But then we remove those sampled values from Rows. If we happen to get the next sample as a value that would lead to overlap (like 18), then instead it samples (18,19,30,31,32,33,34...) since 20:29 have been removed. We need to do 1:100 twice in Rows, so that if we sample a value like 99, it resets from 100, back to 1.
If you want your output in a vector,throw in this at the end
sort(as.vector(SampleRows))
Let me know if this works for the needs of your problem.

How do I speed up this specific for loop?

I've looked at other threads and tried to apply it to my code but have had no luck.
CDR3_post_challenge_unique_clonecount$participant_per_cdr3aa <- as.numeric(CDR3_post_challenge_unique_clonecount$cdr3aa)
participant_list <- unique(CDR3_post_challenge_unique_clonecount$cdr3aa)
for (c in participant_list)
{
CDR3_post_challenge_unique_clonecount$participant_per_cdr3aa[CDR3_post_challenge_unique_clonecount$cdr3aa == c] <- length(unique(CDR3_post_challenge_unique_clonecount$PartID[CDR3_post_challenge_unique_clonecount$cdr3aa == c]))
}
Here is a bit of the dataframe:
cdr3aa clonecount PartID
CAAGRAARGGSVPHWFDPF 1 S-1
CAALADSGSQTDAFDIA 1 S-1
CAFHAAYGSQHGLDVW 1 S-1
CAGGLAWLVDDW 1 S-1
CAGRWFFPW 1 S-1
CAGVKNGRGMDVW 1 S-1
I think you can replace the for loop with
CDR3_post_challenge_unique_clonecount$per3 <-
as.integer(
ave(CDR3_post_challenge_unique_clonecount$PartID,
CDR3_post_challenge_unique_clonecount$cdr3aa,
FUN = function(z) length(unique(z)))
)
I'll demonstrate with mtcars, using the follow analogs:
mtcars --> CDR3_post_challenge_unique_clonecount
cyl --> cdr3aa, the categorical variable in which we want to count PartID
drat --> PartID, the thing we want to count (uniquely) within each cdr3aa
mtcars$drat_per_cyl <- ave(mtcars$drat, mtcars$cyl, FUN = function(z) length(unique(z)))
mtcars
# mpg cyl disp hp drat wt qsec vs am gear carb drat_per_cyl
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 5
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 5
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 10
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 11
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 5
# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 11
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 10
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 5
# Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 5
# Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 11
# Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 11
# Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 11
# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 11
# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 11
# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 11
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 10
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 10
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 10
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 10
# Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 11
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 11
# Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 11
# Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 11
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 10
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 10
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 10
# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 11
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 5
# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 11
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 10
Notes:
ave is a little brain-dead in that the class of the return value is always the same as the class of the first argument. This means that one cannot count unique "character" and expect to get an integer, it is instead returned as a string. It's because of this that I wrap ave in as.integer(.).
ave returns a vector the same length as the input, with values corresponding 1-for-1 (meaning the order is relevant and preserved). In my example of mtcars, this means that it is effectively doing something like this:
ind4 <- which(mtcars$cyl == 4L)
ind4
# [1] 3 8 9 18 19 20 21 26 27 28 32
length(unique(mtcars$drat[ind4]))
# [1] 10
ind6 <- which(mtcars$cyl == 6L)
ind6
# [1] 1 2 4 6 10 11 30
length(unique(mtcars$drat[ind6]))
# [1] 5
### ...
but it will place the return value 10 in the ind4 positions of the return value. For example, because of my ind6, the return value will start with
c(5, 5, .., 5, .., 5, .., .., .., 5, 5, .., .....)
Because of ind4, it will contain
c(.., .., 10, .., .., .., .., 10, 10, .....)
(And same for cyl==8L.)

PCA analysis: getting error in dim desc(): not convenient data

I have conducted PCA on a set of data using prcomp. As a final step I am trying to use the dimdesc() function from FactoMineR to obtain p-values that identify the most significantly associated variables with my principal components.
The data frame has seven variables all of which are numerical and there are no missing values. The names are standard names such as "RCH_Home" (just in case the names could be problematic).
I write the following function:
res.desc <- dimdesc(df_PCA, axes = c(1:2), proba = 0.05)
And get the following error message:
Error in dimdesc(df_PCA, axes = c(1:2), proba = 0.05) : non convenient data
Any idea what might be going on?
Thanks!!!!
You should use the PCA function in sostitution of the prcomp
Below an example of PCA with FactoMineR.
library(FactoMineR)
library(factoextra)
library(paran)
data(cars)
mtcars_pca<-cars_pca<-PCA(mtcars)
If you want to check the percentage of variance, you can do this:
mtcars_pca$eig
> mtcars_pca$eig
eigenvalue percentage of variance cumulative percentage of variance
comp 1 6.60840025 60.0763659 60.07637
comp 2 2.65046789 24.0951627 84.17153
comp 3 0.62719727 5.7017934 89.87332
comp 4 0.26959744 2.4508858 92.32421
comp 5 0.22345110 2.0313737 94.35558
comp 6 0.21159612 1.9236011 96.27918
comp 7 0.13526199 1.2296544 97.50884
comp 8 0.12290143 1.1172858 98.62612
comp 9 0.07704665 0.7004241 99.32655
comp 10 0.05203544 0.4730495 99.79960
comp 11 0.02204441 0.2004037 100.00000
Cos2 stands for squared cosine and is an index for the quality representation of both variables and individuals. The closer this value is to one, the better the quality.
mtcars_pca$var$cos2
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
mpg 0.8685312 0.0006891117 0.031962249 1.369725e-04 0.0023634487
cyl 0.9239416 0.0050717032 0.019276287 1.811054e-06 0.0007642822
disp 0.8958370 0.0064482423 0.002370993 1.775235e-02 0.0346868281
hp 0.7199031 0.1640467049 0.012295659 1.234773e-03 0.0651697911
drat 0.5717921 0.1999959326 0.016295731 1.970035e-01 0.0013361275
wt 0.7916038 0.0542284172 0.073281663 1.630161e-02 0.0012578888
qsec 0.2655437 0.5690984542 0.101947952 1.249426e-03 0.0060588455
vs 0.6208539 0.1422249798 0.115330572 1.244460e-02 0.0803189801
am 0.3647715 0.4887450097 0.026555457 2.501834e-04 0.0018011675
gear 0.2829342 0.5665806069 0.052667265 1.888829e-02 0.0005219259
carb 0.3026882 0.4533387304 0.175213444 4.333912e-03 0.0291718181
res.desc <- dimdesc(mtcars_pca, axes = c(1:2), proba = 0.05)
> head(res.desc)
$Dim.1
$quanti
correlation p.value
cyl 0.9612188 2.471950e-18
disp 0.9464866 2.804047e-16
wt 0.8897212 9.780198e-12
hp 0.8484710 8.622043e-10
carb 0.5501711 1.105272e-03
qsec -0.5153093 2.542578e-03
gear -0.5319156 1.728737e-03
am -0.6039632 2.520665e-04
drat -0.7561693 5.575736e-07
vs -0.7879428 8.658012e-08
mpg -0.9319502 9.347042e-15
attr(,"class")
[1] "condes" "list "
$Dim.2
$quanti
correlation p.value
gear 0.7527155 6.712704e-07
am 0.6991030 8.541542e-06
carb 0.6733043 2.411011e-05
drat 0.4472090 1.028069e-02
hp 0.4050268 2.147312e-02
vs -0.3771273 3.335771e-02
qsec -0.7543861 6.138696e-07
attr(,"class")
[1] "condes" "list "
$call
$call$num.var
[1] 1
$call$proba
[1] 0.05
$call$weights
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
$call$X
Dim.1 mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 -0.6572132031 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag -0.6293955058 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 -2.7793970426 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive -0.3117707086 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 1.9744889419 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant -0.0561375337 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 3.0026742880 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D -2.0553287289 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 -2.2874083842 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 -0.5263812077 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C -0.5092054932 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 2.2478104359 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 2.0478227622 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 2.1485421615 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 3.8997903717 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 3.9541231097 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 3.5929719882 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 -3.8562837567 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic -4.2540325032 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla -4.2342207436 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona -1.9041678566 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 2.1848507430 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 1.8633834347 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 2.8889945733 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 2.2459189274 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 -3.5739682964 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 -2.6512550541 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa -3.3857059882 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 1.3729574238 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino -0.0009899207 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 2.6691258658 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E -2.4205931001 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
it should be use the function from the same package.
If you want to choose how many dimensions you need, you can do this by the param package
library(paran)
cars_paran<-paran(mtcars, graph = TRUE)

Modifying dataframe within a list-column

I have a dataframe with a list-column which itself contains dataframes (see below). Essentially, I am trying to add values from another column in the parent dataframe into the smaller dataframe by creating another column.
This is a simplified example- my real application is more complex.
library(tidyverse)
# What I am trying to do: add column "a" to dataframe within the list column
add_column(mtcars, a = 1)
#> mpg cyl disp hp drat wt qsec vs am gear carb a
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 1
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 1
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 1
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 1
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 1
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 1
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 1
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 1
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 1
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 1
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 1
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 1
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 1
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 1
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 1
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 1
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 1
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 1
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 1
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 1
Then create list-column:
(df <- tibble(data = rep(list(mtcars), times = 3), a = 1:3))
#> # A tibble: 3 x 2
#> data a
#> <list> <int>
#> 1 <data.frame [32 x 11]> 1
#> 2 <data.frame [32 x 11]> 2
#> 3 <data.frame [32 x 11]> 3
But this doesn't work:
df %>%
rowwise() %>%
modify_at("data", ~ add_column(., a = a))
# Error in eval_tidy(xs[[i]], unique_output): object 'a' not found
We may use
df %>% mutate(data = data %>% map2(a, ~add_column(.x, a = .y)))
In this way we start by mutating a column as usual, but then recognising that it's a list we use map2 along with the a column.

Resources