how to give right context to subset in R - r

I am wondering what is the right way of making subset understand context of each variable. For instance, consider the following function:
> f <- function(num) {
subset(mtcars, carb == num)
}
> f(2)
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Now, consider the case where the name of argument of f is also coincidentally carb:
> f <- function(carb) {
subset(mtcars, carb == carb)
}
> f(2)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
..............
This obviously, doesn't work. Wondering what is the right way of fixing this. I would have expected the following to work, but it doesn't. Could someone elaborate?
> f <- function(carb, env=parent.frame()) {
+ subset(mtcars, carb == eval(substitute(carb), env))
+ }
Thanks in advance

This works for me.
carb <- 2
mtcars[mtcars$carb == carb, ]
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Or alternatively
x <- 2
with(mtcars, mtcars[carb == x, ]) # same output as above

Related

Passing a quoted function argument to a three-dots argument inside another function using base R

I would like to pass quoted variables in the group argument of geom_col_wrap to the split_group function.
# I deleted the rest of the function for readability
geom_col_wrap = function(data, mapping, group, ...) {
data |>
split_group(group)
}
# This function was based on the `tidytable` package
split_group = function(data, ...) {
by_quote = as.list(substitute(...()))
by = sapply(by_quote, deparse)
split = vctrs::vec_split(data, data[c(by)])
out = split[["val"]]
names = do.call(paste, c(split[["key"]], sep = "_"))
names(out) = names
return(out)
}
split_group use substitute to quote variables, here is the problem. How can I make split_group recognize quote variables from group argument? I know it is easy to solve using rlang, but I need a R base solution.
split_group(mtcars, vs, am)
$`0_1`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
$`1_1`
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
...
$`1_0`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
...
$`0_0`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
...
geom_col_wrap(
mtcars,
mapping = ggplot2::aes(x = cyl, y = hp, color = am),
group = c(vs, am)
)
Error in `[.data.frame`(data, c(by)) : undefined columns selected
This error comes from as.list(substitute(...())). It does not unquoted the group argument. Why?
Note: I cannot use dots arg to solve the problem.
Using the miraculous ...() chain, explanation is given here.
split_group <- \(x, ...) split(x, x[, sapply(substitute(...()), as.character)])
split_group(mtcars, vs, am)
# $`0.0`
# mpg cyl disp hp drat wt qsec vs am gear carb
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
# Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
# Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3
# ...
#
# $`1.0`
# mpg cyl disp hp drat wt qsec vs am gear carb
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# ...
#
# $`0.1`
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# Porsche 914-2 26 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# ...
#
# $`1.1`
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
You basically need the base R version of rlang's {{ group }} or !!enquo(group) workflow. Which would be using substitute() to grab your group argument, and then using .(group) inside bquote().
However bquote() helps you build the expression, we then need to use eval() to evaluate your new expression.
Another thing - you're using deparse() in split_group() which would convert c(vs, am) to "c(vs, am)". Instead we'll need to mimic tidyselect so you can use c() style selection (that also still works without c() for a single column).
Put together it looks like this.
split_group = function(data, ...) {
by_quote = as.list(substitute(...()))
# Mimic tidyselect
cols = as.list(seq_along(data))
names(cols) = names(data)
by = unlist(lapply(by_quote, eval, cols))
split = vctrs::vec_split(data, data[c(by)])
out = split[["val"]]
names = do.call(paste, c(split[["key"]], sep = "_"))
names(out) = names
return(out)
}
geom_col_wrap = function(data, mapping, group, ...) {
# Use substitute/bquote to "unquote" group arg inside split_group function
# Much like using `{{ group }}` or `!!enquo(group)` in rlang
group = substitute(group)
eval(bquote(
data |>
split_group(.(group))
))
}
geom_col_wrap(
mtcars,
mapping = ggplot2::aes(x = cyl, y = hp, color = am),
group = c(vs, am)
)
#> $`0_1`
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#>
#> $`1_1`
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
#>
#> $`1_0`
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#>
#> $`0_0`
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Any reason you can't use rlang? vctrs depends on rlang so you're already sort of using it anyway.

PCA analysis: getting error in dim desc(): not convenient data

I have conducted PCA on a set of data using prcomp. As a final step I am trying to use the dimdesc() function from FactoMineR to obtain p-values that identify the most significantly associated variables with my principal components.
The data frame has seven variables all of which are numerical and there are no missing values. The names are standard names such as "RCH_Home" (just in case the names could be problematic).
I write the following function:
res.desc <- dimdesc(df_PCA, axes = c(1:2), proba = 0.05)
And get the following error message:
Error in dimdesc(df_PCA, axes = c(1:2), proba = 0.05) : non convenient data
Any idea what might be going on?
Thanks!!!!
You should use the PCA function in sostitution of the prcomp
Below an example of PCA with FactoMineR.
library(FactoMineR)
library(factoextra)
library(paran)
data(cars)
mtcars_pca<-cars_pca<-PCA(mtcars)
If you want to check the percentage of variance, you can do this:
mtcars_pca$eig
> mtcars_pca$eig
eigenvalue percentage of variance cumulative percentage of variance
comp 1 6.60840025 60.0763659 60.07637
comp 2 2.65046789 24.0951627 84.17153
comp 3 0.62719727 5.7017934 89.87332
comp 4 0.26959744 2.4508858 92.32421
comp 5 0.22345110 2.0313737 94.35558
comp 6 0.21159612 1.9236011 96.27918
comp 7 0.13526199 1.2296544 97.50884
comp 8 0.12290143 1.1172858 98.62612
comp 9 0.07704665 0.7004241 99.32655
comp 10 0.05203544 0.4730495 99.79960
comp 11 0.02204441 0.2004037 100.00000
Cos2 stands for squared cosine and is an index for the quality representation of both variables and individuals. The closer this value is to one, the better the quality.
mtcars_pca$var$cos2
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
mpg 0.8685312 0.0006891117 0.031962249 1.369725e-04 0.0023634487
cyl 0.9239416 0.0050717032 0.019276287 1.811054e-06 0.0007642822
disp 0.8958370 0.0064482423 0.002370993 1.775235e-02 0.0346868281
hp 0.7199031 0.1640467049 0.012295659 1.234773e-03 0.0651697911
drat 0.5717921 0.1999959326 0.016295731 1.970035e-01 0.0013361275
wt 0.7916038 0.0542284172 0.073281663 1.630161e-02 0.0012578888
qsec 0.2655437 0.5690984542 0.101947952 1.249426e-03 0.0060588455
vs 0.6208539 0.1422249798 0.115330572 1.244460e-02 0.0803189801
am 0.3647715 0.4887450097 0.026555457 2.501834e-04 0.0018011675
gear 0.2829342 0.5665806069 0.052667265 1.888829e-02 0.0005219259
carb 0.3026882 0.4533387304 0.175213444 4.333912e-03 0.0291718181
res.desc <- dimdesc(mtcars_pca, axes = c(1:2), proba = 0.05)
> head(res.desc)
$Dim.1
$quanti
correlation p.value
cyl 0.9612188 2.471950e-18
disp 0.9464866 2.804047e-16
wt 0.8897212 9.780198e-12
hp 0.8484710 8.622043e-10
carb 0.5501711 1.105272e-03
qsec -0.5153093 2.542578e-03
gear -0.5319156 1.728737e-03
am -0.6039632 2.520665e-04
drat -0.7561693 5.575736e-07
vs -0.7879428 8.658012e-08
mpg -0.9319502 9.347042e-15
attr(,"class")
[1] "condes" "list "
$Dim.2
$quanti
correlation p.value
gear 0.7527155 6.712704e-07
am 0.6991030 8.541542e-06
carb 0.6733043 2.411011e-05
drat 0.4472090 1.028069e-02
hp 0.4050268 2.147312e-02
vs -0.3771273 3.335771e-02
qsec -0.7543861 6.138696e-07
attr(,"class")
[1] "condes" "list "
$call
$call$num.var
[1] 1
$call$proba
[1] 0.05
$call$weights
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
$call$X
Dim.1 mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 -0.6572132031 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag -0.6293955058 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 -2.7793970426 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive -0.3117707086 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 1.9744889419 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant -0.0561375337 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 3.0026742880 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D -2.0553287289 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 -2.2874083842 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 -0.5263812077 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C -0.5092054932 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 2.2478104359 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 2.0478227622 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 2.1485421615 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 3.8997903717 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 3.9541231097 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 3.5929719882 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 -3.8562837567 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic -4.2540325032 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla -4.2342207436 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona -1.9041678566 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 2.1848507430 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 1.8633834347 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 2.8889945733 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 2.2459189274 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 -3.5739682964 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 -2.6512550541 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa -3.3857059882 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 1.3729574238 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino -0.0009899207 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 2.6691258658 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E -2.4205931001 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
it should be use the function from the same package.
If you want to choose how many dimensions you need, you can do this by the param package
library(paran)
cars_paran<-paran(mtcars, graph = TRUE)

Unexpected result when using a vector to select rows from a dataframe in R [duplicate]

This question already has answers here:
Filter causes data missing in R [duplicate]
(1 answer)
Filter data.frame rows by a logical condition
(9 answers)
Closed 3 years ago.
I was playing around with subsetting rows from a dataframe in R. The following code selects only those rows with a cyl value of 4 or 6 from mtcars:
> mtcars[mtcars$cyl %in% c(4, 6), ]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
So far so good. Then, just for fun and because I wasn't sure what it would do (I thought it didn't make sense) I substited == for %in%:
> mtcars[mtcars$cyl == c(4, 6), ]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Note how this returns some but not all rows with a cyl value of 4 or 6. I wasn't sure what to expect in the first place, but now I'm left wondering: why does this return a subset of rows with a cyl value of 4 or 6, and what is the logic? (That is, why does it return only these specific rows?)
I'm including the full dataframe for reference.
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Apply function with nested if statement R

So I understand how the apply function should be used but I am not sure how to integrate with an IF statement. Here is my attempt and can someone please push me in the right direction:
data <- mtcars
apply(data, 1, function(x) {
if (data$mpg < 20) {
data$colour <- "blue"
} else {
data$colour <- "red"
}
})
I just want to add a column to data for each row of the data frame for values in data$mpg between certain ranges.
You can do this using vectorization, which is preferred in R due to its speed:
data <- mtcars
data$colour <- ifelse(data$mpg < 20, data$colour <- "blue", data$colour <- "red")
This yields the following data.frame:
mpg cyl disp hp drat wt qsec vs am gear carb colour
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 red
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 red
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 red
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 red
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 blue
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 blue
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 blue
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 red
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 red
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 blue
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 blue
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 blue
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 blue
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 blue
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 blue
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 blue
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 blue
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 red
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 red
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 red
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 red
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 blue
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 blue
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 blue
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 blue
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 red
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 red
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 red
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 blue
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 blue
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 blue
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 red
A base based option using within:
head(within(mtcars,{
my_col <-ifelse(mpg < 20, "blue", "red")
}),3)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
my_col
Mazda RX4 red
Mazda RX4 Wag red
Datsun 710 red
Or with sapply which in my experience is a bit faster than applying at a margin of 2:
mtcars$colour<-sapply(mtcars[,"mpg"], function(x) ifelse(x<20,"blue","red"))
#rm(mtcars)
#data(mtcars)
#restores mtcars^^

How to leave out rows with missing values when total no. of values crosses a given value in R

I have a dataset which contains 45% of Missing values:
I would like to remove the rows which has NA's values for a given period. for example, if there are rows continuously has missing values ,for almost an hour or more than 50 values missing continuously , i want to remove that rows alone.
And i don't want to leave the rows with missing values less than 15 or 25.
In short,
1) I don't want to remove all rows that has got NA value's.
2) I want to remove rows that continuously has NA values in a column
example data:
pic
Discard columnwise contiguous NAs
Try this, which uses rle(is.na...)) to determine runs of NAs. If any are > num_runs then it is discarded (Data at bottom)
myfun <- function(x, num_runs) {
# x is vector column of df
require(dplyr)
runs <- cumsum(rle(is.na(x))$lengths)
vals <- rle(is.na(x))$values
start <- dplyr::lag(runs)+1
start <- replace(start, is.na(start), 1)
M <- rbind(start[vals], runs[vals])
seqruns <- apply(M, 2, function(x) if ((x[2]-x[1]+1) > num_runs) { seq(x[1],x[2]) })
ans <- unlist(seqruns)
return(ans)
}
library(purrr)
library(dplyr)
num_runs <- 4
discard <- unlist(map(1:ncol(df), ~myfun(df[,.x, num_runs])))
df[-discard,]
Output
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 NA 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 NA 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 NA 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 NA 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Discard rowwise contiguous NAs
Try this, which uses rle(is.na...)) to determine runs of NAs. If any are > num_runs then it is discarded (Data at bottom)
library(purrr)
num_runs <- 1 # number of contiguous NAs
keep <- map_lgl(1:nrow(df), ~!any(rle(is.na(unlist(df[.x,])))$lengths[rle(is.na(unlist(df[.x,])))$values] > num_runs))
df[keep,]
Output
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 NA 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 NA 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 NA 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 NA 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 NA 4
Lincoln Continental 10.4 8 460.0 215 NA 5.424 17.82 0 0 NA 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 NA 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 NA 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 NA 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 NA 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 NA 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 NA 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Data
library(dplyr)
df <- mtcars %>% replace(.==3, NA)

Resources