R sorting data subset - r

I'm learning to use R (version 3.1.2), so this may come as a noob question, but I'm having problems ordering a subset of a data frame. If I use the mtcars data frame using attach(mtcars), I can easily order it using ord.cars <- mtcars[order(hp),]. The problem is, if I use a subset, let's say sub.cars <- subset(mtcars, hp > 120) and try to order it using ord.sub <- sub.cars[order(mpg),], the result is the following:
mpg cyl disp hp drat wt qsec vs am gear carb
NA NA NA NA NA NA NA NA NA NA NA NA
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
NA.1 NA NA NA NA NA NA NA NA NA NA NA
NA.2 NA NA NA NA NA NA NA NA NA NA NA
NA.3 NA NA NA NA NA NA NA NA NA NA NA
NA.4 NA NA NA NA NA NA NA NA NA NA NA
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
NA.5 NA NA NA NA NA NA NA NA NA NA NA
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
NA.6 NA NA NA NA NA NA NA NA NA NA NA
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
NA.7 NA NA NA NA NA NA NA NA NA NA NA
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
NA.8 NA NA NA NA NA NA NA NA NA NA NA
NA.9 NA NA NA NA NA NA NA NA NA NA NA
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
NA.10 NA NA NA NA NA NA NA NA NA NA NA
NA.11 NA NA NA NA NA NA NA NA NA NA NA
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
NA.12 NA NA NA NA NA NA NA NA NA NA NA
NA.13 NA NA NA NA NA NA NA NA NA NA NA
NA.14 NA NA NA NA NA NA NA NA NA NA NA
Why is R putting back as NAs all the rows that were left out of the subset?
Thanks in advance!

This is a problem related to your use of attach() which is not recommended in R - for exactly this reason! The problem is, that your code is kind of ambiguous, or at least, it is something different than what you expected it to be.
How to resolve this?
detach the data set and
don't use attach again. Instead, use [ and/or $ and if you like with() to subset your data.
Here's how you could do it for the example:
detach(mtcars)
ord.cars <- mtcars[order(mtcars$hp),]
sub.cars <- subset(mtcars, hp > 120)
#the subset could also be written as:
sub.cars <- mtcars[mtcars$hp > 120,]
ord.sub <- sub.cars[order(sub.cars$mpg),]
head(ord.sub) # only show the first 6 rows
mpg cyl disp hp drat wt qsec vs am gear carb
Cadillac Fleetwood 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
Lincoln Continental 10.4 8 460 215 3.00 5.42 17.8 0 0 3 4
Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
Chrysler Imperial 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
What exactly caused the problem in your code?
After you attached the mtcars data, whenever you call one of the column names of the attached data, like mpg, it will refer to the attached data set (the original mtcats data). The problem then was that you subsetted the data and stored it in a new object (sub.cars) which was not attached while mtcars was still attached. Then, when you tried to order the sub.cars data, you used sub.cars[order(mpg),] and as you can see, in there, you refer to mpg column - that is interpreted by R as the one from the attached (original) mtcars data set, with more rows than you subsetted data. All those rows in your sub.cars which were excluded by the subsetting, will now be displayed as NAs in sub.cars.
Lesson: don't use attach().

Related

Recode dataframe values to NA per column

How to recode some dataframe values to NA if they don't appear in a separate vector?
More specifically, how to approach such task when:
each data column to clean has its specific set of "valid" values to keep, independent of other columns
column-specific values are given in a separate table (as vectors nested in a list-column in a tibble)
Example
My data to clean up is my_mtcars
I want to clean up certain columns (cars, gear, and carb)
In each of those columns, I want to keep only certain values as they are specified in a separate table table_valid_values under valid_values. Otherwise, values not specified as "valid" should turn to NA.
For any column of my_mtcars that does not appear in table_valid_values, no cleanup is needed.
library(tibble)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
my_mtcars <- rownames_to_column(mtcars, "cars")
as_tibble(my_mtcars)
#> # A tibble: 32 x 12
#> cars mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 ~ 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 Hornet 4 D~ 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 Hornet Spo~ 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
table_valid_values <-
structure(
list(
var_name = c("cars", "gear", "carb"),
valid_values = list(
c("Valiant", "AMC Javelin", "Ferrari Dino"),
c(3, 5),
c(1, 4, 6)
)
),
row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame")
)
table_valid_values
#> # A tibble: 3 x 2
#> var_name valid_values
#> <chr> <list>
#> 1 cars <chr [3]>
#> 2 gear <dbl [2]>
#> 3 carb <dbl [3]>
table_valid_values %>%
pull(valid_values)
#> [[1]]
#> [1] "Valiant" "AMC Javelin" "Ferrari Dino"
#>
#> [[2]]
#> [1] 3 5
#>
#> [[3]]
#> [1] 1 4 6
Created on 2021-01-27 by the reprex package (v0.3.0)
Desired Output
Provided with only table_valid_values, how can I clean up my_mtcars to get the following:
## cars mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA 21 6 160 110 3.9 2.62 16.5 0 1 NA 4
## 2 NA 21 6 160 110 3.9 2.88 17.0 0 1 NA 4
## 3 NA 22.8 4 108 93 3.85 2.32 18.6 1 1 NA 1
## 4 NA 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 NA 18.7 8 360 175 3.15 3.44 17.0 0 0 3 NA
## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 NA 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 NA 24.4 4 147. 62 3.69 3.19 20 1 0 NA NA
## 9 NA 22.8 4 141. 95 3.92 3.15 22.9 1 0 NA NA
## 10 NA 19.2 6 168. 123 3.92 3.44 18.3 1 0 NA 4
## 11 NA 17.8 6 168. 123 3.92 3.44 18.9 1 0 NA 4
## 12 NA 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 NA
## 13 NA 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 NA
## 14 NA 15.2 8 276. 180 3.07 3.78 18 0 0 3 NA
## 15 NA 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
## 16 NA 10.4 8 460 215 3 5.42 17.8 0 0 3 4
## 17 NA 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
## 18 NA 32.4 4 78.7 66 4.08 2.2 19.5 1 1 NA 1
## 19 NA 30.4 4 75.7 52 4.93 1.62 18.5 1 1 NA NA
## 20 NA 33.9 4 71.1 65 4.22 1.84 19.9 1 1 NA 1
## 21 NA 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
## 22 NA 15.5 8 318 150 2.76 3.52 16.9 0 0 3 NA
## 23 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 NA
## 24 NA 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
## 25 NA 19.2 8 400 175 3.08 3.84 17.0 0 0 3 NA
## 26 NA 27.3 4 79 66 4.08 1.94 18.9 1 1 NA 1
## 27 NA 26 4 120. 91 4.43 2.14 16.7 0 1 5 NA
## 28 NA 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 NA
## 29 NA 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
## 30 Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## 31 NA 15 8 301 335 3.54 3.57 14.6 0 1 5 NA
## 32 NA 21.4 4 121 109 4.11 2.78 18.6 1 1 NA NA
I also wonder, what if we wanted to replace invalid values with a string of choice (say, invalid) rather than NA?
You could use dplyr as :
library(dplyr)
my_mtcars %>%
mutate(across(all_of(table_valid_values$var_name), ~{
replace(.x, !.x %in%
table_valid_values$valid_values[match(cur_column(),
table_valid_values$var_name)][[1]], NA)
}))
Similarly, in base R :
my_mtcars[table_valid_values$var_name] <- lapply(table_valid_values$var_name,
function(x) {
replace(my_mtcars[[x]],
!my_mtcars[[x]] %in% table_valid_values$valid_values[
match(x, table_valid_values$var_name)][[1]], NA)
})
my_mtcars
# cars mpg cyl disp hp drat wt qsec vs am gear carb
#1 <NA> 21.0 6 160.0 110 3.90 2.620 16.46 0 1 NA 4
#2 <NA> 21.0 6 160.0 110 3.90 2.875 17.02 0 1 NA 4
#3 <NA> 22.8 4 108.0 93 3.85 2.320 18.61 1 1 NA 1
#4 <NA> 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#5 <NA> 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 NA
#6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#7 <NA> 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#8 <NA> 24.4 4 146.7 62 3.69 3.190 20.00 1 0 NA NA
#9 <NA> 22.8 4 140.8 95 3.92 3.150 22.90 1 0 NA NA
#10 <NA> 19.2 6 167.6 123 3.92 3.440 18.30 1 0 NA 4
#11 <NA> 17.8 6 167.6 123 3.92 3.440 18.90 1 0 NA 4
#12 <NA> 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 NA
#13 <NA> 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 NA
#14 <NA> 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 NA
#15 <NA> 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#16 <NA> 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#17 <NA> 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#18 <NA> 32.4 4 78.7 66 4.08 2.200 19.47 1 1 NA 1
#19 <NA> 30.4 4 75.7 52 4.93 1.615 18.52 1 1 NA NA
#20 <NA> 33.9 4 71.1 65 4.22 1.835 19.90 1 1 NA 1
#21 <NA> 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#22 <NA> 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 NA
#23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 NA
#24 <NA> 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#25 <NA> 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 NA
#26 <NA> 27.3 4 79.0 66 4.08 1.935 18.90 1 1 NA 1
#27 <NA> 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 NA
#28 <NA> 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 NA
#29 <NA> 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#31 <NA> 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 NA
#32 <NA> 21.4 4 121.0 109 4.11 2.780 18.60 1 1 NA NA
Replace NA with any value you want.

data.table SD returns as many rows as asked per group with NA fill instead of as many as exist

I am trying to mimic some dplyr code in data.table, which I rarely use. Not getting the result I want.
For example, with dplyr, when I ask for up to 5 rows per group, using slice, I get the desired result - as many rows as I ask, or as many exist per group, whichever is lower:
library(dplyr)
mtcars %>%
group_by(cyl, am) %>%
slice(1:5)
mpg cyl disp hp drat wt qsec vs am gear carb
1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
2 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
3 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
4 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
5 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
6 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
7 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
8 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
9 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
10 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
11 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
12 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
13 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
14 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
15 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
16 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
17 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
18 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
19 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
20 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
21 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
22 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
But, with data.table and .SD, I get as many rows as I ask, no matter if fewer rows exist per group, and the extra rows are filled with NA in non-grouped columns. Any idea how to get what I get from above dplyr code with data.table?
library(data.table)
as.data.table(mtcars)[, .SD[1:5], by = .(cyl, am)]
cyl am mpg disp hp drat wt qsec vs gear carb
1: 6 1 21.0 160.0 110 3.90 2.620 16.46 0 4 4
2: 6 1 21.0 160.0 110 3.90 2.875 17.02 0 4 4
3: 6 1 19.7 145.0 175 3.62 2.770 15.50 0 5 6
4: 6 1 NA NA NA NA NA NA NA NA NA
5: 6 1 NA NA NA NA NA NA NA NA NA
6: 4 1 22.8 108.0 93 3.85 2.320 18.61 1 4 1
7: 4 1 32.4 78.7 66 4.08 2.200 19.47 1 4 1
8: 4 1 30.4 75.7 52 4.93 1.615 18.52 1 4 2
9: 4 1 33.9 71.1 65 4.22 1.835 19.90 1 4 1
10: 4 1 27.3 79.0 66 4.08 1.935 18.90 1 4 1
11: 6 0 21.4 258.0 110 3.08 3.215 19.44 1 3 1
12: 6 0 18.1 225.0 105 2.76 3.460 20.22 1 3 1
13: 6 0 19.2 167.6 123 3.92 3.440 18.30 1 4 4
14: 6 0 17.8 167.6 123 3.92 3.440 18.90 1 4 4
15: 6 0 NA NA NA NA NA NA NA NA NA
16: 8 0 18.7 360.0 175 3.15 3.440 17.02 0 3 2
17: 8 0 14.3 360.0 245 3.21 3.570 15.84 0 3 4
18: 8 0 16.4 275.8 180 3.07 4.070 17.40 0 3 3
19: 8 0 17.3 275.8 180 3.07 3.730 17.60 0 3 3
20: 8 0 15.2 275.8 180 3.07 3.780 18.00 0 3 3
21: 4 0 24.4 146.7 62 3.69 3.190 20.00 1 4 2
22: 4 0 22.8 140.8 95 3.92 3.150 22.90 1 4 2
23: 4 0 21.5 120.1 97 3.70 2.465 20.01 1 3 1
24: 4 0 NA NA NA NA NA NA NA NA NA
25: 4 0 NA NA NA NA NA NA NA NA NA
26: 8 1 15.8 351.0 264 4.22 3.170 14.50 0 5 4
27: 8 1 15.0 301.0 335 3.54 3.570 14.60 0 5 8
28: 8 1 NA NA NA NA NA NA NA NA NA
29: 8 1 NA NA NA NA NA NA NA NA NA
30: 8 1 NA NA NA NA NA NA NA NA NA
cyl am mpg disp hp drat wt qsec vs gear carb
slice automatically readjusts based on the number of rows per group, here we can use head. With data.table, indexing if we provide index and the index is greater than the .N (i.e. the total number of rows per group), it will create an NA row for each index that is outside the range
library(data.table)
as.data.table(mtcars)[, head(.SD, 5), by = .(cyl, am)]
# cyl am mpg disp hp drat wt qsec vs gear carb
# 1: 6 1 21.0 160.0 110 3.90 2.620 16.46 0 4 4
# 2: 6 1 21.0 160.0 110 3.90 2.875 17.02 0 4 4
# 3: 6 1 19.7 145.0 175 3.62 2.770 15.50 0 5 6
# 4: 4 1 22.8 108.0 93 3.85 2.320 18.61 1 4 1
# 5: 4 1 32.4 78.7 66 4.08 2.200 19.47 1 4 1
# 6: 4 1 30.4 75.7 52 4.93 1.615 18.52 1 4 2
# 7: 4 1 33.9 71.1 65 4.22 1.835 19.90 1 4 1
# 8: 4 1 27.3 79.0 66 4.08 1.935 18.90 1 4 1
# 9: 6 0 21.4 258.0 110 3.08 3.215 19.44 1 3 1
#10: 6 0 18.1 225.0 105 2.76 3.460 20.22 1 3 1
#11: 6 0 19.2 167.6 123 3.92 3.440 18.30 1 4 4
#12: 6 0 17.8 167.6 123 3.92 3.440 18.90 1 4 4
#13: 8 0 18.7 360.0 175 3.15 3.440 17.02 0 3 2
#14: 8 0 14.3 360.0 245 3.21 3.570 15.84 0 3 4
#15: 8 0 16.4 275.8 180 3.07 4.070 17.40 0 3 3
#16: 8 0 17.3 275.8 180 3.07 3.730 17.60 0 3 3
#17: 8 0 15.2 275.8 180 3.07 3.780 18.00 0 3 3
#18: 4 0 24.4 146.7 62 3.69 3.190 20.00 1 4 2
#19: 4 0 22.8 140.8 95 3.92 3.150 22.90 1 4 2
#20: 4 0 21.5 120.1 97 3.70 2.465 20.01 1 3 1
#21: 8 1 15.8 351.0 264 4.22 3.170 14.50 0 5 4
#22: 8 1 15.0 301.0 335 3.54 3.570 14.60 0 5 8
# cyl am mpg disp hp drat wt qsec vs gear carb
Also, slice should work with data.table
library(dplyr)
as.data.table(mtcars)[, slice(.SD, 1:5), by = .(cyl, am)]

Replacing a value in df (all variables) with na [duplicate]

This question already has answers here:
How do I replace NA values with zeros in an R dataframe?
(29 answers)
Conditional replacement of values in a data.frame
(5 answers)
Closed 3 years ago.
I am trying to change portions of my data frame in multiple variables from 8 and 9 to NA
Also, does anyone know a quick way to reverse code a vector? (likert scale where 1 is strongly agree, I want the most weight to be at 5)
Any help would be appreciated. Cheers.
naniar::replace_with_na_all(data = amer, condition = ~.x == -8)
data %>% mutate_all(.funs = function(x) replace(var, which(var == -9 | var == -8), NA))
df %>% mutate_each(funs(replace(., .>7, NA))
dep. evidently
Please see the comment to understand how to make your question reproducible for future posts. It's always a good idea to include sample data; if you can't share your data, provide code to generate representative mock data or use one of the built-in datasets.
As to your question, you can use mutate_all in the following way
library(dplyr)
data %>% mutate_all(~ifelse(.x %in% c(-8, -9), NA, .x))
Or you can use replace
data %>% mutate_all(~replace(.x, which(.x %in% c(-8, -9)), NA))
Reproducible example
Let's take mtcars as sample data. To replace all 3 and 4 entries across all columns with NA we can do
mtcars %>% mutate_all(~ifelse(.x %in% c(3, 4), NA, .x))
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 NA NA
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 NA NA
#3 22.8 NA 108.0 93 3.85 2.320 18.61 1 1 NA 1
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 NA 1
#5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 NA 2
#6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 NA 1
#7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 NA NA
#8 24.4 NA 146.7 62 3.69 3.190 20.00 1 0 NA 2
#9 22.8 NA 140.8 95 3.92 3.150 22.90 1 0 NA 2
#10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 NA NA
#11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 NA NA
#12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 NA NA
#13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 NA NA
#14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 NA NA
#15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 NA NA
#16 10.4 8 460.0 215 NA 5.424 17.82 0 0 NA NA
#17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 NA NA
#18 32.4 NA 78.7 66 4.08 2.200 19.47 1 1 NA 1
#19 30.4 NA 75.7 52 4.93 1.615 18.52 1 1 NA 2
#20 33.9 NA 71.1 65 4.22 1.835 19.90 1 1 NA 1
#21 21.5 NA 120.1 97 3.70 2.465 20.01 1 0 NA 1
#22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 NA 2
#23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 NA 2
#24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 NA NA
#25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 NA 2
#26 27.3 NA 79.0 66 4.08 1.935 18.90 1 1 NA 1
#27 26.0 NA 120.3 91 4.43 2.140 16.70 0 1 5 2
#28 30.4 NA 95.1 113 3.77 1.513 16.90 1 1 5 2
#29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 NA
#30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#32 21.4 NA 121.0 109 4.11 2.780 18.60 1 1 NA 2
Using replace as
mtcars %>% mutate_all(~replace(.x, which(.x %in% c(3, 4)), NA))
gives the same result.

R Dataframe new column with rolling correlation of two other columns

I've been stuck for about one hour with something probably simple.
I have a dataframe and I would need to add a new column containing the correlation of two other columns.
It's important to note I would need the output to be a new dataframe column containing all initial rows, including those with no correlation.
Here is my code:
data <- mtcars[1:50,]
data$out <- rollapply(data.frame(data$mpg, data$wt), 8 ,function(x) cor(x[,1],x[,2]), by.column=FALSE, fill=NA)
if you check the output, for some weird reason the tail has a lot of NA rows:
> tail(data)
mpg cyl disp hp drat wt qsec vs am gear carb new_col out
NA.12 NA NA NA NA NA NA NA NA NA NA NA <NA> NA
NA.13 NA NA NA NA NA NA NA NA NA NA NA <NA> NA
NA.14 NA NA NA NA NA NA NA NA NA NA NA <NA> NA
NA.15 NA NA NA NA NA NA NA NA NA NA NA <NA> NA
NA.16 NA NA NA NA NA NA NA NA NA NA NA <NA> NA
NA.17 NA NA NA NA NA NA NA NA NA NA NA <NA> NA
why is this happening?
Also, if I check the head, only the first 3 rows show NA in the "out" column but I would expect to have an NA value in the first seven rows of the "out" column.
> head(data,10)
mpg cyl disp hp drat wt qsec vs am gear carb new_col out
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 6 -- 2.62 NA
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 6 -- 2.875 NA
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 -- 2.32 NA
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 6 -- 3.215 -0.6294594
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 8 -- 3.44 -0.6231828
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6 -- 3.46 -0.6374638
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 -- 3.57 -0.9454915
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4 -- 3.19 -0.7318425
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 4 -- 3.15 -0.7428921
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 6 -- 3.44 -0.7784487
Next, I realized if I use
data$out <- rollapply(data.frame(data$mpg, data$wt), 8 ,function(x) cor(x[,1],x[,2]), by.column=FALSE, fill=NA, align='right')
this fixed the problem of having NA in the first 7 values, but still give a bunch of NA rows in the tail of the dataframe.
Any suggestion please? Thank you

Assigning by reference in R data.table with i expression provided in string variable

I am building a Shiny app with plotly, and need to filter data on the basis of a number of parameters. Currently I am doing this with a flag in a data.table, updated by reference. The actual data have many columns, and I would vastly prefer an extensible way of adding columns to be visualised. I am coming up short in one area: the actual filtering of the data on the basis of values.
I store the names of the columns to be filtered in an array of characters, but it seems that I can't use this to define the expression by which rows are selected (i.e. the i expression). Is this possible? Or am I approaching this the wrong way?
library(data.table)
set.seed(12345)
dt = data.table(mtcars)
dt[,filtered := FALSE]
filterColumnNames = c('cyl','gear','carb')
filterValues = list(cyl = c(4,6),
gear = c(3),
carb = c(1))
for (columnName in filterColumnNames) {
dt[columnName %in% filterValues[columnName][[1]], filtered := TRUE]
}
# Working, but not loopy enough.
# dt[cyl %in% filterValues['cyl'][[1]], filtered := TRUE]
# dt[gear %in% filterValues['gear'][[1]], filtered := TRUE]
# dt[carb %in% filterValues['carb'][[1]], filtered := TRUE]
print(dt)
Another way to achieve this is to use a join to select the rows:
library(data.table)
dt <- as.data.table(mtcars)
filterValues <- list(cyl = c(4,6),
gear = c(3),
carb = c(1))
dt[do.call(CJ, filterValues), on = names(filterValues), filtered := TRUE][]
mpg cyl disp hp drat wt qsec vs am gear carb filtered
1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 NA
2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 NA
3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 NA
4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 TRUE
5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 NA
6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 TRUE
7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 NA
8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 NA
9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 NA
10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 NA
11: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 NA
12: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 NA
13: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 NA
14: 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 NA
15: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 NA
16: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 NA
17: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 NA
18: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 NA
19: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 NA
20: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 NA
21: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 TRUE
22: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 NA
23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 NA
24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 NA
25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 NA
26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 NA
27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 NA
28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 NA
29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 NA
30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 NA
31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 NA
32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 NA
mpg cyl disp hp drat wt qsec vs am gear carb filtered
or
dt <- as.data.table(mtcars)
dt[do.call(CJ, filterValues), on = names(filterValues), nomatch = 0L]
mpg cyl disp hp drat wt qsec vs am gear carb
1: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
2: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
3: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
You only need to specify the list of filterValues. do.call(CJ, filterValues) (cross join) creates a data.table with all combinations to select the rows by:
cyl gear carb
1: 4 3 1
2: 6 3 1
Edit
The OP has asked if this could be extended to inequalities.
This can be done with data.table's non-equi joins but the setup is somewhat different. E.g.,
filterIntervals <- list(disp = c(200, 300),
mpg = c(10, 20))
mDT <- dcast(melt(filterIntervals), . ~ L1 + rowid(L1))
filterCondition <- c("disp>=disp_1", "disp<disp_2", "mpg>mpg_1", "mpg<mpg_2")
dt[mDT, on = filterCondition, filtered := TRUE][]
mpg cyl disp hp drat wt qsec vs am gear carb filtered
1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 NA
2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 NA
3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 NA
4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 NA
5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 NA
6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 TRUE
7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 NA
8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 NA
9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 NA
10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 NA
11: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 NA
12: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 TRUE
13: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 TRUE
14: 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 TRUE
15: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 NA
16: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 NA
17: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 NA
18: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 NA
19: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 NA
20: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 NA
21: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 NA
22: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 NA
23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 NA
24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 NA
25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 NA
26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 NA
27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 NA
28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 NA
29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 NA
30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 NA
31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 NA
32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 NA
mpg cyl disp hp drat wt qsec vs am gear carb filtered
The reason is the columnName before the %in% is not evaluated to get the value of that column. We can either use get
for (columnName in filterColumnNames) {
dt[get(columnName) %in% filterValues[columnName][[1]], filtered := TRUE][]
}
or eval(as.name(
for (columnName in filterColumnNames) {
dt[eval(as.name(columnName)) %in% filterValues[columnName][[1]], filtered := TRUE][]
}
You can create a character vector based on the filtering conditions you want to apply. See following example:
library(data.table)
d <- mtcars
setDT(d)
filtering_condition <- "cyl==6"
d[eval(parse(text=filtering_condition))]

Resources