Keep top 3 values in a row, change everything else to NA - r

Using mtcars for reproduciblity
(This is a row operation). I want to keep 3 values in a row based on their magnitude (so basically top 3 values would be having value, rest everything change to NA)
I tried using pivot_longer converting to long then filtering but problem is I want to convert again to wide cause I want to retain the structure of data.
mtcars %>%
pivot_longer(cols = everything()) %>%
group_by(name) %>% top_n(3)
Sample Output on 3 rows of mtcars
Note: In mtcars, all 3 rows have same column name values as non NA but in original dataset it would be different. (Preferably tidyverse solution)

I know you would like a tidyverse solution, but this is a one-liner in base R:
t(apply(mtcars, 1, function(x) {x[order(x)[1:(length(x) - 3)]] <- NA; x}))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 NA 160.0 110 NA NA NA NA NA NA NA
#> Mazda RX4 Wag 21.0 NA 160.0 110 NA NA NA NA NA NA NA
#> Datsun 710 22.8 NA 108.0 93 NA NA NA NA NA NA NA
#> Hornet 4 Drive 21.4 NA 258.0 110 NA NA NA NA NA NA NA
#> Hornet Sportabout 18.7 NA 360.0 175 NA NA NA NA NA NA NA
#> Valiant NA NA 225.0 105 NA NA 20.22 NA NA NA NA
#> Duster 360 NA NA 360.0 245 NA NA 15.84 NA NA NA NA
#> Merc 240D 24.4 NA 146.7 62 NA NA NA NA NA NA NA
#> Merc 230 NA NA 140.8 95 NA NA 22.90 NA NA NA NA
#> Merc 280 19.2 NA 167.6 123 NA NA NA NA NA NA NA
#> Merc 280C NA NA 167.6 123 NA NA 18.90 NA NA NA NA
#> Merc 450SE NA NA 275.8 180 NA NA 17.40 NA NA NA NA
#> Merc 450SL NA NA 275.8 180 NA NA 17.60 NA NA NA NA
#> Merc 450SLC NA NA 275.8 180 NA NA 18.00 NA NA NA NA
#> Cadillac Fleetwood NA NA 472.0 205 NA NA 17.98 NA NA NA NA
#> Lincoln Continental NA NA 460.0 215 NA NA 17.82 NA NA NA NA
#> Chrysler Imperial NA NA 440.0 230 NA NA 17.42 NA NA NA NA
#> Fiat 128 32.4 NA 78.7 66 NA NA NA NA NA NA NA
#> Honda Civic 30.4 NA 75.7 52 NA NA NA NA NA NA NA
#> Toyota Corolla 33.9 NA 71.1 65 NA NA NA NA NA NA NA
#> Toyota Corona 21.5 NA 120.1 97 NA NA NA NA NA NA NA
#> Dodge Challenger NA NA 318.0 150 NA NA 16.87 NA NA NA NA
#> AMC Javelin NA NA 304.0 150 NA NA 17.30 NA NA NA NA
#> Camaro Z28 NA NA 350.0 245 NA NA 15.41 NA NA NA NA
#> Pontiac Firebird 19.2 NA 400.0 175 NA NA NA NA NA NA NA
#> Fiat X1-9 27.3 NA 79.0 66 NA NA NA NA NA NA NA
#> Porsche 914-2 26.0 NA 120.3 91 NA NA NA NA NA NA NA
#> Lotus Europa 30.4 NA 95.1 113 NA NA NA NA NA NA NA
#> Ford Pantera L 15.8 NA 351.0 264 NA NA NA NA NA NA NA
#> Ferrari Dino 19.7 NA 145.0 175 NA NA NA NA NA NA NA
#> Maserati Bora 15.0 NA 301.0 335 NA NA NA NA NA NA NA
#> Volvo 142E 21.4 NA 121.0 109 NA NA NA NA NA NA NA

Your general idea was in the right direction. You can pivot to long data and group by the row number before using slice_max() and reshaping back to wide:
library(dplyr)
library(tidyr)
library(tibble)
mtcars %>%
rowid_to_column() %>%
pivot_longer(-rowid) %>%
group_by(rowid) %>%
mutate(value = replace(value, !value %in% tail(value[order(value)], 3), NA)) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <lgl> <dbl> <dbl> <lgl> <lgl> <dbl> <lgl> <lgl> <lgl> <lgl>
1 21 NA 160 110 NA NA NA NA NA NA NA
2 21 NA 160 110 NA NA NA NA NA NA NA
3 22.8 NA 108 93 NA NA NA NA NA NA NA
4 21.4 NA 258 110 NA NA NA NA NA NA NA
5 18.7 NA 360 175 NA NA NA NA NA NA NA
6 NA NA 225 105 NA NA 20.2 NA NA NA NA
7 NA NA 360 245 NA NA 15.8 NA NA NA NA
8 24.4 NA 147. 62 NA NA NA NA NA NA NA
9 NA NA 141. 95 NA NA 22.9 NA NA NA NA
10 19.2 NA 168. 123 NA NA NA NA NA NA NA
# ... with 22 more rows

Seeing that you were curious about other solutions..
Here I leave you a more tidyverse-oriented solution.
library(purrr)
library(dplyr)
mtcars %>% pmap_dfr(~c(...) %>% replace(rank(desc(.)) > 3, NA))
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 NA 160 110 NA NA NA NA NA NA NA
#> 2 21 NA 160 110 NA NA NA NA NA NA NA
#> 3 22.8 NA 108 93 NA NA NA NA NA NA NA
#> 4 21.4 NA 258 110 NA NA NA NA NA NA NA
#> 5 18.7 NA 360 175 NA NA NA NA NA NA NA
#> 6 NA NA 225 105 NA NA 20.2 NA NA NA NA
#> 7 NA NA 360 245 NA NA 15.8 NA NA NA NA
#> 8 24.4 NA 147. 62 NA NA NA NA NA NA NA
#> 9 NA NA 141. 95 NA NA 22.9 NA NA NA NA
#> 10 19.2 NA 168. 123 NA NA NA NA NA NA NA
#> # ... with 22 more rows
As a concept, it's similar to the base R solution, but it should (or at least tries to) be more "functional" and hopefully readable. Even if the chosen solution looks very good.
EDIT.
To answer your comment about more info..
It should be known that ~ helps you writing more compact anonymous functions.
instead of:
mtcars %>% pmap_dfr(~c(...) %>% replace(rank(desc(.)) > 3, NA))
you could also write:
mtcars %>% pmap_dfr(function(...) c(...) %>% replace(rank(desc(.)) > 3, NA))
Those three dots basically gather all together the input you're providing to your function. Instead of writing a variable for each input, I use ... to include them all.
pmap takes a list of lists or a list of vectors as first argument.
In this case, it takes a data.frame which is actually a list of vector of the same length.
Then, pmap provides the function with the i-th element of each vector of the list.
... intercept all those i-th elements and c() create a unique vector of those elements.
The function itself will just replace NAs in that vector in very similar way to the accepted solution. I used rank because it seemed to me a bit easier to read, but I guess it's a matter of style.
pmap always returns a list. That's you can use pmap_dfr to return a dataframe instead. Specifically you want to create a dataframe by binding each vector of the final result as rows (that explains the r at the end).
Check out ?pmap for more info.

A data.table solution for completeness:
DT <- as.data.table(mtcars)
DT[,
{
t3 <- sort(unlist(.SD), decreasing = TRUE)[1:3]
lapply(.SD, function(x) if (x %in% t3) x else NA_real_)
},
by = seq_len(nrow(DT))]
# seq_len mpg cyl disp hp drat wt qsec vs am gear carb
# 1: 1 21.0 NA 160.0 110 NA NA NA NA NA NA NA
# 2: 2 21.0 NA 160.0 110 NA NA NA NA NA NA NA
# 3: 3 22.8 NA 108.0 93 NA NA NA NA NA NA NA
# 4: 4 21.4 NA 258.0 110 NA NA NA NA NA NA NA
# 5: 5 18.7 NA 360.0 175 NA NA NA NA NA NA NA
# 6: 6 NA NA 225.0 105 NA NA 20.22 NA NA NA NA
# ...

One dplyr option could be:
mtcars %>%
rowwise() %>%
mutate(temp = list(tail(sort(c_across(everything())), 3))) %>%
ungroup() %>%
mutate(across(everything(), ~ replace(.x, !.x %in% unlist(temp), NA))) %>%
select(-temp)
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 NA 160 110 NA NA NA NA NA NA NA
2 21 NA 160 110 NA NA NA NA NA NA NA
3 22.8 NA 108 93 NA NA NA NA NA NA NA
4 21.4 NA 258 110 NA NA NA NA NA NA NA
5 18.7 NA 360 175 NA NA NA NA NA NA NA
6 NA NA 225 105 NA NA 20.2 NA NA NA NA
7 NA NA 360 245 NA NA 15.8 NA NA NA NA
8 24.4 NA 147. 62 NA NA NA NA NA NA NA
9 22.8 NA 141. 95 NA NA 22.9 NA NA NA NA
10 19.2 NA 168. 123 NA NA NA NA NA NA NA
The same logic using purrr:
mtcars %>%
pmap_dfr(~ replace(c(...), !c(...) %in% tail(sort(c(...)), 3), NA))

Related

Replace values with NAs based on a column condition

I have a dataframe with different columns, one of which tells me if data in other columns can be "trusted" or not, containing a "yes" or a no" (column name: inside_calibration_range). What I would like to do is simply to replace the values in the whole row with NA every time I have a "no" in the inside_calibration_range column.
I gave it a look to dplyr::na_if and replace_with_na_all() functions, but (I may be wrong) it seems they do not accept conditions, but they replace specific values in the whole dataframe.
When cyl equal to 6 cannot be trusted in mtcars, we can mutate across everything to NA for that condition:
library(tidyverse)
data(mtcars)
as_tibble(mtcars %>% mutate(across(everything(), ~replace(., cyl == 6 , NA))))
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 NA NA NA NA NA NA NA NA NA NA NA
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 NA NA NA NA NA NA NA NA NA NA NA
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 NA NA NA NA NA NA NA NA NA NA NA
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows
Select only some columns instead of all:
as_tibble(mtcars %>% mutate(across(c(mpg, disp), ~replace(., cyl == 6 , NA))))
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA 6 NA 110 3.9 2.62 16.5 0 1 4 4
2 NA 6 NA 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 NA 6 NA 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 NA 6 NA 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 NA 6 NA 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

Rename columns in a large list with different data frame lengths

I have used lapply to create a large list containing +14000 individual data frames with various lengths, see this question for more background.
I now have a large list with many data frames that have different numbers of columns, ranging from 84 to 315. I would like to rename every data frame columns with two consistent names and a sequence from d0 to dxx, depending on the number of columns
col1 col2 col3 col4 col5 ... colxx
ID doy d0 d1 d2 dxx
This means the first data frame will have a sequence of d0 to d312, while the last data frame will have a sequence of d0 to d81.
I tried the following:
df = lapply(df_sub, function(x) {names(x)[1:314] <- c("ID", "doy","d0","d1","d2","d3","d4","d5","d6","d7","d8","d9","d10","d11","d12","d13","d14","d15","d16","d17","d18","d19","d20","d21","d22","d23","d24","d25","d26","d27","d28","d29","d30", "d31","d32","d33","d34","d35","d36","d37","d38","d39","d40","d41","d42","d43","d44","d45","d46","d47","d48","d49","d50",d51","d52","d53","d54","d55","d56","d57","d58","d59","d60",d61","d62","d63","d64","d65","d66","d67","d68","d69","d70","d71","d72","d73","d74","d75","d76","d77","d78","d79","d80","d81","d82","d83","d84","d85","d86","d87","d88","d89","d90","d91","d92","d93","d94","d95","d96","d97","d98","d99","d100",d101","d102","d103","d104","d105","d106","d107","d108","d109","d110","d111","d112","d113","d114","d115","d116","d117","d118","d119","d120","d121","d122","d123","d124","d125","d126","d127","d128","d129","d130","d131","d132","d133","d134","d135","d136","d137","d138","d139","d140","d141","d142","d143","d144","d145","d146","d147","d148","d149","d150","d151","d152","d153","d154","d155","d156","d157","d158","d159","d160","d161","d162","d163","d164","d165","d166","d167","d168","d169","d170","d171","d172","d173","d174","d175","d176","d177","d178","d179","d180","d181","d182","d183","d184","d185","d186","d187","d188","d189","d190","d191","d192","d193","d194","d195","d196","d197","d198","d199","d200","d201","d202","d203","d204","d205","d206","d207","d208","d209","d210","d211","d212","d213","d214","d215","d216","d217","d218","d219","d220","d221","d222","d223","d224","d225","d226","d227","d228","d229","d230","d231","d232","d233","d234","d235","d236","d237","d238","d239","d240","d241","d242","d243","d244","d245","d246","d247","d248","d249","d250","d251","d252","d253","d254","d255","d256","d257","d258","d259","d260","d261","d262","d263","d264","d265","d266","d267","d268","d269","d270","d271","d272","d273","d274","d275","d276","d277","d278","d279","d280","d281","d282","d283","d284","d285","d286","d287","d288","d289","d290","d291","d292","d293","d294","d295","d296","d297","d298","d299","d300","d301","d302","d303","d304","d305","d306","d307","d308","d309","d310","d311","d312"); x})
Error in names(x) <- `*vtmp*` :
'names' attribute [315] must be the same length as the vector [314]
How can this be done otherwise in R?
Only works if there are three or more columns, but I guess that can be assumed here.
# Sample Data
df <- lapply(list(mtcars,iris,CO2), head)
lapply(df, function(d) {setNames(d, c("ID", "doy", paste0("d", 0:(length(d) - 3))))})
#> [[1]]
#> ID doy d0 d1 d2 d3 d4 d5 d6 d7 d8
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
#>
#> [[2]]
#> ID doy d0 d1 d2
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#>
#> [[3]]
#> ID doy d0 d1 d2
#> 1 Qn1 Quebec nonchilled 95 16.0
#> 2 Qn1 Quebec nonchilled 175 30.4
#> 3 Qn1 Quebec nonchilled 250 34.8
#> 4 Qn1 Quebec nonchilled 350 37.2
#> 5 Qn1 Quebec nonchilled 500 35.3
#> 6 Qn1 Quebec nonchilled 675 39.2
Created on 2022-04-01 by the reprex package (v2.0.1)
library(magrittr)
# example data
dfs <- list(
data.frame(a = 1, b = 2, c = 3),
data.frame(x = 1, y = 2),
data.frame(z = 0)
)
dfs %>%
lapply(function(.x) {
colnames(.x) <-
.x %>%
colnames() %>%
length() %>%
seq() %>%
`-`(1) %>%
paste0("d", .)
.x
})
#> [[1]]
#> d0 d1 d2
#> 1 1 2 3
#>
#> [[2]]
#> d0 d1
#> 1 1 2
#>
#> [[3]]
#> d0
#> 1 0
Created on 2022-04-01 by the reprex package (v2.0.0)
A minor modification to your existing code is to check how many columns the dataframe has, before creating the new names:
df <- list()
for(i in 1:3) df[[i]] <- as.data.frame(matrix(NA, 5, i+3))
f <- function(x){
#Names first column "ID", second "Day", and subsequent columns d0, d1...
names(x) <- c("ID","Day", paste0("d", seq(from=0, length.out=length(names(x))-2)))
x
}
lapply(df, f)
# [[1]]
# ID Day d0 d1
# 1 NA NA NA NA
# 2 NA NA NA NA
# 3 NA NA NA NA
# 4 NA NA NA NA
# 5 NA NA NA NA
#
# [[2]]
# ID Day d0 d1 d2
# 1 NA NA NA NA NA
# 2 NA NA NA NA NA
# 3 NA NA NA NA NA
# 4 NA NA NA NA NA
# 5 NA NA NA NA NA
#
# [[3]]
# ID Day d0 d1 d2 d3
# 1 NA NA NA NA NA NA
# 2 NA NA NA NA NA NA
# 3 NA NA NA NA NA NA
# 4 NA NA NA NA NA NA
# 5 NA NA NA NA NA NA

Adding multiple NA rows to a data frame

I wanted to calculate something using a loop. The loop should go more than the number of my observations. For this reason, I want to first add NA rows to my data frame. Let's say my data frame is mtcars data. How can I add 10 NA rows in the mtcars data? The add_row()function from dplyr helps me to add only one NA row. Using rbind() also didn't help to add multiple rows.I probably need to loop this one too. But maybe there is an easy solution.
To add several rows of NA to the dataframe, you can try this:
n <- 2 # Number of NA rows
rbind(df, matrix(NA, nrow = n, ncol = NCOL(df),
dimnames = list(NULL, colnames(df))))
There are many ways to do that. One way with base-R:
#make a data.frame with 10 rows of NA
na_frame <- as.data.frame(matrix(NA, nrow = 10, ncol = 11))
#add the names of mtcars
names(na_frame) <- names(mtcars)
#bind together
rbind(mtcars, na_frame)
output:
#truncated
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
1 NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA
7 NA NA NA NA NA NA NA NA NA NA NA
8 NA NA NA NA NA NA NA NA NA NA NA
9 NA NA NA NA NA NA NA NA NA NA NA
10 NA NA NA NA NA NA NA NA NA NA NA
Alternatives that start with the base frame itself:
base R
mt <- mtcars[1:3,]
rbind(mt, mt[0,][rep(NA, 11),])
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# NA NA NA NA NA NA NA NA NA NA NA NA
# NA.1 NA NA NA NA NA NA NA NA NA NA NA
# NA.2 NA NA NA NA NA NA NA NA NA NA NA
# NA.3 NA NA NA NA NA NA NA NA NA NA NA
# NA.4 NA NA NA NA NA NA NA NA NA NA NA
# NA.5 NA NA NA NA NA NA NA NA NA NA NA
# NA.6 NA NA NA NA NA NA NA NA NA NA NA
# NA.7 NA NA NA NA NA NA NA NA NA NA NA
# NA.8 NA NA NA NA NA NA NA NA NA NA NA
# NA.9 NA NA NA NA NA NA NA NA NA NA NA
# NA.10 NA NA NA NA NA NA NA NA NA NA NA
dplyr
library(dplyr)
mt %>%
slice(rep(1, 11)) %>%
mutate(across(everything(), ~ NA)) %>%
bind_rows(mt, .)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# 4 NA NA NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA NA NA NA NA NA
# 6 NA NA NA NA NA NA NA NA NA NA NA
# 7 NA NA NA NA NA NA NA NA NA NA NA
# 8 NA NA NA NA NA NA NA NA NA NA NA
# 9 NA NA NA NA NA NA NA NA NA NA NA
# 10 NA NA NA NA NA NA NA NA NA NA NA
# 11 NA NA NA NA NA NA NA NA NA NA NA
# 12 NA NA NA NA NA NA NA NA NA NA NA
# 13 NA NA NA NA NA NA NA NA NA NA NA
# 14 NA NA NA NA NA NA NA NA NA NA NA

R - Delete Observations if More Than 25% of a Group

This is my first post! I started using R about a year ago and I have learned a lot from this sub over the last few months! Thanks for all of your help so far.
Here is what I am trying to do:
• Group Data by POS
• Within each POS group, no ORG should represent more than 25% of the dataset
• If the ORG represents more than 25% of the observation(column), the value furthest from the mean should be deleted. I think this would loop until the data from that ORG are less than 25% of the observation.
I am not sure how to approach this problem as I am a not too familiar with R functions. Well, I am assuming this would require a function.
Here is the sample dataset:
print(Example)
# A tibble: 18 x 13
Org Pos obv1 obv2 obv3 obv4 obv5 obv6 obv7 obv8 obv9 obv10 obv11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 34.6 26.2 43.1 NA NA NA NA NA NA NA NA
2 2 1 18.7 15.5 23.4 NA NA NA NA NA NA NA NA
3 3 1 16.2 14.4 21.7 NA NA NA NA NA NA NA 1.32
4 3 1 20.0 15.5 23.4 NA NA 1.32 2.78 1.44 NA NA 1.89
5 3 1 2.39 16.9 24.1 NA NA 1.13 1.52 1.12 NA NA 2.78
6 3 1 24.3 15.4 24.6 NA NA 1.13 1.89 1.13 NA NA 1.51
7 6 1 16.7 16.0 23.4 0.19 NA 0.83 1.3 0.94 1.78 2.15 1.51
8 6 1 18.7 16.4 25.8 0.19 NA 1.22 1.4 0.97 1.93 2.35 1.51
9 6 1 19.3 16.4 25.8 0.19 NA 1.22 1.4 0.97 1.93 2.35 1.51
10 7 1 23.8 18.6 28.6 NA NA NA NA NA NA NA NA
11 12 2 28.8 24.4 39.7 NA NA 1.13 1.89 1.32 2.46 3.21 NA
12 13 2 24.6 19.6 29.4 0.16 NA 3.23 3.23 2.27 NA NA NA
13 14 2 18.4 15.5 24.8 NA NA 2.27 3.78 1.13 3.46 4.91 2.78
14 15 2 23.8 24.4 39.7 NA NA NA NA NA NA NA NA
15 15 2 25.8 24.4 39.7 NA NA NA NA NA NA NA NA
16 16 2 18.9 17.4 26.9 0.15 NA NA 1.89 2.99 NA NA 1.51
17 16 2 22.1 17.3 26.9 NA NA NA 2.57 0.94 NA NA 1.51
18 16 2 24.3 19.6 28.5 0.15 NA NA 1.51 1.32 NA NA 2.27
The result would look something like this:
print(Result)
# A tibble: 18 x 13
Org Pos obv1 obv2 obv3 obv4 obv5 obv6 obv7 obv8 obv9 obv10 obv11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 34.6 26.2 43.1 NA NA NA NA NA NA NA NA
2 2 1 18.7 15.5 23.4 NA NA NA NA NA NA NA NA
3 3 1 NA NA NA NA NA NA NA NA NA NA NA
4 3 1 20.0 15.5 23.4 NA NA 1.32 2.78 1.44 NA NA NA
5 3 1 NA NA NA NA NA NA NA NA NA NA NA
6 3 1 NA NA NA NA NA NA NA NA NA NA 1.51
7 6 1 16.7 16.0 23.4 0.19 NA NA NA NA NA NA NA
8 6 1 NA NA NA NA NA 1.22 1.4 0.97 1.93 2.35 1.51
9 6 1 19.3 16.4 25.8 NA NA NA NA NA NA NA NA
10 7 1 23.8 18.6 28.6 NA NA NA NA NA NA NA NA
11 12 2 28.8 24.4 39.7 NA NA 1.13 1.89 1.32 2.46 3.21 NA
12 13 2 24.6 19.6 29.4 0.16 NA 3.23 3.23 2.27 NA NA NA
13 14 2 18.4 15.5 24.8 NA NA 2.27 3.78 1.13 3.46 4.91 2.78
14 15 2 NA NA NA NA NA NA NA NA NA NA NA
15 15 2 25.8 24.4 39.7 NA NA NA NA NA NA NA NA
16 16 2 NA NA NA NA NA NA 1.89 2.99 NA NA NA
17 16 2 22.1 17.3 26.9 NA NA NA 2.57 0.94 NA NA 1.51
18 16 2 NA NA NA NA NA NA NA NA NA NA NA
Any advice would be appreciated. Thanks!

R sorting data subset

I'm learning to use R (version 3.1.2), so this may come as a noob question, but I'm having problems ordering a subset of a data frame. If I use the mtcars data frame using attach(mtcars), I can easily order it using ord.cars <- mtcars[order(hp),]. The problem is, if I use a subset, let's say sub.cars <- subset(mtcars, hp > 120) and try to order it using ord.sub <- sub.cars[order(mpg),], the result is the following:
mpg cyl disp hp drat wt qsec vs am gear carb
NA NA NA NA NA NA NA NA NA NA NA NA
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
NA.1 NA NA NA NA NA NA NA NA NA NA NA
NA.2 NA NA NA NA NA NA NA NA NA NA NA
NA.3 NA NA NA NA NA NA NA NA NA NA NA
NA.4 NA NA NA NA NA NA NA NA NA NA NA
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
NA.5 NA NA NA NA NA NA NA NA NA NA NA
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
NA.6 NA NA NA NA NA NA NA NA NA NA NA
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
NA.7 NA NA NA NA NA NA NA NA NA NA NA
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
NA.8 NA NA NA NA NA NA NA NA NA NA NA
NA.9 NA NA NA NA NA NA NA NA NA NA NA
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
NA.10 NA NA NA NA NA NA NA NA NA NA NA
NA.11 NA NA NA NA NA NA NA NA NA NA NA
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
NA.12 NA NA NA NA NA NA NA NA NA NA NA
NA.13 NA NA NA NA NA NA NA NA NA NA NA
NA.14 NA NA NA NA NA NA NA NA NA NA NA
Why is R putting back as NAs all the rows that were left out of the subset?
Thanks in advance!
This is a problem related to your use of attach() which is not recommended in R - for exactly this reason! The problem is, that your code is kind of ambiguous, or at least, it is something different than what you expected it to be.
How to resolve this?
detach the data set and
don't use attach again. Instead, use [ and/or $ and if you like with() to subset your data.
Here's how you could do it for the example:
detach(mtcars)
ord.cars <- mtcars[order(mtcars$hp),]
sub.cars <- subset(mtcars, hp > 120)
#the subset could also be written as:
sub.cars <- mtcars[mtcars$hp > 120,]
ord.sub <- sub.cars[order(sub.cars$mpg),]
head(ord.sub) # only show the first 6 rows
mpg cyl disp hp drat wt qsec vs am gear carb
Cadillac Fleetwood 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
Lincoln Continental 10.4 8 460 215 3.00 5.42 17.8 0 0 3 4
Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
Chrysler Imperial 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
What exactly caused the problem in your code?
After you attached the mtcars data, whenever you call one of the column names of the attached data, like mpg, it will refer to the attached data set (the original mtcats data). The problem then was that you subsetted the data and stored it in a new object (sub.cars) which was not attached while mtcars was still attached. Then, when you tried to order the sub.cars data, you used sub.cars[order(mpg),] and as you can see, in there, you refer to mpg column - that is interpreted by R as the one from the attached (original) mtcars data set, with more rows than you subsetted data. All those rows in your sub.cars which were excluded by the subsetting, will now be displayed as NAs in sub.cars.
Lesson: don't use attach().

Resources