How to remove variables from a dataset that are actually named NA? - r

I have an excel dataset that I have to work on. The problem is that empty cells were named NA instead of leaving them empty.
I'm trying to remove NA values from the dataset and usually I could use is.na() to omit them but now they have a name so I don't know how to go about this.
Any ideas to point me in the right direction?

You can try something like below:
library(dplyr)
df %>% mutate(across(everything(), ~ na_if(., 'NA'))) %>% na.omit()
# A tibble: 3 x 2
Drinks ranked
<chr> <chr>
1 A 1
2 C 2
3 C 1
Data used:
df
# A tibble: 5 x 2
Drinks ranked
<chr> <chr>
1 A 1
2 B NA
3 NA 1
4 C 2
5 C 1

Related

How to remove missing values in summarise_all dplyr [duplicate]

This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed 7 months ago.
I'm having trouble to exclude missing values in summarise_all function.
I have a dataset (df) as shown below and basically I'm having two problems:
excluding missing values and the output only being one number
additional data rows with same IDs but NA values (the second column with 'TRUE' values in df1 dataset)
df1 dataset is the one I'm trying to get to.
Here's the whole enchilada:
df #the original dataset
ID type of data genes1 genes2 genes3 ...
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1
...
df1 <- df %>% group_by(df$ID) %>% summarize_all(list, na.rm= TRUE) #my code
#output
ID type of data genes1 genes2 genes3 ...
1 c("new","old","suggested") c(2,NA,NA) c(0,NA,NA) c(2,NA,NA)
1 TRUE TRUE TRUE TRUE
2 c("new","old","suggested") c(1,NA,NA) c(1,NA,NA) c(1,NA,NA)
2 TRUE TRUE TRUE TRUE
...
#my main concern is the "genes" type of data and the rows with same IDs and NA values, I wanted something like this
df1 #dream dataset
ID type of data genes1 genes2 genes3 ...
1 #doesn't matter 2 0 2
2 #doesn't matter 1 1 1
...
I also tried using na.omit in summarise_all but it didn't really fix anything.
Does anybody have any ideas on how to fix it?
You could do:
library(dplyr)
df %>%
group_by(ID) %>%
summarise(across(starts_with('genes'), ~.[!is.na(.)]))
#> # A tibble: 2 × 4
#> ID genes1 genes2 genes3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 2
#> 2 2 1 1 1
Another way
library (dplyr)
df[-2] |>
group_by(ID) |>
fill(genes1:genes3, .direction = "downup") |>
slice(1)
ID genes1 genes2 genes3
<int> <int> <int> <int>
1 1 2 0 2
2 2 1 1 1
An alternative approach based on the coalesce() function from tidyr
In the below code, we remove the type variable since the OP indicated we don't need it in the output. We then group_by() to essentially break up our data into separate data.frames for each ID. The coalesce_by_column() function we define then converts each of these into a list whose elements are each a vector of values for each gene column.
We finally can pass this list to coalesce(). coalesce() takes a set of vectors and finds the first non-NA value across the vectors for each index of the vectors. In practice, this means it can take multiple columns with only one or zero non-NA value across all columns for each index and collapse them into a single column with as many non-NA values as possible.
Usually we would have to pass each vector as its own object to coalesce() but we can use the (splice operator)[https://stackoverflow.com/questions/61180201/triple-exclamation-marks-on-r] !!! to pass each element of our list as its own vector. See the last example in ?"!!!" for a demonstration.
library(dplyr)
library(tidyr)
# Define a function to coalesce by column
coalesce_by_column <- function(df) {
coalesce(!!! as.list(df))
}
# Remove NA rows
df %>%
select(-type) %>%
group_by(ID) %>%
summarise(across(.fns = coalesce_by_column))
#> # A tibble: 2 x 4
#> ID genes1 genes2 genes3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 2
#> 2 2 1 1 1
If you are not worried about the type column you can do something like this
library(tidyverse)
" ID type genes1 genes2 genes3
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1" %>%
read_table() -> df
df %>%
pivot_longer(-c(ID, type)) %>%
drop_na(value) %>%
select(-type) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 2 × 4
ID genes1 genes2 genes3
<dbl> <dbl> <dbl> <dbl>
1 1 2 0 2
2 2 1 1 1
If you want to keep the "type of data" column while using summarise, you can use the following code:
df <- read.table(text = "ID type_of_data genes1 genes2 genes3
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1", header = TRUE)
library(dplyr)
library(tidyr)
df1 <- df %>%
group_by(ID) %>%
summarise(across(starts_with("genes"), na.omit),
type_of_data = type_of_data[genes1]) %>%
ungroup()
df1
#> # A tibble: 2 × 5
#> ID genes1 genes2 genes3 type_of_data
#> <int> <int> <int> <int> <chr>
#> 1 1 2 0 2 old
#> 2 2 1 1 1 new
Created on 2022-07-26 by the reprex package (v2.0.1)

Dense ranking of column based on order of second column

I am beating my brains out on something that is probably straight forward. I want to get a "dense" ranking (as defined for the data.table::frank function), on a column in a data frame, but not based on the columns proper order, the order should be given by another column (val in my example)
I managed to get the dense ranking with #Prasad Chalasani 's solution, like that:
library(dplyr)
foo_df <- data.frame(id = c(4,1,1,3,3), val = letters[1:5])
foo_df %>% arrange(val) %>% mutate(id_fac = as.integer(factor(id)))
#> id val id_fac
#> 1 4 a 3
#> 2 1 b 1
#> 3 1 c 1
#> 4 3 d 2
#> 5 3 e 2
But I would like the factor levels to be ordered based on val. Desired output:
foo_desired <- foo_df %>% arrange(val) %>% mutate(id_fac = as.integer(factor(id, levels = c(4,1,3))))
foo_desired
#> id val id_fac
#> 1 4 a 1
#> 2 1 b 2
#> 3 1 c 2
#> 4 3 d 3
#> 5 3 e 3
I tried data.table::frank
I tried both methods by #Prasad Chalasani.
I tried setting the order of id using id[rank(val)] (and sort(val), and order(val)).
Finally, I also tried to sort the levels using rank(val) etc, but this throws an error (Evaluation error: factor level [3] is duplicated.)
I know that one can specify the level order, I used this for creation of the desired output. This solution is however not great as my data has way more rows and levels
I need that for convenience, in order to produce a table with a specific order, not for computations.
Created on 2018-12-19 by the reprex package (v0.2.1)
You can check with first
foo_df %>% arrange(val) %>%
group_by(id)%>%mutate(id_fac = first(val))%>%
ungroup()%>%
mutate(id_fac=as.integer(factor(id_fac)))
# A tibble: 5 x 3
id val id_fac
<dbl> <fctr> <int>
1 4 a 1
2 1 b 2
3 1 c 2
4 3 d 3
5 3 e 3
Why do you even need factors ? Not sure if I am missing something but this gives your desired output.
You can use match to get id_fac based on the occurrence of ids.
library(dplyr)
foo_df %>%
mutate(id_fac = match(id, unique(id)))
# id val id_fac
#1 4 a 1
#2 1 b 2
#3 1 c 2
#4 3 d 3
#5 3 e 3

R add rows to grouped df using dplyr

I have a grouped df and I would like to add additional rows to the top of the groups that match with a variable (item_code) from the df.
The additional rows do not have an id column. The additional rows should not be duplicated within the groups of df.
Example data:
df <- as.tibble(data.frame(id=rep(1:3,each=2),
item_code=c("A","A","B","B","B","Z"),
score=rep(1,6)))
additional_rows <- as.tibble(data.frame(item_code=c("A","Z"),
score=c(6,6)))
What I tried
I found this post and tried to apply it:
Add row in each group using dplyr and add_row()
df %>% group_by(id) %>% do(add_row(additional_rows %>%
filter(item_code %in% .$item_code)))
What I get:
# A tibble: 9 x 3
# Groups: id [3]
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 Z 6
3 1 NA NA
4 2 A 6
5 2 Z 6
6 2 NA NA
7 3 A 6
8 3 Z 6
9 3 NA NA
What I am looking for:
# A tibble: 6 x 3
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 A 1
3 1 A 1
4 2 B 1
5 2 B 1
6 3 B 1
7 3 Z 6
8 3 Z 1
This should do the trick:
library(plyr)
df %>%
join(subset(df, item_code %in% additional_rows$item_code, select = c(id, item_code)) %>%
join(additional_rows) %>%
subset(!duplicated(.)), type = "full") %>%
arrange(id, item_code, -score)
Not sure if its the best way, but it works
Edit: to get the score in the same order added the other arrange terms
Edit 2: alright, there should now be no duplicated rows added from the additional rows as per your comment

Dynamically Normalize all rows with first element within a group

Suppose I have the following data frame:
year subject grade study_time
1 1 a 30 20
2 2 a 60 60
3 1 b 30 10
4 2 b 90 100
What I would like to do is be able to divide grade and study_time by their first record within each subject. I do the following:
df %>%
group_by(subject) %>%
mutate(RN = row_number()) %>%
mutate(study_time = study_time/study_time[RN ==1],
grade = grade/grade[RN==1]) %>%
select(-RN)
I would get the following output
year subject grade study_time
1 1 a 1 1
2 2 a 2 3
3 1 b 1 1
4 2 b 3 10
It's fairly easy to do when I know what the variable names are. However, I'm trying to write a generalize function that would be able to act on any data.frame/data.table/tibble where I may not know the name of the variables that I need to mutate, I'll only know the variables names not to mutate. I'm trying to get this done using tidyverse/data.table and I can't get anything to work.
Any help would be greatly appreciated.
We group by 'subject' and use mutate_at to change multiple columns by dividing the element by the first element
library(dplyr)
df %>%
group_by(subject) %>%
mutate_at(3:4, funs(./first(.)))
# A tibble: 4 x 4
# Groups: subject [2]
# year subject grade study_time
# <int> <chr> <dbl> <dbl>
#1 1 a 1 1
#2 2 a 2 3
#3 1 b 1 1
#4 2 b 3 10

apply a transformation to specific rows and columns of a df based on a second df

I have two huge df (especially the first one), which I simplify here.
library(tidyverse)
(thewhat <- tibble(sample = 1:10L, y= 1.0, z =2.0))
# A tibble: 10 x 3
sample y z
<int> <dbl> <dbl>
1 1 1. 2.
2 2 1. 2.
3 3 1. 2.
4 4 1. 2.
5 5 1. 2.
6 6 1. 2.
7 7 1. 2.
8 8 1. 2.
9 9 1. 2.
10 10 1. 2.
(thewhere <- tibble(cond = c("a","a","b","c","a"),
init_sample= c(1,3,4,5,7),
duration = c(1,2,2,1,3),
where = c(NA,"y","z","y","z")))
# A tibble: 5 x 4
cond init_sample duration where
<chr> <dbl> <dbl> <chr>
1 a 1. 1. <NA>
2 a 3. 2. y
3 b 4. 2. z
4 c 5. 1. y
5 a 7. 3. z
I want to "mutate" some cells of thewhat df to NAs based on information of thewhere df. Importantly, thewhat is in wide format, and I don't want to transform it to long format (because I have millions of rows).
I want to transform the samples indicated in thewhere by the init_sample until duration of the column indicated by where. (And if where is NA it means that it applies to all the columns of thewhat except sample; here y and z.)
I created a df, NAs, which indicates which are the cells that should be NA:
# table with the elements that should be replaced by NA
NAs <- filter(thewhere, cond=="a") %>%
mutate( sample = map2(init_sample, init_sample + duration - 1,seq)) %>%
unnest %>%
select(where, sample)
I tried different approaches, and this is the closest I got. In the next mutate, I did the NA transformation for one column, and I could manually add the rest of the relevant columns, but in my real scenario I have 30 columns.
# Takes into account the different columns but I need to manually add each relevant column
# and another case for mutate_all when the where is NA:
mutate(thewhat, y = if_else(sample %in% NAs$sample[NAs$where =="y"],
NA_real_, y ))
The expected output is the following:
# A tibble: 10 x 3
sample y z
<int> <dbl> <dbl>
1 1 NA NA
2 2 1. 2.
3 3 NA 2.
4 4 NA 2.
5 5 1. 2.
6 6 1. 2.
7 7 1. NA
8 8 1. NA
9 9 1. NA
10 10 1. 2.
Maybe mutate_at or mutate_if could work here, but I don't know how. Or some map function from purrr could save me, but I couldn't manage to make it work for this case.
(Brownie points if the solution remains in the tidyverse, but I could also live with another type of solution).
Thanks,
Bruno
Based on the description, we could use map
library(tidyverse)
lst <- NAs %>%
split(.$where)
set_names(names(lst), names(lst)) %>%
map_df(., ~ thewhat[[.x]] %>%
replace(., thewhat$sample %in% lst[[.x]]$sample, NA_real_) ) %>%
bind_cols(thewhat %>%
select(sample), .)
# A tibble: 10 x 3
# sample y z
# <int> <dbl> <dbl>
# 1 1 1 2
# 2 2 1 2
# 3 3 NA 2
# 4 4 NA 2
# 5 5 1 2
# 6 6 1 2
# 7 7 1 NA
# 8 8 1 NA
# 9 9 1 NA
#10 10 1 2

Resources