Collapse data frame so NAs are removed - r

I want to collapse this data frame so NA's are removed. How to accomplish this? Thanks!!
id <- c(1,1,1,2,2,3,4,5,5)
q1 <- c(23,55,7,88,90,34,11,22,99)
df <- data.frame(id,q1)
df$row <- 1:nrow(df)
spread(df, id, q1)
row 1 2 3 4 5
1 23 NA NA NA NA
2 55 NA NA NA NA
3 7 NA NA NA NA
4 NA 88 NA NA NA
5 NA 90 NA NA NA
6 NA NA 34 NA NA
7 NA NA NA 11 NA
8 NA NA NA NA 22
9 NA NA NA NA 89
I want it to look like this:
1 2 3 4 5
23 88 34 11 22
55 90 NA NA 89
7 NA NA NA NA
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

The row should be created on the sequence of 'id'. In addition, pivot_wider would be a more general function compared to spread
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
mutate(row = row_number()) %>%
ungroup %>%
pivot_wider(names_from = id, values_from = q1) %>%
select(-row)
-output
# A tibble: 3 × 5
`1` `2` `3` `4` `5`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 23 88 34 11 22
2 55 90 NA NA 99
3 7 NA NA NA NA
Or use dcast
library(data.table)
dcast(setDT(df), rowid(id) ~ id, value.var = 'q1')[, id := NULL][]
1 2 3 4 5
<num> <num> <num> <num> <num>
1: 23 88 34 11 22
2: 55 90 NA NA 99
3: 7 NA NA NA NA

Here's a base R solution. I sort each column so the non-NA values are at the top, find the number of non-NA values in the column with the most non-NA values (n), and return the top n rows from the data frame.
library(tidyr)
id <- c(1,1,1,2,2,3,4,5,5)
q1 <- c(23,55,7,88,90,34,11,22,99)
df <- data.frame(id,q1)
df$row <- 1:nrow(df)
df <- spread(df, id, q1)
collapse_df <- function(df) {
move_na_to_bottom <- function(x) x[order(is.na(x))]
sorted <- sapply(df, move_na_to_bottom)
count_non_na <- function(x) sum(!is.na(x))
n <- max(apply(df, 2, count_non_na))
sorted[1:n, ]
}
collapse_df(df[, -1])

Related

Count the length of each event by group

I am counting the events with multiple conditions and my final result is represented for each year seperately. The example of my df:
year <- c(rep(1981,20))
k1 <- c(rep(NA,5),rep("COLD",4),rep(NA,4),"COLD",NA,"COLD",rep(NA,4))
k2 <- c(rep(NA,10),rep("COLD",2),rep(NA,8))
k3 <- c(rep(NA,3),"COLD",rep(NA,16))
k4 <- c(rep(NA,3),rep("COLD",5),rep(NA,2),rep("COLD",6),NA,rep("COLD",3))
k5 <- c(rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,8))
df <- data.frame(year,k1,k2,k3,k4,k5)
The code I use is below:
rle_col <- function(k_col) {
with(rle(is.na(k_col)), {
i1 <- values
i1[values & lengths <= 2] <- 'Invalid'
sum(!values & lengths >= 5 &
(lead(i1) != "Invalid" & lag(i1)>=1), na.rm = TRUE)
})
}
rezult <- df %>%
group_by(year) %>%
summarise(across(starts_with('k'), rle_col))
My code works well, but I need to get one more variable - the length of each event (or the number of members in each event). This means that instead of sum here sum(!values & lengths >= 5 & (lead(i1) != "Invalid" & lag(i1)>=1), na.rm = TRUE) I need to get the number of values within each event, which satisfies these conditions. Is it possible? Thank you for any help.
The result I would like to see should be as follows:
Year k4
1981 5
1981 10
There have to be easy solutions for this problem. Here is a ridiculous complicated (and not well performing) approach using tidyverse:
library(tidyr)
library(dplyr)
library(purrr)
map_df(
names(df)[grepl("^k", names(df))],
~df %>%
group_by(year) %>%
mutate(
grp = (!!sym(.x) == "COLD" & !is.na(!!sym(.x))) |
!(is.na(lead(!!sym(.x))) | is.na(lag(!!sym(.x))))
) %>%
group_by(year, grp2 = cumsum(!grp)) %>%
filter(grp) %>%
count(name = .x) %>%
filter(!!sym(.x) >= 5) %>%
ungroup() %>%
select(-grp2)
)
This returns
# A tibble: 2 x 6
year k1 k2 k3 k4 k5
<dbl> <int> <int> <int> <int> <int>
1 1981 NA NA NA 5 NA
2 1981 NA NA NA 10 NA
Issues:
It's not performing well.
If there are several groups in different columns the output looks like this:
# A tibble: 9 x 6
year k1 k2 k3 k4 k5
<dbl> <int> <int> <int> <int> <int>
1 1981 4 NA NA NA NA
2 1981 3 NA NA NA NA
3 1981 NA 2 NA NA NA
4 1981 NA NA 1 NA NA
5 1981 NA NA NA 5 NA
6 1981 NA NA NA 10 NA
7 1981 NA NA NA NA 1
8 1981 NA NA NA NA 1
9 1981 NA NA NA NA 1
At the moment I don't know how to change it into something like
# A tibble: 2 x 6
year k1 k2 k3 k4 k5
<dbl> <int> <int> <int> <int> <int>
1 1981 4 2 1 5 1
2 1981 3 NA NA 10 1

R: take different variables from a row and make column header

rainfall <- data.frame("date" = rep(1:15),"location_code" = rep(6:8,5),
"rainfall"=runif(15, min=12, max=60))
rainfall30 <- rainfall %>%
group_by(location_code) %>%
filter(rainfall>30)
I want to use the above data to make the following table, is there a way to do it in R using dplyr?
date location6 location7 location8
2 47.7
5 46.8
6 32.3
7 55.3
9 40.5
I am just starting to use R, please apologize if this already answered. Thanks.
I think what you are looking for is tidyr::pivot_wider, which turns this long-form data.frame into a wide form. See here and here for more information on pivoting data with tidyr.
rainfall30 %>%
pivot_wider(names_from = location_code,
values_from = rainfall)
# date `6` `7` `8`
# <int> <dbl> <dbl> <dbl>
# 1 1 32.3 NA NA
# 2 2 NA 52.7 NA
# 3 3 NA NA 54.3
# 4 4 30.6 NA NA
# 5 7 52.4 NA NA
Here is a base R option using reshape + subset
reshape(
subset(rainfall, rainfall > 30),
idvar = "date",
timevar = "location_code",
direction = "wide"
)
which gives something like below (using set.seed(1) to generate rainfall)
date rainfall.8 rainfall.6 rainfall.7
3 3 39.49696 NA NA
4 4 NA 55.59397 NA
6 6 55.12270 NA NA
7 7 NA 57.34441 NA
8 8 NA NA 43.71829
9 9 42.19747 NA NA
13 13 NA 44.97710 NA
14 14 NA NA 30.43698
15 15 48.95239 NA NA

Counting columns with NAs after group_by

I want to count the number of columns that have an NA value after using group_by.
Similar questions have been asking, but counting total NAs not columns with NA (group by counting non NA)
Data:
Spes <- "Year Spec.1 Spec.2 Spec.3 Spec.4
1 2016 5 NA NA 5
2 2016 1 NA NA 6
3 2016 6 NA NA 4
4 2018 NA 5 5 9
5 2018 NA 4 7 3
6 2018 NA 5 2 1
7 2019 6 NA NA NA
8 2019 4 NA NA NA
9 2019 3 NA NA NA"
Data <- read.table(text=spes, header = TRUE)
Data$Year <- as.factor(Data$Year)
The desired output:
2016 2
2018 1
2019 3
I have tried a few things, this is my current best attempt. I would be keen for a dplyr solution.
> Data %>%
group_by(Year) %>%
summarise_each(colSums(is.na(Data, [2:5])))
Error: Can't create call to non-callable object
I have tried variations without much luck. Many thanks
One option could be to group_by Year, check if there is any NA values in each column and calculate their sum for each Year.
library(dplyr)
Data %>%
group_by(Year) %>%
summarise_all(~any(is.na(.))) %>%
mutate(output = rowSums(.[-1])) %>%
select(Year, output)
# A tibble: 3 x 2
# Year output
# <fct> <dbl>
#1 2016 2
#2 2018 1
#3 2019 3
Base R translation using aggregate
rowSums(aggregate(.~Year, Data, function(x)
any(is.na(x)), na.action = "na.pass")[-1], na.rm = TRUE)
#[1] 2 1 3

Spread and Gather table return duplicated rows with NA values

I have a table with categories and sub categories encoded in this format of columns name:
Date| Admissions__0 |Attendance__0 |Tri_1__0|Tri_2__0|...
Tri_1__1|Tri_2__1|...|
and I would like to change it to this format of columns using spread and gather function of tidyverse:
Date| Country code| Admissions| Attendance| Tri_1|Tri_2|...
I tried a solution posted but the outcome actually return multiple rows with NA rather than a single row.
My code used:
temp <- data %>% gather(key="columns",value ="dt",-Date)
temp <- temp %>% mutate(category = gsub(".*__","",columns)) %>% mutate(columns = gsub("__\\d","",columns))
temp %>% mutate(row = row_number()) %>% spread(key="columns",value="dt")
And my results is:
Date country_code row admissions attendance Tri_1 Tri_2 Tri_3 Tri_4 Tri_5
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 01-APR-2014 0 275 NA 209 NA NA NA NA NA
2 01-APR-2014 0 640 84 NA NA NA NA NA NA
3 01-APR-2014 0 1005 NA NA 5 NA NA NA NA
4 01-APR-2014 0 1370 NA NA NA 33 NA NA NA
5 01-APR-2014 0 1735 NA NA NA NA 62 NA NA
6 01-APR-2014 0 2100 NA NA NA NA NA 80 NA
7 01-APR-2014 0 2465 NA NA NA NA NA NA 29
8 01-APR-2014 1 2830 NA 138 NA NA NA NA NA
9 01-APR-2014 1 3195 66 NA NA NA NA NA NA
10 01-APR-2014 1 3560 NA NA N/A NA NA NA NA
My expected results:
Date country_code row admissions attendance Tri_1 Tri_2 Tri_3 Tri_4 Tri_5
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 01-APR-2014 0 275 84 209 5 33 62 80 29
8 01-APR-2014 1 2830 66 138 66 ... ... ... ...
We can do a summarise_at coalesce to remove the NA elements after the spread
library(tidyverse)
data %>%
gather(key = "columns", val = "dt", -Date, na.rm = TRUE) %>%
mutate(category = gsub(".*__","",columns)) %>%
mutate(columns = gsub("__\\d","",columns)) %>%
group_by(Date, dt, columns, category) %>%
mutate(rn = row_number()) %>%
spread(columns, dt) %>%
select(-V1) %>%
summarise_at(vars(Admissions:Tri_5),list(~ coalesce(!!! .))) # %>%
# filter if needed
#filter_at(vars(Admissions:Tri_5), all_vars(!is.na(.)))

Replacing NA's with a specific condition in R [duplicate]

This question already has answers here:
Replace NA values by row means
(3 answers)
Closed 4 years ago.
In case 2017 is NA and columns of 2015 and 2016 have value, I want to assign average of them to 2017 based on the same row.
Index 2015 2016 2017
1 NA 6355698 10107023
2 13000000 73050000 NA
4 NA NA NA
5 10500000 NA 8000000
6 331000000 659000000 1040000000
7 55500000 NA 32032920
8 NA NA 20000000
9 2521880 5061370 7044288
...
Here is that I tried, didn't work!
ind <- which(is.na(df), arr.ind=TRUE)
df[ind] <- rowMeans(df, na.rm = TRUE)[ind[,1]]
Also if we have values in 2015 and 2017 columns and 2016 is NA, I want to assign average of them to the column of 2016 based on the same row. Any help would be appreciated!
Disclaimer: I'm not entirely clear on what your expected output is. My solution below is based on the assumption that you want to replace NA values with either the mean of all values for every year or with the mean value of all values for every Index.
Here is a tidyverse option first spreading from wide to long, replacing NAs with the mean value per year, and finally converting back from long to wide.
library(tidyverse)
df %>%
gather(year, value, -Index) %>%
group_by(year) %>%
mutate(value = ifelse(is.na(value), mean(value, na.rm = T), value)) %>%
spread(year, value)
## A tibble: 8 x 4
# Index `2015` `2016` `2017`
# <int> <dbl> <dbl> <dbl>
#1 1 115507293. 6355698. 10107023.
#2 2 13000000. 223472356. 186197372.
#3 4 115507293. 223472356. 186197372.
#4 5 115507293. 223472356. 8000000.
#5 6 331000000. 659000000. 1040000000.
#6 7 115507293. 223472356. 32032920.
#7 8 115507293. 223472356. 20000000.
#8 9 2521880. 5061370. 7044288.
Note that here we replace NAs with mean value per year. If instead you want to replace NAs with the mean value per Index value, simply replace group_by(year) with group_by(Index):
df %>%
gather(year, value, -Index) %>%
group_by(Index) %>%
mutate(value = ifelse(is.na(value), mean(value, na.rm = T), value)) %>%
spread(year, value)
## A tibble: 8 x 4
## Groups: Index [8]
# Index `2015` `2016` `2017`
# <int> <dbl> <dbl> <dbl>
#1 1 8231360. 6355698. 10107023.
#2 2 13000000. 13000000. 13000000.
#3 4 NaN NaN NaN
#4 5 8000000. 8000000. 8000000.
#5 6 331000000. 659000000. 1040000000.
#6 7 32032920. 32032920. 32032920.
#7 8 20000000. 20000000. 20000000.
#8 9 2521880. 5061370. 7044288.
Update
To only replace NAs in column 2017 with the row average based on the 2015,2016 values you can do
df <- read_table("Index 2015 2016 2017
1 NA 6355698 10107023
2 13000000 73050000 NA
4 NA NA NA
5 10500000 NA 8000000
6 331000000 659000000 1040000000
7 55500000 NA 32032920
8 NA NA 20000000
9 2521880 5061370 7044288")
df %>%
mutate(`2017` = ifelse(is.na(`2017`), 0.5 * (`2015` + `2016`), `2017`))
## A tibble: 8 x 4
# Index `2015` `2016` `2017`
# <int> <int> <int> <dbl>
#1 1 NA 6355698 10107023.
#2 2 13000000 73050000 43025000.
#3 4 NA NA NA
#4 5 10500000 NA 8000000.
#5 6 331000000 659000000 1040000000.
#6 7 55500000 NA 32032920.
#7 8 NA NA 20000000.
#8 9 2521880 5061370 7044288.
Sample data
df <- read_table("Index 2015 2016 2017
1 NA 6355698 10107023
2 13000000 NA NA
4 NA NA NA
5 NA NA 8000000
6 331000000 659000000 1040000000
7 NA NA 32032920
8 NA NA 20000000
9 2521880 5061370 7044288")

Resources