I have a dataframe like this:
ID Q4_1 Q4_2 Q4_3 Q4_4 Q4_5
1 4 3 1
2 1 3
3 3 1 2
4 5
Can I know how to rearrange the row values based on the column's prefix to have something like this?
ID Q4_1 Q4_2 Q4_3 Q4_4 Q4_5
1 1 3 4
2 1 3
3 1 2 3
4 5
Here is a base R approach. Note that I've replaced the empty space in your input with NA.
The core of this solution is to use lapply to go through all rows in the dataframe, then use ifelse to match the row values with the header_suffix I created, which contains the suffix of colnames of the dataframe.
df <- read.table(header = T,
text = "ID Q4_1 Q4_2 Q4_3 Q4_4 Q4_5
1 4 3 1 NA NA
2 1 3 NA NA NA
3 3 1 2 NA NA
4 5 NA NA NA NA")
header_suffix <- gsub("^.+_", "", grep("_", colnames(df), value = T))
header_suffix
#> [1] "1" "2" "3" "4" "5"
df2 <-
cbind(df[, 1],
as.data.frame(
do.call(
rbind,
lapply(1:nrow(df), function(x)
ifelse(header_suffix %in% df[x, 2:ncol(df)], header_suffix, NA))
)
)
)
colnames(df2) <- colnames(df)
df2
#> ID Q4_1 Q4_2 Q4_3 Q4_4 Q4_5
#> 1 1 1 <NA> 3 4 <NA>
#> 2 2 1 <NA> 3 <NA> <NA>
#> 3 3 1 2 3 <NA> <NA>
#> 4 4 <NA> <NA> <NA> <NA> 5
Created on 2022-03-31 by the reprex package (v2.0.1)
Here's a Tidyverse approach
df %>% pivot_longer(2:6, names_to = "Q", values_to = "no") %>%
mutate(new_Q = paste0("Q4_", no)) %>%
filter(new_Q != "Q4_NA") %>%
select(ID, new_Q, no ) %>%
pivot_wider(id_cols = ID, names_from = new_Q, values_from = no) %>%
select(ID, Q4_1, Q4_2, Q4_3, Q4_4, Q4_5 )
# A tibble: 4 x 6
ID Q4_1 Q4_2 Q4_3 Q4_4 Q4_5
<int> <int> <int> <int> <int> <int>
1 1 1 NA 3 4 NA
2 2 1 NA 3 NA NA
3 3 1 2 3 NA NA
4 4 NA NA NA NA 5
Related
Let's say I have a table like so:
df <- data.frame("Group" = c("A","A","A","B","B","B","C","C","C"),
"Num" = c(1,2,3,1,2,NA,NA,NA,NA))
Group Num
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B NA
7 C NA
8 C NA
9 C NA
In this case, because group C has Num as NA for all entries, I would like to remove rows in group C from the table. Any help is appreciated!
You could group_by on you Group and filter the groups with all values that are NA. You can use the following code:
library(dplyr)
df %>%
group_by(Group) %>%
filter(!all(is.na(Num)))
#> # A tibble: 6 × 2
#> # Groups: Group [2]
#> Group Num
#> <chr> <dbl>
#> 1 A 1
#> 2 A 2
#> 3 A 3
#> 4 B 1
#> 5 B 2
#> 6 B NA
Created on 2023-01-18 with reprex v2.0.2
In base R you could index based on all the groups that have at least one non-NA value:
idx <- df$Group %in% unique(df[!is.na(df$Num),"Group"])
idx
df[idx,]
# or in one line
df[df$Group %in% unique(df[!is.na(df$Num),"Group"]),]
output
Group Num
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B NA
Using ave.
df[with(df, !ave(Num, Group, FUN=\(x) all(is.na(x)))), ]
# Group Num
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 1
# 5 B 2
# 6 B NA
I want to count how many times a specific value occurs across multiple columns and put the number of occurrences in a new column. My dataset has a lot of missing values but only if the entire row consists solely of NA's, it should return NA. If possible, I would prefer something that works with dplyr pipelines.
Example dataset:
df <- data.frame(c1 = sample(1:4, 20, replace = TRUE),
c2 = sample(1:4, 20, replace = TRUE),
c3 = sample(1:4, 20, replace = TRUE),
c4 = sample(1:4, 20, replace = TRUE),
c5 = sample(1:4, 20, replace = TRUE))
for (i in 1:5) {
df[sample(1:20, 1), sample(1:5, 1)] <- NA
df[sample(1:20, 1), ] <- NA
}
c1 c2 c3 c4 c5
1 1 2 4 4 1
2 2 2 1 3 4
3 2 4 4 3 3
4 4 2 3 2 1
5 4 2 4 1 3
6 NA 1 2 4 4
7 3 NA 4 NA 4
8 NA NA NA NA NA
9 1 3 3 2 2
10 NA NA NA NA NA
I have tried with rowwise() and rowSums. Some non-working examples here:
# First attempt
df <- df %>%
rowwise() %>%
mutate(count2 = sum(c_across(c1:c5, ~.x %in% 2)))
# Second attempt
df <- df %>%
rowwise() %>%
mutate(count2 = sum(c_across(select(where(c1:c5 %in% 2)))))
# With rowSums
df <- df %>%
rowwise() %>%
mutate(count4 = rowSums(select(c1:c5 %in% 4), na.rm = TRUE))
How about this:
library(dplyr)
df <- data.frame(c1 = sample(1:4, 20, replace = TRUE),
c2 = sample(1:4, 20, replace = TRUE),
c3 = sample(1:4, 20, replace = TRUE),
c4 = sample(1:4, 20, replace = TRUE),
c5 = sample(1:4, 20, replace = TRUE))
for (i in 1:5) {
df[sample(1:20, 1), sample(1:5, 1)] <- NA
df[sample(1:20, 1), ] <- NA
}
df %>%
rowwise() %>%
mutate(count2 = sum(na.omit(c_across(c1:c5)) == 2),
count2 = ifelse(all(is.na(c_across(c1:c5))), NA, count2))
#> # A tibble: 20 × 6
#> # Rowwise:
#> c1 c2 c3 c4 c5 count2
#> <int> <int> <int> <int> <int> <int>
#> 1 NA NA NA NA NA NA
#> 2 2 2 3 4 2 3
#> 3 1 1 1 4 4 0
#> 4 2 3 3 2 4 2
#> 5 NA NA NA NA NA NA
#> 6 1 1 1 2 1 1
#> 7 3 3 2 3 4 1
#> 8 1 1 4 3 4 0
#> 9 NA NA NA NA NA NA
#> 10 NA NA NA NA NA NA
#> 11 2 3 3 4 1 1
#> 12 2 1 4 2 NA 2
#> 13 4 4 2 NA 2 2
#> 14 4 2 3 3 2 2
#> 15 1 3 4 2 2 2
#> 16 1 1 3 3 2 1
#> 17 1 1 1 4 4 0
#> 18 2 4 4 NA 1 1
#> 19 NA NA NA NA NA NA
#> 20 4 1 1 NA 4 0
Created on 2022-12-08 by the reprex package (v2.0.1)
I have a dataframe like this:
df <- data_frame(id = c(rep('A', 10), rep('B', 10)),
value = c(1:3, rep(NA, 2), 1:2, rep(NA, 3), 1, rep(NA, 4), 1:3, rep(NA, 2)))
I need to count the number of consective NA's in the value column. The count needs to be grouped by ID, and it needs to restart at 1 every time a new NA or new series of NA's is encountered. The exptected output should look like this:
df$expected_output <- c(rep(NA, 3), 1:2, rep(NA, 2), 1:3, NA, 1:4, rep(NA, 3), 1:2)
If anyone can give me a dplyr solution that would also be great :)
I've tried a few things but nothing is giving any sort of sensical result. Thanks in advance^!
A solution using dplyr and data.table.
library(dplyr)
library(data.table)
df2 <- df %>%
group_by(id) %>%
mutate(info = rleid(value)) %>%
group_by(id, info) %>%
mutate(expected_output = row_number()) %>%
ungroup() %>%
mutate(expected_output = ifelse(!is.na(value), NA, expected_output)) %>%
select(-info)
df2
# # A tibble: 20 x 3
# id value expected_output
# <chr> <dbl> <int>
# 1 A 1 NA
# 2 A 2 NA
# 3 A 3 NA
# 4 A NA 1
# 5 A NA 2
# 6 A 1 NA
# 7 A 2 NA
# 8 A NA 1
# 9 A NA 2
# 10 A NA 3
# 11 B 1 NA
# 12 B NA 1
# 13 B NA 2
# 14 B NA 3
# 15 B NA 4
# 16 B 1 NA
# 17 B 2 NA
# 18 B 3 NA
# 19 B NA 1
# 20 B NA 2
We can use rle to get length of groups that are or are not na, and use purrr::map2 to apply seq if they are NA and get the growing count or just fill in with NA values using rep.
library(tidyverse)
count_na <- function(x) {
r <- rle(is.na(x))
consec <- map2(r$lengths, r$values, ~ if (.y) seq(.x) else rep(NA, .x))
unlist(consec)
}
df %>%
mutate(expected_output = count_na(value))
#> # A tibble: 20 × 3
#> id value expected_output
#> <chr> <dbl> <int>
#> 1 A 1 NA
#> 2 A 2 NA
#> 3 A 3 NA
#> 4 A NA 1
#> 5 A NA 2
#> 6 A 1 NA
#> 7 A 2 NA
#> 8 A NA 1
#> 9 A NA 2
#> 10 A NA 3
#> 11 B 1 NA
#> 12 B NA 1
#> 13 B NA 2
#> 14 B NA 3
#> 15 B NA 4
#> 16 B 1 NA
#> 17 B 2 NA
#> 18 B 3 NA
#> 19 B NA 1
#> 20 B NA 2
Here is a solution using rle:
x <- rle(is.na(df$value))
df$new[is.na(df$value)] <- sequence(x$lengths[x$values])
# A tibble: 20 x 3
id value new
<chr> <dbl> <int>
1 A 1 NA
2 A 2 NA
3 A 3 NA
4 A NA 1
5 A NA 2
6 A 1 NA
7 A 2 NA
8 A NA 1
9 A NA 2
10 A NA 3
11 B 1 NA
12 B NA 1
13 B NA 2
14 B NA 3
15 B NA 4
16 B 1 NA
17 B 2 NA
18 B 3 NA
19 B NA 1
20 B NA 2
Yet another solution:
library(tidyverse)
df %>%
mutate(aux =data.table::rleid(value)) %>%
group_by(id, aux) %>%
mutate(eout = ifelse(is.na(value), row_number(), NA_real_)) %>%
ungroup %>% select(-aux)
#> # A tibble: 20 × 4
#> id value expected_output eout
#> <chr> <dbl> <int> <dbl>
#> 1 A 1 NA NA
#> 2 A 2 NA NA
#> 3 A 3 NA NA
#> 4 A NA 1 1
#> 5 A NA 2 2
#> 6 A 1 NA NA
#> 7 A 2 NA NA
#> 8 A NA 1 1
#> 9 A NA 2 2
#> 10 A NA 3 3
#> 11 B 1 NA NA
#> 12 B NA 1 1
#> 13 B NA 2 2
#> 14 B NA 3 3
#> 15 B NA 4 4
#> 16 B 1 NA NA
#> 17 B 2 NA NA
#> 18 B 3 NA NA
#> 19 B NA 1 1
#> 20 B NA 2 2
Let's say I have a nested df, and I want to unnest the columns:
df <- tibble::tribble(
~x, ~y, ~nestdf,
1, 2, tibble::tibble(a=1:2, b=3:4),
3, 4, tibble::tibble(a=3:5, b=5:7)
)
tidyr::unnest(df, nestdf)
# x y a b
# <dbl> <dbl> <int> <int>
#1 1 2 1 3
#2 1 2 2 4
#3 3 4 3 5
#4 3 4 4 6
#5 3 4 5 7
The result has the x and y columns extended to match the dimensions of nestdf, with the new rows using the existing values. However, I want the new rows to contain NA, like so:
# x y a b
# <dbl> <dbl> <int> <int>
#1 1 2 1 3
#2 NA NA 2 4
#3 3 4 3 5
#4 NA NA 4 6
#5 NA NA 5 7
Is it possible to do this with unnest? Either the first or last row for each group can be kept as non-NA, I don't mind.
Repeating rows, and binding with an unnest of the nested list column(s):
nr <- sapply(df$nestdf, nrow) - 1
cbind(
df[rep(rbind(seq_along(nr), NA), rbind(1, nr)), c("x","y")],
unnest(df["nestdf"], cols=everything())
)
# x y a b
#1 1 2 1 3
#2 NA NA 2 4
#3 3 4 3 5
#4 NA NA 4 6
#5 NA NA 5 7
One way would be to change the duplicates to NA.
df1 <- tidyr::unnest(df, nestdf)
cols <- c('x', 'y')
df1[duplicated(df1[cols]), cols] <- NA
df1
# x y a b
# <dbl> <dbl> <int> <int>
#1 1 2 1 3
#2 NA NA 2 4
#3 3 4 3 5
#4 NA NA 4 6
#5 NA NA 5 7
If the values in columns x and y can repeat you can create a row number to identify them uniquely -
library(dplyr)
library(tidyr)
df1 <- df %>% mutate(row = row_number()) %>% unnest(nestdf)
cols <- c('x', 'y', 'row')
df1[duplicated(df1[cols]), cols] <- NA
df1 <- select(df1, -row)
You could convert x and y to lists first:
library(tidyverse)
df <- tibble::tribble(
~x, ~y, ~nestdf,
1, 2, tibble::tibble(a=1:2, b=3:4),
3, 4, tibble::tibble(a=3:5, b=5:7)
)
df %>%
mutate_at(vars(x:y), ~map2(., nestdf, ~.x[seq(nrow(.y))])) %>%
unnest(everything())
# A tibble: 5 x 4
#x y a b
#<dbl> <dbl> <int> <int>
# 1 1 2 1 3
#2 NA NA 2 4
#3 3 4 3 5
#4 NA NA 4 6
#5 NA NA 5 7
I am trying to merge two data frames based on a subset on common IDs. Let me demonstrate:
library(tidyverse)
set.seed(42)
df = list(id = c(1,2,3,4,1,2,2,2,1,1),
group = c("A","A","A","A","B","B","B","B","C","C"),
val = c(round(rnorm(10,6,6),0))
) %>%
tbl_df()
df_na = list(id = c(1,1,1,2,3,3,4,5,5,5),
group = c(rep(NA,10)),
val = c(rep(NA,10))
) %>%
tbl_df()
df contains data and ids, while df_na only contains ids and NAs. I would like to create a combined data frame that contains all the information of df and add the NAs by group and id, i.e. for each group in df find which ids are present in both df and df_na and merge.
If I was doing this manually, i.e. group for group, I would use something like this:
A_dist = df %>% filter(group=="A") %>%
distinct(id) %>%
pull()
df_A_comb = df_na %>%
filter(id %in% A_dist) %>%
bind_rows(filter(df, group=="A"))
# A tibble: 11 x 3
id group val
<dbl> <chr> <dbl>
1 1 NA NA
2 1 NA NA
3 1 NA NA
4 2 NA NA
5 3 NA NA
6 3 NA NA
7 4 NA NA
8 1 A 14
9 2 A 3
10 3 A 8
11 4 A 10
But obvioulsy, I would rather automate this. As an emerging fan of the tidyverse, I'm trying to get my head around purrr::map. I can create a vector of ids for each group.
df_dist = df %>%
split(.$group) %>%
map(distinct, id) %>%
map("id")
> df_dist
$A
[1] 1 2 3 4
$B
[1] 1 2
$C
[1] 1
But translating my dplyr approach is more complicated and produces an error message earlier on.
###this approach doesn't work...
df_comb = df_na %>%
map(filter, id %in% df_dist)# %>%
...
Error in UseMethod("filter_") :
no applicable method for 'filter_' applied to an object of class "c('double', 'numeric')"
Any help will be much appreaciated!
The tidyverse is great, but sometimes you can overlook the power of simple indexing and the really useful base R tools of the apply family. This single lapply function will give you a list containing all the data frames you want in the specified format:
lapply(unique(df$group), function(x){
rbind(df_na[df_na$id %in% df$id[df$group == x],], df[df$group == x,])})
Result:
#> [[1]]
#> # A tibble: 11 x 3
#> id group val
#> <dbl> <chr> <dbl>
#> 1 1 <NA> NA
#> 2 1 <NA> NA
#> 3 1 <NA> NA
#> 4 2 <NA> NA
#> 5 3 <NA> NA
#> 6 3 <NA> NA
#> 7 4 <NA> NA
#> 8 1 A 14
#> 9 2 A 3
#> 10 3 A 8
#> 11 4 A 10
#>
#> [[2]]
#> # A tibble: 8 x 3
#> id group val
#> <dbl> <chr> <dbl>
#> 1 1 <NA> NA
#> 2 1 <NA> NA
#> 3 1 <NA> NA
#> 4 2 <NA> NA
#> 5 1 B 8
#> 6 2 B 5
#> 7 2 B 15
#> 8 2 B 5
#>
#> [[3]]
#> # A tibble: 5 x 3
#> id group val
#> <dbl> <chr> <dbl>
#> 1 1 <NA> NA
#> 2 1 <NA> NA
#> 3 1 <NA> NA
#> 4 1 C 18
#> 5 1 C 6
If you want to join them into a single data frame, store the result (say as x) and do this:
do.call(rbind, x)