I'm working with json data which I've converted into a tibble with some list columns. I'm trying to extract the useful information from the list columns but am facing issues. If given the following dataset-
mydf <-tibble(
x = c(1, 2, 3),
y = list(list(list(id="id1", title="title1"), list(id="id11", title="title11")),
list(id="id2",title="title2"),
NULL)
)
How can I convert it into the following-
data.frame(x=c(1:3), id = c("id1;id11", "id2", ""), title = c("title1;title11", "title2", ""))
# x id title
#1 1 id1;id11 title1;title11
#2 2 id2 title2
#3 3
Any help is appreciated. Thanks!
I think there are better ways, but this is what I can do for now. For each row, I extracted strings and concatenated them with toString(). Since unnest() creates multiple rows for each row (i.e., 1, 2, and 3 in x), I used summarize() to temporarily combine strings. Then, I separate them using separate().
mydf %>%
unnest(y, keep_empty = TRUE) %>%
rowwise %>%
mutate(y = toString(unlist(y))) %>%
group_by(x) %>%
summarize(string = paste(y, collapse = "_")) %>%
separate(col = string, into = c("id", "title"), sep = "_")
# x id title
# <dbl> <chr> <chr>
#1 1 id1, title1 id11, title11
#2 2 id2 title2
#3 3 "" NA
If the names are consistent as in the example, you can do:
mydf2 <- unlist(mydf)
x <- mydf2[grepl("x", names(mydf2))]
id <- mydf2[grepl("id", names(mydf2))]
title <- mydf2[grepl("title", names(mydf2))]
tibble(x, id, title)
# A tibble: 3 x 3
x id title
<chr> <chr> <chr>
1 1 id1 title1
2 2 id11 title11
3 3 id2 title2
Related
I have a large dataframe. I'm trying to remove v character from variable names of a data frame
df <- tibble(q_ve5 = 1:2,
q_f_1v = 3:4,
q_vf_2 = 3:4,
q_e6 = 5:6,
q_ev8 = 5:6)
I tried this. It seems my regular expression pattern is not correct
df %>%
rename_all(~ str_remove(., "\\v\\d+$"))
My desired col names:
q_e5 q_f_1 q_f_2 q_e6 q_e8
If we need to remove only 'v' the one of more digits (\\d+) at the end ($) is not needed as the expected output also removes 'v' from first column 'q_ve5'
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_remove(., "v"), everything())
-output
# A tibble: 2 × 5
q_e5 q_f_1 q_f_2 q_e6 q_e8
<int> <int> <int> <int> <int>
1 1 3 3 5 5
2 2 4 4 6 6
Or without any packages
names(df) <- sub("v", "", names(df))
How can I remove entire group if one of its values is NA. For ex - remove category B because it contains NA.
library(dplyr)
tbl = tibble(category = c("A", "A", "B", "B"),
values = c(2, 3, 1, NA))
We can use filter after grouping by 'category'
library(dplyr)
tbl %>%
group_by(category) %>%
filter(!any(is.na(values))) %>%
ungroup
-output
# A tibble: 2 x 2
category values
<chr> <dbl>
1 A 2
2 A 3
tbl %>%
filter(!category %in% category[is.na(values)])
Output
category values
<chr> <dbl>
1 A 2
2 A 3
tbl %>%
group_by(category) %>%
filter(all(!is.na(values)))
category values
<chr> <dbl>
1 A 2
2 A 3
You can get the categories which has at least one NA value and exclude them.
subset(tbl, !category %in% unique(category[is.na(values)]))
# category values
# <chr> <dbl>
#1 A 2
#2 A 3
If you prefer dplyr::filter.
library(dplyr)
tbl %>% filter(!category %in% unique(category[is.na(values)]))
I would like to combine two variables that have only one answer each into a single variable that has both answers.
Example
IPV_YES only has answers that are 1
IPV_NO only has answers that are 2
I would like to combine them into a single variable named IPV that would have the 1 and 2 results from both individual category.
I have tried using ifelse command but it only shows me the value of IPV_YES.
Dataset I have
My desired outcome
my answer
df %>% mutate(across(everything(), ~ifelse(. == "", NA, as.numeric(.)))) %>%
group_by(ID) %>%
rowwise() %>%
transmute(IPV = sum(c_across(everything()), na.rm = T))
# A tibble: 4 x 2
# Rowwise: ID
ID IPV
<dbl> <dbl>
1 1 1
2 2 2
3 3 1
4 4 2
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
We can use coalesce after converting the '' to NA
library(dplyr)
df <- df %>%
transmute(ID, IPV = coalesce(na_if(IPV_YES, ""), na_if(IPV_NO, ""))) %>%
type.convert(as.is = TRUE)
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
df$IPV <- ifelse(df$IPV_YES != "", df$IPV_YES, df$IPV_NO[!df$IPV_NO==""])
Here, we specify an ifelse statement; it can be glossed thus: if the value in df$IPV_YES is not blank, then give the value in df$IPV_YES, else give those values from df$IPV_NO that are not blank.
If you want to remove the IPV_* columns:
df[,2:3] <- NULL
Result:
df
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2
Data:
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
Maybe you can try the code below
replace(df, df == "", NA) %>%
mutate(IPV = coalesce(IPV_YES, IPV_NO)) %>%
select(ID, IPV) %>%
type.convert(as.is = TRUE)
which gives
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2
I recently had to compile a data frame of student scores (one row per student, id column and several integer-valued columns, one per score component). I had to combine a "master" data frame and several "correction" data frames (containing mostly NA and some updates to the master), so that the result contains the maximum values from the master, and all corrections.
I succeeded by copy-pasting a sequence of mutate() calls, which works (see example below), but is not elegant in my opinion. What I would have wanted to do, was instead of copying and pasting, to use something along the lines of map2 and two lists of columns to compare the columns pair-wise. Something like (which obviously does not work as such):
list_of_cols1 <- list(col1.x, col2.x, col3.x)
list_of_cols2 <- list(col1.y, col2.y, col3.y
map2(list_of_cols1, list_of_cols2, ~ column = pmax(.x, .y, na.rm=T))
I can't seem to be able to figure out to do it. My question is: how to specify such lists of columns and mutate them in one map2() call in dplyr pipe, or is it even possible – have I gotten it all wrong?
Minimum working example
library(tidyverse)
master <- tibble(
id=c(1,2,3),
col1=c(1,1,1),
col2=c(2,2,2),
col3=c(3,3,3)
)
correction1 <- tibble(
id=seq(1,3),
col1=c(NA, NA, 2 ),
col2=c( 1, NA, 3 ),
col3=c(NA, NA, NA)
)
result <- reduce(
# Ultimately there would several correction data frames
list(master, correction1),
function(x,y) {
x <- x %>%
left_join(
y,
by = c("id")
) %>%
# Wish I knew how to do this mutate call with map2
mutate(
col1 = pmax(col1.x, col1.y, na.rm=T),
col2 = pmax(col2.x, col2.y, na.rm=T),
col3 = pmax(col3.x, col3.y, na.rm=T)
) %>%
select(id, col1:col3)
}
)
The result is
> result
# A tibble: 3 x 4
id col1 col2 col3
<int> <dbl> <dbl> <dbl>
1 1 1 2 3
2 2 1 2 3
3 3 2 3 3
Rather than do a left_join, just bind the rows then summarize. For example
result <- reduce(
list(master, master),
function(x,y) {
bind_rows(x, y) %>%
group_by(id) %>%
summarize_all(max, na.rm=T)
}
)
result
# id col1 col2 col3
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 3
# 2 2 1 2 3
# 3 3 2 3 3
Actually, you don't even need reduce as bind_rows can take a list
Adding another table
correction2 <- tibble(id=2,col1=NA,col2=8,col3=NA)
bind_rows(master, correction1, correction2) %>%
group_by(id) %>%
summarize_all(max, na.rm=T)
Sorry this doesn't answer your question about map2, I find it's easier to aggregate over rows than it is over columns in tidy R:
library(dplyr)
master <- tibble(
id=c(1,2,3),
col1=c(1,1,1),
col2=c(2,2,2),
col3=c(3,3,3)
)
correction1 <- tibble(
id=seq(1,3),
col1=c(NA, NA, 2 ),
col2=c( 1, NA, 3 ),
col3=c(NA, NA, NA)
)
result <- list(master, correction1) %>%
bind_rows() %>%
group_by(id) %>%
summarise_all(max, na.rm = TRUE)
result
#> # A tibble: 3 x 4
#> id col1 col2 col3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 3
#> 2 2 1 2 3
#> 3 3 2 3 3
If correction tables will always have the same structure as master, you can do something like the following:
library(dplyr)
library(purrr)
update_master = function(...){
map(list(...), as.matrix) %>%
reduce(pmax, na.rm = TRUE) %>%
data.frame()
}
update_master(master, correction1)
To allow id to take character values, make the following modification:
update_master = function(x, ...){
map(list(x, ...), function(x) as.matrix(x[-1])) %>%
reduce(pmax, na.rm = TRUE) %>%
data.frame(id = x[[1]], .)
}
update_master(master, correction1)
Result:
id col1 col2 col3
1 1 1 2 3
2 2 1 2 3
3 3 2 3 3
Consider this simple example
> weird_df <- data_frame(col1 =c('hello', 'world', 'again'),
+ col_weird = list(list(12,23), list(23,24), NA))
>
> weird_df
# A tibble: 3 x 2
col1 col_weird
<chr> <list>
1 hello <list [2]>
2 world <list [2]>
3 again <lgl [1]>
I need to extract the values in the col_weird. How can I do that? I see how to do that in Python but not in R. Expected output is:
> good_df
# A tibble: 3 x 3
col1 tic toc
<chr> <dbl> <dbl>
1 hello 12 23
2 world 23 24
3 again NA NA
If you collapse the list column into a string you can use separate from tidyr. I used map from purrr to loop through the list column and create a string with toString.
library(tidyr)
library(purrr)
weird_df %>%
mutate(col_weird = map(col_weird, toString ) ) %>%
separate(col_weird, into = c("tic", "toc"), convert = TRUE)
# A tibble: 3 x 3
col1 tic toc
* <chr> <int> <int>
1 hello 12 23
2 world 23 24
3 again NA NA
You can actually use separate directly without the toString part but you end up with "list" as one of the values.
weird_df %>%
separate(col_weird, into = c("list", "tic", "toc"), convert = TRUE) %>%
select(-list)
This led me to tidyr::extract, which works fine with the right regular expression. If your list column was more complicated, though, writing out the regular expression might be a pain.
weird_df %>%
extract(col_weird, into = c("tic", "toc"), regex = "([[:digit:]]+), ([[:digit:]]+)", convert = TRUE)
You can do this with basic R, thanks to I():
weird_df <- data.frame(col1 =c('hello', 'world'),
col_weird = I(list(list(12,23),list(23,24))))
weird_df
> col1 col_weird
1 hello 12, 23
2 world 23, 24
weird_df <- data_frame(col1 = c('hello', 'world'),
col_weird = list(list(12,23), list(23,24)))
library(dplyr)
weird_df %>%
dplyr::mutate(tic = unlist(magrittr::extract2(col_weird, 1)),
toc = unlist(magrittr::extract2(col_weird, 2)),
col_weird = NULL)
With the last changes: Note that now col_weird contains list(NA, NA)
weird_df <- data_frame(col1 = c('hello', 'world', 'again'),
col_weird = list(list(12,23), list(23,24), list(NA, NA)))
library(dplyr)
weird_df %>%
dplyr::mutate(col_weird = matrix(col_weird),
tic = sapply(col_weird, function(x) magrittr::extract2(x, 1)),
toc = sapply(col_weird, function(x) magrittr::extract2(x, 2)),
col_weird = NULL)
Here is one option to do with purrr/tidyverse/reshape2. We unlist the 'col_weird' within map to get the output as list, set the names of the list with 'col1', melt to 'long' format, grouped by 'L1', create a 'rn' column and spread it back to 'wide'
library(tidyverse)
library(reshape2)
weird_df$col_weird %>%
map(unlist) %>%
setNames(., weird_df$col1) %>%
melt %>%
group_by(L1) %>%
mutate(rn = c('tic', 'toc')[row_number()]) %>%
spread(rn, value) %>%
left_join(weird_df[-2], ., by = c(col1 = "L1"))
well, I came up with a simple one
> weird_df %>%
+ rowwise() %>%
+ mutate(tic = col_weird[[1]],
+ tac = ifelse(length(col_weird) == 2, col_weird[[2]], NA)) %>%
+ select(-col_weird) %>% ungroup()
# A tibble: 3 x 3
col1 tic tac
<chr> <dbl> <dbl>
1 hello 12 23
2 world 23 24
3 again NA NA