aggregate command in R - r

How can I use aggregate command for converting this table:
name ID
a 1
a 2
a 2
a NA
b NA
c NA
c NA
to this one:
name ID
a 1|2
b NA
c NA
Thanks.

In base:
> aggregate(ID ~ name, data=x, FUN=function(y) paste(unique(y),
collapse='|'),na.action=na.pass)
name ID
1 a 1|2|NA
2 b NA
3 c NA
This differs from your specification in the handling of the fourth row.

You can try:
library(tidyr);
df$name <- as.factor(df$name)
aggregate(ID ~ name, unique(df[complete.cases(df),]), paste, collapse = "|") %>%
complete(name)
Source: local data frame [3 x 2]
name ID
(fctr) (chr)
1 a 1|2
2 b NA
3 c NA
The logic here is filtering out all incomplete rows and duplicated rows firstly, paste the ID together and then use the complete function from tidyr package to automatically fill the factor variable with all the levels to make sure no information is missing.

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'name', if the elements in 'ID' are all NA, then we return the NA or else paste the unique elements that are not NA in the 'ID' column.
library(data.table)
setDT(df1)[,.(ID= if(all(is.na(ID))) NA_character_ else
paste(na.omit(unique(ID)), collapse = "|")), by = name]
# name ID
#1: a 1|2
#2: b NA
#3: c NA
The same methodology can be used in dplyr
library(dplyr)
df1 %>%
group_by(name) %>%
summarise(ID = if(all(is.na(ID))) NA_character_
else paste(unique(ID[!is.na(ID)]), collapse="|"))
# name ID
# <chr> <chr>
#1 a 1|2
#2 b <NA>
#3 c <NA>

Related

Dplyr: add multiple columns with mutate/across from character vector

I want to add several columns (filled with NA) to a data.frame using dplyr. I've defined the names of the columns in a character vector. Usually, with only one new column, you can use the following pattern:
test %>%
mutate(!!new_column := NA)
However, I don't get it to work with across:
library(dplyr)
test <- data.frame(a = 1:3)
add_cols <- c("col_1", "col_2")
test %>%
mutate(across(!!add_cols, ~ NA))
#> Error: Problem with `mutate()` input `..1`.
#> x Can't subset columns that don't exist.
#> x Columns `col_1` and `col_2` don't exist.
#> ℹ Input `..1` is `across(c("col_1", "col_2"), ~NA)`.
test %>%
mutate(!!add_cols := NA)
#> Error: The LHS of `:=` must be a string or a symbol
expected_output <- data.frame(
a = 1:3,
col_1 = rep(NA, 3),
col_2 = rep(NA, 3)
)
expected_output
#> a col_1 col_2
#> 1 1 NA NA
#> 2 2 NA NA
#> 3 3 NA NA
Created on 2021-10-05 by the reprex package (v1.0.0)
With the first approach, the column names are correctly created, but then it directly tries to find it in the existing column names. In the second approach, I can't use anything other than a single string.
Is there a tidyverse solution or do I need to resort to the good old for loop?
The !! works for a single element
for(nm in add_cols) test <- test %>%
mutate(!! nm := NA)
-output
> test
a col_1 col_2
1 1 NA NA
2 2 NA NA
3 3 NA NA
Or another option is
test %>%
bind_cols(setNames(rep(list(NA), length(add_cols)), add_cols))
a col_1 col_2
1 1 NA NA
2 2 NA NA
3 3 NA NA
In base R, this is easier
test[add_cols] <- NA
Which can be used in a pipe
test %>%
`[<-`(., add_cols, value = NA)
a col_1 col_2
1 1 NA NA
2 2 NA NA
3 3 NA NA
across works only if the columns are already present i.e. it is suggesting to loop across the columns present in the data and do some modification/create new columns with .names modification
We could make use add_column from tibble
library(tibble)
library(janitor)
add_column(test, !!! add_cols) %>%
clean_names %>%
mutate(across(all_of(add_cols), ~ NA))
a col_1 col_2
1 1 NA NA
2 2 NA NA
3 3 NA NA
Another approach:
library(tidyverse)
f <- function(x) df$x = NA
mutate(test, map_dfc(add_cols,~ f(.x)))

How to move dataframe variable names to first row and add new variable names to multiple dataframes in a list?

library(purrr)
library(tibble)
library(dplyr)
Starting list of dataframes
lst <- list(df1 = data.frame(X.1 = as.character(1:2),
heading = letters[1:2]),
df2 = data.frame(X.32 = as.character(3:4),
another.topic = paste("Line ", 1:2)))
lst
#> $df1
#> X.1 heading
#> 1 1 a
#> 2 2 b
#>
#> $df2
#> X.32 another.topic
#> 1 3 Line 1
#> 2 4 Line 2
Expected "combined" dataframe, with new consistent variable names, and old variable names in the first row of each constituent dataframe.
#> id h1 h2
#> 1 df1 X.1 heading
#> 2 df1 1 a
#> 3 df1 2 b
#> 4 df2 X.32 another.topic
#> 5 df2 3 Line 1
#> 6 df2 4 Line 2
add_row requires "Name-value pairs, passed on to tibble(). Values can be defined only for columns that already exist in .data and unset columns will get an NA value."
Which is what I think I have achieved with this:
df_nms <-
map(lst, names) %>%
map(set_names)
#> $df1
#> X.1 heading
#> "X.1" "heading"
#>
#> $df2
#> X.32 another.topic
#> "X.32" "another.topic"
But I cannot tie up the last bit, using a purrr function to add the names to the head of each dataframe. I've tried numerous variations with map2 and pmap the closest I can get at present (if I treat add_row as a formula , prefixing it with ~ and remove the .y I get a new first row populated with NAs). I think I'm missing how to pass the name-value pairs to the add_row function.
map2(lst, df_nms, add_row(.x, .y, .before = 1)) %>%
map(set_names, c("h1", "h2")) %>%
map_dfr(bind_rows, .id = "id")
#> Error in add_row(.x, .y, .before = 1): object '.x' not found
A pointer to resolve this last step would be most appreciated.
Not quite sure how to do this via purrr map functions, but here is an alternative,
library(dplyr)
bind_rows(lapply(lst, function(i){d1 <- as.data.frame(matrix(names(i), ncol = ncol(i)));
rbind(d1, setNames(i, names(d1)))}), .id = 'id')
# id V1 V2
#1 df1 X.1 heading
#2 df1 1 a
#3 df1 2 b
#4 df2 X.32 another.topic
#5 df2 3 Line 1
#6 df2 4 Line 2
Here's an approach using map, rbindlist from data.table and some base R functions:
library(purrr)
library(dplyr)
library(data.table)
map(lst, ~ as.data.frame(unname(rbind(colnames(.x),as.matrix(.x))))) %>%
rbindlist(idcol = "id")
# id V1 V2
#1: df1 X.1 heading
#2: df1 1 a
#3: df1 2 b
#4: df2 X.32 another.topic
#5: df2 3 Line 1
#6: df2 4 Line 2
Alternatively we could use map_df if we use colnames<-:
map_df(lst, ~ as.data.frame(rbind(colnames(.x),as.matrix(.x))) %>%
`colnames<-`(.,paste0("h",seq(1,dim(.)[2]))), .id = "id")
# id h1 h2
#1 df1 X.1 heading
#2 df1 1 a
#3 df1 2 b
#4 df2 X.32 another.topic
#5 df2 3 Line 1
#6 df2 4 Line 2
Key things here are:
Use as.matrix to get rid of the factor / character incompatibility.
Remove names with unname or set them with colnames<-
Use the idcols = or .id = feature to get the names of the list as a column.
I altered your sample data a bit, setting stringsAsFactors to FALSE when creating the data.frames in lst.
here is a solution using data.table::rbindlist().
#sample data
lst <- list(df1 = data.frame(X.1 = as.character(1:2),
heading = letters[1:2],
stringsAsFactors = FALSE), # !! <--
df2 = data.frame(X.32 = as.character(3:4),
another.topic = paste("Line ", 1:2),
stringsAsFactors = FALSE) # !! <--
)
DT <- data.table::rbindlist( lapply( lst, function(x) rbind( names(x), x ) ),
use.names = FALSE, idcol = "id" )
setnames(DT, names( lst[[1]] ), c("h1", "h2") )
# id h1 h2
# 1: df1 X.1 heading
# 2: df1 1 a
# 3: df1 2 b
# 4: df2 X.32 another.topic
# 5: df2 3 Line 1
# 6: df2 4 Line 2

performing multiple functions on every columns and rows of a list of dataframes using their names stored in a list

DATA
foo <- dplyr::tibble(a=c("a","b",NA),b=c("a","b","c"),colC=NA)
bar <- dplyr::tibble(a=c("a","b",NA),b=c("a","b","c"),colC=NA)
all_tibbles <- c("foo","bar")
lapply(mget(all_list), function(y) sapply(y, function(x) all(is.na(x))))
$foo
# A tibble: 3 x 3
a b colC
<chr> <chr> <lgl>
1 a a NA
2 b b NA
3 NA c NA
$bar
# A tibble: 3 x 3
a b colC
<chr> <chr> <lgl>
1 a a NA
2 b b NA
3 NA c NA
I would like to remove all columns from every data frame in mget(all_list)
This created the logical vector using base apply functions.
lapply(mget(all_tibbles), function(y) sapply(y, function(x) all(is.na(x))))
Then remove all rows with the minimum number of missing values
lapply(mget(all_tibbles),function(x)
x[-which.min(rowSums((!is.na(x)))),])
and then store these back in the same variables foo and bar. I have a large character vector with tibble names btw.
Can I use a tidyr package to simplify things? base functions are fairly complicated, and am trying to avoid for loops
An option is select_if
library(dplyr)
library(purrr)
library(stringr)
out <- mget(all_tibbles) %>%
map(~ .x %>%
select_if(~ any(!is.na(.))))
out
#$foo
# A tibble: 3 x 2
# a b
# <chr> <chr>
#1 a a
#2 b b
#3 <NA> c
#$bar
# A tibble: 3 x 2
# a b
# <chr> <chr>
#1 a a
#2 b b
#3 <NA> c
names(out) <- str_c(names(out), "_edited")
If we need to update "foo", "bar" (not recommended)
list2env(out, .GlobalEnv)
Or using keep
mget(all_tibbles) %>%
map(~ keep(.x, colSums(!is.na(.)) > 0))
For the second case with rows
out1 <- mget(all_tibbles) %>%
map(~ .x %>%
slice(-which.min(rowSums(!is.na(.)))))
names(out2) <- str_c(names(out), "_edited2")
list2env(out2, .GlobalEnv)
Or we can use Filter from base R to remove columns (OP already showed a base R option for removing rows)
lapply(mget(all_tibbles), function(x)
Filter(function(y) any(!is.na(y)), x))

Replacing values of dataframe with NA values of another dataframe

I would like to replace values of one dataframe with NA of another dataframe that have the same identifier. That is, for all values of df1 that have the same id, assign the "NA" values of df2 at the corresponding id and indices.
I have df1 and df2:
df1 =data.frame(id = c(1,1,2,2,6,6),a = c(2,4,1,7,5,3), b = c(5,3,0,3,2,5),c = c(9,3,10,33,2,5))
df2 =data.frame(id = c(1,2,6),a = c("NA",0,"NA"), b= c("NA", 9, 9),c=c(0,"NA","NA"))
what i would like is df3:
df3 = data.frame(id = c(1,1,2,2,6,6),a = c("NA","NA",1,7,"NA","NA"), b = c("NA","NA",0,3,2,5),c = c(9,3,"NA","NA","NA","NA"))
I have tried the lookup function and the library "data.table", but i could get the correct df3. Could anyone please help me with this?
We can do a join on 'id' and then replace the NA values by multiplying the .
library(data.table)
nm1 <- names(df1)[-1]
setDT(df1)[df2, (nm1) := Map(function(x, y) x*(NA^is.na(y)), .SD,
mget(paste0('i.', nm1))), on = .(id), .SDcols = nm1]
df1
# id a b c
#1: 1 NA NA 9
#2: 1 NA NA 3
#3: 2 1 0 NA
#4: 2 7 3 NA
#5: 6 NA 2 NA
#6: 6 NA 5 NA
data
df2 =data.frame(id = c(1,2,6),a = c(NA,0,NA), b= c(NA, 9, 9),c=c(0,NA,NA))
NOTE: In the OP's post NA were "NA"
Since your NA values are actually text "NA" you will have to turn all your variables into text (with as.character). You can join both dataframes by id column. Since both dataframes have columns a,b, and c R will rename then a.x, b.x and c.x (df1) and a.y, b.y and c.y (df2).
After that you can create new columns a,b, and c. These than have "NA" whenever a.y == "NA" and a.x otherwise (and so on). If your NA values were real NA you need to test differently is.na(value) (see example below in the code).
library(dplyr)
df1 %>%
mutate_all(as.character) %>% # allvariables as text
left_join(df2 %>%
mutate_all(as.character) ## all variables as text
, by = "id") %>% ## join tables by 'id'; a.x from df1 and a.y from df2 and so on
mutate(a = case_when(a.y == "NA" ~ "NA", TRUE ~ a.x), ## if a.y == "NA" take this,else a.x
b = case_when(b.y == "NA" ~ "NA", TRUE ~ b.x),
c = case_when(c.y == "NA" ~ "NA", TRUE ~ c.x)) %>%
select(id, a, b, c) ## keep only these initial columns
id a b c
1 1 NA NA 9
2 1 NA NA 3
3 2 1 0 NA
4 2 7 3 NA
5 6 NA 2 NA
6 6 NA 5 NA
##if your dataframe head real NA this is how you can test:
missing_value <- NA
is.na(missing_value) ## TRUE
missing_value == NA ## Does not work with R

Replace value from other dataframe

I have a data frame (x) with a factor variable which has values seperated by comma. I have another data frame (y) with description for the same values. Now I want to replace the values in the data frame (x) with the description from the data frame (y). Any help would be highly appreciated.
say for example, the two data frame looks like below
data frame (x)
s.no x
1 2,5,45
2 35,5
3 45
data fram (y)
s.no x description
1 2 a
2 5 b
3 45 c
4 35 d
I need the output as below
s.no x
1 a,b,c
2 d,b
c c
With splitstackshape:
library(splitstackshape)
cSplit(x, 'x', ',', 'long')[setDT(y), on='x'][,.(x=paste(description, collapse=',')), s.no]
# s.no x
#1: 1 a,b,c
#2: 2 b,d
#3: 3 c
A solution using dplyr and tidyr:
library(dplyr)
library(tidyr)
x %>%
separate(x, paste0('x',1:3),',',convert=TRUE) %>%
gather(var, x, -1, na.rm=TRUE) %>%
left_join(., y, by='x') %>%
group_by(s.no = s.no.x) %>%
summarise(x = paste(description,collapse = ','))
the result:
s.no x
(int) (chr)
1 1 a,b,c
2 2 d,b
3 3 c
We can split the 'x' column in 'x' dataset by ',', loop over the list, match the value with the 'x' column in 'y' to get the numeric index, get the corresponding 'description' value from 'y' and paste it together.
x$x <- sapply(strsplit(x$x, ","), function(z)
toString(y$description[match(as.numeric(z), y$x)]))
x
# s.no x
#1 1 a, b, c
#2 2 d, b
#3 3 c
NOTE: If the 'x' column in 'x' is factor class, use strsplit(as.character(x$x, ","))

Resources