Turning a 1x1 data frame into value [duplicate] - r

This question already has answers here:
Extract a dplyr tbl column as a vector
(8 answers)
dplyr::select one column and output as vector [duplicate]
(3 answers)
Closed 5 years ago.
I'm using dplyr to transform a large data frame, and I want to store the DF's most recent date + 1 as a value. I know there's easier ways to do this by breaking up the statements, but I'm trying to do it all with one pipe statement. I ran into something and I'm not sure why R defaults that way. Example:
Day <- seq.Date(as.Date('2017-12-01'), as.Date('2018-02-03'), 'day')
Day <- sample(Day, length(Day))
ID <- sample(c(1:5), length(Day), replace = T)
df <- data.frame(ID, Day)
foo <- df %>%
arrange(desc(Day)) %>%
mutate(DayPlus = as.Date(Day) + 1) %>%
select(DayPlus) #%>%
#slice(1)
foo <- foo[1,1]
When I run this code, foo becomes a value equal to 2018-02-04 as desired. However, when I run the code with slice uncommented:
foo <- df %>%
arrange(desc(Day)) %>%
mutate(DayPlus = as.Date(Day) + 1) %>%
select(DayPlus) %>%
slice(1)
foo <- foo[1,1]
foo stays as a dataframe. My main question is why foo doesn't become a value in the second example, and my second question is if there's an easy way get the "2018-02-04" as a value stored as foo all from one dplyr pipe.
Thanks

That's because your first snippet returns a data.frame, the second one returns a tibble. tibbles are similar to data.frames, but one major difference is subsetting. If you have a data.frame, foo[1, 1] returns the first row of the first column as a vector, whereas if you have a tibble it returns the first row of the first column as a tibble.
df %>%
arrange(desc(Day)) %>%
mutate(DayPlus = as.Date(Day) + 1) %>%
select(DayPlus) %>%
class()
returns
[1] "data.frame"
whereas the second one
df %>%
arrange(desc(Day)) %>%
mutate(DayPlus = as.Date(Day) + 1) %>%
select(DayPlus) %>%
slice(1) %>%
class()
returns
[1] "tbl_df" "tbl" "data.frame"

Related

How to add a new row in the end of data frame only if the last actual value of the column Z doesn't contain "VALUE1"?

I have a list of several data frames and as the heading states, would like to add a new row (where column Z is "VALUE1") in the end of data frame if the last actual value/string (not counting NA "values") of column Z doesn't contain "VALUE1". I already have a script for adding a new row into the beginning of df if the first value of column Z doesn't contain "VALUE1", but can't quite modify the script into the new one myself.
The aforementioned script I'd like to modify looks following:
for(i in 1:length(df)){
df[[i]] <- df[[i]] %>%
filter(!is.na(Z)) %>%
slice(1) %>%
mutate(across(col1:col3, ~ 0)) %>%
filter(!grepl("VALUE1", Z)) %>%
mutate(Z = "VALUE1") %>%
bind_rows(., df[[i]])
}
Also if possible, it would be very much welcome if there could be a short comment for each line explaining what happens in the code (not necessary tho) for further learning and understanding. Thank you!
It's a quite strange script.
To add a line at end of df's if last is VALUE1, slice by n() instead of by 1, and flip order of bind_rows arguments. Try this:
for(i in 1:length(df)){
df[[i]] <- df[[i]] %>%
filter(!is.na(Z)) %>% #filter those rows that are not NA
slice(n()) %>% # select the last (n()) row (in original, select first) we have a new df with 1 row
mutate(across(col1:col3, ~ 0)) %>% # set all to zero (since it will be used as new row)
filter(!grepl("VALUE1", Z)) %>% # if Z contains VALUE1 the result is a df with 0 row (filter out)
mutate(Z = "VALUE1") %>% # set Z to value1 (if df contains 1 row)
bind_rows(df[[i]],.) # paste the new row (. is the pipe placeholder) at end of original data.frame (df[[i]]).
}
# if the step filter(!grepl("VALUE1", Z)) filtered out the row, then bind_rows append a zero row
# dataframe and the effect is that df[[i]] does not change at all.

Split a data.frame by group into a list of vectors rather than a list of data.frames

I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))

most elegant way to calculate rowSums of colums that start AND end with certain strings, using dplyr

I am working with a dataset of which I want to calculate rowSums of columns that start with a certain string and end with an other specified string, using dplyr (in my example: starts_with('c_') & ends_with('_f'))
My current code is as follows (and works fine):
df <- df %>% mutate(row.sum = rowSums(select(select(., starts_with('c_')), ends_with('_f'))))
However, as you can see, using the select() function within a select() function seems a bit messy. Is there a way to combine the starts_with and ends_with within just one select() function? Or do you have other ideas to make this line of code more elegant via using dplyr?
EDIT:
To make the example reproducible:
names <- c('c_first_f', 'c_second_o', 't_third_f', 'c_fourth_f')
values <- c(5, 3, 2, 5)
df <- t(values)
colnames(df) <- names
> df
c_first_f c_second_o t_third_f c_fourth_f
[1,] 5 3 2 5
Thus, here I want to sum the first and fourth column, making the summed value 10.
We could use select_at with matches
library(dplyr)
df %>% select_at(vars(matches("^c_.*_f$"))) %>% mutate(row.sum = rowSums(.))
and with base R :
df$row.sum <- rowSums(df[grep("^c_.*_f$", names(df))])
We can use tidyverse approaches
library(dplyr)
library(purrr)
df %>%
select_at(vars(matches("^c_.*_f$"))) %>%
mutate(rowSum = reduce(., `+`))
Or with new versions of tidyverse, select can take matches
df %>%
select(matches("^c_.*_f$")) %>%
mutate(rowSum = reduce(., `+`))

Frequency table with common values of 5 tables

I have 5 data frames and I have to analize just the first column. From these, I must obtain a frequency table of their common words (not necessarily of all data frames, for example a word can appear just in two or more dataframes).
Then I must obtain a frequency table of common words of ALL dataframes
I just tried doing a for cycle but I seems very complicated. Moreover, dataframes have different dimentions. I didn't find any useful function.
Then I tried doing
lst1 <- list(a,b,c,d,e)
newdat <- stack(setNames(lapply(lst1, "[", 1), seq_along(lst1)))[2:1]
library(dplyr)
newdat %>% group_by(val) %>% filter(uniqueN(ind) > 1) %>% count(val)
but it gives me an error
> stack(setNames(lapply(lst1, "[", 1), seq_along(lst1)))
Error in stack.default(setNames(lapply(lst1, "[", 1), seq_along(lst1))):
at least one vector element is required
Thank you
Here's my solution using purrr & dplyr:
library(purrr)
library(dplyr)
lst1 <- list(mtcars=mtcars, iris=iris, chick=chickwts, cars=cars, airqual=airquality)
lst1 %>%
map_dfr(select, value=1, .id="df") %>% # select first column of every dataframe and name it "value"
group_by(value) %>%
summarise(freq=n(), # frequency over all dataframes
n_df=n_distinct(df), # number of dataframes this value ocurrs
dfs = paste(unique(df), collapse=",")) %>%
filter(n_df > 1) %>%
filter(n_df == 5) # if value has to be in all 5 dataframes

Extract a dataframe from a column of dataframes (tidyverse approach)

I have been able to do some nice things with purrr to be able to work with dataframe columns within a dataframe. By which I am referring to a column of a dataframe where every cell contains a dataframe itself.
I am trying to find out the idiomatic approach for extracting one of these dataframes back out.
Example
# Create a couple of dataframes:
df1 <- tibble::tribble(~a, ~b,
1, 2,
3, 4)
df2 <- tibble::tribble(~a, ~b,
11, 12,
13, 14)
# Make a dataframe with a dataframe column containing
# our first two dfs as cells:
meta_df <- tibble::tribble(~df_name, ~dfs,
"One", df1,
"Two", df2)
My question is, what is the tidyverse-preferred way of getting one of these dataframes back out of meta_df? Say I get the cell I want using select() and filter():
library("magrittr")
# This returns a 1x1 tibble with the only cell containing the 2x2 tibble that
# I'm actually after:
meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::select(dfs)
This works, but seems non-tidyverse-ish:
# To get the actual tibble that I'm after I can wrap the whole lot in brackets
# and then use position [[1, 1]] index to get it:
(meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::select(dfs))[[1, 1]]
# Or a pipeable version:
meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::select(dfs) %>%
`[[`(1, 1)
I have a feeling that this might be a situation where the answer is in purrr rather than dplyr, and that it might be a simple trick once you know it, but I'm coming up blank so far.
better solution:
Use tidyr::unnest():
meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::select(dfs) %>%
tidyr::unnest()
other solution:
You can use pull (the tidyverse way to select the column, equivalent to $), but it returns a one-element list of tibbles, so you need to add %>% .[[1]] to the end.
meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::pull(dfs) %>% .[[1]]

Resources