Pop out observation/row from a data frame - r

My data looks like this:
library(tidyverse)
set.seed(1)
df <- tibble(
id = c("cat", "cat", "mouse", "dog", "fish", "fish", "fish"),
value = rnorm(7, 100, sd = 50)
)
How might I "pop out" the top value of fish, as in move fish to a new data frame and simultaneously remove it from the current data frame?
This works (but it doesn't seem all that elegant):
df_store <- df %>%
filter(id == "fish") %>%
top_n(1)
df <- anti_join(df, df_store)
Is there a better way?

You can do both actions in one single line by using the package pipeR.
library(pipeR); library(dplyr)
df <- df %>>% filter(id == "fish") %>>% top_n(1) %>>% (~ df2) %>% anti_join(df, .)
print(df2)
#### 1 fish 124.3715
print(df)
#### 1 mouse 58.21857
#### 2 dog 179.76404
#### 3 fish 58.97658
#### 4 cat 68.67731
#### 5 cat 109.18217
#### 6 fish 116.47539
I'm no expert of pipeR so you can check it out here, how this kind of assignment within a pipe actually works.
Just one remark: when using top_n i recommend to specify the value column, by default it's the last column but you can easily forget it: top_n(1, value)

Related

How to do a 2-step wrangling and nesting using tidyr/dplyr, using %>% pipe only?

I need to accomplish a wrangling task with tidyr/dplyr as part of a %>% pipe. That is, without assigning data to helper objects. I have the following trb tibble as a given:
library(tibble)
trb <-
tribble(~name, ~type, ~dat,
"john", "cat", mtcars,
"john", "spider", Puromycin,
"amanda", "dog", ToothGrowth,
"chris", "wolf", PlantGrowth,
"annie", "lion", women,
"richard", "frog", trees,
"liz", "horse", USArrests,
"raul", "snake", iris,
"kate" , "bear", quakes)
and I want to do a 2-step wrangling (not necessarily in the following order):
lump together john's dat data frames into a named list (in which names will come from type); and
shift john's information to leftmost while nesting the data of the others.
The desired output should therefore be:
desired_output <-
tribble(~dat_john, ~other_people,
list("cat" = mtcars, "spider" = Puromycin), trb %>% dplyr::filter(name != "john")
)
As noted above, it's important to me to get from trb to desired_output using %>% only. Any ideas?
Maybe something like this? It first categorizes the data as john or not, then nests all the data for each category into one list, then pivots those two categories wide.
library(tidyr); library(dplyr)
trb %>%
mutate(column = if_else(name == "john", "dat_john", "other people")) %>%
nest(-column) %>%
pivot_wider(names_from = column, values_from = data) %>%
# from #ekoam's answer, to convert this column to named list
mutate(dat_john = with(dat_john[[1L]], list(setNames(dat, type))))
It is possible to achieve what you want via a sequence of pipelines. But I am not sure why you want to do this. Note that you need to manually assign "john" as the first level and rearrange the dataframe. Otherwise, if "john" is not the first entry, you won't get him to the leftmost after pivot_wider.
library(dplyr)
library(tidyr)
trb %>%
group_by(id = factor(name != "john", labels = c("dat_john", "other_people"))) %>%
arrange(id) %>% # use factor and arrange to ensure that john is always the first level
nest(data = -id) %>%
pivot_wider(names_from = id, values_from = data) %>%
mutate(dat_john = with(dat_john[[1L]], list(setNames(dat, type))))
Output
# A tibble: 1 x 2
dat_john other_people
<list> <list>
1 <named list [2]> <tibble [7 x 3]>

Change multiple vector classes in R at once

I am trying to use change the classes of multiple vector at once, using %>% mutate_if
in an empty dataset of logical vectors. I can change them one by one with as.factor().
My dataset looks the following:
ID code
pc01 cat
pc02 dog
pc03 cat
pc04 horse
pc01 dog
pc02 horse
Now, I post to you my whole code if it helps:
library(dplyr)
G <- as.factor(levels(as.factor(id)))
dat <- as.data.frame(G)
datprep <- data.frame(matrix(vector(), length(G),
length(levels(as.factor(code)))
)
)
colnames(datprep) = levels(as.factor(code))
datD <- cbind(datprep, datD)
# columns are logical, shall be factors.
datD %>% mutate_if(is.logical, as.factor)
Any suggestions?
In the new version of dplyr, we can also use across
library(dplyr)
df <- df %>%
mutate(across(where(is.character), factor))
In base R you could do:
df <- type.convert(df)
or even
df <- rapply(df,factor,"character", how="replace")
If I understood your problem correctly (not sure since there is no logical colum in you example just character), you have to include the ~ and . with mutate_if (and many others):
library(dplyr)
df <- dplyr::tibble(ID = c("pc01", "pc02", "pc03", "pc04", "pc01", "pc02"),
code = c("cat", "dog", "cat", "horse", "dog", "horse"))
df %>%
dplyr::mutate_if(is.character, ~ as.factor(.))
ID code
<fct> <fct>
1 pc01 cat
2 pc02 dog
3 pc03 cat
4 pc04 horse
5 pc01 dog
6 pc02 horse
The tilde "~" sinalizes a function (result is on the left) and the "." is stands for any column... I used is.character() as this is what your example columns are but you can change it to any other type of verification

How to make multiple subplots from one long data frame

I am looking to make four graphs in ggplot each containing 7 data-series each of which is marked as a group in my data frame. I therefore need a way to filter my single long data frame by the 7 group keys. I want my code to work something like this:
library(tidyverse)
df <- mtcars[0:2]
df <- tibble::rownames_to_column(df, "groups")
grouped_df <- df %>% group_by(groups)
conditions = group_keys(grouped_df)[[1]]
subplot_1_data <- grouped_df %>% filter(groups == AMC Javelin) ## this works
subplot_2_data <- grouped_df %>% filter(conditions[6:10]) ##does not work
subplot_3_data <- grouped_df %>% filter(groups == conditions[11:15]) ## does not work
i want to generate three ggplot graphs 1 with subplot_1_data and another with subplot_2_data and a third with subplot_3_data
I am struggling to achieve this. any hint on how to get multiple groups into 1 dataframe for plotting would be appreciated.
What you are looking for is x %in% c("a", "b") instead of == when filtering using a vector.
library(tidyverse)
df <- mtcars[0:2] %>%
tibble::rownames_to_column("groups") %>%
group_by(groups)
conditions <- group_keys(grouped_df)$groups
subplot_1_data <- df %>% filter(groups == "AMC Javelin")
subplot_2_data <- df %>% filter(groups %in% conditions[6:10])
subplot_3_data <- df %>% filter(groups %in% conditions[11:15])

Add a grouping variable based on ranked data

Consider the following dataframe:
name <- c("Sally", "Dave", "Aaron", "Jane", "Michael")
rank <- c(1,2,1,2,3)
df <- data.frame(name, rank, stringsAsFactors = FALSE)
I'd like to create a grouping variable (event) based on the rank column, as such:
event <- c("Hurdles", "Hurdles", "Long Jump", "Long Jump", "Long Jump")
df_desired <- data.frame(name, rank, event, stringsAsFactors = FALSE)
There are lots of examples of going the other way (making a ranking variable based on a group) but I can't seem to find one doing what I'd like.
It's possible to use filter, full_join and then fill as shown below, but is there a simpler way?
library(tidyverse)
df <- df %>%
mutate(order = row_number())
df_1 <- df %>%
filter(rank == 1)
df_1$event <- c("Hurdles", "Long Jump")
df %>%
filter(rank != 1) %>%
mutate(event = as.character(NA)) %>%
full_join(df_1, by = c("order", "name", "rank", "event")) %>%
arrange(order) %>%
fill(event) %>%
select(-order)
We can use cumsum to create the index
library(dplyr)
df %>%
mutate(event = c("Hurdles", "Long Jump")[cumsum(rank == 1)])
# name rank event
#1 Sally 1 Hurdles
#2 Dave 2 Hurdles
#3 Aaron 1 Long Jump
#4 Jane 2 Long Jump
#5 Michael 3 Long Jump
Or in base R (just in case)
df$event <- c("Hurdles", "Long Jump")[cumsum(df$rank == 1)])

Turn rows into columns and get the latest record - using R

I'm having issue trying to convert rows into columns and then getting only the latest record (using timestamp). Here is my data set:
Below is the code to generate this data set:
df <- data.frame(id = c("123||wa", "123||wa", "123||wa", "223||sa", "223||sa", "223||sa", "123||wa"),
questions = c("dish", "car", "house", "dish", "house", "car", "dish"),
answers = c("pasta", "bmw", "yes", "pizza", "yes", "audi","ravioli" ),
timestamp = c("03JUL2014:15:38:11", "07JAN2015:15:22:54", "24MAR2018:12:24:16", "24MAR2018:12:24:16",
"04AUG2014:12:40:30", "03JUL2014:15:38:11", "05FEB2018:17:23:16"))
The desired output is:
code that generated the output:
output <- data.frame(id = c("123||wa", "223||sa"), dish = c("ravioli", "pizza"),
car = c("bmw", "audi"), house = c("yes", "yes"))
NOTE: As you can see in the original data set, there were multiple rows for the id field. More importantly, there were two rows for id '123||wa' regarding their favourite dish but only their latest answer is wanted in the final output.
Any help would be greatly appreciated. Thanks
Most likely the date_time column should be first converted to the correct type (here using ymd_hms from lubridate and strptime), since the extracted value should correspond to the latest record by date_time. After that several functions from dplyr come in handy
library(lubridate)
library(dplyr)
df %>%
mutate(timestamp = ymd_hms(strptime(timestamp, "%d%b%Y:%H:%M:%S"))) %>%
group_by(id, questions) %>%
arrange(timestamp) %>%
summarise(last = last(answers)) %>%
spread(questions, last)
#output
# A tibble: 2 x 4
# Groups: id [2]
id car dish house
* <fct> <fct> <fct> <fct>
1 123||wa bmw ravioli yes
2 223||sa audi pizza yes
The ymd_hms(strptime(... part can be replaced with:
mutate(timestamp = parse_date_time(timestamp, orders = "%d%b%Y:%H:%M:%S"))
see
?strptime
on how to construct the date_time format
You can do with libraries tidyr and dplyr: first summarize by taking last answer and then transform data.frame:
output <- df%>%
arrange(id, timestamp) %>%
group_by(id, questions)%>%
summarise(last=last(answers))%>%
spread(questions, last)

Resources