Add a grouping variable based on ranked data - r

Consider the following dataframe:
name <- c("Sally", "Dave", "Aaron", "Jane", "Michael")
rank <- c(1,2,1,2,3)
df <- data.frame(name, rank, stringsAsFactors = FALSE)
I'd like to create a grouping variable (event) based on the rank column, as such:
event <- c("Hurdles", "Hurdles", "Long Jump", "Long Jump", "Long Jump")
df_desired <- data.frame(name, rank, event, stringsAsFactors = FALSE)
There are lots of examples of going the other way (making a ranking variable based on a group) but I can't seem to find one doing what I'd like.
It's possible to use filter, full_join and then fill as shown below, but is there a simpler way?
library(tidyverse)
df <- df %>%
mutate(order = row_number())
df_1 <- df %>%
filter(rank == 1)
df_1$event <- c("Hurdles", "Long Jump")
df %>%
filter(rank != 1) %>%
mutate(event = as.character(NA)) %>%
full_join(df_1, by = c("order", "name", "rank", "event")) %>%
arrange(order) %>%
fill(event) %>%
select(-order)

We can use cumsum to create the index
library(dplyr)
df %>%
mutate(event = c("Hurdles", "Long Jump")[cumsum(rank == 1)])
# name rank event
#1 Sally 1 Hurdles
#2 Dave 2 Hurdles
#3 Aaron 1 Long Jump
#4 Jane 2 Long Jump
#5 Michael 3 Long Jump
Or in base R (just in case)
df$event <- c("Hurdles", "Long Jump")[cumsum(df$rank == 1)])

Related

Finding the first row after which x rows meet some criterium in R

A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))

Rowwise find most frequent term in dataframe column and count occurrences

I try to find the most frequent category within every row of a dataframe. A category can consist of multiple words split by a /.
library(tidyverse)
library(DescTools)
# example data
id <- c(1, 2, 3, 4)
categories <- c("apple,shoes/socks,trousers/jeans,chocolate",
"apple,NA,apple,chocolate",
"shoes/socks,NA,NA,NA",
"apple,apple,chocolate,chocolate")
df <- data.frame(id, categories)
# the solution I would like to achieve
solution <- df %>%
mutate(winner = c("apple", "apple", "shoes/socks", "apple"),
winner_count = c(1, 2, 1, 2))
Based on these answers I have tried the following:
Write a function that finds the most common word in a string of text using R
trial <- df %>%
rowwise() %>%
mutate(winner = names(which.max(table(categories %>% str_split(",")))),
winner_count = which.max(table(categories %>% str_split(",")))[[1]])
Also tried to follow this approach, however it also does not give me the required results
How to find the most repeated word in a vector with R
trial2 <- df %>%
mutate(winner = DescTools::Mode(str_split(categories, ","), na.rm = T))
I am mainly struggling because my most frequent category is not just one word but something like "shoes/socks" and the fact that I also have NAs. I don't want the NAs to be the "winner".
I don't care too much about the ties right now. I already have a follow up process in place where I handle the cases that have winner_count = 2.
split the categories on comma in separate rows, count their occurrence for each id, drop the NA values and select the top occurring row for each id
library(dplyr)
library(tidyr)
df %>%
separate_rows(categories, sep = ',') %>%
count(id, categories, name = 'winner_count') %>%
filter(categories != 'NA') %>%
group_by(id) %>%
slice_max(winner_count, n = 1, with_ties = FALSE) %>%
ungroup %>%
rename(winner = categories) %>%
left_join(df, by = 'id') -> result
result
# id winner winner_count categories
# <dbl> <chr> <int> <chr>
#1 1 apple 1 apple,shoes/socks,trousers/jeans,chocolate
#2 2 apple 2 apple,NA,apple,chocolate
#3 3 shoes/socks 1 shoes/socks,NA,NA,NA
#4 4 apple 2 apple,apple,chocolate,chocolate

Add multiple selects in one dataset

I have the dataset below and in it I consolidate the categories Mk_Cap, Exports and Money_Supply, but each of these grids has a different Unit.
df <- data.frame(Mes=c("Jan","Fev","Mar","Abr","Mai",
"Jan","Fev","Mar","Abr","Mai",
"Jan","Fev","Mar","Abr","Mai"),
Ano=c(2005,2006,2007,2008,2009,
2005,2006,2007,2008,2009,
2005,2006,2007,2008,2009),
Mk_Cap=c(11:15,116:120,1111:1115),
Exports=c(21:25,146:150,1351:1355),
Money_Supply=c(31:35,546:550,2111:2115),
Unit=c("USD","USD","USD","USD","USD","200=10",
"200=10","200=10","200=10","200=10",
"CNY","CNY","CNY","CNY","CNY"))
enter image description here
Today I am consolidating as follows:
library(dplyr)
Money_Supply <- df %>% dplyr::select(Ano, Mes,Money_Supply) %>% dplyr::filter(df$Unit == "USD")
Mk_Cap <- df %>% dplyr::select(Mk_Cap) %>% dplyr::filter(df$Unit == "200=10")
Exports <- df %>% dplyr::select(Exports) %>% dplyr::filter(df$Unit == "CNY")
Consolidado <- base::cbind(Money_Supply,Mk_Cap,Exports)
enter image description here
I believe that it is not the most correct way to do this, but today it is the way that I found, in this example that I passed there are few occurrences, but in the practical case I do this in more than 30 variables which is extremely costly, if there is any way easier would be ideal.
A solution with dplyr:
There is a pattern in the dataframe. Each year has three rows.
Of the three column of interest Money_Supply, Mk_Cap, Exports each variable is in the first, second or third row.
First reorder the columns, then arrange by year, then lead the columns of interest. Then group and filter by id==1.
df1 <- df %>%
select(Ano, Mes, Money_Supply, Mk_Cap, Exports) %>%
arrange(Ano) %>%
mutate(Mk_Cap = lead(Mk_Cap, order_by = Ano)) %>%
mutate(Exports = lead(Exports, 2, order_by = Ano)) %>%
mutate(group = rep(row_number(), each=3, length.out = n())) %>%
group_by(group) %>%
mutate(id = row_number()) %>%
filter(id ==1) %>%
ungroup() %>%
select(-group, -id)
Data
df <- data.frame(Mes=c("Jan","Fev","Mar","Abr","Mai",
"Jan","Fev","Mar","Abr","Mai",
"Jan","Fev","Mar","Abr","Mai"),
Ano=c(2005,2006,2007,2008,2009,
2005,2006,2007,2008,2009,
2005,2006,2007,2008,2009),
Mk_Cap=c(11:15,116:120,1111:1115),
Exports=c(21:25,146:150,1351:1355),
Money_Supply=c(31:35,546:550,2111:2115),
Unit=c("USD","USD","USD","USD","USD","200=10",
"200=10","200=10","200=10","200=10",
"CNY","CNY","CNY","CNY","CNY"))
Edit: Try to clarify my point and the simplicity of the pattern in the data:
# slightly simplified code
df1 <- df %>%
arrange(Ano) %>%
mutate(Mk_Cap = lead(Mk_Cap, order_by = Ano)) %>%
mutate(Exports = lead(Exports, 2, order_by = Ano)) %>%
group_by(Ano) %>%
mutate(id = row_number()) %>%
filter(id ==1) %>%
ungroup() %>%
select(Ano, Mes, Money_Supply, Mk_Cap, Exports, -id, -Unit)
If you consider your dataframe like Fig1 with arrange(Ano):
You have 5 Ano (orange): 2005-2009
In each Ano you have 1 Mes(purple): In 2005 = Jan, 2006 = Fev, 2007 = Mar, 2008 = Abr, 2009 = Mai
In each Ano and Mes you have 3 Unit (blue): In 2005 & Jan = USD, 200=10, CNY ; In 2006 & Fev = USD, 200=10, CNY ; etc...
In your desired output you wish to have:
to condense the
3 rows of one Ano with 3 different Unit to
1 row with Ano, Mes and the corresponding values of Money_Supply, Mk_Cap, Exports
This can be achieved by lead function (see Fig.1):
In Money_Supply: no code necessary is already in the first row (color green)
In Mk_Cap: mutate(Mk_Cap = lead(Mk_Cap, order_by = Ano)) yellow arrow
In Exports: mutate(Exports = lead(Exports, 2, order_by = Ano)) red arrow
group_by(Ano) Group by Ano
mutate(id = row_number()) Assign unique id within each group
filter(id ==1) Filter the 1 row in each group
Finally tweak the order of columns and remove unnesseccary columns.
select(Ano, Mes, Money_Supply, Mk_Cap, Exports, -id, -Unit)
I think a simple way would be filtering your dataset by the Unit column before doing any other operations. Store the variations in a list by performing:
unit_variations <- lapply(unique(df$Unit), function(x) {
return(df[df$Unit == x, ])
})
names(unit_variations) <- unique(df$Unit)
Then, to make your Consolidado dataframe, select which variables you want from which unit variations. Say:
vars <- c("Money_Supply", "Mk_Cap", "Exports")
unit <- c("USD", "200=10", "CNY")
Consolidado <- mapply(
FUN = function(var, unit) {
return(unit_variations[[unit]][[var]])
},
vars,
unit
)
I used a list because, from what you described, I cannot assume that the number of rows for each type of Unit will always be the same, so a list allows for more flexibility. I also did not include month and year, for the same reason.

Replace all nicknames with full names based on a different dataframe in R

I have a dataframe with names that are a mixture of full names and nicknames. I want to replace all of the nicknames in that dataframe with full names from a different dataset.
temp <- data.frame("Id" = c(1,2,3,4,5), "name" = c("abe", "bob", "tim","timothy", "Joe"))
temp2 <-data.frame("name" = c("abraham", "robert", "timothy","joseph"),"nickname1"=c("abe", "rob", "tim","joe"),"nickname2"=c("", "bob", "","joey"))
If the name column in temp appears in either nickname1 or nickname2 in temp2, replace with the value in the name column of temp2.
so it should look like this at the end:
temp3<- data.frame("Id" = c(1,2,3,4,5), "name" = c("abraham", "robert", "timothy","timothy", "Joseph"))
As mentioned by #thelatemail, you can get the data in long format and then do a join. Also, you have data in upper as well as lower case, make it into one uniform case before doing the join. If the value is present in temp2, you can select that or else keep the temp value using coalesce.
library(dplyr)
temp2 %>%
tidyr::pivot_longer(cols = -name, names_to = 'nickname') %>%
filter(value != '') %>%
mutate(name = tolower(name)) %>%
right_join(temp %>% mutate(name = tolower(name)), by = c('value' = 'name')) %>%
mutate(name = coalesce(name, value)) %>%
select(Id, name)
# Id name
# <dbl> <chr>
#1 1 abraham
#2 2 robert
#3 3 timothy
#4 4 timothy
#5 5 joseph

Pop out observation/row from a data frame

My data looks like this:
library(tidyverse)
set.seed(1)
df <- tibble(
id = c("cat", "cat", "mouse", "dog", "fish", "fish", "fish"),
value = rnorm(7, 100, sd = 50)
)
How might I "pop out" the top value of fish, as in move fish to a new data frame and simultaneously remove it from the current data frame?
This works (but it doesn't seem all that elegant):
df_store <- df %>%
filter(id == "fish") %>%
top_n(1)
df <- anti_join(df, df_store)
Is there a better way?
You can do both actions in one single line by using the package pipeR.
library(pipeR); library(dplyr)
df <- df %>>% filter(id == "fish") %>>% top_n(1) %>>% (~ df2) %>% anti_join(df, .)
print(df2)
#### 1 fish 124.3715
print(df)
#### 1 mouse 58.21857
#### 2 dog 179.76404
#### 3 fish 58.97658
#### 4 cat 68.67731
#### 5 cat 109.18217
#### 6 fish 116.47539
I'm no expert of pipeR so you can check it out here, how this kind of assignment within a pipe actually works.
Just one remark: when using top_n i recommend to specify the value column, by default it's the last column but you can easily forget it: top_n(1, value)

Resources