Making new variables for every group of observation in R - r

I have 11 variables in my dataframe. The first is unique identifier of observation (a plane). The second one is a number from 1 to 21 representing flight of a given plane. The rest of the variables are time, velocity, distance, etc.
What I want to do is make new variables for every group (number) of flight e.g. time_1, time_2,..., velocity_1, velocity_2, etc. and consequently, reduce the number of observations (the repeating ones).
I don't really have idea how to start. I was thinking about a mutate function like:
mutate(df, time_1 = ifelse(n_flight == 1, time, NA))
But that would be a lot of typing and a new problem may appear, perhaps.

Basically, you want to convert long to wide data for each variable. You can lapply over these with tidyr::spread in that case. Suppose the data looks like the following:
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(rep("A", 3), rep("B", 3)),
n_flight = rep(seq(3), 2),
time = seq(19, 24),
velocity = rev(seq(65, 60))
)
Then the following will generate your outcome of interest, as long as you get rid of the extra ID variables.
lapply(
setdiff(names(df), c("ID", "n_flight")), function(x) {
df %>%
select(ID, n_flight, !!x) %>%
tidyr::spread(., key = "n_flight", value = x) %>%
setNames(paste(x, names(.), sep = "_"))
}
) %>%
bind_cols()
Let me know if this wasn't what you were going for.

Related

How to merge rows based on conditions with characters values? (Household data)

I have a data frame in which the first column indicates the work (manager, employee or worker), the second indicates whether the person works at night or not and the last is a household code (if two individuals share the same code then it means that they share the same house).
#Here is the reproductible data :
PCS <- c("worker", "manager","employee","employee","worker","worker","manager","employee","manager","employee")
work_night <- c("Yes","Yes","No", "No","No","Yes","No","Yes","No","Yes")
HHnum <- c(1,1,2,2,3,3,4,4,5,5)
df <- data.frame(PCS,work_night,HHnum)
My problem is that I would like to have a new data frame with households instead of individuals. I would like to group individuals based on HHnum and then merge their answers.
For the variable "PCS" I have new categories based on the combination of answers : Manager+work ="I" ; manager+employee="II", employee+employee=VI, worker+worker=III etc
For the variable "work_night", I would like to apply a score (is both answered Yes then score=2, if one answered YES then score =1 and if both answered No then score = 0).
To be clear, I would like my data frame to look like this :
HHnum PCS work_night
1 "I" 2
2 "VI" 0
3 "III" 1
4 "II" 1
5 "II" 1
How can I do this on R using dplyr ? I know that I need group_by() but then I don't know what to use.
Best,
Victor
Here is one way to do it (though I admit it is pretty verbose). I created a reference dataframe (i.e., combos) in case you had more categories than 3, which is then joined with the main dataframe (i.e., df_new) to bring in the PCS roman numerals.
library(dplyr)
library(tidyr)
# Create a dataframe with all of the combinations of PCS.
combos <- expand.grid(unique(df$PCS), unique(df$PCS))
combos <- unique(t(apply(combos, 1, sort))) %>%
as.data.frame() %>%
dplyr::mutate(PCS = as.roman(row_number()))
# Create another dataframe with the columns reversed (will make it easier to join to the main dataframe).
combos2 <- data.frame(V1 = c(combos$V2), V2 = c(combos$V1), PCS = c(combos$PCS)) %>%
dplyr::mutate(PCS = as.roman(PCS))
combos <- rbind(combos, combos2)
# Get the count of "Yes" for each HHnum group.
# Then, put the PCS into 2 columns to join together with "combos" df.
df_new <- df %>%
dplyr::group_by(HHnum) %>%
dplyr::mutate(work_night = sum(work_night == "Yes")) %>%
dplyr::group_by(grp = rep(1:2, length.out = n())) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = grp, values_from = PCS) %>%
dplyr::rename("V1" = 3, "V2" = 4) %>%
dplyr::left_join(combos, by = c("V1", "V2")) %>%
unique() %>%
dplyr::select(HHnum, PCS, work_night)

Reduce duplicated entries considering more than one column

I have a long dataset in which there are duplicated entries whose data I need to merge, e.g. paste values together.
In my case, I have a database of scientific articles: the strongest unique identifiers are the DOI and the article title, but the first may be missing in one of the copies, and the second may have slight phonetic/graphic differences that are easy to spot for humans but not programmatically (e.g. one copy uses β and the other plain beta).
A "match" are two articles that share at least one of the two columns. That is, I need a way to dplyr::group_by by the DOI OR the article title (usual group_by uses an AND logic).
The only solution that comes to my mind is to repeat the aggregation twice, for each column. Not very efficient given the large number of records.
Example:
imagine an input like:
df <- data.frame(
ID = c(1, NA, 2, 2),
Title = c('A', 'A', 'beta', 'β'),
to.join = 1:4
)
After (OR)grouping and summarising:
df %>%
group_by_OR(ID, Title) %>% # dummy function
summarise(
ID = na.omit(ID)[1],
Title = Title[1],
joined = paste(to.join, collapse = ', '))
I should get something like this:
ID Title joined
1 1 A 1, 2
2 2 beta 3, 4
That is, the data was grouped by the title for the first group and by the id for the second.
I don't think you can avoid having to group the data twice, but we can do it sequentially, that way we can be as efficient as possible.
library(dplyr)
df_aggregated <- df %>%
group_by(ID) %>%
arrange(Title) %>%
summarise(Title = first(Title),
to.join = paste0(to.join, collapse=", ")) %>%
group_by(Title) %>%
arrange(ID) %>%
summarise(ID = first(ID),
to.join = paste0(to.join, collapse=", ")) %>%
select(ID, Title, joined=to.join) %>%
as.data.frame()
Now,
df_aggregated
is:
ID Title joined
1 1 A 1, 2
2 2 beta 3, 4
Eventually I found a solution, thanks also to #dario.
First I group by Title and impute the missing DOIs if at least one of the copies has one. Then I ungroup and create a new unique ID, using the DOI if present and the Title for those entries whose no copies have it.
Finally I group and summarize by this ID.
This way the computational-heavy summarising step is done only once.
records %>%
mutate(
uID = str_to_lower(Title) %>% str_remove_all('[^\\w\\d]+') # Improve matching between slightly different copies
) %>%
group_by(uID) %>%
mutate(DOI = na.omit(DOI)[1]) %>%
ungroup() %>%
mutate(
uID = ifelse(is.na(DOI), uID, DOI)
) %>%
group_by(uID) %>%
summarise(...) # various stuff here.

Generating additional rows based on a condition within the same data frame

I have a data frame like DF below which will be imported directly from the database (as tibble).
library(tidyverse)
library(lubridate)
date_until <- dmy("31.05.2019")
date_val <- dmy("30.06.2018")
DF <- data.frame( date_bal = as.Date(c("2018-04-30", "2018-05-31", "2018-06-30", "2018-05-31", "2018-06-30")),
department = c("A","A","A","B","B"),
amount = c(10,20,30,40,50)
)
DF <- DF %>%
as_tibble()
DF
It represents the amount of money spent by each department in a specific month. My task is to project how much money will be spent by each department in the following months until a specified date in the future (in this case date_until=31.05.2019)
I would like to use tidyverse in order to generate additional rows for each department where the first column date_bal would be a sequence of dates from the last one from "original" DF up until date_until which is predefined. Then I would like to add additional column called "DIFF" which would represent the difference between DATE_BAL and DATE_VAL, where DATE_VAL is also predefined. My final result would look like this:
Final result
I have managed to do this in the following way:
first filter data from DF for department A
Create another DF2 by populating it with date sequence from min(dat_bal) to date_until from 1.
Merge data frames from 1. and 2. and then add calculated columns using mutate
Since I will have to repeat this procedure for many departments I wonder if it's possible to add rows (create date sequence) in existing DF (without creating a second DF and then merging).
Thanks in advance for your help and time.
I add one day to the dates, create a sequence and then rollback to the last day of the previous month.
seq(min(date_val + days(1)), date_until + days(1), by = 'months')[-1] %>%
rollback() %>%
tibble(date_bal = .) %>%
crossing(DF %>% distinct(department)) %>%
bind_rows(DF %>% select(date_bal, department)) %>%
left_join(DF) %>%
arrange(department, date_bal) %>%
mutate(
amount = if_else(is.na(amount), 0, amount),
DIFF = interval(
rollback(date_val, roll_to_first = TRUE),
rollback(date_bal, roll_to_first = TRUE)) %/% months(1)
)

Time series function in dplyr

I am working with data that stops in a specific year and is NA afterwards. And I need to calculate allot of variables based on lagged values of other variables. I would like to find a way that a whole series is calculated instead of each time one year when one of the variables is NA. I was looking at dplyr given that I am working with panel data and thus need to group it by ID.
I provide the example below:
set.seed(1)
df <- data.frame( year = c(seq(2000, 2018), seq(2000, 2018)) , id = c(rep(1, 19),rep(2, 19)), varA = floor(rnorm(38)*100), varB= floor(rnorm(38)*100), varC= floor(rnorm(38)*100))
df <- df %>% mutate(varA = if_else(year>2010, as.double(NA) , varA) ,
varB = if_else(year>2010, as.double(NA) , varB),
varC = if_else(year>2010, as.double(NA) , varC)) %>% group_by(id) %>% arrange(year)
What I would like is to find a way to calculate a variable that is equal to variable C when it is available, but afterwards is equal to a formula based on lagged values of variable C, B and A. When executing the code below, varResult and D are ony calculated for one year given that the lags are only available for one year:
df <- df %>% mutate( varD = lag(varA)*lag(varB),
varRESULT = if_else(is.na(varC), lag(varC, 1)/lag(varD, 2)*lag(varD, 1), varC))
But I would like to find a way to calculate immidiatly the whole serries (taking into account the panel dimension of the data) instead of heaving to repeat the code 7 times. Preferably a solution where you can calculate varD seperatly from varResults, given that in the final application I have multiple variables that are linked to each other.
Proposed solution:
Starting with the first NA, the "recursive" lags of vars varA, varB, and varC are equal to the last value of these variables.
Thus, starting from these initial variables, we can create new variables: varA1, varB1, and varC1 where we fill the NAs with the last value, by id:
library(dplyr)
library(tidyr) # for the function `fill`
df <- df %>%
mutate(varA1 = varA, varB1 = varB, varC1 = varC) %>%
group_by(id) %>%
arrange(year) %>%
fill(varA1, varB1, varC1) # fills with last value
Then, we apply the formula:
df <- df %>%
mutate( varD = lag(varA1)*lag(varB1),
varRESULT = if_else(is.na(varC), lag(varC1, 1)/lag(varD, 2)*lag(varD, 1), varC)) %>%
select(-varA1, -varB1, -varC1)

Replace NA with Value with Previous Value

I have a column in data frame that I created in R. After a certain month, the values become NA. I would like to replace the NAs with the record 12 months back. Is there a function in R for me to do this? Or do I have to do a loop?
So Jan-11 would then become 10, Feb-11 would become 11 and so forth.
EDIT:
I also tried:
for (i in 1:length(df$var)) {
df$var[i] <- ifelse(is.na(df$var[i]), df$var[i - 12],
df$var[i]) }
but the whole column ends up being NA.
Aha, from the last comment it sounds like you'd like a "chained" lag, where it uses the last value of that month that is available, however many years back you need to go.
Jan-11 will show the value 10, but when it comes to Jan-12, it shows
NA (when it should be 10).
Here's an approach that relies on first grouping by month, and then using tidyr::fill() to fill in from the last valid value for that month.
First, some fake data. (BTW it would be useful to include something like this in your question so that answerers don't have to retype your numbers or generate new ones.)
# Make fake data with 1 year values, 2 yrs NAs
library(lubridate)
set.seed(42);
data <- data.frame(
dates = seq.Date(from = ymd(20100101), to = ymd(20121201), by = "month"),
values = c(as.integer(rnorm(12, 10, 3)), rep(NA_integer_, 24))
)
# Group by months, fill within groups, ungroup.
library(tidyverse)
data_filled <- data %>%
group_by(month = month(dates)) %>%
fill(values) %>%
ungroup() %>%
arrange(dates)
I can't think of a way to do this without a loop, but this should give you what you need:
df <- data.frame(col1 = LETTERS[1:24],
col2 = c(rnorm(12), rep(NA, 12)))
for(i in 1:nrow(df)) {
if(is.na(df[i, 2])) {
df[i, 2] <- df[i - 12, 2]
}
}

Resources