I have a panel dataset where the time and group variables were already converted to dummies. I want to reverse the transformation though back to a simple id and time variable.
Let's create a comparable data:
library(plm)
library(tidyverse)
library(fastDummies)
data(EmplUK)
EmplUK %>%
select(-sector) %>%
dummy_cols(.data = .,select_columns = c("firm","year"),remove_selected_columns = TRUE,remove_first_dummy = TRUE) -> paneldata
head(paneldata)
So basically now all my dummy variables are firm_X and year_X and I would like to have a Year and Firm variable again.
This is slightly complicated by the fact that Firm 1 and Year 1 does not exist as dummy (as they would not be needed in a regression model).
I'm fine with this precise data missing (I can simply infer that the first Firm would be Firm 1 and the year would be Year 1976, which is one less than the smallest one).
Any ideas how to do this nicely? Ideally using tidyverse?
After some thinking, I figured it out and created a small function:
getfactorback <- function(data,
groupdummyprefix,
timedummyprefix,
grouplabel,
timelabel,
firstgroup,
firsttime) {
data %>%
mutate(newgroup = ifelse(rowSums(cur_data() %>% select(starts_with("id")))==1,0,1),
newtime = ifelse(rowSums(cur_data() %>% select(starts_with("time")))==1,0,1)) %>%
rename(!!paste0(groupdummyprefix,firstgroup):=newgroup,
!!paste0(timedummyprefix,firsttime):=newtime) %>%
pivot_longer(cols = starts_with(groupdummyprefix),names_to = grouplabel,names_prefix = groupdummyprefix) %>%
filter(value == 1) %>%
select(-value) %>%
pivot_longer(cols = starts_with(timedummyprefix),names_to = timelabel,names_prefix = timedummyprefix) %>%
filter(value == 1) %>%
select(-value) %>%
mutate(across(.cols = c(all_of(grouplabel),all_of(timelabel)),factor)) %>%
relocate(all_of(c(grouplabel,timelabel))) -> output
return(output)
}
getfactorback(data = paneldata,
groupdummyprefix = "firm_",
grouplabel = "firm",
timedummyprefix = "year_",
timelabel = "year",
firstgroup = "1",
firsttime = 1976)
Related
A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))
I have 2 codes that manipulate and filter (by date) my data.frame and that work perfectly. Now I want to run the code for not only one day, but for every day in vector:
seq(from=as.Date('2020-03-02'), to=Sys.Date(),by='days')` #.... 538 days
The code I want to run for all the days between 2020-03-02 and today is:
KOKOKO <- data.frame %>%
filter(DATE < '2020-03-02')%>%
summarize(DATE = '2020-03-02', CZK = sum(Objem.v.CZK,na.rm = T)
STAVPTF <- data.frame %>%
filter (DATE < '2020-03-02')%>%
group_by(CP) %>%
summarize(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), DATE = '2020-03-02') %>%
select(DATE,CP,mnozstvi) %>%
rbind(KOKOKO)%>%
drop_na() %>%
So instead of '2020-03-02' I want to fill in all days since '2020-03-02' one after another. And each of the KOKOKO and STAVPTF created for the unique day like this I want to save as a separate data.frame and all of them store in a list.
We could use map to loop over the sequence and apply the code
library(dplyr)
library(purrr)
out <- map(s1, ~ data.frame %>%
filter(DATE < .x)%>%
summarize(DATE = .x, CZK = sum(Objem.v.CZK,na.rm = TRUE))
As this is repeated cycle, a function would make it cleaner
f1 <- function(dat, date_col, group_col, Objem_col, aktualni_col, date_val) {
filtered <- dat %>%
filter({{date_col}} < date_val)
KOKOKO <- filtered %>%
summarize({{date_col}} := date_val,
CZK = sum({{Objem_col}}, na.rm = TRUE)
STAVPTF <- filtered %>%
group_by({{group_col}}) %>%
summarize(mnozstvi = last({{aktualni_col}}),
{{date_col}} := date_val) %>%
select({{date_col}}, {{group_col}}, mnozstvi) %>%
bind_rows(KOKOKO)%>%
drop_na()
return(STAVPTF)
}
and call as
map(s1, ~ f1(data.frame, DATE, CP, Objem.v.CZK, AKTUALNI_MNOZSTVI_AKCIE, !!.x))
where
s1 <- seq(from=as.Date('2020-03-02'), to=Sys.Date(), by='days')
It would be easier to answer your question, if you would provide a minimal reproducible example. It's easy done with tidyverses reprex packages
However, your KOKOKO code can be rewritten as simple cumulative sum:
KOKOKO =
data.frame %>%
arrange(DATE) %>% # if necessary
group_by(DATE) %>%
summarise(CZK = sum(Objem.v.CZK), .groups = 'drop') %>% # summarise per DATE (if necessary)
mutate(CZK = cumsum(CZK) - CZK) # cumulative sum excluding current row (current DATE)
Even STAVPTF code can probably be rewritten without iterations. First find the last value of AKTUALNI_MNOZSTVI_AKCIE per CP and DATE. Then this value is assigned to the next DATE:
STAVPTF <-
data.frame %>%
group_by(CP, DATE) %>%
summarise(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), .groups='drop_last') %>%
arrange(DATE) %>% # if necessary
mutate(DATE = lead(DATE))
In data analysis applied to psychology, we often want to check all results for each subject. Therefore, let's say I have this dataset:
library(tidyverse)
set.seed(123)
ds <- data.frame(subject = rep(1:4, each=4),
metadata = c("congruent_1","congruent_2","incongruent_1", "incongruent_2"),
reaction_time = rnorm(16,mean = 0.1, sd=0.02))
I can get means and standard deviation grouped by each subject
#mean
ds %>%
group_by(subject) %>%
filter(metadata == "congruent_1" | metadata == "congruent_2") %>%
summarise(mean_cong = mean(reaction_time))
#sd
ds %>%
group_by(subject) %>%
filter(metadata =="incongruent_1" | metadata == "incongruent_2") %>%
summarise(sd_cong_incong = sd(reaction_time))
However, now I need to compute a variable with the result of mean_cong / sd_cong_incong. I'm sure this is possible via group_by or nest , but I'm not getting the right code to run that.
A fake code will be
ds %>%
group_by(subject) %>%
filter(metadata == "congruent_1" | metadata == "congruent_2") %>%
summarise(mean_cong = mean(reaction_time)) %>%
unfilter() %>% #<- I know this is not possible
filter(metadata =="incongruent_1" | metadata == "incongruent_2") %>%
summarise(sd_cong_incong = sd(reaction_time)) %>%
mutate(pooled = mean_cong/sd_cong_incong)
And a fake output will be:
I want to remain within tidyverse environment.
Thank you.
You can include logic within the summarise expression like this:
ds %>%
dplyr::group_by(subject) %>%
dplyr::summarise(
mean_cong = mean(reaction_time[metadata == "congruent_1" | metadata == "congruent_2"]),
sd_cong = sd(reaction_time[metadata == "incongruent_1" | metadata == "incongruent_2"])
) %>%
dplyr::mutate(
new_var = mean_cong/sd_cong
)
I have the following dataset:
combined <- data.frame(
client = c('aaa','aaa','aaa','bbb','bbb','ccc','ccc','ddd','ddd','ddd'),
type = c('norm','reg','opt','norm','norm','reg','opt','opt','opt','reg'),
age = c('>50','>50','75+','<25','<25','>50','75+','25-50','25-50','75+'),
cases = c('1','2','2','1','0','1','2','0','3','2'),
IsActive = c('1','0','0','1','1','0','1','1','1','0')
)
And have identified the unique variable combinations with :
# get unique variable combinations
unique_vars <- combined %>%
select(1:3,5) %>%
distinct()
I am trying to iterate on this query combined %>% anti_join(slice(unique_vars,1)) using purrr and save both the output of the query and also save summary of cases from each output back to the unique_vars table. The slice should iterate through each row of unique_vars, not be fixed at 1
I tried :
qry <- combined %>% anti_join(slice(unique_vars,1))
map(.x = unique_vars %>%
slice(.),
~qry %>%
summarise(CaseCnt = sum(cases)) %>%
inner_join(.x))
My desired output would be two things:
Full output of the query
the new Field CaseCnt added to the unique_vars dataframe
Is this possible?
Although I don't completely follow the intuition behind your query, it seems that for #1 you would want:
lapply(1:nrow(unique_vars), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE)
})
And for #2 you would want:
unique_vars$CaseCnt <- lapply(1:nrow(unique_vars), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE) %>%
summarise(CaseCnt = sum(cases %>% as.numeric))
}) %>% do.call(what = rbind.data.frame,
args = .)
Alternatively for #2 with purrr:map_df():
unique_vars$CaseCnt <- map_df(c(1:nrow(unique_vars)), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE) %>%
summarise(CaseCnt = sum(cases %>% as.numeric))
})
Just as an aside -- you could do this directly with:
combined %>%
mutate(cases = as.numeric(cases)) %>%
mutate(tot_cases = sum(cases)) %>% # sum total cases across unique_id's
group_by(client, type, age, IsActive) %>%
summarize(CaseCnt = mean(tot_cases) - sum(cases))
Or if what you were actually looking for is the sum of cases in that group:
combined %>%
mutate(cases = as.numeric(cases)) %>%
group_by(client, type, age, IsActive) %>%
summarize(CaseCnt = sum(cases))
I have a dataset that looks like this -
dataset = data.frame(Site=c(rep('A',6),rep('B',6)),
Date=c(rep(c('2019-05-31','2019-04-30','2019-03-31'),4)),
Question=c(rep('Q1',3),rep('Q2',3)),
Score=runif(12,0.5,1),
Average=runif(12,0.5,1))
My objective is to spread the the Score and Average columns based on the Date column.
Using tidyverse, I manipulate the data -
library(tidyverse)
dataset %>%
nest(Score, Average, .key = 'value_col') %>%
spread(key = Date, value = value_col) %>%
unnest(.preserve = c("Site", "Question"), .sep = "_")
And this results in the final dataframe I am looking for -
Site Question 2019-03-31_Score 2019-03-31_Average 2019-04-30_Score 2019-04-30_Average 2019-05-31_Score 2019-05-31_Average
1 A Q1 0.5070755 0.6948877 0.8046608 0.8359777 0.7653232 0.5259696
2 A Q2 0.5255425 0.9482262 0.9796590 0.7612117 0.9819698 0.7710665
3 B Q1 0.6963277 0.5416473 0.7753426 0.6710344 0.8219699 0.5310356
4 B Q2 0.9993356 0.6293783 0.8125886 0.5007390 0.6385580 0.5238838
However when I add a new site to the original dataframe...
new_site= data.frame(Site=c(rep('C',4)),
Date=c('2019-05-31','2019-03-31','2019-05-31','2019-03-31'),
Question=c(rep('Q1',2),rep('Q2',2)),
Score=runif(4,0.5,1),
Average=runif(4,0.5,1))
new_dataset = rbind(dataset,new_site)
and re-run the data manipulation on the new dataset, I get the following error...
library(tidyverse)
new_dataset %>%
nest(Score, Average, .key = 'value_col') %>%
spread(key = Date, value = value_col) %>%
unnest(.preserve = c("Site", "Question"), .sep = "_")
Error: All nested columns must have the same number of elements.
I figured that this is because the new site has one day of no data.
I'd like to know whether there's an alternate approach to treating this new dataset, and reaching the same format of output.
Check
new_dataset %>%
nest(Score, Average, .key = 'value_col') %>%
spread(key = Date, value = value_col)
For the new site you haven't provided any data for the new site on 2019-03-31 and, therefore, the unnesting fails.
Better use something like
new_dataset %>%
gather(key, value, -Site, -Date, -Question) %>%
mutate(key = str_c(Date, "_", key)) %>%
select(-Date) %>%
spread(key, value)