I have a dataset that looks like this -
dataset = data.frame(Site=c(rep('A',6),rep('B',6)),
Date=c(rep(c('2019-05-31','2019-04-30','2019-03-31'),4)),
Question=c(rep('Q1',3),rep('Q2',3)),
Score=runif(12,0.5,1),
Average=runif(12,0.5,1))
My objective is to spread the the Score and Average columns based on the Date column.
Using tidyverse, I manipulate the data -
library(tidyverse)
dataset %>%
nest(Score, Average, .key = 'value_col') %>%
spread(key = Date, value = value_col) %>%
unnest(.preserve = c("Site", "Question"), .sep = "_")
And this results in the final dataframe I am looking for -
Site Question 2019-03-31_Score 2019-03-31_Average 2019-04-30_Score 2019-04-30_Average 2019-05-31_Score 2019-05-31_Average
1 A Q1 0.5070755 0.6948877 0.8046608 0.8359777 0.7653232 0.5259696
2 A Q2 0.5255425 0.9482262 0.9796590 0.7612117 0.9819698 0.7710665
3 B Q1 0.6963277 0.5416473 0.7753426 0.6710344 0.8219699 0.5310356
4 B Q2 0.9993356 0.6293783 0.8125886 0.5007390 0.6385580 0.5238838
However when I add a new site to the original dataframe...
new_site= data.frame(Site=c(rep('C',4)),
Date=c('2019-05-31','2019-03-31','2019-05-31','2019-03-31'),
Question=c(rep('Q1',2),rep('Q2',2)),
Score=runif(4,0.5,1),
Average=runif(4,0.5,1))
new_dataset = rbind(dataset,new_site)
and re-run the data manipulation on the new dataset, I get the following error...
library(tidyverse)
new_dataset %>%
nest(Score, Average, .key = 'value_col') %>%
spread(key = Date, value = value_col) %>%
unnest(.preserve = c("Site", "Question"), .sep = "_")
Error: All nested columns must have the same number of elements.
I figured that this is because the new site has one day of no data.
I'd like to know whether there's an alternate approach to treating this new dataset, and reaching the same format of output.
Check
new_dataset %>%
nest(Score, Average, .key = 'value_col') %>%
spread(key = Date, value = value_col)
For the new site you haven't provided any data for the new site on 2019-03-31 and, therefore, the unnesting fails.
Better use something like
new_dataset %>%
gather(key, value, -Site, -Date, -Question) %>%
mutate(key = str_c(Date, "_", key)) %>%
select(-Date) %>%
spread(key, value)
Related
I'm trying to aggregate this df by the last value in each corresponding country observation. For some reason, the last value that is added to the tibble is not correct.
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred)
aggre_data
I believe it has something to do with all of the NA values throughout the df. However I did try:
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred(na.rm = TRUE)))
aggre_data
Update:
combined %>%
group_by(location) %>%
arrange(date, .by_group = TRUE) %>% # or whatever
summarise(Last_value_vacc = last(na.omit( people_vaccinated_per_hundred)))
I have 2 codes that manipulate and filter (by date) my data.frame and that work perfectly. Now I want to run the code for not only one day, but for every day in vector:
seq(from=as.Date('2020-03-02'), to=Sys.Date(),by='days')` #.... 538 days
The code I want to run for all the days between 2020-03-02 and today is:
KOKOKO <- data.frame %>%
filter(DATE < '2020-03-02')%>%
summarize(DATE = '2020-03-02', CZK = sum(Objem.v.CZK,na.rm = T)
STAVPTF <- data.frame %>%
filter (DATE < '2020-03-02')%>%
group_by(CP) %>%
summarize(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), DATE = '2020-03-02') %>%
select(DATE,CP,mnozstvi) %>%
rbind(KOKOKO)%>%
drop_na() %>%
So instead of '2020-03-02' I want to fill in all days since '2020-03-02' one after another. And each of the KOKOKO and STAVPTF created for the unique day like this I want to save as a separate data.frame and all of them store in a list.
We could use map to loop over the sequence and apply the code
library(dplyr)
library(purrr)
out <- map(s1, ~ data.frame %>%
filter(DATE < .x)%>%
summarize(DATE = .x, CZK = sum(Objem.v.CZK,na.rm = TRUE))
As this is repeated cycle, a function would make it cleaner
f1 <- function(dat, date_col, group_col, Objem_col, aktualni_col, date_val) {
filtered <- dat %>%
filter({{date_col}} < date_val)
KOKOKO <- filtered %>%
summarize({{date_col}} := date_val,
CZK = sum({{Objem_col}}, na.rm = TRUE)
STAVPTF <- filtered %>%
group_by({{group_col}}) %>%
summarize(mnozstvi = last({{aktualni_col}}),
{{date_col}} := date_val) %>%
select({{date_col}}, {{group_col}}, mnozstvi) %>%
bind_rows(KOKOKO)%>%
drop_na()
return(STAVPTF)
}
and call as
map(s1, ~ f1(data.frame, DATE, CP, Objem.v.CZK, AKTUALNI_MNOZSTVI_AKCIE, !!.x))
where
s1 <- seq(from=as.Date('2020-03-02'), to=Sys.Date(), by='days')
It would be easier to answer your question, if you would provide a minimal reproducible example. It's easy done with tidyverses reprex packages
However, your KOKOKO code can be rewritten as simple cumulative sum:
KOKOKO =
data.frame %>%
arrange(DATE) %>% # if necessary
group_by(DATE) %>%
summarise(CZK = sum(Objem.v.CZK), .groups = 'drop') %>% # summarise per DATE (if necessary)
mutate(CZK = cumsum(CZK) - CZK) # cumulative sum excluding current row (current DATE)
Even STAVPTF code can probably be rewritten without iterations. First find the last value of AKTUALNI_MNOZSTVI_AKCIE per CP and DATE. Then this value is assigned to the next DATE:
STAVPTF <-
data.frame %>%
group_by(CP, DATE) %>%
summarise(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), .groups='drop_last') %>%
arrange(DATE) %>% # if necessary
mutate(DATE = lead(DATE))
I have a panel dataset where the time and group variables were already converted to dummies. I want to reverse the transformation though back to a simple id and time variable.
Let's create a comparable data:
library(plm)
library(tidyverse)
library(fastDummies)
data(EmplUK)
EmplUK %>%
select(-sector) %>%
dummy_cols(.data = .,select_columns = c("firm","year"),remove_selected_columns = TRUE,remove_first_dummy = TRUE) -> paneldata
head(paneldata)
So basically now all my dummy variables are firm_X and year_X and I would like to have a Year and Firm variable again.
This is slightly complicated by the fact that Firm 1 and Year 1 does not exist as dummy (as they would not be needed in a regression model).
I'm fine with this precise data missing (I can simply infer that the first Firm would be Firm 1 and the year would be Year 1976, which is one less than the smallest one).
Any ideas how to do this nicely? Ideally using tidyverse?
After some thinking, I figured it out and created a small function:
getfactorback <- function(data,
groupdummyprefix,
timedummyprefix,
grouplabel,
timelabel,
firstgroup,
firsttime) {
data %>%
mutate(newgroup = ifelse(rowSums(cur_data() %>% select(starts_with("id")))==1,0,1),
newtime = ifelse(rowSums(cur_data() %>% select(starts_with("time")))==1,0,1)) %>%
rename(!!paste0(groupdummyprefix,firstgroup):=newgroup,
!!paste0(timedummyprefix,firsttime):=newtime) %>%
pivot_longer(cols = starts_with(groupdummyprefix),names_to = grouplabel,names_prefix = groupdummyprefix) %>%
filter(value == 1) %>%
select(-value) %>%
pivot_longer(cols = starts_with(timedummyprefix),names_to = timelabel,names_prefix = timedummyprefix) %>%
filter(value == 1) %>%
select(-value) %>%
mutate(across(.cols = c(all_of(grouplabel),all_of(timelabel)),factor)) %>%
relocate(all_of(c(grouplabel,timelabel))) -> output
return(output)
}
getfactorback(data = paneldata,
groupdummyprefix = "firm_",
grouplabel = "firm",
timedummyprefix = "year_",
timelabel = "year",
firstgroup = "1",
firsttime = 1976)
I have two data frames that look like this:
Table1:
Gender<-c("M","F","M","M","F")
CPTCodes<-c("15777, 19328, 19342, 19366, 19370, 19371, 19380","15777, 19357","19367, 49568","15777, 19357","15777, 19357")
Df<-tibble(Gender,CPTCodes)
Table2:
Code<-c(19328,19342,15777,49568,12345)
Value<-c(0.5,7,9,35,2)
Df2<-tibble(Code,Value)
And had previously asked this question about how to summarize the "values" from table 2 into a column in table 1, depending on how many codes were in the "Code" column of table 1. Turns out it was a duplicate of another question, but either way, the solutions there worked great! It did exactly what I asked.
Problem was that I didn't realize, buried deep down in the thousands of rows of Table 2, were some duplicate codes. I.e. table 2 really looked like this:
Code<-c(19357,19342,15777,49568,12345,15777,19357)
Modifier<-c("","","","","","a","a")
Value<-c(0.5,7,9,35,2,3,45)
Df2<-tibble(Code,Modifier,Value)
So when I use the suggested code:
Df %>% mutate(id = row_number()) %>% separate_rows(CPTCodes, sep = ", ", convert = TRUE) %>% left_join(Df2, by = c("CPTCodes" = "Code")) %>% group_by(id, Gender) %>% summarize(total = sum(Value, na.rm = TRUE))
It summarizes ALL of the codes in finds that match in Table2, and I really just want rows that dont have anything in the "modifier" column. Any ideas?
Lastly, the current code returns the summarized total in its own data frame, but it'd be cool if everything was still there from the original Table 1, and it just had an extra column with the new sum.
I'm not entirely sure of your expected output. But you should be able to filter and then join the new column to the original df.
Df <- Df %>% mutate(id = row_number()) %>%
separate_rows(CPTCodes, sep = ", ", convert = TRUE) %>%
left_join(Df2, by = c("CPTCodes" = "Code")) %>%
group_by(id, Gender) %>%
filter(Modifier == "") %>%
summarize(total = sum(Value, na.rm = TRUE)) %>%
right_join(Df, by = "Gender")
enter image description hereI have a huge data set which has data for every 30 seconds . First I get the mean to take hourly data , then sum it for daily data and again sum it for monthly data . I need to assign the mutate function to a new data set / variable called mE_131 . for plotting monthly value .I'm new to this Please Help!
library(dplyr)
library(ggplot2)
attach(data)
data%>% #filtering 131 and 132
select(time,Column3,m_Pm) %>%
filter(data,Column3=="131")
filter(data,Column3=="132")
data_131<-filter(data,Column3=="131")
data_132<-filter(data,Column3=="132")
data_131%>%
mutate(datehour= format(time,"%Y-%m-%d %H"), date1= format(time,"%Y-%m-%d"), month=format(time,"%Y-%m")) %>%
group_by(datehour) %>% mutate(hourlyP=mean(m_Pm)) %>% distinct(datehour, .keep_all = TRUE) %>%
group_by(date1) %>% mutate(dailyP=sum(hourlyP)) %>% distinct(date1, .keep_all = TRUE) %>%
group_by(month) %>% summarise(monthlyP=sum(dailyP))
If your goal is to compare monthly data between column3 == 131 and column3 == 132 then you don't necessarily need to create a separate dataset for each of them although I will show you how to do it in the end.
First, let's create the required summary for both 131 and 132 :
data <- data %>%
filter(column3 == "131" | column3 == "132") %>% # filtering the required data only
mutate(datehour= format(time,"%Y-%m-%d %H"), # calculate the required stats
date1= format(time,"%Y-%m-%d"),
month=format(time,"%Y-%m")) %>%
group_by(datehour) %>%
mutate(hourlyP=mean(m_Pm)) %>%
distinct(datehour, .keep_all = TRUE) %>%
group_by(date1) %>%
mutate(dailyP=sum(hourlyP)) %>%
distinct(date1, .keep_all = TRUE) %>%
group_by(month) %>%
summarise(monthlyP=sum(dailyP))
Note: I have written every part of code in separate line to enhance readability but it is basically the same as your code shown above.
Now, let's do the plotting:
data %>%
ggplot(aes(x=month, y=monthlyP, fill=column3)) +
geom_bar(position="dodge") # this will produce similar plot as in your example
If you insist on having a separate dataset for each value in column3 then you can simply use the assignment operator <- to create a new dataframe as follows
mE_131 <- data_131 %>%
mutate(datehour= format(time,"%Y-%m-%d %H"),
date1= format(time,"%Y-%m-%d"),
month=format(time,"%Y-%m")) %>%
group_by(datehour) %>%
mutate(hourlyP=mean(m_Pm)) %>%
distinct(datehour, .keep_all = TRUE) %>%
group_by(date1) %>%
mutate(dailyP=sum(hourlyP)) %>%
distinct(date1, .keep_all = TRUE) %>%
group_by(month) %>%
summarise(monthlyP=sum(dailyP))
Then do the same thing to create mE_132. However, I don't recommend this because it would be harder to plot them.