How to mutate paneldata with dplyr in R? - r

I have panel data (person-year combination) for which I need to investigate the impact that your partner's characterics (several "x") have on your outcome variable (y). Everything is given in one tibble/dataframe. Partner information is given by "pid".
paneldata = data.frame(id=c(1,1,1,2,2,2,3,3,3,4,4,4), time=seq(1:3), pid=c(3,3,NA,4,4,3,1,1,2,2,2,NA),
y=c(9,10,11,12,13,14,15,16,17,18,19,20), x=c(21,22,23,24,25,26,27,28,29,30,31,32),
x_partner=c(27,28,NA,30,31,29,21,22,26,24,25,NA))
library(dplyr)
paneldata %>%
group_by(id, time) %>%
mutate(x_pid = x[pid])
I want to achieve x_partner, but what I have to far is x_pid. I'm trying to catch the index, while running through group_by "id" and "time", get the "pid" (not unique!) and look at x at combination pid-time.

You shouldn't be grouping by id, only by time.
paneldata %>%
group_by(time) %>%
mutate(x_partner = x[match(id, pid)])

Related

Tidymodels infer chisq_test count column

I am using the infer library to run a chisq_test in a group_by with subgroup ~ answer.
I have, among others, a column with subgroup, one with answers and one with count.
Is it possible to specify the count column when running
dat <- dat %>%
group_by(Question, Group) %>%
mutate(p_value = chisq_test(cur_data(), Subgroup ~ Answer)$p_value) %>%
ungroup()
Or do I need to use uncount(Count) first?

Group by, summarise and return the value back to the dataset in R?

I am trying to create summary statistics without losing column values. For example using the iris dataset, I want to group_by the species and find the summary statistics, such as the sd and mean.
Once I have done this and I want to add this back to the original dataset. How can I can do this, I can only do the first step.
library("tidyverse")
data <- (iris)
data<-data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
this looks like this
I want to then add the result of mean and sd to the original iris data, this is so that I can get the z score for each individual row if it belongs to that species.
For further explanation; essentially create groups by the species and then find z score of each individual plant based on their species.
Though there already is an accepted answer, here is a way of computing the Z scores for all numeric variables.
library(dplyr)
library(stringr)
iris %>%
group_by(Species) %>%
mutate(across(where(is.numeric), scale)) %>%
rename_with(~str_c(., "_Z"), where(is.numeric)) %>%
ungroup() %>%
left_join(iris, ., by = "Species") %>%
relocate(Species, .after = last_col())
You can use something like
library("tidyverse")
data <- (iris)
df <- data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
data %>% left_join(df, by = "Species") %>%
mutate(Z = (Sepal.Length-mean.iris)/sd.iris)

Getting sum of top n variables in a data frame

totdal_deaths_confirmed_cases_60_days_only = swine_flu_cases %>%
filter(Confirmed >0) %>%
group_by(Country) %>%
summarise(top_n((total_confirmed_cases = sum(Confirmed), total_deaths = sum(Deaths),60))
So I have a dataframe called swine_flu_cases, and have variables such as:
Country Date Confirmed Recovered Death
What I am trying to do is I want to sum up the groups confirmed and deaths variables in the data frame but only for the first 60 rows/entries per country. I tried using the top_n function but I am not too sure how to apply it into my dataframe before I do the summary. I also tried using slice_max function but my pc doesn't seem to have the function installed even though I loaded the dplyr package so I can't quite figure that out either.
Any suggestions on how I could accomplish this would be appreciated
step_1 = swine_flu_cases %>%
filter(Confirmed >0)
step_2 = step_1 %>%
group_by(Country)
%>% top_n(-60,Date)
Then for the final phase
totdal_deaths_confirmed_cases_60_days_only = summarise(total_confirmed_cases = sum(Confirmed), total_deaths = sum(Deaths))
I don't know if anyone has a much shorter way of doing this, but I did it in steps so that it was easier for me to understand

Arrange a grouped_df by group variable not working

I have a data.frame that contains client names, years, and several revenue numbers from each year.
df <- data.frame(client = rep(c("Client A","Client B", "Client C"),3),
year = rep(c(2014,2013,2012), each=3),
rev = rep(c(10,20,30),3)
)
I want to end up with a data.frame that aggregates the revenue by client and year. I then want to sort the data.frame by year then by descending revenue.
library(dplyr)
df1 <- df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
arrange(year, desc(tot))
However, when using the code above the arrange() function doesn't change the order of the grouped data.frame at all. When I run the below code and coerce to a normal data.frame it works.
library(dplyr)
df1 <- df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
data.frame() %>%
arrange(year, desc(tot))
Am I missing something or will I need to do this every time when trying to arrange a grouped_df by a grouped variable?
R Version: 3.1.1
dplyr package version: 0.3.0.2
EDIT 11/13/2017:
As noted by lucacerone, beginning with dplyr 0.5, arrange once again ignores groups when sorting. So my original code now works in the way I initially expected it would.
arrange() once again ignores grouping, reverting back to the behaviour of dplyr 0.3 and earlier. This makes arrange() inconsistent with other dplyr verbs, but I think this behaviour is generally more useful. Regardless, it’s not going to change again, as more changes will just cause more confusion.
Try switching the order of your group_by statement:
df %>%
group_by(year, client) %>%
summarise(tot = sum(rev)) %>%
arrange(year, desc(tot))
I think arrange is ordering within groups; after summarize, the last group is dropped, so this means in your first example it's arranging rows within the client group. Switching the order to group_by(year, client) seems to fix it because the client group gets dropped after summarize.
Alternatively, there is the ungroup() function
df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
ungroup() %>%
arrange(year, desc(tot))
Edit, #lucacerone: since dplyr 0.5 this does not work anymore:
Breaking changes arrange() once again ignores grouping, reverting back
to the behaviour of dplyr 0.3 and earlier. This makes arrange()
inconsistent with other dplyr verbs, but I think this behaviour is
generally more useful. Regardless, it’s not going to change again, as
more changes will just cause more confusion.
Latest versions of dplyr (at least from dplyr_0.7.4) allow to arrange within groups. You just have so set into the arrange() call .by_group = TRUE. More information is available here
In your example, try:
library(dplyr)
df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
arrange(desc(tot), .by_group = TRUE)

What does n=n( ) mean in R?

The other day I was reading the following lines in R and I don't understand what the %>% and summarise(n=n()) and summarise(total=n()) meant. I understand the group_by and ungroup methods though.
Can someone help out? There isn't any documentation for this either.
library(dplyr)
net.multiplicity <- group_by(net, nodeid, epoch) %>% summarise(n=n()) %>%
ungroup() %>% group_by(n) %>% summarise(total=n())
This is from the dplyr package. n=n() means that a variable named n will be assigned the number of rows (think number of observations) in the summarized data.
the %>% is read as "and then" and is way of listing your functions sequentially rather then nesting them. So that command is saying you should do the grouping and then summarize the result of the grouping by the number of rows in each group and then ungroup that result, and then group the un-grouped data based on n and then summarize that by the total number of rows in each of the new groups.

Resources