R: rbind two dataframes ordered by ID - r

so I´m having a little trouble creating a new dataset by row-binding two "subsets" of an original datasets. I have implemented the following code, and it works fine. However it adds the rows of the second "subset" below the ones of the first "subset and doesn´t take into account the IDs.
rbind(df %>%
group_by(ID) %>%
filter(Var1 >=
((max(Var1)/100)*95)),
V_Dem_tracker_autocracies %>%
group_by(ID) %>%
filter(first_equal_to(Var2, 1)))
so what I get is the data structured like this:
ID Var2
1 0
1 0
1 0
2 0
2 0
2 0
1 1
2 1
however I would want it like this:
ID Var2
1 0
1 0
1 0
1 1
2 0
2 0
2 0
2 1
is there an easy solution getting to this? I appreciate all answers!

You can append arrange() to your pipe to wrap the new bound data frame.
rbind(df %>%
group_by(ID) %>%
filter(Var1 >= ((max(Var1)/100)*95)),
V_Dem_tracker_autocracies %>%
group_by(ID) %>%
filter(first_equal_to(Var2, 1))) %>%
arrange(ID)
If the order is incorrect, you can use desc(ID).

Related

How to count subsets of elements in an data frame using base R or dplyr?

I would like to add a column to the below data frame nCode, call the desired new column "grpRnk", that counts each group's rank (a group defined as Group value <> 0) among the other groups in the dataframe, with the top rank defined as the lowest associated nmCnt for that grouped row and then descending rank from there as the nmCnt increases for the other grouped rows. As described in the column manually added ("grpRnk ADD") to the far right in the data frame output below:
> print.data.frame(nCode)
Name Group nmCnt seqBase subGrp grpRnk ADD
1 B 0 1 1 0 0 since Group = 0
2 R 0 1 1 0 0 since Group = 0
3 R 1 2 2 1 2 since it is 2nd place among the Groups, with its nmCnt > the nmCnt for the highest ranking Group in row 6
4 R 1 3 2 2 2 same reason as above
5 B 0 2 2 0 0 since Group = 0
6 X 2 1 1 1 1 since it is 1st place among the Groups, with its nmCnt of 1 is the lowest among all the groups
7 X 2 2 1 2 1 same reason as above
Any recommendations for how to do this in base R or dplyr?
Below is the code that generates the above (except for the column manually added on the right):
library(dplyr)
library(stringr)
myDF5 <-
data.frame(
Name = c("B","R","R","R","B","X","X"),
Group = c(0,0,1,1,0,2,2)
)
nCode <- myDF5 %>%
group_by(Name) %>%
mutate(nmCnt = row_number()) %>%
ungroup() %>%
mutate(seqBase = ifelse(Group == 0 | Group != lag(Group), nmCnt,0)) %>%
mutate(seqBase = na_if(seqBase, 0)) %>%
group_by(Name) %>%
fill(seqBase) %>%
mutate(seqBase = match(seqBase, unique(seqBase))) %>%
ungroup %>%
mutate(subGrp = as.integer(ifelse(Group > 0, sapply(1:n(), function(x) sum(Name[1:x]==Name[x] & Group[1:x] == Group[x])),0)))
print.data.frame(nCode)
Here's a dplyr solution. However instead of filling non-groups with 0 per my OP, this code drops in NA for non-groups which works better for me for what this is intended for. The dplyr slice() function used in my solution is new to me and is very useful, I found out about it in post dplyr filter: Get rows with minimum of variable, but only the first if multiple minima
grpRnk <- nCode %>% select(Name,Group,nmCnt) %>%
filter(Group > 0) %>%
group_by(Name) %>%
slice(which.min(Group)) %>%
arrange(nmCnt) %>%
select(-nmCnt)
grpRnk$grpRnk <- as.integer(row.names(grpRnk))
left_join(nCode,grpRnk)

How to calculate transition probabilities in R

I would like to calculate how often changes between values happen by person-year combination (panel data). This mimics Stata's command xttrans. The transition between index 6 and 7 should not be included, since it is not a transition from within one person.
df = data.frame(id=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2),
year=seq(from=2003, to=2009, by=1),
health=c(3,1,2,2,5,1,1,1,2,3,2,1,1,2))
Here is a base R solution to calculate transition counts by id groups:
with(df, do.call(`+`, tapply(health, id, function(x){
x <- factor(x, levels = min(health, na.rm = T):max(health, na.rm = T))
table(x[-length(x)], x[-1])
})))
# 1 2 3 4 5
# 1 2 3 0 0 0
# 2 1 1 1 0 1
# 3 1 1 0 0 0
# 4 0 0 0 0 0
# 5 1 0 0 0 0
library(tidyverse)
# Calculate the last health status for each id
df <- df %>%
group_by(id) %>%
mutate(lastHealth=lag(health)) %>%
ungroup()
# Count nunmber of existing transitions
transitions <- df %>%
group_by(health, lastHealth) %>%
summarise(N=n()) %>%
ungroup()
# Fill in the transition grid to include possible transitions that weren't observed
transitions <- transitions %>%
complete(health=1:5, lastHealth=1:5, fill=list(N=0))
# Present the transitions in the required format
transitions %>%
pivot_wider(names_from="health", values_from="N", names_prefix="health") %>%
filter(!is.na(lastHealth))

How to filter for a combination of list arguments and multiple character strings in dplyr

Given a dataframe:
v1_attr1 <- c(1,0,0,0,1,0,0,0,1,1) %>% as.integer ()
v1_attr2 <- c(0,1,0,0,1,1,1,1,1,1) %>% as.integer ()
v2_attr1 <- c(0,0,1,0,0,1,1,1,0,0) %>% as.integer ()
v2_attr2 <- c(0,0,0,1,0,1,1,1,0,0) %>% as.integer ()
df <- data.frame (v1_attr1, v1_attr2, v2_attr1, v2_attr2)
How can I set a filter for the attr of each v[[x]]?
I tried the following code to get the number of rows in each data.frame filtered by attr.
library(dplyr)
# create list for vs
list_vs <- list ("v1", "v2")
# set multiple attr filter for each v[[x]] to get the respective number of rows in each filtered data.frame (presented in a list)
filtered <- lapply (list_vs, function (x){
df %>% filter (noquote(paste0(list_vs[[x]], "_attr1")) == 1 | noquote(paste0(list_vs[[x]], "_attr2")) == 1) %>%
nrow ()
})
Although this code doesn't return an error, the result for filtered[[x]] is always 0. How do I need to set the filter arguments correctly to get the desired number of rows in each data.frame? I used noquote because otherwise filtering arguments would be pasted in quotes.
One dplyr and purrr option could be:
map(.x = list_vs,
~ df %>%
filter_at(vars(starts_with(.x)), any_vars(. == 1)))
[[1]]
v1_attr1 v1_attr2 v2_attr1 v2_attr2
1 1 0 0 0
2 0 1 0 0
3 1 1 0 0
4 0 1 1 1
5 0 1 1 1
6 0 1 1 1
7 1 1 0 0
8 1 1 0 0
[[2]]
v1_attr1 v1_attr2 v2_attr1 v2_attr2
1 0 0 1 0
2 0 0 0 1
3 0 1 1 1
4 0 1 1 1
5 0 1 1 1
An option is to convert to 'long' format with pivot_longer by automatically picking up the patterns from the column names, and then do a group_by, filter_at
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything(), names_sep = "_",
names_to = c('group', '.value' )) %>%
group_by(group) %>%
filter_at(vars(-group_cols()), any_vars(. == 1))

Flag column creation in dplyr

This is a really frustrating silly example. Let's say I have the following data below with an ID column...
dat <- data.frame(cbind(rep(c("a","b","c"),c(1,3,1))))
names(dat) <- c("ID")
dat
ID
1 a
2 b
3 b
4 b
5 c
6 c
I am trying to create a new column using the dplyr package which creates a 0 for the first row by ID and then any subsequent rows will have a 1. So the resulting data should look like this:
ID Flag
1 a 0
2 b 0
3 b 1
4 b 1
5 c 0
6 c 1
I have tried the following code but just get a column of zeros:
dat %>%
group_by(ID) %>%
mutate(
Readmission = ifelse(n() == 1,0,c(0,rep(1,n()-1)))
) %>% data.frame()
Any help appreciated! Surely this is a quick fix and I didn't sleep enough last night. This is actually a pretty simple task using lapply... but it takes too bloody long to run and I'm impatient.
n() is the number of rows, but you need the actual row number. Here the solution:
dat %>%
group_by(ID) %>%
mutate(
Readmission = ifelse(row_number()==1, 0, 1)
) %>%
data.frame()

R - create dynamic indicator columns from values in character columns

I have data that looks like this:
library(dplyr)
d<-data.frame(ID=c(1,1,2,3,3,4), Quality=c("Good", "Bad", "Ugly", "Good", "Good", "Ugly"), Area=c("East", "North", "North", "South", "East", "North"))
What I'd like to do is create one new column for each unique value in Quality and populate it with whether the ID matches that value and then aggregate the ID's. I want to do the same for Area.
This is what I have for when Quality == Good:
d$Quality.Good <- 0
d$Quality.Good[d$Quality=="Good"] <- 1
e <- d %>%
group_by(ID) %>%
summarise(n=n(), MAX.Quality.Good = max(Quality.Good))
e
Output
A tibble: 4 x 3
ID MAX.Quality.Good
<dbl> <dbl>
1 1 1
2 2 0
3 3 1
4 4 0
Is it possible to build a function that will loop through each character column and build an indicator column for Good, Bad, Ugly, North, East, South instead of copy pasting the above many more times?
Here's where I'm stuck:
library(stringr)
#vector of each Quality
e <-d %>%
group_by(Quality) %>%
summarise(n=n()) %>%
select(Quality)
e<-as.data.frame(e)
#create new column names
f <- str_c(names(e),".",e[,1])
#initialize list of new columns
d[f] <- 0
#I'm stuck after this...
Thank you!
We can do this in base R using table by replicating the 'ID' column by the number of columns of dataset minus 1, and pasteing the column names with the unlisted values (excluding the 'ID' column)
table(rep(d$ID, 2), paste0(names(d)[-1][col(d[-1])], unlist(d[-1])))
# AreaEast AreaNorth AreaSouth QualityBad QualityGood QualityUgly
# 1 1 1 0 1 1 0
# 2 0 1 0 0 0 1
# 3 1 0 1 0 2 0
# 4 0 1 0 0 0 1
or with tidyverse, gather into 'long' format, unite the 'key', 'val' columns to a single column, get the distinct rows, and spread into 'wide' format after creating a column of 1s.
library(tidyverse)
gather(d, key, val, -ID) %>%
unite(kv, key, val) %>%
distinct %>%
mutate(n = 1) %>%
spread(kv, n, fill = 0)
#ID Area_East Area_North Area_South Quality_Bad Quality_Good Quality_Ugly
#1 1 1 1 0 1 1 0
#2 2 0 1 0 0 0 1
#3 3 1 0 1 0 1 0
#4 4 0 1 0 0 0 1
1) Base R Create the model matrix for each column (using function make_mm) and bind them together as a data frame m. Finally aggregate on ID. No packages are used.
make_mm <- function(nm, data) model.matrix(~ . - 1, data[nm])
m <- do.call("data.frame", lapply(names(d)[-1], make_mm, d))
with(d, aggregate(. ~ ID, m, max))
giving:
ID QualityBad QualityGood QualityUgly AreaEast AreaNorth AreaSouth
1 1 1 1 0 1 1 0
2 2 0 0 1 0 1 0
3 3 0 1 0 1 0 1
4 4 0 0 1 0 1 0
2) dplyr/purrr This could alternately be written as the following which is close to the code in the question but generalizes to all required columns. Note that here we make model data frames using make_md rather than making model matrices with make_mm. Also note that the dot in group_by(m, ID = .$ID) refers to d and not to m.
library(dplyr)
library(purrr)
make_md <- function(nm, data) {
data %>%
select(nm) %>%
model.matrix(~ . - 1, .) %>%
as.data.frame
}
d %>% {
m <- map_dfc(names(.)[-1], make_md, .)
group_by(m, ID = .$ID) %>%
summarize_all(max) %>%
ungroup
}

Resources