sub setting panel data based on two variables in R - r

library(dplyr)
id <- c(rep(1,4),rep(2,3),rep(3,4))
missing <- c(rep(0,4),rep(0,3),1,0,0,0)
wave <- c(seq(1:4),1,2,3,seq(1:4))
df <- as.data.frame(cbind(id,missing,wave))
df
id missing wave
1 1 0 1
2 1 0 2
3 1 0 3
4 1 0 4
5 2 0 1
6 2 0 2
7 2 0 3
8 3 1 1
9 3 0 2
10 3 0 3
11 3 0 4
I am trying to delete cases if they have missing=1 or if they are missing a wave (1:4). For example, ID=3 should be dropped because at wave=1 they have missing=1 and ID=2 should be dropped because they only have values of 1, 2, and 3 in Wave.
I tried to use dplyr's group_by and filter functions but this removes all cases. I want to only end up with cases for ID=1.
df <- df %>% group_by(id) %>% filter(missing==0, wave==1, wave==2, wave==3, wave==4)
df

Try this. We first group_by id, and then create a list column with the sorted unique values of wave for each id. Then we check to make sure this list equals 1:4. We create a missing_check variable, which is just the max of missing for each id. We filter on both missing_check and wave_check.
df %>%
group_by(id) %>%
mutate(wave_list = I(list(sort(unique(wave))))) %>%
mutate(wave_list_check = all(unlist(wave_list) == 1:4),
missing_check = max(missing)) %>%
filter(missing_check == 0, wave_list_check) %>%
select(id:wave)
id missing wave
<dbl> <dbl> <dbl>
1 1 0 1
2 1 0 2
3 1 0 3
4 1 0 4

Related

How to run Excel-like formulas using dplyr?

In the below reproducible R code, I'd like to add a column "adjust" that results from a series of calculations that in Excel would use cumulative countifs, max, and match (actually, to make this more complete the adjust column should have used the match formula since there could be more than 1 element in the list starting in row 15, but I think it's clear what I'm doing without actually using match) formulas as shown below in the illustration. The yellow shading shows what the reproducible code generates, and the blue shading shows my series of calculations in Excel that derive the desired values in the "adjust" column. Any suggestions for doing this, in dplyr if possible?
I am a long-time Excel user trying to migrate all of my work to R.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","B","B"),
Group = c(0,1,1,1,2,2,3,3)
)
myDataGroups <- myData %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(ElementCnt = row_number()) %>%
ungroup() %>%
mutate(Group = factor(Group, unique(Group))) %>%
arrange(Group) %>%
mutate(groupCt = cumsum(Group != lag(Group, 1, Group[[1]])) - 1L) %>%
as.data.frame()
myDataGroups
We may use rowid to get the sequence to update the 'Group', and then create a logical vector on 'Group' to create the binary and use cumsum on the 'excessOver2' and take the lag
library(dplyr)
library(data.table)
myDataGroups %>%
mutate(Group = rowid(Element, Group),
excessOver2 = +(Group > 2), adjust = lag(cumsum(excessOver2),
default = 0))
-output
Element Group origOrder ElementCnt groupCt excessOver2 adjust
1 A 1 1 1 -1 0 0
2 B 1 2 1 0 0 0
3 B 2 3 2 0 0 0
4 B 3 4 3 0 1 0
5 B 1 5 4 1 0 1
6 B 2 6 5 1 0 1
7 B 1 7 6 2 0 1
8 B 2 8 7 2 0 1
library(dplyr)
myData %>%
group_by(Element, Group) %>%
summarize(ElementCnt = row_number(), over2 = 1 * (ElementCnt > 2),
.groups = "drop_last") %>%
mutate(adjust = cumsum(lag(over2, default = 0))) %>%
ungroup()
Result
# A tibble: 8 × 5
Element Group ElementCnt over2 adjust
<chr> <dbl> <int> <dbl> <dbl>
1 A 0 1 0 0
2 B 1 1 0 0
3 B 1 2 0 0
4 B 1 3 1 0
5 B 2 1 0 1
6 B 2 2 0 1
7 B 3 1 0 1
8 B 3 2 0 1

In R, take sum of multiple variables if combination of values in two other columns are unique

I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!
You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2

How to create binary variable for each individual based on value in other variable?

So I have a data set containing of 4 individuals. Each individual is measured for different time period. In R:
df = data.frame(cbind("id"=c(1,1,1,2,2,3,3,3,3,4,4), "t"=c(1,2,3,1,2,1,2,3,4,1,2), "x1"=c(0,1,0,1,0,0,1,0,1,0,0)))
and I want to create variable x2 indicating whether there already was 1 in variable x1 for given individual, ie it will look like this:
"x2" = c(0,1,1,1,1,0,1,1,1,0,0)
... ideally with dplyr package. So far I have came here:
new_df = df %>% dplyr::group_by(id) %>% dplyr::arrange(t)
but can not move from this point... The desired result is on picture.
Here is one approach using dplyr:
df %>%
arrange(id, t) %>%
group_by(id) %>%
mutate(x2 = ifelse(row_number() >= min(row_number()[x1 == 1]), 1, 0))
This will add a 1 if the row number is greater or equal to the first row number where x1 is 1; otherwise, it will add a 0.
Note, you will get warnings, as at least one group does not have a value of x1 which equals 1.
Also, another alternative, including if you want NA where no id has a x1 value of 1 (e.g., where id is 4):
df %>%
arrange(id, t) %>%
group_by(id) %>%
mutate(x2 = +(row_number() >= which(x1 == 1)[1]))
Output
id t x1 x2
<dbl> <dbl> <dbl> <dbl>
1 1 1 0 0
2 1 2 1 1
3 1 3 0 1
4 2 1 1 1
5 2 2 0 1
6 3 1 0 0
7 3 2 1 1
8 3 3 0 1
9 3 4 1 1
10 4 1 0 0
11 4 2 0 0

Separate a row of data on different columns with the count of each item

I have a dataset with two columns where I want to separate the second one (delimited by |) into many columns where each column has the name of the item and the observation has the count.
id column
1 a|b|a
2 a|b|c|d|e
3 a|c|c
I would like to have columns with the name of each item and its count. for example for user 1 it would be as follows:
id a b c d e
1 2 1 0 0 0
2 1 1 1 1 1
3 2 0 1 0 0
How do I get to separate this data such that the values are distributed in columns as such?
A tidyverse approach, assuming data frame named mydata:
library(dplyr)
library(tidyr)
mydata %>%
separate_rows(column, sep = "\\|") %>%
count(id, column) %>%
spread(column, n) %>%
replace(., is.na(.), 0) # or just spread(column, n, fill = 0)
Result:
# A tibble: 3 x 6
id a b c d e
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0 0 0
2 2 1 1 1 1 1
3 3 1 0 2 0 0

How do I sum recurring values according to a level in a column and output a table of counts?

I'm new to R and I have data that looks something like this:
categories <- c("A","B","C","A","A","B","C","A","B","C","A","B","B","C","C")
animals <- c("cat","cat","cat","dog","mouse","mouse","rabbit","rat","shark","shark","tiger","tiger","whale","whale","worm")
dat <- cbind(categories,animals)
Some animals repeat according to the category. For example, "cat" appears in all three categories A, B, and C.
I like my new dataframe output to look something like this:
A B C count
1 1 1 1
1 1 0 2
1 0 1 0
0 1 1 2
1 0 0 2
0 1 0 0
0 0 1 2
0 0 0 0
The number 1 under A, B, and C means that the animal appears in that category, 0 means the animal does not appear in that category. For example, the first line has 1s in all three categories. The count is 1 for the first line because "cat" is the only animal that repeats itself in each category.
Is there a function in R that will help me achieve this? Thank you in advance.
We can use table to create a cross-tabulation of categories and animals, transpose, convert to data.frame, group_by all categories and count the frequency per combination:
library(dplyr)
library(tidyr)
as.data.frame.matrix(t(table(dat))) %>%
group_by_all() %>%
summarize(Count = n())
Result:
# A tibble: 5 x 4
# Groups: A, B [?]
A B C Count
<int> <int> <int> <int>
1 0 0 1 2
2 0 1 1 2
3 1 0 0 2
4 1 1 0 2
5 1 1 1 1
Edit (thanks to #C. Braun). Here is how to also include the zero A, B, C combinations:
as.data.frame.matrix(t(table(dat))) %>%
bind_rows(expand.grid(A = c(0,1), B = c(0,1), C = c(0,1))) %>%
group_by_all() %>%
summarize(Count = n()-1)
or with complete, as suggested by #Ryan:
as.data.frame.matrix(t(table(dat))) %>%
mutate(non_missing = 1) %>%
complete(A, B, C) %>%
group_by(A, B, C) %>%
summarize(Count = sum(ifelse(is.na(non_missing), 0, 1)))
Result:
# A tibble: 8 x 4
# Groups: A, B [?]
A B C Count
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0 0 1 2
3 0 1 0 0
4 0 1 1 2
5 1 0 0 2
6 1 0 1 0
7 1 1 0 2
8 1 1 1 1
We have
xxtabs <- function(df, formula) {
xt <- xtabs(formula, df)
xxt <- xtabs( ~ . , as.data.frame.matrix(xt))
as.data.frame(xxt)
}
and
> xxtabs(dat, ~ animals + categories)
A B C Freq
1 0 0 0 0
2 1 0 0 2
3 0 1 0 0
4 1 1 0 2
5 0 0 1 2
6 1 0 1 0
7 0 1 1 2
8 1 1 1 1
(dat should really be constructed as data.frame(animals, categories)). This base approach uses xtabs() to form the first cross-tabulation
xt <- xtabs(~ animals + categories, dat)
then coerces using as.data.frame.matrix() to a second data.frame, and uses a second cross-tabulation of all columns of the computed data.frame
xxt <- xtabs(~ ., as.data.frame.matrix(xt))
coerced to the desired form
as.data.frame(xxt)
I originally said this approach was 'arcane', because it relies on knowledge of the difference between as.data.frame() and as.data.frame.matrix(); I think of xtabs() as a tool that users of base R should know. I see though that the other solutions also require this arcane knowledge, as well as knowledge of more obscure (e.g., complete(), group_by_all(), funs()) parts of the tidyverse. Also, the other answers are not (or at least not written in a way that allows) easily generalizable; xxtabs() does not actually know anything about the structure of the incoming data.frame, whereas implicit knowledge of the incoming data are present throughout the other answers.
One 'lesson learned' from the tidy approach is to place the data argument first, allowing piping
dat %>% xxtabs(~ animals + categories)
If I understood you correctly, this should do the trick.
require(tidyverse)
dat %>%
mutate(value = 1) %>%
spread(categories, value) %>%
mutate_if(is.numeric, funs(replace(., is.na(.), 0))) %>%
mutate(count = rowSums(data.frame(A, B, C), na.rm = TRUE)) %>%
group_by(A, B, C) %>%
summarize(Count = n())
# A tibble: 5 x 4
# Groups: A, B [?]
A B C Count
<dbl> <dbl> <dbl> <int>
1 0. 0. 1. 2
2 0. 1. 1. 2
3 1. 0. 0. 2
4 1. 1. 0. 2
5 1. 1. 1. 1
Adding a data.table solution. First, pivot animals against categories using dat. Then, create the combinations of A, B, C using CJ. Join that combinations with dat and count the number of occurrences for each combi.
dcast(as.data.table(dat), animals ~ categories, length)[
CJ(A=0:1, B=0:1, C=0:1), .(count=.N), on=c("A","B","C"), by=.EACHI]

Resources