Having trouble constructing a 2*2*2 contingency table - r

The file "Aspirin" contains a 2 × 2 × 2 contingency table with columns defined as follows.
Column 1: V1=Observation number. [Observations 1 to 8.]
Column 2: V2=Count. [Nonnegative integer count for each cell in the Table.]
Column 3: V3=Case/Control Factor. [Factor Level 1 (Controls) and Level 2 (Cases).]
Column 4: V4=Ulcer Type Factor. [Factor Level 1 (Gastric) and Level 2 (Duodenal).]
Column 5: V5=Aspirin Use Factor. [Factor Level 1 (Non-User) and Level 2 (User).]
> aspirin
V1 V2 V3 V4 V5
1 1 62 1 1 1
2 2 39 2 1 1
3 3 53 1 2 1
4 4 49 2 2 1
5 5 6 1 1 2
6 6 25 2 1 2
7 7 8 1 2 2
8 8 8 2 2 2
I want to construct a 2x2x2 contingency table like the image above in R, so I typed the following code:
case_control=factor(aspirin$V3)
ulcer=factor(aspirin$V4)
use=factor(aspirin$V5)
table(case_control,ulcer,use)
But I get something like this:
, , use = 1
ulcer
case_control 1 2
1 1 1
2 1 1
, , use = 2
ulcer
case_control 1 2
1 1 1
2 1 1
I want a contingency table with counts, so obviously the result above is not what I'm desiring. Is there a way to fix this?

In your case, just use
ftable(case_control,ulcer,use)
which returns a "flat" table
use 1 2
case_control ulcer
1 1 1 1
2 1 1
2 1 1 1
2 1 1
The main problem here is, that you are discarding your count column. So as an alternative here is a - in my opinion - better approach:
You could use xtabs together with ftable() (here used in a dplyr pipe):
library(dplyr)
df %>%
transmute(ID = V1,
Count = V2,
Case_Control = factor(V3,
labels = c("Control", "Case")),
Ulcer_Type = factor(V4,
labels = c("Gastric", "Duodenal")),
Aspirin_Use = factor(V5,
labels = c("Non-User", "User"))) %>%
xtabs(Count ~ Ulcer_Type + Case_Control + Aspirin_Use, data = .) %>%
ftable()
This returns
Aspirin_Use Non-User User
Ulcer_Type Case_Control
Gastric Control 62 6
Case 39 25
Duodenal Control 53 8
Case 49 8
Data
df <- structure(list(V1 = c(1, 2, 3, 4, 5, 6, 7, 8), V2 = c(62, 39,
53, 49, 6, 25, 8, 8), V3 = c(1, 2, 1, 2, 1, 2, 1, 2), V4 = c(1,
1, 2, 2, 1, 1, 2, 2), V5 = c(1, 1, 1, 1, 2, 2, 2, 2)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))

Related

Count how many rows have the same ID and add the number in an new column

My dataframe contains data about political careers, such as a unique identifier (called: ui) column for each politician and the electoral term(called: electoral_term) in which they were elected. Since a politician can be elected in multiple electoral terms, there are multiple rows that contain the same ui.
Now I would like to add another column to my dataframe, that counts how many times the politician got re-elected.
So e.g. the politician with ui=1 was re-elected 2 times, since he occured in 3 electoral_terms.
I already tried
df %>% count(ui)
But that only gives out a table which can't be added into my dataframe.
Thanks in advance!
We may use base R
df$reelected <- with(df, ave(ui, ui, FUN = length)-1)
-output
> df
ui electoral reelected
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
data
df <- structure(list(ui = c(1, 1, 1, 2, 3, 3), electoral = c(1, 2,
3, 2, 7, 9)), class = "data.frame", row.names = c(NA, -6L))
mydf <- tibble::tribble(~ui, ~electoral, 1, 1, 1, 2, 1, 3, 2, 2, 3, 7, 3, 9)
library(dplyr)
df |>
add_count(ui, name = "re_elected") |>
mutate(re_elected = re_elected - 1)
# A tibble: 6 × 3
ui electoral re_elected
<dbl> <dbl> <dbl>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
library(tidyverse)
df %>%
group_by(ui) %>%
mutate(re_elected = n() - 1)
# A tibble: 6 × 3
# Groups: ui [3]
ui electoral re_elected
<dbl> <dbl> <dbl>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1

Reshape from wide to long with multiple columns that have different naming patterns

I have a longitudinal data set in wide format, with > 2500 columns. Almost all columns begin with 'W1_' or 'W2_' to indicate the wave (ie, time point) of data collection. In the real data, there are > 2 waves. They look like this:
# Populate wide format data frame
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
wide
#> person W1_resp_sex W2_resp_sex W1_edu W2_q_2_1
#> 1 1 1 1 1 0
#> 2 2 2 2 2 1
#> 3 3 1 1 3 1
#> 4 4 2 2 4 0
I want to reshape from wide to long format so that the data look like this:
# Populate long data frame (this is how we want the wide data above to look after reshaping it)
person <- c(1, 1, 2, 2, 3, 3, 4, 4)
wave <- c(1, 2, 1, 2, 1, 2, 1, 2)
sex <- c(1, 1, 2, 2, 1, 1, 2, 2)
education <- c(1, NA, 2, NA, 3, NA, 4, NA)
q_2_1 <- c(NA, 0, NA, 1, NA, 1, NA, 0)
long_goal <- as.data.frame(cbind(person, wave, sex, education, q_2_1))
long_goal
#> person wave sex education q_2_1
#> 1 1 1 1 1 NA
#> 2 1 2 1 NA 0
#> 3 2 1 2 2 NA
#> 4 2 2 2 NA 1
#> 5 3 1 1 3 NA
#> 6 3 2 1 NA 1
#> 7 4 1 2 4 NA
#> 8 4 2 2 NA 0
To reshape the data, I tried pivot_longer(). How do I fix these issues?
(I prefer not to use data.table.)
The variables have different naming patterns (How can I correctly specify names_pattern() ?)
The multiple columns (see how all values are under the 'sex' column)
Creating a column with 'NA' when a variable was only collected in one wave (ie, if it was only collected in wave 2, I want a column with W1_varname in which all values are NA).
# Re-load wide format data
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
# Load package
pacman::p_load(tidyr)
# Reshape from wide to long
long <- wide %>%
pivot_longer(
cols = starts_with('W'),
names_to = 'Wave',
names_prefix = 'W',
names_pattern = '(.*)_',
values_to = 'sex',
values_drop_na = TRUE
)
long
#> # A tibble: 16 × 3
#> person Wave sex
#> <dbl> <chr> <dbl>
#> 1 1 1_resp 1
#> 2 1 2_resp 1
#> 3 1 1 1
#> 4 1 2_q_2 0
#> 5 2 1_resp 2
#> 6 2 2_resp 2
#> 7 2 1 2
#> 8 2 2_q_2 1
#> 9 3 1_resp 1
#> 10 3 2_resp 1
#> 11 3 1 3
#> 12 3 2_q_2 1
#> 13 4 1_resp 2
#> 14 4 2_resp 2
#> 15 4 1 4
#> 16 4 2_q_2 0
Created on 2022-09-19 by the reprex package (v2.0.1)
We could reshape to 'long' with pivot_longer, specifying the names_pattern to capture substring from column names ((...)) that matches with the same order of names_to - i.e.. wave column will get the digits (\\d+) after the 'W', where as the .value (value of the columns) correspond to the substring after the first _ in column names. Then, we could modify the resp_sex and edu by column names
library(dplyr)
library(tidyr)
pivot_longer(wide, cols = -person, names_to = c("wave", ".value"),
names_pattern = "^W(\\d+)_(.*)$") %>%
rename_with(~ c("sex", "education"), c("resp_sex", "edu"))
-output
# A tibble: 8 × 5
person wave sex education q_2_1
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 1 1 1 NA
2 1 2 1 NA 0
3 2 1 2 2 NA
4 2 2 2 NA 1
5 3 1 1 3 NA
6 3 2 1 NA 1
7 4 1 2 4 NA
8 4 2 2 NA 0
You want to reshape the variables that are measured in both waves. You may find them tableing the substring of the names without prefix.
v <- grep(names(which(table(substring(names(wide)[-1], 4)) == 2)), names(wide))
reshape2::melt(data=wide, id.vars=1, measure.vars=v)
# person variable value
# 1 1 W1_resp_sex 1
# 2 2 W1_resp_sex 2
# 3 3 W1_resp_sex 1
# 4 4 W1_resp_sex 2
# 5 1 W2_resp_sex 1
# 6 2 W2_resp_sex 2
# 7 3 W2_resp_sex 1
# 8 4 W2_resp_sex 2

Creating factor from multiple other factors fast

I have a data frame that looks like this:
df <- data.frame(
id = c(1, 2, 3, 4, 5),
generation = as.factor(c(3, 2, 4, 3, 4)),
income = as.factor(c(4, 3, 3, 7, 3)),
fem = as.factor(c(0, 0, 1, 0, 1))
)
where id is an identifier for individuals in the data set and generation, income and fem are categorical characteristics of the individuals. Now, I want to put the individuals into cohorts ("groups") based on the individual characteristics, where individuals with the exact same values for the individual characteristics should get the same cohort_id. Hence, I want the following result:
data.frame(
id = c(1, 2, 3, 4, 5),
generation = as.factor(c(3, 2, 4, 3, 4)),
income = as.factor(c(4, 3, 3, 7, 3)),
fem = as.factor(c(0, 0, 1, 0, 1)),
cohort_id = as.factor(c(1, 2, 3, 4, 3))
)
Note that id = 3 and id = 5 get the same cohort_id as they have the same characteristcs.
My question is whether there is a fast way to create the cohort_ids without using multiple case_when or ifelse over and over again? This can get quite tedious if you want to build many cohorts. A solution using dplyr would be nice but is not necessary.
There are multiple ways to do this - one option is to paste the columns and match with the unique values
library(dplyr)
library(stringr)
df %>%
mutate(cohort_id = str_c(generation, income, fem),
cohort_id = match(cohort_id, unique(cohort_id)))
-output
id generation income fem cohort_id
1 1 3 4 0 1
2 2 2 3 0 2
3 3 4 3 1 3
4 4 3 7 0 4
5 5 4 3 1 3
The following code will create an index 'cohort_id' with values a little different from the provided expected, but compliant with the grouping rules:
library(dplyr)
df %>% group_by(generation, income, fem) %>%
mutate(cohort_id = cur_group_id())%>%
ungroup()
# A tibble: 5 × 5
id generation income fem cohort_id
<dbl> <fct> <fct> <fct> <int>
1 1 3 4 0 2
2 2 2 3 0 1
3 3 4 3 1 4
4 4 3 7 0 3
5 5 4 3 1 4

Increase the value in the next row by 1 if two values are equal

I got data like this
structure(list(id = c(1, 1, 1, 2, 2, 2), time = c(1, 2, 2, 5,
6, 6)), class = "data.frame", row.names = c(NA, -6L))
and If for the same ID the value in the next row is equal to the value in the previous row, then increase the value of the duplicate by 1. I want to get this
structure(list(id2 = c(1, 1, 1, 2, 2, 2), time2 = c(1, 2, 3,
5, 6, 7)), class = "data.frame", row.names = c(NA, -6L))
Using base R:
ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
# [1] 1 2 3 5 6 7
(This can be reassigned back into time.)
This deals with 2 or more duplicates, meaning if we instead have another 6th row,
df <- rbind(df, df[6,])
df$time2 <- ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
df
# id time time2
# 1 1 1 1
# 2 1 2 2
# 3 1 2 3
# 4 2 5 5
# 5 2 6 6
# 6 2 6 7
# 61 2 6 8
You could use accumulate
library(tidyverse)
df %>%
group_by(id) %>%
mutate(time2 = accumulate(time, ~if(.x>=.y) .x + 1 else .y))
# A tibble: 6 x 3
# Groups: id [2]
id time time2
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 2
3 1 2 3
4 2 5 5
5 2 6 6
6 2 6 7
This works even if the group is repeated more than twice.
If the first data.frame is named df, this gives you what you need:
df$time[duplicated(df$id) & duplicated(df$time)] <- df$time[duplicated(df$id) & duplicated(df$time)] + 1
df
id time
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7
It finds the rows where both id and time have been duplicated from the previous row, and adds 1 to time in those rows
You can use dplyr's mutate with lag
data%>%group_by(id)%>%
mutate(time=time+cumsum(duplicated(time)))%>%
ungroup()
# A tibble: 6 x 2
id time
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7

How to add a counting column based on values in other columns in R

I have a relatively large dataset (16,000+ x ~31). In other words, it's large enough that I don't want to manipulate it line by line in Excel. The data is in this form:
block site day X1 X2
1 1 1 0.4 5.1
1 1 2 0.8 1.1
1 1 3 1.1 4.2
1 2 1 ... ...
1 2 2
1 2 3
2 3 1
2 3 2
2 3 3
2 4 1
2 4 2
2 4 3
As you can see, the site count is continuous but I would like a column where the site number resets with each block. For example, I would like something like this below:
block site day X1 X2 site2
1 1 1 0.4 5.1 1
1 1 2 0.8 1.1 1
1 1 3 1.1 4.2 1
1 2 1 ... ... 2
1 2 2 2
1 2 3 2
2 3 1 1
2 3 2 1
2 3 3 1
2 4 1 2
2 4 2 2
2 4 3 2
I was thinking about using the R function rle but am not sure if it will work because of complications with day. Otherwise, I would try something like:
Data$site2 <- sequence(rle(Data$block)$lengths)
Does anyone have any suggestions for adding a column counting (sequence) the number of sites within each block? If it helps, there are the same number of days (263) recorded for each site but there are a different number of sites per block.
Here's a slightly clumsy solution using plyr and ddply:
ddply(df,.(block),transform,
site1 = rep(1:length(unique(site)),
times = rle(site)$lengths))
Or a slightly slicker version:
ddply(df,.(block),transform,site1 = as.integer(as.factor(site)))
There may be a clever way of doing this directly, though, using the various seq, sequence and rle functions, but my brain is a bit hazy at the moment. If you leave this open for a bit someone will likely come along with a slick non-plyr solution.
Using tapply could work
# Make some fake data
dat <- data.frame(block = rep(1:3, each = 4), site = rep(1:6, each = 2), val = rnorm(12))
# For each block reset the count
dat$site2 <- unlist(tapply(dat$site, dat$block, function(x){x - min(x) + 1}))
Via ave:
df1 <- structure(list(block = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
site = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4), day = c(1,
2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3)), .Names = c("block", "site",
"day"), row.names = c("2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13"), class = "data.frame")
df1$site2 <- ave(df1$site,df1$block,FUN=function(x) match(x,sort(unique(x))))
I just wanted to update with an answer using dplyr to implement the approach by #joran for people who find this now.
library(dplyr)
# create data
df <- data.frame(block = rep(1:3, each = 4),
site = rep(1:6, each = 2),
day = rep(1:2, times = 6),
x = rnorm(12))
df %>%
group_by(block) %>%
mutate(site2 = as.integer(as.factor(site)))
The resulting output is:
block site day x site2
<int> <int> <int> <dbl> <int>
1 1 1 0.762 1
1 1 2 -0.612 1
1 2 1 1.06 2
1 2 2 -0.168 2
2 3 1 1.09 1
2 3 2 1.38 1
2 4 1 1.69 2
2 4 2 0.414 2
3 5 1 0.208 1
3 5 2 -0.647 1
3 6 1 -1.01 2
3 6 2 -0.354 2

Resources