Having trouble constructing a 2*2*2 contingency table

Having trouble constructing a 2*2*2 contingency table - r

The file "Aspirin" contains a 2 × 2 × 2 contingency table with columns defined as follows.
Column 1: V1=Observation number. [Observations 1 to 8.]
Column 2: V2=Count. [Nonnegative integer count for each cell in the Table.]
Column 3: V3=Case/Control Factor. [Factor Level 1 (Controls) and Level 2 (Cases).]
Column 4: V4=Ulcer Type Factor. [Factor Level 1 (Gastric) and Level 2 (Duodenal).]
Column 5: V5=Aspirin Use Factor. [Factor Level 1 (Non-User) and Level 2 (User).]
> aspirin
V1 V2 V3 V4 V5
1 1 62 1 1 1
2 2 39 2 1 1
3 3 53 1 2 1
4 4 49 2 2 1
5 5 6 1 1 2
6 6 25 2 1 2
7 7 8 1 2 2
8 8 8 2 2 2
I want to construct a 2x2x2 contingency table like the image above in R, so I typed the following code:
case_control=factor(aspirin$V3)
ulcer=factor(aspirin$V4)
use=factor(aspirin$V5)
table(case_control,ulcer,use)
But I get something like this:
, , use = 1
ulcer
case_control 1 2
1 1 1
2 1 1
, , use = 2
ulcer
case_control 1 2
1 1 1
2 1 1
I want a contingency table with counts, so obviously the result above is not what I'm desiring. Is there a way to fix this?

In your case, just use
ftable(case_control,ulcer,use)
which returns a "flat" table
use 1 2
case_control ulcer
1 1 1 1
2 1 1
2 1 1 1
2 1 1
The main problem here is, that you are discarding your count column. So as an alternative here is a - in my opinion - better approach:
You could use xtabs together with ftable() (here used in a dplyr pipe):
library(dplyr)
df %>%
transmute(ID = V1,
Count = V2,
Case_Control = factor(V3,
labels = c("Control", "Case")),
Ulcer_Type = factor(V4,
labels = c("Gastric", "Duodenal")),
Aspirin_Use = factor(V5,
labels = c("Non-User", "User"))) %>%
xtabs(Count ~ Ulcer_Type + Case_Control + Aspirin_Use, data = .) %>%
ftable()
This returns
Aspirin_Use Non-User User
Ulcer_Type Case_Control
Gastric Control 62 6
Case 39 25
Duodenal Control 53 8
Case 49 8
Data
df <- structure(list(V1 = c(1, 2, 3, 4, 5, 6, 7, 8), V2 = c(62, 39,
53, 49, 6, 25, 8, 8), V3 = c(1, 2, 1, 2, 1, 2, 1, 2), V4 = c(1,
1, 2, 2, 1, 1, 2, 2), V5 = c(1, 1, 1, 1, 2, 2, 2, 2)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))

Related

Count how many rows have the same ID and add the number in an new column

My dataframe contains data about political careers, such as a unique identifier (called: ui) column for each politician and the electoral term(called: electoral_term) in which they were elected. Since a politician can be elected in multiple electoral terms, there are multiple rows that contain the same ui.
Now I would like to add another column to my dataframe, that counts how many times the politician got re-elected.
So e.g. the politician with ui=1 was re-elected 2 times, since he occured in 3 electoral_terms.
I already tried
df %>% count(ui)
But that only gives out a table which can't be added into my dataframe.
Thanks in advance!

We may use base R
df$reelected <- with(df, ave(ui, ui, FUN = length)-1)
-output
> df
ui electoral reelected
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
data
df <- structure(list(ui = c(1, 1, 1, 2, 3, 3), electoral = c(1, 2,
3, 2, 7, 9)), class = "data.frame", row.names = c(NA, -6L))

mydf <- tibble::tribble(~ui, ~electoral, 1, 1, 1, 2, 1, 3, 2, 2, 3, 7, 3, 9)
library(dplyr)
df |>
add_count(ui, name = "re_elected") |>
mutate(re_elected = re_elected - 1)
# A tibble: 6 × 3
ui electoral re_elected
<dbl> <dbl> <dbl>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1

library(tidyverse)
df %>%
group_by(ui) %>%
mutate(re_elected = n() - 1)
# A tibble: 6 × 3
# Groups: ui [3]
ui electoral re_elected
<dbl> <dbl> <dbl>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1

Reshape from wide to long with multiple columns that have different naming patterns

I have a longitudinal data set in wide format, with > 2500 columns. Almost all columns begin with 'W1_' or 'W2_' to indicate the wave (ie, time point) of data collection. In the real data, there are > 2 waves. They look like this:
# Populate wide format data frame
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
wide
#> person W1_resp_sex W2_resp_sex W1_edu W2_q_2_1
#> 1 1 1 1 1 0
#> 2 2 2 2 2 1
#> 3 3 1 1 3 1
#> 4 4 2 2 4 0
I want to reshape from wide to long format so that the data look like this:
# Populate long data frame (this is how we want the wide data above to look after reshaping it)
person <- c(1, 1, 2, 2, 3, 3, 4, 4)
wave <- c(1, 2, 1, 2, 1, 2, 1, 2)
sex <- c(1, 1, 2, 2, 1, 1, 2, 2)
education <- c(1, NA, 2, NA, 3, NA, 4, NA)
q_2_1 <- c(NA, 0, NA, 1, NA, 1, NA, 0)
long_goal <- as.data.frame(cbind(person, wave, sex, education, q_2_1))
long_goal
#> person wave sex education q_2_1
#> 1 1 1 1 1 NA
#> 2 1 2 1 NA 0
#> 3 2 1 2 2 NA
#> 4 2 2 2 NA 1
#> 5 3 1 1 3 NA
#> 6 3 2 1 NA 1
#> 7 4 1 2 4 NA
#> 8 4 2 2 NA 0
To reshape the data, I tried pivot_longer(). How do I fix these issues?
(I prefer not to use data.table.)
The variables have different naming patterns (How can I correctly specify names_pattern() ?)
The multiple columns (see how all values are under the 'sex' column)
Creating a column with 'NA' when a variable was only collected in one wave (ie, if it was only collected in wave 2, I want a column with W1_varname in which all values are NA).
# Re-load wide format data
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
# Load package
pacman::p_load(tidyr)
# Reshape from wide to long
long <- wide %>%
pivot_longer(
cols = starts_with('W'),
names_to = 'Wave',
names_prefix = 'W',
names_pattern = '(.*)_',
values_to = 'sex',
values_drop_na = TRUE
)
long
#> # A tibble: 16 × 3
#> person Wave sex
#> <dbl> <chr> <dbl>
#> 1 1 1_resp 1
#> 2 1 2_resp 1
#> 3 1 1 1
#> 4 1 2_q_2 0
#> 5 2 1_resp 2
#> 6 2 2_resp 2
#> 7 2 1 2
#> 8 2 2_q_2 1
#> 9 3 1_resp 1
#> 10 3 2_resp 1
#> 11 3 1 3
#> 12 3 2_q_2 1
#> 13 4 1_resp 2
#> 14 4 2_resp 2
#> 15 4 1 4
#> 16 4 2_q_2 0
Created on 2022-09-19 by the reprex package (v2.0.1)

We could reshape to 'long' with pivot_longer, specifying the names_pattern to capture substring from column names ((...)) that matches with the same order of names_to - i.e.. wave column will get the digits (\\d+) after the 'W', where as the .value (value of the columns) correspond to the substring after the first _ in column names. Then, we could modify the resp_sex and edu by column names
library(dplyr)
library(tidyr)
pivot_longer(wide, cols = -person, names_to = c("wave", ".value"),
names_pattern = "^W(\\d+)_(.*)$") %>%
rename_with(~ c("sex", "education"), c("resp_sex", "edu"))
-output
# A tibble: 8 × 5
person wave sex education q_2_1
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 1 1 1 NA
2 1 2 1 NA 0
3 2 1 2 2 NA
4 2 2 2 NA 1
5 3 1 1 3 NA
6 3 2 1 NA 1
7 4 1 2 4 NA
8 4 2 2 NA 0

You want to reshape the variables that are measured in both waves. You may find them tableing the substring of the names without prefix.
v <- grep(names(which(table(substring(names(wide)[-1], 4)) == 2)), names(wide))
reshape2::melt(data=wide, id.vars=1, measure.vars=v)
# person variable value
# 1 1 W1_resp_sex 1
# 2 2 W1_resp_sex 2
# 3 3 W1_resp_sex 1
# 4 4 W1_resp_sex 2
# 5 1 W2_resp_sex 1
# 6 2 W2_resp_sex 2
# 7 3 W2_resp_sex 1
# 8 4 W2_resp_sex 2

Creating factor from multiple other factors fast

I have a data frame that looks like this:
df <- data.frame(
id = c(1, 2, 3, 4, 5),
generation = as.factor(c(3, 2, 4, 3, 4)),
income = as.factor(c(4, 3, 3, 7, 3)),
fem = as.factor(c(0, 0, 1, 0, 1))
)
where id is an identifier for individuals in the data set and generation, income and fem are categorical characteristics of the individuals. Now, I want to put the individuals into cohorts ("groups") based on the individual characteristics, where individuals with the exact same values for the individual characteristics should get the same cohort_id. Hence, I want the following result:
data.frame(
id = c(1, 2, 3, 4, 5),
generation = as.factor(c(3, 2, 4, 3, 4)),
income = as.factor(c(4, 3, 3, 7, 3)),
fem = as.factor(c(0, 0, 1, 0, 1)),
cohort_id = as.factor(c(1, 2, 3, 4, 3))
)
Note that id = 3 and id = 5 get the same cohort_id as they have the same characteristcs.
My question is whether there is a fast way to create the cohort_ids without using multiple case_when or ifelse over and over again? This can get quite tedious if you want to build many cohorts. A solution using dplyr would be nice but is not necessary.

There are multiple ways to do this - one option is to paste the columns and match with the unique values
library(dplyr)
library(stringr)
df %>%
mutate(cohort_id = str_c(generation, income, fem),
cohort_id = match(cohort_id, unique(cohort_id)))
-output
id generation income fem cohort_id
1 1 3 4 0 1
2 2 2 3 0 2
3 3 4 3 1 3
4 4 3 7 0 4
5 5 4 3 1 3

The following code will create an index 'cohort_id' with values a little different from the provided expected, but compliant with the grouping rules:
library(dplyr)
df %>% group_by(generation, income, fem) %>%
mutate(cohort_id = cur_group_id())%>%
ungroup()
# A tibble: 5 × 5
id generation income fem cohort_id
<dbl> <fct> <fct> <fct> <int>
1 1 3 4 0 2
2 2 2 3 0 1
3 3 4 3 1 4
4 4 3 7 0 3
5 5 4 3 1 4

Increase the value in the next row by 1 if two values are equal

I got data like this
structure(list(id = c(1, 1, 1, 2, 2, 2), time = c(1, 2, 2, 5,
6, 6)), class = "data.frame", row.names = c(NA, -6L))
and If for the same ID the value in the next row is equal to the value in the previous row, then increase the value of the duplicate by 1. I want to get this
structure(list(id2 = c(1, 1, 1, 2, 2, 2), time2 = c(1, 2, 3,
5, 6, 7)), class = "data.frame", row.names = c(NA, -6L))

Using base R:
ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
# [1] 1 2 3 5 6 7
(This can be reassigned back into time.)
This deals with 2 or more duplicates, meaning if we instead have another 6th row,
df <- rbind(df, df[6,])
df$time2 <- ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
df
# id time time2
# 1 1 1 1
# 2 1 2 2
# 3 1 2 3
# 4 2 5 5
# 5 2 6 6
# 6 2 6 7
# 61 2 6 8

You could use accumulate
library(tidyverse)
df %>%
group_by(id) %>%
mutate(time2 = accumulate(time, ~if(.x>=.y) .x + 1 else .y))
# A tibble: 6 x 3
# Groups: id [2]
id time time2
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 2
3 1 2 3
4 2 5 5
5 2 6 6
6 2 6 7
This works even if the group is repeated more than twice.

If the first data.frame is named df, this gives you what you need:
df$time[duplicated(df$id) & duplicated(df$time)] <- df$time[duplicated(df$id) & duplicated(df$time)] + 1
df
id time
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7
It finds the rows where both id and time have been duplicated from the previous row, and adds 1 to time in those rows

You can use dplyr's mutate with lag
data%>%group_by(id)%>%
mutate(time=time+cumsum(duplicated(time)))%>%
ungroup()
# A tibble: 6 x 2
id time
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7

How to add a counting column based on values in other columns in R

I have a relatively large dataset (16,000+ x ~31). In other words, it's large enough that I don't want to manipulate it line by line in Excel. The data is in this form:
block site day X1 X2
1 1 1 0.4 5.1
1 1 2 0.8 1.1
1 1 3 1.1 4.2
1 2 1 ... ...
1 2 2
1 2 3
2 3 1
2 3 2
2 3 3
2 4 1
2 4 2
2 4 3
As you can see, the site count is continuous but I would like a column where the site number resets with each block. For example, I would like something like this below:
block site day X1 X2 site2
1 1 1 0.4 5.1 1
1 1 2 0.8 1.1 1
1 1 3 1.1 4.2 1
1 2 1 ... ... 2
1 2 2 2
1 2 3 2
2 3 1 1
2 3 2 1
2 3 3 1
2 4 1 2
2 4 2 2
2 4 3 2
I was thinking about using the R function rle but am not sure if it will work because of complications with day. Otherwise, I would try something like:
Data$site2 <- sequence(rle(Data$block)$lengths)
Does anyone have any suggestions for adding a column counting (sequence) the number of sites within each block? If it helps, there are the same number of days (263) recorded for each site but there are a different number of sites per block.

Here's a slightly clumsy solution using plyr and ddply:
ddply(df,.(block),transform,
site1 = rep(1:length(unique(site)),
times = rle(site)$lengths))
Or a slightly slicker version:
ddply(df,.(block),transform,site1 = as.integer(as.factor(site)))
There may be a clever way of doing this directly, though, using the various seq, sequence and rle functions, but my brain is a bit hazy at the moment. If you leave this open for a bit someone will likely come along with a slick non-plyr solution.

Using tapply could work
# Make some fake data
dat <- data.frame(block = rep(1:3, each = 4), site = rep(1:6, each = 2), val = rnorm(12))
# For each block reset the count
dat$site2 <- unlist(tapply(dat$site, dat$block, function(x){x - min(x) + 1}))

Via ave:
df1 <- structure(list(block = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
site = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4), day = c(1,
2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3)), .Names = c("block", "site",
"day"), row.names = c("2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13"), class = "data.frame")
df1$site2 <- ave(df1$site,df1$block,FUN=function(x) match(x,sort(unique(x))))

I just wanted to update with an answer using dplyr to implement the approach by #joran for people who find this now.
library(dplyr)
# create data
df <- data.frame(block = rep(1:3, each = 4),
site = rep(1:6, each = 2),
day = rep(1:2, times = 6),
x = rnorm(12))
df %>%
group_by(block) %>%
mutate(site2 = as.integer(as.factor(site)))
The resulting output is:
block site day x site2
<int> <int> <int> <dbl> <int>
1 1 1 0.762 1
1 1 2 -0.612 1
1 2 1 1.06 2
1 2 2 -0.168 2
2 3 1 1.09 1
2 3 2 1.38 1
2 4 1 1.69 2
2 4 2 0.414 2
3 5 1 0.208 1
3 5 2 -0.647 1
3 6 1 -1.01 2
3 6 2 -0.354 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Having trouble constructing a 222 contingency table - r

Related

Count how many rows have the same ID and add the number in an new column

Reshape from wide to long with multiple columns that have different naming patterns

Creating factor from multiple other factors fast

Increase the value in the next row by 1 if two values are equal

How to add a counting column based on values in other columns in R

Categories

Resources