I have the following data in R:
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
disease <- as.factor(disease)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status )
my_data = data.frame(gender, status, disease)
I want to make a table that shows:
What percent of male immigrants have the disease?
What percent of male non-immigrants have the disease?
What percent of female immigrants have the disease?
What percent of female non-immigrants have the disease?
I tried to do this with the following code:
t1 <- xtabs(disease ~ gender + status, data=my_data)
But I get this error:
Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, :
‘sum’ not meaningful for factors
Can someone please show me what I am doing wrong and how to fix this?
Thank you!
As there are more columns and all of them are factors, use count from dplyr and then get the proportions
library(dplyr)
library(tidyr)
my_data %>%
dplyr::count(across(everything())) %>%
pivot_wider(names_from = disease, values_from =n, values_fill = 0) %>%
group_by(gender) %>%
mutate(100 *across(No:Yes, proportions)) %>%
ungroup
-output
# A tibble: 4 × 4
gender status No Yes
<fct> <fct> <dbl> <dbl>
1 Female Citizen 69.4 72.4
2 Female Immigrant 30.6 27.6
3 Male Citizen 70.4 68.7
4 Male Immigrant 29.6 31.3
With xtabs, if we convert the column to integer, it could work as
apply(xtabs(n ~ disease + gender + status,
transform(my_data, n = as.integer(disease))), c(1, 2), proportions) * 100
, , gender = Female
disease
status No Yes
Citizen 69.36724 72.41993
Immigrant 30.63276 27.58007
, , gender = Male
disease
status No Yes
Citizen 70.40185 68.68687
Immigrant 29.59815 31.31313
Related
I am using this dataset: http://www.openintro.org/stat/data/cdc.R
to create a table from a subset that only contains the means and standard deviations of male participants. The table should look like this:
Mean Standard Deviation
Age: 44.27 16.715
Height: 70.25 3.009219
Weight: 189.3 36.55036
Desired Weight: 178.6 26.25121
I created a subset for males and females with this code:
mdata <- subset(cdc, cdc$gender == ("m"))
fdata <- subset(cdc, cdc$gender == ("f"))
How should I create a table that only contains means and SDs of age, height, weight, and desired weight using these subsets?
The data frame you provided sucked up all the memory on my laptop, and it's not needed to provide that much data to solve your problem. Here's a dplyr/tidyr solution to create a summary table grouped by categories, using the starwars dataset available with dplyr:
library(dplyr)
library(tidyr)
starwars |>
group_by(sex) |>
summarise(across(
where(is.numeric),
.fns = list(Mean = mean, SD = sd), na.rm = TRUE,
.names = "{col}__{fn}"
)) |>
pivot_longer(-sex, names_to = c("var", ".value"), names_sep = "__")
# A tibble: 15 × 4
sex var Mean SD
<chr> <chr> <dbl> <dbl>
1 female height 169. 15.3
2 female mass 54.7 8.59
3 female birth_year 47.2 15.0
4 hermaphroditic height 175 NA
5 hermaphroditic mass 1358 NA
6 hermaphroditic birth_year 600 NA
7 male height 179. 36.0
8 male mass 81.0 28.2
9 male birth_year 85.5 157.
10 none height 131. 49.1
11 none mass 69.8 51.0
12 none birth_year 53.3 51.6
13 NA height 181. 2.89
14 NA mass 48 NA
15 NA birth_year 62 NA
Just make a data frame of colMeans and column sd. Note, that you may also select columns.
fdata <- subset(cdc, gender == "f", select=c("age", "height", "weight", "wtdesire"))
data.frame(mean=colMeans(fdata), sd=apply(fdata, 2, sd))
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
You can also use by to do it simultaneously for both groups, it's basically a combination of split and lapply. (To avoid apply when calculating column SDs, you could also use sd=matrixStats::colSds(as.matrix(fdata)) which is considerably faster.)
res <- by(cdc[c("age", "height", "weight", "wtdesire")], cdc$gender, \(x) {
data.frame(mean=colMeans(x), sd=matrixStats::colSds(as.matrix(x)))
})
res
# cdc$gender: m
# mean sd
# age 44.27307 16.719940
# height 70.25165 3.009219
# weight 189.32271 36.550355
# wtdesire 178.61657 26.251215
# ------------------------------------------------------------------------------------------
# cdc$gender: f
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
To extract only one of the data frames in the list-like object use e.g. res$m.
Usually we use aggregate for this, which you also might consider:
aggregate(cbind(age, height, weight, wtdesire) ~ gender, cdc, \(x) c(mean=mean(x), sd=sd(x))) |>
do.call(what=data.frame)
# gender age.mean age.sd height.mean height.sd weight.mean weight.sd wtdesire.mean wtdesire.sd
# 1 m 44.27307 16.71994 70.251646 3.009219 189.32271 36.55036 178.61657 26.25121
# 2 f 45.79772 17.58442 64.367750 2.787304 151.66619 34.29752 133.51500 18.96301
The pipe |> call(what=data.frame) is just needed to get rid of matrix columns, which is useful in case you aim to further process the data.
Note: R >= 4.1 used.
Data:
source('https://www.openintro.org/stat/data/cdc.R')
or
cdc <- structure(list(genhlth = structure(c(3L, 3L, 1L, 5L, 3L, 3L), levels = c("excellent",
"very good", "good", "fair", "poor"), class = "factor"), exerany = c(0,
1, 0, 0, 1, 1), hlthplan = c(1, 1, 1, 1, 1, 1), smoke100 = c(1,
0, 0, 0, 0, 1), height = c(69, 66, 73, 65, 67, 69), weight = c(224L,
215L, 200L, 216L, 165L, 170L), wtdesire = c(224L, 140L, 185L,
150L, 165L, 165L), age = c(73L, 23L, 35L, 57L, 81L, 83L), gender = structure(c(1L,
2L, 1L, 2L, 2L, 1L), levels = c("m", "f"), class = "factor")), row.names = c("19995",
"19996", "19997", "19998", "19999", "20000"), class = "data.frame")
I need to create a model where the presence/absence of pox can differ between species
and between elevations while also allowing for the effect of elevation to differ between species.
This is what I have:
library(car)
library(effects)
modPox=glm(Activepox ~ Species + Elev, data = datPox, family = binomial)
summary(modPox)
Anova(modPox)
plot(allEffects(modPox))
However, what does it means to have pox differing between the two coefficients? What do I need to add in order to have pox differ between species and elevation? What does the effect of elevation differing between species mean?
Thank you.
This is what the data looks like:
Site Species Bandno Date Sex Age Oldpox Activepox Malaria Elev
1 AIN APAP 159174793 7/22/2004 U H 0 0 2 mid
2 AIN APAP 159174964 7/6/2004 M H 0 1 2 mid
3 AIN APAP 159174965 7/7/2004 F H 0 0 2 mid
Data:
datPox <- data.frame(
stringsAsFactors = FALSE,
Site = c("AIN", "AIN", "AIN"),
Species = c("APAP", "APAP", "APAP"),
Bandno = c(159174793L, 159174964L, 159174965L),
Date = c("7/22/2004", "7/6/2004", "7/7/2004"),
Sex = c("U", "M", "F"),
Age = c("H", "H", "H"),
Oldpox = c(0L, 0L, 0L),
Activepox = c(0L, 1L, 0L),
Malaria = c(2L, 2L, 2L),
Elev = c("mid", "mid", "mid")
)
This question already has answers here:
Aggregate and reshape from long to wide
(2 answers)
Closed 2 years ago.
Dataset is a breakdown of responders and the number of contacts they have had within a given time period along with details on their age bracket, something similar to:
participant participant_age contact contact_age
1 18-30 1 18-30
1 18-30 2 30-40
2 30-40 1 18-30
3 18-30 1 18-30
3 18-30 2 50-60
My aim is to calculate the mean number of contacts each age group of participant has had with each age bracket of contact. Something similar to:
age_bracket 18-30 30-40 40-50
18-30 1 3 2
30-40 1.5 4 2
40-50 3 4 1
I have been attempting to use the group_by and spread functions available in dplyr. The closest I have come is using
data%>%
group_by(participant_age, contact_age) %>%
tally() %>%
spread(key = participant_age, value = n)
But this produces the total number (n) of each contact, rather than the mean number of contacts per age bracket.
In base R use tapply.
t(with(dat, tapply(contact, list(contact_age, participant_age), mean)))
# 18-30 30-40 50-60
# 18-30 1 2 2
# 30-40 1 NA NA
Data:
dat <- structure(list(participant = c(1L, 1L, 2L, 3L, 3L), participant_age = c("18-30",
"18-30", "30-40", "18-30", "18-30"), contact = c(1L, 2L, 1L,
1L, 2L), contact_age = c("18-30", "30-40", "18-30", "18-30",
"50-60")), class = "data.frame", row.names = c(NA, -5L))
If I understood correctly your aim, you were pretty close to the right solution:
data %>%
group_by(participant_age, contact_age) %>%
summarise(mean = mean(contact), .groups = "drop") %>%
spread(key = participant_age, value = mean)
You can use pivot_wider and pass the function to apply in values_fn :
tidyr::pivot_wider(df, names_from = contact_age, values_from = contact, values_fn = mean)
This is the libraryI am using for creating dummies
install.packages("fastDummies")
library(fastDummies)
This is the dataset
winners <- data.frame(
city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"),
year = c(1990, 2000, 1990),
crime = 1:3)
Let's them create super dummies out of these cities:
dummy_cols(winners, select_columns = c("city"))
The results are
city year crime city_SaoPaulito city_NewAmsterdam city_BeatifulCow
1 SaoPaulito 1990 1 1 0 0
2 NewAmsterdam 2000 2 0 1 0
3 BeatifulCow 1990 3 0 0 1
So the question if that I want to return to the previous dataset, any ideas?
Thanks in advance!
We can use dcast
library(data.table)
dcast(setDT(winners), crime ~ city, length)
If we need to get the input, it would be
subset(df1, select = 1:3)
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
Or with melt
melt(setDT(df1), measure = patterns("_"))[value == 1, .(city, year, crime)]
# city year crime
#1: SaoPaulito 1990 1
#2: NewAmsterdam 2000 2
#3: BeatifulCow 1990 3
data
df1 <- structure(list(city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"
), year = c(1990L, 2000L, 1990L), crime = 1:3, city_SaoPaulito = c(1L,
0L, 0L), city_NewAmsterdam = c(0L, 1L, 0L), city_BeatifulCow = c(0L,
0L, 1L)), class = "data.frame", row.names = c("1", "2", "3"))
If you are going to have only one city as 1 in each row, you can just skip the dummy columns
df[, 1:3]
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
If you can have multiple cities one way using dplyr and tidyr::gather is
library(dplyr)
df %>%
tidyr::gather(key, value, starts_with("city_")) %>%
filter(value == 1) %>%
select(-value, -key)
I have made a function which increments the values in certain columns in a certain row. I did this by writing a function that subsets through my dataframe to find the row it needs (by looking at sex, then age, then deprivation, then number of partners) and then adds numbers to whichever column I need it to (depending on these risk factors), it then calculates the risk (my code is for STI testing).
However, this does not change my existing dataframe with the new values, but creates a new variable patientRow which holds these new values. I need help with how I can incorporate this into my existing dataframe. Thanks!
adaptRisk <- function(dataframe, sexNum, ageNum, deprivationNum,
partnerNum, testResult){
sexRisk = subset(dataframe, sex == sexNum)
ageRisk = subset(sexRisk, age == ageNum)
depRisk = subset(ageRisk, deprivation == deprivationNum)
patientRow = subset(depRisk, partners == partnerNum)
if (testResult == "positive") {
patientRow$tested <- patientRow$tested + 1
patientRow$infected <- patientRow$infected + 1
}
else if (testResult == "negative") {
patientRow$tested <- patientRow$tested + 1
}
patientRow <- transform(patientRow, risk = infected/tested)
return(patientRow)
}
This is the head of my dataframe to give you an idea:
sex age deprivation partners tested infected risk
1 Female 16-19 1-2 0-1 132 1 0.007575758
2 Female 16-19 1-2 2 25 1 0.040000000
3 Female 16-19 1-2 >=3 30 1 0.033333333
4 Female 16-19 3 0-1 80 2 0.025000000
5 Female 16-19 3 2 12 1 0.083333333
6 Female 16-19 3 >=3 18 1 0.055555556
The dput of my data is:
structure(list(sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label =
c("Female",
"Male"), class = "factor"), age = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("16-19", "20-24", "25-34", "35-44"), class =
"factor"),
deprivation = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1-2",
"3", "4-5"), class = "factor"), partners = structure(c(2L,
3L, 1L, 2L, 3L, 1L), .Label = c(">=3", "0-1", "2"), class = "factor"),
tested = c(132L, 25L, 30L, 80L, 12L, 18L), infected = c(1L,
1L, 1L, 2L, 1L, 1L), uninfected = c(131L, 24L, 29L, 78L,
11L, 17L), risk = c(0.00757575757575758, 0.04, 0.0333333333333333,
0.025, 0.0833333333333333, 0.0555555555555556)), .Names = c("sex",
"age", "deprivation", "partners", "tested", "infected", "uninfected",
"risk"), row.names = c(NA, 6L), class = "data.frame")
An example call to the function:
adaptRisk(data, "Female", "16-19", 3, 2, "positive")
sex age deprivation partners tested infected uninfected risk
5 Female 16-19 3 2 13 2 11 0.1538462
I have adjusted your function (see all the way below) using base R syntax. It does the job, but is not the most beautiful code.
Issue:
The subsets create a lot of extra (and not needed) data.frames, instead of replacing the internal values when the conditions match. And the return was a different data.frame so the existing data.frame could not handle it correctly.
I adjusted it so that the filters are done on the needed objects that you want to change.
Transform might have unintended side effects and you were recalculating the whole risk column. Now only the affected value is recalculated.
You might want to built in some warnings / stops in case the filters return more than 1 record.
You can now use
df <- adaptRisk(df, "Female", "16-19", "3", "2", "positive") to replace the values in the data.frame you supply to the function
examples
# affects row 5
adaptRisk(df, "Female", "16-19", "3", "2", "positive")
sex age deprivation partners tested infected uninfected risk
1 Female 16-19 1-2 0-1 132 1 131 0.007575758
2 Female 16-19 1-2 2 25 1 24 0.040000000
3 Female 16-19 1-2 >=3 30 1 29 0.033333333
4 Female 16-19 3 0-1 80 2 78 0.025000000
5 Female 16-19 3 2 13 2 11 0.153846154
6 Female 16-19 3 >=3 18 1 17 0.055555556
# affects row 5
adaptRisk(df, "Female", "16-19", "3", "2", "negative")
sex age deprivation partners tested infected uninfected risk
1 Female 16-19 1-2 0-1 132 1 131 0.007575758
2 Female 16-19 1-2 2 25 1 24 0.040000000
3 Female 16-19 1-2 >=3 30 1 29 0.033333333
4 Female 16-19 3 0-1 80 2 78 0.025000000
5 Female 16-19 3 2 13 1 11 0.076923077
6 Female 16-19 3 >=3 18 1 17 0.055555556
function:
adaptRisk <- function(data, sexNum, ageNum, deprivationNum,
partnerNum, testResult){
if (testResult == "positive") {
data$tested[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] <- data$tested[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] + 1
data$infected[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] <- data$infected[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] + 1
data$risk[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] <- data$infected[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum]/data$tested[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum]
}
else if (testResult == "negative") {
data$tested[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] <- data$tested[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] + 1
data$risk[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] <- data$infected[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum]/data$tested[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum]
}
return(data)
}
The function outputs a single row that -- apparently -- you intend to replace the original row(s). You could replace the original row by doing something like this:
## original data frame is named patientData
patientRow <- adaptRisk(data, "Female", "16-19", 3, 2, "positive")
patientData[row.names(patientRow), ] <- patientRow