For a sample dataframe:
df <- structure(list(region = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("a", "b", "c", "d"), class = "factor"),
result = c(0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L), weight = c(0.126,
0.5, 0.8, 1.5, 5.3, 2.2, 3.2, 1.1, 0.1, 1.3, 2.5)), .Names = c("region",
"result", "weight"), row.names = c(NA, 11L), class = "data.frame")
I am producing a weighted xtab:
df$region <- factor(df$region)
result <- xtabs(weight ~ result + region, data=df)
result
Which is:
region
result a b
0 6.926 6.900
1 1.300 3.500
How can I flip the xtab around so the region and result variables are the other way around (i.e. region as rows and result as columns).
I thought this might work, but alas no!
result <- xtabs(region + (weight ~ result), data=df)
Any help would be much appreciated.
Just reverse the order of entries:
xtabs(weight ~ region + result, data=df)
Related
I have a dataset and the task:"Average number of major credit cards held for people with top 10 income".
dput(head(creditcard))
structure(list(card = structure(c(2L, 2L, 2L, 2L, 2L, 2L), levels = c("no","yes"), class = "factor"), reports = c(0L, 0L, 0L, 0L, 0L, 0L), age = c(37.66667, 33.25, 33.66667, 30.5, 32.16667, 23.25), income = c(4.52, 2.42, 4.5, 2.54, 9.7867, 2.5), share = c(0.03326991, 0.005216942, 0.004155556, 0.06521378, 0.06705059, 0.0444384), expenditure = c(124.9833, 9.854167, 15, 137.8692, 546.5033, 91.99667), owner = structure(c(2L, 1L, 2L, 1L, 2L, 1L), levels = c("no", "yes"), class = "factor"), selfemp = structure(c(1L, 1L, 1L, 1L, 1L, 1L), levels = c("no", "yes"), class = "factor"),
dependents = c(3L, 3L, 4L, 0L, 2L, 0L), days = c(54L, 34L,58L, 25L, 64L, 54L), majorcards = c(1L, 1L, 1L, 1L, 1L, 1L), active = c(12L, 13L, 5L, 7L, 5L, 1L), income_fam = c(1.13, 0.605, 0.9, 2.54, 3.26223333333333, 2.5)), row.names = c("1","2", "3", "4", "5", "6"), class = "data.frame")
I tried to do the task like this
round(mean(creditcard[order(creditcard$income, decreasing = TRUE),]$majorcards[1:10]))
But my solution turned out to be inoptimal and I do not understand how it can be corrected
You can get the 10 observations with the highest income using slice_max, then creating a new dataset with the mean of majorcards
library(dplyr)
creditcard %>%
slice_max(income, n = 10) %>%
summarise(mean(majorcards))
If your dataset is one row per person, then you can do this:
library(dplyr)
creditcard %>%
arrange(desc(income)) %>%
slice_head(n=10) %>%
summarize(mean_cards = mean(majorcards,na.rm=T))
Maybe something like:
mean(creditcard$majorcards[which(creditcard$income%in%sort(creditcard$income, decreasing = TRUE)[1:10])])
Using base R
with(creditcard, mean(head(majorcards[order(-income)], 10)))
Or in data.table
library(data.table)
setDT(creditcard)[order(-income), mean(head(majorcards, 10))]
For a sample dataframe:
df <- structure(list(region = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("a", "b", "c", "d"), class = "factor"),
result = c(1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L), weight = c(0.126,
0.5, 0.8, 1.5, 5.3, 2.2, 3.2, 1.1, 0.1, 1.3, 2.5)), .Names = c("region",
"result", "weight"), row.names = c(NA, 11L), class = "data.frame")
df$region <- factor(df$region)
result <- xtabs(weight ~ region + result, data=df)
result
I want to reorder the 1s of the result column. As I understand it (from here), I could use order:
result <- result[order(result[, 2], decreasing=T),]
result
result
region 0 1
b 6.9 3.500
a 5.8 2.426
HOWEVER this appears to be just ordering by the number of 1s - I want instead to use the proportion of 1s in each region (i.e. percentage). How can I use order (or something else) to develop my xtab the way I want.
Use prop.table:
result[order(prop.table(result,1)[,2], decreasing=TRUE),]
# result
#region 0 1
# b 6.9 3.500
# a 5.8 2.426
Where prop.table(result,1) gives:
prop.table(result,1)
# result
#region 0 1
# a 0.7050814 0.2949186
# b 0.6634615 0.3365385
For a sample dataframe:
df <- structure(list(area = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L,
4L, 4L, 4L), .Label = c("a1", "a2", "a3", "a4"), class = "factor"),
result = c(0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L),
weight = c(0.5, 0.8, 1, 3, 3.4, 1.6, 4, 1.6, 2.3, 2.1, 2,
1, 0.1, 6, 2.3, 1.6, 1.4, 1.2, 1.5, 2, 0.6, 0.4, 0.3, 0.6,
1.6, 1.8)), .Names = c("area", "result", "weight"), class = "data.frame", row.names = c(NA,
-26L))
I wish to calculate the risk difference between all combinations of areas (i.e. a1 and a2, a1 and a3, a2 and a3). Preferably this would be in a matrix form.
Up till now, I have just looked at comparing the risk difference (RD) between the regions with the highest and lowest results:
#Include only regions with highest or lowest percentage
df.summary <- data.table(df.summary)
incl <- df.summary[c(which.min(result), which.max(result)),area]
df.new <- df[df$area %in% incl,]
df.new$area <- factor(df.new$area)
#Run relative difference
df.xtabs <- xtabs(weight ~ result + area, data=df.new)
df.xtabs
#Produce xtabs table
RD.result <- prop.test(x=df.xtabs[,2], n=rowSums(df.xtabs), correct = FALSE)
RD <- round(- diff(RD.result$estimate), 3)
... But how would I change this to ensure the code runs through all combinations of areas without having to specify each one in turn? (I may have up to 19 areas).
You can do it using combn function. For example,
uniqueCombinations <- combn(unique(as.character(df$area)), 2)
resultDF <- data.frame(matrix(NA, nrow=dim(uniqueCombinations)[2], ncol=2+1))#2 col for unique combination and 1 for RD value
names(resultDF) <- c(paste0("area_", 1:2), "RD")
for(i in 1:dim(uniqueCombinations)[2]){
#iterate over a unique combination
incl <- uniqueCombinations[,i]
print(incl)
#Your code
df.new <- df[df$area %in% incl,]
df.new$area <- factor(df.new$area)
#Run relative difference
df.xtabs <- xtabs(weight ~ result + area, data=df.new)
df.xtabs
df.xtabs1 <- data.frame(df.xtabs)
#Produce xtabs table
RD.result <- prop.test(x=df.xtabs[,2], n=rowSums(df.xtabs), correct = FALSE)
RD <- round(- diff(RD.result$estimate), 3)
resultDF[i, 1:2] <- incl
resultDF[i, 3] <- RD
}
resultDF
UPDATE : code update to create a resultDF, which will have result from loop.
For a sample dataframe:
df <- structure(list(area = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L,
4L, 4L, 4L), .Label = c("a1", "a2", "a3", "a4"), class = "factor"),
result = c(0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L),
weight = c(0.5, 0.8, 1, 3, 3.4, 1.6, 4, 1.6, 2.3, 2.1, 2,
1, 0.1, 6, 2.3, 1.6, 1.4, 1.2, 1.5, 2, 0.6, 0.4, 0.3, 0.6,
1.6, 1.8)), .Names = c("area", "result", "weight"), class = "data.frame", row.names = c(NA,
-26L))
I am trying to isolate areas with the highest and lowest regions and then produce a weighted crosstab which is then used to calculate risk difference.
df.summary <- setDT(df)[,.(.N, freq.1 = sum(result==1), result = weighted.mean((result==1),
w = weight)*100), by = area]
#Include only regions with highest or lowest percentage
df.summary <- data.table(df.summary)
incl <- df.summary[c(which.min(result), which.max(result)),area]
df.new <- df[df$area %in% incl,]
incl
'incl' has the two areas that I want, but still the four levels:
[1] a2 a3
Levels: a1 a2 a3 a4
How do I get rid of the levels as well? The subsequent analysis that I want to do needs just the two levels as well as the areas. Any ideas?
I found this elsewhere on the web (e.g. Problems with levels in a xtab in R)
df.new$area <- factor(df.new$area)
It works!
Hope it's useful for others.
In a sample (rows) by species (columns) matrix, that contains subsets (as assigned by column Treatment):
data <- structure(list(S1 = c(0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L), S2 = c(0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L), S3 = c(0L, 1L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), Treatment = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("S1",
"S2", "S3", "Treatment"), class = "data.frame", row.names = c(NA,
9L))
i would like to identify those species that only occur in a given treatment.
Is that possible? Thank you very much!
/edit:
Id like to know i) the number of unique species per treatment and ii) would like to create vectors containing the species names that are unique per treatment.
For species names that unique per treatment I would go with (though it could be probably optimized)
sapply(data[-4L], function(x) {
temp <- data[x == 1L, 4L]
if(length(unique(temp)) == 1) as.character(unique(temp)) else ""
})
# S1 S2 S3
# "" "B" "A"
For the number of unique species per treatment, here's a vectorized option
rowSums(!!rowsum(data[-4L], data[, 4L]))
# A B C
# 2 2 1
library(dplyr)
data %>%
group_by(Treatment) %>%
summarise(S1 = any(S1 == 1),
S2 = any(S2 == 1),
S3 = any(S3 == 1))
Gives you one row per treatment and one column per species. TRUE indicates the species was found in that treatment.