Can weighted data be used with the CrossTable function in R? - r

I have attached sample weights to my data using the code below;
s_w <- couple_dta$h_sw /1000000
design <-svydesign(ids =~s_unit + hh, strata =~res , weights = s_w ,data =c_dta)
I had earlier created crosstables with unweighted data using the code;
CrossTable(c_dta$varA, c_dta$varB, prop.c = FALSE ,prop.r = FALSE , prop.chisq = FALSE , format = "SPSS")
The result of CrossTable with unweighted data is something below.
Variable B
Variable A | f | mf | m | Row Total |
---------------------|-----------|-----------|-----------|-----------|
m | n1 | n2 | n3 |n1 +n2+n3 |
|n1/N x 100 | n2/N x 100|n3/N x 100 | |
---------------------|-----------|-----------|-----------|-----------|
mf | n5 | n6 | n7 | n5+n6+n7 |
|n5/N x 100 |n6/N x 100 | n7/N x 100| |
---------------------|-----------|-----------|-----------|-----------|
f | n8 | n9 | n10 |n8+n9+n10 |
|n8/N x 100 |n9/N x 100 |n10/N x 100| |
---------------------|-----------|-----------|-----------|-----------|
Column Total | n1+n5+n8 | n2+n6+n9 |n3+n7+n10 | N |
---------------------|-----------|-----------|-----------|-----------|
Is there a way of incorporating weights to the second data. I have looked at 'prop.table(svytable)' but not sure how to proceed, given that I would also like to display the number of observations in each cell and the corresponding percentage.
Thank you in advance
Edit : I have used the svytable and Crosstable functions to achieve my goal.
table2 <- svytable(~c_dta$VarA + c_dta$wrd_VarB, design=design)
CrossTable(table2 ,prop.c = FALSE ,prop.r = FALSE , prop.chisq = FALSE , format = "SPSS")

Related

Add a value from a column in one table based off finding a result in another in R

I have a data table in R:
|gene | prom_65| prom_66| amast_69| amast_70| markerID|
|:--------------|---------:|---------:|---------:|---------:|---------:|
|ABC | 24.7361| 25.2550| 31.2974|45.4209 |16:123234_T/C; 16:54352342_A/T; 16:747564_T/G|
|DFG | 107.3580| 112.9870| 77.4182| 86.3211| 16:3453453_G/A; 16:765753453_A/T; 16:65345345_T/G|
|LKP | 72.0639| 86.1486| 68.5747| 77.8383| 16:25234453_G/C; 16:876456546_A/T; 16:4535_T/G|
|KLF | 43.8766| 53.4004| 34.0255| 38.4038| 16:87484_G/A; 16:5435_A/T; 16:747564_T/G|
|PPO | 2382.8700| 1871.9300| 2013.4200| 2482.0600| 16:785_T/C; 16:5435_A/T; 16:747564_T/G|
|LWPV | 49.6488| 53.7134| 59.1175| 66.0931| 16:123_T/C; 16:54564_A/T; 16:54646_T/G|
I have another data table:
|markerid | prom_65| prom_66| amast_69| amast_70| pvalue|
|:--------------|---------:|---------:|---------:|---------:|---------:|
|16:123234_T/C |x | x | x | x | x |
|16:3453453_G/A| x | x | x x | x |
I would like to add the gene column to table two for the markerid that matches the relevant gene in table one. In table one the markerIDs are all separated by a semi-colon and a markerID will only ever appear within one gene row in table1. In this example the output should look like this:
|markerid | prom_65| prom_66| amast_69| amast_70| pvalue |gene|
|:--------------|---------:|---------:|---------:|---------:|---------:|
|16:123234_T/C |x | x | x | x | x |ABC
|16:3453453_G/A | x | x | x | x | x |DFG
Not sure how to approach this in R.
Many thanks
Without a reproducible example of your table, it is hard to be sure what looks like the last column (it seems to be a list but not sure).
You can try on the second table:
Table2$gene <- sapply(Table2$markerid, function(x) Table1$Gene[grep(x,Table1$marker_id)])
Here an example with dummy dataframes:
dataA <- data.frame(Gene = LETTERS[1:5],
marker = paste(letters[6:10],"_A"))
Gene marker
1 A f _A
2 B g _A
3 C h _A
4 D i _A
5 E j _A
dataB <- data.frame(marker = letters[6:8])
marker
1 f
2 g
3 h
And now, if you use the sapply function:
dataB$Gene <- sapply(dataB$marker, function(x) dataA$Gene[grep(x,dataA$marker)])
1 f A
2 g B
3 h C
Does it look what you are trying to get ?
If it is not working, can you provide the output of str(Table1) ?

How to combine countpct and binomCI into the same summary statistic to be used in tableby function?

I'm using the tableby function from the arsenal package to create summary tables. For most of the statistics I need to generate, this package gives me exactly the format I'm asked except for one. I need to get in the same cell something like this:
n (%) [95%CI of the percentage]
For now, I'm using the countpct function which gives me the "n (%)" and binomCI which gives me the proportion with 95%CI but it doubles the number of rows in my final table so it's not ideal...
How could I do to have everything on the same line ?
I tried to see if I could create another function from the original ones but I don't really understand their syntax...
Thanks for your help.
EDIT : Here is a reproducible example.
Code for the original functions can be found here.
So this is what I have now :
data<-NULL
data$Visit2<-c(rep("Responder",121),rep("Not Responder",29),rep("Responder",4),rep("Not Responder",47))
data$Group<-c(rep("Tx",150),rep("No Tx",51))
data<-as.data.frame(data)
library(arsenal)
my_controls <- tableby.control(test = F,total = F, cat.stats = c("countpct" ,"binomCI"), conf.level = 0.95)
summary(tableby(Group ~ Visit2,
data = data,
control = my_controls),
digits=2, digits.p=3, digits.pct=1)
# Results :
| | No Tx (N=51) | Tx (N=150) |
|:-------------------------------|:-----------------:|:-----------------:|
|**Visit2** | | |
| Not Responder | 47 (92.2%) | 29 (19.3%) |
| Responder | 4 (7.8%) | 121 (80.7%) |
| Not Responder | 0.92 (0.81, 0.98) | 0.19 (0.13, 0.27) |
| Responder | 0.08 (0.02, 0.19) | 0.81 (0.73, 0.87) |
And this is what I want :
| | No Tx (N=51) | Tx (N=150) |
|:----------------|:-------------------------:|:------------------------:|
|**Visit2** | |
| Not Responder | 47 (92.2%) [81.1, 97.8] | 29 (19.3%) [13.3, 26.6] |
| Responder | 4 (7.8%) [2.2, 18.9] | 121 (80.7%) [73.4, 86.7] |
|

Generating table from dataframe with proportions of 20 variables, for each row, for each possible combination of said variable in R

I have a dataframe with 1000 rows representing a different species, for each of these rows are 20 columns with different proportions of a single variable (amino acids).
For each row (species), I would like to calculate the proportion of each possible combination of single letter variables (amino acids).
So each species should have 10 million calculated combinations of the amino acids.
My code for generating all possible combinations of amino acids is this:
S <- c('G','A','L','M','F','W','K','Q','E','S','P','V','I','C','Y','H','R','N','D','T')
allCombs <- function(x) c(x, lapply(seq_along(x)[-1L],
function(y) combn(x, y, collapse = "")),
recursive = TRUE)
Scombi <- allCombs(S)
My dataframe looks like this:
+----------------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
| Species | Domain | Actual OGT | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y |
+----------------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
| Acaryochloris_marina | Bacteria | 25 | 0.089806129655016 | 0.011179368033588 | 0.052093758404379 | 0.056116688487831 | 0.033311792369428 | 0.074719969063287 | 0.021456955206517 | 0.062874293719234 | 0.046629846831622 | 0.105160548187069 | 0.023372745414207 | 0.034667218445279 | 0.050847279968411 | 0.052372091362254 | 0.054393907299958 | 0.058415776607691 | 0.059282788930956 | 0.075786041807662 | 0.012266709932789 | 0.025246090272826 |
| Acetobacter_pasteurianus | Bacteria | 26 | 0.113635842586218 | 0.009802006063102 | 0.053600553080754 | 0.058133056353357 | 0.036903783608575 | 0.085210142094237 | 0.021833316616858 | 0.053123968429941 | 0.045353753818743 | 0.096549489115246 | 0.025913145427995 | 0.027225003296464 | 0.052562918173042 | 0.033342785074972 | 0.072705595398914 | 0.049908591821467 | 0.056094207383391 | 0.079084190962059 | 0.010144168305489 | 0.018873482389179 |
| Acetobacterium_woodii | Bacteria | 30 | 0.074955804625209 | 0.011863137047001 | 0.058166310295556 | 0.071786218284636 | 0.03424697521635 | 0.075626240308253 | 0.018397399287915 | 0.087245372635541 | 0.078978610001876 | 0.087790924875632 | 0.03068806687375 | 0.046498124583435 | 0.036120348133785 | 0.031790536900726 | 0.045179171055634 | 0.050727609439901 | 0.055617806111571 | 0.069643619533744 | 0.005984048340735 | 0.028693676448754 |
| Acetohalobium_arabaticum | Bacteria | 37 | 0.07294006171749 | 0.008402092275195 | 0.063388830763099 | 0.094174357919767 | 0.032968396601359 | 0.074335444399095 | 0.014775170057021 | 0.081175614650614 | 0.068173658934912 | 0.096191143631822 | 0.023591084039018 | 0.042176390239929 | 0.036535950562554 | 0.032690297143697 | 0.045929769851454 | 0.05201834344653 | 0.049098780255464 | 0.079225589949997 | 0.004923023531168 | 0.027286000029819 |
| Acholeplasma_laidlawii | Bacteria | 37 | 0.067353087090147 | 0.002160134400001 | 0.056809775441953 | 0.065310218890485 | 0.038735792072418 | 0.069508395797039 | 0.018942086187746 | 0.081435757342441 | 0.084786245636216 | 0.096181862610799 | 0.026545056054257 | 0.045549913713558 | 0.038323250930165 | 0.033008924859672 | 0.047150659509282 | 0.054698408656138 | 0.059971572823796 | 0.072199395290938 | 0.005926270925023 | 0.03540319176793 |
| Achromobacter_xylosoxidans | Bacteria | 30 | 0.120974236639852 | 0.008469732379263 | 0.054028585828065 | 0.055476991380945 | 0.035048667997051 | 0.086814010110846 | 0.02243157894653 | 0.050520668283285 | 0.039296015271673 | 0.099074202941835 | 0.028559018986725 | 0.025845147774914 | 0.049701994138614 | 0.034808403369533 | 0.073998251525545 | 0.050072992977641 | 0.051695040348985 | 0.080314177991249 | 0.011792085285623 | 0.021078197821829 |
+----------------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
So you can see, each row has the proportion of each amino acid (A,G,I etc.) over the entire set of amino acids, (all 20 add up to 1), but I would like to generate each possible combination, over 1. so something that looks like the following:
+----------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+
| Species | Domain | Actual OGT | A | AC | AD | AE |
+----------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+
| Acaryochloris_marina | Bacteria | 25 | 0.089806129655016 | 0.191179368033588 | 0.1782093758404379 | 0.186116688487831 |
+----------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+
So for each species, 10 million columns (each representing one of the possible combinations of amino acids, without repetition, so the largest string is 20 with each one)
Apologies for being unclear, does anyone have any ideas on how to create this data-set? (Or the best way of asking/explaining what I should be looking up?)
Species <- structure(list(Species = c("Acaryochloris_marina",
"Acetobacter_pasteurianus",
"Acetobacterium_woodii", "Acetohalobium_arabaticum", "Acholeplasma_laidlawii",
"Achromobacter_xylosoxidans"), Domain = c("Bacteria", "Bacteria",
"Bacteria", "Bacteria", "Bacteria", "Bacteria"), Actual.OGT = c(25,
26, 30, 37, 37, 30), A = c(0.089806129655016, 0.113635842586218,
0.074955804625209, 0.07294006171749, 0.067353087090147, 0.120974236639852
), C = c(0.011179368033588, 0.009802006063102, 0.011863137047001,
0.008402092275195, 0.002160134400001, 0.008469732379263), D = c(0.052093758404379,
0.053600553080754, 0.058166310295556, 0.063388830763099, 0.056809775441953,
0.054028585828065), E = c(0.056116688487831, 0.058133056353357,
0.071786218284636, 0.094174357919767, 0.065310218890485, 0.055476991380945
), F = c(0.033311792369428, 0.036903783608575, 0.03424697521635,
0.032968396601359, 0.038735792072418, 0.035048667997051), G = c(0.074719969063287,
0.085210142094237, 0.075626240308253, 0.074335444399095, 0.069508395797039,
0.086814010110846), H = c(0.021456955206517, 0.021833316616858,
0.018397399287915, 0.014775170057021, 0.018942086187746, 0.02243157894653
), I = c(0.062874293719234, 0.053123968429941, 0.087245372635541,
0.081175614650614, 0.081435757342441, 0.050520668283285), K = c(0.046629846831622,
0.045353753818743, 0.078978610001876, 0.068173658934912, 0.084786245636216,
0.039296015271673), L = c(0.105160548187069, 0.096549489115246,
0.087790924875632, 0.096191143631822, 0.096181862610799, 0.099074202941835
), M = c(0.023372745414207, 0.025913145427995, 0.03068806687375,
0.023591084039018, 0.026545056054257, 0.028559018986725), N = c(0.034667218445279,
0.027225003296464, 0.046498124583435, 0.042176390239929, 0.045549913713558,
0.025845147774914), P = c(0.050847279968411, 0.052562918173042,
0.036120348133785, 0.036535950562554, 0.038323250930165, 0.049701994138614
), Q = c(0.052372091362254, 0.033342785074972, 0.031790536900726,
0.032690297143697, 0.033008924859672, 0.034808403369533), R = c(0.054393907299958,
0.072705595398914, 0.045179171055634, 0.045929769851454, 0.047150659509282,
0.073998251525545), S = c(0.058415776607691, 0.049908591821467,
0.050727609439901, 0.05201834344653, 0.054698408656138, 0.050072992977641
), T = c(0.059282788930956, 0.056094207383391, 0.055617806111571,
0.049098780255464, 0.059971572823796, 0.051695040348985), V = c(0.075786041807662,
0.079084190962059, 0.069643619533744, 0.079225589949997, 0.072199395290938,
0.080314177991249), W = c(0.012266709932789, 0.010144168305489,
0.005984048340735, 0.004923023531168, 0.005926270925023, 0.011792085285623
), Y = c(0.025246090272826, 0.018873482389179, 0.028693676448754,
0.027286000029819, 0.03540319176793, 0.021078197821829)), .Names = c("Species",
"Domain", "Actual.OGT", "A", "C", "D", "E", "F", "G", "H", "I",
"K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"), row.names = c(NA,
-6L), class = "data.frame")
I'm not entirely sure that R is the right tool for this job. It's going to take a very, very long time. You may be able to reduce that time using the parallel package if you have sufficient cores, however.
I've put together a process that will accomplish what you want. For each species, it takes my computer about eight minutes to generate the "joint proportion." If you run on a single thread, as R will do inherently, you're looking at close to an hour just to accomplish the these six species in your sample data.
I wrote my script to run in parallel, and using seven cores, it took about 11 minutes to complete all six. Extending this over all 1000 species, I wouldn't be surprised if it took as long as two days to do all this (on seven cores). If you have a large cluster, you may be able to cut it down some.
Please note that this will not give you your results as described in your question. I posted a comment that I wasn't sure what formula you were using to get the joint proportions. I am just taking the sum here for ease of demonstration. You will need to adjust your code appropriately.
library(parallel)
library(dplyr)
library(tidyr)
library(magrittr)
# Reshape data. This will make it easier to split and access proportion
# within each species.
SpeciesLong <-
Species %>%
gather(protein, proportion,
A:Y) %>%
arrange(Species)
# Get unique species
S <- unique(SpeciesLong$protein)
# Build the combination list
# Note, this is different than your code, I added FUN = paste0
Scombi <- unlist(lapply(seq_along(S),
function(x) combn(S, x, FUN = paste0, collapse = "")))
# Function to get the joint proportion
# I took the sum, for convenience. You'll need to replace this
# with whatever function you use to get the joint proportion.
# The key part is getting the correct proteins, which happens within
# the `sum` call.
joint_protein <- function(protein_combo, data){
sum(data$proportion[vapply(data$protein,
grepl,
logical(1),
protein_combo)])
}
# make a list data frames, one for each species
SplitSpecies <-
split(SpeciesLong,
SpeciesLong$Species)
# Make a cluster of processors to run on
cl <- makeCluster(detectCores() - 1)
# export Scombi and joint_protein to all processes in the cluster
clusterExport(cl, c("Scombi", "joint_protein"))
# Get the aggregate values for each species in a one-row data frame.
SpeciesAggregate <-
parLapply(cl,
X = SplitSpecies,
fun = function(data){
X <- lapply(Scombi,
joint_protein,
data)
names(X) <- Scombi
as.data.frame(X)
})
# Join the results to the Species data
# You may want to save your data before this step. I'm not entirely
# sure I did this right to match the rows correctly.
Species <- cbind(Species, SpeciesAggregate)

Creating a grouped bar plot in R

I have a data frame df in R, that looks like this:
D C B E K R
Rd 80 80 80 80 80 80
Sw 100 100 100 100 100 100
Sf 100 100 100 100 100 100
I'm trying to plot the data in a bar plot. I need the y axis to have the range 0-100, and the x axis to be the category names. Basically, I need it to look like this:
100 | _ _ _ _ _ _ _ _ _ _ _ _
|_ _ _ _ _ _ | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
0 |_|_|_|_|_|_|__|_|_|_|_|_|_|__|_|_|_|_|_|_|_
D C B E K R D C B E K R D C B E K R
Rd Sw Sf
With all of the Ds the same color, all of the Cs the same colour, and so on.
I'm not sure how to do this, or which libraries to use.
So far I have:
counts <- as.matrix(df$D, df$C, df$B, df$E, df$K, df$R)
barplot(counts, beside = TRUE, space = c(0, 0, 0, 0, 0, 0), xlab = "",
col = c("coral", "coral1", "coral2", "coral3", "coral4", "cornflowerblue"),
names.arg = c("D", "C", "B", "E", "K", "R"))
mtext(side = 1, text = "x label", line = 7)
But it only displays something like this:
100 | _ _ _ _
|_| | |_| | |
| | | | | | |
| | | | | | |
| | | | | | |
0 |_|_|_|_|_|_|_
D C B E K R
x label
I'm not sure why I'm getting only this.
Any help would be much appreciated.
Try this:
df2 <- t(as.matrix(df))
bp <- barplot(df2,beside=TRUE,col=1:6)
mtext(rownames(df2),1,at=bp)
If you would like to edit things to spin axis labels or exactly position the group/bar labels, then you can do something like the below code. Change the line= arguments to affect the vertical position of the labels and the las=1 argument on the barplot call to spin the labels as required.
df2 <- t(as.matrix(df))
bp <- barplot(df2,beside=TRUE,col=1:6,axisnames=FALSE,las=1)
mtext(rownames(df2),1,at=bp,line=0.6)
mtext(colnames(df2),1,at=colMeans(bp),line=2)
set.seed(1)
df <- data.frame(groups = rep(c("Rd", "Sw", "Sf"), each=6),
bars = rep(c("D", "C", "B", "E", "K", "R"), 3),
values = sample(1:100,18))
library(ggplot2)
ggplot(df, aes(x=bars, y=values, fill=bars)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~groups, ncol=3)
Result: see the figure below.
Or, if you take the data frame from the question:
df <- read.table(sep=" ", header=T, text="
D C B E K R
Rd 80 80 80 80 80 80
Sw 100 100 100 100 100 100
Sf 100 100 100 100 100 100")
df$facet <- row.names(df)
library(reshape2)
df.long <- melt(df, "facet")
ggplot(df.long, aes(x=variable, y=value, fill=variable)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~facet, ncol=3)
It looks like you don't want that as.matrix call with the column vectors from the data.frame -- it is most likely that you want: as.matrix(D), for instance:
barplot(as.matrix(data.frame(rpois(10, 2), rpois(10, 1))), beside = T)
Basically, I am giving a simple example, imagine you have:
df = data.frame(x = c(1,2,3), y = c(2,3,4))
you were doing as.matrix(df$x, df$y) - that's not what you want, you want to do: as.matrix(df), thus:
barplot(as.matrix(df), beside = T)
will give you what you want. The key is to look at what "as.matrix" does.

By group: sum of variable values under condition

Sum of var values by group with certain values excluded conditioned on the other variable.
How to do it elegantly without transposing?
So in the table below for each (fTicker, DATE_f), I seek to sum the values of wght with the value of wght conditioned on sTicker excluded from the sum.
In the table below, (excl_val,sTicker=A) |(fTicker=XLK, DATE_f = 6/20/2003) = wght_AAPL_6/20/2003_XLK + wght_AA_6/20/2003_XLK but not the wght for sTicker=A
+---------+---------+-----------+-------------+-------------+
| sTicker | fTicker | DATE_f | wght | excl_val |
+---------+---------+-----------+-------------+-------------+
| A | XLK | 6/20/2003 | 0.087600002 | 1.980834016 |
| A | XLK | 6/23/2003 | 0.08585 | 1.898560068 |
| A | XLK | 6/24/2003 | 0.085500002 | |
| AAPL | XLK | 6/20/2003 | 0.070080002 | |
| AAPL | XLK | 6/23/2003 | 0.06868 | |
| AAPL | XLK | 6/24/2003 | 0.068400002 | |
| AA | XLK | 6/20/2003 | 1.910754014 | |
| AA | XLK | 6/23/2003 | 1.829880067 | |
| AA | XLK | 6/24/2003 | 1.819775 | |
| | | | | |
| | | | | |
+---------+---------+-----------+-------------+-------------+
There are several fTicker groups with many sTicker in them (10 to 70), some sTicker may belong to several fTicker. The end result should be an excl_val for each sTicker on each DATE_f and for each fTicker.
I did it by transposing in SAS with resulting file about 6 gb but the same approach in R, blew memory up to 40 gb and it's basically unworkable.
In R, I got as far as this
weights$excl_val <- with(weights, aggregate(wght, list(fTicker, DATE_f), sum, na.rm=T))
but it's just a simple sum (without excluding the necessary observation) and there is mismatch between rows length. If i could condition the sum to exclude the sTicker obs for wght from the summation, i think it might work.
About the excl_val length: i computed it in excel, for just 2 cells, that's why it's short
Thank you!
Arsenio
When you have data in a data.frame, it is better if the rows are meaningful
(in particular, the columns should have the same length):
in this case, excl_val looks like a separate vector.
After putting the information it contains in the data.frame,
things become easier.
# Sample data
k <- 5
d <- data.frame(
sTicker = rep(LETTERS[1:k], k),
fTicker = rep(LETTERS[1:k], each=k),
DATE_f = sample( seq(Sys.Date(), length=2, by=1), k*k, replace=TRUE ),
wght = runif(k*k)
)
excl_val <- sample(d$wght, k)
# Add a "valid" column to the data.frame
d$valid <- ! d$wght %in% excl_val
# Compute the sum
library(plyr)
ddply(d, c("fTicker","DATE_f"), summarize, sum=sum(wght[valid]))

Resources