Display absolute and relative frequencies with tabular - r

probably someone has an advice.
I m working on a report with knitR and the tables package. The function tabular generates my Latex code.
I want a table with absolute and relative frequencies, but I dont know how to get it.
I tried to apply within tabular an own function that generates the relative values:
relative <- function(x){
len <- length(x)
ll <- length(table(x))
lev <- unique(x)
out <- numeric(length = ll)
for(i in 1:ll){
pos <- which(x == lev[i])
out[i] <- length(pos)/len*100
}
return(out)
}
The function I put in the tabular function with Country and Type as two factors.
tabular(Country ~ Type*relative, data = cars)
Here the result:
Type
PKW SUV
Country relative relative
Germany 0 0
Japan 0 0
Other 0 NaN
USA 0 0
The result is just the example of how it doesnt work :) My main target is a table that contains both: absolute AND relative frequencies with tabular.
To clearify the problem: Here the result without the function relative:
tabular(Country ~ Type, data = cars)
Type
Country PKW SUV
Germany 23 1
Japan 22 17
Other 12 0
USA 58 22
Does anyone has an idea how to get the table?
Thanks in advance!

From the vignette:
The last factor, (mean + sd)
names two R functions. These are assumed to be functions that operate on a
vector and produce a single value, as mean and sd do. The values in the table
will be the results of applying those functions to the two dierent variables
and the subsets of the dataset.
I have overlooked this part that is the crucial hint, why the function i wrote doesnt work.
"relative" returns a vector as long as manifestations the variable x has - allowed is an atomic vector. The problem is now that there is no way to add the basis into the function to calculate the relative values - the returning value is calculated from an subset of x.
Actually this is working but is not the most elegant way. I wrote the function with a hard coded nrow (here: 155).
relative <- function(x){
round(length(x)/155, 2)
}
tabular(Country ~ Type + relative, data = cars)
Type
Country PKW SUV relative
Germany 23 1 0.15
Japan 22 17 0.25
Other 12 0 0.08
USA 58 22 0.52
If anyone find a better solution - please help :)

Related

Find gene overlap as a percentage between lists

This question is a variant of the question found here.
I also want to evaluate the percent of gene overlap, but instead of doing a pairwise comparison of all-by-all within a single list, I want to compare one list to a different list set. The original post answer gave an elegant nested sapply, which I don't think will work in my case.
Here are some example data.
>listOfGenes1 <- list("cellLine1" = c("ENSG001", "ENSG002", "ENSG003"), "cellLine2" = c("ENSG003", "ENSG004"), "cellLine3" = c("ENSG004", "ENSG005"))
>myCellLine <- list("myCellLine" = c("ENSG001", "ENSG002", "ENSG003"))
I want to compare each of the cell lines in listOfGenes1 to the single group in myCellLine, with output something like:
>overlaps
cellLine1 cellLine2 cellLine3
100 33 0
To be clear, I would like percent overlap, with "myCellLine" as the denominator. Here is what I was trying so far that didn't work out.
overlaps <- sapply(listOfGenes1, function(g1) {round(length(intersect(g1, myCellLine)) / length(myCellLine) * 100)})
You can try,
round(sapply(listOfGenes1, function(i)
100 * length(intersect(i, myCellLine[[1]])) / length(myCellLine[[1]])), 0)
#cellLine1 cellLine2 cellLine3
# 100 33 0
NOTE: Your myCellLine object is a list, hence length(myCellLine) will not work
We can use sapply
sapply(listOfGenes1, function(x) mean(myCellLine[[1]] %in% x) * 100)
#cellLine1 cellLine2 cellLine3
#100.00000 33.33333 0.00000

Why the for loop is not using the 'i' specified in the function

I have a data frame with 25 weeks of observations per animal and 20 animals in total. I am trying to write a function that calculates a linear equation between 2 points each time and do that for the 25 weeks and the 20 animals.
I want to use a general form of the equation so I can calculate values al any point. In the function, Week=t, Weight=d.
I can't figure out how to make this work. I don't think the loop is working using each row of the data frame as the index for the function. My data frame named growth looks something like this:
Week Weight Animal
1 50 1
2 60 1
n=25
1 80 2
2 90 2
.
.
20
for (i in growth$Week){
eq<- function(t){
d = growth$BW.Kg
t = growth$Week
(d[i+1]-d[i])/(t[i+1]-t[i])*(t-t[i])+d[i]
return(eq)
}
}
eq(3)
OK, so I think there are a few points of confusion here. The first is writing a function inside a for loop. What is happening is that you are re-writing the function over and over, and also your function doesn't save the values of your equation anywhere. Secondly, you are passing t as your argument but the expecting t to follow the for loop with the i value. Finally, you say that you want this to be done for each animal, but the animal value is not shown in your code.
So it's a little bit hard to see what you are trying to achieve here.
Based on your information above, I've rewritten your function into something that will provide a result for your equation.
library(tidyverse)
growth <- tibble(week = 1:5,
animal = 1,
weight = c(50,52,55,54,57))
eq <- function(d,t,i){
z <- (d[i+1]-d[i])/(t[i+1]-t[i])*(t-t[i])+d[i]
return(z)
}
test_result <- eq(growth$weight,growth$week,3)
Results:
[1] 57 56 55 54 53
Is that the kind of result you were expecting? Or did you want just a single result per week per animal? Could you provide a working example of a formula that would produce a single desired result (i.e. a result for animal 1 on week 1)?

Filter 0 values and output chi square results to a data frame in R

I have data that consists of many Cases (over 600) where I have two independent assessments to compare. I want to determine, based on the relative abundance of species observed, whether the differences between the assessments are due to random variation (differing plot locations/methodologies) or due to human error. The assessments were conducted by a forest manager (FM; generally an ocular estimate) and the ministry responsible for validating the result (MNRF; intensive plot based survey). A result with a p-value <0.05 would indicate that it is either highly unlikely that the two samples were taken from the same population, or that the less intensive method is not sufficiently accurate.
Species composition has been converted to counts of trees by species based on the number of plots established by MNRF. There are several species that may be encountered, but in each case, there are generally less than 6. Species are identified by a two letter code (e.g. PJ = jack pine, BW = white birch). An example of a single case is:
> head(case545)
Case Source PJ SB BW PO BF SW PR LA MR CW PW
1 545 MNRF 68 21 17 15 1 0 0 0 0 0 0
2 545 FM 101 13 13 0 0 0 0 0 0 0 0
I can calculate the statistic I want for this case using the code:
chisq.test(rbind(c(68,21,17,15,1),c(101,13,13,0,0)))
My problem is I have many many cases and I can't figure out how to tell R which values to use in each case. As far as I can tell the logic flow should be
identify and eliminate species where both assessments recorded a value of 0
ensure that the values are organized correctly for chisq.test
run the test and output a new table with the X2 and p value for each case
Any help is greatly appreciated.
This might be of use but might requiere some changes based on some nuances you might have with your data.
For this example I recreate two cases with the naming convention caseXXX
case545 <- data.frame(Case="545",
Source=c("XX","X1"), PJ=c(68,21),SB=c(17,13),BW=c(1,0), SW=c(0,0))
case546 <- data.frame(Case="546",
Source=c("XX","X1"), PJ=c(100,300),SB=c(0,0),BW=c(400,0), SW=c(300,500))
We then create a list of all the data.frames with that naming convention
library(dplyr)
DF <- ls(pattern = "case")
We then apply a function to the list of of data.frames and bind the rows together to make one single data.frame.
This function does what you are asking.
1-Get rid of columns with only 0s
2-Compute the statistical test
3-Give us the X2 statistic and the p-value as a data.frame
Output <- bind_rows(lapply(DF, function(DF){
TMP <- get(DF)
TMP <- TMP %>%
select(grep(pattern = F,colSums( TMP != 0) == 0))
TMP <- chisq.test(rbind(TMP[1,-c(1:2)],TMP[2,-c(1:2)]))
TMP <- data.frame(X2=TMP$statistic,p=TMP$p.value,case=DF)
return(TMP)
}))
> Output
X2 p case
1 4.703423 9.520608e-02 case545
2 550.000000 3.706956e-120 case546

Sorting data by type in R

I am struggling to write a function for a dataset that looks like this:
identifier age occupation
pers1 18 student
pers2 45 teacher
pers3 65 retired
What I am trying to do, is to write a function that will:
sort my variables into numerical vs. factor variable
for the numerical variables, give me the mean, min and mx
for the factor variable, give me a frequency table
return point (2) and (3) in a "nice" format (dataframe, vector or table)
So far, I have tried this:
describe<- function(x)
{ if (is.numeric(x)) { mean <- mean(x)
min <- min(x)
max <- max(x)
d <- data.frame(mean, min, max)}
else { factor <- table(x) }
}
stats <- lapply(data, describe)
Problems:
My problem is that now, "stats" is a list that is difficult to read and to export to Excel or share. I don't know how to make the list "stats" more reader-friendly.
Alternatively, maybe is there a better way to build the function "describe"?
Any thoughts on how to solve these two problems are much appreciated!
I ma be late to the party, but maybe you still need a solution. I combined the answers from some of the comments to your post to the following code. It assumes you only have numerical columns and factors, and scales to a large number of columns, as you specified:
# Just some sample data for my example, you don't need ggplot2.
library(ggplot2)
data=diamonds
# Find which columns are numeric, and which are not.
classes = sapply(data,class)
numeric = which(classes=="numeric")
non_numeric = which(classes!="numeric")
# create the summary objects
summ_numeric = summary(data[,numeric])
summ_non_numeric = summary(data[,non_numeric])
# result is easily written to csv
write.csv(summ_non_numeric,file="test.csv")
Hope this helps.
The desired functionality is already available elsewhere, so if you are not interested in coding it yourself then you can maybe use this. The Publish package can be used to generate a table for presentation in a paper. It is not on CRAN, but you can install it from github
devtools::install_github('tagteam/Publish')
library(Publish)
library(isdals) # Get some data
data(fev)
fev$Smoke <- factor(fev$Smoke, levels=0:1, labels=c("No", "Yes"))
fev$Gender <- factor(fev$Gender, levels=0:1, labels=c("Girl", "Boy"))
The univariateTable can generate a publication-ready table presenting the data. By default, univariateTable computes the mean and standard deviation for numeric variables and the distribution of observations in categories for factors. These values can be computed and compared across groups. The main input to univariateTable is a formula where the right-hand side lists the variables to be included in the table while the left-hand side --- if present --- specifies a grouping variable.
univariateTable(Smoke ~ Age + Ht + FEV + Gender, data=fev)
This produces the following output
Variable Level No (n=589) Yes (n=65) Total (n=654) p-value
1 Age mean (sd) 9.5 (2.7) 13.5 (2.3) 9.9 (3.0) <1e-04
2 Ht mean (sd) 60.6 (5.7) 66.0 (3.2) 61.1 (5.7) <1e-04
3 FEV mean (sd) 2.6 (0.9) 3.3 (0.7) 2.6 (0.9) <1e-04
4 Gender Girl 279 (47.4) 39 (60.0) 318 (48.6)
5 Boy 310 (52.6) 26 (40.0) 336 (51.4) 0.0714

Creating new columns using a for loop

I am new to R (Economist with background in Stata) and I am having trouble getting a nested for loop to work for me. I know the issue is that I don't have a good understanding of how to use the loop counter as part of a variable name.
A bit of background. I have data frame with data on average rental rates for homes of different size (1 bedroom, 2 bedroom, etc) and data on annual earnings (mean, median, and various percentiles). I am trying to generate a series of new columns containing the ratio of these two things (rental rate / mean earnings).
Specifically my variables are:
beds1, beds2, beds3, beds4
mean, median, p10, p25, p75, p90
So you see I need to generate 24 new columns of cost/earnings data. I could write out 24 lines of code but I don't want to. More importantly, I want to learn an efficient way of doing this in R. In Stata I could do this very simply using a nested for loop, but I can't get it to work in R. Here is my code so far.
for (i in 1:4) {
stat <- c("median", "mean", "p10", "p25", "p75","p90")
for (x in stat) {
df$beds[i]_[x] <- round((df$beds[i]/df$[x]),digits=3)
}
}
When I run this code the error I get is
Error: unexpected input in:
" for (x in stat) {
df$beds[i]_"
> }
Error: unexpected '}' in " }"
> }
Error: unexpected '}' in "}"
I have tried to use the double brackets [[]] but that didn't change the results. If anyone has some insight into why the dynamic variables names aren't working please let me know. Even better, since I guess loops are evil in R, if anyone knows a way to use lapply to get this done, I would love to hear that too.
EDIT
Thanks #Spacedman for the comment. I think I am getting what you're saying. So does that mean that there simply isn't anyway to do what I want to do in R?
var1 <- c("beds1", "beds2")
var2 <- c("mean", "median")
for (i in 1:2) {
for (j in 1:2) {
df$var1[i]_var2[j] <- df$var1[i]/df$var2[j]
}
}
I think this should grab the elements of the lists var1 and var2 so that when i=1 and j=1, df$var1[i]/df$var2[j] should mean df$beds1/df$mean. Or would R get mad and think I was trying to divide strings?
FINAL EDIT WITH ANSWER FROM #SPACEEMAN
Thanks #Spacedman. I loved your spoiler and thank you for providing additional help. I didn't fully grasp the difference between the two ways of referring to columns after your last post, but I think I have a better idea now. I did a bit of tweaking and now I have something that works perfectly. Thanks again!
beds <- c("beds1", "beds2", "beds3", "beds4")
stat <- c("median", "mean", "p10", "p25", "p75","p90")
for(i in beds){
for(x in stat){
res = paste0(i,"_",x)
df[[res]]=round(df[[i]]/df[[x]],digits=3)
}
}
R is not a macro expansion language like other languages you might be used to.
x[i], if i=123, does not "expand" into x123. It gets the value of the 123rd element of the vector, x.
So df$beds[i] tries to get the i'th element of a vector df$beds.
You need to know two things:
How to construct strings from other strings.
For this you can use paste0:
> for(i in 1:4){
+ print(paste0("beds",i))
+ }
[1] "beds1"
[1] "beds2"
[1] "beds3"
[1] "beds4"
How to access columns by names.
For this you can use double square brackets. In a list:
> z = list()
> n = "thing"
Double squabs evaluate their index and use that. So:
> z[[n]] = 99
Will set z$thing, but dollar sign indexing is literal, so:
> z$n = 123
will set z$n:
> z
$thing
[1] 99
$n
[1] 123
hopefully that's enough hints to get you through. It should all be covered in basic R tutorials online.
Spoiler
If you want to work out how to do it yourself, look away now...
First, lets create a sample data frame - you should include something like this in your question so we have common test data to work on. I'll just have three beds and two stats:
> df = data.frame(
beds1=c(1,2,3),
beds2=c(5,2,3),
beds3=c(6,6,6),
mean=c(8,4,3),
median=c(1,7,4))
> df
beds1 beds2 beds3 mean median
1 1 5 6 8 1
2 2 2 6 4 7
3 3 3 6 3 4
Now the work. We loop over the bed number and the character stats. The bed column name is stored in bed by pasting "beds" to the number i. We compute the name of the result column (res) for a given bed number and stat by pasting "beds" to i and "_" and the name of the stat in x.
Then set the new resulting column to the value by dividing the beds number by the stat. We use [[z]] to get the columns by name:
> for(i in 1:3){
stats=c("mean","median")
for(x in stats){
bed = paste0("beds",i)
res = paste0("beds",i,"_",x)
df[[res]]=round(df[[bed]]/df[[x]],digits=3)
}
}
Resulting in....
> df
beds1 beds2 beds3 mean median beds1_mean beds1_median beds2_mean beds2_median
1 1 5 6 8 1 0.125 1.000 0.625 5.000
2 2 2 6 4 7 0.500 0.286 0.500 0.286
3 3 3 6 3 4 1.000 0.750 1.000 0.750
beds3_mean beds3_median
1 0.75 6.000
2 1.50 0.857
3 2.00 1.500
>

Resources