Ridit scoring (https://en.wikipedia.org/wiki/Ridit_scoring) is often used to transform an ordinal categorial variable ino relative frequency (proportion of cases below a given value, plus one half of the proportion at that value).
How would you do this in R?
FURTHER UPDATE
These and several other functions are now available in the CRAN package ridittools, maintained by yours truly.
UPDATE
Remove rather silly code involving constructing a conversion matrix, I'd forgotten about cumsum()
# Convert vector of counts to ridits
to.ridit <- function(v) {
(cumsum(v) - .5 * v) / sum(v)
}
# Calculate mean ridit for vector of counts relative to reference group
mean.ridit <- function(v, ref) {
sum(to.ridit(ref) * v ) / sum(v)
}
# Calculate mean ridits for several groups
# x is matrix of counts
# margin is 1 for groups in rows, 2 for groups in columns
# If ref is omitted, totals across groups are used as reference group
# If ref is a vector of counts, it's used as reference group
# Otherwise, ref is the number (or name if it exists) of the group to use as reference
ridits <- function(x, margin, ref=NULL) {
if (length(ref) > 1) {
refgroup <- ref
} else if (length(ref) == 1) {
if (margin==1) {
refgroup <- x[ref,]
} else {
refgroup <- x[, ref]
}
} else {
refgroup <- apply(x, 3-margin, sum)
}
apply(x, margin, mean.ridit, refgroup)
}
Example (Fleiss, 1981: severity of car accidents):
to.ridit(c(17, 54, 60, 19, 9, 6, 14))
[1] 0.04748603 0.24581006 0.56424581 0.78491620 0.86312849 0.90502793 0.96089385
Note
Although my code is somewhat less flexible than the Ridit::ridit package mentioned in another answer, it seems to be quite a bit faster:
# Influenza subtypes by age as of week ending 2/24/18 (US CDC)
> flu.age
BY BV BU H3 H1
0-4 274 91 92 1808 500
5-24 1504 274 698 5090 951
25-64 1665 101 567 7538 1493
65+ 1476 35 330 9541 515
# Using CRAN package
> system.time(ridit(flu.age,2))
user system elapsed
3.746 0.007 3.756
# Using my code
> system.time(ridits(flu.age,2))
user system elapsed
0.001 0.000 0.000
The following package may solve your problem. Especially the command Ridit::ridit is useful as it is described in the following way.
An extension of the Kruskal-Wallis test that allow specify arbitrary reference group. Also provide
Mean Ridit for each group. Mean Ridit of a group is an estimate of probability a random observation
from that group will be greater than or equal to a random observation from reference group.
https://cran.r-project.org/web/packages/Ridit/Ridit.pdf
An alternative approach would be using a binary choice model like Probit, Logit or Exact Logit and extracting the predicted independent variables, i.e. 0 or 1.
Related
I'm trying to figure out how to repeat the same code 30 times without typing each one at a time... any help will be much appreciated.
SRS_1 <- sample(1:nrow(MyData_points), size=.10*nrow(MyData_points))
data_sample_1 <- MyData_points[SRS_1,]
fpc.srs <- rep(6399875, 639987)
design_SRS_1 <- svydesign(id=~1, strata=NULL, data=data_sample_1, fpc=fpc.srs)
ONStotal_SRS1 <- svytotal(~data_sample_1$V4, design=design_SRS_1)
ONSmean_SRS1 <- svymean(~data_sample_1$V4, design=design_SRS_1)
CI_SRS_1 <- confint(svytotal(~data_sample_1$V4, design=design_SRS_1))
The first code calculates a Simple Random Sampling with a probability of .10 from the data. The second gets the sample from the data. Third, calculates the fpc, which is the 10% of the total data points. Now, in order to estimate the population I need to do a design of the sample without replacement including the fpc. Then, for the last three codes, I calculate a population estimate, mean and confidence interval based on that sample.
What changes is that I must repeat 30 different Simple Random Samplings from the data. Therefore, the resulting estimation, mean and confidence intervals will be obtained from 30 different samples. They might be close but not equal
How can I make this code better so I can run it 30 times each and be able to print a table with (ONStotal_SRS1, ONSmean_SRS1,CI_SRS_1)?
Usually I would use either rbindlist from the data.table package or bind_rows from dplyr in combination with an lapply to build the table a row at a time and then bind the rows together. Here is an example using bind_rows with the mtcars data set:
library(dplyr)
combined_data <- bind_rows(lapply(1:30, function(...) {
# Take a sample
SRS_1 <- sample(1:nrow(mtcars), size = .10 * nrow(mtcars))
data_sample_1 <- mtcars[SRS_1, ]
# Compute some things from the sample
m_disp <- mean(data_sample_1$disp)
m_hp <- mean(data_sample_1$hp)
# Make a one row data.frame that will be returned by the function
data.frame(m_disp, m_hp)
}))
Which gives this data.frame:
> str(combined_data)
'data.frame': 30 obs. of 2 variables:
$ m_disp: num 235 272 410 115 249 ...
$ m_hp : num 147 159 195 113 154 ...
I am struggling to write a function for a dataset that looks like this:
identifier age occupation
pers1 18 student
pers2 45 teacher
pers3 65 retired
What I am trying to do, is to write a function that will:
sort my variables into numerical vs. factor variable
for the numerical variables, give me the mean, min and mx
for the factor variable, give me a frequency table
return point (2) and (3) in a "nice" format (dataframe, vector or table)
So far, I have tried this:
describe<- function(x)
{ if (is.numeric(x)) { mean <- mean(x)
min <- min(x)
max <- max(x)
d <- data.frame(mean, min, max)}
else { factor <- table(x) }
}
stats <- lapply(data, describe)
Problems:
My problem is that now, "stats" is a list that is difficult to read and to export to Excel or share. I don't know how to make the list "stats" more reader-friendly.
Alternatively, maybe is there a better way to build the function "describe"?
Any thoughts on how to solve these two problems are much appreciated!
I ma be late to the party, but maybe you still need a solution. I combined the answers from some of the comments to your post to the following code. It assumes you only have numerical columns and factors, and scales to a large number of columns, as you specified:
# Just some sample data for my example, you don't need ggplot2.
library(ggplot2)
data=diamonds
# Find which columns are numeric, and which are not.
classes = sapply(data,class)
numeric = which(classes=="numeric")
non_numeric = which(classes!="numeric")
# create the summary objects
summ_numeric = summary(data[,numeric])
summ_non_numeric = summary(data[,non_numeric])
# result is easily written to csv
write.csv(summ_non_numeric,file="test.csv")
Hope this helps.
The desired functionality is already available elsewhere, so if you are not interested in coding it yourself then you can maybe use this. The Publish package can be used to generate a table for presentation in a paper. It is not on CRAN, but you can install it from github
devtools::install_github('tagteam/Publish')
library(Publish)
library(isdals) # Get some data
data(fev)
fev$Smoke <- factor(fev$Smoke, levels=0:1, labels=c("No", "Yes"))
fev$Gender <- factor(fev$Gender, levels=0:1, labels=c("Girl", "Boy"))
The univariateTable can generate a publication-ready table presenting the data. By default, univariateTable computes the mean and standard deviation for numeric variables and the distribution of observations in categories for factors. These values can be computed and compared across groups. The main input to univariateTable is a formula where the right-hand side lists the variables to be included in the table while the left-hand side --- if present --- specifies a grouping variable.
univariateTable(Smoke ~ Age + Ht + FEV + Gender, data=fev)
This produces the following output
Variable Level No (n=589) Yes (n=65) Total (n=654) p-value
1 Age mean (sd) 9.5 (2.7) 13.5 (2.3) 9.9 (3.0) <1e-04
2 Ht mean (sd) 60.6 (5.7) 66.0 (3.2) 61.1 (5.7) <1e-04
3 FEV mean (sd) 2.6 (0.9) 3.3 (0.7) 2.6 (0.9) <1e-04
4 Gender Girl 279 (47.4) 39 (60.0) 318 (48.6)
5 Boy 310 (52.6) 26 (40.0) 336 (51.4) 0.0714
I am new to R (Economist with background in Stata) and I am having trouble getting a nested for loop to work for me. I know the issue is that I don't have a good understanding of how to use the loop counter as part of a variable name.
A bit of background. I have data frame with data on average rental rates for homes of different size (1 bedroom, 2 bedroom, etc) and data on annual earnings (mean, median, and various percentiles). I am trying to generate a series of new columns containing the ratio of these two things (rental rate / mean earnings).
Specifically my variables are:
beds1, beds2, beds3, beds4
mean, median, p10, p25, p75, p90
So you see I need to generate 24 new columns of cost/earnings data. I could write out 24 lines of code but I don't want to. More importantly, I want to learn an efficient way of doing this in R. In Stata I could do this very simply using a nested for loop, but I can't get it to work in R. Here is my code so far.
for (i in 1:4) {
stat <- c("median", "mean", "p10", "p25", "p75","p90")
for (x in stat) {
df$beds[i]_[x] <- round((df$beds[i]/df$[x]),digits=3)
}
}
When I run this code the error I get is
Error: unexpected input in:
" for (x in stat) {
df$beds[i]_"
> }
Error: unexpected '}' in " }"
> }
Error: unexpected '}' in "}"
I have tried to use the double brackets [[]] but that didn't change the results. If anyone has some insight into why the dynamic variables names aren't working please let me know. Even better, since I guess loops are evil in R, if anyone knows a way to use lapply to get this done, I would love to hear that too.
EDIT
Thanks #Spacedman for the comment. I think I am getting what you're saying. So does that mean that there simply isn't anyway to do what I want to do in R?
var1 <- c("beds1", "beds2")
var2 <- c("mean", "median")
for (i in 1:2) {
for (j in 1:2) {
df$var1[i]_var2[j] <- df$var1[i]/df$var2[j]
}
}
I think this should grab the elements of the lists var1 and var2 so that when i=1 and j=1, df$var1[i]/df$var2[j] should mean df$beds1/df$mean. Or would R get mad and think I was trying to divide strings?
FINAL EDIT WITH ANSWER FROM #SPACEEMAN
Thanks #Spacedman. I loved your spoiler and thank you for providing additional help. I didn't fully grasp the difference between the two ways of referring to columns after your last post, but I think I have a better idea now. I did a bit of tweaking and now I have something that works perfectly. Thanks again!
beds <- c("beds1", "beds2", "beds3", "beds4")
stat <- c("median", "mean", "p10", "p25", "p75","p90")
for(i in beds){
for(x in stat){
res = paste0(i,"_",x)
df[[res]]=round(df[[i]]/df[[x]],digits=3)
}
}
R is not a macro expansion language like other languages you might be used to.
x[i], if i=123, does not "expand" into x123. It gets the value of the 123rd element of the vector, x.
So df$beds[i] tries to get the i'th element of a vector df$beds.
You need to know two things:
How to construct strings from other strings.
For this you can use paste0:
> for(i in 1:4){
+ print(paste0("beds",i))
+ }
[1] "beds1"
[1] "beds2"
[1] "beds3"
[1] "beds4"
How to access columns by names.
For this you can use double square brackets. In a list:
> z = list()
> n = "thing"
Double squabs evaluate their index and use that. So:
> z[[n]] = 99
Will set z$thing, but dollar sign indexing is literal, so:
> z$n = 123
will set z$n:
> z
$thing
[1] 99
$n
[1] 123
hopefully that's enough hints to get you through. It should all be covered in basic R tutorials online.
Spoiler
If you want to work out how to do it yourself, look away now...
First, lets create a sample data frame - you should include something like this in your question so we have common test data to work on. I'll just have three beds and two stats:
> df = data.frame(
beds1=c(1,2,3),
beds2=c(5,2,3),
beds3=c(6,6,6),
mean=c(8,4,3),
median=c(1,7,4))
> df
beds1 beds2 beds3 mean median
1 1 5 6 8 1
2 2 2 6 4 7
3 3 3 6 3 4
Now the work. We loop over the bed number and the character stats. The bed column name is stored in bed by pasting "beds" to the number i. We compute the name of the result column (res) for a given bed number and stat by pasting "beds" to i and "_" and the name of the stat in x.
Then set the new resulting column to the value by dividing the beds number by the stat. We use [[z]] to get the columns by name:
> for(i in 1:3){
stats=c("mean","median")
for(x in stats){
bed = paste0("beds",i)
res = paste0("beds",i,"_",x)
df[[res]]=round(df[[bed]]/df[[x]],digits=3)
}
}
Resulting in....
> df
beds1 beds2 beds3 mean median beds1_mean beds1_median beds2_mean beds2_median
1 1 5 6 8 1 0.125 1.000 0.625 5.000
2 2 2 6 4 7 0.500 0.286 0.500 0.286
3 3 3 6 3 4 1.000 0.750 1.000 0.750
beds3_mean beds3_median
1 0.75 6.000
2 1.50 0.857
3 2.00 1.500
>
probably someone has an advice.
I m working on a report with knitR and the tables package. The function tabular generates my Latex code.
I want a table with absolute and relative frequencies, but I dont know how to get it.
I tried to apply within tabular an own function that generates the relative values:
relative <- function(x){
len <- length(x)
ll <- length(table(x))
lev <- unique(x)
out <- numeric(length = ll)
for(i in 1:ll){
pos <- which(x == lev[i])
out[i] <- length(pos)/len*100
}
return(out)
}
The function I put in the tabular function with Country and Type as two factors.
tabular(Country ~ Type*relative, data = cars)
Here the result:
Type
PKW SUV
Country relative relative
Germany 0 0
Japan 0 0
Other 0 NaN
USA 0 0
The result is just the example of how it doesnt work :) My main target is a table that contains both: absolute AND relative frequencies with tabular.
To clearify the problem: Here the result without the function relative:
tabular(Country ~ Type, data = cars)
Type
Country PKW SUV
Germany 23 1
Japan 22 17
Other 12 0
USA 58 22
Does anyone has an idea how to get the table?
Thanks in advance!
From the vignette:
The last factor, (mean + sd)
names two R functions. These are assumed to be functions that operate on a
vector and produce a single value, as mean and sd do. The values in the table
will be the results of applying those functions to the two dierent variables
and the subsets of the dataset.
I have overlooked this part that is the crucial hint, why the function i wrote doesnt work.
"relative" returns a vector as long as manifestations the variable x has - allowed is an atomic vector. The problem is now that there is no way to add the basis into the function to calculate the relative values - the returning value is calculated from an subset of x.
Actually this is working but is not the most elegant way. I wrote the function with a hard coded nrow (here: 155).
relative <- function(x){
round(length(x)/155, 2)
}
tabular(Country ~ Type + relative, data = cars)
Type
Country PKW SUV relative
Germany 23 1 0.15
Japan 22 17 0.25
Other 12 0 0.08
USA 58 22 0.52
If anyone find a better solution - please help :)
What is the most efficient way to sample a data frame under a certain constraint?
For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.
You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.
> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)
> salry <- data.frame(EEnams = sapply(1:20 ,
function(x){paste(sample(letters[1:20], 6) ,
collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
EEnams sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057
> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263
If you had 1000 EE's then you would have:
> choose(1000, 3) # Combination possibilities
# [1] 166,167,000 Commas added to output
One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).
Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.
How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.
## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE)