Fill a pseudo- table() with values from another column - r

How can I pseudo-table() two variables but fill with values from third column/ separate list?
Example:
library(ggplot2) # diamonds data
data(diamonds)
T.matrix <- with(diamonds, table(color, clarity))
Produces:
clarity
color I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
D 42 1370 2083 1697 705 553 252 73
E 102 1713 2426 2470 1281 991 656 158
F 143 1609 2131 2201 1364 975 734 385
G 150 1548 1976 2347 2148 1443 999 681
H 162 1563 2275 1643 1169 608 585 299
I 92 912 1424 1169 962 365 355 143
J 50 479 750 731 542 131 74 51
I want a similar table with color by clarity except with fill = reference$value instead of table()'s count
reference <- expand.grid(clarity = c("I1", "SI2", "SI1", "VS2", "VS1","VVS2", "VVS1", "IF"),
color = c("D", "E", "F", "G", "H", "I", "J"))
reference$value <- 1:56
So: [D,I1] would have a value of 1, [SI1, D] = 2, [VS2, H] = 36, etc.

Try tapply:
tapply(diamonds$price, list(diamonds$color, diamonds$clarity), mean)
tapply takes your desired variable, groups it by the list of variables to group by, then applies the last function. The table output is maybe not so useful, depending on your desired use.
If you want your data in a more usable format, you might want to use dplyr:
library(dplyr)
diamonds %>% group_by(clarity, color) %>%
summarise(mean(price))
Edit: It is the same!
tapply(reference$value, list(reference$color, reference$clarity), FUN = sum)
you need the fun or tapply collapses the output

Related

Divide colums by other columns and itself depending on index in dplyr

library(dplyr)
set.seed(1)
df <- data.frame(dddt_a = sample(1:1000, 1000, replace=T),
dddt_b = sample(1:1000, 1000, replace=T),
dddt_c = sample(1:1000, 1000, replace=T),
dddt_d = sample(1:1000, 1000, replace=T),
index = as.character(sample(c("a", "b"), 1000, replace=T)))
I want to divide each colum by either dddt_a or dddt_b depending on what the index is. If the index is a then divide all columns except the index by dddt_a and if index==b divide all columns except the index by dddt_b. The way it is set up now, this only divides dddt_a by a but not the other columns (likewise if index==b).
df1 <- df %>%
mutate_at(.vars = vars(starts_with("dddt")),
.funs = list(~ifelse(index=="a", ./dddt_a, ./dddt_b)))
head(df1)
dddt_a dddt_b dddt_c dddt_d index
1 1.0000000 686 474 756 a
2 0.7388466 1 681 726 b
3 1.0000000 218 570 448 a
4 2.0086393 1 830 958 b
5 1.0000000 989 590 128 a
6 1.0000000 128 978 144 a
A work around is storing the denominator variable outside, split the data for each index, divide everything and put it back together (I ran it only for index==a here). However, this should be possible in dplyr, I'm sure...?
ind_a <- df$dddt_a[df$index=="a"]
dfa <- df %>%
filter(index=="a")%>%
mutate_at(.vars = vars(starts_with("dddt")),
.funs = ~ ./!!ind_a)
Related to what seems to be the same issue. In a nex step I want to sum the values up, again depending on the index variable:
df2 <- df1 %>%
mutate(SUMS = ifelse(index=="a",
1+dddt_b+dddt_c+dddt_d,
1+dddt_a+dddt_c+dddt_d))
However, this sums all variables up...
head(df2)
dddt_a dddt_b dddt_c dddt_d index SUMS
1 1.0000000 686 474 756 a 1917.000
2 0.7388466 1 681 726 b 1408.739
3 1.0000000 218 570 448 a 1237.000
4 2.0086393 1 830 958 b 1791.009
5 1.0000000 989 590 128 a 1708.000
6 1.0000000 128 978 144 a 1251.000
But for the first row, for example, SUMS should be equal to 1916:
rowSums(df2[1,2:4]) #the result should be 1916 not 1917
1916
Thanks for the help.
Create a new column after dividing
library(dplyr)
df %>%
mutate_at(vars(starts_with("dddt")),
list(new = ~ifelse(index=="a", ./dddt_a, ./dddt_b))) %>%
head
# dddt_a dddt_b dddt_c dddt_d index dddt_a_new dddt_b_new dddt_c_new dddt_d_new
#1 836 686 474 756 a 1.000 0.821 0.567 0.904
#2 679 919 681 726 b 0.739 1.000 0.741 0.790
#3 129 218 570 448 a 1.000 1.690 4.419 3.473
#4 930 463 830 958 b 2.009 1.000 1.793 2.069
#5 509 989 590 128 a 1.000 1.943 1.159 0.251
#6 471 128 978 144 a 1.000 0.272 2.076 0.306
If you want you can then select only "_new" columns or rename the "_new" column to names of your choice.
We can also use case_when
library(dplyr)
df %>%
mutate_at(vars(starts_with("dddt")),
list(new = ~case_when(index=="a" ~ ./dddt_a, TRUE ~ ./dddt_b)))

R Making barplot

Im trying to make a barplot with the following data
Dept
Admit A B C D E F
Admitted 601 370 322 269 147 46
Rejected 332 215 596 523 437 668
and I have tried the following code:
admission_department <- barplot(biasUCB_d, main="Admit by deparment",
xlab="biasUCB_d[['Dept']]",
col=c("darkblue","red"),
legend = rownames(biasUCB_d[['Dept']]),
beside=TRUE)
The name of the coding used to create the dataset is:
biasUCB_d <- margin.table(UCBAdmissions, c(1,3))
What am I doing wrong?
Assuming that Dept is an element of a list this should work:
Data:
biasUCB_d <- list(Dept = read.table(header=T, text='
Admit A B C D E F
Admitted 601 370 322 269 147 46
Rejected 332 215 596 523 437 668'))
Solution:
dmission_department <- barplot(as.matrix(biasUCB_d$Dept[2:7]), main="Admit by deparment",
xlab="biasUCB_d[['Dept']]",
col=c("darkblue","red"),
legend = biasUCB_d$Dept$Admit,
beside=TRUE)
Output:
Try:
admission_department <- barplot(biasUCB_d, main="Admit by deparment",
xlab="Department",
col=c("darkblue","red"),
legend.text = rownames(biasUCB_d),
beside=TRUE)

R how to add a second label for a matrix

I want to build this matrix
What I tried
table <- matrix(c(163,224,312,314,303,175,119,662,933,909,871,702,522,307,1513,2400,2164,2299,1824,1204,678,1603,2337,2331,2924,2360,1428,808,2834,3903,3826,4884,3115,2093,89), nrow=5, ncol=7, byrow=T)
rownames(table) <- c("Fair", "Good", "Very Good", "Premium", "Ideal")
colnames(table) <- c("D", "E", "F", "G", "H", "I", "J")
but the result is this:
and my question is how to add the color and cut labels
Here, dimnames(table) is a 'list'. In the original matrix 'table', the list elements are not named. We can use names to change the names of the list from 'NULL' to the preferred one.
names(dimnames(table)) <- c('cut', 'color')
table
# color
# cut D E F G H I J
# Fair 163 224 312 314 303 175 119
# Good 662 933 909 871 702 522 307
# Very Good 1513 2400 2164 2299 1824 1204 678
# Premium 1603 2337 2331 2924 2360 1428 808
# Ideal 2834 3903 3826 4884 3115 2093 89
NOTE: table is an R function, so it is better to name the object a different name.

Row wise operation on data.table

Let's say I'd like to calculate the magnitude of the range over a few columns, on a row-by-row basis.
set.seed(1)
dat <- data.frame(x=sample(1:1000,1000),
y=sample(1:1000,1000),
z=sample(1:1000,1000))
Using data.frame(), I would do something like this:
dat$diff_range <- apply(dat,1,function(x) diff(range(x)))
To put it more simply, I'm looking for this operation, over each row:
diff(range(dat[1,]) # for i 1:nrow(dat)
If I were doing this for the entire table, it would be something like:
setDT(dat)[,diff_range := apply(dat,1,function(x) diff(range(x)))]
But how would I do it for only named (or numbered) rows?
pmax and pmin find the min and max across columns in a vectorized way, which is much better than splitting and working with each row separately. It's also pretty concise:
dat[, r := do.call(pmax,.SD) - do.call(pmin,.SD)]
x y z r
1: 266 531 872 606
2: 372 685 967 595
3: 572 383 866 483
4: 906 953 437 516
5: 201 118 192 83
---
996: 768 945 292 653
997: 61 231 965 904
998: 771 145 18 753
999: 841 148 839 693
1000: 857 252 218 639
How about this:
D[,list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I][c(1:4,15:18)]
# I V1
#1: 1 971
#2: 2 877
#3: 3 988
#4: 4 241
#5: 15 622
#6: 16 684
#7: 17 971
#8: 18 835
#actually this will be faster
D[c(1:4,15:18),list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I]
use .I to give you an index to call with the by= parameter, then you can run the function on each row. The second call pre-filters by any list of row numbers, or you can add a key and filter on that if your real table looks different.
You can do it by subsetting before/during the function. If you only want every second row for example
dat_Diffs <- apply(dat[seq(2,1000,by=2),],1,function(x) diff(range(x)))
Or for rownames 1:10 (since their names weren't specified they are just numbers counting up)
dat_Diffs <- apply(dat[rownames(dat) %in% 1:10,],1,function(x) diff(range(x)))
But why not just calculate per row then subset later?

Binning a dataframe with equal frequency of samples

I have binned my data using the cut function
breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))
My split dataframe looks like
... ...
$`(15,20]`
val ks_Result c
15 60 237
18 70 247
... ...
$`(20,25]`
val ks_Result c
21 20 317
24 10 140
... ...
My bins looks like
> table(data)
data
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
0 0 0 7 128 2748 2307
(35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70]
1404 11472 1064 536 7389 1008 1714
(70,75] (75,80] (80,85] (85,90] (90,95] (95,100] (100,105]
2047 700 329 1107 399 376 323
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140]
314 79 1008 77 474 158 381
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175]
89 660 15 1090 109 824 247
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210]
1226 139 531 174 1041 107 257
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245]
72 671 98 212 70 95 25
(245,250]
494
When I mean the bins, I get on an average of ~900 samples
> mean(table(data))
[1] 915.9
I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.
I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!
library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))
Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:
df <- data.frame(var=runif(500, 0, 100)) # make data
cut.vec <- cut(
df$var,
breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
include.lowest=T
)
df.split <- split(df, cut.vec)
Hmisc::cut2 has this option built in as well.
Can be done by the function provided here by Joris Meys
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
data<-split(df2, EqualFreq2(df2$val, 25))

Resources