I divided the value of X into 5 boxes and calculated its joint probabilities.
In the example below, since there are lots of 2s in X, in the end I only have 4 boxes.
Example:
X <-c(1,2,2,2,2,3,4,5,6,7)
Y <-c(0,1,1,1,0,1,0,1,0,1)
qX=quantile(X, 1:4/5) # find quantiles 20%,40%,60%,80%
qY=c(0,1)
dX=findInterval(X,qX,rightmost.closed=TRUE)
dY=findInterval(Y,qY+0.001,rightmost.closed=TRUE)
pXY=xtabs(~dX+dY)/10 # joint distribution
rownames(pXY) <- paste("box",1:dim(pXY)[1],sep="")
> pXY
dY
dX 0 1
box1 0.1 0.0
box2 0.1 0.4
box3 0.1 0.1
box4 0.1 0.1
Now I want to add one more column for the range of X in each box.
The desired table will be:
box1 [1,1] 0.1 0.0
box2 [2,3] 0.1 0.4
box3 [4,5] 0.1 0.1
box4 [6,7] 0.1 0.1
The output of xtabs or table is somewhat messy to add to. I would convert to matrix:
pXY2 <- pXY; class(pXY2) <- "matrix"
data.frame(r=t(sapply(split(X,dX),range)),pXY2)
# r.1 r.2 X0 X1
# 0 1 1 0.1 0.0
# 2 2 3 0.1 0.4
# 3 4 5 0.1 0.1
# 4 6 7 0.1 0.1
Given the cutpoints used to make dX, the values of the boxes really are 0,2,3,4, not 1,2,3,4.
If you want to print the range with special formatting, consider writing a custom function:
brackem <- function(x) paste0("[",x[1],",",x[2],"]")
data.frame(r=tapply(X,dX,function(z)brackem(range(z))),pXY2)
# r X0 X1
# 0 [1,1] 0.1 0.0
# 2 [2,3] 0.1 0.4
# 3 [4,5] 0.1 0.1
# 4 [6,7] 0.1 0.1
Related
I have the following dataframe.
a <- c(1,2,1,3)
b <- c(0.5, 0.6, 0.7, 0.3)
d <- c(0.4, 0.3, 0.6, 0.4)
df <- data.frame(a,b,d)
df
#output
# a b d
#1 1 0.5 0.4
#2 2 0.6 0.3
#3 1 0.7 0.6
#4 3 0.3 0.4
I want to combine column b and d into a new column and decide which column value to take for each row based on column a. If a = 1 I want to take the value from column d. If a doesn't equal 1 the new column will take the value from column b for that row.
I tried the following...
df$new <- c(df[df$a != 1, "b"], df[df$a == 1, "d"])
df
#output
# a b d new
#1 1 0.5 0.4 0.6
#2 2 0.6 0.3 0.3
#3 1 0.7 0.6 0.4
#4 3 0.3 0.4 0.6
...but the rows don't line up properly. How can I combine these two columns in this manner without losing the row index to achieve the desired output shown below?
#desired output
# a b d new
#1 1 0.5 0.4 0.4
#2 2 0.6 0.3 0.6
#3 1 0.7 0.6 0.6
#4 3 0.3 0.4 0.3
This is what the ifelse function is for:
df$new <- ifelse(df$a==1, df$d, df$b)
Your version didn't work as it included all the a!=1 values first and then the a==1 values afterwards.
I have a code where I create factors and then want to summarise, have a proportional table and unlikeability calculation:
myvars <- names(Diab[c(17:33)])
Diab[myvars] <- lapply(Diab[myvars], ordered, levels = c("No","Down","Steady","Up"), labels = c("No","Down","Steady","Up"))
summary(Diab$metformin)
round(prop.table(summary(Diab$metformin)),3)
unalike(Diab$metformin)
summary(Diab$repaglinide)
round(prop.table(summary(Diab$repaglinide)),3)
unalike(Diab$repaglinide)
.....
where
myvars
[1] "metformin" "repaglinide" "nateglinide"
[4] "chlorpropamide" "glimepiride" "glipizide"
[7] "glyburide" "tolbutamide" "pioglitazone"
[10] "rosiglitazone" "acarbose" "miglitol"
[13] "tolazamide" "glyburide_metformin" "glipizide_metformin"
[16] "glimepiride_pioglitazone" "insulin"
Instead of coding summary(), round(prop.table()) and unalike() for each of myvars, how can I do this in a loop?
I know I can summary(Diab[myvars]), put the output is in columns and I want to retain the output in rows as follows:
summary(Diab$metformin)
No Down Steady Up
22057 162 5310 275
round(prop.table(summary(Diab$metformin)),3)
No Down Steady Up
0.793 0.006 0.191 0.010
unalike(Diab$metformin)
0.3340651
Thank you in advance for your solutions.
Consider reshaping your wide data to long format and then run table (equivalent to summary.factor) and prop.table. Doing so, you avoid any need for looping. Unfamiliar of definition of unalike, possibly from ragree package, it appears you can pass a data frame with named arguments.
Diab_long <- reshape(Diab[c(17:33)], varying = names(Diab), times = names(Diab),
v.names = "value", timevar = "metric", ids = NULL,
new.row.names = 1:1E4, direction = "long")
tbl <- table(Diab_long)
prop.table(tbl, margin = 1)
ragree::unalike(Diab_long, ...)
To demonstrate with seeded, random data:
Data
set.seed(22620)
lvls <- c("No","Down","Steady","Up")
# DATA FRAME OF ALL FACTORS
Diab <- setNames(data.frame(replicate(17, factor(sample(lvls, 10, replace=TRUE),
levels = c("No","Down","Steady","Up")))),
c("metformin", "repaglinide", "nateglinide",
"chlorpropamide", "glimepiride", "glipizide",
"glyburide", "tolbutamide", "pioglitazone",
"rosiglitazone", "acarbose", "miglitol",
"tolazamide", "glyburide_metformin",
"glipizide_metformin",
"glimepiride_pioglitazone", "insulin"))
# RESHAPE TO LONG
Diab_long <- reshape(Diab, varying = names(Diab), times = names(Diab),
v.names = "value", timevar = "metric", ids = NULL,
new.row.names = 1:1E4, direction = "long")
Output (does not include unalike)
tbl <- table(Diab_long)
tbl
# value
# metric Down No Steady Up
# acarbose 1 2 2 5
# chlorpropamide 4 4 1 1
# glimepiride 6 3 0 1
# glimepiride_pioglitazone 4 0 2 4
# glipizide 4 4 2 0
# glipizide_metformin 2 3 3 2
# glyburide 3 2 3 2
# glyburide_metformin 1 3 6 0
# insulin 1 1 5 3
# metformin 2 2 4 2
# miglitol 1 3 5 1
# nateglinide 6 3 1 0
# pioglitazone 1 4 3 2
# repaglinide 1 4 2 3
# rosiglitazone 1 7 1 1
# tolazamide 2 4 1 3
# tolbutamide 3 3 2 2
ptbl <- prop.table(tbl, margin = 1)
ptbl
# value
# metric Down No Steady Up
# acarbose 0.1 0.2 0.2 0.5
# chlorpropamide 0.4 0.4 0.1 0.1
# glimepiride 0.6 0.3 0.0 0.1
# glimepiride_pioglitazone 0.4 0.0 0.2 0.4
# glipizide 0.4 0.4 0.2 0.0
# glipizide_metformin 0.2 0.3 0.3 0.2
# glyburide 0.3 0.2 0.3 0.2
# glyburide_metformin 0.1 0.3 0.6 0.0
# insulin 0.1 0.1 0.5 0.3
# metformin 0.2 0.2 0.4 0.2
# miglitol 0.1 0.3 0.5 0.1
# nateglinide 0.6 0.3 0.1 0.0
# pioglitazone 0.1 0.4 0.3 0.2
# repaglinide 0.1 0.4 0.2 0.3
# rosiglitazone 0.1 0.7 0.1 0.1
# tolazamide 0.2 0.4 0.1 0.3
# tolbutamide 0.3 0.3 0.2 0.2
Online Demo
I have two dataframes. One is a matrix with column and row titles, the other dataframe is the metadata of the matrix. The current row and column names of the matrix are accession numbers, but I have other names in the dataframe that I was to use in as the row/column names. The issue is that they are in different orders. I want to find the row in the metadata that matches the row/column in the matrix and change the row/column name of the matrix to the name matching a different column in the second dataframe.
Matrix:
"XP01020938" "XP3943847" "XP39583574" "XP39384739"
"XP01020938" 1 0.5 0.25 0.1
"XP3943847" 0.5 1 0.5 0.25
"XP39583574" 0.25 0.5 1 0.1
"XP39384739" 0.1 0.25 0.1 1
Metadata:
Accession Name
XP3943847 Tiger
XP39583574 Elephant
XP39384739 Monkey
XP01020938 Horse
Desired:
"Horse" "Tiger" "Elephant" "Monkey"
"Horse" 1 0.5 0.25 0.1
"Tiger" 0.5 1 0.5 0.25
"Elephant" 0.25 0.5 1 0.1
"Monkey" 0.1 0.25 0.1 1
Something like this using match ?
colnames(mat) <- metadata$Name[match(colnames(mat), metadata$Accession)]
rownames(mat) <- metadata$Name[match(rownames(mat), metadata$Accession)]
mat
# Horse Tiger Elephant Monkey
#Horse 1.00 0.50 0.25 0.1
#Tiger 0.50 1.00 0.25 0.1
#Elephant 0.25 0.50 1.00 0.1
#Monkey 0.10 0.25 0.50 1.0
I am trying to plot a stacked bar chart with multiple facets using the code below:
dat <- read.csv(file="./fig1.csv", header=TRUE)
dat2 <- melt(dat, id.var = c("id", "col1", "label"))
ggplot(dat2, aes(x=id, y=value, fill = variable)) +
geom_bar(stat="identity") +
scale_x_discrete(limits=dat2$label) +
facet_grid(. ~ col1) +
geom_col(position = position_stack(reverse = TRUE))
and here is a minimized example of how my data looks like:
id label col1 col2 col3 col4 col5
1 3 1 0.2 0.1 0.1 0.1
2 3 1 0.2 0.1 0.2 0.1
3 4 1 0.2 0.2 0.2 0.1
4 4 1 0.1 0.1 0.2 0.1
5 7 2 0.1 0.1 0.1 0.2
6 8 2 0.2 0.1 0.1 0.1
7 9 2 0.2 0.1 0.2 0.1
8 9 2 0.2 0.2 0.2 0.1
9 9 2 0.1 0.1 0.2 0.1
The problem I have is that the labels do not show up as I expect them. The labels for the facet where col1 is 1 gets repeated for the facet where col1 is 2, which means the labels (7,8,9,9,9) are ignored. Also, when consecutive labels are the same, they only appear once. For instance, when the first label which is 3 appears, the second label which is again 3 is ignored. Does anyone know how I can have the labels as I list them in the label column?
I am working in R and I have two datasets. One dataset contains a contribution amount, and the other includes an include/exclude flag. Below are the data:
> contr_df
asof_dt X Y
1 2014-11-03 0.3 1.2
2 2014-11-04 -0.5 2.3
3 2014-11-05 1.2 0.4
> inex_flag
asof_dt X Y
1 2014-11-03 1 0
2 2014-11-04 1 1
3 2014-11-05 0 0
I would like to create a 3rd dataset that show one multiplied by the other. For example, I want to see the following
2014-11-03 0.3 * 1 1.2*0
2014-11-04 -0.5*1 2.3*1
2014-11-05 1.2*0 0.4*0
So far the only way that I've been able accomplish this is through using a for loop that loops through the total number of columns. However, this is complicated and inefficient. I was wondering if there was an easier way to make this happen. Does anyone know of a better solution?
This does the multiplication, but doesn't make sense for factors:
df1 * df2
# asof_dt X Y
#1 NA 0.3 0.0
#2 NA -0.5 2.3
#3 NA 0.0 0.0
#Warning message:
#In Ops.factor(left, right) : * nicht sinnvoll für Faktoren
One Option: You can cbind the first column and the multiplied values like this:
cbind(df1[1], df1[-1] * df2[-1])
# asof_dt X Y
#1 2014-11-03 0.3 0.0
#2 2014-11-04 -0.5 2.3
#3 2014-11-05 0.0 0.0
This means, you multiply the df1 and df2 without their first column of each data frame and add to it the first column of df1 with the dates.
The one-line answer is:
mapply(`*`, contr_df, inex_flag)
This will pair-wise apply the scalar multiplication function across the data.frame columns.
d = data.frame(a=c(1,2,3), b=c(0,2,-1))
e = data.frame(a=c(.2, 2, -1), b=c(0, 2, -2))
mapply(`*`, d, e)
a b
[1,] 0.2 0
[2,] 4.0 4
[3,] -3.0 2