Indicator feature creation in R based on multiple columns

Indicator feature creation in R based on multiple columns - r

I have a dataset with 10 columns and out of them 10, 3 are of interest to create a new indicator feature. The features are "pT", "pN", & "M" and they all take different values. Off all the values that these 3 features take, there are a toal of 9 unique combinations that needs to be captures in the new variable.
PATHOT PATHON PATHOM
1 pT2 pN1 M0
4 pT1 pN1 M0
13 pT3 pN1 M0
161 pT1 *pN2 M0
391 pT1 pN1 *M1
810 *pTIS pN1 M0
948 pT3 *pN2 M0
1043 pT2 pN1 *M1
1067 *pT4 pN1 M0
For example, the new variable will have value "1" when PATHOT=pT2, PATHON=pN1 & PATHOM=M0 and so on upto value 9. I have completed the task but after spending almost 20 lines of code involving vectorised operation for all unique combinations.
diag3_bs$sfd[diag3_bs$pathot=="pT2" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 1
diag3_bs$sfd[diag3_bs$pathot=="pT1" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 2
diag3_bs$sfd[diag3_bs$pathot=="pT3" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 3... so on upto 9.
I want to ask if there is a better more automated way of getting the same result?
dput(data.frame) is given below
structure(list(F_STATUS = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "Y", class = "factor"), EVENT_ID = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "BASELINE", class =
"factor"),
PAG_NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "BR2", class = "factor"), PTSIZE = c(3, 4,
2.7, 2, 0.9, 3, 3, 0.9, 3, 4.5), PTSIZE_U = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "CM", class = "factor"),
PT_SYM = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("", "-", "<", ">"), class = "factor"), PATHOT = structure(c(4L,
4L, 4L, 3L, 3L, 4L, 4L, 3L, 4L, 4L), .Label = c("*pT4", "*pTIS",
"pT1", "pT2", "pT3"), class = "factor"), PATHON = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("*pN2", "pN1"
), class = "factor"), PATHOM = structure(c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("*M1", "M0"), class = "factor"),
RSUBJID = 901000:901009, RUSUBJID = structure(1:10, .Label = c(
"000301-000-901-251", "000301-000-901-252", "000301-000-901-253",
"000301-000-901-254", "000301-000-901-255", "000301-000-901-256",
"000301-000-901-257", "000301-000-901-258", "000301-000-901-259",
"000301-000-901-260", "000301-000-901-261", "000301-000-901-262")
, class = "factor")), .Names = c("F_STATUS", "EVENT_ID", "PAG_NAME", "PTSIZE", "PTSIZE_U", "PT_SYM", "PATHOT",
"PATHON", "PATHOM", "RSUBJID", "RUSUBJID"), row.names = c(NA, 10L),
class = "data.frame")
Thanks.

I tried to edit the data so it didn't throw an error on input. Also created a version of that tabulation of possible combinations:
stg_tbl <- structure(list(PATHOT = structure(c(4L, 3L, 5L, 3L, 3L, 2L, 5L,
4L, 1L), .Label = c("*pT4", "*pTIS", "pT1", "pT2", "pT3"), class = "factor"),
PATHON = structure(c(2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("*pN2",
"pN1"), class = "factor"), PATHOM = structure(c(2L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 2L), .Label = c("*M1", "M0"), class = "factor")), .Names = c("PATHOT",
"PATHON", "PATHOM"), class = "data.frame", row.names = c("1",
"4", "13", "161", "391", "810", "948", "1043", "1067"))
Make a vector of text-equivalents of the categories:
stg_lbls <- with(stg_tbl, paste(PATHOT, PATHON, PATHOM, sep="_") )
Then the as.numeric values of a factor created using those levels will be the desired result:
dat$stg <- with(dat, factor( paste(PATHOT, PATHON, PATHOM, sep="_"), levels=stg_lbls))
as.numeric(dat$stg)
#[1] 1 1 1 2 2 1 1 2 1 1
You can just assign those values in the usual way:
dat$sfd <- as.numeric(dat$stg)

I made some new data, that should be useful for your problem.
k<-expand.grid(data.frame(a=letters[1:3],b=letters[4:6],c=letters[7:9]))
library(dplyr)
k %>% mutate(groups=paste0(a,b,c))->k2
k2$groups<-as.numeric(factor(k2$groups))
k2
It's crude, and you're not picking which combination get's which numbers, so it'd take some digging afterwards, but it's quick.

Related

Why this code is not right statically in ggplot to get percentage in y-axis?

I have this data, and I want to get percentage in y-axis.
structure(list(sb_1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("0", "x"), class = "factor"),
sb_2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "0", class = "factor"), sb_3 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "b", class = "factor"),
sb_4 = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("0", "c"), class = "factor"), wave = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("h",
"j"), class = "factor")), row.names = c(NA, 12L), class = "data.frame")
This the code I have used:
nn%>%
pivot_longer(cols = starts_with("sb_")) %>%
filter(value != 0) %>%
unite(sb_, name, value) %>%
group_by(wave) %>%
mutate(wave_total = n()) %>%
group_by(sb_, .add = TRUE) %>%
mutate(sb_pct = 100 * n() / wave_total) %>%
ggplot(aes(x = factor(sb_, levels = str_sort(unique(sb_), numeric = TRUE)), y = sb_pct)) +
geom_bar(aes(fill = wave), stat = "identity", position = position_dodge(preserve = "single")) +
xlab("sb") +
ylab("percent")
And the outcome is that :
![1]
And the result should be different because for instance for the first column, there was no zero and all is the outcome.
sb_1 sb_2 sb_3 sb_4 wave
1 0 0 b 0 h
2 0 0 b 0 j
3 0 0 b 0 h
4 0 0 b c j
5 0 0 b c h
6 0 0 b c j
7 x 0 b c h
8 x 0 b c j
9 x 0 b c h
10 x 0 b c j
11 x 0 b c h
12 x 0 b c j
So please help me why is not correct?

I can't tell why your code isn't correct, but I tried a different way and it seems to work as expected:
n <- structure(list(sb_1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("0", "x"), class = "factor"),
sb_2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "0", class = "factor"), sb_3 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "b", class = "factor"),
sb_4 = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("0", "c"), class = "factor"), wave = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("h",
"j"), class = "factor")), row.names = c(NA, 12L), class = "data.frame")
n <- pivot_longer(n, cols = starts_with("sb_"))
n$wave_and_name <- as.factor(paste(n$wave,n$name, sep="_"))
n <- as.data.frame(table(filter(n, value != 0)$wave_and_name) / table(n$wave_and_name) * 100)
n$wave <- substr(n$Var1, 1, 1)
n$name <- substr(n$Var1, 3, 6)
ggplot(n, aes(x=name, y=Freq)) +
geom_bar(aes(fill = wave), stat="identity",position = position_dodge()) +
xlab("sb") +
ylab("percent")

Transform a data frame into a table with option

I have a data frame with different variables (columns).
I want to transform this data frame into a table with a different structure to make it more readable.
For example, I have a data frame like this:
myData = structure(list(X = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "20", class = "factor"),
Y = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("20", "100"), class = "factor"),
MethodType = structure(c(2L, 2L, 4L, 4L, 1L, 1L, 3L, 3L,
2L, 2L, 4L, 4L, 1L, 1L, 3L, 3L), .Label = c("E", "Q", "R",
"W"), class = "factor"), MethodType2 = structure(c(1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), Metric1 = c(0.970017512487058, 0.969647220975651,
0.965873991040769, 0.966242788535318, 0.986725852301671,
0.98696657967457, 0.98252107117733, 0.982655296614757, 0.278826941542694,
-0.990926101696033, 0.194574672498287, 0.281916524368647,
0.152983364411985, 1.44135982835554, 0.330270447575806, -0.369627160641594
), Metric2 = c(0.987541353383459, 0.987007518796992, 0.980984962406015,
0.981646616541353, 0.984082706766917, 0.984481203007519,
0.988165413533835, 0.988375939849624, -0.109331599015822,
-0.148471161609603, 1.31331396089969, -1.34238564643737,
2.14014350779371, -0.422879539464588, -1.25706359685425,
1.09603324772565)), row.names = c(NA, -16L), class = "data.frame")
and I want to have a table like this:
Which kind of manipulation I can use? Which tool I can use. I'm looking for something flexible that can work also with more factors.

How to change box color in transition plot (Gmisc package)

I want to make transition plot with three columns. I use Gmisc package but not the transitionPlot function since it does not enable me include third column. Therefore, I used the code below. My problem is that my result transition table is dark green and there is box shadow. Could you please help me how I can change the color and get rid of the shadow? Thank you. This is my first inquiry, if there is something wrong, sorry.
Here a dataframe sample (I took this from stackoverflow, since I do not have the data):
x <- structure(list(Obs = 1:13, Seq.1 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("a", "b", "c" ), class = "factor"), Seq.2 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("c", "d"), class = "factor"), Seq.3 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("", "d", "e"), class = "factor"), Seq.4 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L), .Label = c("", "f"), class = "factor")), .Names = c("Obs", "Seq.1", "Seq.2", "Seq.3", "Seq.4"), class = "data.frame", row.names = c(NA, -13L))
library(Gmisc)
library(dplyr)
transitions <- table(x$Seq.1,x$Seq.2) %>%
getRefClass("Transition")$new(label=c("1st Iteration", "2nd Iteration"))
transitions$box_width = 0.25;
transitions$box_label_cex = 0.7;
transitions$arrow_type = "simple";
transitions$arrow_rez = 300;
table(x$Seq.2,x$Seq.3) %>% transitions$addTransitions(label = '3rd Iteration')
transitions$render()

Convert an adjacency matrix to From To matrix

I am trying to understand graphs and I am using toy dataset as follows
df = structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("0","1"), class = "factor"),
B= structure(c(1L, 1L,2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("0","1"), class = "factor"),
C= structure(c(1L,1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("0","1"), class = "factor"),
D= structure(c(1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0","1"), class = "factor"),
Weight = c(12, 33, 65, 72, 9, 3, 3, 10, 17, 9, 9, 25, 1, 0, 1, 4)), .Names = c("A","B", "C", "D", "Weight"),
row.names = c(NA, -16L), class = "data.frame")
I am interested in converting this adjacency matrix above to a From and To matrix something like this.
From To Weight
A B 72
A C 3
C D 1
. . .
. . .
A D 9
So on, any help on how to accomplish this is much appreciated.
-------------user20650 correction-----------
df = df[-1,] excluding the first observation

Bootstrapped tree values differ from PAST

When I compute a bootstrapped tree in R I get different values to when I use PAST (http://folk.uio.no/ohammer/past/). How can I get the output to match from the two programs?
Here's what I'm doing in R (data below):
library("ape")
library("phytools")
library("phangorn")
library("cluster")
# compute neighbour-joined tree
f <- function(xx) nj(daisy(xx))
nj_tree <- f(tab)
nj_tree_root <- root(nj_tree, 1, r = TRUE)
## bootstrap
# bootstrap values do not match PAST output - why is that?
nj_tree_root_boot <- boot.phylo(nj_tree, FUN = f, tab, rooted = TRUE)
# Are bootstrap values stable?
for (i in 1:10){
print(boot.phylo(nj_tree, FUN = f, tab, rooted = TRUE, quiet = TRUE))
}
# yes, they seem ok
# plot tree with bootstrap values
plot(nj_tree_root, use.edge.length = FALSE)
nodelabels(nj_tree_root_boot, adj = c(1.2, 1.2), frame = "none")
Typical output for the bootstrap is [1] 100 6 39 27 23 57 53 75 71 and here's the plot (far LHS value should be 100, it was cropped somehow):
I transform the data to send it to PAST like so:
tab1 <- t(apply(tab, 1, as.numeric))
write.table(tab1, "tab.txt")
In PAST I open the tab.txt file, do multivariate -> cluster -> Neighbour Joining with Euclidian and 100 bootstrap replications, using an outgroup. From PAST I get this plot:
And the values are very different. What do I need to do with R to make the output match that from PAST? Is PAST wrong?
The data:
tab <- structure(list(X1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 2L, 2L), .Label = c("0", "1"), class = "factor"), X2 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("0", "1"), class = "factor"),
X3 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
2L), .Label = c("0", "1"), class = "factor"), X4 = structure(c(2L,
2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L), .Label = c("0",
"1"), class = "factor"), X5 = structure(c(1L, 1L, 1L, 1L,
2L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("0", "1"), class = "factor"),
X6 = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L,
2L), .Label = c("0", "1"), class = "factor"), X7 = structure(c(1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), X8 = structure(c(2L, 2L, 2L, 2L,
1L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
X9 = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L,
1L), .Label = c("0", "1"), class = "factor"), X10 = structure(c(1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), X11 = structure(c(1L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
X12 = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("0", "1"), class = "factor"), X13 = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), X14 = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
X15 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L), .Label = c("0", "1"), class = "factor"), X16 = structure(c(2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("0",
"1"), class = "factor"), X17 = structure(c(2L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
X18 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L,
1L), .Label = c("0", "1"), class = "factor"), X19 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), X20 = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
X21 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("0", "1"), class = "factor"), X22 = structure(c(2L,
2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), X23 = structure(c(1L, 1L, 2L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
X24 = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L,
2L), .Label = c("0", "1"), class = "factor"), X25 = structure(c(1L,
1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), X26 = structure(c(1L, 1L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor")), .Names = c("X1",
"X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10", "X11",
"X12", "X13", "X14", "X15", "X16", "X17", "X18", "X19", "X20",
"X21", "X22", "X23", "X24", "X25", "X26"), row.names = c("a",
"b", "c", "d", "e", "f", "g", "h", "i", "j", "k"), class = "data.frame")

After much searching around, it turn out the answer is in the ape package FAQ Q14:
I have done a bootstrap analysis with boot.phylo but some bootstrap
values seem at the wrong place after rooting the tree. This is because
the bootstrap values are counted as the frequencies of clades, and not
as actual bipartitions. So these values are really associated to the
nodes, not to the edges. A consequence is that some of the bootstrap
values are lilely to loose their meaning after (re)rooting the tree
since this will affect the definition of the clades in the tree. A
simple solution is to include the rooting process in the definition of
the function FUN that is given as argument to boot.phylo. Obviously
the estimated tree must also be rooted in the same way before doing
the bootstrap. In this situation, it is more convenient to define FUN
beforehand. An example code would be:
outgroup <- 1 # may be several tips, numeric or tip labels
foo <- function(xx) root(nj(dist.dna(xx)), outgroup)
tr <- foo(X) # X is the matrix of DNA sequences
bp <- boot.phylo(tr, X, foo)
plot(tr)
nodelabels(bp) # will have "100" at the root
In the specific case of my question:
nj_tree_root_boot <- boot.phylo(nj_tree, FUN = f, tab, rooted = TRUE)
plot(nj_tree_root, use.edge.length = FALSE)
nodelabels(nj_tree_root_boot, adj = c(1.2, 1.2), frame = "none")
Which matches the PAST output quite well.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Indicator feature creation in R based on multiple columns - r

Related

Why this code is not right statically in ggplot to get percentage in y-axis?

Transform a data frame into a table with option

How to change box color in transition plot (Gmisc package)

Convert an adjacency matrix to From To matrix

Bootstrapped tree values differ from PAST

Categories

Resources