One-hot coding to numeric [duplicate] - r

This question already has answers here:
How do I dichotomise efficiently
(5 answers)
How to one hot encode several categorical variables in R
(5 answers)
Closed 9 months ago.
I am working on a project that requires me to one-hot code a single variable and I cannot seem to do it correctly.
I simply want to one-hot code the variable data$Ratings so that the values for 1,2,3 and separated in the dataframe and only equal either 0 or 1. E.g., if data$Ratings = 3 then the dummy would = 1. All the other columns are not to change.
structure(list(ID = c(284921427, 284926400, 284946595, 285755462,
285831220, 286210009, 286313771, 286363959, 286566987, 286682679
), AUR = c(4, 3.5, 3, 3.5, 3.5, 3, 2.5, 2.5, 2.5, 2.5), URC = c(3553,
284, 8376, 190394, 28, 47, 35, 125, 44, 184), Price = c(2.99,
1.99, 0, 0, 2.99, 0, 0, 0.99, 0, 0), AgeRating = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1), Size = c(15853568, 12328960, 674816, 21552128,
34689024, 48672768, 6328320, 64333824, 2657280, 1466515), HasSubtitle = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), InAppSum = c(0, 0, 0, 0, 0, 1.99,
0, 0, 0, 0), InAppMin = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppMax = c(0,
0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppCount = c(0, 0, 0, 0, 0,
1, 0, 0, 0, 0), InAppAvg = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0),
descriptionTermCount = c(263, 204, 97, 272, 365, 368, 113,
129, 61, 87), LanguagesCount = c(17, 1, 1, 17, 15, 1, 0,
1, 1, 1), EngSupported = c(2, 2, 2, 2, 2, 2, 1, 2, 1, 2),
GenreCount = c(2, 2, 2, 2, 3, 3, 3, 2, 3, 2), months = c(7,
7, 7, 7, 7, 7, 7, 8, 8, 8), monthsSinceUpdate = c(29, 17,
25, 29, 15, 6, 71, 12, 23, 134), GameFree = c(0, 0, 0, 0,
0, 1, 0, 0, 0, 0), Ratings = c(3, 3, 3, 3, 2, 3, 2, 3, 2,
3)), row.names = c(NA, 10L), class = "data.frame")
install.packages("mlbench")
install.packages("neuralnet")
install.packages("mltools")
library(mlbench)
library(dplyr)
library(caret)
library(mltools)
library(tidyr)
data2 <- mutate_if(data, is.factor,as.numeric)
data3 <- lapply(data2, function(x) as.numeric(as.character(x)))
data <- data.frame(data3)
summary(data)
head(data)
str(data)
View(data)
#
dput(head(data, 10))
data %>% mutate(value = 1) %>% spread(data$Ratings, value, fill = 0 )

Is this what you want? I will assume your data is called data and continue with that for the data frame you supplied:
library(plm)
plm::make.dummies(data$Ratings) # returns a matrix
## 2 3
## 2 1 0
## 3 0 1
# returns the full data frame with dummies added:
plm::make.dummies(data, col = "Ratings")
## [not printed to save space]
There are some options for plm::make.dummies, e.g., you can select the base category via base and you can choose whether to include the base (add.base = TRUE) or not (add.base = FALSE).
The help page ?plm::make.dummies has more examples and explanation as well as a comparison for LSDV model estimation by a factor variable and by explicitly self-created dummies.

Related

Odds ratio for 2*5 Table [duplicate]

I have the following dataframe:
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15), var1 = c(1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1), var2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1),
var3 = c(1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1), var4 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1), outcome = c(1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
I would like to arrange a script to calculate all possible odds ratio (using chi square), with 95% CI and p values, between all columns and the column outcome.
How can I do that?
I installed epitools but it seems that I need a 2x2 contingency table and I am not able to apply the function to columns of a dataframe
With mapply, you can use the fisher.test function, which doesn't fail when the odds ratio cannot be calculated.
mapply(fisher.test, x=data[, grep("var", names(data))], y=data[,"outcome"])
But the output is a 7x4 matrix which cannot be tidied into a nice format. However, we can use lapply to perform Fisher's test for each column and then tidy the results with the broom package.
library(broom)
cols <- df1[,grep("var", names(df1))]
res_list <- lapply(as.list(cols), function(x) fisher.test(x, y=df1$outcome))
do.call(rbind, lapply(res_list, broom::tidy))
# A tibble: 4 x 6
estimate p.value conf.low conf.high method alternative
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 0 1 0 77.9 Fisher's Exact Test ~ two.sided
2 Inf 0.505 0.204 Inf Fisher's Exact Test ~ two.sided
3 2.13 0.608 0.160 37.2 Fisher's Exact Test ~ two.sided
4 Inf 0.505 0.204 Inf Fisher's Exact Test ~ two.sided
Or using dplyr with map, reshaping first and then splitting on the name.
library(dplyr)
df1 %>%
pivot_longer(cols=starts_with("var")) %>%
split(.$name) %>%
map(~fisher.test(x=.$value, y=.$outcome)) %>%
map(tidy) %>%
map_df(~as_tibble(.))
Data:
df1 <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15), var1 = c(1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1), var2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1),
var3 = c(1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1), var4 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1), outcome = c(1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
The following code performs the computations as described in the question but 3/4 give errors.
library(epitools)
cols <- grep("var", names(df1), value = TRUE)
res_list <- lapply(cols, function(v){
tbl <- table(df1[, c(v, "outcome")])
tryCatch(oddsratio(x = tbl), error = function(e) e)
})
ok <- !sapply(res_list, inherits, "error")
res_list[ok]
The errors are all this:
simpleError in uniroot(function(or) { 1 - midp(a1, a0, b1, b0, or)
- alpha/2}, interval = interval): f() values at end points not of opposite sign
which can be seen with
res_list[!ok]

Calculating Odds Ratio between multiple columns of a dataframe

I have the following dataframe:
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15), var1 = c(1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1), var2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1),
var3 = c(1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1), var4 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1), outcome = c(1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
I would like to arrange a script to calculate all possible odds ratio (using chi square), with 95% CI and p values, between all columns and the column outcome.
How can I do that?
I installed epitools but it seems that I need a 2x2 contingency table and I am not able to apply the function to columns of a dataframe
With mapply, you can use the fisher.test function, which doesn't fail when the odds ratio cannot be calculated.
mapply(fisher.test, x=data[, grep("var", names(data))], y=data[,"outcome"])
But the output is a 7x4 matrix which cannot be tidied into a nice format. However, we can use lapply to perform Fisher's test for each column and then tidy the results with the broom package.
library(broom)
cols <- df1[,grep("var", names(df1))]
res_list <- lapply(as.list(cols), function(x) fisher.test(x, y=df1$outcome))
do.call(rbind, lapply(res_list, broom::tidy))
# A tibble: 4 x 6
estimate p.value conf.low conf.high method alternative
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 0 1 0 77.9 Fisher's Exact Test ~ two.sided
2 Inf 0.505 0.204 Inf Fisher's Exact Test ~ two.sided
3 2.13 0.608 0.160 37.2 Fisher's Exact Test ~ two.sided
4 Inf 0.505 0.204 Inf Fisher's Exact Test ~ two.sided
Or using dplyr with map, reshaping first and then splitting on the name.
library(dplyr)
df1 %>%
pivot_longer(cols=starts_with("var")) %>%
split(.$name) %>%
map(~fisher.test(x=.$value, y=.$outcome)) %>%
map(tidy) %>%
map_df(~as_tibble(.))
Data:
df1 <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15), var1 = c(1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1), var2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1),
var3 = c(1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1), var4 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1), outcome = c(1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
The following code performs the computations as described in the question but 3/4 give errors.
library(epitools)
cols <- grep("var", names(df1), value = TRUE)
res_list <- lapply(cols, function(v){
tbl <- table(df1[, c(v, "outcome")])
tryCatch(oddsratio(x = tbl), error = function(e) e)
})
ok <- !sapply(res_list, inherits, "error")
res_list[ok]
The errors are all this:
simpleError in uniroot(function(or) { 1 - midp(a1, a0, b1, b0, or)
- alpha/2}, interval = interval): f() values at end points not of opposite sign
which can be seen with
res_list[!ok]

Optimize multiple replacement based on condition per ID

Which is the fastest way to do this? I have many 'value' columns (>100) in which I have to replace values when 'valueAux' is zero.
'Value1' column should be set to zero always that 'value1Aux' (for the same row) is zero
Original data:
df <- data.frame(ID = c(1,1,1,1,1,1,1,1),
value1 = c(23, 0, 4, 1, 0, 0, 8, 12),
value2 = c(0, 12, 56, 7, 8, 1, 8, 12),
value1aux = c(0, 0, 89, 65, 0, 0, 0, 1),
value2aux = c (1,1,0,0,4,15,67,12))
Result desired data:
df <- data.frame(ID = c(1,1,1,1,1,1,1,1),
value1 = c(0, 0, 4, 1, 0, 0, 0, 12),
value2 = c(0, 12, 0, 0, 8, 1, 8, 12),
value1aux = c(0, 0, 89, 65, 0, 0, 0, 1),
value2aux = c (1,1,0,0,4,15,67,12))
Code to optimize:
names <- colnames(df[2:3])
names2 <- colnames(df[4:5])
for (i in 1:nrow(df)){
df[i,names] <- replace (df[i,names], df[i,names2] == 0, 0)}

How to overlay survival plot

I would to overlay two different survival curves on same plot, for example OS et PFS (here false results).
N pt. OS. OS_Time_(years). PFS. PFS_Time_(years).
__________________________________________________________________
1. 1 12 0 12
2. 0 10 1 8
3. 0 14 0 14
4. 0 10 0 10
5. 1 11 1 8
6. 1 16 1 6
7. 0 11 1 4
8. 0 12 1 10
9. 1 9 0 9
10 1 10 1 9
__________________________________________________________
First, I import my dataset:
library(readxl)
testR <- read_excel("~/test.xlsx")
View(testR)
Then, I created survfit for both OS and PFS:
OS<-survfit(Surv(OS_t,OS)~1, data=test)
PFS<-survfit(Surv(PFS_t,PFS)~1, data=test)
And finally, I can plot each one thanks to:
plot(OS)
plot(PFS)
for example (or ggplot2...).
Here my question, if I want to overlay the 2 ones on same graph, how can I do?
I tried multipleplot or
ggplot(testR, aes(x)) + # basic graphical object
geom_line(aes(y=y1), colour="red") + # first layer
geom_line(aes(y=y2), colour="green") # second layer
But it didn't work (but I'm not sure to use it correctly).
Can someone help me, please ?
Thanks a lot
Here is my code for Data sample:
test <- structure(list(ID = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 8, 9),
Sex = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1),
Tabac = c(2, 0, 1, 1, 0, 0, 2, 0, 0, 0, 1, 1, 1, 0, 2, 0, 1, 1, 1),
Bmi = c(20, 37, 37, 25, 28, 38, 16, 27, 26, 28, 15, 36, 20, 17, 28, 37, 27, 26, 18),
Age = c(75, 56, 45, 65, 76, 34, 87, 43, 67, 90, 56, 37, 84, 45, 80, 87, 90, 65, 23), c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0),
OS_times = c(2, 4, 4, 2, 3, 5, 5, 3, 2, 2, 4, 1, 3, 2, 4, 3, 4, 3, 2),
OS = c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0),
PFS_time = c(1, 2, 1, 1, 3, 4, 3, 1, 2, 2, 4, 1, 2, 2, 2, 3, 4, 3, 2),
PFS = c(1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0)),
.Names = c("ID", "Sex", "Tabac", "Bmi", "Age", "LN", "OS_times", "OS", "PFS_time", "PFS"),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -19L))
You may use the ggsurv function from the GGally package in the following way. Combine both groups of variables in a data frame and add a "type" column. Later in the call to the plot, you refer to the type.
I used your data structure and named it "test". Afterwards, I transformed it to a data frame with the name "testdf".
library(GGally)
testdf <- data.frame(test)
OS_PFS1 <- data.frame(life = testdf$OS, life_times = testdf$OS_times, type= "OS")
OS_PFS2 <- data.frame(life = testdf$PFS, life_times = testdf$PFS_time, type= "PFS")
OS_PFS <- rbind(OS_PFS1, OS_PFS2)
sf.OS_PFS <- survfit(Surv(life_times, life) ~ type, data = OS_PFS)
ggsurv(sf.OS_PFS)
if you want the confidence intervals shown:
ggsurv(sf.OS_PFS, CI = TRUE)
Please let me know whether this is what you want.

How to plot a ternary graphic with different size point in ggtern?

I would like to make a ggtern graph that I could change the size of every point. My data has some patients which have only 1 of the 3 possible compositions. As a result, in a vertex, I have more than 1 patient information overlapped, and I don't want to jitter.
What I have so far:
library(compositions)
library(ggtern)
ds <- structure(list(`GC+` = c(1, 0, 9, 21, 2, 0, 0, 0, 4, 0, 0, 24,
0, 0, 1, 0, 0, 3, 3, 0, 5, 0, 0, 3, 0, 0, 0, 2, 11, 0, 0, 18,
13, 0, 6, 8, 0, 1, 0, 1, 23, 0, 1, 4, 5), `PC+` = c(5, 2, 8,
0, 6, 0, 0, 0, 10, 0, 0, 20, 0, 0, 2, 0, 0, 3, 3, 0, 0, 0, 10,
2, 0, 0, 0, 0, 10, 1, 0, 4, 8, 0, 1, 16, 1, 2, 0, 0, 18, 0, 0,
0, 1), `OT+` = c(0, 2, 7, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0,
0, 2, 5, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 12, 0, 0, 6, 3, 1, 0,
6, 0, 0, 0, 0, 3, 0, 0, 3, 0), size = c(1, 1, 1, 4, 1, 0, 0,
0, 1, 0, 3, 1, 0, 0, 1, 0, 3, 1, 1, 0, 4, 0, 1, 1, 0, 0, 0, 1,
1, 2, 0, 1, 1, 3, 1, 1, 2, 1, 0, 4, 1, 0, 4, 1, 1)), row.names = c(NA,
45L), class = "data.frame")
d.tern <- as.data.frame(acomp(ds))
size <- apply(d.tern, 2, function(x) {
sum(x==1)
})
ds$size <- ifelse(d.tern$`GC+` == 1, 4,
ifelse(d.tern$`PC+` == 1, 2,
ifelse(d.tern$`OT+` == 1, 3, 1)))
ds$size[is.na(ds$size)] <- 0
ggtern(data = ds, aes(`GC+`, `PC+`, `OT+`)) +
geom_mask() +
geom_point(fill="red", shape=21, size = 3) +
theme_bw() +
theme_showarrows() +
theme_clockwise() +
labs(x = "GC+", y = "PC+", z = "OT+",
title = "Composição dos Linfonodos Positivos")
I would like to pass size from ds to geom_point. But it doesn't work.
So here is a way how to count the samples per unique value:
tab <- as.data.frame(table(ds[,1:3]))
# Keep only observed samples
tab <- tab[tab$Freq > 0,]
# Fix colnames to contain plus
colnames(tab) <- gsub("\\.", "+", colnames(tab))
# For reasons I don't understand the columns were converted to factors
# so we'll fix them again as numeric
tab[, 1:3] <- lapply(tab[, 1:3], as.numeric)
And then the plotting would be as follows:
ggtern(data = tab, aes(`GC+`, `PC+`, `OT+`)) +
geom_mask() +
geom_point(aes(size = Freq), fill="red", shape=21) +
scale_size_continuous(range = c(3, 5), breaks = sort(unique(tab$Freq))) +
theme_bw() +
theme_showarrows() +
theme_clockwise() +
labs(x = "GC+", y = "PC+", z = "OT+",
title = "Composição dos Linfonodos Positivos")
You can play around with the scale_size_continuous() function untill you have sizes that satisfy you.

Resources