Related
This question already has answers here:
How do I dichotomise efficiently
(5 answers)
How to one hot encode several categorical variables in R
(5 answers)
Closed 9 months ago.
I am working on a project that requires me to one-hot code a single variable and I cannot seem to do it correctly.
I simply want to one-hot code the variable data$Ratings so that the values for 1,2,3 and separated in the dataframe and only equal either 0 or 1. E.g., if data$Ratings = 3 then the dummy would = 1. All the other columns are not to change.
structure(list(ID = c(284921427, 284926400, 284946595, 285755462,
285831220, 286210009, 286313771, 286363959, 286566987, 286682679
), AUR = c(4, 3.5, 3, 3.5, 3.5, 3, 2.5, 2.5, 2.5, 2.5), URC = c(3553,
284, 8376, 190394, 28, 47, 35, 125, 44, 184), Price = c(2.99,
1.99, 0, 0, 2.99, 0, 0, 0.99, 0, 0), AgeRating = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1), Size = c(15853568, 12328960, 674816, 21552128,
34689024, 48672768, 6328320, 64333824, 2657280, 1466515), HasSubtitle = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), InAppSum = c(0, 0, 0, 0, 0, 1.99,
0, 0, 0, 0), InAppMin = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppMax = c(0,
0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppCount = c(0, 0, 0, 0, 0,
1, 0, 0, 0, 0), InAppAvg = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0),
descriptionTermCount = c(263, 204, 97, 272, 365, 368, 113,
129, 61, 87), LanguagesCount = c(17, 1, 1, 17, 15, 1, 0,
1, 1, 1), EngSupported = c(2, 2, 2, 2, 2, 2, 1, 2, 1, 2),
GenreCount = c(2, 2, 2, 2, 3, 3, 3, 2, 3, 2), months = c(7,
7, 7, 7, 7, 7, 7, 8, 8, 8), monthsSinceUpdate = c(29, 17,
25, 29, 15, 6, 71, 12, 23, 134), GameFree = c(0, 0, 0, 0,
0, 1, 0, 0, 0, 0), Ratings = c(3, 3, 3, 3, 2, 3, 2, 3, 2,
3)), row.names = c(NA, 10L), class = "data.frame")
install.packages("mlbench")
install.packages("neuralnet")
install.packages("mltools")
library(mlbench)
library(dplyr)
library(caret)
library(mltools)
library(tidyr)
data2 <- mutate_if(data, is.factor,as.numeric)
data3 <- lapply(data2, function(x) as.numeric(as.character(x)))
data <- data.frame(data3)
summary(data)
head(data)
str(data)
View(data)
#
dput(head(data, 10))
data %>% mutate(value = 1) %>% spread(data$Ratings, value, fill = 0 )
Is this what you want? I will assume your data is called data and continue with that for the data frame you supplied:
library(plm)
plm::make.dummies(data$Ratings) # returns a matrix
## 2 3
## 2 1 0
## 3 0 1
# returns the full data frame with dummies added:
plm::make.dummies(data, col = "Ratings")
## [not printed to save space]
There are some options for plm::make.dummies, e.g., you can select the base category via base and you can choose whether to include the base (add.base = TRUE) or not (add.base = FALSE).
The help page ?plm::make.dummies has more examples and explanation as well as a comparison for LSDV model estimation by a factor variable and by explicitly self-created dummies.
I am currently strugling to remove words from a large dataframe in R.
This is the df:
The first column (GeneID) contains a so called "ensembl gene ID". First one i.e. ENSG00000223972.5 followed by a "|". Afterwards, the real Gene name is listed. So i now want to remove the "ensembl gene ID" including the "|" to keep only the real gene name in this column. Is there a smart way to do this ? For example with the stringR package?
Cheers!
Edit:
> dput(head(data3))
structure(list(GeneID = c("ENSG00000223972.5|DDX11L1", "ENSG00000227232.5|WASH7P",
"ENSG00000278267.1|MIR6859-1", "ENSG00000243485.5|MIR1302-2HG",
"ENSG00000284332.1|MIR1302-2", "ENSG00000237613.2|FAM138A"),
`DC2-CD5pos-d1` = c(2, 47, 0, 0, 0, 0), `DC2-CD5pos-d2` = c(0,
41, 0, 0, 0, 0), `DC2-CD5pos-d3` = c(2, 31, 0, 0, 0, 0),
`DC2-CD5pos-d4` = c(0, 29, 0, 0, 0, 0), `DC3-d1` = c(1, 36,
0, 0, 0, 0), `DC3-d2` = c(0, 33, 0, 0, 0, 0), `DC3-d3` = c(0,
49, 0, 0, 0, 3), `DC3-d4` = c(0, 27, 0, 0, 0, 0), `DC2-BTLA-S-d1` = c(2,
4, 0, 1, 0, 0), `DC2-BTLA-S-d3` = c(6, 6, 1, 0, 0, 0), `DC2-BTLA-S-d4` = c(2,
1, 0, 0, 0, 0), `DC3-CD163-S-d1` = c(2, 8, 2, 0, 0, 0), `DC3-CD163-S-d3` = c(5,
9, 0, 0, 0, 0), `DC3-CD163-S-d4` = c(0, 5, 0, 0, 0, 0)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
I have a dataset with some variables having a binary type.
The first column are names, so when applying cluster analysis it is showing error.
kc <- kmeans(j1,4) ## j1 is the stored data frame
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message: In storage.mode(x) <- "double" : NAs
introduced by coercion –
The data head I am giving here using dput(j1[1:5,]:
structure(list(OUTPUT_NAME = c("nonsaturation_fba268_2ch_0_out.wav",
"nonsaturation_fba268_2ch_32_out.wav", "substreaminfo_fba268_2ch_96_out.wav",
"substreaminfo_fba268_2ch_201_out.wav", "substreaminfo_fba268_2ch_93_out.wav"
), PEAK_MIPS = c(82.47, 82.5, 82.63, 82.73, 82.73), PRESENTATION = c(0,
0, 0, 0, 0), DTHD_ATMOS_PRE = c(0, 0, 0, 0, 0), FBAFBBDETECTER = c(1,
1, 1, 1, 1), DIAL_NORM = c(31, 31, 31, 31, 31), NORMAL_DRC = c(0,
0, 0, 0, 0), ANALOG_DB_GAIN_REQ = c(0, 0, 0, 0, 0), DECODER_CH_ASSIGN = c(1,
1, 1, 1, 1), DECODER_6_CH_ASSIGN = c(1, 1, 13, 1, 1), DECODER_8_CH_ASSIGN = c(1,
1, 13, 1, 1), DECODER_16_CH_ASSIGN = c(0, 0, 0, 0, 0), CH_MODIFIER = c(0,
0, 0, 0, 0), CH_ASSIGNMENT_TYPE = c(0, 0, 0, 0, 0), FILTER_ORDER = c(0,
0, 0, 0, 0), COEFF_BITS = c(9, 9, 9, 9, 9), COEFF_SHIFT = c(7,
7, 7, 7, 7), STATE_BITS = c(4, 4, 6, 6, 6), STATE_SHIFT = c(0,
0, 0, 0, 0), `31EC_PRIMITIVE_MATRIX_CNT` = c(16, 16, 8, 8, 8),
LSB_BYPASS_COUNT = c(0, 0, 0, 0, 0), DITHER_SCALE = c(1,
1, 1, 1, 1), `31EC_FRAC_BITS` = c(14, 14, 12, 12, 12), INTERPOLATION_USED = c(1,
1, 0, 0, 0), `31EA_31EB_PRIMITIVE_MATIX_CNT` = c(0, 0, 0,
0, 0), `31EA_31EB_FRAC_BITS` = c(14, 14, 12, 12, 12), LSB_BYPASS_USED = c(0,
0, 0, 0, 0), AU_LENGTH = c(937, 937, 937, 937, 937), VARIABLE_RATE = c(1,
1, 1, 1, 1), PEAK_DATA_RATE = c(6000, 6000, 6000, 6000, 6000
), SUBSTREAM_CNT = c(1, 1, 2, 2, 2), EXTENDED_SUBSTREAM_CNT = c(0,
0, 0, 0, 0), SUBSTREAM_INFO = c(20, 20, 40, 24, 24), SPEAKER_LAYOUT = c(0,
0, 0, 0, 0), CONTROL_EN_2 = c(0, 0, 0, 0, 0), CONTROL_EN_6 = c(0,
0, 0, 0, 0), CONTROL_EN_8 = c(0, 0, 0, 0, 0), MIX_LEVEL_2 = c(35,
35, 35, 35, 35), MIX_LEVEL_6 = c(35, 35, 35, 35, 35), MIX_LEVEL_8 = c(35,
35, 35, 35, 35), DIALOGUE_NORM_2 = c(31, 31, 31, 31, 31),
DIALOGUE_NORM_6 = c(31, 31, 31, 31, 31), DIALOGUE_NORM_8 = c(31,
31, 31, 31, 31), SOURCE_FORMAT_6 = c(0, 0, 0, 0, 0), SOURCE_FORMAT_8 = c(0,
0, 0, 0, 0), DRC_STARTUP_GAIN = c(0, 0, 0, 0, 0), DIALOGUE_NORM_16 = c(28,
28, 31, 31, 31), MIX_LEVEL_16 = c(35, 35, 35, 35, 35), CHANNEL_CNT_16 = c(16,
16, 16, 16, 16), DYNAMIC_OBJ_ONLY = c(1, 1, 1, 1, 1), DYNAMIC_CHANNEL_CNT_16 = c(0,
0, 0, 0, 0), LFE_PRE = c(1, 1, 0, 0, 0), CHANNEL_CONTENT_DES_16 = c(0,
0, 0, 0, 0), MIN_CHAN = c(0, 0, 0, 0, 0), MAX_CHAN = c(1,
1, 1, 1, 1), RESTART_SYNC_WORD = c(12778, 12778, 12778, 12778,
12778), MAX_MATRIX_CHAN = c(1, 1, 1, 1, 1), DITHER_SHIFT = c(0,
0, 0, 0, 0), ERROR_PROTECT = c(1, 1, 1, 1, 1), LOSSLESS_PROTECT = c(0,
0, 1, 1, 1), BLOCK_SIZE = c(32, 32, 40, 40, 40), OUTPUT_SHIFT = c(0,
0, 0, 0, 0), QUANT_STEP_SIZE = c(0, 0, 0, 0, 0), HUFF_OFFSET = c(0,
0, 0, 0, 0), HUFF_TYPE = c(1, 1, 0, 2, 2), HUFF_LSBS = c(6,
6, 8, 5, 5), SAMPLE_RATE = c(0, 3, 0, 3, 0), OUTPUT_SAMPLE_COUNT = c(40,
40, 40, 40, 40), RESTART_HEADER_EXISTS = c(0, 0, 0, 0, 0)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
You're using a variable that is not numeric, look at this:
class(j1[,1])
[1] "character"
You've to remove it, to make kmeans works:
set.seed(1234)
kmeans(j1[,-1],2)
I am trying to display a data frame using kable function. The dataframe consists of 29 columns but only 9 columns are displayed and the remaining columns are repeated.
The dataframe used is
structure(list(Name = c("Grand Total", "B", "C", "D", "E", "F"
), GrandTotal = c(3416, 297, 410, 326, 125, 29), English = c(1096,
18, 64, 0, 55, 0), Science = c(211, 5, 39, 0, 55, 0), Language = c(149,
5, 0, 0, 10, 0), Maths = c(22, 0, 0, 0, 0, 0), Social = c(0,
0, 0, 0, 0, 0), English = c(211, 5, 39, 0, 55, 0), Science = c(149,
5, 0, 0, 10, 0), Maths = c(0, 0, 0, 0, 0, 0), Social = c(22,
0, 0, 0, 0, 0), English = c(1096, 18, 64, 0, 55, 0), Science = c(211,
5, 39, 0, 55, 0), Language = c(149, 5, 0, 0, 10, 0), Maths = c(22,
0, 0, 0, 0, 0), Social = c(0, 0, 0, 0, 0, 0), English = c(211,
5, 39, 0, 55, 0), Science = c(149, 5, 0, 0, 10, 0), ACIntern = c(0,
0, 0, 0, 0, 0), PAM = c(22, 0, 0, 0, 0, 0), Maths = c(1096, 18,
64, 0, 55, 0), Social = c(211, 5, 39, 0, 55, 0), English = c(149,
5, 0, 0, 10, 0), Science = c(22, 0, 0, 0, 0, 0), Language = c(0,
0, 0, 0, 0, 0), Maths = c(211, 5, 39, 0, 55, 0), Social = c(149,
5, 0, 0, 10, 0), English = c(0, 0, 0, 0, 0, 0), Science = c(22,
0, 0, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")
The code used for displaying the data frame as a table format is as follows
monthSelected <- c("April","May","June")
month1 <- paste0(monthSelected[1],' ',yearSelected)
month2 <- paste0(monthSelected[2],' ',yearSelected)
month3 <- paste0(monthSelected[3],' ',yearSelected)
myHeader <- c(" " = 2, month1 = 9, month2 = 9, month3 = 9)
names(myHeader) <- c(" ", month1, month2, month3)
kable(df[1:ncol(df)],"html") %>%
kable_styling(c("striped", "bordered")) %>%
add_header_above(c(" "=2, "IND" = 5, "US" = 4,"IND" = 5, "US" = 4,"IND" = 5, "US" = 4)) %>%
add_header_above(header = myHeader)
The output displayed is as follows
I can't figure out where I went wrong. Can anyone help me out with this issue?
In addition to it, is it possible to freeze first two columns when the table is scrolled horizontally?
Thanks in advance!!
I would to overlay two different survival curves on same plot, for example OS et PFS (here false results).
N pt. OS. OS_Time_(years). PFS. PFS_Time_(years).
__________________________________________________________________
1. 1 12 0 12
2. 0 10 1 8
3. 0 14 0 14
4. 0 10 0 10
5. 1 11 1 8
6. 1 16 1 6
7. 0 11 1 4
8. 0 12 1 10
9. 1 9 0 9
10 1 10 1 9
__________________________________________________________
First, I import my dataset:
library(readxl)
testR <- read_excel("~/test.xlsx")
View(testR)
Then, I created survfit for both OS and PFS:
OS<-survfit(Surv(OS_t,OS)~1, data=test)
PFS<-survfit(Surv(PFS_t,PFS)~1, data=test)
And finally, I can plot each one thanks to:
plot(OS)
plot(PFS)
for example (or ggplot2...).
Here my question, if I want to overlay the 2 ones on same graph, how can I do?
I tried multipleplot or
ggplot(testR, aes(x)) + # basic graphical object
geom_line(aes(y=y1), colour="red") + # first layer
geom_line(aes(y=y2), colour="green") # second layer
But it didn't work (but I'm not sure to use it correctly).
Can someone help me, please ?
Thanks a lot
Here is my code for Data sample:
test <- structure(list(ID = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 8, 9),
Sex = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1),
Tabac = c(2, 0, 1, 1, 0, 0, 2, 0, 0, 0, 1, 1, 1, 0, 2, 0, 1, 1, 1),
Bmi = c(20, 37, 37, 25, 28, 38, 16, 27, 26, 28, 15, 36, 20, 17, 28, 37, 27, 26, 18),
Age = c(75, 56, 45, 65, 76, 34, 87, 43, 67, 90, 56, 37, 84, 45, 80, 87, 90, 65, 23), c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0),
OS_times = c(2, 4, 4, 2, 3, 5, 5, 3, 2, 2, 4, 1, 3, 2, 4, 3, 4, 3, 2),
OS = c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0),
PFS_time = c(1, 2, 1, 1, 3, 4, 3, 1, 2, 2, 4, 1, 2, 2, 2, 3, 4, 3, 2),
PFS = c(1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0)),
.Names = c("ID", "Sex", "Tabac", "Bmi", "Age", "LN", "OS_times", "OS", "PFS_time", "PFS"),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -19L))
You may use the ggsurv function from the GGally package in the following way. Combine both groups of variables in a data frame and add a "type" column. Later in the call to the plot, you refer to the type.
I used your data structure and named it "test". Afterwards, I transformed it to a data frame with the name "testdf".
library(GGally)
testdf <- data.frame(test)
OS_PFS1 <- data.frame(life = testdf$OS, life_times = testdf$OS_times, type= "OS")
OS_PFS2 <- data.frame(life = testdf$PFS, life_times = testdf$PFS_time, type= "PFS")
OS_PFS <- rbind(OS_PFS1, OS_PFS2)
sf.OS_PFS <- survfit(Surv(life_times, life) ~ type, data = OS_PFS)
ggsurv(sf.OS_PFS)
if you want the confidence intervals shown:
ggsurv(sf.OS_PFS, CI = TRUE)
Please let me know whether this is what you want.