How to change from factored data to numeric? - r

I have this dataframe in factored form:
Data <- structure(list(ID = c("1", "2", "3", "4", "5",
"6"), V1 = structure(c(1L, 1L, 4L, 4L, 4L, 1L), .Label = c("1",
"129", "2", "3", "76"), class = "factor"), V2 = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("1", "3"), class = "factor"),
V3 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor"),
V4 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"3"), class = "factor"), V5 = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor"), V6 = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor"), V7 = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor"), V8 = structure(c(2L,
2L, 2L, 2L, 2L, 2L), .Label = c("0", "1", "3"), class = "factor"),
V9 = structure(c(2L, 2L, 3L, 2L, 2L, 2L
), .Label = c("0", "1", "3"), class = "factor"), V10 = structure(c(2L,
2L, 2L, 2L, 2L, 2L), .Label = c("0", "1", "2", "3"), class = "factor"),
V11 = structure(c(2L, 2L, 2L, 2L,
2L, 2L), .Label = c("0", "1"), class = "factor"), V12 = structure(c(1L,
1L, 1L, 1L, 1L, 3L), .Label = c("1", "2", "3"), class = "factor"),
V13 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3"), class = "factor"), V14 = structure(c(1L,
1L, 2L, 1L, 1L, 1L), .Label = c("1", "3"), class = "factor"),
V15 = structure(c(2L, 2L, 2L, 2L, 2L,
2L), .Label = c("0", "1", "3"), class = "factor"), V17 = structure(c(3L,
1L, 3L, 1L, 1L, 3L), .Label = c("1", "2", "3"), class = "factor"),
V18 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3"), class = "factor"), V19 = structure(c(1L,
1L, 2L, 1L, 1L, 1L), .Label = c("1", "3"), class = "factor"),
V20 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"3"), class = "factor"), V21 = structure(c(1L, 3L,
1L, 1L, 3L, 1L), .Label = c("1", "2", "3"), class = "factor"),
V22 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3"), class = "factor"), V23 = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("1", "3"), class = "factor"),
V24 = structure(c(1L, 1L, 1L, 1L, 1L, 1L
), .Label = "1", class = "factor"), V25 = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("1", "2", "3"), class = "factor"),
V26 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2"), class = "factor"), V27 = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor"), V28 = structure(c(2L,
2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
V29 = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), V30 = structure(c(2L,
2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
V31 = structure(c(2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("0", "1"), class = "factor"), V32 = structure(c(2L,
2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
Totals = structure(c(1L, 1L, 4L, 1L, 2L, 2L), .Label = c("1",
"2", "3", "5"), class = "factor")), row.names = c(NA, 6L), class = "data.frame")
It is in factored form but I need to change it to numeric form without changing the original dataframe called Data. So, I tried this method:
Data2 <- lapply(Data[c(1:33)], numeric)
This gave me the "invalid length argument" error. So I tried this method after looking up the issue:
Data2 <- lapply(Data[c(1:33)], as.numeric)
Data2 <- as.data.frame(Data2)
I do indeed get a new dataframe, but the data doesn't match what I have in my script. Some numbers change by 1 value, for example. (Where there is a 3, it is a 4. Where there is a 4, there is a 5).
Any other methods to this issue?
EDIT: earlier in my script I convert from character to factor using this method:
Data <- lapply(Data[c(2:33)], factor)
Would it be easier to instead convert to numeric and wait until I am done with all of my analyses to convert to factor?

You need to convert to character first:
Data <- lapply(Data, function(x) as.numeric(as.character(x)))

Try this dplyr approach:
library(dplyr)
#Code
Data2 <- Data %>% mutate(across(2:33,~as.numeric(as.character(.))))

We can use type.convert from base R
Data <- type.convert(Data, as.is = TRUE)

Related

filter dataframes for the first occurrence of any non-zero value per row

These are the type of data frames that I have, with two examples as to how they can differ:
- - - - - - - - - - - - - - - - - - - - - - - - Type1 992.0 4461.0 1.2 38476.0 :1..4473
Second data frame:
- - - - - - - - - - - - - - - - - - - - - - - - Type2 1.0 5131.0 0.4 44433.0 -:1998..7151
- - - - - - - - - - - - - - - - - - - - - - - - Type2 5331.0 845.0 1.3 6672.0 -:1164..2016
Type3 1945.0 91.0 18.7 426.0 -:501..597 Type3 1912.0 91.0 18.7 426.0 -:501..597 - - - - - - - - - - - - - - - - - -
Type3 2071.0 196.0 18.9 468.0 -:10..236 Type3 2038.0 196.0 18.9 468.0 -:10..236 Type3 2049.0 141.0 16.3 441.0 -:10..196 Type3 2049.0 141.0 16.3 441.0 -:10..196 Type3 8294.0 151.0 17.2 580.0 -:10..196
- - - - - - - - - - - - - - - - - - - - - - - - Type4 8604.0 1473.0 0.5 13042.0 :1..1471
- - - - - - - - - - - - - - - - - - - - - - - - Type5 9795.0 2114.0 32.0 1971.0 :1296..3439
- - - - - - - - - - - - - - - - - - - - - - - - Type6 10131.0 5684.0 0.3 49063.0 :1455..7113
What I am looking for is code that will extract the first occurrence of anything that isn't '-' without relying on the occurrence being 'Type*', just anything not '-'. So the output would look like this:
Type1
or
Type2
Type3
Type4
Type5
Type6
I can obviously subset for anything that doesn't equal '-' but I can't figure out how to only get the first occurrence, because I want the output to have the same dimensions. I see a lot of solutions for the first occurrence of any word in an entire dataframe, but this needs to be per row and I just can't seem to get it working.
dput for the first data.frame:
dput(x)
structure(list(V1 = structure(1L, .Label = "-", class = "factor"),
V2 = structure(1L, .Label = "-", class = "factor"), V3 = structure(1L, .Label = "-", class = "factor"),
V4 = structure(1L, .Label = "-", class = "factor"), V5 = structure(1L, .Label = "-", class = "factor"),
V6 = structure(1L, .Label = "-", class = "factor"), V7 = structure(1L, .Label = "-", class = "factor"),
V8 = structure(1L, .Label = "-", class = "factor"), V9 = structure(1L, .Label = "-", class = "factor"),
V10 = structure(1L, .Label = "-", class = "factor"), V11 = structure(1L, .Label = "-", class = "factor"),
V12 = structure(1L, .Label = "-", class = "factor"), V13 = structure(1L, .Label = "-", class = "factor"),
V14 = structure(1L, .Label = "-", class = "factor"), V15 = structure(1L, .Label = "-", class = "factor"),
V16 = structure(1L, .Label = "-", class = "factor"), V17 = structure(1L, .Label = "-", class = "factor"),
V18 = structure(1L, .Label = "-", class = "factor"), V19 = structure(1L, .Label = "-", class = "factor"),
V20 = structure(1L, .Label = "-", class = "factor"), V21 = structure(1L, .Label = "-", class = "factor"),
V22 = structure(1L, .Label = "-", class = "factor"), V23 = structure(1L, .Label = "-", class = "factor"),
V24 = structure(1L, .Label = "-", class = "factor"), V25 = structure(1L, .Label = "Type1", class = "factor"),
V26 = 992, V27 = 4461, V28 = 1.2, V29 = 38476, V30 = structure(1L, .Label = ":1..4473", class = "factor")), class = "data.frame", row.names = c(NA, -1L))
The second data frame example:
structure(list(V1 = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("-",
"Type2"), class = "factor"), V2 = structure(c(1L, 1L, 2L, 3L,
1L, 1L, 1L), .Label = c("-", "1945.0", "2071.0"), class = "factor"),
V3 = structure(c(1L, 1L, 3L, 2L, 1L, 1L, 1L), .Label = c("-",
"196.0", "91.0"), class = "factor"), V4 = structure(c(1L,
1L, 2L, 3L, 1L, 1L, 1L), .Label = c("-", "18.7", "18.9"), class = "factor"),
V5 = structure(c(1L, 1L, 2L, 3L, 1L, 1L, 1L), .Label = c("-",
"426.0", "468.0"), class = "factor"), V6 = structure(c(1L,
1L, 3L, 2L, 1L, 1L, 1L), .Label = c("-", "-:10..236", "-:501..597"
), class = "factor"), V7 = structure(c(1L, 1L, 2L, 2L, 1L,
1L, 1L), .Label = c("-", "Type2"), class = "factor"), V8 = structure(c(1L,
1L, 2L, 3L, 1L, 1L, 1L), .Label = c("-", "1912.0", "2038.0"
), class = "factor"), V9 = structure(c(1L, 1L, 3L, 2L, 1L,
1L, 1L), .Label = c("-", "196.0", "91.0"), class = "factor"),
V10 = structure(c(1L, 1L, 2L, 3L, 1L, 1L, 1L), .Label = c("-",
"18.7", "18.9"), class = "factor"), V11 = structure(c(1L,
1L, 2L, 3L, 1L, 1L, 1L), .Label = c("-", "426.0", "468.0"
), class = "factor"), V12 = structure(c(1L, 1L, 3L, 2L, 1L,
1L, 1L), .Label = c("-", "-:10..236", "-:501..597"), class = "factor"),
V13 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 1L), .Label = c("-",
"Type2"), class = "factor"), V14 = structure(c(1L, 1L, 1L,
2L, 1L, 1L, 1L), .Label = c("-", "2049.0"), class = "factor"),
V15 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 1L), .Label = c("-",
"141.0"), class = "factor"), V16 = structure(c(1L, 1L, 1L,
2L, 1L, 1L, 1L), .Label = c("-", "16.3"), class = "factor"),
V17 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 1L), .Label = c("-",
"441.0"), class = "factor"), V18 = structure(c(1L, 1L, 1L,
2L, 1L, 1L, 1L), .Label = c("-", "-:10..196"), class = "factor"),
V19 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 1L), .Label = c("-",
"Type2"), class = "factor"), V20 = structure(c(1L, 1L, 1L,
2L, 1L, 1L, 1L), .Label = c("-", "2049.0"), class = "factor"),
V21 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 1L), .Label = c("-",
"141.0"), class = "factor"), V22 = structure(c(1L, 1L, 1L,
2L, 1L, 1L, 1L), .Label = c("-", "16.3"), class = "factor"),
V23 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 1L), .Label = c("-",
"441.0"), class = "factor"), V24 = structure(c(1L, 1L, 1L,
2L, 1L, 1L, 1L), .Label = c("-", "-:10..196"), class = "factor"),
V25 = structure(c(4L, 4L, 1L, 3L, 5L, 2L, 5L), .Label = c("-",
"Type3", "Type2", "Type4", "Type5"), class = "factor"),
V26 = structure(c(2L, 4L, 1L, 5L, 6L, 7L, 3L), .Label = c("-",
"1.0", "10131.0", "5331.0", "8294.0", "8604.0", "9795.0"), class = "factor"),
V27 = structure(c(5L, 7L, 1L, 3L, 2L, 4L, 6L), .Label = c("-",
"1473.0", "151.0", "2114.0", "5131.0", "5684.0", "845.0"), class = "factor"),
V28 = structure(c(3L, 5L, 1L, 6L, 4L, 7L, 2L), .Label = c("-",
"0.3", "0.4", "0.5", "1.3", "17.2", "32.0"), class = "factor"),
V29 = structure(c(4L, 7L, 1L, 6L, 2L, 3L, 5L), .Label = c("-",
"13042.0", "1971.0", "44433.0", "49063.0", "580.0", "6672.0"
), class = "factor"), V30 = structure(c(4L, 3L, 1L, 2L, 5L,
6L, 7L), .Label = c("-", "-:10..196", "-:1164..2016", "-:1998..7151",
":1..1471", ":1296..3439", ":1455..7113"), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
The following works:
apply(df, MARGIN = 1, FUN = function(row) row[!grepl("-", row)][1])
[1] "Type2" "Type2" "Type3" "Type3" "Type4" "Type5" "Type6"
apply with MARGIN = 1 acts on rows. The function in FUN uses grepl to catch all elements of the row not matching - and returns the first element with [1].
We can use a vectorized option with max.col to find the position of column where the first case of non - occurs and cbind with sequence of rows and extract the values based on the row/column index
df1[cbind(seq_len(nrow(df1)), max.col(df1 != "-", "first"))]
[1] "Type4" "Type4" "Type2" "Type2" "Type5" "Type3" "Type5"
x[cbind(seq_len(nrow(x)), max.col(x != "-", "first"))]
[1] "Type1"

cut.default error in heatmap generation R

I want to generate a heatmap from a 8*6 dataframe. The last row in the dataframe has the information to annotate the columns. Structure of the dataframe is as follows:
heatmap_try <-structure(list(BGC0000041 = structure(c(1L, 2L, 1L, 1L, 1L, 3L
), .Label = c("0", "0.447458977", "a"), class = "factor"), BGC0000128 = structure(c(1L,
1L, 1L, 3L, 2L, 4L), .Label = c("0", "1.785875195", "4.093659107",
"a"), class = "factor"), BGC0000287 = structure(c(1L, 1L, 1L,
3L, 2L, 4L), .Label = c("0", "1.785875195", "4.456229186", "b"
), class = "factor"), BGC0000294 = structure(c(3L, 1L, 2L, 4L,
1L, 5L), .Label = c("0", "2.035046947", "3.230553742", "3.286304185",
"b"), class = "factor"), BGC0000295 = structure(c(1L, 1L, 1L,
2L, 1L, 3L), .Label = c("0", "2.286304185", "c"), class = "factor"),
BGC0000308 = structure(c(4L, 2L, 3L, 5L, 1L, 6L), .Label = c("6.277728291",
"6.313707588", "6.607936616", "6.622871165", "6.64385619",
"c"), class = "factor"), BGC0000323 = structure(c(1L, 2L,
1L, 1L, 1L, 3L), .Label = c("0", "0.447458977", "c"), class = "factor"),
BGC0000328 = structure(c(1L, 2L, 1L, 1L, 1L, 3L), .Label = c("0",
"0.447458977", "c"), class = "factor")), class = "data.frame", row.names = c("Gut",
"Oral", "Anterior_nares", "Retroauricular_crease", "Vagina",
"AL"))
My code for heatmap generation is as follows (I am using pheatmap library):
library(pheatmap)
heatmap_data1 <- heatmap_try[ c(1:5), c(1:8) ]
anotation_data <- as.data.frame(t(heatmap_try[6, ]))
row.names(anotation_data) <- colnames(heatmap_data1)
pheatmap(heatmap_data1, annotation_col = anotation_data, color = colorRampPalette(c("white","blue"))(n=100),cellwidth = 40,cellheight = 6,fontsize_row = 5,cluster_rows = F,cluster_cols = F)
However, I am getting the following error:
Error in cut.default(x, breaks = breaks, include.lowest = T) :
'x' must be numeric
What I am doing wrong?
Thanks!
This is because the columns of heatmap_data1 are factors, they need to be numeric. One way to convert is with:
heatmap_data1_num <- as.data.frame(lapply(heatmap_data1,
function(x) as.numeric(as.character(x))))
# then as before
pheatmap(heatmap_data1_num, annotation_col = anotation_data, color = colorRampPalette(c("white","blue"))(n=100),cellwidth = 40,cellheight = 6,fontsize_row = 5,cluster_rows = F,cluster_cols = F)

I would like to create a boxplot of numerical data, but excluding cases which are marked as '0' on another column?

I have made a boxplot for a single factor as follows:
ggplot(data = dataframe2, aes(x=factor(0), y = RPSdata$Survival.One.Year)) + geom_boxplot(...)
The dataframe is simply:
dataframe2 <- data.frame(RPSdata$Survival.One.Year)
I would like to make the same boxplot, but only including cases which are coded as '1' in column RPSdata$Survival.Complete.Sense
Thank you so much! New to R so appreciate any help
Data Sample:
> dput(head(RPSdata, 5))
structure(list(ID.Rank = 1:5, ID.Participant = c("8571762481",
"7351340719", "7396795819", "3790978753", "6450996320"), Population.Risk = structure(c(1L,
2L, 3L, 2L, 2L), .Label = c("1", "2", "3", "4", "5", "6"), class = "factor"),
Personal.Risk = c(50, 60, 30, 40, 10), Comparative.Risk.Age = structure(c(2L,
NA, 3L, 4L, 3L), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
Comparative.Risk.Current = structure(c(NA, 3L, 3L, NA, NA
), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
Comparative.Risk.Ex = structure(c(2L, 3L, NA, NA, 3L), .Label = c("1",
"2", "3", "4", "5"), class = "factor"), Score.Exposure = structure(c(1L,
1L, 1L, 2L, 1L), .Label = c("1", "2", "4", "5"), class = "factor"),
RF.Age = structure(c(1L, NA, 1L, 1L, 2L), .Label = c("0",
"1", "2"), class = "factor"), RF.Pollution = structure(c(1L,
NA, 3L, 2L, 2L), .Label = c("0", "1", "2"), class = "factor"),
RF.Asbestos = structure(c(1L, NA, 1L, 1L, 1L), .Label = c("1",
"2"), class = "factor"), RF.Asthma = structure(c(2L, NA,
3L, 2L, 1L), .Label = c("0", "1", "2"), class = "factor"),
RF.BMI = structure(c(2L, NA, 1L, 2L, 3L), .Label = c("0",
"1", "2"), class = "factor"), RF.Gene = structure(c(2L, NA,
3L, 3L, 3L), .Label = c("0", "1", "2"), class = "factor"),
RF.COPD = structure(c(2L, NA, 2L, 2L, 2L), .Label = c("0",
"1", "2"), class = "factor"), RF.History = structure(c(2L,
NA, 1L, 1L, 2L), .Label = c("0", "1", "2"), class = "factor"),
RF.Diet = structure(c(3L, NA, 1L, 2L, 3L), .Label = c("0",
"1", "2"), class = "factor"), RF.Radon = structure(c(2L,
NA, 1L, 3L, 3L), .Label = c("0", "1", "2"), class = "factor"),
RF.Smoking = structure(c(2L, NA, 2L, 2L, 2L), .Label = c("0",
"1", "2"), class = "factor"), RF.Second.Smoke = structure(c(3L,
NA, 1L, 3L, 2L), .Label = c("0", "1", "2"), class = "factor"),
Survival.One.Year = c(80, 20, NA, NA, 90), Survival.Five.Year = c(60,
50, NA, 30, 50), Survival.Ten.Year = c(40, 20, NA, NA, 2),
Worry.Frequency = structure(c(1L, 3L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4"), class = "factor"), Worry.Intensity = structure(c(1L,
2L, 2L, 2L, 1L), .Label = c("1", "2", "3", "4"), class = "factor"),
Mental.Health.One = structure(c(1L, 3L, 2L, 1L, 1L), .Label = c("0",
"1", "2", "3"), class = "factor"), Mental.Health.Two = structure(c(1L,
2L, 2L, 1L, 1L), .Label = c("0", "1", "2", "3"), class = "factor"),
Mental.Health.Three = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("0",
"1", "2", "3"), class = "factor"), Mental.Health.Four = structure(c(2L,
2L, 1L, 1L, 1L), .Label = c("0", "1", "2", "3"), class = "factor"),
PHQ.4 = structure(c(2L, 5L, 3L, 1L, 1L), .Label = c("0",
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11",
"12"), class = "factor"), PHQ4.Anx = structure(c(1L, 4L,
3L, 1L, 1L), .Label = c("0", "1", "2", "3", "4", "5", "6"
), class = "factor"), PHQ4.Dep = structure(c(2L, 2L, 1L,
1L, 1L), .Label = c("0", "1", "2", "3", "4", "5", "6"), class = "factor"),
PHQ4.Bin = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("0",
"1", "2", "3"), class = "factor"), Dep.Bin = structure(c(1L,
1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
Anx.Bin = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), Survival.Compelete.Sense = structure(c(2L,
1L, 1L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
Survival.Semi.Sense = c(1L, 0L, 0L, 1L, 1L)), row.names = c(NA,
5L), class = "data.frame")
>
Given the problem description, there is no need for a second data.frame, RPSdata alone is all that is needed. The problem is solved by subsetting conditional on a column that must be equal to 1.
library(ggplot2)
ggplot(data = subset(RPSdata, Survival.Complete.Sense == 1),
mapping = aes(x = Survival.Complete.Sense, y = Survival.One.Year)) +
geom_boxplot()
Another option, with package dplyr, is to filter first and pipe the result to ggplot. I also coerce the x axis column to factor.
library(dplyr)
library(ggplot2)
RPSdata %>%
filter(Survival.Complete.Sense == 1) %>%
mutate(Survival.Complete.Sense = factor(Survival.Complete.Sense)) %>%
ggplot(aes(Survival.Complete.Sense, Survival.One.Year)) +
geom_boxplot()

Indicator feature creation in R based on multiple columns

I have a dataset with 10 columns and out of them 10, 3 are of interest to create a new indicator feature. The features are "pT", "pN", & "M" and they all take different values. Off all the values that these 3 features take, there are a toal of 9 unique combinations that needs to be captures in the new variable.
PATHOT PATHON PATHOM
1 pT2 pN1 M0
4 pT1 pN1 M0
13 pT3 pN1 M0
161 pT1 *pN2 M0
391 pT1 pN1 *M1
810 *pTIS pN1 M0
948 pT3 *pN2 M0
1043 pT2 pN1 *M1
1067 *pT4 pN1 M0
For example, the new variable will have value "1" when PATHOT=pT2, PATHON=pN1 & PATHOM=M0 and so on upto value 9. I have completed the task but after spending almost 20 lines of code involving vectorised operation for all unique combinations.
diag3_bs$sfd[diag3_bs$pathot=="pT2" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 1
diag3_bs$sfd[diag3_bs$pathot=="pT1" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 2
diag3_bs$sfd[diag3_bs$pathot=="pT3" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 3... so on upto 9.
I want to ask if there is a better more automated way of getting the same result?
dput(data.frame) is given below
structure(list(F_STATUS = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "Y", class = "factor"), EVENT_ID = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "BASELINE", class =
"factor"),
PAG_NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "BR2", class = "factor"), PTSIZE = c(3, 4,
2.7, 2, 0.9, 3, 3, 0.9, 3, 4.5), PTSIZE_U = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "CM", class = "factor"),
PT_SYM = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("", "-", "<", ">"), class = "factor"), PATHOT = structure(c(4L,
4L, 4L, 3L, 3L, 4L, 4L, 3L, 4L, 4L), .Label = c("*pT4", "*pTIS",
"pT1", "pT2", "pT3"), class = "factor"), PATHON = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("*pN2", "pN1"
), class = "factor"), PATHOM = structure(c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("*M1", "M0"), class = "factor"),
RSUBJID = 901000:901009, RUSUBJID = structure(1:10, .Label = c(
"000301-000-901-251", "000301-000-901-252", "000301-000-901-253",
"000301-000-901-254", "000301-000-901-255", "000301-000-901-256",
"000301-000-901-257", "000301-000-901-258", "000301-000-901-259",
"000301-000-901-260", "000301-000-901-261", "000301-000-901-262")
, class = "factor")), .Names = c("F_STATUS", "EVENT_ID", "PAG_NAME", "PTSIZE", "PTSIZE_U", "PT_SYM", "PATHOT",
"PATHON", "PATHOM", "RSUBJID", "RUSUBJID"), row.names = c(NA, 10L),
class = "data.frame")
Thanks.
I tried to edit the data so it didn't throw an error on input. Also created a version of that tabulation of possible combinations:
stg_tbl <- structure(list(PATHOT = structure(c(4L, 3L, 5L, 3L, 3L, 2L, 5L,
4L, 1L), .Label = c("*pT4", "*pTIS", "pT1", "pT2", "pT3"), class = "factor"),
PATHON = structure(c(2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("*pN2",
"pN1"), class = "factor"), PATHOM = structure(c(2L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 2L), .Label = c("*M1", "M0"), class = "factor")), .Names = c("PATHOT",
"PATHON", "PATHOM"), class = "data.frame", row.names = c("1",
"4", "13", "161", "391", "810", "948", "1043", "1067"))
Make a vector of text-equivalents of the categories:
stg_lbls <- with(stg_tbl, paste(PATHOT, PATHON, PATHOM, sep="_") )
Then the as.numeric values of a factor created using those levels will be the desired result:
dat$stg <- with(dat, factor( paste(PATHOT, PATHON, PATHOM, sep="_"), levels=stg_lbls))
as.numeric(dat$stg)
#[1] 1 1 1 2 2 1 1 2 1 1
You can just assign those values in the usual way:
dat$sfd <- as.numeric(dat$stg)
I made some new data, that should be useful for your problem.
k<-expand.grid(data.frame(a=letters[1:3],b=letters[4:6],c=letters[7:9]))
library(dplyr)
k %>% mutate(groups=paste0(a,b,c))->k2
k2$groups<-as.numeric(factor(k2$groups))
k2
It's crude, and you're not picking which combination get's which numbers, so it'd take some digging afterwards, but it's quick.

Bootstrapped tree values differ from PAST

When I compute a bootstrapped tree in R I get different values to when I use PAST (http://folk.uio.no/ohammer/past/). How can I get the output to match from the two programs?
Here's what I'm doing in R (data below):
library("ape")
library("phytools")
library("phangorn")
library("cluster")
# compute neighbour-joined tree
f <- function(xx) nj(daisy(xx))
nj_tree <- f(tab)
nj_tree_root <- root(nj_tree, 1, r = TRUE)
## bootstrap
# bootstrap values do not match PAST output - why is that?
nj_tree_root_boot <- boot.phylo(nj_tree, FUN = f, tab, rooted = TRUE)
# Are bootstrap values stable?
for (i in 1:10){
print(boot.phylo(nj_tree, FUN = f, tab, rooted = TRUE, quiet = TRUE))
}
# yes, they seem ok
# plot tree with bootstrap values
plot(nj_tree_root, use.edge.length = FALSE)
nodelabels(nj_tree_root_boot, adj = c(1.2, 1.2), frame = "none")
Typical output for the bootstrap is [1] 100 6 39 27 23 57 53 75 71 and here's the plot (far LHS value should be 100, it was cropped somehow):
I transform the data to send it to PAST like so:
tab1 <- t(apply(tab, 1, as.numeric))
write.table(tab1, "tab.txt")
In PAST I open the tab.txt file, do multivariate -> cluster -> Neighbour Joining with Euclidian and 100 bootstrap replications, using an outgroup. From PAST I get this plot:
And the values are very different. What do I need to do with R to make the output match that from PAST? Is PAST wrong?
The data:
tab <- structure(list(X1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 2L, 2L), .Label = c("0", "1"), class = "factor"), X2 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("0", "1"), class = "factor"),
X3 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
2L), .Label = c("0", "1"), class = "factor"), X4 = structure(c(2L,
2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L), .Label = c("0",
"1"), class = "factor"), X5 = structure(c(1L, 1L, 1L, 1L,
2L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("0", "1"), class = "factor"),
X6 = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L,
2L), .Label = c("0", "1"), class = "factor"), X7 = structure(c(1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), X8 = structure(c(2L, 2L, 2L, 2L,
1L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
X9 = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L,
1L), .Label = c("0", "1"), class = "factor"), X10 = structure(c(1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), X11 = structure(c(1L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
X12 = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("0", "1"), class = "factor"), X13 = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), X14 = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
X15 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L), .Label = c("0", "1"), class = "factor"), X16 = structure(c(2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("0",
"1"), class = "factor"), X17 = structure(c(2L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
X18 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L,
1L), .Label = c("0", "1"), class = "factor"), X19 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), X20 = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
X21 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("0", "1"), class = "factor"), X22 = structure(c(2L,
2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), X23 = structure(c(1L, 1L, 2L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
X24 = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L,
2L), .Label = c("0", "1"), class = "factor"), X25 = structure(c(1L,
1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), X26 = structure(c(1L, 1L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor")), .Names = c("X1",
"X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10", "X11",
"X12", "X13", "X14", "X15", "X16", "X17", "X18", "X19", "X20",
"X21", "X22", "X23", "X24", "X25", "X26"), row.names = c("a",
"b", "c", "d", "e", "f", "g", "h", "i", "j", "k"), class = "data.frame")
After much searching around, it turn out the answer is in the ape package FAQ Q14:
I have done a bootstrap analysis with boot.phylo but some bootstrap
values seem at the wrong place after rooting the tree. This is because
the bootstrap values are counted as the frequencies of clades, and not
as actual bipartitions. So these values are really associated to the
nodes, not to the edges. A consequence is that some of the bootstrap
values are lilely to loose their meaning after (re)rooting the tree
since this will affect the definition of the clades in the tree. A
simple solution is to include the rooting process in the definition of
the function FUN that is given as argument to boot.phylo. Obviously
the estimated tree must also be rooted in the same way before doing
the bootstrap. In this situation, it is more convenient to define FUN
beforehand. An example code would be:
outgroup <- 1 # may be several tips, numeric or tip labels
foo <- function(xx) root(nj(dist.dna(xx)), outgroup)
tr <- foo(X) # X is the matrix of DNA sequences
bp <- boot.phylo(tr, X, foo)
plot(tr)
nodelabels(bp) # will have "100" at the root
In the specific case of my question:
nj_tree_root_boot <- boot.phylo(nj_tree, FUN = f, tab, rooted = TRUE)
plot(nj_tree_root, use.edge.length = FALSE)
nodelabels(nj_tree_root_boot, adj = c(1.2, 1.2), frame = "none")
Which matches the PAST output quite well.

Resources