Binary representation of breast cancer wisconsin database - r

I want to produce a binary representation of the well-known breast cancer Wisconsin database.
The initial data set has 31 numerical variables, and one categorical variable.
id_number diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean
1 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419
2 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812
3 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069
4 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597
5 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809
I want to produce a binary representation of this dataframe by:
transforming the diagnosis column (levels= M , B) to two columns diagnosis_M and diagnosis_B and put 1 or 0 in the relevant row depending on the value in the initial column (M or B).
Looking for the median of each numerical column and split it as two columns depending on whether the values are greater or lower than the mean value. eg: for the column radius_mean, split it in radius_mean_great in-which we put 1 if the values > mean, o else; and a column radius_mean_low inversely.
library(mlbench)
library("RCurl")
library("curl")
UCI_data_URL <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
names <- c('id_number', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean','concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst')
breast.cancer.fr <- read.table(textConnection(UCI_data_URL), sep = ',', col.names = names)

Well there are several ways to binarize the base, I found the following I hope it serves
df <- breast.cancer.fr[,3:32]
df2 <- matrix(NA, ncol = 2*ncol(df), nrow = nrow(df))
for(i in 1:ncol(df)){
df2[,2*i-1]<- as.numeric(df[,i] > mean(df[,i]))
df2[,2*i] <- as.numeric(df[,i] <= mean(df[,i]))}
colnames(df2) <- c(rbind(paste0(names(df),"_great"),paste0(names(df),"_low")))
library(dplyr)
df3 <- select(breast.cancer.fr,id_number,diagnosis) %>% mutate(diagnosis_M = as.numeric(diagnosis == "M")) %>%
mutate(diagnosis_B = as.numeric(diagnosis == "B"))
df <- cbind(df3[,-2],df2)
df[1:10,1:7]
id_number diagnosis_M diagnosis_B radius_mean_great radius_mean_low texture_mean_great texture_mean_low
1 842302 1 0 1 0 0 1
2 842517 1 0 1 0 0 1
3 84300903 1 0 1 0 1 0
4 84348301 1 0 0 1 1 0
5 84358402 1 0 1 0 0 1
6 843786 1 0 0 1 0 1
7 844359 1 0 1 0 1 0
8 84458202 1 0 0 1 1 0
9 844981 1 0 0 1 1 0
10 84501001 1 0 0 1 1 0

Related

Merge columnwise from file_list

I have 96 files in file_list
file_list <- list.files(pattern = "*.mirna")
They all have the same columns, but the number of rows varies. Example file:
> head(test1)
seq name freq mir start end mism add t5 t3 s5 s3 DB
1 TGGAGTGTGATAATGGTGTTT seq_100003_x4 4 hsa-miR-122-5p 15 35 11TC 0 0 g GCTGTGGA TTTGTGTC miRNA
2 TGTAAACATCCCCGACCGGAAGCT seq_100045_x4 4 hsa-miR-30d-5p 6 29 17CT 0 0 CT TTGTTGTA GAAGCTGT miRNA
3 CTAGACTGAAGCTCCTTGAAAA seq_100048_x4 4 hsa-miR-151a-3p 47 65 0 I-AAA 0 gg CCTACTAG GAGGACAG miRNA
4 AGGCGGAGACTTGGGCAATTGC seq_100059_x4 4 hsa-miR-25-5p 14 35 0 0 0 C TGAGAGGC ATTGCTGG miRNA
5 AAACCGTTACCATTACTGAAT seq_100067_x4 4 hsa-miR-451a 17 35 0 I-AT 0 gtt AAGGAAAC AGTTTAGT miRNA
6 TGAGGTAGTAGCTTGTGCTGTT seq_10007_x24 24 hsa-let-7i-5p 6 27 12CT 0 0 0 TGGCTGAG TGTTGGTC miRNA
precursor ambiguity
1 hsa-mir-122 1
2 hsa-mir-30d 1
3 hsa-mir-151a 1
4 hsa-mir-25 1
5 hsa-mir-451a 1
6 hsa-let-7i 1
second file
> head(test2)
seq name freq mir start end mism add t5 t3 s5 s3 DB
1 ATTGCACTTGTCCTGGCCTGT seq_1000013_x1 1 hsa-miR-92a-3p 49 69 14TC 0 t 0 AAAGTATT CTGTGGAA miRNA
2 AAACCGTTACTATTACTGAGA seq_1000094_x1 1 hsa-miR-451a 17 36 11TC I-A 0 tt AAGGAAAC AGTTTAGT miRNA
3 TGAGGTAGCAGATTGTATAGTC seq_1000169_x1 1 hsa-let-7f-5p 8 28 9CT I-C 0 t GGGATGAG AGTTTTAG miRNA
4 TGGGTCTTTGCGGGCGAGAT seq_100019_x12 12 hsa-miR-193a-5p 21 40 0 0 0 ga GGGCTGGG ATGAGGGT miRNA
5 TGAGGTAGTAGATTGTATAGTG seq_100035_x12 12 hsa-let-7f-5p 8 28 0 I-G 0 t GGGATGAG AGTTTTAG miRNA
6 TGAAGTAGTAGGTTGTGTGGTAT seq_1000437_x1 1 hsa-let-7b-5p 6 26 4AG I-AT 0 t GGGGTGAG GGTTTCAG miRNA
precursor ambiguity
1 hsa-mir-92a-2 1
2 hsa-mir-451a 1
3 hsa-let-7f-2 1
4 hsa-mir-193a 1
5 hsa-let-7f-2 1
6 hsa-let-7b 1
I would like to create a unique ID consisting of the columns mir and seq:
hsa-miR-122-5p_TGGAGTGTGATAATGGTGTTT
Then I would like to merge all the 96 files based in this ID and take the column freq form each file.
ID freq_file1 freq_file2 ...
hsa-miR-122-5p_TGGAGTGTGATAATGGTGTTT 4 12
If an ID is not pressent in a specific file the freq should be NA
We can use Reduce with merge on a list of data.frames.
lst <- lapply(mget(ls(pattern="test\\d+")),
function(x) subset(transform(x, ID=paste(precursor,
seq)), select=c("ID", "freq")))
Reduce(function(...) merge(..., by = "ID"), lst)
NOTE: In the above, I assumed that the "test1", "test2" objects are already created in the global environment by reading the files in 'file_list'. If not, we can directly read the files into a list instead of creating additional data.frame objects i.e.
library(data.table)
lst <- lapply(file_list, function(x)
fread(x, select=c("precursor", "seq", "freq"))[,
list(ID=paste(precursor, seq), freq=freq)])
Reduce(function(x,y) x[y, on = "ID"], lst)
Or instead of fread (from data.table) use read.csv/read.table and use merge as before on 'lst'

Survdiff() output fields in R

my question is about the output structure of survdiff() function form the 'survival' library in R. Namely, I have a data frame containing survival data
> dat
ID Time Treatment Gender Censored
1 E002 2.7597536 IND F 0
2 E003 4.2710472 Control M 0
3 E005 1.4784394 IND F 0
4 E006 6.8993840 Control F 1
5 E008 9.5934292 IND M 0
6 E009 2.9897331 Control F 0
7 E014 1.3470226 IND F 1
8 E016 2.1683778 Control F 1
9 E018 2.7597536 IND F 1
10 E022 1.3798768 IND F 0
11 E023 0.7227926 IND M 1
12 E024 5.5195072 IND F 0
13 E025 2.4640657 Control F 0
14 E028 7.4579055 Control M 1
15 E029 5.5195072 Control F 1
16 E030 2.7926078 IND M 0
17 E031 4.9938398 Control F 0
18 E032 2.7268994 IND M 0
19 E033 0.1642710 IND M 1
20 E034 4.1396304 Control F 0
and a model
> diff = survdiff(Surv(Time, Censored) ~ Treatment+Gender, data = dat)
> diff
Call:
survdiff(formula = Surv(Time, Censored) ~ Treatment + Gender,
data = dat)
N Observed Expected (O-E)^2/E (O-E)^2/V
Treatment=Control, Gender=M 2 1 1.65 0.255876 0.360905
Treatment=Control, Gender=F 7 3 2.72 0.027970 0.046119
Treatment=IND, Gender=M 5 2 2.03 0.000365 0.000519
Treatment=IND, Gender=F 6 2 1.60 0.100494 0.139041
Chisq= 0.5 on 3 degrees of freedom, p= 0.924
I'm wondering what's the field of the output object that contains the values from the very right column (O-E)^2/V? I'd like to use them further but can't obtain them neither from diff\$obs, diff\$exp, diff\$var nor from their combinations.
Your help's gonna be much appreciated.
For (O-E)^2/V try something like
rowSums(diff$obs - diff$exp)^2 / diag(diff$var)
while for (O-E)^2/E try something like
rowSums(diff$obs - diff$exp)^2 / rowSums(diff$exp)

Subset a data frame based on values of another column in data frame

It is possible to take one column of numeric values like in dup$Number and subset columns in DG that match dup$number and return this as a new data frame?
dup
Number Letter
59 Q
91 Q
19 Q
17 Q
DG
chr pos id ref alt refc altc qual cov line_21 line_26 line_28 line_31 line_32 line_38 line_40 line_41 line_42 line_45 line_48 line_49 line_57 line_59 line_69 line_73 line_75 line_83
1 2R 7006506 2R_7006506_SNP C A 169 26 999 29 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 -
Try
indx <- grep('line', names(DG))
DG[indx[as.numeric(sub('.*_', '', names(DG)[indx])) %in% dup$Number]]
# line_59
#1 0

Automatically creating and filling data frames in R

Here is the code that I am working with.
rnumbers <- data.frame(replicate(5,runif(20000, 0, 1)))
dt <- c(.001)
A <- dt*1
B <- dt*.5
## A = 0
## B = 1
rstate <- rnumbers # copy the structure
rstate[] <- NA # preserve structure with NA's
# Init:
rstate[1, ] <- rnumbers[1, ] < .02 & rnumbers[1, ] > 0.01
step_generator <- function(col, rnum){
for (i in 2:length(col) ){
if( rnum[i] < B) { col[i] <- 0 }
else { if (rnum[i] < A) {col[i] <- 1 }
else {col[i] <- col[i-1] } }
}
return(col)
}
# Run for each column index:
for(cl in 1:5){ rstate[ , cl] <-
step_generator(rstate[,cl], rnumbers[,cl]) }
rstate1 <- transform(rstate, time = rep(dt))
rstate2 <- transform(rstate1, cumtime = cumsum(time))
This gives me a data frame with 5 columns that contain state switches over time. Time interval is in the 6th column (seconds) and cumulative time is in the 7th column (seconds). Now I want to see how long each state lasts in seconds. This is what I am doing -
1) lengths <- rle(rstate2[,1])
>Run Length Encoding
lengths: int [1:15] 366 3278 1817 451 3033 1655 1901 748 742 1780 ...
values : num [1:15] 0 1 0 1 0 1 0 1 0 1 ...
2) lengths1 <- data.frame(state = lengths$values, duration = lengths$lengths)
> lengths1
state duration
1 0 366
2 1 3278
3 0 1817
4 1 451
5 0 3033
6 1 1655
7 0 1901
8 1 748
9 0 742
10 1 1780
11 0 26
12 1 458
13 0 305
14 1 1039
15 0 2401
3) library("plyr")
lengths2 <- transform(lengths1, time = duration*dt)
lengths3 <- arrange(lengths2, desc(state))
> lengths3
state duration time
1 1 3278 3.278
2 1 451 0.451
3 1 1655 1.655
4 1 748 0.748
5 1 1780 1.780
6 1 458 0.458
7 1 1039 1.039
8 0 366 0.366
9 0 1817 1.817
10 0 3033 3.033
11 0 1901 1.901
12 0 742 0.742
13 0 26 0.026
14 0 305 0.305
15 0 2401 2.401
4) col1 <- ddply(lengths3, .(state), function(df) 1/mean(df$time))
> col1
state V1
1 0 0.7553583
2 1 0.7439685
So, col1 is showing me "1/mean(time in each state)" for column1 of rstate2. What I would like to do is iterate steps 1-4 for every column in rstate2 and generate a data frame that looks like this :
> rates
state col1 col2 col3 col4 col5
1 0 0.1 0.2 0.3 0.4 0.5
2 1 0.3 0.4 0.5 0.6 0.7
Where the numbers for each column are equal to the 1/mean(df$time) for each of the column from rstate2.
Thank you for any and all help.
I'd do this using the development version of data.table (v 1.8.11) in this manner:
require(data.table) # 1.8.11
require(reshape2)
DT <- data.table(rstate2)
DT.m <- melt(DT, id=6, measure=1:5)
ans <- DT.m[, {dl=data.table:::duplist(list(value));
list(state=value[dl], time=c(diff(dl),
.N-dl[length(dl)]+1)*dt)
}, by=list(variable)]
ans <- ans[, 1/mean(time), by=list(variable, state)]
dcast.data.table(ans, state ~ variable)
state X1 X2 X3 X4 X5
1: 0 0.9875568 1.0777521 0.3227194 2.2371365 0.7237054
2: 1 1.0127608 0.4442799 0.2802691 0.2887169 1.0576415
Unfortunately, it's still building on R-Forge. So, probably you can install 1.8.10 from CRAN and use reshape2's melt and cast (which'll output a data.frame) and convert the result back to a data.table and do the grouping as follows:
require(data.table) # 1.8.10
require(reshape2)
DT.m <- data.table(melt(rstate2, id=6, measure=1:5))
ans <- DT.m[, {dl=data.table:::duplist(list(value));
list(state=value[dl], time=c(diff(dl),
.N-dl[length(dl)]+1)*dt)
}, by=list(variable)]
ans <- ans[, 1/mean(time), by=list(variable, state)]
dcast(ans, state ~ variable)

How can I calculate an inner product with an arbitrary number of columns using ddply?

I want to perform an inner product of the first D columns for each row in a data frame with a given array, W. I am trying the following:
W = (1,2,3);
ddply(df, .(id), transform, inner_product=c(col1, col2, col3) %*% W);
This works but I typically may have an arbitrary number of columns. Can I generalize the above expression to handle that case?
Update:
This is an updated example as asked for in the comments:
libary(kernlab);
data(spam);
W = array();
W[1:3] = seq(1,3);
spamdf = head(spam);
spamdf$id = seq(1,nrow(spamdf));
df_out=ddply(spamdf, .(id), transform, inner_product=c(make, address, all) %*% W);
> W
[1] 1 2 3
> spamdf[1,]
make address all num3d our over remove internet order mail receive will
1 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64
people report addresses free business email you credit your font num000
1 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0 0 0 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0 0 0 0.778
charDollar charHash capitalAve capitalLong capitalTotal type id
1 0 0 3.756 61 278 spam 1
> df_out[1,]
make address all num3d our over remove internet order mail receive will
1 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64
people report addresses free business email you credit your font num000
1 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0 0 0 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0 0 0 0.778
charDollar charHash capitalAve capitalLong capitalTotal type id inner_product
1 0 0 3.756 61 278 spam 1 3.2
The above example performs a inner product of the first three dimensions with an array W=(1,2,3) of the spam data set available in kernlab package. Here I have explicity specified the first three dimensions as c(make, address, all).
Thus df_out[1,"inner_product"] = 3.2.
Instead I want to perform the inner product over all the dimensions without having to list all the dimensions. The conversion to a matrix and back to a data frame seems to be an expensive operation?
A strategy along the lines of the following should work:
Convert each chunk to a matrix
Perform a matrix multiplication
Convert results to data.frame
The code:
set.seed(1)
df <- data.frame(
id=sample(1:5, 20, replace=TRUE),
col1 = runif(20),
col2 = runif(20),
col3 = runif(20),
col4 = runif(20)
)
W <- c(1,2,3,4)
ddply(df, .(id), function(x)as.data.frame(as.matrix(x[, -1]) %*% W))
The results:
id V1
1 1 4.924994
2 1 5.076043
3 2 7.053864
4 2 5.237132
5 2 6.307620
6 2 3.413056
7 2 5.182214
8 2 7.623164
9 3 5.194714
10 3 6.733229
11 4 4.122548
12 4 3.569013
13 4 4.978939
14 4 5.513444
15 4 5.840900
16 4 6.526522
17 5 3.530220
18 5 3.549646
19 5 4.340173
20 5 3.955517
If you want to append a column of cross-products, you could do this (assuming W had the right number of elements to match the non-"id" columns:
df2 <- cbind(df, as.matrix(df[, -grep("id", names(df))]) %*% W )
It does not appear that the .(id) serves any useful purpose, since you are not do a sum of crossproducts within id, and if you were then you wouldn't be using transform but some other aggregating function.

Resources