Creating multiple dummies from an existing data frame or data table - r

I am looking for a quick extension to the following solution posted here. In it Frank shows that for an example data table
test <- data.table("index"=rep(letters[1:10],100),"var1"=rnorm(1000,0,1))
You can quickly make dummies by using the following code
inds <- unique(test$index) ; test[,(inds):=lapply(inds,function(x)index==x)]
Now I want to extend this solution for a data.table that has multiple rows of indices, e.g.
new <- data.table("id" = rep(c("Jan","James","Dirk","Harry","Cindy","Leslie","John","Frank"),125), "index1"=rep(letters[1:5],200),"index2" = rep(letters[6:15],100),"index3" = rep(letters[16:19],250))
I need to do this for many dummies and ideally the solution would allow me to get 4 things:
The total count of every index
The mean times every index occurs
The count of every index per id
The mean of every index per id
In my real case, the indices are named differently so the solution would need to be able to loop through the column names I think.
Thanks
Simon

If you only need the four items in that list, you should just tabulate:
indcols <- paste0('index',1:3)
lapply(new[,indcols,with=FALSE],table) # counts
lapply(new[,indcols,with=FALSE],function(x)prop.table(table(x))) # means
# or...
lapply(
new[,indcols,with=FALSE],
function(x){
z<-table(x)
rbind(count=z,mean=prop.table(z))
})
This gives
$index1
a b c d e
count 200.0 200.0 200.0 200.0 200.0
mean 0.2 0.2 0.2 0.2 0.2
$index2
f g h i j k l m n o
count 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
mean 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
$index3
p q r s
count 250.00 250.00 250.00 250.00
mean 0.25 0.25 0.25 0.25
The previous approach would work on a data.frame or a data.table, but is rather complicated. With a data.table, one can use the melt syntax:
melt(new, id="id")[,.(
N=.N,
mean=.N/nrow(new)
), by=.(variable,value)]
which gives
variable value N mean
1: index1 a 200 0.20
2: index1 b 200 0.20
3: index1 c 200 0.20
4: index1 d 200 0.20
5: index1 e 200 0.20
6: index2 f 100 0.10
7: index2 g 100 0.10
8: index2 h 100 0.10
9: index2 i 100 0.10
10: index2 j 100 0.10
11: index2 k 100 0.10
12: index2 l 100 0.10
13: index2 m 100 0.10
14: index2 n 100 0.10
15: index2 o 100 0.10
16: index3 p 250 0.25
17: index3 q 250 0.25
18: index3 r 250 0.25
19: index3 s 250 0.25
This approach was mentioned by #Arun in a comment (and implemented by him also, I think..?). To see how it works, first have a look at melt(new, id="id") which transforms the original data.table.
As mentioned in the comments, melting a data.table requires installing and loading reshape2 for some versions of the data.table package.
If you also need the dummies, they can be made in a loop as in the linked question:
newcols <- list()
for (i in indcols){
vals = unique(new[[i]])
newcols[[i]] = paste(vals,i,sep='_')
new[,(newcols[[i]]):=lapply(vals,function(x)get(i)==x)]
}
This stores the groups of columns associated with each variable in newcols for convenience. If you wanted to do the tabulation just with these dummies (instead of the underlying variables as in solution above), you could do
lapply(
indcols,
function(i) new[,lapply(.SD,function(x){
z <- sum(x)
list(z,z/.N)
}),.SDcols=newcols[[i]] ])
which gives a similar result. I just wrote it this way to illustrate how data.table syntax can be used. You could again avoid square brackets and .SD here:
lapply(
indcols,
function(i) sapply(
new[, newcols[[i]], with=FALSE],
function(x){
z<-sum(x)
rbind(z,z/length(x))
}))
But anyway: just use table if you can hold onto the underlying variables.

Related

maximum function in R removes decimals

I have a question according to the max function in R. I have a column in a dataframe which has 2 decimals after the comma and whenever I apply the max function on the column i will get the highest value but with only 1 decimal after the comma.
max(df$e)
How can I get two decimals after the comma? I tried using options() and round() but nothing works.
Reproducible example:
a = c(228, 239)
b = c(50,83)
d = c(0.27,0.24)
e = c(2.12,1.69)
df = data.frame(a,b,d,e)
max(df$e)
#[1] 2.1
df
# a b d e
# 1 228 50 0.27 2.1
# 2 239 83 0.24 1.7
Now I would like to make more calculations:
df$f = (sqrt(df[,1]/(df[,2]+0.5))/max(df$e))*100
In the end the dataframe should have column a and b without decimals and d , e and f with two decimals after the comma.
Thank you!
tl;dr you've probably got options(digits = 2).
df
a b d e
1 228 50 0.27 2.12
2 239 83 0.24 1.69
If I set options(digits = 2) then R uses this value to set the output format (it doesn't change the actual values), globally, to two significant digits — this is "total digits", not "digits after the decimal point".
options(digits = 2)
df
a b d e f
1 228 50 0.27 2.1 100
2 239 83 0.24 1.7 80
To restore the value to the default, use options(digits = 7).
After restoring the default digits setting and computing df$f = (sqrt(df[,1]/(df[,2]+0.5))/max(df$e))*100 I get
a b d e f
1 228 50 0.27 2.12 100.2273
2 239 83 0.24 1.69 79.8031
You can round the last column to two decimal places:
df$f <- round(df$f, 2)
> df
a b d e f
1 228 50 0.27 2.12 100.23
2 239 83 0.24 1.69 79.80

Comparing changes across two matrices

I'm performing some biogeographic analyses in R and the result is encoded as a pair of matrices. Columns represent geographic regions, rows indicate nodes in a phylogenetic tree and values in the matrix are the probability that the branching event occurred in the geographic region indicated by the column. A very simple example would be:
> One_node<-matrix(c(0,0.8,0.2,0),
+ nrow=1, ncol=4,
+ dimnames = list(c("node 1"),
+ c("A","B","C","D")))
> One_node
A B C D
node_1 0 0.8 0.2 0
In this case, the most probable location for node_1 is region B. In reality, the output of the analysis is encoded as two separate 79x123 matrices. The first is the probabilities of a node occupying a given region before an event and the second is the probabilities of a node occupying a given region after an event (rowSums=1). Some slightly more complicated examples:
before<-matrix(c(0,0,0,0,0.9,
0.8,0.2,0.6,0.4,0.07,
0.2,0.8,0.4,0.6,0.03,
0,0,0,0,0),
nrow=5, ncol=4,
dimnames = list(c("node_1","node_2","node_3","node_4","node_5"),
c("A","B","C","D")))
after<-matrix(c(0,0,0,0,0.9,
0.2,0.8,0.4,0.6,0.03,
0.8,0.2,0.6,0.4,0.07,
0,0,0,0,0),
nrow=5, ncol=4,
dimnames = list(c("node_1","node_2","node_3","node_4","node_5"),
c("A","B","C","D")))
> before
A B C D
node_1 0.0 0.80 0.20 0
node_2 0.0 0.20 0.80 0
node_3 0.0 0.60 0.40 0
node_4 0.0 0.40 0.60 0
node_5 0.9 0.07 0.03 0
> after
A B C D
node_1 0.0 0.20 0.80 0
node_2 0.0 0.80 0.20 0
node_3 0.0 0.40 0.60 0
node_4 0.0 0.60 0.40 0
node_5 0.9 0.03 0.07 0
Specifically, I'm only interested in extracting row numbers where column B is the highest in before and column C is the highest in after and vice versa as I'm trying to extract node numbers in a tree where taxa have moved B->C or C->B.
So the output I'm looking for would be something like:
> BC
[1] 1 3
> CB
[1] 2 4
There will be rows where B>C or C>B but where neither is the highest in the row (node_5) and I need to ignore these. The row numbers are then used to query a separate dataframe that provides the data I want.
I hope this all makes sense. Thanks in advance for any advice!
You could do something like this...
maxBefore <- apply(before, 1, which.max) #find highest columns before (by row)
maxAfter <- apply(after, 1, which.max) #and highest columns after
BC <- which(maxBefore==2 & maxAfter==3) #rows with B highest before, C after
CB <- which(maxBefore==3 & maxAfter==2) #rows with C highest before, B after
BC
node_1 node_3
1 3
CB
node_2 node_4
2 4

Merging dataframes with all.equal on numeric(float) keys?

I have two data frames I want to merge based on a numeric value, however I am having trouble with floating point accuracy. Example:
> df1 <- data.frame(number = 0.1 + seq(0.01,0.1,0.01), letters = letters[1:10])
> df2 <- data.frame(number = seq(0.11,0.2,0.01), LETTERS = LETTERS[1:10])
> (merged <- merge(df1, df2, by = "number", all = TRUE))
number letters LETTERS
1 0.11 a A
2 0.12 <NA> B
3 0.12 b <NA>
4 0.13 c C
5 0.14 d D
6 0.15 <NA> E
7 0.15 e <NA>
8 0.16 f F
9 0.17 g G
10 0.18 h H
11 0.19 i I
12 0.20 j J
Some of the values (0.12 and 0.15) don't match up due to floating point accuracy issues as discussed in this post. The solution for finding equality there was the use of the all.equal function to remove floating point artifacts, however I don't believe there is a way to do this within the merge function.
Currently I get around it by forcing one of the the number columns to a character and then back to a number after merge, but this is a little clunky; does anyone have a better solution for this problem?
> df1c <- df1
> df1c[["number"]] <- as.character(df1c[["number"]])
> merged2 <- merge(df1c, df2, by = "number", all = TRUE)
> merged2[["number"]] <- as.numeric(merged2[["number"]])
> merged2
number letters LETTERS
1 0.11 a A
2 0.12 b B
3 0.13 c C
4 0.14 d D
5 0.15 e E
6 0.16 f F
7 0.17 g G
8 0.18 h H
9 0.19 i I
10 0.20 j J
EDIT: A little more about the data
I wanted to keep my question general to make it more applicable to other people's problems, but it seems I may need to be more specific to get an answer.
It is likely that all of the issues with merging with be due to floating point inaccuracy, but it may be a little hard to be sure. The data comes in as a series of time series values, a start time, and a frequency. These are then turned into a time series (ts) object and a number of functions are called to extract features from the time series (one of which is the time value), which is returned as a data frame. Meanwhile another set of functions is being called to get other features from the time series as targets. There are also potentially other series getting features generated to complement the original series. These values then have to be reunited using the time value.
Can't store as POSIXct: Each of these processes (feature extraction, target computation, merging) has to be able to occur independently and be stored in a CSV type format so it can be passed to other platforms. Storing as a POSIXct value would be difficult since the series aren't necessarily stored in calendar times.
Round to the level of precision that will allow the number to be equal.
> df1$number=round(df1$number,2)
> df2$number=round(df2$number,2)
>
> (merged <- merge(df1, df2, by = "number", all = TRUE))
number letters LETTERS
1 0.11 a A
2 0.12 b B
3 0.13 c C
4 0.14 d D
5 0.15 e E
6 0.16 f F
7 0.17 g G
8 0.18 h H
9 0.19 i I
10 0.20 j J
If you need to choose the level of precision programmatically then you should tell us more about the data and whether we can perhaps assume that it will always be due to floating point inaccuracy. If so, then rounding to 10 decimal places should be fine. The all.equal function uses sqrt(.Machine$double.eps) which in usually practice should be similar to round( ..., 16).

Iterate in data table by factor and insert result in another dataframe

I have following data table
> total.dt[,list(first, sched, platform, CCR, speedup)]
first sched platform CCR speedup
1: mult static_hlfet 1 0.1 1.000000
2: mult static_mcp 1 0.1 1.000000
3: mult static_eft 1 0.1 1.000000
4: mult static_lheft 1 0.1 1.000000
5: mult greedy 1 0.1 1.000000
---
1634: gen64 static_eft 64 10.0 9.916995
1635: gen64 static_lheft 64 10.0 8.926877
1636: gen64 greedy 64 10.0 5.235970
1637: gen64 Horizon-8 64 10.0 11.523087
1638: gen64 Horizon-1 64 10.0 9.896009
I want to find out how many times every sched is better than every other sched, when fields first, platform and CCR are equal. And group these numbers by sched.
First I create all combinations of groups where I do the comparison.
setkey(total.dt, first, platform, CCR)
comb <- unique(total.dt[,list(first, platform, CCR)])
Now I can get a group where i can do the comparison
d <- total.dt[comb[n,], list(first, platform, CCR, sched, speedup)]
> print (d) # if n equals 1
first platform CCR sched speedup
1: mult 1 0.1 static_hlfet 1
2: mult 1 0.1 static_mcp 1
3: mult 1 0.1 static_eft 1
4: mult 1 0.1 static_lheft 1
5: mult 1 0.1 greedy 1
6: mult 1 0.1 Horizon-8 1
7: mult 1 0.1 Horizon-1 1
And now I have to count how many times every sched wins others (has bigger speedup), loses or has draw. This I have to store to the data frame which has 5 columns: (first, second, win, lose, draw). I have to repeat this operation for every row in comb and accumulate numbers in second dataframe.
And here I'm a bit lost, because I do not understand how to do this and how to store the result.
I'll appreciate any your help and sorry if this kind of question is not appropriate for SO.
UPD.
Minimal example.
I have following data:
d <- expand.grid(first=c("heat", "lu"),
sched=c("eft", "mcp"),
CCR=c(0.1, 1), platform=c(1,2))
d$speedup <- 1:16
I want get following results:
res <- data.frame(first=c("eft", "mcp"),
win=c(0, 8), lose=c(8, 0), draw=c(0, 0),
second=c("mcp", "eft"))
How do I calculate. First I take rows where first="heat", platform="1", CCR=".1". There are two such rows. First has sched=eft, speedup=1. The second one has sched=mcp, speedup=9. This means mcp wins. In the data.frame res we increase win counter in the row where first=mcp, second=eft. And we increase lose counter in the row where first=eft, second=mcp
Then I take next rows one by one from data frame d and repeat the procedure, filling the res data frame

Find the 2 max values for each factor in R

I have a question about finding the two largest values of column C, for each unique ID in column A, then calculating the mean of column B. A sample of my data is here:
ID layer weight
1 0.6843629 0.35
1 0.6360772 0.70
1 0.6392318 0.14
2 0.3848640 0.05
2 0.3882660 0.30
2 0.3877026 0.10
2 0.3964194 0.60
2 0.4273218 0.02
2 0.3869507 0.12
3 0.4748541 0.07
3 0.5853659 0.42
3 0.5383678 0.10
3 0.6060287 0.60
4 0.4859274 0.08
4 0.4720740 0.48
4 0.5126481 0.08
4 0.5280899 0.48
5 0.7492097 0.07
5 0.7220433 0.35
5 0.8750000 0.10
5 0.8302752 0.50
6 0.4306283 0.10
6 0.4890895 0.25
6 0.3790714 0.20
6 0.5139686 0.50
6 0.3885678 0.02
6 0.4706815 0.05
For each ID, I want to calculate the mean value of layer, using only the rows where with the two highest weights.
I can do this with the following code in R:
ind.max1 <- ddply(index1, "ID", function(x) x[which.max(x$weight),])
dt1 <- data.table(index1, key=c("layer"))
dt2 <- data.table(ind.max1, key=c("layer"))
index2 <- dt1[!dt2]
ind.max2 <- ddply(index2, "ID", function(x) x[which.max(x$weight),])
ind.max.all <- merge(ind.max1, ind.max2, all=TRUE)
ind.ndvi.mean <- as.data.frame(tapply(ind.max.all$layer, list(ind.max.all$ID), mean))
This uses ddply to select the first highest weight value per ID and put into a dataframe with layer. Then remove these highest weight values from the original dataframe using data.table. I then repeat the ddply select max value, and merge the two max weight value dataframes into one. Finally, computing mean with tapply.
There must be a more efficient way to do this. Does anyone have any insight? Cheers.
You could use data.table
library(data.table)
setDT(dat)[, mean(layer[order(-weight)[1:2]]), by=ID]
# ID Meanlayer
#1: 1 0.6602200
#2: 2 0.3923427
#3: 3 0.5956973
#4: 4 0.5000819
#5: 5 0.7761593
#6: 6 0.5015291
Order weight column in descending order(-weight)
Select first two from the order created [1:2] by group ID
subset the corresponding layer row based on the index layer[order..]
Do the mean
Alternatively, in 1.9.3 (current development version) or from the next version on, a function setorder is exported for reordering data.tables in any order, by reference:
require(data.table) ## 1.9.3+
setorder(setDT(dat), ID, -weight) ## dat is now reordered as we require
dat[, mean(layer[1:min(.N, 2L)]), by=ID]
By ordering first, we avoid the call to order() for each group (unique value in ID). This'll be more advantageous with more groups. And setorder() is much more efficient than order() as it doesn't need to create a copy of your data.
This actually is a question for StackOverflow... anyway!
Don't know if the version below is efficient enough for you...
s.ind<-tapply(df$weight,df$ID,function(x) order(x,decreasing=T))
val<-tapply(df$layer,df$ID,function(x) x)
foo<-function(x,y) list(x[y][1:2])
lapply(mapply(foo,val,s.ind),mean)
I think this will do it. Assuming the data is called dat,
> sapply(split(dat, dat$ID), function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
})
# 1 2 3 4 5 6
# 0.6602200 0.3923427 0.5956973 0.5000819 0.7761593 0.5015291
You'll likely want to include na.rm = TRUE as the second argument to mean to account for any rows that contain NA values.
Alternatively, mapply is probably faster, and has the exact same code just in a different order,
mapply(function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
}, split(dat, dat$ID))

Resources