Working on grouped data.tables within a larger data.table - r

I have recently decided to transform from data.frame to data.table but I can't seem to succeed in manipulating the data.table in the same way.
How would I write the following ddply code using data.table to accompish the same output
# generate the data.table
R> set.seed(42)
R> dt <- data.table(sample = rep(c('a','b','c'),times = 1, each = 5),
seq = sample(c('dd','ee','ff'),15,replace=T),
num = sample(1:100,15),
letters = sample(letters,15))
R> dt
sample seq num letters
1: a ff 95 t
2: a ff 97 u
3: a dd 12 j
4: a ff 47 p
5: a ee 54 a
6: b ee 86 r
7: b ff 14 v
8: b dd 92 d
9: b ee 88 q
10: b ff 8 k
11: c ee 99 g
12: c ff 35 w
13: c ff 80 z
14: c dd 39 m
15: c ee 72 f
The code I would use using ddply to have a single line for sample and seq column containing the sum(num) in the new num column and letter containing the letter with the highest num within each subgroup:
example: for subgroup sample == 'a' and seq == 'ff' the letter is u because it has num == 97 because it is higher than 95 and 47
R> df_new <- ddply(dt, .(sample, seq), function(df){
order_d <- order(df$num, decreasing = TRUE)
df_new <- df [order_d[1],]
df_new$num <- sum(df$num)
return(df_new)
})
R> df_new
sample seq num letters
1 a dd 12 j
2 a ee 54 a
3 a ff 239 u
4 b dd 92 d
5 b ee 174 q
6 b ff 22 v
7 c dd 39 m
8 c ee 171 g
9 c ff 115 z
How can I do this in a data.table way?

Related

R convert list with multiple string lengths to data frame

I have a list:
l1<-list(A=1:10, B=100:120, C=300:310, D=400:430)
How do I convert it to dataframe with 2 columns:
C1 C2
R1 1 A
R2 2 A
...
R10 10 A
R11 100 B
R12 101 B
....
R73 429 D
R73 430 D
I tried:
df1 <- data.frame(matrix(unlist(l1), nrow=length(l1), byrow=T))
But I'm getting an error because the vectors in my list have multiple lengths. Also my actual list consist of Dates and not just integers.
Just use stack:
stack(l1)
> head(stack(l1))
values ind
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
> tail(stack(l1))
values ind
68 425 D
69 426 D
70 427 D
71 428 D
72 429 D
73 430 D
Update
stack won't work with dates. If you have actual date objects, you can do:
data.frame(ind = rep(names(l1), lengths(l1)),
val = as.Date(unlist(l1), origin = "1970-01-01"))
or
data.frame(ind = rep(names(l1), lengths(l1)), val = do.call(c, l1))
Sample data:
l1<-list(A=Sys.Date()+(1:10),
B=Sys.Date()+(100:120),
C=Sys.Date()+(300:310),
D=Sys.Date()+(400:430))
Here's one method: Similar to #Duck answer using Map and do.call
tmp <- Map(data.frame,N = l1,L = names(l1))
out <- do.call(rbind,tmp)
rownames(out) <- NULL
> tail(out)
N L
68 425 D
69 426 D
70 427 D
71 428 D
72 429 D
73 430 D
Maybe a long solution, but using mapply() and do.call() you can reach the expected result. First, you can extract the names of the list as well as the number of elements. Then, using mapply() you can create a list for the first column in your desired result. After that you combine mapply(), do.call(), rbind() and cbind() to end up with df. Here the code:
#Code
#names
v1 <- names(l1)
#length
v2 <- unlist(lapply(l1, length))
#Create values
l2 <- mapply(function(x,y) rep(x,y),v1,v2)
#Bind
df <- as.data.frame(do.call(rbind,mapply(cbind,l2,l1)))
df$V2 <- as.numeric(df$V2)
Output (some rows):
head(df,15)
V1 V2
1 A 1
2 A 24
3 A 25
4 A 37
5 A 69
6 A 70
7 A 71
8 A 72
9 A 73
10 A 2
11 B 3
12 B 4
13 B 5
14 B 6
15 B 7

R Function to write 3 calculated columns to a data.table

This may have already been answered, but couldn't quite find the answer I am looking for. I am trying to write the output of a function that calculates 3 variables to a data.table.
Currently I am copying the function three times (with three different names), each time returning a different variable. This is taking a lot more time as it runs thrice. I understand
there may be a better way to do it, using list or some unique data.table command.
I would greatly appreciate any input you can provide to simplify this. Below is the example of how I am calling it one variable at a time.
Example
fn_1 <- function(a, b, c, d){
for (i in 1:b) { col_1[i] = calculation }
for (i in 1:c) { col_2[i] = calculation }
for (i in 1:d) { col_3[i] = calculation }
return(col_1)
}
data[ ,column_1 := fn_1(a,b,c,d) ,by= .(e,f) ]
fn_2 <- function(a, b, c, d){
for (i in 1:b) { col_1[i] = calculation }
for (i in 1:c) { col_2[i] = calculation }
for (i in 1:d) { col_3[i] = calculation }
return(col_2)
}
data[ ,column_2 := fn_2(a,b,c,d) ,by= .(e,f) ]
The OP has tagged the question with data.table. docendo discimus' comment is showing the direction to follow.
Create sample data
library(data.table) # CRAN version 1.10.4 used
n <- 10L
DT <- data.table(
a = 1:n, b = (n:1)^2, c = -(1:n), d = 2 * (1:n) - n/2,
e = rep(LETTERS[1:2], length.out = n),
f = rep(LETTERS[3:4], each = n/2, length.out = n))
DT
# a b c d e f
# 1: 1 100 -1 -3 A C
# 2: 2 81 -2 -1 B C
# 3: 3 64 -3 1 A C
# 4: 4 49 -4 3 B C
# 5: 5 36 -5 5 A C
# 6: 6 25 -6 7 B D
# 7: 7 16 -7 9 A D
# 8: 8 9 -8 11 B D
# 9: 9 4 -9 13 A D
#10: 10 1 -10 15 B D
Define function
fn <- function(p, q, r, s) {
list(X1 = p + mean(q) + r + s,
Y2 = p * q + r * s,
Z3 = p * q - r * s)
}
The function takes 4 parameters and returns a list of 3 named vectors. Note that the computations inside the function don't need to use for loops in contrast to OP's approach.
Apply function to data.table
Note that the OP wants to group on columns e and f when the function is applied.
The first variant creates a new data.table. By default, the names of the list elements as defined in fn are used:
DT[, fn(a, b, c, d), .(e, f)]
# e f X1 Y2 Z3
# 1: A C 63.66667 103 97
# 2: A C 67.66667 189 195
# 3: A C 71.66667 155 205
# 4: B C 64.00000 164 160
# 5: B C 68.00000 184 208
# 6: B D 18.66667 108 192
# 7: B D 22.66667 -16 160
# 8: B D 26.66667 -140 160
# 9: A D 19.00000 49 175
#10: A D 23.00000 -81 153
The second variant updates DT by reference. The names of the new columns are explicitely stated.
DT[, c("x", "y", "z") := fn(a, b, c, d), .(e, f)]
DT
# a b c d e f x y z
# 1: 1 100 -1 -3 A C 63.66667 103 97
# 2: 2 81 -2 -1 B C 64.00000 164 160
# 3: 3 64 -3 1 A C 67.66667 189 195
# 4: 4 49 -4 3 B C 68.00000 184 208
# 5: 5 36 -5 5 A C 71.66667 155 205
# 6: 6 25 -6 7 B D 18.66667 108 192
# 7: 7 16 -7 9 A D 19.00000 49 175
# 8: 8 9 -8 11 B D 22.66667 -16 160
# 9: 9 4 -9 13 A D 23.00000 -81 153
#10: 10 1 -10 15 B D 26.66667 -140 160
You're in the second circle of hell. To solve the problem, pre-allocate what you want to add.
data <- data.table(c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))
Then, make a vectorized function to do the calculation, which returns the whole column to append.
calculation <- Vectorize(function(x) mean(c(x, 3)))
Write fn in terms of this new function, and return the whole block of columns to be added, then cbind it with data to add all the columns at once. It's extremely slow to do all the calculations every time, and then only return one part.
fn <- function(b, c, d) {
toBeAdded <- data.table(matrix(nrow = nrow(data), ncol = 3))
toBeAdded[ , 1] <- calculation(b)
toBeAdded[ , 2] <- calculation(b)
toBeAdded[ , 3] <- calculation(b)
toBeAdded
}
data <- cbind(data, fn(data[1,], data[2,], data[3,]))
Answering my own question, based on inputs from #docendodiscimus & #ConCave, i solved it like this. appreciate everyone's input!
fn_1 <- function(a, b, c, d){
for (i in 1:b) { col_1[i] = calculation }
for (i in 1:c) { col_2[i] = calculation }
for (i in 1:d) { col_3[i] = calculation }
df = data.table(col_1, col_2, col_3)
return(df)
}
data[,c("column_1","column_2","column_3"):= fn_1(a,b,c,d) ,by= .(e,f)]
Does it have to be a data.table? If not , then you can just use mutate in dplyr
a <- c(1,2,2,1,2,3,4,2)
b <- c(3,3,2,3,5,4,3,2)
c <- c(9,9,8,7,8,9,8,7)
d <- c(0,1,1,0,1,1,0,1)
have <- data.frame(a,b,c,d)
want <-
have %>%
mutate(abc = a+ b + c,
db = d * b,
aa = 2 * a)

Getting a prop.table() in r

I've been trying to use prop.table() to get the proportions of data I have but keep getting errors. My data is..
Letter Total
a 10
b 34
c 8
d 21
. .
. .
. .
z 2
I want a third column that gives the proportion of each letter.
My original data is in a data frame so I've tried converting to a data table and then using prop.table ..
testtable = table(lettersdf)
prop.table(testtable)
When I try this I keep getting the error,
Error in margin.table(x, margin) : 'x' is not an array
Any help or advise is appreciated.
:)
If the Letter column in your data does not have duplicate values, like this
Df <- data.frame(
Letter=letters,
Total=sample(1:50,26),
stringsAsFactors=F)
you can just do this instead of using prop.table:
Df$Prop <- Df$Total/sum(Df$Total)
> head(Df)
Letter Total Prop
1 a 45 0.074875208
2 b 1 0.001663894
3 c 13 0.021630616
4 d 15 0.024958403
5 e 24 0.039933444
6 f 39 0.064891847
> sum(Df[,3])
[1] 1
If there are duplicated values, like in this object
Df2 <- data.frame(
Letter=sample(letters,50,replace=T),
Total=sample(1:50,50),
stringsAsFactors=F)
you can make a table to sum the frequency of unique Letters,
Table <- table(rep(Df2$Letter,Df2$Total))
> Table
a b c d e f h j k l m n o p q t v w x y z
48 16 99 2 40 75 45 42 66 6 62 27 88 99 32 96 85 64 53 161 69
and then use prop.table on this table object:
> prop.table(Table)
a b c d e f h j k l m
0.037647059 0.012549020 0.077647059 0.001568627 0.031372549 0.058823529 0.035294118 0.032941176 0.051764706 0.004705882 0.048627451
n o p q t v w x y z
0.021176471 0.069019608 0.077647059 0.025098039 0.075294118 0.066666667 0.050196078 0.041568627 0.126274510 0.054117647
You could also make this into a data.frame:
Df2.table <- cbind(
data.frame(Table,stringsAsFactors=F),
Prop=as.numeric(prop.table(Table)))
> head(Df2.table)
Var1 Freq Prop
1 a 48 0.037647059
2 b 16 0.012549020
3 c 99 0.077647059
4 d 2 0.001568627
5 e 40 0.031372549
6 f 75 0.058823529

replace values in a data.frame with values from another data.frame

I have two dataframes with different dimensions,
df1 <- data.frame(names= sample(LETTERS[1:10]), duration=sample(0:100, 10))
>df1
names duration
1 J 97
2 G 57
3 H 53
4 A 23
5 E 100
6 D 90
7 C 73
8 F 60
9 B 37
10 I 67
df2 <- data.frame(names= LETTERS[1:5], names_new=letters[1:5])
> df2
names names_new
1 A a
2 B b
3 C c
4 D d
5 E e
I want to replace in df1 the values that match df1$names and df2$names but using the df2$names_new. My desired output would be:
> df1
names duration
1 J 97
2 G 57
3 H 53
4 a 23
5 e 100
6 d 90
7 c 73
8 F 60
9 b 37
10 I 67
This is the code I'm using but I wonder if there is a cleaner way to do it with no so many steps,
df2[,1] <- as.character(df2[,1])
df2[,2] <- as.character(df2[,2])
df1[,1] <- as.character(df1[,1])
match(df1[,1], df2[,1]) -> id
which(!is.na(id)==TRUE) -> idx
id[!is.na(id)] -> id
df1[idx,1] <- df2[id,2]
Many thanks
Here's an approach from qdapTools:
library(qdapTools)
df1$names <- df1$names %lc+% df2
The %l+% is a binary operator version of lookup. The left are the terms and the right side is the lookup table. The + means that any noncomparables will revert back to the original. This is a wrapper for the data.table package and is pretty speedy.
Here is the output including set.seed(1) for reproducibility:
set.seed(1)
df1 <- data.frame(names= sample(LETTERS[1:10]), duration=sample(0:100, 10),stringsAsFactors=F)
df2 <- data.frame(names= LETTERS[1:5], names_new=letters[1:5],stringsAsFactors=F)
library(qdapTools)
df1$names <- df1$names %lc+% df2
df1
## names duration
## 1 c 20
## 2 d 17
## 3 e 68
## 4 G 37
## 5 b 74
## 6 H 47
## 7 I 98
## 8 F 93
## 9 J 35
## 10 a 71
Are all names in df2 also in df1? And do you intent to keep them as a factor? If so, you might find this solution helpful.
idx <- match(levels(df2$names), levels(df1$names))
levels(df1$names)[idx] <- levels(df2$names_new)
This works but requires that names and names_new are character and not factor.
set.seed(1)
df1 <- data.frame(names= sample(LETTERS[1:10]), duration=sample(0:100, 10),stringsAsFactors=F)
df2 <- data.frame(names= LETTERS[1:5], names_new=letters[1:5],stringsAsFactors=F)
rownames(df1) <- df1$names
df1[df2$name,]$names <- df2$names_new
Another option using merge:
transform(merge(df1,df2,all.x=TRUE),
names=ifelse(is.na(names_new),as.character(names),
as.character(names_new)))
Another way using match would be (if df1$names and df1$names are characters of course)
df1[match(df2$names, df1$names), "names"] <- df2$names_new

Change the index number of a dataframe

After I'm done with some manipulation in Dataframe, I got a result dataframe. But the index are not listed properly as below.
MsgType/Cxr NoOfMsgs AvgElpsdTime(ms)
161 AM 86 30.13
171 CM 1 104
18 CO 27 1244.81
19 US 23 1369.61
20 VK 2 245
21 VS 11 1273.82
112 fqa 78 1752.22
24 SN 78 1752.22
I would like to get the result as like below.
MsgType/Cxr NoOfMsgs AvgElpsdTime(ms)
1 AM 86 30.13
2 CM 1 104
3 CO 27 1244.81
4 US 23 1369.61
5 VK 2 245
6 VS 11 1273.82
7 fqa 78 1752.22
8 SN 78 1752.22
Please guide how I can get this ?
These are the rownames of your dataframe, which by default are 1:nrow(dfr). When you reordered the dataframe, the original rownames are also reordered. To have the rows of the new order listed sequentially, just use:
rownames(dfr) <- 1:nrow(dfr)
Or, simply
rownames(df) <- NULL
gives what you want.
> d <- data.frame(x = LETTERS[1:5], y = letters[1:5])[sample(5, 5), ]
> d
x y
5 E e
4 D d
3 C c
2 B b
1 A a
> rownames(d) <- NULL
> d
x y
1 E e
2 D d
3 C c
4 B b
5 A a
The index is actually the data frame row names. To change them, you can do something like:
rownames(dd) = 1:dim(dd)[1]
or
rownames(dd) = 1:nrow(dd)
Personally, I never use rownames.
In your example, I suspect that you don't need to worry about them either, since you are just renaming them 1 to n. In particular, when you subset your data frame the rownames will again be incorrect. For example,
##Simple data frame
R> dd = data.frame(a = rnorm(6))
R> dd$type = c("A", "B")
R> rownames(dd) = 1:nrow(dd)
R> dd
a type
1 2.1434 A
2 -1.1067 B
3 0.7451 A
4 -0.1711 B
5 1.4348 A
6 -1.3777 B
##Basic subsetting
R> dd_sub = dd[dd$type=="A",]
##Rownames are "wrong"
R> dd_sub
a type
1 2.1434 A
3 0.7451 A
5 1.4348 A

Resources