Aggregate all columns with data.table using 2 fixed columns - r

I have a custom function I would like to apply to a data table such as follows:
DT = data.table(x = rep(c("a","b","c"), each = 2),
x2 = rep(c("h","j"), each = 3),
y = c(1,3),
v = 1:6,
z = 7:12,
w = 13:18)
DT
x x2 y v z w
1: a h 1 1 7 13
2: a h 3 2 8 14
3: b h 1 3 9 15
4: b j 3 4 10 16
5: c j 1 5 11 17
6: c j 3 6 12 18
I have a function which I would like to score the numeric columns of DT by column x. The function scores by two fixed columns and performs a calculation on the 3rd column over the numeric columns. The function is as follows (the underscore represents the column that is not fixed):
scoring <- function(_, z, w) {
f <- abs(w - _) / abs(w - z)
f[is.infinite(f)] <- 1
f[is.nan(f)] <- 1
return(median(f))
}
The result would (in this case) have 2 new columns, y and v both of which would be aggregated using the the score function by x (that is for groups "a", "b" and "c". E.g.:
y: a: 0.9166667
y: b: 1.25
y: c: 1.583333
v: a: 1
v: b: 1
v: c: 1
My question is:
I know I can use the by functionality in data.table, but I don't know how to tell it to keep two columns fixed for my custom function and perform the calculation on the remaining columns.

Related

How do I create new variable names using argument in my function in R?

This is a dataset.
require(data.table)
df <- data.table(a = c(1, 2, 3),
b = c(4, 5, 6))
a b
1: 1 4
2: 2 5
3: 3 6
I would like to make several several column names with my function.
Here is a example function.
f_test <- function(x){
variableName1 <- eval(paste0("variableName1_", x))
variableName2 <- eval(paste0("variableName2_", x))
print(variableName1)
#setNames(variableName)
df_1a <- df[, `:=` (variableName1 = a * b * 1,
variableName2 = a * b * 2)]
}
For example, this is the expected outcome from f_test("AAA")
a b variableName1_AAA variableName2_AAA
1: 1 4 4 8
2: 2 5 10 20
3: 3 6 18 36
However, the function outcome is not 'variableName1_AAA', but 'variableName1'.
How do I assign the name based on the string argument in the function? I need to assign the name character to use in the future function work.
We can use paste directly to create the column names as the input is a string. The assignment (:=) can also be done with concatenating the column name objects on the lhs of :=
f_test <- function(x){
variableName1 <- paste0("variableName1_", x)
variableName2 <- paste0("variableName2_", x)
df_1a <- copy(df)
df_1a[, c(variableName1, variableName2) := .( a * b * 1,
a * b * 2)][]
}
-testing
f_test("AAA")
# a b variableName1_AAA variableName2_AAA
#1: 1 4 4 8
#2: 2 5 10 20
#3: 3 6 18 36

operating between columns and classifing values per groups R

I try to obtain percentages grouping values regarding one variable.
For this I used sapply to obtain the percentage of each column regarding another one, but I dont know how to group these values by type (another variable)
x <- data.frame("A" = c(0,0,1,1,1,1,1), "B" = c(0,1,0,1,0,1,1), "C" = c(1,0,1,1,0,0,1),
"type" = c("x","x","x","y","y","y","x"), "yes" = c(0,0,1,1,0,1,1))
x
A B C type yes
1 0 0 1 x 0
2 0 1 0 x 0
3 1 0 1 x 1
4 1 1 1 y 1
5 1 0 0 y 0
6 1 1 0 y 1
7 1 1 1 x 1
I need to obtaing the next value (percentage): A==1&yes==1/A==1, and for this I use the next code:
result <- as.data.frame(sapply(x[,1:3],
function(i) (sum(i & x$yes)/sum(i))*100))
result
sapply(x[, 1:3], function(i) (sum(i & x$yes)/sum(i)) * 100)
A 80
B 75
C 75
Now I need to obtain the same math operation but taking into account the varible "type". It means, obtaing the same percentage but discriminating it by type. So, my expected table was:
type sapply(x[, 1:3], function(i) (sum(i & x$yes)/sum(i)) * 100)
A x 40
A y 40
B x 25
B y 50
C x 50
C y 25
In the example it's possible to observe that, by letters, the percentage sum is the same value that the obtained in the first result, just here is discriminated by type.
thanks a lot.
You can do the following using data.table:
Code
setDT(df)
cols = c('A', 'B', 'C')
mat = df[yes == 1, lapply(.SD, function(x){
100 * sum(x)/df[, lapply(.SD, sum), .SDcols = cols][[substitute(x)]]
# Here, the numerator is sum(x | yes == 1) for x == columns A, B, C
# If we look at the denominator, it equals sum(x) for x == columns A, B, C
# The reason why we need to apply substitute(x) is because df[, lapply(.SD, sum)]
# generates a list of column sums, i.e. list(A = sum(A), B = sum(B), ...).
# Hence, for each x in the column names we must subset the list above using [[substitute(x)]]
# Ultimately, the operation equals sum(x | yes == 1)/sum(x) for A, B, C.
}), .(type), .SDcols = cols]
# '.(type)' simply means that we apply this for each type group,
# i.e. once for x and once for y, for each ABC column.
# The dot is just shorthand for 'list()'.
# .SDcols assigns the subset that I want to apply my lapply statement onto.
Result
> mat
type A B C
1: x 40 25 50
2: y 40 50 25
Long format (your example)
> melt(mat)
type variable value
1: x A 40
2: y A 40
3: x B 25
4: y B 50
5: x C 50
6: y C 25
Data
df <- data.frame("A" = c(0,0,1,1,1,1,1), "B" = c(0,1,0,1,0,1,1), "C" = c(1,0,1,1,0,0,1),
"type" = c("x","x","x","y","y","y","x"), "yes" = c(0,0,1,1,0,1,1))

Break dataframe into smaller dataframe's and save them

Need help to split one dataframe dynamically into multiple smaller dataframe’s based on a column interval and save them as well.
Example:
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
The above dataframe x needs to split into smaller dataframes based on value in num, in an interval of 5.
The result would be 6 dataframes
> 1. 0 – 5
> 2. 6 – 10
> 3. 11 – 15
> 4. 16 -20
> 5. 21 -25
> 6. 26 – 30
You can use the split function and cut function to perform the operation:
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
answer<-split(x, cut(x$num, breaks=c(0, 5, 10, 15, 20, 25, 30)))
you can then pass this list to lapply for further processing.
Using tidyverse
library(tidyverse)
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
##Brake the data frame
y <- x %>%
mutate(group = cut_width(num,5, boundary = 0,closed = "right"))
##Put them into a list
list_1 <- lapply(1:length(unique(y$group)),
function(i)filter(y, group == unique(y$group)[i]))
Consider also tagging records by multiples of 5 then running by, the function to split a data frame by one or more factors:
df <- data.frame(num = 1:26, let = letters, LET = LETTERS)
df$grp <- ceiling(df$num / 5)
df_list <- by(df, df$grp, function(sub) transform(sub, grp=NULL))
Output
df_list
# df$grp: 1
# num let LET
# 1 1 a A
# 2 2 b B
# 3 3 c C
# 4 4 d D
# 5 5 e E
# -------------------------------------------------------------------------------------------
# df$grp: 2
# num let LET
# 6 6 f F
# 7 7 g G
# 8 8 h H
# 9 9 i I
# 10 10 j J
# -------------------------------------------------------------------------------------------
# df$grp: 3
# num let LET
# 11 11 k K
# 12 12 l L
# 13 13 m M
# 14 14 n N
# 15 15 o O
# -------------------------------------------------------------------------------------------
# df$grp: 4
# num let LET
# 16 16 p P
# 17 17 q Q
# 18 18 r R
# 19 19 s S
# 20 20 t T
# -------------------------------------------------------------------------------------------
# df$grp: 5
# num let LET
# 21 21 u U
# 22 22 v V
# 23 23 w W
# 24 24 x X
# 25 25 y Y
# -------------------------------------------------------------------------------------------
# df$grp: 6
# num let LET
# 26 26 z Z
This seems to be a neater way. You can easily adjust the names of the output files and the number of splits
library(tidyverse)
df <- data.frame(num = 1:26, let = letters, LET = LETTERS)
# split data frame into 6 pieces
split_df <- split(df, ceiling(1:nrow(df) / nrow(df) * 6))
# save each of them in turn
split_df %>%
names(.) %>%
walk(~ write_csv(split_df[[.]], paste0("part_", ., ".csv")))

Combine vectors into data frame, using vector name as a column

library(dplyr)
I have a set of vectors:
Sp_A <- c("A",1,2,3,4,5,6,7,8)
Sp_B <- c("B",9,10,11,12,13,14,15,16)
Sp_C <- c("C",17,18,19,20,21,22,23,24)
which I have made into a list of vectors:
list <- ls(pattern = "Sp_")
I want to use this list to loop over each vector in the list and make it into a data frame . I currently do this for one vector using this:
A_df <- select(data.frame(rep(Sp_A[1], each = 4), c(Sp_A[c(2,4,6,8)]), c(Sp_A[c(3,5,7,9)])), name = 1, var1 = 2, var2 = 3)
I have tried to make this operation into a for loop like this:
for(i in list) {
test[i] <- select(A_df <- data.frame(rep(i[1], each = 4),
c(i[c(2,4,6,8)]),
c(i[c(3,5,7,9)]),
name = 1, var1 = 2, var2 = 3))
}
but to no avail.
I have heard that I might be able to use apply() for this sort of thing but I don't know how.
Maybe this:
lapply(list,function(x) data.frame(name=get(x)[1],matrix(get(x)[-1],ncol = 2)))
[[1]]
name X1 X2
1 A 1 5
2 A 2 6
3 A 3 7
4 A 4 8
[[2]]
name X1 X2
1 B 9 13
2 B 10 14
3 B 11 15
4 B 12 16
[[3]]
name X1 X2
1 C 17 21
2 C 18 22
3 C 19 23
4 C 20 24
Or a simple for loop to assign the dataframes to objects:
for (x in 1:length(list)){
assign(paste0("test",x),data.frame(name=get(list[x])[1],matrix(get(list[x])[-1],ncol = 2)))
}

r "slot" two columns into one (like a zip)

Given two columns (perhaps from a data frame) of equal length N, how can I produce a column of length 2N with the odd entries from the first column and the even entries from the second column?
Suppose I have the following data frame
df.1 <- data.frame(X = LETTERS[1:10], Y = 2*(1:10)-1, Z = 2*(1:10))
How can I produce this data frame df.2?
i <- 1
j <- 0
XX <- NA
while (i <= 10){
XX[i+j] <- LETTERS[i]
XX[i+j+1]<- LETTERS[i]
i <- i+1
j <- i-1
}
df.2 <- data.frame(X.X = XX, Y.Z = c(1:20))
ggplot2 has an unexported function interleave which does this.
Whilst unexported it does have a help page (?ggplot2:::interleave)
with(df.1, ggplot2:::interleave(Y,Z))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
If I understand you right, you want to create a new vector twice the length of the vectors X, Y and Z in your data frame and then want all the elements of X to occupy the odd indices of this new vector and all the elements of Y the even indices. If so, then the code below should do the trick:
foo<-vector(length=2*nrow(df.1), mode='character')
foo[seq(from = 1, to = 2*length(df.1$X), by=2)]<-as.character(df.1$X)
foo[seq(from = 2, to = 2*length(df.1$X), by=2)]<-df.1$Y
Note, I first create an empty vector foo of length 20, then fill it in with elements of df.1$X and df.1$Y.
Cheers,
Danny
You can use melt from reshape2:
library(reshape2)
foo <- melt(df.1, id.vars='X')
> foo
X variable value
1 A Y 1
2 B Y 3
3 C Y 5
4 D Y 7
5 E Y 9
6 F Y 11
7 G Y 13
8 H Y 15
9 I Y 17
10 J Y 19
11 A Z 2
12 B Z 4
13 C Z 6
14 D Z 8
15 E Z 10
16 F Z 12
17 G Z 14
18 H Z 16
19 I Z 18
20 J Z 20
Then you can sort and pick the columns you want:
foo[order(foo$X), c('X', 'value')]
Another solution using base R.
First index the character vector of the data.frame using the vector [1,1,2,2 ... 10,10] and store as X.X. Next, rbind the data.frame vectors Y & Z effectively "zipping" them and store in Y.X.
> res <- data.frame(
+ X.X = df.1$X[c(rbind(1:10, 1:10))],
+ Y.Z = c(rbind(df.1$Y, df.1$Z))
+ )
> head(res)
X.X Y.Z
1 A 1
2 A 2
3 B 3
4 B 4
5 C 5
6 C 6
A one two liner in base R:
test <- data.frame(X.X=df.1$X,Y.Z=unlist(df.1[c("Y","Z")]))
test[order(test$X.X),]
Assuming that you want what you asked for in the first paragraph, and the rest of what you posted is your attempt at solving it.
a=df.1[df.1$Y%%2>0,1:2]
b=df.1[df.1$Z%%2==0,c(1,3)]
names(a)=c("X.X","Y.Z")
names(b)=names(a)
df.2=rbind(a, b)
If you want to group them by X.X as shown in your example, you can do:
library(plyr)
arrange(df.2, X.X)

Resources