I have following dataframe in r
count1 count2 count3
0 12 11
12 13 44
22 32 13
I want to calculate distance between count2,count1 and count3 and count2 as follows
sqrt(abs(count2-count1) + abs(count3-count2))
to every row of dataframe. My desired dataframe is as follows
count1 count2 count3 distance
0 12 11 sqrt(abs(12-0)+abs(12-11))
12 13 44 sqrt(abs(13-12)+abs(44-13))
22 32 13 sqrt(abs(32-22)+abs(13-32))
the way I am doing it is with for loop
for(i in 1:nrow(df)){
df$distance[i] <- sqrt(abs(df$count1[i] - df$count2[i]) + abs(df$count2[i] - df$count3[i]))
}
Is there any better way of doing above ?
I guess the dplyr package is the way to go for that:
df <- data.frame(count1 = sample(1:100,10),count2 = sample(1:100,10),count3 = sample(1:100,10))
> df %>% mutate(distance=sqrt(abs(count2-count1) + abs(count3-count2)))
count1 count2 count3 distance
1 79 59 54 5.000000
2 70 18 22 7.483315
3 31 13 57 7.874008
4 54 49 53 3.000000
5 94 67 77 6.082763
6 51 74 21 8.717798
7 33 4 24 7.000000
8 90 79 78 3.464102
9 6 64 98 9.591663
10 22 68 28 9.273618
df$distance = apply(df, 1,
function(x) sqrt(abs(x[2] - x[1]) + abs(x[3] - x[2])))
df
We can just use base R
df$distance <- with(df, sqrt(abs(count2 - count1) + abs(count3 - count2)))
Or with rowSums from base R
df$distance <- sqrt(rowSums(abs(df[-1] - df[-length(df)])))
data
df <- structure(list(count1 = c(0L, 12L, 22L), count2 = c(12L, 13L,
32L), count3 = c(11L, 44L, 13L)), .Names = c("count1", "count2",
"count3"), class = "data.frame", row.names = c(NA, -3L))
You can also do with data.table package :
library(data.table)
y <- data.table(count1 = c(0,12,22), count2 = c(12,13,32), count3 = c(11,44,13))
y[, distance := sqrt(abs(count2 - count1) + abs(count3 - count2))]
Results :
> y
count1 count2 count3 distance
1: 0 12 11 3.605551
2: 12 13 44 5.656854
3: 22 32 13 5.385165
use dplyr package
pretty much the standard now
Here a working example using iris data (use dput(namedataset) to share your db)
library(dplyr)
iris[1:3] %>% mutate(res=sqrt(abs(Sepal.Length-Sepal.Width)))
Related
I have something like this,
A B C
100 24
18
16
21
14
I am trying to write a function that calculates C = A-B for the respective row and then adds 20 to C which is A for the next row and repeats the step and it should be like this at the end.
A B C
100 24 76
96 18 78
98 16 82
102 21 81
101 14 87
I am doing it manually atm like
df$C[1] = df$A[1] - df$B[1] and then
df$A[2] = df$C[1]+20 and repeating it.
I would like to create a function instead of doing this way. Any help would be appreciated.
Here is another approach using for loop:
data
df <- data.frame(A=NA, B = c(24L, 18L, 16L, 21L, 14L),C=NA)
Initialize first row of df
df$A[1] <- 100
df$C[1] <- df$A[1]-df$B[1]
Populate the remaining rows of df
for (i in 1:(length(df$B)-1)){
df$C[i+1] <- df$C[i]-df$B[i+1]+20
df$A[i+1] <- df$C[i]+20
}
Output
df
A B C
1 100 24 76
2 96 18 78
3 98 16 82
4 102 21 81
5 101 14 87
We can start with only B column and then calculate A and C respectively.
start_value <- 100
df$A <- c(start_value, start_value - cumsum(df$B) + 20 * 1:nrow(df))[-(nrow(df) + 1)]
df$C <- df$A - df$B
df
# B A C
#1 24 100 76
#2 18 96 78
#3 16 98 82
#4 21 102 81
#5 14 101 87
data
df <- structure(list(B = c(24L, 18L, 16L, 21L, 14L)),
class = "data.frame", row.names = c(NA, -5L))
I am trying to select random rows from a data frame with 1000 lines (and six columns) where the skewness of the line is larger than a given value (say Sk > 0.3).
I've generated the following data frame
df=data.frame(replicate(6,sample(10:100,1000,rep=TRUE)))
I can get row skewness from the fbasics package:
rowSkewness(df) gives:
[8] -0.2243295435 0.5306809351 0.0707122386 0.0341447417 0.3339384838 -0.3910593364 -0.6443905090
[15] 0.5603809206 0.4406091534 -0.3736108832 0.0397860038 0.9970040772 -0.7702547535 0.2065830354
But now, I need to select say 10 rows of the df which have rowskewness greater than say 0.1... May with
for (a in 1:10) {
sample.data[a,] = sample(x=df[which(rowSkewness(df[sample(1:nrow(df),1)>0.1),], size = 1, replace = TRUE)
}
or something like this?
Any thoughts on this will be appreciated.
thanks in advance.
you can use the sample_n() function or sample_frac() - makes your version a little shorter:
library(tidyr)
library(fBasics)
df=data.frame(replicate(6,sample(10:100,1000,rep=TRUE)))
x=df %>% dplyr::filter(rowSkewness(df)>0.1) %>% dplyr::sample_n(10)
Got it:
x=df %>% filter(rowSkewness(df)>0.1)
for (a in 1:samplesize) {
sample.data[a,] = sample(x=x, size = 1, replace = TRUE)
}
Just do a subset:
res1 <- DF[fBasics::rowSkewness(DF) > .1, ]
head(res1)
# X1 X2 X3 X4 X5 X6
# 7 56 28 21 93 74 24
# 8 33 56 23 44 10 12
# 12 29 19 29 38 94 95
# 13 35 51 54 98 66 10
# 14 12 51 24 23 36 68
# 15 50 37 81 22 55 97
Or with e1071::skewness:
res2 <- DF[apply(as.matrix(DF), 1, e1071::skewness) > .1, ]
stopifnot(all.equal(res1, res2))
Data
set.seed(42); DF <- data.frame(replicate(6, sample(10:100, 1000, rep=TRUE)))
I have a dataframe (df) in r, and I am interested in two columns, df$LEFT and df$RIGHT. I would like to create two new columns such that in df$BEST I have the smaller number between LEFT and RIGHT for each row. Analogously, I want to create the column df$WORST where it is stored the smallest number.
ID LEFT RIGHT
1 20 70
2 65 15
3 25 65
I would like to obtain this:
ID LEFT RIGHT BEST WORST
1 20 70 20 70
2 65 15 15 65
3 25 65 25 65
How can I do that?
We can use pmin/pmax to get the corresponding minimum, maximum values of the two columns
transform(df, BEST = pmin(LEFT, RIGHT), WORST = pmax(LEFT, RIGHT))
# ID LEFT RIGHT BEST WORST
#1 1 20 70 20 70
#2 2 65 15 15 65
#3 3 25 65 25 65
data
df <- structure(list(ID = 1:3, LEFT = c(20L, 65L, 25L), RIGHT = c(70L,
15L, 65L)), class = "data.frame", row.names = c(NA, -3L))
An alternative is using apply
> df$WORST <- apply(df[,-1], 1, min)
> df$BEST <- apply(df[,-1], 1, max)
> df
ID LEFT RIGHT WORST BEST
1 1 20 70 20 70
2 2 65 15 15 65
3 3 25 65 25 65
Using #akrun's approach with transform:
> transform(df,
WORST = apply(df[,-1], 1, min),
BEST = apply(df[,-1], 1, max))
Say I have a matrix with 1000 columns. I want to create a new matrix with every other n columns from the original matrix, starting from column i.
So let say that n=3 and i=5, then the columns I need from the old matrix are 5,6,7,11,12,13,17,18,19 and so on.
Using two seq()s to create the start and stop bounds, then using a mapply() on those to build your true column index intervals. Then just normal bracket notation to extract from your matrix.
set.seed(1)
# using 67342343's test case
M <- matrix(runif(100^2), ncol = 100)
n <- 3
i <- 5
starts <- seq(i, ncol(M), n*2)
stops <- seq(i+(n-1), ncol(M), n*2)
col_index <- c(mapply(seq, starts, stops)) # thanks Jaap and Sotos
col_index
[1] 5 6 7 11 12 13 17 18 19 23 24 25 29 30 31 35 36 37 41 42 43 47 48 49 53 54 55 59 60 61 65 66 67 71 72 73 77 78
[39] 79 83 84 85 89 90 91 95 96 97
M[, col_index]
Another solution is based on the fact that R uses index recycling:
i <- 5; n <- 3
M <- matrix(runif(100^2), ncol = 100)
id <- seq(i, ncol(M), by = 1)[rep(c(TRUE, FALSE), each = n)]
M_sub <- M[, id]
I would write a function that determines the indices of the columns you want, and then call that function as needed.
col_indexes <- function(mat, start = 1, by = 1){
n <- ncol(mat)
inx <- seq(start, n, by = 2*by)
inx <- c(sapply(inx, function(i) i:(i + by -1)))
inx[inx <= n]
}
m <- matrix(0, nrow = 1, ncol = 20)
icol <- col_indexes(m, 5, 3)
icol
[1] 5 6 7 11 12 13 17 18 19
Here is a method using outer.
c(outer(5:7, seq(0L, 95L, 6L), "+"))
[1] 5 6 7 11 12 13 17 18 19 23 24 25 29 30 31 35 36 37 41 42 43 47 48 49 53
[26] 54 55 59 60 61 65 66 67 71 72 73 77 78 79 83 84 85 89 90 91 95 96 97
To generalize this, you could do
idx <- c(outer(seq(i, i + n), seq(0L, ncol(M) - i, 2 * n), "+"))
The idea is to construct the initial set of columns (5:7 or seq(i, i + n)), calculate the starting points for every subsequent set (seq(0L, 95L, 6L) or seq(0L, ncol(M) - i, 2 * n)) then use outer to calculate the sum of every combination of these two vectors.
you can subset the matrix using [ like M[, idx].
I want to add many new columns simultaneously to a data.table based on by-group computations. A working example of my data would look something like this:
Time Stock x1 x2 x3
1: 2014-08-22 A 15 27 34
2: 2014-08-23 A 39 44 29
3: 2014-08-24 A 20 50 5
4: 2014-08-22 B 42 22 43
5: 2014-08-23 B 44 45 12
6: 2014-08-24 B 3 21 2
Now I want to scale and sum many of the variables to get an output like:
Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
1: 2014-08-22 A 15 27 34 -1.1175975 0.7310560 121 68
2: 2014-08-23 A 39 44 29 0.3073393 0.4085313 121 68
3: 2014-08-24 A 20 50 5 0.8102582 -1.1395873 121 68
4: 2014-08-22 B 42 22 43 -0.5401315 1.1226726 88 57
5: 2014-08-23 B 44 45 12 1.1539172 -0.3274462 88 57
6: 2014-08-24 B 3 21 2 -0.6137858 -0.7952265 88 57
A brute force implementation of my problem would be:
library(data.table)
set.seed(123)
d <- data.table(Time = rep(seq.Date( Sys.Date(), length=3, by="day" )),
Stock = rep(LETTERS[1:2], each=3 ),
x1 = sample(1:50, 6),
x2 = sample(1:50, 6),
x3 = sample(1:50, 6))
d[,x2_scale:=scale(x2),by=Stock]
d[,x3_scale:=scale(x3),by=Stock]
d[,x2_sum:=sum(x2),by=Stock]
d[,x3_sum:=sum(x3),by=Stock]
Other posts describing a similar issue (Add multiple columns to R data.table in one function call? and Assign multiple columns using := in data.table, by group) suggest the following solution:
d[, c("x2_scale","x3_scale"):=list(scale(x2),scale(x3)), by=Stock]
d[, c("x2_sum","x3_sum"):=list(sum(x2),sum(x3)), by=Stock]
But again, this would get very messy with a lot of variables and also this brings up an error message with scale (but not with sum since this isn't returning a vector).
Is there a more efficient way to achieve the required result (keeping in mind that my actual data set is quite large)?
I think with a small modification to your last code you can easily do both for as many variables you want
vars <- c("x2", "x3") # <- Choose the variable you want to operate on
d[, paste0(vars, "_", "scale") := lapply(.SD, function(x) scale(x)[, 1]), .SDcols = vars, by = Stock]
d[, paste0(vars, "_", "sum") := lapply(.SD, sum), .SDcols = vars, by = Stock]
## Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
## 1: 2014-08-22 A 13 14 32 -1.1338934 1.1323092 87 44
## 2: 2014-08-23 A 25 39 9 0.7559289 -0.3701780 87 44
## 3: 2014-08-24 A 18 34 3 0.3779645 -0.7621312 87 44
## 4: 2014-08-22 B 44 8 6 -0.4730162 -0.7258662 59 32
## 5: 2014-08-23 B 49 3 18 -0.6757374 1.1406469 59 32
## 6: 2014-08-24 B 15 48 8 1.1487535 -0.4147807 59 32
For simple functions (that don't need special treatment like scale) you could easily do something like
vars <- c("x2", "x3") # <- Define the variable you want to operate on
funs <- c("min", "max", "mean", "sum") # <- define your function
for(i in funs){
d[, paste0(vars, "_", i) := lapply(.SD, eval(i)), .SDcols = vars, by = Stock]
}
Another variation using data.table
vars <- c("x2", "x3")
d[, paste0(rep(vars, each=2), "_", c("scale", "sum")) := do.call(`cbind`,
lapply(.SD, function(x) list(scale(x)[,1], sum(x)))), .SDcols=vars, by=Stock]
d
# Time Stock x1 x2 x3 x2_scale x2_sum x3_scale x3_sum
#1: 2014-08-22 A 15 27 34 -1.1175975 121 0.7310560 68
#2: 2014-08-23 A 39 44 29 0.3073393 121 0.4085313 68
#3: 2014-08-24 A 20 50 5 0.8102582 121 -1.1395873 68
#4: 2014-08-22 B 42 22 43 -0.5401315 88 1.1226726 57
#5: 2014-08-23 B 44 45 12 1.1539172 88 -0.3274462 57
#6: 2014-08-24 B 3 21 2 -0.6137858 88 -0.7952265 57
Based on comments from #Arun, you could also do:
cols <- paste0(rep(vars, each=2), "_", c("scale", "sum"))
d[,(cols):= unlist(lapply(.SD, function(x) list(scale(x)[,1L], sum(x))),
rec=F), by=Stock, .SDcols=vars]
You're probably looking for a pure data.table solution, but you could also consider using dplyr here since it works with data.tables as well (no need for conversion). Then, from dplyr you could use the function mutate_all as I do in this example here (with the first data set you showed in your question):
library(dplyr)
dt %>%
group_by(Stock) %>%
mutate_all(funs(sum, scale), x2, x3)
#Source: local data table [6 x 9]
#Groups: Stock
#
# Time Stock x1 x2 x3 x2_sum x3_sum x2_scale x3_scale
#1 2014-08-22 A 15 27 34 121 68 -1.1175975 0.7310560
#2 2014-08-23 A 39 44 29 121 68 0.3073393 0.4085313
#3 2014-08-24 A 20 50 5 121 68 0.8102582 -1.1395873
#4 2014-08-22 B 42 22 43 88 57 -0.5401315 1.1226726
#5 2014-08-23 B 44 45 12 88 57 1.1539172 -0.3274462
#6 2014-08-24 B 3 21 2 88 57 -0.6137858 -0.7952265
You can easily add more functions to be calculated which will create more columns for you. Note that mutate_all applies the function to each column except the grouping variable (Stock) by default. But you can either specify the columns you only want to apply the functions to (which I did in this example) or you can specify which columns you don't want to apply the functions to (that would be, e.g. -c(x2,x3) instead of where I wrote x2, x3).
EDIT: replaced mutate_each above with mutate_all as mutate_each will be deprecated in the near future.
EDIT: cleaner version using functional. I think this is the closest to the dplyr answer.
library(functional)
funs <- list(scale=Compose(scale, c), sum=sum) # See data.table issue #783 on github for the need for this
cols <- paste0("x", 2:3)
cols.all <- outer(cols, names(funs), paste, sep="_")
d[,
c(cols.all) := unlist(lapply(funs, Curry(lapply, X=.SD)), rec=F),
.SDcols=cols,
by=Stock
]
Produces:
Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
1: 2014-08-22 A 15 27 34 -1.1175975 0.7310560 121 68
2: 2014-08-23 A 39 44 29 0.3073393 0.4085313 121 68
3: 2014-08-24 A 20 50 5 0.8102582 -1.1395873 121 68
4: 2014-08-22 B 42 22 43 -0.5401315 1.1226726 88 57
5: 2014-08-23 B 44 45 12 1.1539172 -0.3274462 88 57
6: 2014-08-24 B 3 21 2 -0.6137858 -0.7952265 88 57