Sub-setting by group closest to defined value - r

I have a dataframe where I would like to select within each group the lines where y is the closest to a specific value (ex.: 5).
set.seed(1234)
df <- data.frame(x = c(rep("A", 4),
rep("B", 4)),
y = c(rep(4, 2), rep(1, 2), rep(6, 2), rep(3, 2)),
z = rnorm(8))
df
## x y z
## 1 A 4 -1.2070657
## 2 A 4 0.2774292
## 3 A 1 1.0844412
## 4 A 1 -2.3456977
## 5 B 6 0.4291247
## 6 B 6 0.5060559
## 7 B 3 -0.5747400
## 8 B 3 -0.5466319
The result would be:
## x y z
## 1 A 4 -1.2070657
## 2 A 4 0.2774292
## 3 B 6 0.4291247
## 4 B 6 0.5060559
Thank you, Philippe

df %>%
group_by(x) %>%
mutate(
delta = abs(y - 5)
) %>%
filter(delta == min(delta)) %>%
select(-delta)

Alternatively using base R:
df[do.call(c, tapply(df$y, df$x, function(x) x-5 == max(x - 5))),]
x y z
1 A 4 -1.2070657
2 A 4 0.2774292
5 B 6 0.4291247
6 B 6 0.5060559

Here is an option with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'x', we create get the absolute difference of 'y' with 5, check for elements that are min from the difference, get the row index (.I), extract the column that is row index ("V1") and subset the dataset.
library(data.table)
setDT(df)[df[, {v1 <- abs(y-5)
.I[v1==min(v1)]}, x]$V1]
# x y z
#1: A 4 -1.2070657
#2: A 4 0.2774292
#3: B 6 0.4291247
#4: B 6 0.5060559

val <- 5
delta <- abs(val - df$y)
df <- df[delta == min(delta), ]

Related

weight calculation for panel data set r [duplicate]

I have the following data frame, where "x" is a grouping variable and "y" some values:
dat <- data.frame(x = c(1, 2, 3, 3, 2, 1), y = c(3, 4, 4, 5, 2, 5))
I want to create a new column where each "y" value is divided by the sum of "y" within each group defined by "x". E.g. the result for the first row is 3 / (3 + 5) = 0.375, where the denominator is the sum of "y" values for group 1 (x = 1).
There are various ways of solving this, here's one
with(dat, ave(y, x, FUN = function(x) x/sum(x)))
## [1] 0.3750000 0.6666667 0.4444444 0.5555556 0.3333333 0.6250000
Here's another possibility
library(data.table)
setDT(dat)[, z := y/sum(y), by = x]
dat
# x y z
# 1: 1 3 0.3750000
# 2: 2 4 0.6666667
# 3: 3 4 0.4444444
# 4: 3 5 0.5555556
# 5: 2 2 0.3333333
# 6: 1 5 0.6250000
Here's a third one
library(dplyr)
dat %>%
group_by(x) %>%
mutate(z = y/sum(y))
# Source: local data frame [6 x 3]
# Groups: x
#
# x y z
# 1 1 3 0.3750000
# 2 2 4 0.6666667
# 3 3 4 0.4444444
# 4 3 5 0.5555556
# 5 2 2 0.3333333
# 6 1 5 0.6250000
Here are some base R solutions:
1) prop.table Use the base prop.table function with ave like this:
transform(dat, z = ave(y, x, FUN = prop.table))
giving:
x y z
1 1 3 0.3750000
2 2 4 0.6666667
3 3 4 0.4444444
4 3 5 0.5555556
5 2 2 0.3333333
6 1 5 0.6250000
2) sum This also works:
transform(dat, z = y / ave(y, x, FUN = sum))
And of course there's a way for people thinking in SQL, very wordy in this case, but nicely generalising to all sorts of other similiar problems:
library(sqldf)
dat <- sqldf("
with sums as (
select
x
,sum(y) as sy
from dat
group by x
)
select
d.x
,d.y
,d.y/s.sy as z
from dat d
inner join sums s
on d.x = s.x
")

Compute the percentage of summed grouped values in long format [duplicate]

I have the following data frame, where "x" is a grouping variable and "y" some values:
dat <- data.frame(x = c(1, 2, 3, 3, 2, 1), y = c(3, 4, 4, 5, 2, 5))
I want to create a new column where each "y" value is divided by the sum of "y" within each group defined by "x". E.g. the result for the first row is 3 / (3 + 5) = 0.375, where the denominator is the sum of "y" values for group 1 (x = 1).
There are various ways of solving this, here's one
with(dat, ave(y, x, FUN = function(x) x/sum(x)))
## [1] 0.3750000 0.6666667 0.4444444 0.5555556 0.3333333 0.6250000
Here's another possibility
library(data.table)
setDT(dat)[, z := y/sum(y), by = x]
dat
# x y z
# 1: 1 3 0.3750000
# 2: 2 4 0.6666667
# 3: 3 4 0.4444444
# 4: 3 5 0.5555556
# 5: 2 2 0.3333333
# 6: 1 5 0.6250000
Here's a third one
library(dplyr)
dat %>%
group_by(x) %>%
mutate(z = y/sum(y))
# Source: local data frame [6 x 3]
# Groups: x
#
# x y z
# 1 1 3 0.3750000
# 2 2 4 0.6666667
# 3 3 4 0.4444444
# 4 3 5 0.5555556
# 5 2 2 0.3333333
# 6 1 5 0.6250000
Here are some base R solutions:
1) prop.table Use the base prop.table function with ave like this:
transform(dat, z = ave(y, x, FUN = prop.table))
giving:
x y z
1 1 3 0.3750000
2 2 4 0.6666667
3 3 4 0.4444444
4 3 5 0.5555556
5 2 2 0.3333333
6 1 5 0.6250000
2) sum This also works:
transform(dat, z = y / ave(y, x, FUN = sum))
And of course there's a way for people thinking in SQL, very wordy in this case, but nicely generalising to all sorts of other similiar problems:
library(sqldf)
dat <- sqldf("
with sums as (
select
x
,sum(y) as sy
from dat
group by x
)
select
d.x
,d.y
,d.y/s.sy as z
from dat d
inner join sums s
on d.x = s.x
")

How to make this code more efficient in R?

I know this is a stupid question, but I'm kinda frustrated with my code because it takes so much time. Jere is one part of my code.
basically I have a matrix called "distance"...
a b c
1 2 5 7
2 6 8 4
3 9 2 3
and then lets say I have a column in a data frame, contains of {a,b,c}
c1 c2 c3
c ... ...
a
a just another column
b
c ... ...
so I want to do a match, I wanna make another matrix with ncol=nrow(distance), and nrow=nrow(c1). where replace the factor value with their distance value. Here's an example of the first column of matrix that I'm going to make
a will replaced by 2
b will replaced by 5
c will replaced by 7
and for the second column, i will take row number 2 from distance matrix, and so on... so the result will be like this
m1 m2 m3
7 4 3
2 6 9
2 6 9
5 8 2
7 4 3
That is just an easy example, and I'm running this code, but when it deals with large iterations, it's kinda stressful for me.
for(l in 1:ncol(d.cat)){
get.unique = sort(unique(d.cat[, l]))
for(j in 1:nrow(d.cat)){
value = as.character(d.cat[j, l])
index = which(get.unique == value)
d2[j,l] = (d[[l]][i, index])
}
}
d.cat is categorical data. And d[[...]] is the list of matrix distance for every column in d.cat.
Try to store the indices and do the updating in one go. Lets say your distance matrix is dmat and data frame is df and you want to create a matrix named newmat
a.ind = which(df$c1=="a")
b.ind = which(df$c1=="b")
c.ind = which(df$c1=="c")
newmat = matrix(0,nrow=length(df$c1),ncol=3)
newmat[a.ind,] = dmat[,1]
newmat[b.ind,] = dmat[,2]
newmat[c.ind,] = dmat[,3]
Here's some data
set.seed(123)
d = matrix(1:9, 3, dimnames=list(NULL, letters[1:3]))
df = data.frame(c1 = sample(letters[1:3], 10, TRUE), stringsAsFactors=FALSE)
and a solution
t(d[, match(df$c1, colnames(d))])
For example
> d
a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> df$c1
[1] "a" "c" "b" "c" "c" "a" "b" "c" "b" "b"
> t(d[,match(df$c1, colnames(d))])
[,1] [,2] [,3]
a 1 2 3
c 7 8 9
b 4 5 6
c 7 8 9
c 7 8 9
a 1 2 3
b 4 5 6
c 7 8 9
b 4 5 6
b 4 5 6
Your data
mat <- matrix(c(2,6,9,5,8,2,7,4,3), nrow=3)
rownames(mat) <- 1:3
colnames(mat) <- letters[1:3]
library(dplyr)
set.seed(1)
df <- as.data.frame(matrix(sample(letters[1:3], 12, replace=TRUE), nrow=4)) %>%
setNames(paste0("c", 1:3))
# c1 c2 c3
# 1 a a b
# 2 b c a
# 3 b c a
# 4 c b a
Using purrr::map2_df, iterate through columns of df and columns of tmat
library(purrr)
tmat <- t(mat)
map2_df(df, seq_len(ncol(tmat)), ~tmat[,.y][.x])
# # A tibble: 4 x 3
# c1 c2 c3
# <dbl> <dbl> <dbl>
# 1 2. 6. 2.
# 2 5. 4. 9.
# 3 5. 4. 9.
# 4 7. 8. 9.
Here is my attempt using the tidyverse :
library(tidyverse)
# Lets create some example
distance <- data_frame(a = sample(1:10, 1000, T), b = sample(1:10, 1000, T), c = sample(1:10, 1000, T))
c1 <- data_frame(c1 = sample(letters[1:3], 1000, T), c2 = sample(letters[1:3], 1000, T))
# First rearrange a little bit your data to make it more tidy
distance2 <- distance %>%
mutate(i = seq_len(n())) %>%
gather(col, value, -i)
c2 <- c1 %>%
mutate(i = seq_len(n()) %>%
gather(col, value, -i)
# Now just join the data and spread it again
c12 %>%
left_join(distance2, by = c("i", "value" = "col")) %>%
select(i, col, value.y) %>%
spread(col, value.y)

R: Euclidian distances between objects in a group

I want to create a matrix with similarities based on two identifiers, consider following matrix:
x1 <- c(2,2,2,3,1,2,4,6,4)
y1 <- c(5,4,3,3,4,2,1,6,3)
x2 <- c(8,2,7,3,1,2,2,2,6)
y2 <- c(1,3,3,3,1,2,4,3,8)
x3 <- c(4,4,1,2,4,6,3,2,9)
y3 <- c(1,2,3,3,1,2,4,6,1)
id1 <- c("a","a","a","a","b","b","b","b","b")
id2 <- c(2002,2002,2003,2003,2002,2002,2003,2003,2003)
dat <- data.frame(x1,y1,x2,y2,x3,y3,id1,id2)
For the groups marked by id1 and id2 I want to create the euclidean distance (sqrt((x1a-x1b)^2+(y1a-y1b)^2 + ... + (y3a-y3b)^2)) between the lines in the dataset. In the best case, there would be a new variable that indicates the distances of each line to each other line with the same id1 and id2. Please note that different numbers of members can be in each group as for instance in 2003 in the b-group there are three cases.
Any advice would be great!!!
I think it would be a good idea first to distinguish the lines whose distances you want to calculate. For example, for id1 == b and id2 == 2003 you have 3 lines, and you want to calculate 3 different distances (between each possible pair). So let's first assign each of these a unique id.
f <- function(n) {
# Returns a vector
# 1, 2, 1, 3, ..., 1, n, 2, 3, 2, 4, ..., 2, n, ..., (n-1), n
m <- matrix(ncol = 2, nrow = n * (n-1) / 2)
m[, 1] <- rep(1:(n-1), (n-1):1)
m[, 2] <- unlist(lapply(2:n, function(x) x:n))
as.numeric(t(m))
}
# Alternatively,
# f <- function(n) {
# d <- expand.grid(a = 1:n, b = 1:n)
# d <- d[d$a < d$b, ]
# unlist(d)
# }
# but this is slower
# Using plyr...
library(plyr)
dat <- ddply(dat, .(id1, id2), function(d) {
d <- d[f(nrow(d)), ]
d$id3 <- paste0(d$id1, rep(1:(nrow(d) / 2), each = 2))
d
})
# ...or using base R
dat <- do.call(rbind,
by(dat, list(dat$id1, dat$id2), function(d) {
d <- d[f(nrow(d)), ]
d$id3 <- paste0(d$id1, rep(1:(nrow(d) / 2), each = 2))
d
}))
Now there will only be two lines for each (id3, id2) pair and you can calculate the differences as follows
# Using plyr
result <- ddply(dat, .(id3, id2), function(d) {
d <- d[paste0(rep(c("x", "y"), 3), 1:3)]
d$dist <- sqrt(sum((d[1, ] - d[2, ])^2))
d
})
# Base R
result <- do.call(rbind,
by(dat[paste0(rep(c("x", "y"), 3), 1:3)],
list(dat$id3, dat$id2),
function(d){
d$dist <- sqrt(sum((d[1, ] - d[2, ])^2))
d
}
))
result[c("id3", "id2")] <- dat[c("id3", "id2")]
result
# x1 y2 x3 y1 x2 y3 dist id3 id2
# 1 2 1 4 5 8 1 6.480741 a1 2002
# 2 2 3 4 4 2 2 6.480741 a1 2002
# 5 1 1 4 4 1 1 3.464102 b1 2002
# 6 2 2 6 2 2 2 3.464102 b1 2002
# 3 2 3 1 3 7 3 4.242641 a1 2003
# 4 3 3 2 3 3 3 4.242641 a1 2003
# 7 4 4 3 1 2 4 5.916080 b1 2003
# 8 6 3 2 6 2 6 5.916080 b1 2003
# 7.1 4 4 3 1 2 4 9.000000 b2 2003
# 9 4 8 9 3 6 1 9.000000 b2 2003
# 8.1 6 3 2 6 2 6 11.313708 b3 2003
# 9.1 4 8 9 3 6 1 11.313708 b3 2003
Maybe this could be helpful.
dist(dat[which(dat[,"id1"]=="a" & dat[,"id2"]=="2002"),], method ="euclidean")
dist(dat[which(dat[,"id1"]=="b" & dat[,"id2"]=="2003"),], method ="euclidean")

how can i melt a data.table with concatenated column names

I'm using dcast.data.table to convert a long data.table to a wide data.table
library(data.table)
library(reshape2)
set.seed(1234)
dt.base <- data.table(A = rep(c(1:3),2), B = rep(c(1:2),3), C=c(1:4,1,2),thevalue=rnorm(6))
#from long to wide using dcast.data.table()
dt.cast <- dcast.data.table(dt.base, A ~ B + C, value.var = "thevalue", fun = sum)
#now some stuff happens e.g., please do not bother what happens between dcast and melt
setkey(dt.cast, A)
dt.cast[2, c(2,3,4):=1,with = FALSE]
now i want to melt the data.table back again to the original column layout and here i'm stuck, how do I separate the concatenated columnames from the casted data.table, this is my problem
dt.melt <- melt(dt.cast,id.vars = c("A"), value.name = "thevalue")
I need two columns instead of one
the result that i'm looking for can be produced with this code
#update
dt.base[A==2 & B == 1 & C == 1, thevalue :=1]
dt.base[A==2 & B == 2 & C == 2, thevalue :=1]
#insert (2,1,3 was not there in the base data.table)
dt.newrow <- data.table(A=2, B=1, C=3, thevalue = 1)
dt.base <-rbindlist(list(dt.base, dt.newrow))
dt.base
As always any help is appreciated
Would that work for you?
colnames <- c("B", "C")
dt.melt[, (colnames) := (colsplit(variable, "_", colnames))][, variable := NULL]
subset(dt.melt, thevalue != 0)
# or dt.melt[thevalue != 0, ]
# A thevalue B C
#1: 1 -1.2070657 1 1
#2: 2 1.0000000 1 1
#3: 2 1.0000000 1 3
#4: 3 1.0844412 1 3
#5: 2 1.0000000 2 2
#6: 3 0.5060559 2 2
#7: 1 -2.3456977 2 4
If your data set isn't representable and there could be zeros in valid rows, here's alternative approach
colnames <- c("B", "C")
setkey(dt.melt[, (colnames) := (colsplit(variable, "_",colnames))][, variable := NULL], A, B, C)
setkey(dt.base, A, B, C)
dt.base <- dt.melt[rbind(dt.base, data.table(A = 2, B = 1, C = 3), fill = T)]
dt.base[, thevalue.1 := NULL]
## A B C thevalue
## 1: 1 1 1 -1.2070657
## 2: 1 2 4 -2.3456977
## 3: 2 1 1 1.0000000
## 4: 2 2 2 1.0000000
## 5: 3 1 3 1.0844412
## 6: 3 2 2 0.5060559
## 7: 2 1 3 1.0000000
Edit
As. suggested by #Arun, the most efficient way would be to use #AnandaMahto cSplit function, as it is using data.table too, i.e,
cSplit(dt.melt, "variable", "_")
Second Edit
In order to save the manual merges, you can set fill = NA (for example) while dcasting and then do everything in one go with csplit, e.g.
dt.cast <- dcast.data.table(dt.base, A ~ B + C, value.var = "thevalue", fun = sum, fill = NA)
setkey(dt.cast, A)
dt.cast[2, c(2,3,4):=1,with = FALSE]
dt.melt <- melt(dt.cast,id.vars = c("A"), value.name = "thevalue")
dt.cast <- cSplit(dt.melt, "variable", "_")[!is.na(thevalue)]
setnames(dt.cast, 3:4, c("B","C"))
# A thevalue B C
# 1: 1 -1.2070657 1 1
# 2: 2 1.0000000 1 1
# 3: 2 1.0000000 1 3
# 4: 3 1.0844412 1 3
# 5: 2 1.0000000 2 2
# 6: 3 0.5060559 2 2
# 7: 1 -2.3456977 2 4

Resources