Finding a paired values from the row throughout the dataframe in R - r

I am trying to polishing my R skills and sort of hit my limit.
The issue I am trying to find the solution as follows.
Suppose my dataframe is as below ,
n = c(2, 15, 31 , 33)
n2 = c( 10 , 9, 10 , 40)
n3 = c( 11 , 10 , 11 , 42)
df = data.frame(n , n2 , n3)
> df
n n2 n3
1 2 10 11
2 15 9 10
3 31 10 11
4 33 40 42
if I would like to go through each row , and generate a random pair :eg 2,10 and go through each of the rest of the rows to find a repeated pair and print out the common pairs and number of occurrence , how can I do that?
In the above example , the only pair that repeat is 10 and 11 at rows 1 and 3.
So far I have thought about the pseudo code as follows
for(each row in the dataframe)
{
for (each of the values in the row)
{
for every pair
}
find a repeated pair
if found store in a dataframe
}
and to find the random pair using combn function.
But I am a little lost at the part on iteration through out the dataframe rows.
Pls help.
Thanks a lot!

I think this is what you want. Instead of thinking about selecting every combination of two values for each row, we'll get every combination of two column numbers - which will be the same for every row. Then we use plyr::count as a convenience function to count rows with the same values for an entire data frame at once. This way we can loop over the combinations of column indices rather than over rows. I use apply, but you could write it as a for loop instead.
pairs = combn(ncol(df), m = 2)
result = apply(pairs, MAR = 2, FUN = function(p) {
plyr::count(df[p])
})
names(result) = apply(pairs, MAR = 2, FUN = paste, collapse = "_")
The result is a list where each item is is a data frame with two columns and a freq column giving the number of rows in the original data each value-pair occurred.
result
# $`1_2`
# n n2 freq
# 1 2 10 1
# 2 15 9 1
# 3 31 10 1
# 4 33 40 1
#
# $`1_3`
# n n3 freq
# 1 2 11 1
# 2 15 10 1
# 3 31 11 1
# 4 33 42 1
#
# $`2_3`
# n2 n3 freq
# 1 9 10 1
# 2 10 11 2
# 3 40 42 1
If you want to omit the values that aren't repeated, we can just subset them out:
lapply(result, subset, freq > 1)
# $`1_2`
# [1] n n2 freq
# <0 rows> (or 0-length row.names)
#
# $`1_3`
# [1] n n3 freq
# <0 rows> (or 0-length row.names)
#
# $`2_3`
# n2 n3 freq
# 2 10 11 2

Slightly different method
n = c(2, 15, 31 , 15) # changed the dataset to have some common pairs in n and n2 too
n2 = c( 10 , 9, 10 , 9)
n3 = c( 11 , 10 , 11 , 42)
df = data.frame(n , n2 , n3)
library(dplyr)
library(rlang)
library(utils)
cols<-colnames(df) # define the columns that you want to do the pair checking for
combinations<- as.data.frame(combn(cols,2),stringsAsFactors = FALSE)
# picks up all combinations of columns
#iterates over each pair of columns
all_combs<- lapply(names(combinations[cols1]), function(x){
df %>%
group_by(!! sym( combinations[[x]][1]),!! sym( combinations[[x]][2])) %>%
filter(n()>1) # groups by the two columns, and filters out pairs that occur more than once. You can add a distinct command below if you
#dont want them repeated
})
all_combs_df <- do.call("rbind", all_combs)# all_combs is in a list format, use rbind to convert into a dataframe
all_combs_df
the output is this
n n2 n3
<dbl> <dbl> <dbl>
1 15. 9. 10.
2 15. 9. 42.
3 2. 10. 11.
4 31. 10. 11.

Related

Filling dataframe with loops

I have a dataframe:
Start <- data.frame("Number" = 2,"Square" = 4,"Cube" = 8)
A Vector of inputs:
Numbers <- c(3,5)
I want to iterate the elements of Numbers in the function Squarecube and fill the dataframe with the results:
SquareCube <- function(x){ df <- c(x^2,x^3)
df}
Desired Output:
Filled <- data.frame("Number" = c(2,3,5),"Square" = c(4,9,25),"Cube" = c(8,27,125))
Note: Already searched for this topic , but in this case the size of the vector Numbers can be different. My intent is to fill the dataframe with the results of the function.
Thanks
If I am reading your question right, you may just be having issues with structure that do.call may be able to help with. I also redefined the function slightly to accommodate the naming:
Start <- data.frame("Number" = 2,"Square" = 4, "Cube" = 8)
Number <- c(3,5)
Define your function:
SquareCube <- function(x){ list(Number=x,Square=x^2,Cube=x^3) }
Then construct the data frame with desired end results:
> rbind(Start, data.frame( do.call(cbind, SquareCube(Number)) ))
Number Square Cube
1 2 4 8
2 3 9 27
3 5 25 125
You can also make a wrapper function and just hand it the Start data and the original Number list that you want to process, which will yield a data frame:
> makeResults <- function(a, b) { rbind(a, data.frame(do.call(cbind,SquareCube(b)))) }
> makeResults(Start, Number)
Number Square Cube
1 2 4 8
2 3 9 27
3 5 25 125
outer() function produces matrix which has exactly same output of yours. You can just change it to data frame and rename.
(Filled <- outer(
c(2, 3, 5),
1:3,
FUN = "^"
))
#> [,1] [,2] [,3]
#> [1,] 2 4 8
#> [2,] 3 9 27
#> [3,] 5 25 125
For this matrix, you can use any function what you know to
change class
change column names
Here, for instance, dplyr::rename():
library(tidyverse)
Filled %>%
as_tibble() %>% # make data frame
rename(Number = V1, Square = V2, Cube = V3) # rename column names
#> # A tibble: 3 x 3
#> Number Square Cube
#> <dbl> <dbl> <dbl>
#> 1 2 4 8
#> 2 3 9 27
#> 3 5 25 125

Nested loop in R for detecting outliers

I would like to remove outliers for each cluster of a dataset. The dataset contains 3 columns with different variables and a column indicating the cluster to which each point is allocated. If only one of the 3 variables is an outlier, the entire row will be removed. Outliers are identified determining the interval spanning over the mean plus/minus three standard deviations but I can also use the outlierfunction.
I am able to remove outliers without considering clusters, using:
#data: each row has 3 different variables and the allocating cluster (k)
dat <- cbind.data.frame(v1=c(sample(5:10, 100,replace=T),sample(1:5,5)),
v2=c(sample(20:25, 100,replace=T),sample(5:10,5)),
v3=c(sample(30:35, 100,replace=T),sample(10:20,5)),
k=c(rep(1:5,21)))
### find outliers without considering clusters
#(obviously only the last 5 samples in this example)
rmv<-c()
for(i in 1:3){
variable<-dat[,i]
rmv.tm<-which(variable >= (mean(variable)+sd(variable)*3)
| variable <= (mean(variable)-sd(variable)*3))
rmv<-c(rmv,rmv.tm)
}
rmv<-unique(rmv)
rmv
### remove outliers
dat_clean <- dat[-rmv,]
However, I am not able to detect outliers CONSIDERING clusters and thus determining intervals inside each cluster and not inside the entire population. I thought to nest another loop, but I am finding difficult coding it.
Any help would be much appreciated.
Here's a dplyr-approach:
library(dplyr)
dat %>%
group_by(k) %>%
filter_all(all_vars((abs(mean(.) - .) < 3*sd(.))))
# # A tibble: 100 x 4
# # Groups: k [5]
# v1 v2 v3 k
# <int> <int> <int> <int>
# 1 9 20 30 1
# 2 5 24 35 2
# 3 8 20 30 3
# 4 8 23 32 4
# 5 6 23 35 5
# 6 9 24 32 1
# 7 9 22 33 2
# 8 9 23 31 3
# 9 7 21 35 4
# 10 9 23 32 5
# # ... with 90 more rows
Base R:
dat <- cbind.data.frame(v1=c(sample(5:10, 100,replace=T),sample(1:5,5)),
v2=c(sample(20:25, 100,replace=T),sample(5:10,5)),
v3=c(sample(30:35, 100,replace=T),sample(10:20,5)),
k=c(rep(1:5,21)))
get_remove <- function(x, index, a = 3) {
lower_limit <- tapply(x, index, function(x) mean(x) - a * sd(x))
upper_limit <- tapply(x, index, function(x) mean(x) + a * sd(x))
vals <- split(x, index)
res <- sapply(seq_along(vals), function(i)
((vals[[i]] < lower_limit[i]) | (vals[[i]] > upper_limit[i])))
}
mask <- apply(do.call(cbind,
lapply(dat[ , c("v1", "v2", "v3")],
get_remove, dat$k)),
MARGIN = 1, any)
dat[!mask, ]
print("removed:")
dat[mask, ]

random sampling of columns based on column group

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

R: find data frame index of multiple conditions

Given two data frames s and q with five observations each:
set.seed(8)
s <- data.frame(id=sample(c('Z','X'), 5, T),
t0=sample(1:10, 5, T),
t1 = sample(11:30, 5, T))
q <- data.frame(id=sample(c('Z','X'), 5, T),
t0=sample(1:10, 5, T),
t1 = sample(11:30, 5, T))
> s
id t0 t1
1 Z 8 20
2 Z 3 12
3 X 10 19
4 X 8 21
5 Z 7 13
> q
id t0 t1
1 X 3 30
2 Z 5 12
3 Z 7 23
4 Z 3 21
5 X 7 27
The midpoint for the observations between the variables t0 and t1 is (e.g. for s data):
s$t0+(s$t1-s$t0)/2
To find the index of the (first) observation in s whose midpoint is closest to, say, the first observation in q I can do:
i <- which.min(abs((s$t0+(s$t1-s$t0)/2 - (q$t0[1]+(q$t1[1]-q$t0[1])/2)))
s[i,]
gives:
id t0 t1
3 X 10 19
But I cannot figure out how to find the same index in the original data s if I also want to condition on the id variable (e.g. pseudo code like: which.min(....) & s$id == q$id[1] - in this case the midpoint is sought among ids being 'X'). This SO is close but not spot on.
Again: I need a index to be used in the original 5-row data set.
Set the which.min argument to infinity when your condition is not obeyed:
val <- abs((s$t0+(s$t1-s$t0)/2 - (q$t0[1]+(q$t1[1]-q$t0[1])/2))
val[s$id != q$id[1]] <- Inf
i <- which.min(val)
By the way, you can simplify the expression in the first character as:
val <- abs((s$t0+s$t1)/2-(q$t0[1]+q$t1[1])/2)
or even
val <- abs(s$t0+s$t1-q$t0[1]-q$t1[1])/2

aggregate data frame by equal buckets

I would like to aggregate an R data.frame by equal amounts of the cumulative sum of one of the variables in the data.frame. I googled quite a lot, but probably I don't know the correct terminology to find anything useful.
Suppose I have this data.frame:
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> head(x)
p v
1 10.002904 4
2 10.132200 2
3 10.026105 6
4 10.001146 2
5 9.990267 2
6 10.115907 6
7 10.199895 9
8 9.949996 8
9 10.165848 8
10 9.953283 6
11 10.072947 10
12 10.020379 2
13 10.084002 3
14 9.949108 8
15 10.065247 6
16 9.801699 3
17 10.014612 8
18 9.954638 5
19 9.958256 9
20 10.031041 7
I would like to reduce the x to a smaller data.frame where each line contains the weighted average of p, weighted by v, corresponding to an amount of n units of v. Something of this sort:
> n <- 100
> cum.v <- cumsum(x$v)
> f <- cum.v %/% n
> x.agg <- aggregate(cbind(v*p, v) ~ f, data=x, FUN=sum)
> x.agg$'v * p' <- x.agg$'v * p' / x.agg$v
> x.agg
f v * p v
1 0 10.039369 98
2 1 9.952049 94
3 2 10.015058 104
4 3 9.938271 103
5 4 9.967244 100
6 5 9.995071 69
First question, I was wondering if there is a better (more efficient approach) to the code above. The second, more important, question is how to correct the code above in order to obtain more precise bucketing. Namely, each row in x.agg should contain exacly 100 units of v, not just approximately as it is the case above. For example, the first row contains the aggregate of the first 17 rows of x which correspond to 98 units of v. The next row (18th) contains 5 units of v and is fully included in the next bucket. What I would like to achieve instead would be attribute 2 units of row 18th to the first bucket and the remaining 3 units to the following one.
Thanks in advance for any help provided.
Here's another method that does this with out repeating each p v times. And the way I understand it is, the place where it crosses 100 (see below)
18 9.954638 5 98
19 9.958256 9 107
should be changed to:
18 9.954638 5 98
19.1 9.958256 2 100 # ---> 2 units will be considered with previous group
19.2 9.958256 7 107 # ----> remaining 7 units will be split for next group
The code:
n <- 100
# get cumulative sum, an id column (for retrace) and current group id
x <- transform(x, cv = cumsum(x$v), id = seq_len(nrow(x)), grp = cumsum(x$v) %/% n)
# Paste these two lines in R to install IRanges
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
ir1 <- successiveIRanges(x$v)
ir2 <- IRanges(seq(n, max(x$cv), by=n), width=1)
o <- findOverlaps(ir1, ir2)
# gets position where multiple of n(=100) occurs
# (where we'll have to do something about it)
pos <- queryHits(o)
# how much do the values differ from multiple of 100?
val <- start(ir2)[subjectHits(o)] - start(ir1)[queryHits(o)] + 1
# we need "pos" new rows of "pos" indices
x1 <- x[pos, ]
x1$v <- val # corresponding values
# reduce the group by 1, so that multiples of 100 will
# belong to the previous row
x1$grp <- x1$grp - 1
# subtract val in the original data x
x$v[pos] <- x$v[pos] - val
# bind and order them
x <- rbind(x1,x)
x <- x[with(x, order(id)), ]
# remove unnecessary entries
x <- x[!(duplicated(x$id) & x$v == 0), ]
x$cv <- cumsum(x$v) # updated cumsum
x$id <- NULL
require(data.table)
x.dt <- data.table(x, key="grp")
x.dt[, list(res = sum(p*v)/sum(v), cv = tail(cv, 1)), by=grp]
Running on your data:
# grp res cv
# 1: 0 10.037747 100
# 2: 1 9.994648 114
Running on #geektrader's data:
# grp res cv
# 1: 0 9.999680 100
# 2: 1 10.040139 200
# 3: 2 9.976425 300
# 4: 3 10.026622 400
# 5: 4 10.068623 500
# 6: 5 9.982733 562
Here's a benchmark on a relatively big data:
set.seed(12345)
x <- data.frame(cbind(p=rnorm(1e5, 10, 0.1), v=round(runif(1e5, 1, 10))))
require(rbenchmark)
benchmark(out <- FN1(x), replications=10)
# test replications elapsed relative user.self
# 1 out <- FN1(x) 10 13.817 1 12.586
It takes about 1.4 seconds on 1e5 rows.
If you are looking for precise bucketing, I am assuming value of p is same for 2 "split" v
i.e. in your example, value of p for 2 units of row 18th that go in first bucket is 9.954638
With above assumption, you can do following for not super large datasets..
> set.seed(12345)
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> z <- unlist(mapply(function(x,y) rep(x,y), x$p, x$v, SIMPLIFY=T))
this creates a vector with each value of p repeated v times for each row and result is combined into single vector using unlist.
After this aggregation is trivial using aggregate function
> aggregate(z, by=list((1:length(z)-0.5)%/%100), FUN=mean)
Group.1 x
1 0 9.999680
2 1 10.040139
3 2 9.976425
4 3 10.026622
5 4 10.068623
6 5 9.982733

Resources