Related
I'm looking to make a set of two random numbers (e.g., [1,2], [3,12]) with the first number between 1-12, and the second between 1-4. I know how to sample the two numbers independently using:
sample(1:12, 1, replace = T)
sample(1:4, 1, replace = T)
but don't know how to create a system to determine if the pairing of the two numbers has already been rolled, and if so, roll again. Any tips!?
Thanks :)
While this doesn't scale happily (in case you need large-scale simulation), you can do this:
set.seed(42)
di2 <- sample(setdiff(1:4, di1 <- sample(1:12, size = 1)), size = 1)
c(di1, di2)
# [1] 1 2
The inner (di1) assignment takes the first from 1:12, so far so good.
We then set-diff 1:4 from this so that the second sampling only has candidates that are not equal to di1;
The outer (di2) assignment samples from 1:4 without di1 if it was within 1-4.
While not an authoritative proof of correctness,
rand <- replicate(100000, local({ di2 <- sample(setdiff(1:4, di1 <- sample(1:12, size=1)), size = 1); c(di1, di2); }))
dim(rand)
# [1] 2 100000
any(rand[1,] == rand[2,])
# [1] FALSE
Are you looking for sth like:
library(tidyverse)
expand.grid(1:12,1:4) %>%
as.data.frame() %>%
slice_sample (n = 5, replace = FALSE)
I have the following data frame:
library(dplyr)
set.seed(42)
df <- data_frame(x = sample(seq(0, 1, 0.1), 5, replace = T), y = sample(seq(0, 1, 0.1), 5, replace = T), z= sample(seq(0, 1, 0.1), 5, replace = T) )
For each row in df, I would like to find out whether there is a row in df2 which is close to it ("neighbor") in all columns, where "close" means that it is not different by more than 0.1 in each column.
So for instance, a proper neighbor to the row (1, 0.5, 0.5) would be (0.9, 0.6, 0.4).
The second data set is
set.seed(42)
df2 <- data_frame(x = sample(seq(0, 1, 0.1), 10, replace = T), y = sample(seq(0, 1, 0.1), 10, replace = T), z= sample(seq(0, 1, 0.1), 10, replace = T) )
In this case there is no "neighbor", so Im supposed to get "FALSE" for all rows of df.
My actual data frames are much bigger than this (dozens of columns and hundreds of thousands of rows, so the naming has to be very general rather than "x", "y" and "z".
I have a sense that this can be done using mutate and funs, for example I tried this line:
df <- df %>% mutate_all(funs(close = (. <= df2(, .)+0.1) & (. >= df2(, .)-0.1))
But got an error.
Any ideas?
You can use package fuzzyjoin
library(fuzzyjoin)
# adding two rows that match
df2 <- rbind(df2,df[1:2,] +0.01)
df %>%
fuzzy_left_join(df2,match_fun= function(x,y) y<x+0.1 & y> x-0.1 ) %>%
mutate(found=!is.na(x.y)) %>%
select(-4:-6)
# # A tibble: 5 x 4
# x.x y.x z.x found
# <dbl> <dbl> <dbl> <lgl>
# 1 1 0.5 0.5 TRUE
# 2 1 0.8 0.7 TRUE
# 3 0.3 0.1 1 FALSE
# 4 0.9 0.7 0.2 FALSE
# 5 0.7 0.7 0.5 FALSE
find more info there: Joining/matching data frames in R
The machine learning approach to finding a close entry in a multi-dimensional dataset is Euclidian distance.
The general approach is to normalize all the attributes. Make the range for each column the same, zero to one or negative one to one. That equalizes the effect of the columns with large and small values. When more advanced approaches are used one would center the adjusted column values on zero. The test criteria is scaled the same.
The next step is to calculate the distance of each observation from its neighbors. If the data set is small or computing time is cheap, calculate the distance from every observation to every other. The Euclidian distance from observation1 (row1) to observation2 (row2) is sqrt((X1 - X2)^2 + sqrt((Y1 - Y2)^2 + ...). Choose your criteria and select.
In your case, the section criterion is simpler. Two observations are close if no attribute is more than 0.1 from the other observation. I assume that df and df2 have the same number of columns in the same order. I make the assumption that close observations are relatively rare. My approach tells me once we discover a pair is distant, discontinue investigation. If you have hundred of thousands of rows, you will likely exhaust memory if you try to calculate all the combinations at the same time.
~~~~~
You have a big problem. If your data sets df and df2 are one hundred thousand rows each, and four dozen columns, the machine needs to do 4.8e+11 comparisons. The scorecard at the end will have 1e+10 results (close or distant). I started with some subsetting to do comparisons with tearful results. R wanted matrices of the same size. The kluge I devised was unsuccessful. Therefore I regressed to the days of FORTRAN and did it with loops. With the loop approach, you could subset the problem and finish without smoking your machine.
From the sample data, I did the comparisons by hand, all 150 of them: nrow(df) * nrow(df2) * ncol(df). There were no close observations in the sample data by the definition you gave.
Here is how I intended to present the results before transferring the results to a new column in df.
dfclose <- matrix(TRUE, nrow = nrow(df), ncol = nrow(df2))
dfclose # Have a look
This matrix describes the distance from observation in df (rows in dfclose) to observation in df2 (colums in dfclose). If close, the entry is TRUE.
Here is the repository of the result of the distance measures:
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
dfdist # have a look; it's the same format, but with numbers
We start with the assumption that all observations in df aare close to df2.
The total distance is zero. To that we add the Manhattan Distance. When the total Manhattan distance is greater than .1, they are no longer close. We needn't evaluate any more.
closeCriterion <- function(origin, dest) {
manhattanDistance <- abs(origin-dest)
#print(paste("manhattanDistance =", manhattanDistance))
if (manhattanDistance < .1) ret <- 0 else ret <- 1
}
convertScore <- function(x) if (x>0) FALSE else TRUE
for (j in 1:ncol(df)) {
print(paste("col =",j))
for (i in 1:nrow(df)) {
print(paste("df row =",i))
for (k in 1:nrow(df2)) {
# print(paste("df2 row (and dflist column) =", k))
distantScore <- closeCriterion(df[i,j], df2[k,j])
#print(paste("df and dfdist row =", i, " df2 row (and dflist column) =", k, " distantScore = ", distantScore))
dfdist[i,k] <- dfdist[i,k] + distantScore
}
}
}
dfdist # have a look at the numerical results
dfclose <- matrix(lapply(dfdist, convertScore), ncol = nrow(df2))
I wanted to see what the process would look like at scale.
set.seed(42)
df <- matrix(rnorm(3000), ncol = 30)
set.seed(42)
df2 <-matrix(rnorm(5580), ncol = 30)
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
Then I ran the code block to see what would happen.
~ ~ ~
You might consider the problem definition. I ran the model several times, changing the criterion for closeness. If the entry in each of three dozen columns in df2 has a 90% chance of matching its correspondent in df, the row only has a 2.2% chance of matching. The example data is not such a good test case for the algorithm.
Best of luck
Here's one way to calculate that column without fuzzyjoin
library(tidyverse)
found <-
expand.grid(row.df = seq(nrow(df)),
row.df2 = seq(nrow(df2))) %>%
mutate(in.range = pmap_lgl(., ~ all(abs(df[.x,] - df2[.y,]) <= 0.1))) %>%
group_by(row.df) %>%
summarise_at('in.range', any) %>%
select(in.range)
I'm trying to compute all the pairwise dissimilarities between observations in a data set consisting of only nominal variables using some self-defined dissimilarity metric.
Data looks like
set.seed(3424)
(mydata <- data.table(paste(sample(letters[1:5], 5, replace=T),
sample(LETTERS[1:5], 5, replace=T),
sep = ","),
paste(sample(LETTERS[1:5], 5, replace=T),
sample(LETTERS[1:5], 5, replace=T),
sep = ","),
paste(sample(letters[1:5], 5, replace=T),
sample(letters[1:5], 5, replace=T),
sep = ",")))
V1 V2 V3
1: a,A E,E b,b
2: e,D C,A d,d
3: d,B B,C d,d
4: c,B A,E b,d
5: a,B C,D d,a
library(data.table)
library(dplyr)
library(stringr)
metric <- function(pair){
intersection <- 0
union <- 0
for(i in 1:ncol(mydata)){
A <- pair[[1]][[i]]
B <- pair[[2]][[i]]
if(sum(is.na(A),is.na(B))==1)
union = union + 1
if(sum(is.na(A),is.na(B))==0){
intersection <- intersection + length(intersect(A,B))/length(union(A,B))
union = union + 1
}
}
1 - intersection/union
}
diss <- matrix(nrow = nrow(mydata), ncol = nrow(mydata))
for(i in 1:(nrow(mydata)-1)){
print(i) ## to check progress ##
for(j in (i+1):nrow(mydata)){
pair <- rbind(mydata[i], mydata[j])
diss[j, i] <- apply(pair, 1, function(x) strsplit(x, split=",")) %>% metric()
}
}
These loops work, but really slow when mydata has 1000+ rows and 100+ columns.
The metric I used here is Jaccard index, but a nested version. Since each element in the data is not a single value. So instead of treating each two levels as either match(0) or different(1), I use Jaccard when comparing levels as well.
Update:
Some context about my data, not the toy data I made up.
Each row represents a query, i.e. "SELECT ... FROM ... WHERE ...
...".
Each column contains part of the information in the query, i.e. 1st column contains everything between "SELECT" and "FROM", 2nd column contains what's between "FROM" and "WHERE", etc.
There are 100 columns and 400 rows, I don't why there are so many columns though.
Number of elements in one cell could be really arbitrary, some cells contain very long lists of values, while many are actually NAs. E.g.
SELECT
1: NA
2:p1.PLAYERID,f1.PLAYERNAME,p2.PLAYERID,f2.PLAYERNAME
3: PLAYER f1,PLAYER f2,PLAYS p1
4: NA
5: NA
6: c1.table_name t1,c2.table_name t2
7: NA
...
400: asd,vrht,yuetr,wxeq,yiknuy,sce,ercher
You can gain some speed pretty easily by doing less work. If you are only interested in pairwise comparisons, you only need to do N choose 2 comparisons, instead of N^2. You can implement that with F2() below.
set.seed(3424)
(mydata <- data.table(sample(letters[1:5], 50, replace = T),
sample(LETTERS[1:5], 50, replace = T),
sample(1:3, 50, replace = T)))
mydf<-data.frame(mydata)
f1<- function(){
diss <- matrix(nrow = nrow(mydata), ncol = nrow(mydata))
for(i in 1:(nrow(mydata)-1)){
print(i) ## to check progress ##
for(j in (i+1):nrow(mydata)){
pair <- rbind(mydata[i], mydata[j])
diss[j, i] <- apply(pair, 1, function(x) strsplit(x, split=",")) %>% metric()
}
}
return(diss)
}
f2<-function(){
met<-NULL
A<-NULL
B<-NULL
choices<-choose(nrow(mydf),2)
combs<-combn(nrow(mydf),2)
for(i in 1:choices) {
print(i)
pair<-rbind(mydf[combs[1,i],], mydf[combs[2,i],])
met[i]<- apply(pair, 1, function(x) strsplit(x, split=",")) %>% metric()
A[i]<-mydf[combs[1,i],1]
B[i]<-mydf[combs[2,i],2]
}
results<-data.frame(A,B, met)
return(results)
}
library(microbenchmark)
microbenchmark(f1(), f2(), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
f1() 1381 1391.2 1416.8 1417.6 1434.9 1456 10
f2() 907 923.6 942.3 946.9 948.9 1008 10
It is a little faster, but not mind-blowingly so. My guess is that some more work needs to be done on the metric function you define. I tried to look at it and determine a way to vectorize it, but I could not find a way. If that can be done this problem would be trivial. For example, I have a similar program that measures pairwise cosine similarity between ~400 vectors of length ~5000. It has to make 400 choose 2 = 79800 comparisons and the entire program takes about 6 seconds to run.
It's similar to the original, but I made a few changes. It runs more quickly, but I didn't bother timing it. 1000 with this code seems about like 100 with the original.
The main changes:
remove rbind by passing in variables to mapply calculate union
variable instead of adding every time (union <- union + 1)
split strings all at once outside of loops
check length intersection before calculating union and adding intersection (lenint > 0)
Hopefully something helps your case.
rownum <- 1000
(mydata <- data.table(paste(sample(letters[1:5], rownum, replace=T),
sample(LETTERS[1:5], rownum, replace=T),
sep = ","),
paste(sample(LETTERS[1:5], rownum, replace=T),
sample(LETTERS[1:5], rownum, replace=T),
sep = ","),
paste(sample(letters[1:5], rownum, replace=T),
sample(letters[1:5], rownum, replace=T),
sep = ",")))
allsplit <- lapply(mydata,strsplit,split = ',')
allsplitdf <- cbind(allsplit[['V1']],allsplit[['V2']],allsplit[['V3']])
allsplitlist <- split(allsplitdf,1:nrow(allsplitdf))
metric2 <- function(p1,p2){
for(i in seq_along(p1)){
intersection <- 0
A <- p1[[i]]
B <- p2[[i]]
if(!any(is.na(A),is.na(B))){
lenint <- length(intersect(A,B))
if(lenint > 0){
intersection <- intersection + lenint/length(union(A,B))
}
}
}
1 - intersection/length(p1)
}
diss <- matrix(nrow = nrow(mydata), ncol = nrow(mydata))
for(i in 1:(nrow(mydata)-1)){
print(i) ## to check progress ##
for(j in (i+1):nrow(mydata)){
diss[j, i] <- mapply(metric2,p1 = allsplitlist[i],p2 = allsplitlist[j])
}
}
When constructing an algorithm it is important to keep in mind the speed/space trade off. What I mean by the speed/space trade off is that by storing your data within a different schema you can usually eliminate for loops. However, data stored within this new schema will generally occupy more space.
The reason your example is slow is because, among other things, you are looping over all the rows and the columns of you're data. With a 1000x100 data.frame that is 1e5 computations. One way to eliminate theloop over your rows is to store you data a bit differently. For example, I use the expand.grid command to combine all pairwise comparisons within the same data.frame, dTMP. I then strip the comma and allow each member of the pair to occupy it's own column (i.e. "a,A" which is originally contained in one variable, is now "a" and "A" and represent entries in two separate variables). In general, reshaping data into different formats is quick, or atleast quicker than looping over each row. This reshaping clearly, however, generate a data set which takes up more RAM. In your case the data.frame will be 1e6x4. Which is very large, but not so large as to clog up all your RAM.
The reward to doing all that hard work is that now it is trivial and extremely fast to obtain the intersect and union variables. You will of course still need to loop over each column, however, we've eliminated one loop by simply arranging your data. It is possible to remove the loop over the columns loop by utilizing 3D arrays, however, such an array would not fit into memory.
f3 <- function(){
intersection <- 0
for(v in names(mydata)){
dTMP <- expand.grid(mydata[[v]], mydata[[v]], stringsAsFactors = FALSE)[,c(2,1)]
#There is likely a more elegant way to do this.
dTMP <-
dTMP$Var2 %>%
str_split(., ",") %>%
unlist(.) %>%
matrix(., ncol = 2, nrow = nrow(dTMP), byrow = TRUE) %>%
cbind(., dTMP$Var1%>%
str_split(., ",") %>%
unlist(.) %>%
matrix(., ncol = 2, nrow = nrow(dTMP), byrow = TRUE)) %>%
as.data.frame(., stringsAsFactors = FALSE)
names(dTMP) <- c("v1", "v2", "v3", "v4")
intersect <- rowSums(dTMP[,c("v1", "v2")] == dTMP[,c("v3", "v4")])
intersect <- ifelse(rowSums(dTMP[,c("v1", "v2")] == dTMP[,c("v4", "v3")]) !=0, rowSums(dTMP[,c("v1", "v2")] == dTMP[,c("v4", "v3")]), intersect)
intersect <- ifelse(dTMP[, "v1"] == dTMP[, "v2"], 1, intersect)
MYunion <- sapply(as.data.frame(t(dTMP)), function(x) n_distinct(x))
intersection <- intersection + intersect/MYunion
}
union <- ncol(mydata)
return(matrix(1 - intersection/union, nrow = nrow(mydata), ncol = nrow(mydata), byrow = TRUE)) #This is the diss matrix, I think. Double check that I got the rows and columns correct
}
Update
I'm still having trouble replicating your results, however, I believe the newly updated code is very close. There is only one cell (2,1) of the dissimilarity matrix which our results differ when set.seed(3424). The problem with the current iteration, however, is that I need to implement a sapply to obtain MYunion. If you can think of a faster way do to do this, you'll get big speed gains. Read this SO post for suggests: Efficient Means of Identifying Number of Distinct Elements in a Row
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
apply a function over groups of columns
I have a data.frame with 30 rows and many columns (1000+), but I need to average every 16 columns together. For example, the data frame will look like this (I truncate it to make it easier..):
Col1 Col2 Col3 Col4........
4.176 4.505 4.048 4.489
6.167 6.184 6.359 6.444
5.829 5.739 5.961 5.764
.
.
.
Therefore, I cannot aggregate (I do not have a list) and I tried:
a <- data.frame(rowMeans(my.df[,1:length(my.df)]) )
which gives me the average of the all 1000+ coumns, But is there any way to say I want to do that every 16 columns until the end? (they are multiple of 16 the total number of columns).
A secondary, less important point but would be useful to solve this as well.
The col names are in the following structure:
XXYY4ZZZ.txt
Once averaged the columns all I need is a new col name with only XXYY as the rest will be averaged out. I know I could use gsub but is there an optimal way to do the averaging and this operation in one go?
I am still relatively new to R and therefore I am not sure where and how to find the answer.
Here is an example adapted from #ben's question and #TylerRinker's answer from apply a function over groups of columns . It should be able to apply any function over a matrix or data frame by intervals of columns.
# Create sample data for reproducible example
n <- 1000
set.seed(1234)
x <- matrix(runif(30 * n), ncol = n)
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
# Run function
y <- byapply(x, 16, rowMeans)
# Test to make sure it returns expected result
y.test <- rowMeans(x[, 17:32])
all.equal(y[, 2], y.test)
# TRUE
You can do other odd things with it. For example, if you needed to know the total sum of every 10 columns, being sure to remove NAs if present:
y.sums <- byapply(x, 10, sum, na.rm = T)
y.sums[1]
# 146.7756
sum(x[, 1:10], na.rm = T)
# 146.7756
Or find the standard deviations:
byapply(x, 10, apply, 1, sd)
Update
by can also be specified as a vector of groups:
byapply(x, rep(1:10, each = 10), rowMeans)
This works for me on a much smaller data frame:
rowMeans(my.df[,seq(1,length(my.df),by=16)])
I have a data frame df with 2 variables A and B. I would like to split A in groups 1 and 2 so that mean(df$B[df$group==1]) as close as possible to mean(df$B[df$group==2])
Or just to express it otherwise, what I would like is to find a cut point (cutp) in df$A that would minimize the abs(mean(df$B[df$A<cutp])-mean(df$B[df$A>=cutp]))
Any ideas?
If you want to find a threshold on variable A, to split the data into two groups, so that the means of B in those two groups be similar, you can compute these means for all possible cut-points, and check when the distance between those means is minimal.
# Sample data
n <- 10
d <- data.frame(
A = rnorm(n),
B = rnorm(n)
)
# The quantity to minimize
# (You can use a loop instead of apply.)
d$differences <- apply(
d, 1,
# Compute the difference of the means for each value of A
function (u) {
i <- d$A <= u[1];
abs( mean( d$B[which(i)]) - mean(d$B[which(!i)] ) )
}
)
# The mean of an empty vector is NaN: discard those values
d$differences[ ! is.finite( d$differences ) ] <- Inf
# Take the minimum
threshold <- d$A[ which.min( d$differences ) ]
# Build the groups
d$group <- ifelse( d$A <= threshold, "group 1", "group 2" )
I'm still not sure how column A factors into it. It seems you want to create a new column that has two levels which create ~= mean values for column B. Column A is obviously associated with the new column created, but does not directly factor into the calculation needed. Am I missing something?
Regardless, here's a start (note this can be made much more robust, but proof of concept should work). Define a tolerance that you find acceptable and then set up a while loop to create new groups until the condition is met, i.e.
FUN <- function(tol){
df$groups <- sample(1:2, nrow(df), TRUE)
while(abs(mean(df$B[df$groups == 1]) - mean(df$B[df$groups == 2])) > tol) {
df$groups <- sample(1:2, nrow(df), TRUE)
}
return(df)
}
set.seed(101)
df <- data.frame(A=runif(20),B=runif(20))
#Test it. Means should be less than .02 different and have roughly equivalent sample sizes.
set.seed(101)
out <- FUN(.02)
library(plyr)
> ddply(out, "groups", summarize, n = length(B), mean = mean(B))
groups n mean
1 1 11 0.5229024
2 2 9 0.5037279
I should note that you could create a runaway function if you set tol super low so don't blame me if your computer crashes.