I have a dataset with 50 observations over 10 variables and I would like to apply the following function over the all variable permutations.
new_fun <- function(data, x, y) {
x <- data[ , x]
y <- data[ , y]
value <- (x - y) / (x + y)
colnames(value) <- paste(names(x), "/", names(y), sep = "")
return(value)
}
here is a part of the dataset
var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
1268 1522 1268 1842 4728 5611 5544 2374 1535 5773
1286 1534 1259 1829 4834 5802 5776 2383 1538 5928
1296 1534 1266 1853 4905 5805 5916 2418 1545 5949
1296 1488 1239 1791 4963 5985 5880 2359 1524 6142
1273 1503 1228 1787 4694 5608 5608 2268 1476 5725
1290 1522 1271 1811 4799 5728 5752 2402 1555 5832
1265 1510 1247 1786 4981 6072 6172 2409 1526 6258
1289 1527 1246 1841 4876 5827 5808 2361 1522 6009
1322 1590 1351 1917 4532 5271 5264 2412 1589 5418
1334 1589 1445 1899 3680 4638 4820 2321 1638 4974
1347 1532 1370 1865 3618 4702 4852 2275 1619 4994
The idea is to have a new dataset with 50 observations on 90 columns (n=10, r=2, no repeats).
var1/var2 var1/var3 var1/var4 ...
1 . . . ...
2 . . . ...
3 . . . ...
. . . . ...
. . . . ...
. . . . ...
I have tried apply functions and loops with no success so far. Any help is greatly appreciated!
You can do this using the tidyverse and the purrr package:
library(tidyverse)
# the data you provided
varst <- as.data.frame(read_csv("var1,var2,var3,var4,var5,var6,var7,var8,var9,var10
1268,1522,1268,1842,4728,5611,5544,2374,1535,5773
1286,1534,1259,1829,4834,5802,5776,2383,1538,5928
1296,1534,1266,1853,4905,5805,5916,2418,1545,5949
1296,1488,1239,1791,4963,5985,5880,2359,1524,6142
1273,1503,1228,1787,4694,5608,5608,2268,1476,5725
1290,1522,1271,1811,4799,5728,5752,2402,1555,5832
1265,1510,1247,1786,4981,6072,6172,2409,1526,6258
1289,1527,1246,1841,4876,5827,5808,2361,1522,6009
1322,1590,1351,1917,4532,5271,5264,2412,1589,5418
1334,1589,1445,1899,3680,4638,4820,2321,1638,4974
1347,1532,1370,1865,3618,4702,4852,2275,1619,4994"))
map_dfc(names(varst), # cycle through each column
function(x) {
# fetch all columns beside x to match
map(setdiff(names(varst), x),
function(y){ # your function as above
v_x <- varst[x]
v_y <- varst[y]
ret <- (v_x - v_y) / (v_x + v_y)
names(ret) <- paste0(x, "/", y)
ret # return the caluclated values
})
})
to stop repeats
To stop comparing var2/var1 if var1/var2 has already been compared, etc. do the following:
# if you need to put the column names in order, use this
# you'll also need two digit number names, e.g. var01, var07, var10
# varst <- varst %>% select(order(colnames(varst)))
map_dfc(names(varst)[-length(names(varst))], # map each column except the final column as x
function(x) {
# fetch all columns after x in the df column names
map(names(varst)[(match(x,names(varst))+1):length(names(varst))],
function(y){ # your function as above
if(!is.na(y)){
v_x <- varst[x]
v_y <- varst[y]
ret <- (v_x - v_y) / (v_x + v_y)
names(ret) <- paste0(x, "/", y)
ret # return the caluclated values
}
})
})
Related
I have the following data frame, and I'd to replace the NAs with the first product_ids in top_products that don't already appear in that row. To give some context, these are product recommendations.
Although I have some experience with plyr and sapply, I'm struggling to figure out the right way to make this happen.
I think the code below speaks for itself.
> head(recs_with_na)
V1 V2 V3 V4
148 1227 1213 <NA> <NA>
249 1169 1221 <NA> <NA>
553 1227 1162 <NA> <NA>
732 1227 1162 <NA> <NA>
765 1227 1162 <NA> <NA>
776 1227 1162 <NA> <NA>
> top_products
product_id count
21 1162 7917
65 1213 4839
19 1160 4799
11 1152 3543
34 1175 3423
75 1227 2719
2 1143 2396
13 1154 2168
> fill_nas_with_top <- function(data, top_products) {
+ top_products_copy <- top_products
+ mydata <- data
+ #mydata <- as.data.frame(data)
+ for (i in 1:4) {
+ if (is.na(mydata[,i])) {
+ mydata[,i] <- top_products_copy[1,1]
+ top_products_copy <- top_products_copy[-1,]
+
+ }
+ else {
+ top_products_copy <- top_products_copy[top_products_copy[,1] != mydata[,i],]
+ }
+ }
+ return(mydata)
+ }
> sapply(recs_with_na, fill_nas_with_top, top_products)
Show Traceback
Rerun with Debug
Error in `[.default`(mydata, , i) : incorrect number of dimensions
R uses pass-by-value semantics. Your function will get copies of data and top_products each time it is called so no need for you to make defensive copies.
Because pass-by-value means creating copies (and for many other reasons too), it is a good practice to give your functions the smallest possible amount of information they need to accomplish their task. In this case, you don't need to pass the whole top_products data frame. A vector of product_ids will do.
fill_nas_with_top <- function(data, top) {
for (i in 1:4) {
d <- data[i]
if (is.na(d)) {
## Find the first not already existing value
for (t in top) {
top <- top[-1]
if (!t %in% data) {
data[i] <- t
break;
}
}
} else {
# This no longer assumes that product_ids in top are ordered as in data
if (d %in% top) top <- top[-which(d == top)]
}
}
return(data)
}
Called like this (observe that we call it with a vector of product_ids in top_products):
as.data.frame(t(apply(recs_with_na, 1, fill_nas_with_top, top_products[,1])))
will produce:
V1 V2 V3 V4
1 1227 1213 1162 1160
2 1169 1221 1162 1213
3 1227 1162 1213 1160
4 1227 1162 1213 1160
5 1227 1162 1213 1160
6 1227 1162 1213 1160
I have a numeric matrix from which I want to retrieve the index given specific values.
I am trying the which() function to find values in the matrix.
The problem is that some values are found and some are not.
My matrix is as follows:
x_lat <- as.double(seq(48.0 ,60.0, by=0.1))
y_long <- as.double(seq(-10.0 ,2.0, by=0.1))
xv <- as.double(rep(x_lat,each = 121))
yv <- as.double(rep(y_long, 121))
vMatrix <- as.matrix(cbind(xv,yv))
If I want to retrieve the indices where the value -2.3 is TRUE the function returns correctly a vector with the indices where -2.3 appears.
xx<- which(vMatrix==-2.3,arr.ind=TRUE)
> xx
[1] 78 199 320 441 562 683 804 925 1046 1167 1288 1409 1530 1651 1772 1893 2014 2135 2256 2377 2498
[22] 2619 2740 2861 2982 3103 3224 3345 3466 3587 3708 3829 3950 4071 4192 4313 4434 4555 4676 4797 4918 5039
[43] 5160 5281 5402 5523 5644 5765 5886 6007 6128 6249 6370 6491 6612 6733 6854 6975 7096 7217 7338 7459 7580
[64] 7701 7822 7943 8064 8185 8306 8427 8548 8669 8790 8911 9032 9153 9274 9395 9516 9637 9758 9879 10000 10121
[85] 10242 10363 10484 10605 10726 10847 10968 11089 11210 11331 11452 11573 11694 11815 11936 12057 12178 12299 12420 12541 12662
[106] 12783 12904 13025 13146 13267 13388 13509 13630 13751 13872 13993 14114 14235 14356 14477 14598
But for some numbers (that appear in the matrix) the function does not work, e.g.,
xx<- which(vMatrix==-2.2,arr.ind=TRUE)
> xx
integer(0)
Floating point numbers can be misleading. Two such numbers are usually not "equal", even though the console may display the same output. The machine only has a certain accuracy with which it can represent the numbers.
Here's a simple example:
a <- 0.15 - 1/8
b <- 0.025
> a
[1] 0.025
> b
[1] 0.025
However, if we compare these numbers with "==", we obtain:
> a==b
[1] FALSE
That is because there are differences resulting from the floating point arithmetic which are beyond the machine's accuracy:
> a-b
[1] -6.938894e-18
Probably you can resolve the issue by simply rounding the numbers in the matrix to the necessary amount of relevant digits, like, e.g.,
xx<- which(round(vMatrix,3)==-2.2,arr.ind=TRUE)
I am trying to obtain a vector, which contains sum of elements which fit condition.
values = runif(5000)
bin = seq(0, 0.9, by = 0.1)
sum(values < bin)
I expected that sum will return me 10 values - a sum of "values" elements which fit "<" condition per each "bin" element.
However, it returns only one value.
How can I achieve the result without using a while loop?
I understand this to mean that you want, for each value in bin, the number of elements in values that are less than bin. So I think you want vapply() here
vapply(bin, function(x) sum(values < x), 1L)
# [1] 0 497 1025 1501 1981 2461 2955 3446 3981 4526
If you want a little table for reference, you could add names
v <- vapply(bin, function(x) sum(values < x), 1L)
setNames(v, bin)
# 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# 0 497 1025 1501 1981 2461 2955 3446 3981 4526
I personally prefer data.table over tapply or vapply, and findInterval over cut.
set.seed(1)
library(data.table)
dt <- data.table(values, groups=findInterval(values, bin))
setkey(dt, groups)
dt[,.(n=.N, v=sum(values)), groups][,list(cumsum(n), cumsum(v)),]
# V1 V2
# 1: 537 26.43445
# 2: 1041 101.55686
# 3: 1537 226.12625
# 4: 2059 410.41487
# 5: 2564 637.18782
# 6: 3050 904.65876
# 7: 3473 1180.53342
# 8: 3951 1540.18559
# 9: 4464 1976.23067
#10: 5000 2485.44920
cbind(vapply(bin, function(x) sum(values < x), 1L)[-1],
cumsum(tapply( values, cut(values, bin), sum)))
# [,1] [,2]
#(0,0.1] 537 26.43445
#(0.1,0.2] 1041 101.55686
#(0.2,0.3] 1537 226.12625
#(0.3,0.4] 2059 410.41487
#(0.4,0.5] 2564 637.18782
#(0.5,0.6] 3050 904.65876
#(0.6,0.7] 3473 1180.53342
#(0.7,0.8] 3951 1540.18559
#(0.8,0.9] 4464 1976.23067
Using tapply with a cut()-constructed INDEX vector seems to deliver:
tapply( values, cut(values, bin), sum)
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
25.43052 71.06897 129.99698 167.56887 222.74620 277.16395
(0.6,0.7] (0.7,0.8] (0.8,0.9]
332.18292 368.49341 435.01104
Although I'm guessing you would want the cut-vector to extend to 1.0:
bin = seq(0, 1, by = 0.1)
tapply( values, cut(values, bin), sum)
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
25.48087 69.87902 129.37348 169.46013 224.81064 282.22455
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
335.43991 371.60885 425.66550 463.37312
I see that I understood the question differently than Richard. If you wanted his result you can use cumsum on my result.
Using dplyr:
set.seed(1)
library(dplyr)
df %>% group_by(groups) %>%
summarise(count = n(), sum = sum(values)) %>%
mutate(cumcount= cumsum(count), cumsum = cumsum(sum))
Output:
groups count sum cumcount cumsum
1 (0,0.1] 537 26.43445 537 26.43445
2 (0.1,0.2] 504 75.12241 1041 101.55686
3 (0.2,0.3] 496 124.56939 1537 226.12625
4 (0.3,0.4] 522 184.28862 2059 410.41487
5 (0.4,0.5] 505 226.77295 2564 637.18782
6 (0.5,0.6] 486 267.47094 3050 904.65876
7 (0.6,0.7] 423 275.87466 3473 1180.53342
8 (0.7,0.8] 478 359.65217 3951 1540.18559
9 (0.8,0.9] 513 436.04508 4464 1976.23067
10 NA 536 509.21853 5000 2485.44920
I have binned my data using the cut function
breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))
My split dataframe looks like
... ...
$`(15,20]`
val ks_Result c
15 60 237
18 70 247
... ...
$`(20,25]`
val ks_Result c
21 20 317
24 10 140
... ...
My bins looks like
> table(data)
data
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
0 0 0 7 128 2748 2307
(35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70]
1404 11472 1064 536 7389 1008 1714
(70,75] (75,80] (80,85] (85,90] (90,95] (95,100] (100,105]
2047 700 329 1107 399 376 323
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140]
314 79 1008 77 474 158 381
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175]
89 660 15 1090 109 824 247
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210]
1226 139 531 174 1041 107 257
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245]
72 671 98 212 70 95 25
(245,250]
494
When I mean the bins, I get on an average of ~900 samples
> mean(table(data))
[1] 915.9
I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.
I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!
library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))
Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:
df <- data.frame(var=runif(500, 0, 100)) # make data
cut.vec <- cut(
df$var,
breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
include.lowest=T
)
df.split <- split(df, cut.vec)
Hmisc::cut2 has this option built in as well.
Can be done by the function provided here by Joris Meys
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
data<-split(df2, EqualFreq2(df2$val, 25))
I have a list of lists that contain the following 2 variables:
> dist_sub[[1]]$zip
[1] 901 902 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
[26] 929 930 931 933 934 935 936 937 938 939 940 955 961 962 963 965 966 968 969 970 975 981
> dist_sub[[1]]$hu
[1] 4990 NA 168 13224 NA 3805 NA 6096 3884 4065 NA 16538 NA 12348 10850 NA
[17] 9322 17728 NA 13969 24971 5413 47317 7893 NA NA NA NA NA 140 NA 4
[33] NA NA NA NA NA 13394 8939 NA 3848 7894 2228 17775 NA NA NA
> dist_sub[[2]]$zip
[1] 921 934 952 956 957 958 959 960 961 962 965 966 968 969 970 971
> dist_sub[[2]]$hu
[1] 17728 140 4169 32550 18275 NA 22445 0 13394 8939 3848 7894 2228 17775 NA 12895
Is there a way remove duplicates such that if a zipcode appears in one list is removed from other lists according to specific criteria?
Example: zipcode 00921 is present in the two lists above. I'd like to keep it only on the list with the lowest sum of hu (housing units). In this I would like to keep zipcode 00921 in the 2nd list only since the sum of hu is 162,280 in list 2 versus 256,803 in list 1.
Any help is very much appreciated.
Here is a simulate dataset for your problem so that others can use it too.
dist_sub <- list(list("zip"=1:10,
"hu"=rnorm(10)),
list("zip"=8:12,
"hu"=rnorm(5)),
list("zip"=c(1, 3, 11, 7),
"hu"=rnorm(4))
)
Here's a solution that I was able to come up with. I realized that loops were really the cleaner way to do this:
do.this <- function (x) {
for(k in 1:(length(x) - 1)) {
for (l in (k + 1):length(x)) {
to.remove <- which(x[[k]][["zip"]] %in% x[[l]][["zip"]])
x[[k]][["zip"]] <- x[[k]][["zip"]][-to.remove]
x[[k]][["hu"]] <- x[[k]][["hu"]][-to.remove]
}
}
return(x)
}
The idea is really simple: for each set of zips we keep removing the elements that are repeated in any set after it. We do this until the penultimate set because the last set will be left with no repeats in anything before it.
The solution to use the criterion you have, i.e. lowest sum of hu can be easily implemented using the function above. What you need to do is reorder the list dist_sub by sum of hu like so:
sum_hu <- sapply(dist_sub, function (k) sum(k[["hu"]], na.rm=TRUE))
dist_sub <- dist_sub[order(sum_hu, decreasing=TRUE)]
Now you have dist_sub sorted by sum_hu which means that for each set the sets that come before it have larger sum_hu. Therefore, if sets at values i and j (i < j) have values a in common, then a should be removed from ith element. That is what this solution does too. Do you think that makes sense?
PS: I've called the function do.this because I usually like writing generic solutions while this was a very specific question, albeit, an interesting one.