Related
I have a series of values that includes strings of values that are close to each other, for example the sequences below. Note that roughly around the places I have categorized the values in V1 with distinct values in V2, the range of the values changes. That is, all the values called 1 in V2 are within 20 points of each other. All the values marked 2 in V2 are within 20 points of each other. All the values marked 3 are within 20 points of each other, etc. Notice that the values are not identical (they are all different). But instead, they cluster around a common value.
I identified these clusters manually. How could I automate it?
V1 V2
1 399.710 1
2 403.075 1
3 405.766 1
4 407.112 1
5 408.458 1
6 409.131 1
7 410.477 1
8 411.150 1
9 412.495 1
10 332.419 2
11 330.400 2
12 329.054 2
13 327.708 2
14 326.363 2
15 325.017 2
16 322.998 2
17 319.633 2
18 314.923 2
19 288.680 3
20 285.315 3
21 283.969 3
22 281.950 3
23 279.932 3
24 276.567 3
25 273.875 3
26 272.530 3
27 271.857 3
28 272.530 3
29 273.875 3
30 274.548 3
31 275.894 3
32 275.894 3
33 276.567 3
34 277.240 3
35 278.586 3
36 279.932 3
37 281.950 3
38 284.642 3
39 288.007 3
40 291.371 3
41 294.063 4
42 295.409 4
43 296.754 4
44 297.427 4
45 298.100 4
46 299.446 4
47 300.792 4
48 303.484 4
49 306.848 4
50 327.708 5
51 309.540 6
52 310.213 6
53 309.540 6
54 306.848 6
55 304.156 6
56 302.811 6
57 302.811 6
58 304.156 6
59 305.502 6
60 306.175 6
61 306.175 6
62 304.829 6
I haven't tried anything yet, I don't know how to do this.
Using dist and hclust with cutree to detect clusters, but with unique levels at the breaks.
hc <- hclust(dist(x))
cl <- cutree(hc, k=6)
data.frame(x, seq=cumsum(c(0, diff(cl)) != 0) + 1)
# x seq
# 1 399.710 1
# 2 403.075 1
# 3 405.766 1
# 4 407.112 1
# 5 408.458 1
# 6 409.131 1
# 7 410.477 1
# 8 411.150 1
# 9 412.495 1
# 10 332.419 2
# 11 330.400 2
# 12 329.054 2
# 13 327.708 2
# 14 326.363 2
# 15 325.017 2
# 16 322.998 2
# 17 319.633 3
# 18 314.923 3
# 19 288.680 4
# 20 285.315 4
# 21 283.969 4
# 22 281.950 4
# 23 279.932 4
# 24 276.567 5
# 25 273.875 5
# 26 272.530 5
# 27 271.857 5
# 28 272.530 5
# 29 273.875 5
# 30 274.548 5
# 31 275.894 5
# 32 275.894 5
# 33 276.567 5
# 34 277.240 5
# 35 278.586 6
# 36 279.932 6
# 37 281.950 6
# 38 284.642 6
# 39 288.007 6
# 40 291.371 6
# 41 294.063 7
# 42 295.409 7
# 43 296.754 7
# 44 297.427 7
# 45 298.100 7
# 46 299.446 7
# 47 300.792 7
# 48 303.484 7
# 49 306.848 7
# 50 327.708 8
# 51 309.540 9
# 52 310.213 9
# 53 309.540 9
# 54 306.848 9
# 55 304.156 9
# 56 302.811 9
# 57 302.811 9
# 58 304.156 9
# 59 305.502 9
# 60 306.175 9
# 61 306.175 9
# 62 304.829 9
However, the dendrogram suggests rather k=4 clusters instead of 6, but it is arbitrary.
plot(hc)
abline(h=30, lty=2, col=2)
abline(h=18.5, lty=2, col=3)
abline(h=14, lty=2, col=4)
legend('topright', lty=2, col=2:4, legend=paste(c(4, 5, 7), 'cluster'), cex=.8)
Data:
x <- c(399.71, 403.075, 405.766, 407.112, 408.458, 409.131, 410.477,
411.15, 412.495, 332.419, 330.4, 329.054, 327.708, 326.363, 325.017,
322.998, 319.633, 314.923, 288.68, 285.315, 283.969, 281.95,
279.932, 276.567, 273.875, 272.53, 271.857, 272.53, 273.875,
274.548, 275.894, 275.894, 276.567, 277.24, 278.586, 279.932,
281.95, 284.642, 288.007, 291.371, 294.063, 295.409, 296.754,
297.427, 298.1, 299.446, 300.792, 303.484, 306.848, 327.708,
309.54, 310.213, 309.54, 306.848, 304.156, 302.811, 302.811,
304.156, 305.502, 306.175, 306.175, 304.829)
This solution iterates over every value, checks the range of all values in the group up to that point, and starts a new group if the range is greater than a threshold.
maxrange <- 18
grp_start <- 1
grp_num <- 1
V3 <- numeric(length(dat$V1))
for (i in seq_along(dat$V1)) {
grp <- dat$V1[grp_start:i]
if (max(grp) - min(grp) > maxrange) {
grp_num <- grp_num + 1
grp_start <- i
}
V3[[i]] <- grp_num
}
cbind(dat, V3)
V1 V2 V3
1 399.710 1 1
2 403.075 1 1
3 405.766 1 1
4 407.112 1 1
5 408.458 1 1
6 409.131 1 1
7 410.477 1 1
8 411.150 1 1
9 412.495 1 1
10 332.419 2 2
11 330.400 2 2
12 329.054 2 2
13 327.708 2 2
14 326.363 2 2
15 325.017 2 2
16 322.998 2 2
17 319.633 2 2
18 314.923 2 2
19 288.680 3 3
20 285.315 3 3
21 283.969 3 3
22 281.950 3 3
23 279.932 3 3
24 276.567 3 3
25 273.875 3 3
26 272.530 3 3
27 271.857 3 3
28 272.530 3 3
29 273.875 3 3
30 274.548 3 3
31 275.894 3 3
32 275.894 3 3
33 276.567 3 3
34 277.240 3 3
35 278.586 3 3
36 279.932 3 3
37 281.950 3 3
38 284.642 3 3
39 288.007 3 3
40 291.371 3 4
41 294.063 4 4
42 295.409 4 4
43 296.754 4 4
44 297.427 4 4
45 298.100 4 4
46 299.446 4 4
47 300.792 4 4
48 303.484 4 4
49 306.848 4 4
50 327.708 5 5
51 309.540 6 6
52 310.213 6 6
53 309.540 6 6
54 306.848 6 6
55 304.156 6 6
56 302.811 6 6
57 302.811 6 6
58 304.156 6 6
59 305.502 6 6
60 306.175 6 6
61 306.175 6 6
62 304.829 6 6
A threshold of 18 reproduces your groups, except that group 4 starts one row earlier. You could use a higher threshold, but then group 6 would start later than you have it.
I'm looking for the optimal way to go from a numeric vector containing duplicate entries, like this one:
a=c(1,3,4,4,4,5,7,9,27,28,28,30,42,43)
to this one, avoiding the duplicates by shifting +1 if appropriate:
b=c(1,3,4,5,6,7,8,9,27,28,29,30,42,43)
side to side comparison:
> data.frame(a=a, b=b)
a b
1 1 1
2 3 3
3 4 4
4 4 5
5 4 6
6 5 7
7 7 8
8 9 9
9 27 27
10 28 28
11 28 29
12 30 30
13 42 42
14 43 43
is there any easy and quick way to do it? Thanks!
In case you want it to be done only once (there may still be duplicates):
a=c(1,3,4,4,4,5,7,9,27,28,28,30,42,43)
a <- ifelse(duplicated(a),a+1,a)
output:
> a
[1] 1 3 4 5 5 5 7 9 27 28 29 30 42 43
Loop that will lead to a state without any duplicates:
a=c(1,3,4,4,4,5,7,9,27,28,28,30,42,43)
while(length(a[duplicated(a)])) {
a <- ifelse(duplicated(a),a+1,a)
}
output:
> a
[1] 1 3 4 5 6 7 8 9 27 28 29 30 42 43
An alternative is to use a recursive function:
no_dupes <- function(x) {
if (anyDuplicated(x) == 0)
x
else
no_dupes(x + duplicated(x))
}
no_dupes(a)
[1] 1 3 4 5 6 7 8 9 27 28 29 30 42 43
A tidyverse option using purrr::accumulate.
library(dplyr)
library(purrr)
accumulate(a, ~ if_else(.y <= .x, .x+1, .y))
# [1] 1 3 4 5 6 7 8 9 27 28 29 30 42 43
I have a huge data set. Data covers around 4000 regions.
I need to do a multiplication like this: each number in each row should be multiplied by the corresponding column name/value (0 or...) at first.
Then, these resulting numbers should be summed up and be divided by total number (totaln) in that row.
For example, the data is like this:
region totan 0 1 2 3 4 5 6 7 .....
1 1346 5 7 3 9 23 24 34 54 .....
2 1256 7 8 4 10 34 2 14 30 .....
3 1125 83 43 23 11 16 4 67 21 .....
4 3211 43 21 67 12 13 12 98 12 .....
5 1111 21 8 9 3 23 13 11 0 .....
.... .... .. .. .. .. .. .. .. .. .....
4000 2345 21 9 11 45 67 89 28 7 .....
The calculation should be like this:
For example in region 1:
(5*0)+(7*1)+(3*2)+(9*3)+(23*4)+(24*5)+(34*6)+(7*54)...= the result/1346=the result
I need to do such an analysis for all the regions.
I tried a couple of ways like use of "for" and "apply" but did not get the required result.
This can be done fully vectorized:
Data:
> df
region totan 0 1 2 3 4 5 6 7
1 1 1346 5 7 3 9 23 24 34 54
2 2 1256 7 8 4 10 34 2 14 30
3 3 1125 83 43 23 11 16 4 67 21
4 4 3211 43 21 67 12 13 12 98 12
5 5 1111 21 8 9 3 23 13 11 0
6 4000 2345 21 9 11 45 67 89 28 7
as.matrix(df[3:10]) %*% as.numeric(names(df)[3:10]) / df$totan
[,1]
[1,] 0.6196137
[2,] 0.3869427
[3,] 0.6711111
[4,] 0.3036437
[5,] 0.2322232
[6,] 0.4673774
This should be significantly faster on a huge dataset than any for or *apply loop.
You could use the tidyverse :
library(tidyverse)
df %>% gather(k,v,-region,-totan) %>%
group_by(region,totan) %>% summarize(x=sum(as.numeric(k)*v)/first(totan))
## A tibble: 5 x 3
## Groups: region [?]
# region totan x
# <int> <int> <dbl>
#1 1 1346 0.620
#2 2 1256 0.387
#3 3 1125 0.671
#4 4 3211 0.304
#5 5 1111 0.232
for (i in 1:nrow(data)) {
sum(data[i,3:(ncol(data))]*names(data)[3:ncol(data)])/data[i,2]
}
alternatively
apply(data,1,function(x){
sum(x[3:length(x)]*names(x)[3:length(x)])/x[2]
}
I have a list of Data Frames named StatesList (it's a list of states), and I'm trying to pull out two Columns from each one, sum it, and return the sums. This is what I have so far:
StatesList <- list(Alabam, Alask, Arizon, Arkansa, Californi, Colorado, Connecticu, Delawar, District_ColUmbi, Florid, Georgi, Hawai, Idah, Illinoi, Indian, Iow, Kansa, Kentuck, Louisian, Main, Marylan, Massachusett, Michiga, Minnesot, Mississipp, Missour, Montan, Nebrask, Nevad, New_Hamp, New_Jer, New_Mex, New_York, North_Carol, North_Dak, Ohi, Oklahom, Orego, Pennsylvani, Rhode_Isl, South_Carol, South_Dak, Tennesse, Texa, Uta,Vermon, Virgini, Washingto, West_Vir, Wisconsi, Wyomin )
my_function <- function(x) {
c <- sum(x + $Clinton_Weighted)
t <- sum(x + $Trump_Weighted)
ans <- list(Clinton = c, Trump = t)
return(print(ans))
}
lapply(StatesList, my_function(x))
I know that x + $Clinton_Weighted won't work, but I'm not sure what will.
How do I pull out that specific column in the function's code? And is trying to combine the names of each list with the $ and the desired column a bad idea?
Here is a simple way to do this using a combination of lapply and apply:
# Create sample data
cols = list(Clinton = 1:10, Trump = 10:1, SomeoneElse = 21:30)
Alabama = data.frame(cols)
Alaska = data.frame(cols)
Arison = data.frame(cols)
Arkansa = data.frame(cols)
Californi = data.frame(cols)
df_list = list(Alabama, Alaska, Arison, Arkansa, Californi)
The list of dataframes look like this:
df_list
[[1]]
Clinton Trump SomeoneElse
1 1 10 21
2 2 9 22
3 3 8 23
4 4 7 24
5 5 6 25
6 6 5 26
7 7 4 27
8 8 3 28
9 9 2 29
10 10 1 30
[[2]]
Clinton Trump SomeoneElse
1 1 10 21
2 2 9 22
3 3 8 23
4 4 7 24
5 5 6 25
6 6 5 26
7 7 4 27
8 8 3 28
9 9 2 29
10 10 1 30
[[3]]
Clinton Trump SomeoneElse
1 1 10 21
2 2 9 22
3 3 8 23
4 4 7 24
5 5 6 25
6 6 5 26
7 7 4 27
8 8 3 28
9 9 2 29
10 10 1 30
[[4]]
Clinton Trump SomeoneElse
1 1 10 21
2 2 9 22
3 3 8 23
4 4 7 24
5 5 6 25
6 6 5 26
7 7 4 27
8 8 3 28
9 9 2 29
10 10 1 30
[[5]]
Clinton Trump SomeoneElse
1 1 10 21
2 2 9 22
3 3 8 23
4 4 7 24
5 5 6 25
6 6 5 26
7 7 4 27
8 8 3 28
9 9 2 29
10 10 1 30
Now sum up the columns of the dataframe, and apply it over the list of dataframes:
# Choose the columns to extract the sum of
cols = c("Clinton", "Trump")
lapply(df_list, function(x) apply(x[cols], 2, sum))
Below is the returned list
[[1]]
Clinton Trump
55 55
[[2]]
Clinton Trump
55 55
[[3]]
Clinton Trump
55 55
[[4]]
Clinton Trump
55 55
[[5]]
Clinton Trump
55 55
I am playing around to develop a sampling function to do randomization to make days easier:
Question:
pln <- 1:80
bcap <- cumsum(c(20, 12, 16, 16, 16))
bcap
[1] 20 32 48 64 80
I want to randomize pln such that 1:20, 21:32, 33:48, 49:64, 65:80, for this example. This might vary for different scenarios.
newpln <- c(sample(1:20), sample(21:32), sample(33:48),
sample(49:64), sample(65:80))
I want create a general function where length of bcap can be of any number, however the pln should run 1: max(bcap).
Is this what you want?
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 13 19 4 16 11 2 5 20 9 14 10 3 1 7 6 8 17 12 15 18 27 24 30 32 23 25 28 21 31 26 29 22 39 41 48 36 37 45 42 47 43 38 40 34 35
[46] 44 46 33 60 52 50 58 51 54 62 55 64 61 59 49 63 53 56 57 72 74 76 78 67 69 70 66 73 79 68 80 77 71 75 65
Testing:
> pln <- 1:12
> pln
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> bcap <- cumsum(c(4, 3, 2, 3))
> bcap
[1] 4 7 9 12
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 4 2 3 1 6 5 7 8 9 12 11 10
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 4 2 3 1 6 5 7 9 8 10 12 11
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 2 3 1 4 7 6 5 8 9 11 10 12
You can do this with one call to mapply. You just need an object that contains what's inside the cumsum call of your bcap object.
bvec <- c(20, 12, 16, 16, 16)
mapply(function(x,y) sample(x)+y-x, bvec, cumsum(bvec))
A small example:
bvec <- c(2,1,3,1)
set.seed(21)
unlist(mapply(function(x,y) sample(x)+y-x, bvec, cumsum(bvec)))
# [1] 2 1 3 4 5 6 7
library("plyr")
unlist(
llply(
mlply(
data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),
seq),
sample),
use.names = FALSE)
Make a data.frame with each ranges from/to, use that to make a list with the sequences, sample each list, and then combine them together.
UPDATE:
worked for me:
> library("plyr")
> bcap <- cumsum(c(4, 3, 2, 3))
> unlist(llply(mlply(data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),seq),sample),use.names=FALSE)
[1] 4 2 3 1 7 4 5 6 9 7 8 12 9 11 10
> unlist(llply(mlply(data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),seq),sample),use.names=FALSE)
[1] 3 1 2 4 5 6 4 7 9 7 8 9 12 10 11
> unlist(llply(mlply(data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),seq),sample),use.names=FALSE)
[1] 2 3 4 1 6 5 4 7 8 9 7 11 10 12 9