Error styles[[id]] : Indexing out of bounds in "riverplot" package - r

I am struggling to create a Sankey diagram using the package "riverplot". I did not manage to create a minimal toy example, so I have to include the riverplot object created by makeRiver() here. makeRiver did not throw any errors, so I thought it would work, but it does not. I hope that anyone of you has an idea.
This is the riverplot object I am trying to plot:
$edges
ID N1 N2 Value
102 102 2 10 3
106 106 6 10 2
111 111 2 11 7
115 115 6 11 2
119 119 1 12 1
120 120 2 12 72
121 121 3 12 4
125 125 7 12 7
127 127 9 12 4
129 129 2 13 14
134 134 7 13 2
136 136 9 13 1
145 145 9 14 1
147 147 2 15 4
152 152 7 15 1
154 154 9 15 1
156 156 2 16 1
165 165 2 17 69
166 166 3 17 3
167 167 4 17 1
168 168 5 17 1
169 169 6 17 2
170 170 7 17 7
171 171 8 17 1
172 172 9 17 8
$nodes
ID labels x
1 1 Albanisch 1
2 2 Arabisch 1
3 3 Arabisch;Englisch 1
4 4 Arabisch;Türkisch 1
5 5 Englisch;Kurdisch;Arabisch 1
6 6 Kurdisch 1
7 7 Kurdisch;Arabisch 1
8 8 Syrisch;Arabisch 1
9 9 keine 1
10 10 Arabisch 2
11 11 Arabisch;Englisch 2
12 12 Englisch 2
13 13 Englisch;Französisch 2
14 14 Englisch;Französisch;Arabisch 2
15 15 Französisch 2
16 16 Französisch;Englisch 2
17 17 keine 2
$styles
list()
attr(,"class")
[1] "list" "riverplot"
Calling riverplot(river) ("river" being the name of the variable I saved the object in), I get the following output (sorry that the error message is in German, it says "Index(ing) out of bounds"):
[1] "calculating positions"
[1] 21.9
ID labels x
1 1 Albanisch 1
2 2 Arabisch 1
3 3 Arabisch;Englisch 1
4 4 Arabisch;Türkisch 1
5 5 Englisch;Kurdisch;Arabisch 1
6 6 Kurdisch 1
7 7 Kurdisch;Arabisch 1
8 8 Syrisch;Arabisch 1
9 9 keine 1
10 10 Arabisch 2
11 11 Arabisch;Englisch 2
12 12 Englisch 2
13 13 Englisch;Französisch 2
14 14 Englisch;Französisch;Arabisch 2
15 15 Französisch 2
16 16 Französisch;Englisch 2
17 17 keine 2
[1] "done"
[1] "drawing edges"
Fehler in styles[[id]] : Indizierung außerhalb der Grenzen
I THINK I traced the problem to the function riverplot:::getattr, but I am not sure about that. Any help?

In case anyone is interested in the solution to the problem I described above: I used numeric IDs for nodes (1, 2, 3, ...) and edges (101, 102, ...).
makeRiver() checks if IDs are duplicated among nodes and edges and throws an error if that happens. However, it does NOT check if the IDs are purely numeric, which is apparently the source of the error.
I now added an "E" at the beginning of the edge IDs (E1, E2, ...) and an "N" at the beginning of node IDs (N1, N2, ...). It works now!

Related

Finding the k-largest clusters in dbscan result

I have a dataframe df, consists of 2 columns: x and y coordinates.
Each row refers to a point.
I feed it into dbscan function to obtain the clusters of the points in df.
library("fpc")
db = fpc::dbscan(df, eps = 0.08, MinPts = 4)
plot(db, df, main = "DBSCAN", frame = FALSE)
By using print(db), I can see the result returned by dbscan.
> print(db)
dbscan Pts=13131 MinPts=4 eps=0.08
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
border 401 38 55 5 2 3 0 0 0 8 0 6 1 3 1 3 3 2 1 2 4 3
seed 0 2634 8186 35 24 561 99 7 22 26 5 75 17 9 9 54 1 2 74 21 3 15
total 401 2672 8241 40 26 564 99 7 22 34 5 81 18 12 10 57 4 4 75 23 7 18
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
border 4 1 2 6 2 1 3 7 2 1 2 3 11 1 3 1 3 2 5 5 1 4 3
seed 14 9 4 48 2 4 38 111 5 11 5 14 111 6 1 5 1 8 3 15 10 15 6
total 18 10 6 54 4 5 41 118 7 12 7 17 122 7 4 6 4 10 8 20 11 19 9
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
border 2 4 2 1 3 2 1 1 3 1 0 2 2 3 0 3 3 3 3 0 0 2 3 1
seed 15 2 9 11 4 8 12 4 6 8 7 7 3 3 4 3 3 4 2 9 4 2 1 4
total 17 6 11 12 7 10 13 5 9 9 7 9 5 6 4 6 6 7 5 9 4 4 4 5
69 70 71
border 3 3 3
seed 1 1 1
total 4 4 4
From the above summary, I can see cluster 2 consists of 8186 seed points (core points), cluster 1 consists of 2634 seed points and cluster 5 consists of 561 points.
I define the largest cluster as the one contains the largest amount of seed points. So, in this case, the largest cluster is cluster 2. And the 1st, 2nd, 3th largest clusters are 2, 1 and 5.
Are they any direct way to return the rows (points) in the largest cluster or the k-largest cluster in general?
I can do it in an indirect way.
I can obtain the assigned cluster number of each point by
db$cluster.
Hence, I can create a new dataframe df2 with db$cluster as the
new additional column besides the original x column and y
column.
Then, I can aggregate the df2 according to the cluster numbers in
the third column and find the number of points in each cluster.
After that, I can find the k-largest groups, which are 2, 1 and 5
again.
Finally, I can select the rows in df2 with third column value equals to 2 to return the points in the largest cluster.
But the above approach re-computes many known results as stated in the summary of print(db).
The dbscan function doesn't appear to retain the data.
library(fpc)
set.seed(665544)
n <- 600
df <- data.frame(x=runif(10, 0, 10)+rnorm(n, sd=0.2), y=runif(10, 0, 10)+rnorm(n,sd=0.2))
(dbs <- dbscan(df, 0.2))
#dbscan Pts=600 MinPts=5 eps=0.2
# 0 1 2 3 4 5 6 7 8 9 10 11
#border 28 4 4 8 5 3 3 4 3 4 6 4
#seed 0 50 53 51 52 51 54 54 54 53 51 1
#total 28 54 57 59 57 54 57 58 57 57 57 5
attributes(dbs)
#$names
#[1] "cluster" "eps" "MinPts" "isseed"
#$class
#[1] "dbscan"
Your indirect steps are not that indirect (only two lines needed), and these commands won't recalculate the clusters. So just run those commands, or put them in a function and then call the function in one command.
cluster_k <- function(dbs, data, k){
kth <- names(rev(sort(table(dbs$cluster)))[k])
data[dbs$cluster == kth,]
}
cluster_k(dbs=dbs, data=df, k=1)
## x y
## 3 6.580695 8.715245
## 13 6.704379 8.528486
## 23 6.809558 8.160721
## 33 6.375842 8.756433
## 43 6.603195 8.640206
## 53 6.728533 8.425067
## a data frame with 59 rows

Multiplication in R

I have a huge data set. Data covers around 4000 regions.
I need to do a multiplication like this: each number in each row should be multiplied by the corresponding column name/value (0 or...) at first.
Then, these resulting numbers should be summed up and be divided by total number (totaln) in that row.
For example, the data is like this:
region totan 0 1 2 3 4 5 6 7 .....
1 1346 5 7 3 9 23 24 34 54 .....
2 1256 7 8 4 10 34 2 14 30 .....
3 1125 83 43 23 11 16 4 67 21 .....
4 3211 43 21 67 12 13 12 98 12 .....
5 1111 21 8 9 3 23 13 11 0 .....
.... .... .. .. .. .. .. .. .. .. .....
4000 2345 21 9 11 45 67 89 28 7 .....
The calculation should be like this:
For example in region 1:
(5*0)+(7*1)+(3*2)+(9*3)+(23*4)+(24*5)+(34*6)+(7*54)...= the result/1346=the result
I need to do such an analysis for all the regions.
I tried a couple of ways like use of "for" and "apply" but did not get the required result.
This can be done fully vectorized:
Data:
> df
region totan 0 1 2 3 4 5 6 7
1 1 1346 5 7 3 9 23 24 34 54
2 2 1256 7 8 4 10 34 2 14 30
3 3 1125 83 43 23 11 16 4 67 21
4 4 3211 43 21 67 12 13 12 98 12
5 5 1111 21 8 9 3 23 13 11 0
6 4000 2345 21 9 11 45 67 89 28 7
as.matrix(df[3:10]) %*% as.numeric(names(df)[3:10]) / df$totan
[,1]
[1,] 0.6196137
[2,] 0.3869427
[3,] 0.6711111
[4,] 0.3036437
[5,] 0.2322232
[6,] 0.4673774
This should be significantly faster on a huge dataset than any for or *apply loop.
You could use the tidyverse :
library(tidyverse)
df %>% gather(k,v,-region,-totan) %>%
group_by(region,totan) %>% summarize(x=sum(as.numeric(k)*v)/first(totan))
## A tibble: 5 x 3
## Groups: region [?]
# region totan x
# <int> <int> <dbl>
#1 1 1346 0.620
#2 2 1256 0.387
#3 3 1125 0.671
#4 4 3211 0.304
#5 5 1111 0.232
for (i in 1:nrow(data)) {
sum(data[i,3:(ncol(data))]*names(data)[3:ncol(data)])/data[i,2]
}
alternatively
apply(data,1,function(x){
sum(x[3:length(x)]*names(x)[3:length(x)])/x[2]
}

Cumulative function for a specific range of values

I have a table with a column "Age" that has a values from 1 to 10, and a column "Population" that has values specified for each of the "age" values. I want to generate a cumulative function for population such that resultant values start from ages at least 1 and above, 2 and above, and so on. I mean, the resultant array should be (203,180..and so on). Any help would be appreciated!
Age Population Withdrawn
1 23 3
2 12 2
3 32 2
4 33 3
5 15 4
6 10 1
7 19 2
8 18 3
9 19 1
10 22 5
You can use cumsum and rev:
df$sum_above <- rev(cumsum(rev(df$Population)))
The result:
> df
Age Population sum_above
1 1 23 203
2 2 12 180
3 3 32 168
4 4 33 136
5 5 15 103
6 6 10 88
7 7 19 78
8 8 18 59
9 9 19 41
10 10 22 22

R: Subsetting a list by another list

this is a very trivial problem, so I hope someone can help me out here.
I have two lists, and I want to subset one list based on the values in the other.
> head(islands)
RleViewsList of length 6
names(6): chr1 chr2 chr3 chr4 chr5 chr6
> head(islands$chr1)
Views on a 249250621-length Rle subject
views:
start end width
[1] 10001 10104 104 [ 1 2 3 3 4 4 5 6 7 7 8 8 9 10 11 11 12 ...]
[2] 10109 10145 37 [ 1 2 2 3 3 4 5 6 6 7 7 8 9 10 10 11 11 ...]
[3] 10149 10176 28 [1 1 2 3 4 4 5 5 5 6 7 7 7 7 7 7 7 7 7 7 7 5 4 4 4 ...]
[4] 10178 10229 52 [ 1 1 2 3 4 4 5 5 6 7 8 8 9 9 10 11 12 ...]
[5] 10256 10286 31 [1 2 2 3 3 4 5 6 6 7 7 7 8 9 9 9 9 9 8 7 7 7 7 7 5 ...]
[6] 10332 10388 57 [ 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 ...]
> names(islandsums)
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrM" "chrX" "chrY"
> head(islandsums$chr1)
[1] 1198 259 140 472 176 298
> length(islandsums)
[1] 25
> length(islandsums$chr1)
[1] 288625
> length(islands)
[1] 25
> length(islands$chr1)
[1] 288625
If I do it manually on one list item, everything works as I would expect it:
> head(islands$chr1[islandsums$chr1>1000])
Views on a 249250621-length Rle subject
views:
start end width
[1] 10001 10104 104 [ 1 2 3 3 4 4 5 6 7 7 8 8 9 10 11 11 ...]
[2] 50482 50514 33 [ 3 14 17 28 29 39 40 49 51 59 60 64 65 66 66 66 ...]
[3] 74555 74633 79 [ 1 3 3 11 14 26 42 56 82 130 159 176 ...]
[4] 74908 74957 50 [76 76 76 76 76 76 76 76 76 76 76 76 77 77 77 77 ...]
[5] 109573 109615 43 [ 1 1 1 4 15 18 29 32 43 46 57 60 ...]
[6] 121455 121529 75 [ 1 1 1 1 1 4 4 4 4 4 4 5 5 6 11 11 ...]
But if I try to use lapply to apply it to the list, it does not work.
> head(lapply(islands, function(x) islands$x[islandsums$x>1000]))
$chr1
NULL
$chr2
NULL
$chr3
NULL
$chr4
NULL
$chr5
NULL
$chr6
NULL
Neither does this, although it gives a different result.
> head(lapply(islands, function(x) x[islandsums$x>1000]))
$chr1
Views on a 249250621-length Rle subject
views: NONE
$chr2
Views on a 243199373-length Rle subject
views: NONE
$chr3
Views on a 198022430-length Rle subject
views: NONE
$chr4
Views on a 191154276-length Rle subject
views: NONE
$chr5
Views on a 180915260-length Rle subject
views: NONE
$chr6
Views on a 171115067-length Rle subject
views: NONE
foo1 <- list(chr1= 1:25, chr2 = 5:15)
foo2 <- list(chr1= 7:31, chr2= 12:22)
lapply(seq_along(foo1), function(i) foo1[[i]][foo2[[i]]>9 & foo2[[i]] <14])
#[[1]]
#[1] 4 5 6 7
#[[2]]
#[1] 5 6
Or
Map(function(x,y) x[y>9 & y <14], foo1, foo2)
# $chr1
#[1] 4 5 6 7
#$chr2
#[1] 5 6
your code
lapply(foo1, function(x) x) #gives the values of list elements
#$chr1
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#$chr2
# [1] 5 6 7 8 9 10 11 12 13 14 15
lapply(foo1, function(x) foo2$x) #which is not the index for corresponding list elements in foo2
#$chr1
#NULL
#$chr2
#NULL
But,
lapply(seq_along(foo1), function(i) i ) gives the index of corresponding list elements
#[[1]]
#[1] 1
#[[2]]
#[1] 2
lapply(seq_along(foo1), function(i) foo2[[i]] ) #gives the values of each list element in `foo2`

adding rnorm to a column in loop

I am doing simulations and am trying to add error to a column repeatedly, specifically to the column titled Ao. In my output, the first 30 rows are correct; we have the initial data, the first year of altered data (error added to Ao), but then afterwards, where I would like to have 30 years of added error, I get repeats of Year 2 for Ao up to year 30. My goal is that I add error after each year of sampling. Ie. Year 2 is Year 1 Ao + error. Year 3 is Year 2 Ao + error, so on and so forth. Any helpers? Cheers.
for(t in 1:30){
Error<-rnorm(1000,0,1)
m<-rep(year1data$m,30)
r<-rep(year1data$r,30)
a<-rep(year1data$a,30)
g<-rep(year1data$g,30)
Year<-rep(2:31, each=TotSpecies)
Species<-1:TotSpecies
Ao<-year1data$Ao+sample(Error,TotSpecies,replace=FALSE)
TotSpeciesdata<-data.frame(Species,Year,Ao,m,r,a,g)
TotSpeciesdata<-rbind(year1data,TotSpeciesdata)
}
> TotSpeciesdata
Species Year Ao m r a g
1 1 1 25.770783 43 119.110786 3.2305180 2.6526471
2 2 1 53.908914 138 161.894541 0.7342070 0.1151602
3 3 1 2.010732 226 193.820489 2.2890904 3.6248105
4 4 1 23.742254 332 17.315335 1.4009572 2.0037931
5 5 1 4.291080 63 187.591209 0.2563995 2.1553908
6 6 1 4.691113 343 116.267867 0.3899113 3.3950085
7 7 1 604.133044 224 132.240197 3.0410743 0.7985524
8 8 1 13.332567 166 5.367118 0.7921644 1.7861011
9 9 1 3.759268 141 212.340970 2.8733737 2.7123141
10 10 1 3.647390 209 259.400858 0.1249936 0.6594659
11 11 1 23.731109 10 114.171147 2.2437372 0.9867591
12 12 1 85.116996 69 167.412993 0.8306823 2.8905148
13 13 1 31.684280 277 177.025460 2.7618332 2.9245554
14 14 1 30.657523 205 21.710438 2.7661347 1.5911379
15 15 1 12.240410 85 210.121109 2.8827455 3.0418454
16 1 2 27.038097 43 119.110786 3.2305180 2.6526471
17 2 2 54.251600 138 161.894541 0.7342070 0.1151602
18 3 2 2.010636 226 193.820489 2.2890904 3.6248105
19 4 2 22.699369 332 17.315335 1.4009572 2.0037931
20 5 2 4.542589 63 187.591209 0.2563995 2.1553908
21 6 2 3.607833 343 116.267867 0.3899113 3.3950085
22 7 2 604.480756 224 132.240197 3.0410743 0.7985524
23 8 2 13.663513 166 5.367118 0.7921644 1.7861011
24 9 2 2.138715 141 212.340970 2.8733737 2.7123141
25 10 2 3.642769 209 259.400858 0.1249936 0.6594659
26 11 2 22.897993 10 114.171147 2.2437372 0.9867591
27 12 2 85.490897 69 167.412993 0.8306823 2.8905148
28 13 2 31.689202 277 177.025460 2.7618332 2.9245554
29 14 2 30.644419 205 21.710438 2.7661347 1.5911379
30 15 2 12.050207 85 210.121109 2.8827455 3.0418454
31 1 3 27.038097 43 119.110786 3.2305180 2.6526471
32 2 3 54.251600 138 161.894541 0.7342070 0.1151602
33 3 3 2.010636 226 193.820489 2.2890904 3.6248105
34 4 3 22.699369 332 17.315335 1.4009572 2.0037931
35 5 3 4.542589 63 187.591209 0.2563995 2.1553908
36 6 3 3.607833 343 116.267867 0.3899113 3.3950085
37 7 3 604.480756 224 132.240197 3.0410743 0.7985524
38 8 3 13.663513 166 5.367118 0.7921644 1.7861011
39 9 3 2.138715 141 212.340970 2.8733737 2.7123141
40 10 3 3.642769 209 259.400858 0.1249936 0.6594659
41 11 3 22.897993 10 114.171147 2.2437372 0.9867591
42 12 3 85.490897 69 167.412993 0.8306823 2.8905148
43 13 3 31.689202 277 177.025460 2.7618332 2.9245554
44 14 3 30.644419 205 21.710438 2.7661347 1.5911379
45 15 3 12.050207 85 210.121109 2.8827455 3.0418454
The main problem you have with your approach is the line:
TotSpeciesdata<-data.frame(Species,Year,Ao,m,r,a,g)
Because Year is a 30 * TotSpecies vector, but all the others are just TotSpecies long. So in effect, you are recycling all columns except Year 30 times when you create the data frame, which will lead to the year 2 data repeated 30 times, among other things. If you just have Year <- rep(i + 1, TotSpecies) I think your logic will work fine. That said, here is an alternate approach:
This will, for each species, create an incrementing random walk starting with Ao for that species for 5 years (just did that for display purposes):
set.seed(1)
year1data <- data.frame(species=1:10, year=1, Ao=runif(10, 1, 700))
TotSpeciesData <- do.call(
rbind,
lapply(
split(year1data, year1data$species),
function(data)
with(
data,
data.frame(species=species, year=year, Ao=c(Ao, Ao + cumsum(rnorm(5)))
) ) ) )
head(TotSpeciesData, 15)
Note I excluded columns m-g since they don't seem directly relevant to your particular question, but you can add them relatively easily. I also only did 5 years in addition to year 1 so you can see the results here, but that is also easy to change:
species year Ao
1.1 1 1 186.5906
1.2 1 1 185.7701
1.3 1 1 186.2575
1.4 1 1 186.9958
1.5 1 1 187.5716
1.6 1 1 187.2662
2.1 2 1 261.1146
2.2 2 1 262.6264
2.3 2 1 263.0162
2.4 2 1 262.3950
2.5 2 1 260.1803
2.6 2 1 261.3052
3.1 3 1 401.4245
3.2 3 1 401.3796
3.3 3 1 401.3634
It has been pointed out that the code that you provided above, or at least that I have edited, repeats itself every 15 years, rather than being unique year year in a step-wise fashion. I edited it as shown below:
TotSpeciesData <- do.call(
rbind, #bind the table by rows
lapply( #applying the function in list form
split(year1data, year1data$Species), #splits data into groups by species
function(data)
with(
data,
data.frame(Species=Species, Year=1:Community, Ao=c(Ao, Ao + cumsum(rnorm((TotSpecies-1),0,2))),m=m, r=r, a=a, g=g) #data frame is Species, Year,
) ) )
TotSpeciesData$Ao[TotSpeciesData$Ao<0]<-0 #any values less than 0 go to 0
TotSpeciesData<-TotSpeciesData[order(TotSpeciesData$Year),] #orders the data frame by Year
When I do this code:
TotSpeciesData[TotSpeciesData$Species==1 & TotSpeciesData$Year %in% c(1,2,16,17),]
I end up with an output showing that the data is repeating itself.
Species Year Ao m r a g
1.1 1 1 48.49161 239 332.9625 3.791778 2.723104
1.2 1 2 49.62851 239 332.9625 3.791778 2.723104
1.16 1 16 48.49161 239 332.9625 3.791778 2.723104
1.17 1 17 49.62851 239 332.9625 3.791778 2.723104
Any comments toward this?

Resources