I have a simple graph g. It is requared to smoth the graph by deleting the vertices whose degree is 2 with preserving the layout of the original graph. The same task was solved in the Mathematica.
library(igraph)
set.seed(1)
# preprocessing
g <- sample_gnp(40, 1/20)
V(g)$name <- seq(1:vcount(g))
components <- clusters(g, mode="weak")
biggest_cluster_id <- which.max(components$csize)
vert_ids <- V(g)[components$membership == biggest_cluster_id]
vert_ids
# input random graph
g <- induced_subgraph(g, vert_ids)
LO = layout.fruchterman.reingold(g)
plot(g, vertex.color = ifelse(degree(g)==2, "red", "green"), main ="g", layout = LO)
I have selected vertices chains with a degree of 2.
subg <- induced_subgraph(g, degree(g)==2)
subg_ids <- V(subg); subg_ids
I have read the Q&A and I manually define the mapping parameter of the contract() function.
# join nodes 3 -> 14, 15 -> 40, 13 -> 31, 29 -> 6
mapping = c(2, 3, 4, 5, 6, 7, 8, 10, 13, 3, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 6, 30, 13, 32, 33, 34, 35, 36, 38, 39, 15)
g2 <- simplify(contract(g, mapping=mapping, vertex.attr.comb=toString))
# L2 <- LO[-as.numeric(c(14, 40, 31, 6)),] # not working
plot(g2, vertex.color = ifelse(degree(g2)==2, "red", "green"), main ="g2")
Question. What is a possible way to define the mapping parameter iteratively?
Here is an option without mapping in contract (so you don't need to configure mapping manually)
g2 <- graph_from_data_frame(
rbind(
get.data.frame(delete.vertices(g, names(subg_ids))),
do.call(
rbind,
lapply(
decompose(subg),
function(x) {
nbs <- names(unlist(neighborhood(g, nodes = names(V(x))[degree(x) < 2])))
setNames(data.frame(t(subset(nbs, !nbs %in% names(subg_ids)))), c("from", "to"))
}
)
)
),
directed = FALSE
)
and you will see the graph below after running
plot(g2, main = "g2", layout = LO[match(names(V(g2)), names(V(g))), ])
This is only a partial answer, since it does not give a way to compute the contraction automatically. However, I can give some insights on the manual mapping:
Your vertices have names, so those are used for reference instead of the internal vertex number from 1 to n.
In the mapping we need to give the new IDs of the vertices after the contraction.
The original IDs are
> V(g)
+ 33/33 vertices, named, from 0af52c3:
[1] 2 3 4 5 6 7 8 10 13 14 15 16 17 18 19 20 21 22 23 25 26 27 29 30 31 32 33 34 35 36 38 39 40
The new IDs can be given as (multiple possibilities exist):
mapping <- c(6, 14, 6, 5, 6, 7, 7, 10, 31, 14, 15, 16, 17, 14, 6, 7, 31, 22, 6, 25, 26, 27, 14, 30, 31, 6, 6, 34, 35, 36, 38, 39, 15)
For better overview:
old ID: 2 3 4 5 6 7 8 10 13 14 15 16 17 18 19 20 21 22 23 25 26 27 29 30 31 32 33 34 35 36 38 39 40
new ID: 6 14 6 5 6 7 7 10 31 14 15 16 17 14 6 7 31 22 6 25 26 27 14 30 31 6 6 34 35 36 38 39 15
This results in:
g2 <- simplify(contract(g, mapping=mapping, vertex.attr.comb=toString))
plot(g2, vertex.color = ifelse(degree(g2)==2, "red", "green"), main ="g2")
To get rid of the now existing degree-0-nodes you can do:
g3 <- delete.vertices(g2, which(degree(g2) == 0))
Alternatively, and maybe even cleaner you could delete nameless nodes:
g3 <- delete.vertices(g2, which(names(V(g2)) == ""))
To keep the original layout you can do:
L3 <- LO[-which(mapping != as.numeric(names(V(g)))),]
plot(g3, layout = L3)
But is not very good looking in this case...
Related
I am trying to create an algorithm that essentially is a function of this data frame.
This is the code I was using, but it doesn't seem to be working.
I need image_id to be the independent variable so that when I input 7 into the function, I get back 10 and 15. If I were to input 8, I would get back 11 and 13.
num = function(image_id, category_id, data = categories) {x->y}
This is the data frame that I am using.
category_id image_id cat_to_img_last_update
1 15 15 NULL
2 11 11 NULL
3 13 13 NULL
4 10 10 NULL
5 35 35 NULL
6 78 78 NULL
7 112 112 NULL
8 61 61 NULL
9 86 86 NULL
10 101 101 NULL
11 61 61 NULL
12 86 86 NULL
You probably don't need a function for this, but if you really want, here is what it would look like:
# Read in data
categories <-
data.frame(category_id = c(15,11,13,10,35,78,112,61,86,101,61,86),
image_id = c(7,8,8,7,9,9,10,10,11,11,12,12),
stringsAsFactors = FALSE)
num <- function(image_id, data = categories) {
data$category_id[data$image_id == image_id]
}
num(7) # 15 10
num(8) # 11 13
df <- data.frame(
category_id = c(15, 11, 13, 10, 25, 78, 112, 61, 86, 101, 61, 86),
image_id = c(7, 8, 8, 7, 9, 9, 10, 10, 11, 11, 12, 12)
)
myfun <- function(num) { sort(df[df$image_id == num, "category_id"]) }
myfun(7)
myfun(8)
I am working with R version i386 3.1.1 and RStudio 0.99.442.
I have large datasets of tree species that I've collected from 7 plots, each of which are divided into 5 subplots (i.e. 35 distinct subplots). I am trying to get R to run through my dataset and print the species which are present within each plot.
I thought I could use "aggregate" to apply the "levels" function to the Species data column and have it return the Species present for each Plot and Subplot, however it returns the levels of the entire data frame (for 12 species, total) rather than the 3 or 4 species that are actually present in the Subplot.
To provide a reproducible example of what I'm trying to do, we can use the "warpbreaks" dataset that comes with R.
I convert the 'breaks' variable in warpbreaks to a factor variable to recreate the problem; It thus exemplifies my 'species' variable, whereas 'warpbreaks$wool' would represent 'plot', and 'warpbreaks$tension' would represent 'subplot'.
require(stats)
warpbreaks$breaks = as.factor(warpbreaks$breaks)
aggregate(breaks ~ wool + tension, data = warpbreaks, FUN="levels")
If we look at the warpbreaks data, then for "Plot" A (wool) and "Subplot" L (tension) - the desired script would print the species "26, 30, 54, 25, etc."
breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L
7 51 A L
8 26 A L
9 67 A L
10 18 A M
11 21 A M
12 29 A M
...
Instead, R returns something of this sort, where it is printing ALL of the levels of the factor variable for ALL of the plots:
wool tension breaks.1 breaks.2 breaks.3 breaks.4 breaks.5 breaks...
1 A L 10 12 13 14 15 ...
2 B L 10 12 13 14 15 ...
3 A M 10 12 13 14 15 ...
4 B M 10 12 13 14 15 ...
5 A H 10 12 13 14 15 ...
6 B H 10 12 13 14 15 ...
How do I get it to print only the factors that are present within that Plot/Subplot combination? Am I totally off in my use of "aggregate"? I'd imagine this is a relatively easy task for an experience R user...
First time stackoverflow post - would appreciate any help or nudges towards the right code!
Many kind thanks.
Try FUN=unique rather than FUN=levels. levels will return every level of the factor, as you have surmised already. unique(...) will only return the unique levels.
y <- aggregate(breaks ~ wool + tension, data = warpbreaks, FUN=unique)
wool tension breaks
1 A L 14, 18, 29, 13, 31, 28, 27, 30
2 B L 15, 4, 17, 9, 19, 23, 10, 26
3 A M 8, 11, 17, 7, 2, 20, 18, 21
4 B M 24, 14, 9, 6, 22, 16, 11, 17
5 A H 21, 11, 12, 8, 1, 25, 16, 5, 14
6 B H 10, 11, 12, 7, 3, 5, 6, 16
NOTE the breaks column is a little weird, as in each row of that column instead of having one value (which makes sense for a dataframe), you have a vector of values. i.e. each cell of that breaks column is NOT a string; it's a vector!
> class(y$wool)
[1] "factor"
> class(y$breaks) # list !!
[1] "list"
> y$breaks[[1]] # first row in breaks
[1] 26 30 54 25 70 52 51 67
Levels: 10 12 13 14 15 16 17 18 19 20 21 24 25 26 27 28 29 30 31 35 36 39 41 42 43 44 51 52 54 67 70
Note that to access the first element of the breaks column, instead of doing y$breaks[1] (like you would with the wool or tension column) you need to do y$breaks[[1]] because of this.
Data frames are not really meant to work like this; a single cell in a dataframe is supposed to have a single value, and most functions will expect a dataframe to conform to this, so just keep this in mind when doing future processing.
If you wanted to convert to a string, use (e.g.) FUN=function (x) paste(unique(x), collapse=', '); then y$breaks will be a column of strings and behaves as normal.
I have a large xts object and want to subset the seconds in the time column, but only if there is a sequence of minimum 5 consecutive seconds. I have up to 8 data points per second (which shouldn't be counted as 5 consecutive points as they are measured within the same second).
And_sub_xts is my xts object
> str(And_sub_xts)
An ‘xts’ object on 2010-04-09 20:32:56/2010-04-26 06:56:57 containing:
Data: chr [1:164421, 1:11] "0.255416" "0.168836" "0.212126" "0.229442" "0.238100" "0.212126" "0.168836" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:11] "CalSurge" "CalSway" "CalHeave" "Stat_Surge" ...
Indexed by objects of class: [POSIXct,POSIXt] TZ:
xts Attributes:
NULL
and the first 100 values for
abs(diff(.indexsec(And_sub_xts)) are
56 8 23 34 40 40 41 42 25 27 34 35 38 38 40 40 41 56 59 59 19 19 20 20 20 20 22 22 23 23 24 24 24 25 25 26 27 27 27 27 27 28 28 30 30 30 37 38 40 40 41 44 44 46 46 47 48 51 52 54 54 54 54 55 56 59 1 4 4 4 6 6 6 6 7 7 11 12 12 14 14 15 16 16 17 18 18 19 19 21 21 22 22 23 23 25 25 26 26 26
I marked the keeps in bold, so the subset should just consist of these data points.
I just realize that theorethically it could happen that there are some data points distributed like this
2010-04-09 20:32:20
2010-04-09 20:32:20
2010-04-09 20:32:21
2010-04-09 20:32:22
2010-04-09 20:32:22
2010-04-09 20:40:22
2010-04-09 22:52:23
2010-04-10 20:52:24
which wouldn't be 5 consecutive seconds, but you can't account for this with the .indexsec command - maybe anybody knows a way to go around this.
Thanks for your help!
Here's one way to do it. x is sample data that contains index values with seconds equal to your first 100 values.
require(xts)
# sample data
s <- c(56, 8, 23, 34, 40, 40, 41, 42, 25, 27, 34, 35, 38, 38, 40,
40, 41, 56, 59, 59, 19, 19, 20, 20, 20, 20, 22, 22, 23, 23, 24,
24, 24, 25, 25, 26, 27, 27, 27, 27, 27, 28, 28, 30, 30, 30, 37,
38, 40, 40, 41, 44, 44, 46, 46, 47, 48, 51, 52, 54, 54, 54, 54,
55, 56, 59, 1, 4, 4, 4, 6, 6, 6, 6, 7, 7, 11, 12, 12, 14, 14,
15, 16, 16, 17, 18, 18, 19, 19, 21, 21, 22, 22, 23, 23, 25, 25,
26, 26, 26)
S <- cumsum(ifelse(c(0, diff(s)) < 0, 1, 0)) * 60 + s
x <- .xts(seq_along(S), S, tzone="UTC")
The basic idea is to aggregate your data to 1-second resolution, so you can use rle (run-length encoding) to find the consecutive 5-second observations. Then find the first and last timestamps of the sets of 5-second observations in your aggregated data, and then find the locations of those timestamps in your original data. Finally, use the locations of the timestamps in your original data to create sets of sequences you can use to subset the consecutive 5-second groups of observations.
# aggregate data to 1-second resolution
oneSec <- period.apply(x, endpoints(x, 'seconds'), identity)
# find the runs of 5 or more consecutive one-second increments
consec <- rle(diff(.index(oneSec)))
gte5s <- consec$lengths >= 5
# get the location of the first obs of the run in the 1-second data
begLoc <- cumsum(c(1,consec$lengths))[gte5s]
endLoc <- begLoc + consec$lengths[gte5s]
# get the timestamp of the first and last obs from the original data
beg <- lapply(index(oneSec)[begLoc], function(i) first(x[i, which.i=TRUE]))
end <- lapply(index(oneSec)[endLoc], function(i) last(x[i, which.i=TRUE]))
# create index vector between each value in 'beg' and 'end'
loc <- unlist(mapply(seq, beg, end))
# subset original object using index vector
X <- x[loc,]
I have this data frame :
> df
Z freq proba
1 17 1 0.0033289263
2 18 4 0.0055569026
3 19 2 0.0087878028
4 20 3 0.0132023556
5 21 16 0.0188900561
6 22 12 0.0257995234
7 23 30 0.0337042731
8 24 41 0.0421963455
9 25 56 0.0507149437
10 26 65 0.0586089198
11 27 65 0.0652230449
12 28 93 0.0699913154
13 29 82 0.0725182432
14 30 94 0.0726318551
15 31 72 0.0703990113
16 32 74 0.0661024717
17 33 58 0.0601873020
18 34 66 0.0531896431
19 35 38 0.0456625487
20 36 45 0.0381117389
21 37 27 0.0309498221
22 38 17 0.0244723502
23 39 15 0.0188543771
24 40 13 0.0141629367
25 41 4 0.0103793600
26 42 1 0.0074254435
27 43 2 0.0051886582
28 45 1 0.0023658767
29 46 1 0.0015453804
30 49 2 0.0003792308
# Here are my datas :
> dput(df)
structure(list(Z = c(17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 45, 46, 49), freq = c(1, 4, 2, 3, 16, 12, 30, 41, 56, 65,
65, 93, 82, 94, 72, 74, 58, 66, 38, 45, 27, 17, 15, 13, 4, 1,
2, 1, 1, 2), proba = c(0.0033289262662263, 0.00555690264007235,
0.00878780282243439, 0.0132023555702843, 0.0188900560866825,
0.0257995234198431, 0.0337042730520012, 0.0421963455163949, 0.0507149437492447,
0.0586089198012906, 0.0652230449359029, 0.0699913153996099, 0.0725182432348992,
0.0726318551493006, 0.0703990113442269, 0.0661024716831246, 0.0601873020200862,
0.0531896430528685, 0.045662548708844, 0.0381117389181843, 0.030949822142559,
0.0244723501557229, 0.01885437705459, 0.0141629366839816, 0.0103793599644779,
0.00742544354411115, 0.00518865818999788, 0.00236587669133322,
0.00154538036835848, 0.000379230768851682)), .Names = c("Z",
"freq", "proba"), row.names = c(NA, -30L), class = "data.frame")
And I want to regroup lines for which the value "freq" is < 5 with the next line, and this while the next line is < 5.
Idk if I'm clear enough so this is the output I expect :
> df2
labels effectifs pi
1 17;20 10 0.03087599
2 21 16 0.01889006
3 22 12 0.02579952
4 23 30 0.03370427
5 24 41 0.04219635
6 25 56 0.05071494
7 26 65 0.05860892
8 27 65 0.06522304
9 28 93 0.06999132
10 29 82 0.07251824
11 30 94 0.07263186
12 31 72 0.07039901
13 32 74 0.06610247
14 33 58 0.06018730
15 34 66 0.05318964
16 35 38 0.04566255
17 36 45 0.03811174
18 37 27 0.03094982
19 38 17 0.02447235
20 39 15 0.01885438
21 40 13 0.01416294
22 41;49 11 0.02728395
I did it with nested while, but I find this solution very painful and so unoptimized.
i <- 1
freqs <- c()
labels <- c()
pi <- c()
while(i < nrow(df)) {
if (df$freq[i] >= 5) {
freqs <- c(freqs, df$freq[i])
labels <- c(labels, df$Z[i])
pi <- c(pi, df$proba[i])
i <- i + 1
}
else {
count <- df$freq[i]
countPi <- df$proba[i]
k <- i
j <- i
while(df$freq[i] < 5 & i < nrow(df)) {
if (df$freq[i+1] < 5) {
count <- count + df$freq[i+1]
countPi <- countPi + df$proba[i+1]
j <- i + 1
}
i <- i + 1
}
labels <- c(labels, paste0(df$Z[k], ";", df$Z[j]))
freqs <- c(freqs, count)
pi <- c(pi, countPi)
}
}
df2 <- data.frame(labels, freqs, pi)
I'm sure there is far better, maybe with dplyr. If you have a better solution.. Thanks !
We could use the "devel" version of "data.table" as new functions are introduced (rleid). Here, we convert the "data.frame" to "data.table" (setDT(df)), create a grouping variable ("gr") based on the logical index (freq <5) using rleid. 'Z' column is 'numeric/integer' class. Create a character column ("Z1") from the "Z". Grouped by 'gr', if the "freq" is less than 5 for all the elements of that group, summarise the rows to a single row by taking the first observation of columns (.SD[1L]), remove the unwanted columns (as .SD includes "Z1" which will result in duplicate columns), append it with the "Z1" that we get from pasting the min and max value of "Z" for that group. Otherwise, leave it unchanged (else .SD). Remove the columns that we don't need by assigning it to "NULL".
library(data.table) #data.table_1.9.5
res <- setDT(df)[, gr:=rleid(freq<5)][, Z1:= as.character(Z)][,
if(all(freq<5)) c(.SD[1L][,-4, with=FALSE],
list(Z1=toString(c(min(Z), max(Z)))))
else .SD, gr][,1:2 :=NULL][]
head(res,3)
# freq proba Z1
#1: 1 0.003328926 17, 20
#2: 16 0.018890056 21
#3: 12 0.025799523 22
Since this is a dplyr question, here is a dplyr solution. First I used a grouping function in order to define the groups (similar to the rleid function in data.table). Then the summary and is fairly simple.
# grouping function
grouping <- function(condition){
# calculate runs for grouping
run <- rle((!condition) * 1:length(condition))
# revalue
run$values <- seq_along(run$values)
# invert to get grouping
inverse.rle(run)
}
# load dplyr
require(dplyr)
df %>%
mutate(group = grouping(freq<5)) %>% # add groups
group_by(group) %>% # group data
summarize(freq = sum(freq), # sum freq
proba = sum(proba), # sum proba
Z = toString(unique(range(Z)))) %>% # rename Z
mutate(group=NULL) # remove groups
## Source: local data table [22 x 3]
##
## freq proba Z
## 1 10 0.03087599 17, 20
## 2 16 0.01889006 21
## 3 12 0.02579952 22
## 4 30 0.03370427 23
## 5 41 0.04219635 24
## 6 56 0.05071494 25
## 7 65 0.05860892 26
## 8 65 0.06522304 27
## 9 93 0.06999132 28
## 10 82 0.07251824 29
## .. ... ... ...
I have this data.frame:
structure(list(X0 = c(9, 13, 13, 13, 35, 36, 37, 38, 39, 40,
40, 42, 43, 44), X0.1 = c(10, 40, 45, 46, 36, 37, 38, 40, 46,
45, 46, 43, 44, 46)), .Names = c("A", "B"), row.names = c(NA,
14L), class = "data.frame")
A B
1 9 10
2 13 40
3 13 45
4 13 46
5 35 36
6 36 37
7 37 38
8 38 40
9 39 46
10 40 45
11 40 46
12 42 43
13 43 44
14 44 46
I want to create sets like this: row 2,3 and 4 have 13, so they will be grouped into a set (13,40,45,46).
If any further row has even one member common with this set, both members of that row will be included in this set.
Since row 8 has 40 common with above set, the set will include them also: (13,40,45,46,38)
Now row 7 now has one member (38) common with this set, other member (37) will also be included in this set. The set will become (13,40,45,46,38,37)
If none of the 2 members of a row are common to any existing set, they will form their own set. Like row 1 has 9 and 10, none of which is there in any other row. So they form one set of (9,10)
At end I want to print out all sets.
Can I accompalish this in R programming? Thanks for your help.
Is this what you want?
f <- function(s, v) {
m <- which(s$A %in% v | s$B %in% v)
if (!any(m)) v
else Recall(s[-m, ], sort(unique(c(v, c(unlist(s[m, ]))))))
}
done <- c()
for(n in unique(unlist(d))) {
if (n %in% done) next
r <- f(d, n)
done <- c(done, r)
cat("(", r, ") ")
}
it outputs
( 9 10 ) ( 13 35 36 37 38 39 40 42 43 44 45 46 )
Updated
done <- c()
ret <- list()
for(n in unique(unlist(d))) {
if (n %in% done) next
r <- f(d, n)
done <- c(done, r)
cat("(", r, ") ")
ret <- c(ret, list(r))
}
then,
> ret
[[1]]
[1] 9 10
[[2]]
[1] 13 35 36 37 38 39 40 42 43 44 45 46