R rearrange data - r

I have a bunch of texts written by the same person, and I'm trying to estimate the templates they use for each text. The way I'm going about this is:
create a TermDocumentMatrix for all the texts
take the raw Euclidean distance of each pair
cut out any pair greater than X distance (10 for the sake of argument)
flatten the forest
return one example of each template with some summarized stats
I'm able to get to the point of having the distance pairs, but I am unable to convert the dist instance to something I can work with. There is a reproducible example at the bottom.
The data in the dist instance looks like this:
The row and column names correspond to indexes in the original list of texts which I can use to do achieve step 5.
What I have been trying to get out of this is a sparse matrix with col name, row name, value.
col, row, value
1 2 14.966630
1 3 12.449900
1 4 13.490738
1 5 12.688578
1 6 12.369317
2 3 12.449900
2 4 13.564660
2 5 12.922848
2 6 12.529964
3 4 5.385165
3 5 5.830952
3 6 5.830952
4 5 7.416198
4 6 7.937254
5 6 7.615773
From this point I would be comfortable cutting out all pairs greater than my cutoff and flattening the forest, i.e. returning 3 templates in this example, a group containing only document 1, a group containing only document 2 and a third group containing documents 3, 4, 5, and 6.
I have tried a bunch of things from creating a matrix out of this and then trying to make it sparse, to directly using the vector inside of the dist class, and I just can't seem to figure it out.
Reproducible example:
tdm <- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,3,1,2,2,2,3,2,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,2,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,4,1,1,1,1,1,0,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,2,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,1,1,1,1,0,1,0,1,0,0,2,0,0,0,0,0,1,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,3,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,1,0,0,2,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,3,1,1,1,1,0,1,0,0,0,0,1,2,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,1,1,1,2,1,1,1,0,0,0,0,1,2,2,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,0,2,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,2,0,2,2,3,2,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,2,1,1,1,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,1,0,0,1,1,1,0,0,1,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,2,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,3,0,1,1,1,1,0,0,1,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,4,2,4,6,4,3,1,0,1,2,1,1,0,1,0,0,0,0,2,0,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,2,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,2,1,2,2,2,2,1,0,1,2,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,2,2,2,2,2,2,3,3,4,5,3,1,2,1,1,1,1,1,1,0,0,0,0,3,3,0,0,1,1,0,1,0,0,0,0), nrow=6)
rownames(tdm) <- 1:6
colnames(tdm) <- paste("term", 1:229, sep="")
tdm.dist <- dist(tdm)
# I'm stuck turning tdm.dist into what I have shown

A classic approach to turn a "matrix"-like object to a [row, col, value] "data.frame" is the as.data.frame(as.table(.)) route. Specifically here, we need:
subset(as.data.frame(as.table(as.matrix(tdm.dist))), as.numeric(Var1) < as.numeric(Var2))
But that includes way too many coercions and creation of a larger object only to be subset immediately.
Since dist stores its values in a "lower.tri"angle form we could use combn to generate the row/col indices and cbind with the "dist" object:
data.frame(do.call(rbind, combn(attr(tdm.dist, "Size"), 2, simplify = FALSE)), c(tdm.dist))
Also, "Matrix" package has some flexibility that, along its memory efficiency in creating objects, could be used here:
library(Matrix)
tmp = combn(attr(tdm.dist, "Size"), 2)
summary(sparseMatrix(i = tmp[2, ], j = tmp[1, ], x = c(tdm.dist),
dims = rep_len(attr(tdm.dist, "Size"), 2), symmetric = TRUE))
Additionally, among different functions that handle "dist" objects,
cutree(hclust(tdm.dist), h = 10)
#1 2 3 4 5 6
#1 2 3 3 3 3
groups by specifying the cut height.

That's how I've done a very similar thing in the past using dplyr and tidyr packages.
You can run the chained (%>%) script row by row to see how the dataset is updated step by step.
tdm <- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,3,1,2,2,2,3,2,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,2,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,4,1,1,1,1,1,0,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,2,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,1,1,1,1,0,1,0,1,0,0,2,0,0,0,0,0,1,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,3,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,1,0,0,2,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,3,1,1,1,1,0,1,0,0,0,0,1,2,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,1,1,1,2,1,1,1,0,0,0,0,1,2,2,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,0,2,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,2,0,2,2,3,2,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,2,1,1,1,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,1,0,0,1,1,1,0,0,1,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,2,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,3,0,1,1,1,1,0,0,1,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,4,2,4,6,4,3,1,0,1,2,1,1,0,1,0,0,0,0,2,0,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,2,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,2,1,2,2,2,2,1,0,1,2,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,2,2,2,2,2,2,3,3,4,5,3,1,2,1,1,1,1,1,1,0,0,0,0,3,3,0,0,1,1,0,1,0,0,0,0), nrow=6)
rownames(tdm) <- 1:6
colnames(tdm) <- paste("term", 1:229, sep="")
tdm.dist <- dist(tdm)
library(dplyr)
library(tidyr)
tdm.dist %>%
as.matrix() %>% # update dist object to a matrix
data.frame() %>% # update matrix to a data frame
setNames(nm = 1:ncol(.)) %>% # update column names
mutate(names1 = 1:nrow(.)) %>% # use rownames as a variable
gather(names2, value , -names1) %>% # reshape data
filter(names1 <= names2) # keep the values only once
# names1 names2 value
# 1 1 1 0.000000
# 2 1 2 14.966630
# 3 2 2 0.000000
# 4 1 3 12.449900
# 5 2 3 12.449900
# 6 3 3 0.000000
# 7 1 4 13.490738
# 8 2 4 13.564660
# 9 3 4 5.385165
# 10 4 4 0.000000
# 11 1 5 12.688578
# 12 2 5 12.922848
# 13 3 5 5.830952
# 14 4 5 7.416198
# 15 5 5 0.000000
# 16 1 6 12.369317
# 17 2 6 12.529964
# 18 3 6 5.830952
# 19 4 6 7.937254
# 20 5 6 7.615773
# 21 6 6 0.000000

Related

Update dataframe B with values from dataframe A in R

I am doing social network analysis and working with two data frames. Dataframe A (or "nodes") has the information related to each node of the network (i.e. id and name). Dataframe B (or "links") has two columns: "from" and "to" which basically shows how the nodes are connected between them. Each row represents a link "from" one node "to" the other.
I want to use the package networkD3 to visualize the network but it has some requirements: id's should start from zero and they have to be consecutive (0,1,2, etc). Because my nodes and links are a random subset from a larger database, they are not consecutive.
I sorted the "nodes" data frame based on the id and created a new column (new_id) starting from zero and with consecutive numbers. But now, I don't know how to update the "links" data frame based on the new_id's.
Currently, I am converting the values in the "links" data frame to characters and then revaluing them using the plyr package. But I need to do this for a larger dataset.
I am copying a sample of the two data frame that I have now:
set.seed(10)
nodes_df <- data.frame(id = c(1,3,5,6,8,10),
name = c("Agriculture", "Agriculture_in_Mesoamerica", "Agriculture_in_ancient_Greece",
"Agriculture_in_ancient_Rome", "Agriculture_in_India", "Agriculture_in_China"),
new_id = seq(0,5))
links_df <- data.frame(from = c(3,3,5,6,8,10),
to = c(1,5,6,8,10,3))
In summary, I need to update the values in the links_df to correspond to the new_id values from the nodes_df.
Thank you so much in advance. I hope I was clear enough.
Best regards,
In base you just need to use merge and extract your required column
links_df$new_to <- merge(links_df, nodes_df,
by.x = "to", by.y = "id",
all.x = TRUE)$new_id
links_df$new_from <- merge(links_df, nodes_df,
by.x = "from", by.y = "id",
all.x = TRUE)$new_id
links_df <- links_df[,c(1,2,4,3)] # Reordering columns
links_df
from to new_from new_to
1 3 1 1 0
2 3 5 1 1
3 5 6 2 2
4 6 8 3 3
5 8 10 4 4
6 10 3 5 5
An alternative to merging or joining could be to use recode. A solution (based in the tidyverse) could look as follows.
library(dplyr)
library(tibble)
swap <- deframe(tibble(id = nodes_df$id, new_id = nodes_df$new_id))
links_df %>%
mutate(new_from = recode(from, !!!swap),
new_to = recode(to, !!!swap))
# from to new_from new_to
# 1 3 1 1 0
# 2 3 5 1 2
# 3 5 6 2 3
# 4 6 8 3 4
# 5 8 10 4 5
# 6 10 3 5 1
Technically speaking, networkD3 expects the values in the links data frame to be the (zero-based) index of the nodes they refer to in the nodes data frame. So the first row/node in the nodes data frame is 0, and so forth.
You can use match() to determine the 1-based index of each element in a vector in a target vector, and subtract 1 to get a 0-based index.
links_df$from
#> [1] 3 3 5 6 8 10
nodes_df$id
#> [1] 1 3 5 6 8 10
match(links_df$from, nodes_df$id) - 1
#> [1] 1 1 2 3 4 5
links_df$to
#> [1] 1 5 6 8 10 3
nodes_df$id
#> [1] 1 3 5 6 8 10
match(links_df$to, nodes_df$id) - 1
#> [1] 0 2 3 4 5 1
Created on 2021-03-28 by the reprex package (v1.0.0)

Add a column in a data frame depending of another column [duplicate]

I have the data.frame below. I want to add a column 'g' that classifies my data according to consecutive sequences in column h_no. That is, the first sequence of h_no 1, 2, 3, 4 is group 1, the second series of h_no (1 to 7) is group 2, and so on, as indicated in the last column 'g'.
h_no h_freq h_freqsq g
1 0.09091 0.008264628 1
2 0.00000 0.000000000 1
3 0.04545 0.002065702 1
4 0.00000 0.000000000 1
1 0.13636 0.018594050 2
2 0.00000 0.000000000 2
3 0.00000 0.000000000 2
4 0.04545 0.002065702 2
5 0.31818 0.101238512 2
6 0.00000 0.000000000 2
7 0.50000 0.250000000 2
1 0.13636 0.018594050 3
2 0.09091 0.008264628 3
3 0.40909 0.167354628 3
4 0.04545 0.002065702 3
You can add a column to your data using various techniques. The quotes below come from the "Details" section of the relevant help text, [[.data.frame.
Data frames can be indexed in several modes. When [ and [[ are used with a single vector index (x[i] or x[[i]]), they index the data frame as if it were a list.
my.dataframe["new.col"] <- a.vector
my.dataframe[["new.col"]] <- a.vector
The data.frame method for $, treats x as a list
my.dataframe$new.col <- a.vector
When [ and [[ are used with two indices (x[i, j] and x[[i, j]]) they act like indexing a matrix
my.dataframe[ , "new.col"] <- a.vector
Since the method for data.frame assumes that if you don't specify if you're working with columns or rows, it will assume you mean columns.
For your example, this should work:
# make some fake data
your.df <- data.frame(no = c(1:4, 1:7, 1:5), h_freq = runif(16), h_freqsq = runif(16))
# find where one appears and
from <- which(your.df$no == 1)
to <- c((from-1)[-1], nrow(your.df)) # up to which point the sequence runs
# generate a sequence (len) and based on its length, repeat a consecutive number len times
get.seq <- mapply(from, to, 1:length(from), FUN = function(x, y, z) {
len <- length(seq(from = x[1], to = y[1]))
return(rep(z, times = len))
})
# when we unlist, we get a vector
your.df$group <- unlist(get.seq)
# and append it to your original data.frame. since this is
# designating a group, it makes sense to make it a factor
your.df$group <- as.factor(your.df$group)
no h_freq h_freqsq group
1 1 0.40998238 0.06463876 1
2 2 0.98086928 0.33093795 1
3 3 0.28908651 0.74077119 1
4 4 0.10476768 0.56784786 1
5 1 0.75478995 0.60479945 2
6 2 0.26974011 0.95231761 2
7 3 0.53676266 0.74370154 2
8 4 0.99784066 0.37499294 2
9 5 0.89771767 0.83467805 2
10 6 0.05363139 0.32066178 2
11 7 0.71741529 0.84572717 2
12 1 0.10654430 0.32917711 3
13 2 0.41971959 0.87155514 3
14 3 0.32432646 0.65789294 3
15 4 0.77896780 0.27599187 3
16 5 0.06100008 0.55399326 3
Easily: Your data frame is A
b <- A[,1]
b <- b==1
b <- cumsum(b)
Then you get the column b.
If I understand the question correctly, you want to detect when the h_no doesn't increase and then increment the class. (I'm going to walk through how I solved this problem, there is a self-contained function at the end.)
Working
We only care about the h_no column for the moment, so we can extract that from the data frame:
> h_no <- data$h_no
We want to detect when h_no doesn't go up, which we can do by working out when the difference between successive elements is either negative or zero. R provides the diff function which gives us the vector of differences:
> d.h_no <- diff(h_no)
> d.h_no
[1] 1 1 1 -3 1 1 1 1 1 1 -6 1 1 1
Once we have that, it is a simple matter to find the ones that are non-positive:
> nonpos <- d.h_no <= 0
> nonpos
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE
In R, TRUE and FALSE are basically the same as 1 and 0, so if we get the cumulative sum of nonpos, it will increase by 1 in (almost) the appropriate spots. The cumsum function (which is basically the opposite of diff) can do this.
> cumsum(nonpos)
[1] 0 0 0 1 1 1 1 1 1 1 2 2 2 2
But, there are two problems: the numbers are one too small; and, we are missing the first element (there should be four in the first class).
The first problem is simply solved: 1+cumsum(nonpos). And the second just requires adding a 1 to the front of the vector, since the first element is always in class 1:
> classes <- c(1, 1 + cumsum(nonpos))
> classes
[1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
Now, we can attach it back onto our data frame with cbind (by using the class= syntax, we can give the column the class heading):
> data_w_classes <- cbind(data, class=classes)
And data_w_classes now contains the result.
Final result
We can compress the lines together and wrap it all up into a function to make it easier to use:
classify <- function(data) {
cbind(data, class=c(1, 1 + cumsum(diff(data$h_no) <= 0)))
}
Or, since it makes sense for the class to be a factor:
classify <- function(data) {
cbind(data, class=factor(c(1, 1 + cumsum(diff(data$h_no) <= 0))))
}
You use either function like:
> classified <- classify(data) # doesn't overwrite data
> data <- classify(data) # data now has the "class" column
(This method of solving this problem is good because it avoids explicit iteration, which is generally recommend for R, and avoids generating lots of intermediate vectors and list etc. And also it's kinda neat how it can be written on one line :) )
In addition to Roman's answer, something like this might be even simpler. Note that I haven't tested it because I do not have access to R right now.
# Note that I use a global variable here
# normally not advisable, but I liked the
# use here to make the code shorter
index <<- 0
new_column = sapply(df$h_no, function(x) {
if(x == 1) index = index + 1
return(index)
})
The function iterates over the values in n_ho and always returns the categorie that the current value belongs to. If a value of 1 is detected, we increase the global variable index and continue.
Approach based on identifying number of groups (x in mapply) and its length (y in mapply)
mytb<-read.table(text="h_no h_freq h_freqsq group
1 0.09091 0.008264628 1
2 0.00000 0.000000000 1
3 0.04545 0.002065702 1
4 0.00000 0.000000000 1
1 0.13636 0.018594050 2
2 0.00000 0.000000000 2
3 0.00000 0.000000000 2
4 0.04545 0.002065702 2
5 0.31818 0.101238512 2
6 0.00000 0.000000000 2
7 0.50000 0.250000000 2
1 0.13636 0.018594050 3
2 0.09091 0.008264628 3
3 0.40909 0.167354628 3
4 0.04545 0.002065702 3", header=T, stringsAsFactors=F)
mytb$group<-NULL
positionsof1s<-grep(1,mytb$h_no)
mytb$newgroup<-unlist(mapply(function(x,y)
rep(x,y), # repeat x number y times
x= 1:length(positionsof1s), # x is 1 to number of nth group = g1:g3
y= c( diff(positionsof1s), # y is number of repeats of groups g1 to penultimate (g2) = 4, 7
nrow(mytb)- # this line and the following gives number of repeat for last group (g3)
(positionsof1s[length(positionsof1s )]-1 ) # number of rows - position of penultimate group (g2)
) ) )
mytb
I believe that using "cbind" is the simplest way to add a column to a data frame in R. Below an example:
myDf = data.frame(index=seq(1,10,1), Val=seq(1,10,1))
newCol= seq(2,20,2)
myDf = cbind(myDf,newCol)
The data.table function rleid is handy for things like this. We subtract the sequence 1:nrow(data) to transform consecutive sequences to constants, and then use rleid to create the group IDs:
data$g = data.table::rleid(data$h_no - 1:nrow(data))
Data.frame[,'h_new_column'] <- as.integer(Data.frame[,'h_no'], breaks=c(1, 4, 7))

Using two grouping designations to create one 'combined' grouping variable

Given a data.frame:
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))
#> df
# grp1 grp2
#1 1 1
#2 1 2
#3 1 3
#4 2 3
#5 2 4
#6 2 5
#7 3 6
#8 3 7
#9 3 8
#10 4 6
#11 4 9
#12 4 10
Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.
Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.
Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).
Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.
The final output that I'm hoping to obtain would look something like:
# > df
# grp1 grp2 combinedGrp
# 1 1 1 1
# 2 1 2 1
# 3 1 3 1
# 4 2 3 1
# 5 2 4 1
# 6 2 5 1
# 7 3 6 2
# 8 3 7 2
# 9 3 8 2
# 10 4 6 2
# 11 4 9 2
# 12 4 10 2
Thank you for any direction on this topic!
I would define a graph and label nodes according to connected components:
gmap = unique(stack(df))
gmap$node = seq_len(nrow(gmap))
oldcols = unique(gmap$ind)
newcols = paste0("node_", oldcols)
df[ newcols ] = lapply(oldcols, function(i) with(gmap[gmap$ind == i, ],
node[ match(df[[i]], values) ]
))
library(igraph)
g = graph_from_edgelist(cbind(df$node_grp1, df$node_grp2), directed = FALSE)
gmap$group = components(g)$membership
df$group = gmap$group[ match(df$node_grp1, gmap$node) ]
grp1 grp2 node_grp1 node_grp2 group
1 1 1 1 5 1
2 1 2 1 6 1
3 1 3 1 7 1
4 2 3 2 7 1
5 2 4 2 8 1
6 2 5 2 9 1
7 3 6 3 10 2
8 3 7 3 11 2
9 3 8 3 12 2
10 4 6 4 10 2
11 4 9 4 13 2
12 4 10 4 14 2
Each unique element of grp1 or grp2 is a node and each row of df is an edge.
One way to do this is via a matrix that defines links between rows based on group membership.
This approach is related to #Frank's graph answer but uses an adjacency matrix rather than using edges to define the graph. An advantage of this approach is it can deal immediately with many > 2 grouping columns with the same code. (So long as you write the function that determines links flexibly.) A disadvantage is you need to make all pair-wise comparisons between rows to construct the matrix, so for very long vectors it could be slow. As is, #Frank's answer would work better for very long data, or if you only ever have two columns.
The steps are
compare rows based on groups and define these rows as linked (i.e., create a graph)
determine connected components of the graph defined by the links in 1.
You could do 2 a few ways. Below I show a brute force way where you 2a) collapse links, till reaching a stable link structure using matrix multiplication and 2b) convert the link structure to a factor using hclust and cutree. You could also use igraph::clusters on a graph created from the matrix.
1. construct an adjacency matrix (matrix of pairwise links) between rows
(i.e., if they in the same group, the matrix entry is 1, otherwise it's 0). First making a helper function that determines whether two rows are linked
linked_rows <- function(data){
## helper function
## returns a _function_ to compare two rows of data
## based on group membership.
## Use Vectorize so it works even on vectors of indices
Vectorize(function(i, j) {
## numeric: 1= i and j have overlapping group membership
common <- vapply(names(data), function(name)
data[i, name] == data[j, name],
FUN.VALUE=FALSE)
as.numeric(any(common))
})
}
which I use in outer to construct a matrix,
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
2a. collapse 2-degree links to 1-degree links. That is, if rows are linked by an intermediate node but not directly linked, lump them in the same group by defining a link between them.
One iteration involves: i) matrix multiply to get the square of A, and
ii) set any non-zero entry in the squared matrix to 1 (as if it were a first degree, pairwise link)
## define as a function to use below
lump_links <- function(A) {
A <- A %*% A
A[A > 0] <- 1
A
}
repeat this till the links are stable
oldA <- 0
i <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
2b. Use the stable link structure in A to define groups (connected components of the graph). You could do this a variety of ways.
One way, is to first define a distance object, then use hclust and cutree. If you think about it, we want to define linked (A[i,j] == 1) as distance 0. So the steps are a) define linked as distance 0 in a dist object, b) construct a tree from the dist object, c) cut the tree at zero height (i.e., zero distance):
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
In practice you can encode steps 1 - 2 in a single function that uses the helper lump_links and linked_rows:
lump <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
oldA <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
}
This works for the original df and also for the structure in #rawr's answer
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
lump(df)
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
13 5 11 1
14 5 3 1
15 6 12 3
16 7 3 1
17 8 6 2
18 9 12 3
PS
Here's a version using igraph, which makes the connection with #Frank's answer more clear:
lump2 <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
cluster_A <- igraph::clusters(igraph::graph.adjacency(A))
df$combinedGrp <- cluster_A$membership
df
}
Hope this solution helps you a bit:
Assumption: df is ordered on the basis of grp1.
## split dataset using values of grp1
split_df <- split.default(df$grp2,df$grp1)
parent <- vector('integer',length(split_df))
## find out which combinations have values of grp2 in common
for (i in seq(1,length(split_df)-1)){
for (j in seq(i+1,length(split_df))){
inter <- intersect(split_df[[i]],split_df[[j]])
if (length(inter) > 0){
parent[j] <- i
}
}
}
ans <- vector('list',length(split_df))
index <- which(parent == 0)
## index contains indices of elements that have no element common
for (i in seq_along(index)){
ans[[index[i]]] <- rep(i,length(split_df[[i]]))
}
rest_index <- seq(1,length(split_df))[-index]
for (i in rest_index){
val <- ans[[parent[i]]][1]
ans[[i]] <- rep(val,length(split_df[[i]]))
}
df$combinedGrp <- unlist(ans)
df
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
Based on https://stackoverflow.com/a/35773701/2152245, I used a different implementation of igraph because I already had an adjacency matrix of sf polygons from st_intersects():
library(igraph)
library(sf)
# Use example data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- nc[-sample(1:nrow(nc),nrow(nc)*.75),] #drop some polygons
# Find intersetions
b <- st_intersects(nc, sparse = F)
g <- graph.adjacency(b)
clu <- components(g)
gr <- groups(clu)
# Quick loop to assign the groups
for(i in 1:nrow(nc)){
for(j in 1:length(gr)){
if(i %in% gr[[j]]){
nc[i,'group'] <- j
}
}
}
# Make a new sfc object
nc_un <- group_by(nc, group) %>%
summarize(BIR74 = mean(BIR74), do_union = TRUE)
plot(nc_un['BIR74'])

How to pass column name as parameter to function in dplyr?

I want to do the same as here but with dplyr and one more column.
I want to selecting a column via a string variable, but on top I also want to select a second column normally.
I need this because I have a function which selects a couple of columns by a given parameters.
I have the following code as an example:
library(dplyr)
data(cars)
x <- "speed"
cars %>% select_(x, dist)
You can use quote() for the dist column
x <- "speed"
cars %>% select_(x, quote(dist)) %>% head
# speed dist
# 1 4 2
# 2 4 10
# 3 7 4
# 4 7 22
# 5 8 16
# 6 9 10
I know I'm a little late to this one, but I figured I would add it for others.
x <- "speed"
cars %>% select(one_of(x),dist) %>% head()
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
OR this would work too
cars %>% select(one_of(c(x,'dist')))

Variable Length Core Name Identification

I have a data set with the following row-naming scheme:
a.X.V
where:
a is a fixed-length core ID
X is a variable-length string that subsets a, which means I should keep X
V is a variable-length ID which specifies the individual elements of a.X to be averaged
. is one of {-,_}
What I am trying to do is take column averages of all the a.X's. A sample:
sampleList <- list("a.12.1"=c(1,2,3,4,5), "b.1.23"=c(3,4,1,4,5), "a.12.21"=c(5,7,2,8,9), "b.1.555"=c(6,8,9,0,6))
sampleList
$a.12.1
[1] 1 2 3 4 5
$b.1.23
[1] 3 4 1 4 5
$a.12.21
[1] 5 7 2 8 9
$b.1.555
[1] 6 8 9 0 6
Currently I am manually gsubbing out the .Vs to get a list of general :
sampleList <- t(as.data.frame(sampleList))
y <- rowNames(sampleList)
y <- gsub("(\\w\\.\\d+)\\.d+", "\\1", y)
Is there a faster way to do this?
This is one half of 2 issues I've encountered in a workflow. The other half was answered here.
You can use a vector of patterns to find the locations of the columns you want to group. I included a pattern I knew wouldn't match anything in order to show that the solution is robust to that situation.
# A *named* vector of patterns you want to group by
patterns <- c(a.12="^a.12",b.12="^b.12",c.12="^c.12")
# Find the locations of those patterns in your list
inds <- lapply(patterns, grep, x=names(sampleList))
# Calculate the mean of each list element that matches the pattern
out <- lapply(inds, function(i)
if(l <- length(i)) Reduce("+",sampleList[i])/l else NULL)
# Set the names of the output
names(out) <- names(patterns)
Perhaps you could consider messing with your data structure to make it easier to apply some standard tools:
sampleList <- list("a.12.1"=c(1,2,3,4,5),
"b.1.23"=c(3,4,1,4,5), "a.12.21"=c(5,7,2,8,9),
"b.1.555"=c(6,8,9,0,6))
library(reshape2)
m1 <- melt(do.call(cbind,sampleList))
m2 <- cbind(m1,colsplit(m1$Var2,"\\.",c("coreID","val1","val2")))
The results looks like this:
head(m2)
Var1 Var2 value coreID val1 val2
1 1 a.12.1 1 a 12 1
2 2 a.12.1 2 a 12 1
3 3 a.12.1 3 a 12 1
Then you can more easily do something like this:
aggregate(value~val1,mean,data=subset(m2,coreID=="a"))
R is poised to do this stuff if you would just move to data.frames instead of lists. Make Your 'a', 'X', and 'V' into their own columns. Then you can use ave, by, aggregate, subset, etc.
data.frame(do.call(rbind, sampleList),
do.call(rbind, strsplit(names(sampleList), '\\.')))
# X1 X2 X3 X4 X5 X1.1 X2.1 X3.1
# a.12.1 1 2 3 4 5 a 12 1
# b.1.23 3 4 1 4 5 b 1 23
# a.12.21 5 7 2 8 9 a 12 21
# b.1.555 6 8 9 0 6 b 1 555

Resources