Update dataframe B with values from dataframe A in R - r

I am doing social network analysis and working with two data frames. Dataframe A (or "nodes") has the information related to each node of the network (i.e. id and name). Dataframe B (or "links") has two columns: "from" and "to" which basically shows how the nodes are connected between them. Each row represents a link "from" one node "to" the other.
I want to use the package networkD3 to visualize the network but it has some requirements: id's should start from zero and they have to be consecutive (0,1,2, etc). Because my nodes and links are a random subset from a larger database, they are not consecutive.
I sorted the "nodes" data frame based on the id and created a new column (new_id) starting from zero and with consecutive numbers. But now, I don't know how to update the "links" data frame based on the new_id's.
Currently, I am converting the values in the "links" data frame to characters and then revaluing them using the plyr package. But I need to do this for a larger dataset.
I am copying a sample of the two data frame that I have now:
set.seed(10)
nodes_df <- data.frame(id = c(1,3,5,6,8,10),
name = c("Agriculture", "Agriculture_in_Mesoamerica", "Agriculture_in_ancient_Greece",
"Agriculture_in_ancient_Rome", "Agriculture_in_India", "Agriculture_in_China"),
new_id = seq(0,5))
links_df <- data.frame(from = c(3,3,5,6,8,10),
to = c(1,5,6,8,10,3))
In summary, I need to update the values in the links_df to correspond to the new_id values from the nodes_df.
Thank you so much in advance. I hope I was clear enough.
Best regards,

In base you just need to use merge and extract your required column
links_df$new_to <- merge(links_df, nodes_df,
by.x = "to", by.y = "id",
all.x = TRUE)$new_id
links_df$new_from <- merge(links_df, nodes_df,
by.x = "from", by.y = "id",
all.x = TRUE)$new_id
links_df <- links_df[,c(1,2,4,3)] # Reordering columns
links_df
from to new_from new_to
1 3 1 1 0
2 3 5 1 1
3 5 6 2 2
4 6 8 3 3
5 8 10 4 4
6 10 3 5 5

An alternative to merging or joining could be to use recode. A solution (based in the tidyverse) could look as follows.
library(dplyr)
library(tibble)
swap <- deframe(tibble(id = nodes_df$id, new_id = nodes_df$new_id))
links_df %>%
mutate(new_from = recode(from, !!!swap),
new_to = recode(to, !!!swap))
# from to new_from new_to
# 1 3 1 1 0
# 2 3 5 1 2
# 3 5 6 2 3
# 4 6 8 3 4
# 5 8 10 4 5
# 6 10 3 5 1

Technically speaking, networkD3 expects the values in the links data frame to be the (zero-based) index of the nodes they refer to in the nodes data frame. So the first row/node in the nodes data frame is 0, and so forth.
You can use match() to determine the 1-based index of each element in a vector in a target vector, and subtract 1 to get a 0-based index.
links_df$from
#> [1] 3 3 5 6 8 10
nodes_df$id
#> [1] 1 3 5 6 8 10
match(links_df$from, nodes_df$id) - 1
#> [1] 1 1 2 3 4 5
links_df$to
#> [1] 1 5 6 8 10 3
nodes_df$id
#> [1] 1 3 5 6 8 10
match(links_df$to, nodes_df$id) - 1
#> [1] 0 2 3 4 5 1
Created on 2021-03-28 by the reprex package (v1.0.0)

Related

Using Strings to Identify Sequence of Column Names in R

I am currently try to use pre-defined strings in order to identify multiple column names in R.
To be more explicit, I am using the ave function to create identification variables for subgroups of a dataframe. The twist is that I want the identification variables to be flexible, in such a manner that I would just pass it as a generic string.
A sample code would be:
ids = with(df,ave(rep(1,nrow(df)),subcolumn1,subcolumn2,subcolumn3,FUN=seq_along))
I would like to run this code in the following fashion (code below does not work as expected):
subColumnsString = c("subcolumn1","subcolumn2","subcolumn3")
ids = with(df,ave(rep(1,nrow(df)),subColumnsString ,FUN=seq_along))
I tried something with eval, but still did not work:
subColumnsString = c("subcolumn1","subcolumn2","subcolumn3")
ids = with(df,ave(rep(1,nrow(df)),eval(parse(text=subColumnsString)),FUN=seq_along))
Any ideas?
Thanks.
EDIT: Working code example of what I want:
df = mtcars
id_names = c("vs","am")
idDF_correct = transform(df,idItem = as.numeric(interaction(vs,am)))
idDF_wrong = cbind(df,ave(rep(1,nrow(df)),df[id_names],FUN=seq_along))
Note how in idDF_correct, the unique combinations are correctly mapped into unique values of idItem. In idDF_wrong this is not the case.
I think this achieves what you requested. Here I use the mtcars dataset that ships with R:
subColumnsString <- c("cyl","gear")
ids = with(mtcars, ave(rep(1,nrow(mtcars)), mtcars[subColumnsString], FUN=seq_along))
Just index your data.frame using the sub columns which returns a list that naturally works with ave
EDIT
ids = ave(rep(1,nrow(mtcars)), mtcars[subColumnsString], FUN=seq_along)
You can omit the with and just call plain 'ol ave, as G. Grothendieck, stated and you should also use their answer as it is much more general.
This defines a function whose arguments are:
data, the input data frame
by, a character vector of column names in data
fun, a function to use in ave
Code--
Ave <- function(data, by, fun = seq_along) {
do.call(function(...) ave(rep(1, nrow(data)), ..., FUN = fun), data[by])
}
# test
Ave(CO2, c("Plant", "Treatment"), seq_along)
giving:
[1] 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3
[39] 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6
[77] 7 1 2 3 4 5 6 7

Combining two columns using shared values in first column

I am trying to adjust the formatting of a data set. My current set looks like this, in two columns. The first column is a "cluster" and the second column "name" contains values within each cluster:
Cluster Name
A 1
A 2
A 3
B 4
B 5
C 2
C 6
C 7
And I'd like a list that is, one column wherein all the values from column 2 are listed under the associated cluster from column 1 in a single column:
Cluster A
1
2
3
Cluster B
4
5
Cluster C
2
6
7
I've been trying in R and Excel with no luck for the last few hours. Any ideas?
Using a trick with tidyr::nest :
library(dplyr)
library(tidyr)
df %>% mutate(Cluster = paste0("Cluster_",Cluster)) %>% nest(Name) %>% t %>% unlist %>% as.data.frame
# .
# 1 Cluster_A
# 2 1
# 3 2
# 4 3
# 5 Cluster_B
# 6 4
# 7 5
# 8 Cluster_C
# 9 2
# 10 6
# 11 7

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

R rearrange data

I have a bunch of texts written by the same person, and I'm trying to estimate the templates they use for each text. The way I'm going about this is:
create a TermDocumentMatrix for all the texts
take the raw Euclidean distance of each pair
cut out any pair greater than X distance (10 for the sake of argument)
flatten the forest
return one example of each template with some summarized stats
I'm able to get to the point of having the distance pairs, but I am unable to convert the dist instance to something I can work with. There is a reproducible example at the bottom.
The data in the dist instance looks like this:
The row and column names correspond to indexes in the original list of texts which I can use to do achieve step 5.
What I have been trying to get out of this is a sparse matrix with col name, row name, value.
col, row, value
1 2 14.966630
1 3 12.449900
1 4 13.490738
1 5 12.688578
1 6 12.369317
2 3 12.449900
2 4 13.564660
2 5 12.922848
2 6 12.529964
3 4 5.385165
3 5 5.830952
3 6 5.830952
4 5 7.416198
4 6 7.937254
5 6 7.615773
From this point I would be comfortable cutting out all pairs greater than my cutoff and flattening the forest, i.e. returning 3 templates in this example, a group containing only document 1, a group containing only document 2 and a third group containing documents 3, 4, 5, and 6.
I have tried a bunch of things from creating a matrix out of this and then trying to make it sparse, to directly using the vector inside of the dist class, and I just can't seem to figure it out.
Reproducible example:
tdm <- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,3,1,2,2,2,3,2,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,2,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,4,1,1,1,1,1,0,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,2,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,1,1,1,1,0,1,0,1,0,0,2,0,0,0,0,0,1,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,3,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,1,0,0,2,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,3,1,1,1,1,0,1,0,0,0,0,1,2,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,1,1,1,2,1,1,1,0,0,0,0,1,2,2,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,0,2,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,2,0,2,2,3,2,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,2,1,1,1,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,1,0,0,1,1,1,0,0,1,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,2,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,3,0,1,1,1,1,0,0,1,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,4,2,4,6,4,3,1,0,1,2,1,1,0,1,0,0,0,0,2,0,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,2,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,2,1,2,2,2,2,1,0,1,2,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,2,2,2,2,2,2,3,3,4,5,3,1,2,1,1,1,1,1,1,0,0,0,0,3,3,0,0,1,1,0,1,0,0,0,0), nrow=6)
rownames(tdm) <- 1:6
colnames(tdm) <- paste("term", 1:229, sep="")
tdm.dist <- dist(tdm)
# I'm stuck turning tdm.dist into what I have shown
A classic approach to turn a "matrix"-like object to a [row, col, value] "data.frame" is the as.data.frame(as.table(.)) route. Specifically here, we need:
subset(as.data.frame(as.table(as.matrix(tdm.dist))), as.numeric(Var1) < as.numeric(Var2))
But that includes way too many coercions and creation of a larger object only to be subset immediately.
Since dist stores its values in a "lower.tri"angle form we could use combn to generate the row/col indices and cbind with the "dist" object:
data.frame(do.call(rbind, combn(attr(tdm.dist, "Size"), 2, simplify = FALSE)), c(tdm.dist))
Also, "Matrix" package has some flexibility that, along its memory efficiency in creating objects, could be used here:
library(Matrix)
tmp = combn(attr(tdm.dist, "Size"), 2)
summary(sparseMatrix(i = tmp[2, ], j = tmp[1, ], x = c(tdm.dist),
dims = rep_len(attr(tdm.dist, "Size"), 2), symmetric = TRUE))
Additionally, among different functions that handle "dist" objects,
cutree(hclust(tdm.dist), h = 10)
#1 2 3 4 5 6
#1 2 3 3 3 3
groups by specifying the cut height.
That's how I've done a very similar thing in the past using dplyr and tidyr packages.
You can run the chained (%>%) script row by row to see how the dataset is updated step by step.
tdm <- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,3,1,2,2,2,3,2,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,2,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,4,1,1,1,1,1,0,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,2,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,1,1,1,1,0,1,0,1,0,0,2,0,0,0,0,0,1,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,3,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,1,0,0,2,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,3,1,1,1,1,0,1,0,0,0,0,1,2,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,1,1,1,2,1,1,1,0,0,0,0,1,2,2,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,0,2,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,2,0,2,2,3,2,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,2,1,1,1,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,1,0,0,1,1,1,0,0,1,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,2,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,3,0,1,1,1,1,0,0,1,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,4,2,4,6,4,3,1,0,1,2,1,1,0,1,0,0,0,0,2,0,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,2,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,2,1,2,2,2,2,1,0,1,2,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,2,2,2,2,2,2,3,3,4,5,3,1,2,1,1,1,1,1,1,0,0,0,0,3,3,0,0,1,1,0,1,0,0,0,0), nrow=6)
rownames(tdm) <- 1:6
colnames(tdm) <- paste("term", 1:229, sep="")
tdm.dist <- dist(tdm)
library(dplyr)
library(tidyr)
tdm.dist %>%
as.matrix() %>% # update dist object to a matrix
data.frame() %>% # update matrix to a data frame
setNames(nm = 1:ncol(.)) %>% # update column names
mutate(names1 = 1:nrow(.)) %>% # use rownames as a variable
gather(names2, value , -names1) %>% # reshape data
filter(names1 <= names2) # keep the values only once
# names1 names2 value
# 1 1 1 0.000000
# 2 1 2 14.966630
# 3 2 2 0.000000
# 4 1 3 12.449900
# 5 2 3 12.449900
# 6 3 3 0.000000
# 7 1 4 13.490738
# 8 2 4 13.564660
# 9 3 4 5.385165
# 10 4 4 0.000000
# 11 1 5 12.688578
# 12 2 5 12.922848
# 13 3 5 5.830952
# 14 4 5 7.416198
# 15 5 5 0.000000
# 16 1 6 12.369317
# 17 2 6 12.529964
# 18 3 6 5.830952
# 19 4 6 7.937254
# 20 5 6 7.615773
# 21 6 6 0.000000

Create new variable in R data frame by conditional lookup

I want to create a new variable in an R data frame by using an existing column as a lookup value on another column in the same table. For example, in the following data frame:
df = data.frame(
pet = c("smalldog", "mediumdog", "largedog",
"smallcat", "mediumcat", "largecat"),
numPets = c(1, 2, 3, 4, 5, 6)
)
> df
pet numPets
1 smalldog 1
2 mediumdog 2
3 largedog 3
4 smallcat 4
5 mediumcat 5
6 largecat 6
I want to to create a new column called numEnemies which is equal to zero for small animals but equal to the number of animals of the same size but the different species for medium and large animals. I want to end up with this:
pet numPets numEnemies
1 smalldog 1 0
2 mediumdog 2 5
3 largedog 3 6
4 smallcat 4 0
5 mediumcat 5 2
6 largecat 6 3
The way I was attempting to do this was by using conditional logic to generate a character variable which I could then use to look up the final value I want from the same data frame, which got me to here:
calculateEnemies <- function(df) {
ifelse(grepl('small', df$pet), 0,
ifelse(grepl('dog', df$pet), gsub('dog', 'cat', df$pet),
ifelse(grepl('cat', df$pet),
gsub('cat', 'dog', df$pet), NA)))
}
df$numEnemies <- calculateEnemies(df)
> df
pet numPets numEnemies
1 smalldog 1 0
2 mediumdog 2 mediumcat
3 largedog 3 largecat
4 smallcat 4 0
5 mediumcat 5 mediumdog
6 largecat 6 largedog
I want to modify this function to use the newly generated string to lookup the values from df$numPets based on the corresponding value in df$pet. I'm also open to a better approach that also generalizes.
Here's how I would approach this using the data.table packages
library(data.table)
setDT(df)[, numEnemies := rev(numPets), by = sub(".*(large|medium).*", "\\1", pet)]
df[grep("^small", pet), numEnemies := 0L]
# pet numPets numEnemies
# 1: smalldog 1 0
# 2: mediumdog 2 5
# 3: largedog 3 6
# 4: smallcat 4 0
# 5: mediumcat 5 2
# 6: largecat 6 3
What I basically did, is to first create groups of medium and large over the whole data set and just reverse the values within each group.
Then, I've assigned 0 to all the values in numPets when grep("^small", pet).
This should be both very efficient and robust, as it will work on any number of animals and you don't actually need to know the animals names apriori.

Resources