Combining two columns using shared values in first column - r

I am trying to adjust the formatting of a data set. My current set looks like this, in two columns. The first column is a "cluster" and the second column "name" contains values within each cluster:
Cluster Name
A 1
A 2
A 3
B 4
B 5
C 2
C 6
C 7
And I'd like a list that is, one column wherein all the values from column 2 are listed under the associated cluster from column 1 in a single column:
Cluster A
1
2
3
Cluster B
4
5
Cluster C
2
6
7
I've been trying in R and Excel with no luck for the last few hours. Any ideas?

Using a trick with tidyr::nest :
library(dplyr)
library(tidyr)
df %>% mutate(Cluster = paste0("Cluster_",Cluster)) %>% nest(Name) %>% t %>% unlist %>% as.data.frame
# .
# 1 Cluster_A
# 2 1
# 3 2
# 4 3
# 5 Cluster_B
# 6 4
# 7 5
# 8 Cluster_C
# 9 2
# 10 6
# 11 7

Related

R - How to create multiple datasets based on levels of factor in multiple columns?

I'm kinda new to R and still looking for ways to make my code more elegant. I want to create multiple datasets in a more efficient way, each based on a particular value over different columns.
This is my dataset:
df<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
And I need each column to be a factor:
df[colnames(df)] <- lapply(df[colnames(df)], factor)
Now, what I want to obtain is one dataframe called "Likert_rank_yes" that contains all the observations that in the column "dummy1" have "yes", one dataframe called "Likert_rank_no" that contains all the observations that in the column "dummy1" have "no", one dataframe called "Likert_rank_high" that contains all the observations that in the column "dummy2" have "high" and so on for all my other dummies.
I want to loop or streamline the process in some way, so that there are few commands to run to get all the datasets I need.
The first two dataframes should look something like this:
Dataframe called "Likert_rank_yes" that contains all the observations that in the column "dummy1" have "yes"
Dataframe called "Likert_rank_no" that contains all the observations that in the column "dummy1" have "no"
I have to do this with several dummies with multiple levels and would like to automate/loop the process or make it more efficient, so that I don't have to subset and rename every dataframe for each dummy level. Ideally I would also need to drop the last column in each df created (the one containing the dummy considered).
I tried splitting like below but it seems it is not possible using multiple values, I just get 4 dfs (yes AND high observations, yes AND low obs, no AND high obs etc.) like so:
Splitting with a list of columns doesn't work
list_df <- split(df[c(1:5)], list(df$dummy1,df$dummy2), sep=".")
Can you help? Thanks in advance!
You need two lapplys:
vals <- colnames(df)[1:5]
dummies <- colnames(df)[-(1:5)]
step1 <- lapply(dummies, function(x) df[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
step2
# $dummy1
# $dummy1$no
# A B C D E dummy1
# 3 2 2 3 5 2 no
# 4 3 3 3 5 3 no
# 5 4 4 4 5 4 no
# 6 5 2 2 4 2 no
# 8 1 5 1 5 1 no
#
# $dummy1$yes
# A B C D E dummy1
# 1 1 4 3 1 1 yes
# 2 2 4 3 2 4 yes
# 7 1 1 5 5 5 yes
# 9 2 2 2 2 2 yes
# 10 3 2 3 3 3 yes
#
#
# $dummy2
# $dummy2$high
# A B C D E dummy2
# 1 1 4 3 1 1 high
# 5 4 4 4 5 4 high
# 6 5 2 2 4 2 high
# 7 1 1 5 5 5 high
# 10 3 2 3 3 3 high
#
# $dummy2$low
# A B C D E dummy2
# 2 2 4 3 2 4 low
# 3 2 2 3 5 2 low
# 4 3 3 3 5 3 low
# 8 1 5 1 5 1 low
# 9 2 2 2 2 2 low
For the first data set ("dummy1" and "no") use step2$dummy1$no or step2[[1]][[1]] or step2[["dummy1"]][["no"]].
For programming purposes it is usually better to keep the list intact since it makes it simple to write code that processes all of the data frames in the list without having to specify them individually.
You are very close:
tbls <- unlist(step2, recursive=FALSE)
list2env(tbls, envir=.GlobalEnv)
ls()
# [1] "df" "dummies" "dummy1.no" "dummy1.yes" "dummy2.high" "dummy2.low" "step1" "step2" "tbls" "vals"
This will create the same set of tables.

How to rearrange columns of a data frame based on values in a row

This is an R programming question. I would like to rearrange the order of columns in a data frame based on the values in one of the rows. Here is an example data frame:
df <- data.frame(A=c(1,2,3,4),B=c(3,2,4,1),C=c(2,1,4,3),
D=c(4,2,3,1),E=c(4,3,2,1))
Suppose I want to rearrange the columns in df based on the values in row 4, ascending from 1 to 4, with ties having the same rank. So the desired data frame could be:
df <- data.frame(B=c(3,2,4,1),D=c(4,2,3,1),E=c(4,3,2,1),
C=c(2,1,4,3),A=c(1,2,3,4))
although I am indifferent about the order of first three columns, all of which have the value 1 in column 4.
I could do this with a for loop, but I am looking for a simpler approach. Thank you.
We can use select - subset the row (4), unlist, order the values and pass it on select
library(dplyr)
df %>%
select(order(unlist(.[4, ])))
-output
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
Or may use
df %>%
select({.} %>%
slice_tail(n = 1) %>%
flatten_dbl %>%
order)
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
or in base R
df[order(unlist(tail(df, 1))),]

Update dataframe B with values from dataframe A in R

I am doing social network analysis and working with two data frames. Dataframe A (or "nodes") has the information related to each node of the network (i.e. id and name). Dataframe B (or "links") has two columns: "from" and "to" which basically shows how the nodes are connected between them. Each row represents a link "from" one node "to" the other.
I want to use the package networkD3 to visualize the network but it has some requirements: id's should start from zero and they have to be consecutive (0,1,2, etc). Because my nodes and links are a random subset from a larger database, they are not consecutive.
I sorted the "nodes" data frame based on the id and created a new column (new_id) starting from zero and with consecutive numbers. But now, I don't know how to update the "links" data frame based on the new_id's.
Currently, I am converting the values in the "links" data frame to characters and then revaluing them using the plyr package. But I need to do this for a larger dataset.
I am copying a sample of the two data frame that I have now:
set.seed(10)
nodes_df <- data.frame(id = c(1,3,5,6,8,10),
name = c("Agriculture", "Agriculture_in_Mesoamerica", "Agriculture_in_ancient_Greece",
"Agriculture_in_ancient_Rome", "Agriculture_in_India", "Agriculture_in_China"),
new_id = seq(0,5))
links_df <- data.frame(from = c(3,3,5,6,8,10),
to = c(1,5,6,8,10,3))
In summary, I need to update the values in the links_df to correspond to the new_id values from the nodes_df.
Thank you so much in advance. I hope I was clear enough.
Best regards,
In base you just need to use merge and extract your required column
links_df$new_to <- merge(links_df, nodes_df,
by.x = "to", by.y = "id",
all.x = TRUE)$new_id
links_df$new_from <- merge(links_df, nodes_df,
by.x = "from", by.y = "id",
all.x = TRUE)$new_id
links_df <- links_df[,c(1,2,4,3)] # Reordering columns
links_df
from to new_from new_to
1 3 1 1 0
2 3 5 1 1
3 5 6 2 2
4 6 8 3 3
5 8 10 4 4
6 10 3 5 5
An alternative to merging or joining could be to use recode. A solution (based in the tidyverse) could look as follows.
library(dplyr)
library(tibble)
swap <- deframe(tibble(id = nodes_df$id, new_id = nodes_df$new_id))
links_df %>%
mutate(new_from = recode(from, !!!swap),
new_to = recode(to, !!!swap))
# from to new_from new_to
# 1 3 1 1 0
# 2 3 5 1 2
# 3 5 6 2 3
# 4 6 8 3 4
# 5 8 10 4 5
# 6 10 3 5 1
Technically speaking, networkD3 expects the values in the links data frame to be the (zero-based) index of the nodes they refer to in the nodes data frame. So the first row/node in the nodes data frame is 0, and so forth.
You can use match() to determine the 1-based index of each element in a vector in a target vector, and subtract 1 to get a 0-based index.
links_df$from
#> [1] 3 3 5 6 8 10
nodes_df$id
#> [1] 1 3 5 6 8 10
match(links_df$from, nodes_df$id) - 1
#> [1] 1 1 2 3 4 5
links_df$to
#> [1] 1 5 6 8 10 3
nodes_df$id
#> [1] 1 3 5 6 8 10
match(links_df$to, nodes_df$id) - 1
#> [1] 0 2 3 4 5 1
Created on 2021-03-28 by the reprex package (v1.0.0)

How to merge columns in R with different levels of values

I have been given a dataset that I am attempting to perform logistic regression on. However, to do so, I need to merge some columns in R.
For instance in the carevaluations data set, I am given (BuyingPrice_low, BuyingPrice_medium, BuyingPrice_high, BuyingPrice_vhigh, MaintenancePrice_low MaintenancePrice_medium MaintenancePrice_high MaintenancePrice_vhigh)
How would I combine the columns buying price_low, medium, etc. into one column called "BuyingPrice" with the order and their respective data in each column and the same with the maintenanceprice column?
library(dplyr)
df <- data.frame(Buy_low=rep(c(0,1), 10),
Buy_high=rep(c(0,1), 10))
one_column <- df %>%
gather(var, value)
head(one_column)
var value
1 Buy_low 0
2 Buy_low 1
3 Buy_low 0
4 Buy_low 1
5 Buy_low 0
6 Buy_low 1
It can be done with stack in base R :
df1 <- data.frame(a=1:3,b=4:6,c=7:9)
stack(df1)
# values ind
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c

How to sort a column from ascending order for EACH ID in R [duplicate]

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 7 years ago.
If I want to sort the Chrom# from ascending order (1 to 23) for each unique ID (as shown below there's multiple rows of same IDs, how to write the R code for it? eg) MB-0002, chrom from 1,1,1,2,4,22... etc. 1 chrom per row. I am new to R so any help would be appreciated. Thanks so much!
sample dataset
If you can use dplyr::arrange then you can easily sort by two variables.
tmp <- data.frame(id=c("a","a","b","a","b","c","a","b","c"),
value=c(3,2,4,1,2,1,7,4,3))
tmp
# id value
# 1 a 3
# 2 a 2
# 3 b 4
# 4 a 1
# 5 b 2
# 6 c 1
# 7 a 7
# 8 b 4
# 9 c 3
library(dplyr)
tmp %>% arrange(id, value)
# id value
# 1 a 1
# 2 a 2
# 3 a 3
# 4 a 7
# 5 b 2
# 6 b 4
# 7 b 4
# 8 c 1
# 9 c 3
FYI, an image doesn't work as a usable sample dataset.

Resources