I have data.frame containing a list of people and who they are neighbours with. However, the data suggest that Josh a neighbour of himself, Emma, and Nick, but Emma is not a neighbour of Josh.
x <- read.table(text = "
Name ID Neighbour_ID
Josh 1 1,2,3
Emma 2 4
Nick 3 1
Mark 4 5
Claire 5
", sep = " ", header = TRUE)
x
Name ID Neighbour_ID
1 Josh 1 1,2,3
2 Emma 2 4
3 Nick 3 1
4 Mark 4 5
5 Claire 5
This of course needs to be fixed and I am looking for a way to do that. The outcome should look like this
Name ID Neighbour_ID
1 Josh 1 2,3
2 Emma 2 1,4
3 Nick 3 1
4 Mark 4 2,5
5 Claire 5 4
Add: If you find a better suited title for this question please feel free to edit!
Using igraph package, convert it to graph object:
library(dplyr)
library(tidyr)
library(igraph)
g <- separate_rows(x, Neighbour_ID, convert = TRUE) %>%
select(from = ID, to = Neighbour_ID) %>%
filter(!is.na(to) & from != to) %>%
graph_from_data_frame(directed = FALSE)
g
# IGRAPH 1eeedee UN-- 5 5 --
# + attr: name (v/c)
# + edges from 1eeedee (vertex names):
# [1] 1--2 1--3 2--4 1--3 4--5
plot(g)
I'd stop here, as we have our data in graph format. But if you need your output as data.frame then get the edgelists and merge back to original data.
gEdge <- get.edgelist(g)
left_join(x %>% select(Name, ID),
data.frame(unique(rbind(gEdge[, 1:2], gEdge[, 2:1]))) %>%
mutate(X1 = as.integer(X1), X2 = as.integer(X2)) %>%
summarise(Neighbour_ID = paste(sort(X2), collapse = ","), .by = X1),
by = c("ID" = "X1"))
# Name ID Neighbour_ID
# 1 Josh 1 2,3
# 2 Emma 2 1,4
# 3 Nick 3 1
# 4 Mark 4 2,5
# 5 Claire 5 4
x %>%
separate_rows(Neighbour_ID, convert = TRUE) %>%
select(-Name) %>%
rbind(setNames(rev(.), names(.))) %>%
filter(ID != Neighbour_ID) %>%
distinct()%>%
left_join(select(x, -Neighbour_ID), c(ID = 'ID')) %>%
summarise(Neighbour_ID = toString(sort(Neighbour_ID)), .by = c(Name, ID))
# A tibble: 5 × 3
Name ID Neighbour_ID
<chr> <int> <chr>
1 Josh 1 2, 3
2 Emma 2 1, 4
3 Nick 3 1
4 Mark 4 2, 5
5 Claire 5 4
Related
I have a dataframe with a column of ids, but for some rows there are multiple ids concatenated together. I want to merge this onto another dataframe using the id, and when the ids are concatenated it handles that and reflects it by having the values in the new columns added also concatenated.
For example I have dataframes
data <- data.frame(
id = c(1, 4, 3, "2,3", "1,4"),
value = c(1:5)
)
> data
id value
1 1 1
2 4 2
3 3 3
4 2,3 4
5 1,4 5
mapping <- data.frame(
id = 1:4,
name = c("one", "two", "three", "four")
)
> mapping
id name
1 1 one
2 2 two
3 3 three
4 4 four
I would like to end up with
id value name
1 1 1 one
2 4 2 four
3 3 3 three
4 2,3 4 two,three
5 1,4 5 one,four
I don't think there's a good way to do this other than to separate, join, and re-concatenate:
library(dplyr)
library(tidyr)
data %>%
mutate(true_id = row_number()) %>%
separate_rows(id, convert = TRUE) %>%
left_join(mapping, by = "id") %>%
group_by(true_id, value) %>%
summarize(id = toString(id), name = toString(name), .groups = "drop")
# # A tibble: 5 × 4
# true_id value id name
# <int> <int> <chr> <chr>
# 1 1 1 1 one
# 2 2 2 4 four
# 3 3 3 3 three
# 4 4 4 2, 3 two, three
# 5 5 5 1, 4 one, four
I wasn't sure if your value column would actually be unique, so I added a true_id just in case.
What about something like this. I could think of a few ways. One is longer, but much easier to follow and the other is short, but kind of a mess.
library(tidyverse)
#long and readable
data |>
mutate(tmp = row_number()) |>
mutate(id = str_split(id, ",")) |>
unnest_longer(id) |>
left_join(mapping |>
mutate(id = as.character(id)), by = "id") |>
group_by(tmp) |>
summarise(id = paste(id, collapse = ","),
value = value[1],
name = paste(name, collapse = ","))
#> # A tibble: 5 x 4
#> tmp id value name
#> <int> <chr> <int> <chr>
#> 1 1 1 1 one
#> 2 2 4 2 four
#> 3 3 3 3 three
#> 4 4 2,3 4 two,three
#> 5 5 1,4 5 one,four
#short and ugly
data |>
mutate(name = map_chr(id, \(x)paste(
mapping$name[which(as.character(mapping$id) %in% str_split(x, ",")[[1]])],
collapse = ",") ))
#> id value name
#> 1 1 1 one
#> 2 4 2 four
#> 3 3 3 three
#> 4 2,3 4 two,three
#> 5 1,4 5 one,four
greping the data$ids out of the mapping$ids.
mapply(\(x, y) toString(mapping$name[grep(sprintf('[%s]', gsub('\\D', '', x)), y)]),
data$id, list(mapping$id))
# 1 4 3 2,3 1,4
# "one" "four" "three" "two, three" "one, four"
In order not to have a space after the comma, use paste(., collapse=',') instead of toString.
I want to restructure some "multiple response" survey data from binary to nominal categories.
The survey asks the responder which ten people they most often interact with and gives a list of 50 names. The data comes back with 50 columns, one column for each name, and a name value in each cell for each name selected and blank for unselected names. I want to convert the fifty columns into ten columns (name1 to name10).
Below is an example of what I mean with (for simplicity) 5 names, where the person must select two names with five responders.
id <- 1:5
mike <- c("","mike","","","mike")
tim <- c("tim","","tim","","")
mary <- c("mary","mary","mary","","")
jane <- c("","","","jane","jane")
liz <- c("","","","liz","")
surveyData <- data.frame(id,mike,tim,mary,jane,liz)
Name1 <- c("tim","mike","tim","jane","mike")
Name2 <- c("mary","mary","mary","liz","jane")
restructuredSurveyData <- data.frame(id,Name1,Name2)
replace '' with NA and apply na.omit.
cbind(surveyData[1], `colnames<-`(t(apply(replace(surveyData[-1],
surveyData[-1] == '', NA), 1,
na.omit)), paste0('name_', 1:2)))
# id name_1 name_2
# 1 1 tim mary
# 2 2 mike mary
# 3 3 tim mary
# 4 4 jane liz
# 5 5 mike jane
A spoiled eye may like this better these days:
replace(surveyData[-1], surveyData[-1] == '', NA) |>
apply(1, na.omit) |>
t() |>
`colnames<-`(paste0('name_', 1:2)) |>
cbind(surveyData[1]) |>
subset(select=c('id', 'name_1', 'name_2'))
# id name_1 name_2
# 1 1 tim mary
# 2 2 mike mary
# 3 3 tim mary
# 4 4 jane liz
# 5 5 mike jane
Note: R >= 4.1 used.
Another possible solution, based on tidyverse:
library(tidyverse)
surveyData %>%
pivot_longer(-id) %>%
filter(value != "") %>%
mutate(nam = if_else(row_number() %% 2 == 1, "names1", "names2")) %>%
pivot_wider(id, names_from = nam)
#> # A tibble: 5 × 3
#> id names1 names2
#> <int> <chr> <chr>
#> 1 1 tim mary
#> 2 2 mike mary
#> 3 3 tim mary
#> 4 4 jane liz
#> 5 5 mike jane
Or using purrr::pmap_df:
library(tidyverse)
pmap_df(surveyData[-1], ~ str_c(c(...)[c(...) != ""], collapse = ",") %>%
set_names("names")) %>%
separate(names, into = str_c("names", 1:2), sep = ",") %>%
bind_cols(select(surveyData, id), .)
#> id names1 names2
#> 1 1 tim mary
#> 2 2 mike mary
#> 3 3 tim mary
#> 4 4 jane liz
#> 5 5 mike jane
I have a large data set that requires some converting but I am not sure what to do.
Let's say I have 2 participants in my study.
football_enjoyment <- c(5,3)
basketball_enjoyment <- c(5,5)
football_participation <- c(1,2)
basketball_participation <- c(1,3)
df<- data.frame(football_enjoyment,football_participation,
basketball_enjoyment,basketball_participation)
df$id <- seq.int(nrow(df))
df
## football_enjoyment football_participation basketball_enjoyment basketball_participation id
# 5 1 5 1 1
# 3 2 5 3 2
I want it to be like this
sports <- c("football","football", "basketball","basketball")
enjoyment_score <- c(5,3,5,5)
participation_score <- c(1,2,1,3)
id <- c(1,2)
df2 <- data.frame(sports, enjoyment_score,participation_score, id)
df2
## sports enjoyment_score participation_score id
# football 5 1 1
# football 3 2 2
# basketball 5 1 1
# basketball 5 3 2
I am stuck with the structure and the column/row names are just for demonstration purpose.
With tidyverse you could do:
library(tidyverse)
library(reshape2)
df %>% gather("variable", "value", - id) %>%
separate(variable, into = c("sports", "variable"), sep = "_") %>%
dcast(id + sports ~ variable) %>% arrange(desc(sports))
# id sports enjoyment participation
#1 1 football 5 1
#2 2 football 3 2
#3 1 basketball 5 1
#4 2 basketball 5 3
Or, in base you could do:
df2 <- reshape(df, varying = c("football_enjoyment", "football_participation", "basketball_enjoyment", "basketball_participation"),
direction = "long",
idvar = "id",
sep = "_",
timevar = "sports",
times = c("football", "basketball"), v.names = c('enjoyment', 'participation'))
rownames(df2) <- NULL
# id sports enjoyment participation
#1 1 football 5 1
#2 2 football 3 2
#3 1 basketball 5 1
#4 2 basketball 5 3
tidyr 1.0.0 has a pivot_longer function that can do this:
library(tidyr)
football_enjoyment <- c(5,3)
basketball_enjoyment <- c(5,5)
football_participation <- c(1,2)
basketball_participation <- c(1,3)
df<- data.frame(football_enjoyment,football_participation,
basketball_enjoyment,basketball_participation)
df$id <- seq.int(nrow(df))
df
#> football_enjoyment football_participation basketball_enjoyment
#> 1 5 1 5
#> 2 3 2 5
#> basketball_participation id
#> 1 1 1
#> 2 3 2
df %>% pivot_longer(-id, names_to = c("sports",".value"), names_sep = "_")
#> # A tibble: 4 x 4
#> id sports enjoyment participation
#> <int> <chr> <dbl> <dbl>
#> 1 1 football 5 1
#> 2 1 basketball 5 1
#> 3 2 football 3 2
#> 4 2 basketball 5 3
Created on 2019-09-20 by the reprex package (v0.3.0)
I have a df like this
name <- c("Fred","Mark","Jen","Simon","Ed")
a_or_b <- c("a","a","b","a","b")
abc_ah_one <- c(3,5,2,4,7)
abc_bh_one <- c(5,4,1,9,8)
abc_ah_two <- c(2,1,3,7,6)
abc_bh_two <- c(3,6,8,8,5)
abc_ah_three <- c(5,4,7,6,2)
abc_bh_three <- c(9,7,2,1,4)
def_ah_one <- c(1,3,9,2,7)
def_bh_one <- c(2,8,4,6,1)
def_ah_two <- c(4,7,3,2,5)
def_bh_two <- c(5,2,9,8,3)
def_ah_three <- c(8,5,3,5,2)
def_bh_three <- c(2,7,4,3,0)
df <- data.frame(name,a_or_b,abc_ah_one,abc_bh_one,abc_ah_two,abc_bh_two,
abc_ah_three,abc_bh_three,def_ah_one,def_bh_one,
def_ah_two,def_bh_two,def_ah_three,def_bh_three)
I want to use the value in column "a_or_b" to choose the values in each of the corresponding "ah/bh" columns for each "abc" (one, two, and three), and put it into a new data frame. For example, Fred would have the values 3, 2 and 5 in his row in the new df. Those values represent the values of each of his "ah" categories for the abc columns. Jen, who has "b" in her a_or_b column, would have all of her "bh" values from her abc columns for her row in the new df. Here is what my desired output would look like:
combo_one <- c(3,5,1,4,8)
combo_two <- c(2,1,8,7,5)
combo_three <- c(5,4,2,6,4)
df2 <- data.frame(name,a_or_b,combo_one,combo_two,combo_three)
I've attempted this using sapply. The following gives me a matrix of the correct column correct indexes of df[grep("abc",colnames(df),fixed=TRUE)] for each row:
sapply(paste0(df$a_or_b,"h"),grep,colnames(df[grep("abc",colnames(df),fixed=TRUE)]))
First we gather your data into a tidy long format, then break out the columns into something useful. After that the filtering is simple, and if necessary we can convert back to an difficult wide format:
library(dplyr)
library(tidyr)
gather(df, key = "var", value = "val", -name, -a_or_b) %>%
separate(var, into = c("combo", "h", "ind"), sep = "_") %>%
mutate(h = substr(h, 1, 1)) %>%
filter(a_or_b == h, combo == "abc") %>%
arrange(name) -> result_long
result_long
# name a_or_b combo h ind val
# 1 Ed b abc b one 8
# 2 Ed b abc b two 5
# 3 Ed b abc b three 4
# 4 Fred a abc a one 3
# 5 Fred a abc a two 2
# 6 Fred a abc a three 5
# 7 Jen b abc b one 1
# 8 Jen b abc b two 8
# 9 Jen b abc b three 2
# 10 Mark a abc a one 5
# 11 Mark a abc a two 1
# 12 Mark a abc a three 4
# 13 Simon a abc a one 4
# 14 Simon a abc a two 7
# 15 Simon a abc a three 6
spread(result_long, key = ind, value = val) %>%
select(name, a_or_b, one, two, three)
# name a_or_b one two three
# 1 Ed b 8 5 4
# 2 Fred a 3 2 5
# 3 Jen b 1 8 2
# 4 Mark a 5 1 4
# 5 Simon a 4 7 6
Base R approach would be using lapply, we loop through each row of the dataframe, create a string to find similar columns using paste0 based on a_or_b column and then rbind all the values together for each row.
new_df <- do.call("rbind", lapply(seq(nrow(df)), function(x)
setNames(df[x, grepl(paste0("abc_",df[x,"a_or_b"], "h"), colnames(df))],
c("combo_one", "combo_two", "combo_three"))))
new_df
# combo_one combo_two combo_three
#1 3 2 5
#2 5 1 4
#3 1 8 2
#4 4 7 6
#5 8 5 4
We can cbind the required columns then :
cbind(df[c(1, 2)], new_df)
# name a_or_b combo_one combo_two combo_three
#1 Fred a 3 2 5
#2 Mark a 5 1 4
#3 Jen b 1 8 2
#4 Simon a 4 7 6
#5 Ed b 8 5 4
It's possible to do this with a combination of map and mutate:
require(tidyverse)
df %>%
select(name, a_or_b, starts_with("abc")) %>%
rename_if(is.numeric, funs(sub("abc_", "", .))) %>%
mutate(combo_one = map_chr(a_or_b, ~ paste0(.x,"h_one")),
combo_one = !!combo_one,
combo_two = map_chr(a_or_b, ~ paste0(.x,"h_two")),
combo_two = !!combo_two,
combo_three = map_chr(a_or_b, ~ paste0(.x,"h_three")),
combo_three = !!combo_three) %>%
select(name, a_or_b, starts_with("combo"))
Output:
name a_or_b combo_one combo_two combo_three
1 Fred a 3 2 5
2 Mark a 5 1 4
3 Jen b 1 8 2
4 Simon a 4 7 6
5 Ed b 8 5 4
I have this example: df.Journal.Conferences
venue author0 author1 author2 ... author19
A John Mary
B Peter Jacob Isabella
C Lia
B Jacob Lara John
C Mary
B Isabella
I want to know how many unique authors are in each venue
Result:
A 2
B 5
C 2
Edit:
Here is the link to my data: GoogleDrive Excel sheet.
because your data was hard to reproduce, I generated a "similar" data set,
this should word
set.seed(1984)
df <- data.frame(id = sample(1:5,10, replace= T),
v1 = sample(letters[1:5],10,replace= T),
v2 = sample(letters[1:5],10,replace= T),
v3 = sample(letters[1:5],10,replace= T),
v4 = sample(letters[1:5],10,replace= T),
stringsAsFactors = F)
z <- data.frame( id = unique(df$id), n = NA )
for (i in z$id) {
z$n[z$id == i] <- length(unique(unlist(df[df$id == i,-1])))
}
z
# id n
# 1 4 4
# 2 3 4
# 3 2 4
# 4 5 4
# 5 1 3
Using #zx8754 data for testing, this code gives want you wanted (assuming you have NA for empty cells in the dataframe):
sapply(split(df1[,-1], df1$venue), function(x) length(unique(x[!is.na(x)])))
# A B C
# 2 5 2
Using dplyr and tidyr, reshape the data from wide to long, then group by count.
library(dplyr)
library(tidyr)
gather(df1, key = author, value = name, -venue) %>%
select(venue, name) %>%
group_by(venue) %>%
summarise(n = n_distinct(name, na.rm = TRUE))
# # A tibble: 3 × 2
# venue n
# <chr> <int>
# 1 A 2
# 2 B 5
# 3 C 2
data
df1 <- read.table(text ="
venue,author0,author1,author2
A,John,Mary,NA
B,Peter,Jacob,Isabella
C,Lia,NA,NA
B,Jacob,Lara,John
C,Mary,NA,NA
B,Isabella,NA,NA
", header = TRUE, sep = ",", stringsAsFactors = FALSE)
Edit: Saved your Excel sheet as CSV, then read in using read.csv, then above code returns below output:
df1 <- read.csv("Journal_Conferences_Authors.csv", na.strings = "#N/A")
# output
# # A tibble: 427 × 2
# venue n
# <fctr> <int>
# 1 AAAI 4
# 2 ACC 4
# 3 ACIS-ICIS 5
# 4 ACM SIGSOFT Software Engineering Notes 1
# 5 ACM Southeast Regional Conference 5
# 6 ACM TIST 3
# 7 ACM Trans. Comput.-Hum. Interact. 3
# 8 ACML 2
# 9 ADMA 2
# 10 Advanced Visual Interfaces 3
# # ... with 417 more rows