r creating an adjacency matrix from columns in a dataframe - r

I am interested in testing some network visualization techniques but before trying those functions I want to build an adjacency matrix (from, to) using the dataframe which is as follows.
Id Gender Col_Cold_1 Col_Cold_2 Col_Cold_3 Col_Hot_1 Col_Hot_2 Col_Hot_3
10 F pain sleep NA infection medication walking
14 F Bump NA muscle NA twitching flutter
17 M pain hemoloma Callus infection
18 F muscle pain twitching medication
My goal is to create an adjacency matrix as follows
1) All values in columns with keyword Cold will contribute to the rows
2) All values in columns with keyword Hot will contribute to the columns
For example, pain, sleep, Bump, muscle, hemaloma are cell values under the columns with keyword Cold and they will form the rows and cell values such as infection, medication, Callus, walking, twitching, flutter are under columns with keywords Hot and this will form the columns of the association matrix.
The final desired output should appear like this:
infection medication walking twitching flutter Callus
pain 2 2 1 1 1
sleep 1 1 1
Bump 1 1
muscle 1 1
hemaloma 1 1
[pain, infection] = 2 because the association between pain and infection occurs twice in the original dataframe: once in row 1 and again in row 3.
[pain, medication]=2 because association between pain and medication occurs twice once in row 1 and again in row 4.
Any suggestions or advice on producing such an association matrix is much appreciated thanks.
Reproducible Dataset
df = structure(list(id = c(10, 14, 17, 18), Gender = structure(c(1L, 1L, 2L, 1L), .Label = c("F", "M"), class = "factor"), Col_Cold_1 = structure(c(4L, 2L, 1L, 3L), .Label = c("", "Bump", "muscle", "pain"), class = "factor"), Col_Cold_2 = structure(c(4L, 2L, 3L, 1L), .Label = c("", "NA", "pain", "sleep"), class = "factor"), Col_Cold_3 = structure(c(1L, 3L, 2L, 4L), .Label = c("NA", "hemaloma", "muscle", "pain" ), class = "factor"), Col_Hot_1 = structure(c(4L, 3L, 2L, 1L), .Label = c("", "Callus", "NA", "infection"), class = "factor"), Col_Hot_2 = structure(c(2L, 3L, 1L, 3L), .Label = c("infection", "medication", "twitching"), class = "factor"), Col_Hot_3 = structure(c(4L, 2L, 1L, 3L), .Label = c("", "flutter", "medication", "walking" ), class = "factor")), .Names = c("id", "Gender", "Col_Cold_1", "Col_Cold_2", "Col_Cold_3", "Col_Hot_1", "Col_Hot_2", "Col_Hot_3" ), row.names = c(NA, -4L), class = "data.frame")

One way is to make the dataset into a "tidy" form, then use xtabs. First, some cleaning up:
df[] <- lapply(df, as.character) # Convert factors to characters
df[df == "NA" | df == "" | is.na(df)] <- NA # Make all blanks NAs
Now, tidy the dataset:
library(tidyr)
library(dplyr)
out <- do.call(rbind, sapply(grep("^Col_Cold", names(df), value = T), function(x){
vars <- c(x, grep("^Col_Hot", names(df), value = T))
setNames(gather_(select(df, one_of(vars)),
key_col = x,
value_col = "value",
gather_cols = vars[-1])[, c(1, 3)], c("cold", "hot"))
}, simplify = FALSE))
The idea is to "pair" each of the "cold" columns with each of the "hot" columns to make a long dataset. out looks like this:
out
# cold hot
# 1 pain infection
# 2 Bump <NA>
# 3 <NA> Callus
# 4 muscle <NA>
# 5 pain medication
# ...
Finally, use xtabs to make the desired output:
xtabs(~ cold + hot, na.omit(out))
# hot
# cold Callus flutter infection medication twitching walking
# Bump 0 1 0 0 1 0
# hemaloma 1 0 1 0 0 0
# muscle 0 1 0 1 2 0
# pain 1 0 2 2 1 1
# sleep 0 0 1 1 0 1

Related

How to order contingency table based on data order?

Given
Group ss
B male
B male
B female
A male
A female
X male
Then
tab <- table(res$Group, res$ss)
I want the group column to be in the order B, A, X as it is on the data. Currently its alphabetic order which is not what I want. This is what I want
MALE FEMALE
B 5 5
A 5 10
X 10 12
If you arrange the factor levels based on the order you want, you'll get the desired result.
res$Group <- factor(res$Group, levels = c('B', 'A', 'X'))
#If it is based on occurrence in Group column we can use
#res$Group <- factor(res$Group, levels = unique(res$Group))
table(res$Group, res$ss)
#Or just
#table(res)
# female male
# B 1 2
# A 1 1
# X 0 1
data
res <- structure(list(Group = structure(c(2L, 2L, 2L, 1L, 1L, 3L),
.Label = c("A", "B", "X"), class = "factor"), ss = structure(c(2L, 2L, 1L, 2L,
1L, 2L), .Label = c("female", "male"), class = "factor")),
class = "data.frame", row.names = c(NA, -6L))
unique returns the unique elements of a vector in the order they occur. A table can be ordered like any other structure by extracting its elements in the order you want. So if you pass the output of unique to [,] then you'll get the table sorted in the order of occurrence of the vector.
tab <- table(res$Group, res$ss)[unique(res$Group),]

Multiple aggregation with unspecified FUN in R

I have a data.frame object in R and need to:
Group by col_1
Select rows from col_3 such that col_2 value is the second largest one (if there is only observation for the given value of col_1, return 'NA' for instance).
How can I obtain this?
Example:
scored xg first_goal scored_mane
1 1 1.03212 Lallana 0
2 1 2.06000 Mane 1
3 2 2.38824 Robertson 1
4 2 1.64291 Mane 1
Group by "scored_mane", return values from "scored" where "xg" is the second largest. Expected output: "NA", 1
You can try the following base R solution, using aggregate + merge
res <- merge(aggregate(xg~scored_mane,df,function(v) sort(v,decreasing = T)[2]),df,all.x = TRUE)[,"scored"]
such that
> res
[1] NA 1
DATA
structure(list(scored = c(1L, 1L, 2L, 2L), xg = c(1.03212, 2.06,
2.38824, 1.64291), first_goal = c("Lallana", "Mane", "Robertson",
"Mane"), scored_mane = c(0L, 1L, 1L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4")) -> df

Is there an R function to group a table by a certain variable? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I am trying to remove some rows of my data by adding them to a different row, in the form of another column. Is there a way I can group rows together by a certain variable?
I have tried using group_by statement in the dplyr package, but it does not seem to solve my issue.
library(dplyr)
late <- read.csv(file.choose())
late <- group_by(late, state, add = FALSE)
The data set I have (named "late") now is in this form:
ontime state count
0 AL 1
1 AL 44
null AL 3
0 AR 5
1 AR 50
...
But I would like it to be:
state count0 count1 countnull
AL 1 44 3
AR 5 50 null
...
Ultimately, I want to calculate count0/count1 for each state. So if there is a better way of going about this, I would be open to any suggestions.
You could do this with dcast() from the reshape2 package
library(reshape2)
df = data.frame(
ontime = c(0,1,NA,0,1),
state = c("AL","AL","AL","AR","AR"),
count = c(1,44,3,5,50)
)
dcast(df,state~ontime,value=count)
With spread:
library(dplyr)
library(tidyr)
df %>%
mutate(ontime = paste0('count', ontime)) %>%
spread(ontime, count)
Output:
state count0 count1 countnull
1 AL 1 44 3
2 AR 5 50 NA
Data:
df <- structure(list(ontime = structure(c(1L, 2L, 3L, 1L, 2L), .Label = c("0",
"1", "null"), class = "factor"), state = structure(c(1L, 1L,
1L, 2L, 2L), .Label = c("AL", "AR"), class = "factor"), count = c(1L,
44L, 3L, 5L, 50L)), class = "data.frame", row.names = c(NA, -5L
))

Mapping values across a dataframe

I have a large dataset. The example below is a much abbreviated version.
There are two dataframes, df1 and df2. I would like to map to each row of df1, a derived value using conditions from df2 with arguments from df1.
Hope the example below makes more sense
year <- rep(1996:1997, each=3)
age_group <- rep(c("20-24","25-29","30-34"),2)
df1 <- as.data.frame(cbind(year,age_group))
df1 is a database with all permutations of year and age group.
df2 <- as.data.frame(rbind(c(111,1997,"20-24"),c(222,1997,"30-34")))
names(df2) <- c("id","year","age.group")
df2 is a database where each row represents an individual at a particular year
I would like to use arguments from df1 conditional on values from df2 and then to map to df1. The arguments are as follows:
each_yr <- map(df1, function(year,age_group) case_when(
as.character(df1$year) == as.character(df2$year) & as.character(df1$age_group)
== as.character(df2$age.group)~ 0,
TRUE ~ 1))
The output i get is wrong and shown below
structure(list(year = c(1, 1, 1, 1, 1, 0), age_group = c(1, 1,
1, 1, 1, 0)), .Names = c("year", "age_group"))
The output i would ideally like is something like this (dataframe as an example but would be happy as a list)
structure(list(year = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1996",
"1997"), class = "factor"), age_group = structure(c(1L, 2L, 3L,
1L, 2L, 3L), .Label = c("20-24", "25-29", "30-34"), class = "factor"),
v1 = structure(c(2L, 2L, 2L, 1L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), v2 = structure(c(2L, 2L, 2L, 2L,
2L, 1L), .Label = c("0", "1"), class = "factor")), .Names = c("year",
"age_group", "v1", "v2"), row.names = c(NA, -6L), class = "data.frame")
I have used map before when 'df1' is a vector but in this scenario it is a dataframe where both columns are used as arguments. Can Map handle this?
In df3 the column v1 is the result of conditions based on df1 and df2 and then mapped to df1 for patient '111'. Likewise column v2 is the outcome for patient '222'.
Thanks in advance
Looks like some work for pmap instead. And a touch of tidyr to get the suggested result.
purrr::pmap(list(df2$id,as.character(df2$year),as.character(df2$age.group)),
function(id,x,y)
data.frame(df1,
key=paste0("v",id),
value=1-as.integer((x==df1$year)&(y==df1$age_group)),
stringsAsFactors=FALSE
)) %>%
replyr::replyr_bind_rows() %>% tidyr::spread(key,value)
# year age_group v1 v2
#1 1996 20-24 1 1
#2 1996 25-29 1 1
#3 1996 30-34 1 1
#4 1997 20-24 0 1
#5 1997 25-29 1 1
#6 1997 30-34 1 0
Whithing tidiverse you can do it this way:
library(tidyverse)
#library(dplyr)
#library(tidyr)
df2 %>%
mutate(tmp = 0) %>%
spread(id, tmp, fill = 1, sep = "_") %>%
right_join(df1, by = c("year", "age.group" = "age_group")) %>%
mutate_at(vars(-c(1, 2)), coalesce, 1)
# year age.group id_111 id_222
# 1 1996 20-24 1 1
# 2 1996 25-29 1 1
# 3 1996 30-34 1 1
# 4 1997 20-24 0 1
# 5 1997 25-29 1 1
# 6 1997 30-34 1 0
#Warning messages:
# 1: Column `year` joining factors with different levels, coercing to character vector
# 2: Column `age.group`/`age_group` joining factors with different levels, coercing to
# character vector

Counting where the string appears in row with caveat - [R]

I have a dataset where the first three columns (G1P1, G1P2, G1P3) indicate one grouping of three individuals (i.e. Sidney, Blake, Max on Row 1), the second three columns (G2P1, G2P2, G2P3) indicate another grouping of three individuals (i.e. David, Steve, Daniel on Row 2), etc.... There are a total of 12 individuals, and dataset is pretty much all the possible groupings of these 12 people (approximately 300,000 rows). Each group's cumulative test scores are represented on far right columns, (G1.Sum, G2.Sum, G3.Sum, G4.Sum
).
#### The dput(data) of the first five rows ####
data <- structure(list(X = 1:5, G1P1 = structure(c(4L, 4L, 4L, 4L, 4L), .Label = c("CHRIS", "DAVID", "EVA", "SIDNEY"), class = "factor"), G1P2 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("BLAKE", "NICK", "PATRIC", "STEVE"), class = "factor"), G1P3 = structure(c(4L, 4L, 4L, 4L, 4L), .Label = c("BEAU", "BRANDON", "DANIEL", "MAX"), class = "factor"), G2P1 = structure(c(2L, 2L, 1L, 1L, 3L), .Label = c("CHRIS", "DAVID", "EVA", "SIDNEY"), class = "factor"), G2P2 = structure(c(4L, 4L, 3L, 3L, 2L), .Label = c("BLAKE", "NICK", "PATRIC", "STEVE"), class = "factor"), G2P3 = structure(c(3L, 3L, 2L, 2L, 1L), .Label = c("BEAU", "BRANDON", "DANIEL", "MAX"), class = "factor"), G3P1 = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("CHRIS", "DAVID", "EVA", "SIDNEY"), class = "factor"), G3P2 = structure(c(3L, 2L, 4L, 2L, 4L), .Label = c("BLAKE", "NICK", "PATRIC", "STEVE"), class = "factor"), G3P3 = structure(c(2L, 1L, 3L, 1L, 3L), .Label = c("BEAU", "BRANDON", "DANIEL", "MAX"), class = "factor"), G4P1 = structure(c(3L, 1L, 3L, 2L, 1L), .Label = c("CHRIS", "DAVID", "EVA", "SIDNEY"), class = "factor"), G4P2 = structure(c(2L, 3L, 2L, 4L, 3L), .Label = c("BLAKE", "NICK", "PATRIC", "STEVE"), class = "factor"), G4P3 = structure(c(1L, 2L, 1L, 3L, 2L), .Label = c("BEAU", "BRANDON", "DANIEL", "MAX"), class = "factor"), G1.Sum = c(63.33333333, 63.33333333, 63.33333333, 63.33333333, 63.33333333), G2.Sum = c(58.78333333, 58.78333333, 54.62333333, 54.62333333, 58.69), G3.Sum = c(54.62333333, 58.69, 58.78333333, 58.69, 58.78333333), G4.Sum = c(58.69, 54.62333333, 58.69, 58.78333333, 54.62333333)), .Names = c("X", "G1P1", "G1P2", "G1P3", "G2P1", "G2P2", "G2P3", "G3P1", "G3P2", "G3P3", "G4P1", "G4P2", "G4P3", "G1.Sum", "G2.Sum", "G3.Sum", "G4.Sum"), row.names = c(NA, 5L), class = "data.frame")
I was wondering how you would write an R function so for each row, you can record where the person's group score ranked. For example, on Row 1, SIDNEY was in a group with the highest score at 63.3333. So his rank would be a '1.' But for BRANDON, his group scored last (54.62333), so her rank would be 4. I would like my final data.frame output to be something like this:
ranks <- t(apply(data[grep("Sum", names(data))], 1, function(x) rep(match(x, sort(x, decreasing=T)),each=3)))
just.names <- data[grep("P", names(data))] #Subset without sums
names <- as.character(unlist(just.names[1,])) #create name vector
sapply(names, function(x) ranks[just.names == x])
# SIDNEY BLAKE MAX DAVID STEVE DANIEL CHRIS PATRIC BRANDON EVA NICK BEAU
# [1,] 1 1 1 2 2 2 4 4 4 3 3 3
# [2,] 1 1 1 2 2 2 4 4 4 3 3 3
# [3,] 1 1 1 2 2 2 4 4 4 3 3 3
# [4,] 1 1 1 2 2 2 4 4 4 3 3 3
# [5,] 1 1 1 2 2 2 4 4 4 3 3 3
We first rank the sums and replicate them 3 times each. Next we subset the larger data frame with the names only (take out the sum columns). We make a vector with the individual names. And lastly, we subset the ranks matrix that we created first by seeing where in the data frame the name appears.
Using dplyr and tidyr. First, ranking, then uniting all the rows with their rank, converting to long data, separating out the variables, then finally spreading.
It got really long, and can probably be simplified:
library(dplyr)
library(tidyr)
data[ ,14:17] <- t(apply(-data[ ,14:17], 1 , rank))
data %>% unite("g1", starts_with("G1")) %>%
unite("g2", starts_with("G2")) %>%
unite("g3", starts_with("G3")) %>%
unite("g4", starts_with("G4")) %>%
gather(Row, val, -X) %>%
select(-Row) %>%
separate(val, c("1", "2", "3", "rank")) %>%
gather(zzz, name, -X, -rank) %>%
select(-zzz) %>%
spread(name, rank)
X BEAU BLAKE BRANDON CHRIS DANIEL DAVID EVA MAX NICK PATRIC SIDNEY STEVE
1 1 3 1 4 4 2 2 3 1 3 4 1 2
2 2 3 1 4 4 2 2 3 1 3 4 1 2
3 3 3 1 4 4 2 2 3 1 3 4 1 2
4 4 3 1 4 4 2 2 3 1 3 4 1 2
5 5 3 1 4 4 2 2 3 1 3 4 1 2
Using previous answer's 'rank' matrix and library(reshape2) to convert wide data.frame to long data.frame,
ranks <- t(apply(test[grep("Sum",names(test))], 1, function (x)
rep(match(x, sort(x, decreasing=T)),each=3)))
colnames(ranks) <- names(test)[grep("P", names(test))]
# data subset
test_L <- test[,-grep("Avg", names(test))]
df_player <- data.frame(position= names(test)[grep("P", names(test))],
t(test_L[,-1]), row.names = NULL)
df_ranks <- data.frame(position=names(test)[grep("P", names(test))],
t(ranks), row.names=NULL)
# Combine two temporary data.frames
df_player_melted <- melt(df_player, id=1,
variable.name = "rowNumber", value.name = "player")
df_ranks_melted <- rank= melt(df_ranks, id=1,
variable.name = "rowNumber", value.name = "rank")
df <- cbind(df_player_melted, rank= df_ranks_melted$rank)
# cast into the output format you want
df <- dcast(df, rowNumber ~ player + rank)[1,]

Resources