I am looking to add a 'description' variable to the vertices data frame which describes the cluster in which a node is found. My network is family relationships so clusters could be a family of two adults and two children, single parent with three children, couple etc.
My data looks like
Vertices data frame
ID Date.Of.B Nationality
X1 02/05/1995 Ugandan
X2 10/10/2010 Ugandan
X3 15/12/1975 Irish
: : :
Edgelist
ID1 ID2
X1 X2
X1 X3
X2 X3
X3 X1
: :
I plan to create factor levels to describe clusters i.e
2 adults = 2A
2 adults 2 children = 2A2C
5 adults 0 children = 5A
After creating the graph using graph_from_data_frame() I can extract the components using componets() with components$membership giving each cluster a membership number with the IDs an attribute of components$membership. I can apply a label to each vertex to determine their status as an adult or child.
Basically I am looking to add another variable which classes each ID given the cluster it is in:
New vertices data frame
ID Date.Of.B Nationality Class
X1 02/05/1995 Ugandan 2A1C
X2 10/10/2010 Ugandan 2A1C
X3 15/12/1975 Irish 2A1C
: : :
I am thinking I am going to have to use some sort of loop to go through each cluster and apply a level to each vertex by component$membership
This is one option I thought of and am currently working on.
Please let me know if you have any other ideas or better ways to do it.
Thanks
Maybe this helps:
library(igraph)
library(dplyr)
library(tidyr)
Generate example data:
set.seed(1)
vertices <- data.frame(ID = 1:20,
date = as.character(rnorm(20, -5000, 3000) + Sys.Date()),
Nationality = letters[1:20])
edgelist <- data.frame(from = sample(1:20, 15, replace = T),
to = sample(1:20, 15, replace = T))
g <- graph_from_data_frame(edgelist,
directed = F,
vertices = vertices)
cp <- components(g)
Save component-membership as new vertex attribute:
V(g)$components <- membership(cp)
Extract vertices plus additional attributes:
df <- get.data.frame(g, "vertices")
Work with the dataframe:
First generate a new coding variable based on age (in days), count the occurence and paste the result into a new variable.
df <- df %>%
mutate(coding = ifelse(Sys.Date() - as.Date(df$date) > 6570, "A", "C")) %>%
group_by(components, coding) %>%
mutate(n = n()) %>%
ungroup() %>%
mutate(new = paste(n, coding, sep = "")) %>%
select(-coding, -n)
Then nest the dataframe based on components into a new dataframe and delete duplicates.
df2 <- df %>%
select(new, components) %>%
distinct(.keep_all = T) %>%
nest(-components)
After that you can merge the two dataframes and loop through (sapply) to unlist your new class variable (in this case called data), which is also your final result.
df3 <- left_join(df, df2) %>%
select(-new)
df3$data <- sapply(df3$data, function(x) paste(unname(unlist(x)), collapse = ""))
Related
I have a data frame in which the first column indicates the work (manager, employee or worker), the second indicates whether the person works at night or not and the last is a household code (if two individuals share the same code then it means that they share the same house).
#Here is the reproductible data :
PCS <- c("worker", "manager","employee","employee","worker","worker","manager","employee","manager","employee")
work_night <- c("Yes","Yes","No", "No","No","Yes","No","Yes","No","Yes")
HHnum <- c(1,1,2,2,3,3,4,4,5,5)
df <- data.frame(PCS,work_night,HHnum)
My problem is that I would like to have a new data frame with households instead of individuals. I would like to group individuals based on HHnum and then merge their answers.
For the variable "PCS" I have new categories based on the combination of answers : Manager+work ="I" ; manager+employee="II", employee+employee=VI, worker+worker=III etc
For the variable "work_night", I would like to apply a score (is both answered Yes then score=2, if one answered YES then score =1 and if both answered No then score = 0).
To be clear, I would like my data frame to look like this :
HHnum PCS work_night
1 "I" 2
2 "VI" 0
3 "III" 1
4 "II" 1
5 "II" 1
How can I do this on R using dplyr ? I know that I need group_by() but then I don't know what to use.
Best,
Victor
Here is one way to do it (though I admit it is pretty verbose). I created a reference dataframe (i.e., combos) in case you had more categories than 3, which is then joined with the main dataframe (i.e., df_new) to bring in the PCS roman numerals.
library(dplyr)
library(tidyr)
# Create a dataframe with all of the combinations of PCS.
combos <- expand.grid(unique(df$PCS), unique(df$PCS))
combos <- unique(t(apply(combos, 1, sort))) %>%
as.data.frame() %>%
dplyr::mutate(PCS = as.roman(row_number()))
# Create another dataframe with the columns reversed (will make it easier to join to the main dataframe).
combos2 <- data.frame(V1 = c(combos$V2), V2 = c(combos$V1), PCS = c(combos$PCS)) %>%
dplyr::mutate(PCS = as.roman(PCS))
combos <- rbind(combos, combos2)
# Get the count of "Yes" for each HHnum group.
# Then, put the PCS into 2 columns to join together with "combos" df.
df_new <- df %>%
dplyr::group_by(HHnum) %>%
dplyr::mutate(work_night = sum(work_night == "Yes")) %>%
dplyr::group_by(grp = rep(1:2, length.out = n())) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = grp, values_from = PCS) %>%
dplyr::rename("V1" = 3, "V2" = 4) %>%
dplyr::left_join(combos, by = c("V1", "V2")) %>%
unique() %>%
dplyr::select(HHnum, PCS, work_night)
Haven't been able to find an exact Q/A to match this problem, though there are several related. Im trying to calculate a distance matrix for all points groups defined by an ID column. Then select the two points that are farthest apart from each group, retaining the original group id. The number of points in each group varies from 2, 4 or 6.
My sf df:
df <- data.frame(x = runif(12), y = runif(12), id = rep(1:3,each = 4)) %>%
st_as_sf(coords = c("x","y"), crs = 27700)
I've tried code such as:
a <- df %>%
group_by(id) %>%
st_distance(.)
Though this just returns a distance matrix of all points.
The below gives me, what I want, though I fear it would be slow on large datasets:
maxMin <- do.call(rbind,lapply(unique(allInts$id), function(x) {
df <- allInts %>% filter(id == x)
d <- st_distance(df)
df %>% slice(unique(as.vector(which(d == max(d),arr.ind=T))))
}))
You can use dplyr::group_split to split your data frame into a list per group. You can then apply whatever function you want to that list using map/lapply.
Script below keeps the 2 points which are furthest apart in each group.
library(sf)
library(tidyverse)
# dummy data
data <- data.frame(x = runif(12), y = runif(12), id = rep(1:3,each = 4)) %>%
st_as_sf(coords = c("x","y"), crs = 27700)
# split it into a list per ID
data_group <- data %>%
group_by(id) %>%
group_split()
#apply a function to each list
distance_per_group <- map(data_group, function(x){
distance_matrix <- st_distance(x)
biggest_distance <- as.numeric(which(distance_matrix == max(distance_matrix), arr.ind = TRUE)[1,])
farthest_apart <- x[biggest_distance,]
})
Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))
I'm working with a dataset that was poorly formatted and I'm trying to get it into a tidy format for statistical testing and data visualization. I'm hoping someone can provide some insight on whether I have the data in the correct tidy format and what the simplest way to perform multiple t.tests.
Here is some sample data similar to my untidied format:
library(tidyverse)
data <- data.frame("subject_id" = 1:10, "age" = 21:30, "weight" = 150:159, "height" = 65:74,
"x_c1_avg" = c(1:9, NA), "y_c1_avg" = runif(10),"z_c1_avg" = c(9:1, NA),
"x_c2e1_avg" = c(1:9, NA), "y_c2e1_avg" = runif(10), "z_c2e1_avg" = runif(10),
"x_c2e2_avg" = runif(10), "y_c2e2_avg" = runif(10), "z_c2e2_avg" = runif(10))
glimpse(data)
The tibble contains demographic information and then three measures, collected at different conditions with some of the measures being performed by two examiners (e.g. x_c1_avg is the average for measure x collected at condition 1 (a certain leg position) and y_c2e1_avg is the average for measure y collected by examiner 1 at condition 2.
So my first question, am I correct that the output of the code below would be considered tidy? Measure, condition and examiner are each in their own columns with the values in another column.
data2 <- data %>%
gather(key = "condition", value = "value", -c(subject_id:height)) %>%
separate(condition, into = c("measure", "condition"), sep = "_", extra = "drop") %>%
separate(condition, into = c("condition", "examiner"), sep = 2, fill = "right")
My second question, what is the most efficient way to perform a paired t.test on this data or is there a way to do this without creating new vectors for each variable? There are 12 conditions in total but I'll only be performing t.tests for six comparisons. I'd be comparing measure x at c1 to measure x at c2 for each subject or measure y by examiner 1 at condition 2 to measure y by examiner 2 at condition 2 and so on. My current code is a:
x_c1 <- data2 %>%
filter(measure == "x", condition == "c1") %>%
select(value)
x_c2_e2 <- data2 %>%
filter(measure == "x", condition == "c2", examiner == "e2") %>%
select(value)
t.test(x_c1$value, x_c2_e2$value, paired = TRUE)
However, this seems much more complicated than it needs to be and feels like I'm reversing the work I did to get it in tidy format. It would have been much easier to run this from the start:
t.test(data$x_c1_avg, data$x_c2e2_avg, paired = TRUE)
My problem:
I have two data frames, one for industries and one for occupations. They are nested by state, and show employment.
I also have a concordance matrix, which shows the weights of each of the occupations in each industry.
I would like to create a new employment number in the Occupation data frame, using the Industry employments and the concordance matrix.
I have made dummy version of my problem - which I think is clear:
Update
I have solved the issue, but I would like to know if there is a more elegant solution? In reality my dimensions are 7 States * 200 industries * 350 Occupations it becomes rather data hungry
# create industry data frame
set.seed(12345)
ind_df <- data.frame(State = c(rep("a", len =6),rep("b", len =6),rep("c", len =6)),
industry = rep(c("Ind1","Ind2","Ind3","Ind4","Ind5","Ind6"), len = 18),
emp = rnorm(18,20,2))
# create occupation data frame
Occ_df <- data.frame(State = c(rep("a", len = 5), rep("b", len = 5), rep("c", len =5)),
occupation = rep(c("Occ1","Occ2","Occ3","Occ4","Occ5"), len = 15),
emp = rnorm(15,10,1))
# create concordance matrix
Ind_Occ_Conc <- matrix(rnorm(6*5,1,0.5),6,5) %>% as.data.frame()
# name cols in the concordance matrix
colnames(Ind_Occ_Conc) <- unique(Occ_df$occupation)
rownames(Ind_Occ_Conc) <- unique(ind_df$industry)
# solution
Ind_combined <- cbind(Ind_Occ_Conc, ind_df)
Ind_combined <- Ind_combined %>%
group_by(State) %>%
mutate(Occ1 = emp*Occ1,
Occ2 = emp*Occ2,
Occ3 = emp*Occ3,
Occ4 = emp*Occ4,
Occ5 = emp*Occ5
)
Ind_combined <- Ind_combined %>%
gather(key = "occupation",
value = "emp2",
-State,
-industry,
-emp
)
Ind_combined <- Ind_combined %>%
group_by(State, occupation) %>%
summarise(emp2 = sum(emp2))
Occ_df <- left_join(Occ_df,Ind_combined)
My solution seems pretty inefficient, is there a better / faster way to do this?
Also - I am not quite sure how to get to this - but the expected outcome would be another column added to the Occ_df called emp2, this would be derived from Ind_df emp column and the Ind_Occ_Conc. I have tried to step this out for Occupation 1, essentially the Ind_Occ_Conc contains weights and the result is a weighted average.
I'm not sure about what you want to do with the sum(Ind$emp*Occ1_coeff) line but maybe that's what your looking for :
# Instead of doing the computation only for state a, get expected outcomes for all states (with dplyr):
Ind <- ind_df %>% group_by(State) %>%
summarize(rez = sum(emp))
# Then do some computations on Ind, which is a N element vector (one for each state)
# ...
# And finally, join Ind and Occ_df using merge
Occ_df <- merge(x = Occ_df, y = Ind, by = "State", all = TRUE)
Final output would then have Ind values in a new column: one value for all a, one value for b and one value for c.
Hope it will help ;)