SNA in R: Get alters' attributes from ego's ID - r

I have a data frame that's an edgelist (undirected) describing who is tied to who, and then a data frame with those actors' ethnicity. I want to get a data frame that lists the name of each ego in one column and the sum of their alters of a given type of ethnicity on the other (ex. Joe and the number of his white friends). Here's what I tried:
atts <- data.frame(Actor = letters[1:10], Ethnicity = sample(1:3, 10, replace=T)) # sample ethnicity data
df <- data.frame(actorA = letters[1:10],actorB=c("h","d","f","i","g","b","a","a","e","h")) # sample edgelist
df.split<-split(df$actorB,df$actorA) # obtain list of alters for column 1
head(df.split)
friends <- c()
n<-length(df.split)
for (i in 1:n){
alters_e <-atts[atts$Actor %in% df.split[[i]]==TRUE,] # get ethnicity for alters
friends[i] <- sum(alters_e$Ethnicity==3) # compute no. ties for one ethnicity value
}
friends
The problem with this is that using the split function doesn't work if some of your egos only show up in the actorB column.
Can anybody recommend a more graceful way for me to obtain lists of alters by ego's ID, that isn't the split function?

I hope this helps:
(atts <- data.frame(Actor = letters[1:10], Ethnicity = sample(1:3, 10, replace=T)))
(df <- data.frame(alter = letters[1:10],ego=c("h","d","f","i","g","b","a","a","e","h")))
(Merged <- merge (df, atts, by.x="alter", by.y="Actor"))
with(Merged, table(ego,Ethnicity))
,David

Related

Combining rows based on conditions and saving others (in R)

I have a question regarding combining columns based on two conditions.
I have two datasets from an experiment where participants had to type in a code, answer about their gender and eyetracking data was documented. The experiment happened twice (first: random1, second: random2).
eye <- c(1000,230,250,400)
gender <- c(1,2,1,2)
code <- c("ABC","DEF","GHI","JKL")
random1 <- data.frame(code,gender,eye)
eye2 <- c(100,250,230,450)
gender2 <- c(1,1,2,2)
code2 <- c("ABC","DEF","JKL","XYZ")
random2 <- data.frame(code2,gender2,eye2)
Now I want to combine the two dataframes. For all rows where code and gender match, the rows should be combined (so columns added). Code and gender variables of those two rows should become one each (gender3 and code3) and the eyetracking data should be split up into eye_first for random1 and eye_second for random2.
For all rows where there was not found a perfect match for their code and gender values, a new dataset with all of these rows should exist.
#this is what the combined data looks like
gender3 <- c(1,2)
eye_first <- c(1000,400)
eye_second <- c(100, 230)
code3 <- c("ABC", "JKL")
random3 <- data.frame(code3,gender3,eye_first,eye_second)
#this is what the data without match should look like
gender4 <- c(2,1,2)
eye4 <- c(230,250,450)
code4 <- c("DEF","GHI","XYZ")
random4 <- data.frame(code4,gender4,eye4)
I would greatly appreciate your help! Thanks in advance.
Use the same column names for your 2 data.frames and use merge
random1 <- data.frame(code = code, gender = gender, eye = eye)
random2 <- data.frame(code = code2, gender = gender2, eye = eye2)
df <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"))
For your second request, you can use anti_join from dplyr
df2 <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"), all = TRUE) # all = TRUE : keep rows with ids that are only in one of the 2 data.frame
library(dplyr)
anti_join(df2, df, by = c("code", "gender"))

R: Mapping the indicator column to what constitutes the column

I am coding in R and I have a dataframe for region such as:
data <- data.frame(Region = c("Cali", "NYC", "LA", "Vegas"),
Group = c(1,2,2,1), stringsAsFactors = F)
The regions have been clubbed to make a group. The group column tells which regions are a part of the group. How can I code, that when I have the group information, I can go and find the regions that constitute that group. Any help is really appreciated.
Most importantly and for future posts please
include sample data in a reproducible and copy&paste-able format using e.g. dput
refrain from adding superfluous statements like "This one is super urgent!"
As to your question, first I'll generate some sample data
set.seed(2018)
df <- data.frame(
Region = sample(letters, 10),
Group = sample(1:3, 10, replace = T))
I recommend summarising/aggregating data by Group, which will make it easy to extract information for specific Groups.
For example in base R you can aggregate the data based on Group and concatenate all Regions per Group
aggregate(Region ~ Group, data = df, FUN = toString)
# Group Region
#1 1 m
#2 2 i, l, g, c
#3 3 b, e, k, r, j
Or alternative you can store all Regions per Group in a list
aggregate(Region ~ Group, data = df, FUN = list)
# Group Region
#1 1 m
#2 2 i, l, g, c
#3 3 b, e, k, r, j
Note that while the output looks identical, toString creates a character string, while list stores the Regions in a list. The latter might be a better format for downstream processing.
Similar outputs can be achieved using dplyr
library(dplyr)
df %>%
group_by(Group) %>%
summarise(Region = toString(Region))
So with a small, reproducible example,
data <- data.frame(Region = c("Cali", "NYC", "LA", "Vegas"), Group = c(1,2,2,1),stringsAsFactors=F)
we see the following results, say we want all from group 1
group.number = 1
data[data$Group == group.number,"Region"]
[1] Cali Vegas
Or using dpyr
library(dplyr)
group.number = 1
data %>%
filter(Group == group.number)%>%
.$Region
Or from Jilber Urbina (Much more readable)
subset(data, Group==1)$Region

Create score based on word occurrences

I have two data frames with columns of words and associated scores for these words. I want to run comments through these frames and create an additive score based on if the words appear in the sentences.
I want to do this across many, many comments so it needs to be computationally efficient. So for example, the sentence "hi, he said. why is it okay" will get a score of .98 + .1 + .2 because the words "hi", "why", and "okay" are in data frame a. Any sentence could potentially have words from several data frames as well.
Can anyone help me create the column "add_score" with a procedure that scales well to large data frames? Thank you
a <- data.frame(words = c("hi","no","okay","why"),score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here",score = c(.5,.3,.2)))
comment_df = data.frame(id = c("1","2","3"), comments = c("hi, he said. why
is it okay","okay okay okay no","yes, here is it"))
comment_df$add_score = c(1.28,1.1,.5)
This solution uses functions from tidyverse and stringr.
# Load packages
library(tidyverse)
library(stringr)
# Merge a and b to create score_df
score_df <- bind_rows(a, b)
# Create a function to calculate score for one string
string_cal <- function(string, score_df){
temp <- score_df %>%
# Count the number of words in one string
mutate(Number = str_count(string, pattern = fixed(words))) %>%
# Calcualte the score
mutate(Total_Score = score * Number)
# Return the sum
return(sum(temp$Total_Score))
}
# Use map_dbl to apply the string_cal function over comments
# The results are stored in the add_score column
comment_df <- comment_df %>%
mutate(add_score = map_dbl(comments, string_cal, score_df = score_df))
Data Preparation
a <- data.frame(words = c("hi","no","okay","why"),
score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here"),
score = c(.5,.3,.2))
comment_df <- data.frame(id = c("1","2","3"),
comments = c("hi, he said. why is it okay",
"okay okay okay no",
"yes, here is it"))

Serial Subsetting in R

I am working with a large datasets. I have to extract values from one datasets, the identifiers for the values are stored in another dataset. So basically I am subsetting twice for each value of one category. For multiple category, I have to combine such double-subsetted values. So I am doing something similar to this shown below, but I think there must be a better way to do it.
example datasets
set.seed(1)
df <- data.frame(number= seq(5020, 5035, 1), value =rnorm(16, 20, 5),
type = rep(c("food", "bar", "sleep", "gym"), each = 4))
df2 <- data.frame(number= seq(5020, 5035, 1), type = rep(LETTERS[1:4], 4))
extract value for grade A
asub_df2 <-subset(df2, type == "A" )
asub_df <-subset(df, number == asub_df2$number)
new_a <- cbind(asub_df, grade = rep(c("A"),nrow(asub_df)))
similarly extract value for grade B in new_b and combine to do any analysis.
can we use
You can split the 'df2' and use lapply
Filter(Negate(is.null),
lapply(split(df2, df2$type), function(x) {
x1 <- subset(df, number==x$number)
if(nrow(x1)>0) {
transform(x1, grade=x$type[1])
}
}))

Grouping data into ranges in R

Suppose I have a data frame in R that has names of students in one column and their marks in another column. These marks range from 20 to 100.
> mydata
id name marks gender
1 a1 56 female
2 a2 37 male
I want to divide the student into groups, based on the criteria of obtained marks, so that difference between marks in each group should be more than 10. I tried to use the function table, which gives the number of students in each range from say 20-30, 30-40, but I want it to pick those students that have marks in a given range and put all their information together in a group. Any help is appreciated.
I am not sure what you mean with "put all their information together in a group", but here is a way to obtain a list with dataframes split up of your original data frame where each element is a data frame of the students within a mark range of 10:
mydata <- data.frame(
id = 1:100,
name = paste0("a",1:100),
marks = sample(20:100,100,TRUE),
gender = sample(c("female","male"),100,TRUE))
split(mydata,cut(mydata$marks,seq(20,100,by=10)))
I think that #Sacha's answer should suffice for what you need to do, even if you have more than one set.
You haven't explicitly said how you want to "group" the data in your original post, and in your comment, where you've added a second dataset, you haven't explicitly said whether you plan to "merge" these first (rbind would suffice, as recommended in the comment).
So, with that, here are several options, each with different levels of detail or utility in the output. Hopefully one of them suits your needs.
First, here's some sample data.
# Two data.frames (myData1, and myData2)
set.seed(1)
myData1 <- data.frame(id = 1:20,
name = paste("a", 1:20, sep = ""),
marks = sample(20:100, 20, replace = TRUE),
gender = sample(c("F", "M"), 20, replace = TRUE))
myData2 <- data.frame(id = 1:17,
name = paste("b", 1:17, sep = ""),
marks = sample(30:100, 17, replace = TRUE),
gender = sample(c("F", "M"), 17, replace = TRUE))
Second, different options for "grouping".
Option 1: Return (in a list) the values from myData1 and myData2 which match a given condition. For this example, you'll end up with a list of two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) x[x$marks >= 30 & x$marks <= 50, ])
Option 2: Return (in a list) each dataset split into two, one for FALSE (doesn't match the stated condition) and one for TRUE (does match the stated condition). In other words, creates four groups. For this example, you'll end up with a nested list with two list items, each with two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, x$marks >= 30 & x$marks <= 50))
Option 3: More flexible than the first. This is essentially #Sacha's example extended to a list. You can set your breaks wherever you would like, making this, in my mind, a really convenient option. For this example, you'll end up with a nested list with two list items, each with multiple data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, cut(x$marks,
breaks = c(0, 30, 50, 75, 100),
include.lowest = TRUE)))
Option 4: Combine the data first and use the grouping method described in Option 1. For this example, you will end up with a single data.frame containing only values which match the given condition.
# Combine the data. Assumes all the rownames are the same in both sets
myDataALL <- rbind(myData1, myData2)
# Extract just the group of scores you're interested in
myDataALL[myDataALL$marks >= 30 & myDataALL$marks <= 50, ]
Option 5: Using the combined data, split the data into two groups: one group which matches the stated condition, one which doesn't. For this example, you will end up with a list with two data.frames.
split(myDataALL, myDataALL$marks >= 30 & myDataALL$marks <= 50)
I hope one of these options serves your needs!
I had the same kind of issue and after researching some answers on stack overflow I came up with the following solution :
Step 1 : Define range
Step 2 : Find the elements that fall in the range
Step 3 : Plot
A sample code is as shown below:
range = NULL
for(i in seq(0, max(all$downlink), 2000)){
range <- c(range, i)
}
counts <- numeric(length(range)-1);
for(i in 1:length(counts)) {
counts[i] <- length(which(all$downlink>=range[i] & all$downlink<range[i+1]));
}
countmax = max(counts)
a = round(countmax/1000)*1000
barplot(counts, col= rainbow(16), ylim = c(0,a))

Resources