R: Mapping the indicator column to what constitutes the column - r

I am coding in R and I have a dataframe for region such as:
data <- data.frame(Region = c("Cali", "NYC", "LA", "Vegas"),
Group = c(1,2,2,1), stringsAsFactors = F)
The regions have been clubbed to make a group. The group column tells which regions are a part of the group. How can I code, that when I have the group information, I can go and find the regions that constitute that group. Any help is really appreciated.

Most importantly and for future posts please
include sample data in a reproducible and copy&paste-able format using e.g. dput
refrain from adding superfluous statements like "This one is super urgent!"
As to your question, first I'll generate some sample data
set.seed(2018)
df <- data.frame(
Region = sample(letters, 10),
Group = sample(1:3, 10, replace = T))
I recommend summarising/aggregating data by Group, which will make it easy to extract information for specific Groups.
For example in base R you can aggregate the data based on Group and concatenate all Regions per Group
aggregate(Region ~ Group, data = df, FUN = toString)
# Group Region
#1 1 m
#2 2 i, l, g, c
#3 3 b, e, k, r, j
Or alternative you can store all Regions per Group in a list
aggregate(Region ~ Group, data = df, FUN = list)
# Group Region
#1 1 m
#2 2 i, l, g, c
#3 3 b, e, k, r, j
Note that while the output looks identical, toString creates a character string, while list stores the Regions in a list. The latter might be a better format for downstream processing.
Similar outputs can be achieved using dplyr
library(dplyr)
df %>%
group_by(Group) %>%
summarise(Region = toString(Region))

So with a small, reproducible example,
data <- data.frame(Region = c("Cali", "NYC", "LA", "Vegas"), Group = c(1,2,2,1),stringsAsFactors=F)
we see the following results, say we want all from group 1
group.number = 1
data[data$Group == group.number,"Region"]
[1] Cali Vegas
Or using dpyr
library(dplyr)
group.number = 1
data %>%
filter(Group == group.number)%>%
.$Region
Or from Jilber Urbina (Much more readable)
subset(data, Group==1)$Region

Related

Replacing marks with grades dynamically

I have two dataframes:
marks<-data.frame("student" =c("stud1","stud2","stud3") ,"sub1" =c(25,75,43), "sub2" = c(43,99,45),"sub3" = c(32,53,45), stringsAsFactors = FALSE)
grades<-data.frame("grade" =c("F","B","A") ,"sub1" =c(50,75,85), "sub2" =c(35,75,85)),"sub3" =c(32,75,85), stringsAsFactors = FALSE)
(sample in image format as well)
I need to compare each mark in Marks df and get corresponding grades from Grades df. The grade definition is different for different subjects.
I have tried using lapply function and cut:
(marks<-sapply(marks, function(x) cut(x,
breaks=c(0,50,75,85),
labels=c("F","B","A"),include.lowest = TRUE, right = TRUE,na.rm = TRUE))
This works without difficulty if grade boundaries are fixed. But when they change dynamically (based on subject column) I could not do this.
The expected output is:
gradedmarks<-data.frame("student" =c("stud1","stud2","stud3") ,"sub1" =c("F","B","F"), "sub2" = c("F","A","F"),"sub3" = c("F","F","F"), stringsAsFactors = FALSE)
Any quick way to achieve this in R?
Please note this is NOT the duplicate of this(Looping through multiple if_else statements). This is about using cut function with dynamic values depending on the column.
Using a data.table non-equi join:
library(data.table)
setDT(marks)
setDT(grades)
#reshape to long format
marks <- melt(marks, id.vars = "student")
grades <- melt(grades, id.vars = "grade")
#non.equi join
marks[grades, grade := i.grade, on = c("variable", "value >= value")]
#fill "F" for low marks
marks[is.na(grade), grade := "F"]
#reshape to wide format
dcast(marks, student ~ variable, value.var = "grade")
# student sub1 sub2 sub3
#1 stud1 F F F
#2 stud2 B A F
#3 stud3 F F F
Study the data.table documentation to understand data.table syntax. There are some excellent vignettes.
You can use Map to change the grades dynamically :
marks[-1] <- Map(function(x, y) cut(x, breaks=c(0, y),
labels=grades$grade,include.lowest = TRUE, right = TRUE),
marks[-1], grades[-1])
You might need to adjust settings of cut function based on your requirement. Also this requires subjects to be in same order in marks and grades dataframe.

R equivalent to SAS "merge" "by"

If you only use "merge" and "by" in SAS to merge datasets that contain several variables with equal names (beside the ID(s) that you merge by), SAS will combine these variables in to one using the value read last - it is described here https://communities.sas.com/t5/SAS-Programming/Merge-step-overwriting-shared-vars/m-p/281542#M57117
Text from above link:
"There is a rule: whichever value was read last. But that rule is simple only when the merge is one-to-one. In that case, the value you get depends on the order in the MERGE statement:
merge a b;
by id;
The value of common variables (for a one-to-one merge) comes from data set B. SAS reads a value from data set A, then reads a value from data set B. The value from B is read last, and overwrites the value read from data set A.
If there is a mismatch, and an ID appears only in data set A but not in data set B, the value will be the one found in data set A."
How do I make R behave the same way without having to combine the rows afterwards after certain conditions? (in SAS, values are not overwritten by NAs)
library(tidyverse)
#create tibbles
df1 <- tibble(id = c(1:3), y = c("tt", "ff", "kk"))
df2 <- tibble(id = c(1,2,4), y = c(4,3,8))
df3 <- tibble(id = c(1:3), y = c(5,7,NA))
#combine the tibbles
combined_df <- list(df1, df2, df3) %>%
reduce(full_join, by = "id")
# desired output
combined_df_desired <- tibble(id = 1:4, y = c(5,7,"kk",8))
I don't know exactly what you mean with "certain conditions". There isn't a way to change the inner workings of full_join() but you can do:
list(df1, df2, df3) %>%
reduce(full_join, by = "id") %>%
mutate_all(as.character) %>%
mutate(y = coalesce(y, y.y , y.x,)) %>%
select(id, y)
A tibble: 4 x 2
id y
<chr> <chr>
1 1 5
2 2 7
3 3 kk
4 4 8
coalesce() takes a set of columns and returns the first non-NA value for each row. You can order the columns inside the function according to your priorities.

Access dim names of unnamed data frame for dropping columns by names

This is a follow up on this question. #Sotos had provided a correct answer to the question, but indeed my question was meant more theoretically.
I am aware that this might all not be very practical, but it is more out of curiosity.
How can I access the named attributes of a named object for 'negative selection' (dropping) by name ?
'Positive selection' is neat:
data.frame(year = 1996:1998, group = letters[1:3]) ['group']
group
1 a
2 b
3 c
But following this approach it could quickly get cumbersome with 'negative selection', especially for larger data frames:
data.frame(year = 1996:1998, group = letters[1:3])[!names(data.frame(year = 1996:1998, group = letters[1:3])) %in% 'year']
group
1 a
2 b
3 c
I know that you could use subset or dplyr::select:
data.frame(year = 1996:1998, group = letters[1:3]) %>% select(- year)
# or
subset(data.frame(year = 1996:1998, group = letters[1:3]), select = -year)
group
1 a
2 b
3 c
But I wondered if there are other means, based on selection using [, such as to use the foo[!names %in% x] solution without attributing a name to foo beforehand and without the cumbersome repetition of the data frame as in my example code.

Create score based on word occurrences

I have two data frames with columns of words and associated scores for these words. I want to run comments through these frames and create an additive score based on if the words appear in the sentences.
I want to do this across many, many comments so it needs to be computationally efficient. So for example, the sentence "hi, he said. why is it okay" will get a score of .98 + .1 + .2 because the words "hi", "why", and "okay" are in data frame a. Any sentence could potentially have words from several data frames as well.
Can anyone help me create the column "add_score" with a procedure that scales well to large data frames? Thank you
a <- data.frame(words = c("hi","no","okay","why"),score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here",score = c(.5,.3,.2)))
comment_df = data.frame(id = c("1","2","3"), comments = c("hi, he said. why
is it okay","okay okay okay no","yes, here is it"))
comment_df$add_score = c(1.28,1.1,.5)
This solution uses functions from tidyverse and stringr.
# Load packages
library(tidyverse)
library(stringr)
# Merge a and b to create score_df
score_df <- bind_rows(a, b)
# Create a function to calculate score for one string
string_cal <- function(string, score_df){
temp <- score_df %>%
# Count the number of words in one string
mutate(Number = str_count(string, pattern = fixed(words))) %>%
# Calcualte the score
mutate(Total_Score = score * Number)
# Return the sum
return(sum(temp$Total_Score))
}
# Use map_dbl to apply the string_cal function over comments
# The results are stored in the add_score column
comment_df <- comment_df %>%
mutate(add_score = map_dbl(comments, string_cal, score_df = score_df))
Data Preparation
a <- data.frame(words = c("hi","no","okay","why"),
score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here"),
score = c(.5,.3,.2))
comment_df <- data.frame(id = c("1","2","3"),
comments = c("hi, he said. why is it okay",
"okay okay okay no",
"yes, here is it"))

SNA in R: Get alters' attributes from ego's ID

I have a data frame that's an edgelist (undirected) describing who is tied to who, and then a data frame with those actors' ethnicity. I want to get a data frame that lists the name of each ego in one column and the sum of their alters of a given type of ethnicity on the other (ex. Joe and the number of his white friends). Here's what I tried:
atts <- data.frame(Actor = letters[1:10], Ethnicity = sample(1:3, 10, replace=T)) # sample ethnicity data
df <- data.frame(actorA = letters[1:10],actorB=c("h","d","f","i","g","b","a","a","e","h")) # sample edgelist
df.split<-split(df$actorB,df$actorA) # obtain list of alters for column 1
head(df.split)
friends <- c()
n<-length(df.split)
for (i in 1:n){
alters_e <-atts[atts$Actor %in% df.split[[i]]==TRUE,] # get ethnicity for alters
friends[i] <- sum(alters_e$Ethnicity==3) # compute no. ties for one ethnicity value
}
friends
The problem with this is that using the split function doesn't work if some of your egos only show up in the actorB column.
Can anybody recommend a more graceful way for me to obtain lists of alters by ego's ID, that isn't the split function?
I hope this helps:
(atts <- data.frame(Actor = letters[1:10], Ethnicity = sample(1:3, 10, replace=T)))
(df <- data.frame(alter = letters[1:10],ego=c("h","d","f","i","g","b","a","a","e","h")))
(Merged <- merge (df, atts, by.x="alter", by.y="Actor"))
with(Merged, table(ego,Ethnicity))
,David

Resources