I have a question regarding combining columns based on two conditions.
I have two datasets from an experiment where participants had to type in a code, answer about their gender and eyetracking data was documented. The experiment happened twice (first: random1, second: random2).
eye <- c(1000,230,250,400)
gender <- c(1,2,1,2)
code <- c("ABC","DEF","GHI","JKL")
random1 <- data.frame(code,gender,eye)
eye2 <- c(100,250,230,450)
gender2 <- c(1,1,2,2)
code2 <- c("ABC","DEF","JKL","XYZ")
random2 <- data.frame(code2,gender2,eye2)
Now I want to combine the two dataframes. For all rows where code and gender match, the rows should be combined (so columns added). Code and gender variables of those two rows should become one each (gender3 and code3) and the eyetracking data should be split up into eye_first for random1 and eye_second for random2.
For all rows where there was not found a perfect match for their code and gender values, a new dataset with all of these rows should exist.
#this is what the combined data looks like
gender3 <- c(1,2)
eye_first <- c(1000,400)
eye_second <- c(100, 230)
code3 <- c("ABC", "JKL")
random3 <- data.frame(code3,gender3,eye_first,eye_second)
#this is what the data without match should look like
gender4 <- c(2,1,2)
eye4 <- c(230,250,450)
code4 <- c("DEF","GHI","XYZ")
random4 <- data.frame(code4,gender4,eye4)
I would greatly appreciate your help! Thanks in advance.
Use the same column names for your 2 data.frames and use merge
random1 <- data.frame(code = code, gender = gender, eye = eye)
random2 <- data.frame(code = code2, gender = gender2, eye = eye2)
df <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"))
For your second request, you can use anti_join from dplyr
df2 <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"), all = TRUE) # all = TRUE : keep rows with ids that are only in one of the 2 data.frame
library(dplyr)
anti_join(df2, df, by = c("code", "gender"))
Related
Suppose I have a data frame (DF) that looks like the following:
test <- c('Test1','Test2','Test3')
col.DF.names < c('ID', 'year', 'car', 'age', 'year.1', 'car.1', 'age.1', 'year.2', 'car.2', 'age.2')
ID <- c('A','B','C')
year <- c(2001,2002,2003)
car <- c('acura','benz','lexus')
age <- c(55,16,20)
year.1 <- c(2011,2012,2013)
car.1 <- c('honda','gm','bmw')
age.1 <- c(43,21,34)
year.2 <- c(1961,1962,1963)
car.2 <- c('toyota','porsche','jeep')
age.2 <- c(33,56,42)
DF <- data.frame(ID, year, car, age, year.1, car.1, age.1, year.2, car.2, age.2)
I need the columns of data frame to lose the ".#" and instead have the Test# in front of it, so it looks something like this:
ID Test1.year Test1.car Test1.age Test2.year Test2.car Test2.age Test3.year Test3.car Test3.age
.... with all the data
Does anyone have a suggestion? Basically, starting at the second column, I"d like to add the test[1] name for 3 columns, and then move to the next set of three columns and add test[2] and so on..
I know how to hard code it:
colnames(DF)[2:4] <- paste(test[1], colnames(DF)[2:4], sep = ".")
but this is a toy set, and I would like to somewhat automate it, so I'm not specifically indicating[2:4] for example.
You could try:
colnames(DF)[-1] <- paste(sapply(test, rep, 3), colnames(DF)[-1], sep = ".")
or perhaps the following would be better:
colnames(DF)[-1] <- paste(sapply(test, rep, 3), colnames(DF)[2:4], sep = ".")
or:
colnames(DF)[-1] <- paste(rep(test, each=3), colnames(DF)[2:4], sep = ".")
thanks to #thelatemail
I have two data frames with columns of words and associated scores for these words. I want to run comments through these frames and create an additive score based on if the words appear in the sentences.
I want to do this across many, many comments so it needs to be computationally efficient. So for example, the sentence "hi, he said. why is it okay" will get a score of .98 + .1 + .2 because the words "hi", "why", and "okay" are in data frame a. Any sentence could potentially have words from several data frames as well.
Can anyone help me create the column "add_score" with a procedure that scales well to large data frames? Thank you
a <- data.frame(words = c("hi","no","okay","why"),score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here",score = c(.5,.3,.2)))
comment_df = data.frame(id = c("1","2","3"), comments = c("hi, he said. why
is it okay","okay okay okay no","yes, here is it"))
comment_df$add_score = c(1.28,1.1,.5)
This solution uses functions from tidyverse and stringr.
# Load packages
library(tidyverse)
library(stringr)
# Merge a and b to create score_df
score_df <- bind_rows(a, b)
# Create a function to calculate score for one string
string_cal <- function(string, score_df){
temp <- score_df %>%
# Count the number of words in one string
mutate(Number = str_count(string, pattern = fixed(words))) %>%
# Calcualte the score
mutate(Total_Score = score * Number)
# Return the sum
return(sum(temp$Total_Score))
}
# Use map_dbl to apply the string_cal function over comments
# The results are stored in the add_score column
comment_df <- comment_df %>%
mutate(add_score = map_dbl(comments, string_cal, score_df = score_df))
Data Preparation
a <- data.frame(words = c("hi","no","okay","why"),
score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here"),
score = c(.5,.3,.2))
comment_df <- data.frame(id = c("1","2","3"),
comments = c("hi, he said. why is it okay",
"okay okay okay no",
"yes, here is it"))
So I have three data sets that I need to merge. These contain school data and read/math scores for grades 4 and 5. One of them is a long form data set that has a lot of missingness in some variables (yes, I do need the data in long form) and the other two have the full missing data in wide form. All of these data frames contain a column that has an unique ID number for each individual in the database.
Here is a full reproducible example that generates a small example of the types of data.frames I am working with... The three data frames that I need to use are the following: school_lf, school4 and school5. school_lf has the long form data with NAs and school4 and school5 are the dfs I need to use to populate the NA's in this long form data (by id and grade)
set.seed(890)
school <- NULL
school$id <-sample(102938:999999, 100)
school$selected <-sample(0:1, 100, replace = T)
school$math4 <- sample(400:500, 100)
school$math5 <- sample(400:500, 100)
school$read4 <- sample(400:500, 100)
school$read5 <- sample(400:500, 100)
school <- as.data.frame(school)
# Delete observations at random from the school df
indm4 <- which(school$math4 %in% sample(school$math4, 25))
school$math4[indm4] <- NA
indm5 <- which(school$math5 %in% sample(school$math5, 50))
school$math5[indm5] <- NA
indr4 <- which(school$read4 %in% sample(school$read4, 70))
school$read4[indr4] <- NA
indr5 <- which(school$read5 %in% sample(school$read5, 81))
school$read5[indr5] <- NA
# Separate Read and Math
read <- as.data.frame(subset(school, select = -c(math4, math5)))
math <- as.data.frame(subset(school, select = -c(read4, read5)))
# Now turn this into long form data...
clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore")
clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore")
# Clean up the grades for each of these...
clr$grade <- ifelse(clr$variable == "read4", 4,
ifelse(clr$variable == "read5", 5, NA))
clm$grade <- ifelse(clm$variable == "math4", 4,
ifelse(clm$variable == "math5", 5, NA))
# Put all these in one df
school_lf <-cbind(clm, clr$readscore)
school_lf$readscore <- school_lf$`clr$readscore` # renames
school_lf$`clr$readscore` <- NULL # deletes
school_lf$variable <- NULL # deletes
###############
# Generate the 2 data frames with IDs that have the full data
set.seed(890)
school4 <- NULL
school4$id <-sample(102938:999999, 100)
school4$selected <-sample(0:1, 100, replace = T)
school4$math4 <- sample(400:500, 100)
school4$read4 <- sample(400:500, 100)
school4$grade <- 4
school4 <- as.data.frame(school4)
set.seed(890)
school5 <- NULL
school5$id <-sample(102938:999999, 100)
school5$selected <-sample(0:1, 100, replace = T)
school5$math5 <- sample(400:500, 100)
school5$read5 <- sample(400:500, 100)
school5$grade <- 5
school5 <- as.data.frame(school5)
I need to merge the wide-form data into the long-form data to replace the NAs with the actual values. I have tried the code below, but it introduces several columns instead of merging the read scores and the math scores where there's NA's. I simply need one column with the read scores and one with the math scores, instead of six separate columns (read.x, read.y, math.x, math.y, mathscore and readscore).
sch <- merge(school_lf, school4, by = c("id", "grade", "selected"), all = T)
sch <- merge(sch, school5, by = c("id", "grade", "selected"), all = T)
Any help is highly appreciated! I've been trying to solve this for hours now and haven't made any progress (so figured I'd ask here)
You can use the coalesce function from dplyr. If a value in the first vector is NA, it will see if the value at the same position in the second vector is not NA and select it. If again NA, it goes to the third.
library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
mutate(readscore = coalesce(readscore, read4, read5)) %>%
select(id:readscore)
EDIT: I just tried to do this approach on my actual data and it does not work because the replacement data also has some NAs and, as a result, the dfs I try to do coalesce with have differing number of rows... Back to square one.
I was able to figure this out with the following code (albeit it's not the most elegant or straight-forward ,and #Edwin's response helped point me in the right direction. Any suggestions on how to make this code more elegant and efficient are more than welcome!
# Idea: put both in long form and stack on top of one another... then merge like that!
sch4r <- as.data.frame(subset(school4, select = -c(mathscore)))
sch4m <- as.data.frame(subset(school4, select = -c(readscore)))
sch5r <- as.data.frame(subset(school5, select = -c(mathscore)))
sch5m <- as.data.frame(subset(school5, select = -c(readscore)))
# Put these in LF
sch4r_lf <- melt(sch4r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch4m_lf <- melt(sch4m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
sch5r_lf <- melt(sch5r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch5m_lf <- melt(sch5m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
# Combine in one DF
sch_full_4 <-cbind(sch4r_lf, sch4m_lf$mathscore)
sch_full_4$mathscore <- sch_full_4$`sch4m_lf$mathscore`
sch_full_4$`sch4m_lf$mathscore` <- NULL # deletes
sch_full_4$variable <- NULL
sch_full_5 <- cbind(sch5r_lf, sch5m$mathscore)
sch_full_5$mathscore <- sch_full_5$`sch5m$mathscore`
sch_full_5$`sch5m$mathscore` <- NULL
sch_full_5$variable <- NULL
# Stack together
sch_full <- rbind(sch_full_4,sch_full_5)
sch_full$selected <- NULL # delete this column...
# MERGE together
final_school_math <- mutate(school_lf, mathscore = coalesce(school_lf$mathscore, sch_full$mathscore))
final_school_read <- mutate(school_lf, readscore = coalesce(school_lf$readscore, sch_full$readscore))
final_df <- cbind(final_school_math, final_school_read$readscore)
final_df$readscore <- final_df$`final_school_read$readscore`
final_df$`final_school_read$readscore` <- NULL
I have a question on data subset based on dynamic column class. For example:
#Coming from other source. Dont exaclty know about their names and number of classes.
#But following are two demography, which will help in imagining the problem
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
#moredemo.................
# reproducible data
set.seed(1)
col1 <- as.data.frame(rep(gender, 100))
col2 <- as.data.frame(rep(agegroup, 25))
col3 <- runif(200)
datafile <- cbind(col1, col2, col3)
names(datafile)[1] = "gender"
names(datafile)[2] = "agegroup"
datafile <- as.data.frame(datafile)
#Subset is only for gender = 1 and agegroup = 3
#Subset is for every combination of classes in each demography
#No hardcoded name is required, because demography name will not be know
dat_gender_1_agegroup_3 <- datafile[datafile$gender == 1 & datafile$agegroup == 3, ]
But there can be more demography and not just gender and agegroup. There can be income or education or race and so on. each of the demography has varying number of class. Kindly help me in getting the subset of the dataset datafile on the varying number of columns. Thanks in advance
Using expand grid for combos then apply to subset:
#dummy data
set.seed(123)
mydata <- data.frame(gender = sample(1:2, 100, replace = TRUE),
agegroup = sample(1:10, 100, replace = TRUE))
#groups
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
#get all combo
myCombo <- expand.grid(gender, agegroup)
#result is a list object
apply(myCombo, 1, function(i){
mydata[ mydata$gender == i[1] &
mydata$agegroup == i[2], ]
})
Edit: Based on update, I think you just need split command
split(datafile, datafile[, 1:2])
What about (assuming the column names are "gender" and "agegroup"):
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
data_subset <- subset(full_data, gender%in%gender | agegroup%in%agegroup | [AND SO ON])
You can add as many [column_name]%in%[values] as you want.
HTH a little!
EDIT: you can very well use & instead of |, obviously.
I have a data frame that's an edgelist (undirected) describing who is tied to who, and then a data frame with those actors' ethnicity. I want to get a data frame that lists the name of each ego in one column and the sum of their alters of a given type of ethnicity on the other (ex. Joe and the number of his white friends). Here's what I tried:
atts <- data.frame(Actor = letters[1:10], Ethnicity = sample(1:3, 10, replace=T)) # sample ethnicity data
df <- data.frame(actorA = letters[1:10],actorB=c("h","d","f","i","g","b","a","a","e","h")) # sample edgelist
df.split<-split(df$actorB,df$actorA) # obtain list of alters for column 1
head(df.split)
friends <- c()
n<-length(df.split)
for (i in 1:n){
alters_e <-atts[atts$Actor %in% df.split[[i]]==TRUE,] # get ethnicity for alters
friends[i] <- sum(alters_e$Ethnicity==3) # compute no. ties for one ethnicity value
}
friends
The problem with this is that using the split function doesn't work if some of your egos only show up in the actorB column.
Can anybody recommend a more graceful way for me to obtain lists of alters by ego's ID, that isn't the split function?
I hope this helps:
(atts <- data.frame(Actor = letters[1:10], Ethnicity = sample(1:3, 10, replace=T)))
(df <- data.frame(alter = letters[1:10],ego=c("h","d","f","i","g","b","a","a","e","h")))
(Merged <- merge (df, atts, by.x="alter", by.y="Actor"))
with(Merged, table(ego,Ethnicity))
,David