I'm sure I can figure out a straightforward solution to this problem, but I didn't see a comparable question so I thought I'd post a question.
I have a longitudinal dataset with thousands of respondents over several time intervals. Everything from the questions to the data types can differ between the waves and often requires constructing long series of bools to construct indicators or dummy variables, but each respondent has a unique ID with no additional respondents add to the surveys after the first wave, so easy enough.
The issue is that while the early wave consist of one (Stata) file each, the latter waves contain lots of addendum files, structured differently. So, for example, in constructing previous indicators for the sex of previous partners there were columns (for one wave) called partnerNum and sex and there were up to 16 rows for each unique ID (respondent). Easy enough to spread (or cast) that data to be able to create a single row for each unique ID and columns partnerNum_1 ... partnerNum_16 with the value from the sex column as the entry in partnerDF. Then it's easy to construct indicators like:
sexuality$newIndicator[mainDF$bioSex = "Male" & apply(partnerDF[1:16] == "Male", 1, any)] <- 1
For other addendum files in the last two waves the data is structured long like the partner data, with multiple rows for each unique ID, but rather than just one variable like sex there are hundreds that I need to use to test against to construct indicators, all coded with different types, so it's impractical to spread (or cast) the data wide (never mind writing those bools). There are actually several of these files for each wave and the way they are structured some respondents (unique ID) occupy just 1 row, some a few dozen. (I've left_join'ed the addendum files together for each wave.)
What I'd like to be able to do to is test something like:
newDF$indicator[any(waveIIIAdds$var1 == 1) & any(waveIIIAdds$var2 == 1)] <- 1
or
newDF$indicator[mainDF$var1 == 1 & any(waveIIIAdds$var2 == 1)] <- 1
where newDF is the same length as mainDF (one row per unique ID).
So, for example, if I had two dfs.
df1 <- data.frame(ID = c(1:4), A = rep("a"))
df2 <- data.frame(ID = rep(1:4, each=2), B = rep(1:2, 2), stringsAsFactors = FALSE)
df1$A[1] <- "b"
df1$A[3] <- "b"
df2$B[8] <- 3
> df1 > df2
ID A ID B
1 b 1 1
2 a 1 2
3 b 2 1
4 a 2 2
3 1
3 2
4 1
4 3
I'd like to test like (assuming df3 has one column, just the unique IDs from df1)
df3$new <- 0
df3$new[df1$ID[df1$A == "a"] & df2$ID[df2$B == 2]] <- 1
So that df3 would have one unique ID per row and since there is an "a" in df1$A for all IDs but df1$A[1] and a 2 in at least one row of df2$B for all IDs except the last ID (df2$B[7:8]) the result would be:
> df3
ID new
1 0
2 1
3 1
4 0
and
df3$new <- 0
df3$new[df1$ID[df1$A == "a"] | df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 1
2 1
3 1
4 0
This does it...
df3 <- data.frame(ID=unique(df1$ID),
new=sapply(unique(df1$ID),function(x)
as.numeric(x %in% df1$ID[df1$A == "a"] & x %in% df2$ID[df2$B == 2])))
df3
ID new
1 1 1
2 2 1
3 3 1
4 4 0
I came up with a parsimonious solution thinking about it for a few minutes after returning to the problem (rather than the wee hours of the morning of the post).
I wanted something a graduate student who will likely construct thousands of indicators or dummy variables this way and may learn R first, or even only ever learn R, could use. The following provides a solution for the example and actual data using the same schema:
if the DF was already created with the IDs and the column values for the dummy indicator initiated to zero already as assumed in the example:
df3 <- data.frame(ID = df1$ID)
df3$new <- 0
My solution was:
df3$new[df1$ID %in% df1$ID[df1$A == "a"] & df1$ID %in% df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 0
2 1
3 0
4 1
Using | (or) instead:
df3$new[df1$ID %in% df1$ID[df1$A == "a"] | df1$ID %in% df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 1
2 1
3 0
4 1
I need to turn these two matrices corresponding to (toy) word counts:
a hope to victory win
[1,] 2 1 1 1 1
and
a chance than win
[1,] 1 1 1 1
where the word "a" appears a combined number of 3 times, and the word "win" appears 2 times (once in each matrix), into:
a win chance hope than to victory
[1,] 3 2 1 1 1 1 1
where equally-named columns combine into a single column that contains the sum.
And,
a hope to victory win different than
[1,] 2 1 1 1 1 0 0
where first matrix is preserved, and the second matrix is attached at the end but with only unique column names and with all the row values equal to zero.
So, if you store this data in a data frame (Which is really recommended for this sort of data) the process is very simple.
(I'm including a conversion from that format, with any number of rows):
conversion:
newdf1 <- data.frame(Word = colnames(matrix1), Count = as.vector(t(matrix1)))
newdf2 <- data.frame(Word = colnames(matrix2), Count = as.vector(t(matrix2)))
now you can use rbind + dplyr (or data.table)
dplyr solution:
library(dplyr)
df <- rbind(newdf1,newdf2)
result <- df %>% group_by(Word) %>% summarise(Count = sum(Count))
the answer to your second question is related,
result2 <- rbind(newdf1,data.frame(Word = setdiff(newdf2$Word,newdf1$Word), Count = 0))
(the data.table solution is very similar, but if you're new to data frames and grouping/reshaping, I recommend dplyr)
(EDITED the second solution so that it's actually giving you the unique entries)
I understand that this is quite a simple question, but I haven't been able to find an answer to this.
I have a data frame which gives you the id of a person and his hobby. Since a person may have many hobbies, the id field may be repeated in multiple rows, each with a different hobby. I have been trying to print out only those rows which have more than one hobbies. I was able to get the frequencies using table.
But how do I apply the condition to print only when the frequency is greater than one.
Secondly, is there a better way to find frequencies without using table.
This is my attempt with table without the filter for frequency greater than one
> id=c(1,2,2,3,2,4,3,1)
> hobby = c('play','swim','play','movies','golf','basketball','playstation','gameboy')
> df = data.frame(id, hobby)
> table(df$id)
1 2 3 4
2 3 2 1
Try using data table, I find it more readable than using table() functions:
library(data.table)
id=c(1,2,2,3,2,4,3,1)
hobby = c('play','swim','play','movies',
'golf','basketball','playstation','gameboy')
df = data.frame(id=id, hobby=hobby)
dt = as.data.table(df)
dt[,hobbies:=.N, by=id]
You will get, for your condition:
> dt[hobbies >1,]
id hobby hobbies
1: 1 play 2
2: 2 swim 3
3: 2 play 3
4: 3 movies 2
5: 2 golf 3
6: 3 playstation 2
7: 1 gameboy 2
This example assumes you are trying to filter df
id=c(1,2,2,3,2,4,3,1)
hobby = c('play','swim','play','movies','golf','basketball',
'playstation','gameboy')
df = data.frame(id, hobby)
table(df$id)
Get all those ids that have more than one hobby
tmp <- as.data.frame(table(df$id))
tmp <- tmp[tmp$Freq > 1,]
Using that information - select their IDs in df
df1 <- df[df$id %in% tmp$Var1,]
df1
In R, I have two data frames (A and B) that share columns (1, 2 and 3). Column 1 has a unique identifier, and is the same for each data frame; columns 2 and 3 have different information. I'm trying to merge these two data frames to get 1 new data frame that has columns 1, 2, and 3, and in which the values in column 2 and 3 are concatenated: i.e. column 2 of the new data frame contains: [data frame A column 2 + data frame B column 2]
Example:
dfA <- data.frame(Name = c("John","James","Peter"),
Score = c(2,4,0),
Response = c("1,0,0,1","1,1,1,1","0,0,0,0"))
dfB <- data.frame(Name = c("John","James","Peter"),
Score = c(3,1,4),
Response = c("0,1,1,1","0,1,0,0","1,1,1,1"))
dfA:
Name Score Response
1 John 2 1,0,0,1
2 James 4 1,1,1,1
3 Peter 0 0,0,0,0
dfB:
Name Score Response
1 John 3 0,1,1,1
2 James 1 0,1,0,0
3 Peter 4 1,1,1,1
Should results in:
dfNew <- data.frame(Name = c("John","James","Peter"),
Score = c(5,5,4),
Response = c("1,0,0,1,0,1,1,1","1,1,1,1,0,1,0,0","0,0,0,0,1,1,1,1"))
dfNew:
Name Score Response
1 John 5 1,0,0,1,0,1,1,1
2 James 5 1,1,1,1,0,1,0,0
3 Peter 4 0,0,0,0,1,1,1,1
I've tried merge but that simply appends the columns (much like cbind)
Is there a way to do this, without having to cycle through all columns, like:
colnames(dfNew) <- c("Name","Score","Response")
dfNew$Score <- dfA$Score + dfB$Score
dfNew$Response <- paste(dfA$Response, dfB$Response, sep=",")
The added difficulty is, as you might have noticed, that for some columns we need to use addition, whereas others require concatenation separated by a comma (the columns requiring addition are formatted as numerical, the others as text, which might make it easier?)
Thanks in advance!
PS. The string 1,0,0,1,0,1,1,1 etc. captures the response per trial – this example has 8 trials to which participants can either respond correctly (1) or incorrectly (0); the final score is collected under Score. Just to explain why my data/example looks the way it does.
Personally, I would try to avoid concatenating 'response per trial' to a single variable ('Response') from the start, in order to make the data less static and facilitate any subsequent steps of analysis or data management. Given that the individual trials already are concatenated, as in your example, I would therefore consider splitting them up. Formatting the data frame for a final, pretty, printed output I would consider a different, later issue.
# merge data (cbind would also work if data are ordered properly)
df <- merge(x = dfA[ , c("Name", "Response")], y = dfB[ , c("Name", "Response")],
by = "Name")
# rename
names(df) <- c("Name", c("A", "B"))
# split concatenated columns
library(splitstackshape)
df2 <- concat.split.multiple(data = df, split.cols = c("A", "B"),
seps = ",", direction = "wide")
# calculate score
df2$Score <- rowSums(df2[ , -1])
df2
# Name A_1 A_2 A_3 A_4 B_1 B_2 B_3 B_4 Score
# 1 James 1 1 1 1 0 1 0 0 5
# 2 John 1 0 0 1 0 1 1 1 5
# 3 Peter 0 0 0 0 1 1 1 1 4
I would approach this with a for loop over the column names you want to merge. Given your example data:
cols <- c("Score", "Response")
dfNew <- dfA[,"Name",drop=FALSE]
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfA[[n]] + dfB[[n]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfA[[n]], dfB[[n]], sep=",")})
}
This solution is basically what you had as your idea, but with a loop. The data sets are looked at to see if they are numeric (add them numerically) or a string or factor (concatenate the strings). You could get a similar result by having two vectors of names, one for the numeric and one for the character, but this is extensible if you have other data types as well (though I don't know what they might be). The major drawback of this method is that is assumes the data frames are in the same order with regard to Name. The next solution doesn't make that assumption
dfNew <- merge(dfA, dfB, by="Name")
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfNew[[paste0(n,".x")]] + dfNew[[paste0(n,".y")]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfNew[[paste0(n,".x")]], dfNew[[paste0(n,".y")]], sep=",")})
dfNew[[paste0(n,".x")]] <- NULL
dfNew[[paste0(n,".y")]] <- NULL
}
Same general idea as previous, but uses merge to make sure that the data is correctly aligned, and then works on columns (whose names are postfixed with ".x" and ".y") with dfNew. Additional steps are included to get rid of the separate columns after joining. Also has the bonus feature of carrying along any other columns not specified for joining together in cols.