Append values from column 2 to values from column 1 - r

In R, I have two data frames (A and B) that share columns (1, 2 and 3). Column 1 has a unique identifier, and is the same for each data frame; columns 2 and 3 have different information. I'm trying to merge these two data frames to get 1 new data frame that has columns 1, 2, and 3, and in which the values in column 2 and 3 are concatenated: i.e. column 2 of the new data frame contains: [data frame A column 2 + data frame B column 2]
Example:
dfA <- data.frame(Name = c("John","James","Peter"),
Score = c(2,4,0),
Response = c("1,0,0,1","1,1,1,1","0,0,0,0"))
dfB <- data.frame(Name = c("John","James","Peter"),
Score = c(3,1,4),
Response = c("0,1,1,1","0,1,0,0","1,1,1,1"))
dfA:
Name Score Response
1 John 2 1,0,0,1
2 James 4 1,1,1,1
3 Peter 0 0,0,0,0
dfB:
Name Score Response
1 John 3 0,1,1,1
2 James 1 0,1,0,0
3 Peter 4 1,1,1,1
Should results in:
dfNew <- data.frame(Name = c("John","James","Peter"),
Score = c(5,5,4),
Response = c("1,0,0,1,0,1,1,1","1,1,1,1,0,1,0,0","0,0,0,0,1,1,1,1"))
dfNew:
Name Score Response
1 John 5 1,0,0,1,0,1,1,1
2 James 5 1,1,1,1,0,1,0,0
3 Peter 4 0,0,0,0,1,1,1,1
I've tried merge but that simply appends the columns (much like cbind)
Is there a way to do this, without having to cycle through all columns, like:
colnames(dfNew) <- c("Name","Score","Response")
dfNew$Score <- dfA$Score + dfB$Score
dfNew$Response <- paste(dfA$Response, dfB$Response, sep=",")
The added difficulty is, as you might have noticed, that for some columns we need to use addition, whereas others require concatenation separated by a comma (the columns requiring addition are formatted as numerical, the others as text, which might make it easier?)
Thanks in advance!
PS. The string 1,0,0,1,0,1,1,1 etc. captures the response per trial – this example has 8 trials to which participants can either respond correctly (1) or incorrectly (0); the final score is collected under Score. Just to explain why my data/example looks the way it does.

Personally, I would try to avoid concatenating 'response per trial' to a single variable ('Response') from the start, in order to make the data less static and facilitate any subsequent steps of analysis or data management. Given that the individual trials already are concatenated, as in your example, I would therefore consider splitting them up. Formatting the data frame for a final, pretty, printed output I would consider a different, later issue.
# merge data (cbind would also work if data are ordered properly)
df <- merge(x = dfA[ , c("Name", "Response")], y = dfB[ , c("Name", "Response")],
by = "Name")
# rename
names(df) <- c("Name", c("A", "B"))
# split concatenated columns
library(splitstackshape)
df2 <- concat.split.multiple(data = df, split.cols = c("A", "B"),
seps = ",", direction = "wide")
# calculate score
df2$Score <- rowSums(df2[ , -1])
df2
# Name A_1 A_2 A_3 A_4 B_1 B_2 B_3 B_4 Score
# 1 James 1 1 1 1 0 1 0 0 5
# 2 John 1 0 0 1 0 1 1 1 5
# 3 Peter 0 0 0 0 1 1 1 1 4

I would approach this with a for loop over the column names you want to merge. Given your example data:
cols <- c("Score", "Response")
dfNew <- dfA[,"Name",drop=FALSE]
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfA[[n]] + dfB[[n]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfA[[n]], dfB[[n]], sep=",")})
}
This solution is basically what you had as your idea, but with a loop. The data sets are looked at to see if they are numeric (add them numerically) or a string or factor (concatenate the strings). You could get a similar result by having two vectors of names, one for the numeric and one for the character, but this is extensible if you have other data types as well (though I don't know what they might be). The major drawback of this method is that is assumes the data frames are in the same order with regard to Name. The next solution doesn't make that assumption
dfNew <- merge(dfA, dfB, by="Name")
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfNew[[paste0(n,".x")]] + dfNew[[paste0(n,".y")]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfNew[[paste0(n,".x")]], dfNew[[paste0(n,".y")]], sep=",")})
dfNew[[paste0(n,".x")]] <- NULL
dfNew[[paste0(n,".y")]] <- NULL
}
Same general idea as previous, but uses merge to make sure that the data is correctly aligned, and then works on columns (whose names are postfixed with ".x" and ".y") with dfNew. Additional steps are included to get rid of the separate columns after joining. Also has the bonus feature of carrying along any other columns not specified for joining together in cols.

Related

R - Compare column values in data frames of differing lengths by unique ID

I'm sure I can figure out a straightforward solution to this problem, but I didn't see a comparable question so I thought I'd post a question.
I have a longitudinal dataset with thousands of respondents over several time intervals. Everything from the questions to the data types can differ between the waves and often requires constructing long series of bools to construct indicators or dummy variables, but each respondent has a unique ID with no additional respondents add to the surveys after the first wave, so easy enough.
The issue is that while the early wave consist of one (Stata) file each, the latter waves contain lots of addendum files, structured differently. So, for example, in constructing previous indicators for the sex of previous partners there were columns (for one wave) called partnerNum and sex and there were up to 16 rows for each unique ID (respondent). Easy enough to spread (or cast) that data to be able to create a single row for each unique ID and columns partnerNum_1 ... partnerNum_16 with the value from the sex column as the entry in partnerDF. Then it's easy to construct indicators like:
sexuality$newIndicator[mainDF$bioSex = "Male" & apply(partnerDF[1:16] == "Male", 1, any)] <- 1
For other addendum files in the last two waves the data is structured long like the partner data, with multiple rows for each unique ID, but rather than just one variable like sex there are hundreds that I need to use to test against to construct indicators, all coded with different types, so it's impractical to spread (or cast) the data wide (never mind writing those bools). There are actually several of these files for each wave and the way they are structured some respondents (unique ID) occupy just 1 row, some a few dozen. (I've left_join'ed the addendum files together for each wave.)
What I'd like to be able to do to is test something like:
newDF$indicator[any(waveIIIAdds$var1 == 1) & any(waveIIIAdds$var2 == 1)] <- 1
or
newDF$indicator[mainDF$var1 == 1 & any(waveIIIAdds$var2 == 1)] <- 1
where newDF is the same length as mainDF (one row per unique ID).
So, for example, if I had two dfs.
df1 <- data.frame(ID = c(1:4), A = rep("a"))
df2 <- data.frame(ID = rep(1:4, each=2), B = rep(1:2, 2), stringsAsFactors = FALSE)
df1$A[1] <- "b"
df1$A[3] <- "b"
df2$B[8] <- 3
> df1 > df2
ID A ID B
1 b 1 1
2 a 1 2
3 b 2 1
4 a 2 2
3 1
3 2
4 1
4 3
I'd like to test like (assuming df3 has one column, just the unique IDs from df1)
df3$new <- 0
df3$new[df1$ID[df1$A == "a"] & df2$ID[df2$B == 2]] <- 1
So that df3 would have one unique ID per row and since there is an "a" in df1$A for all IDs but df1$A[1] and a 2 in at least one row of df2$B for all IDs except the last ID (df2$B[7:8]) the result would be:
> df3
ID new
1 0
2 1
3 1
4 0
and
df3$new <- 0
df3$new[df1$ID[df1$A == "a"] | df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 1
2 1
3 1
4 0
This does it...
df3 <- data.frame(ID=unique(df1$ID),
new=sapply(unique(df1$ID),function(x)
as.numeric(x %in% df1$ID[df1$A == "a"] & x %in% df2$ID[df2$B == 2])))
df3
ID new
1 1 1
2 2 1
3 3 1
4 4 0
I came up with a parsimonious solution thinking about it for a few minutes after returning to the problem (rather than the wee hours of the morning of the post).
I wanted something a graduate student who will likely construct thousands of indicators or dummy variables this way and may learn R first, or even only ever learn R, could use. The following provides a solution for the example and actual data using the same schema:
if the DF was already created with the IDs and the column values for the dummy indicator initiated to zero already as assumed in the example:
df3 <- data.frame(ID = df1$ID)
df3$new <- 0
My solution was:
df3$new[df1$ID %in% df1$ID[df1$A == "a"] & df1$ID %in% df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 0
2 1
3 0
4 1
Using | (or) instead:
df3$new[df1$ID %in% df1$ID[df1$A == "a"] | df1$ID %in% df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 1
2 1
3 0
4 1

R: change one value every row in big dataframe

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2

Summing values in rows of matrices with same column name in R

I need to turn these two matrices corresponding to (toy) word counts:
a hope to victory win
[1,] 2 1 1 1 1
and
a chance than win
[1,] 1 1 1 1
where the word "a" appears a combined number of 3 times, and the word "win" appears 2 times (once in each matrix), into:
a win chance hope than to victory
[1,] 3 2 1 1 1 1 1
where equally-named columns combine into a single column that contains the sum.
And,
a hope to victory win different than
[1,] 2 1 1 1 1 0 0
where first matrix is preserved, and the second matrix is attached at the end but with only unique column names and with all the row values equal to zero.
So, if you store this data in a data frame (Which is really recommended for this sort of data) the process is very simple.
(I'm including a conversion from that format, with any number of rows):
conversion:
newdf1 <- data.frame(Word = colnames(matrix1), Count = as.vector(t(matrix1)))
newdf2 <- data.frame(Word = colnames(matrix2), Count = as.vector(t(matrix2)))
now you can use rbind + dplyr (or data.table)
dplyr solution:
library(dplyr)
df <- rbind(newdf1,newdf2)
result <- df %>% group_by(Word) %>% summarise(Count = sum(Count))
the answer to your second question is related,
result2 <- rbind(newdf1,data.frame(Word = setdiff(newdf2$Word,newdf1$Word), Count = 0))
(the data.table solution is very similar, but if you're new to data frames and grouping/reshaping, I recommend dplyr)
(EDITED the second solution so that it's actually giving you the unique entries)

Combine two data.frames in R with differing rows

I have two tables one with more rows than the other. I would like to filter the rows out that both tables share. I tried the solutions proposed here.
The problem, however, is that it is a large data-set and computation takes quite a while. Is there any simple solution? I know how to extract the shared rows of both tables using:
rownames(x1)->k
rownames(x)->l
which(rownames(x1)%in%l)->o
Here x1 and x are my data frames. But this only provides me with the shared rows. How can I get the unique rows of each table to then exclude them respectively? So that I can just cbind both tables together?
(I edit the whole answer)
You can merge both df with merge() (from Andrie's comment). Also check ?merge to know all the options you can put in as by parameter, 0 = row.names.
The code below shows an example with what could be your data frames (different number of rows and columns)
x = data.frame(a1 = c(1,1,1,1,1), a2 = c(0,1,1,0,0), a3 = c(1,0,2,0,0), row.names = c('y1','y2','y3','y4','y5'))
x1 = data.frame(a4 = c(1,1,1,1), a5 = c(0,1,0,0), row.names = c('y1','y3','y4','y5'))
Provided that row names can be used as identifier then we put them as a new column to merge by columns:
x$id <- row.names(x)
x1$id <- row.names(x1)
# merge by column names
merge(x, x1, by = intersect(names(x), names(x1)))
# result
# id a1 a2 a3 a4 a5
# 1 y1 1 0 1 1 0
# 2 y3 1 1 2 1 1
# 3 y4 1 0 0 1 0
# 4 y5 1 0 0 1 0
I hope this solves the problem.
EDIT: Ok, now I feel silly. If ALL columns have different names in both data frames then you don't need to put the row name as another column. Just use:
merge(x,x1, by=0)
If you only want the rows which are not repeated from each data set:
rownames(x1)->k
rownames(x)->l
which(k%in%l) -> o
x1.uniq <- x1[k[k != o],];
x.uniq <- x[l[l != o],];
And then you can join them with rbind:
x2 <- rbind(x1.uniq,x.uniq);
If you also wanted the repeated rows you can add them:
x.repeated <- x1[o];
x2 <- rbind(x2,x.repeated);

Improving performance of updating contents of large data frame using contents of similar data frame

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
Thanks
This loops by column, setting dt1 by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2
This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
#READING IN YOUR DATA FRAMES
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
df5
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
return(LS[[i]])
}
)
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5
EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
dt1[J(k),names(p):=as.list(p),with=FALSE]
}
or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.
The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
library(data.table)
library(reshape2)
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

Resources