I am trying to add multiple dataframes together but not in a bind fashion.
Is there an easy way to overlay & add dataframes on top of each other? As shown in this picture:
The number of columns will always be same; the row count will differ.
I want to sum the cells by row position. So Result[1,1] = Table1[1,1] + Table2[1,1] and so on, such that the resulting frame adds whatever cells have data and resulting table is the size of biggest table's size.
The table are generated dynamically so I'd like to refrain from any hardcoding.
Consider the following two data frames:
table1 <- replicate(4,round(runif(10,0,1),2)) %>% as.data.frame %>% setNames(LETTERS[1:4])
table2 <- replicate(4,round(runif(6,0,1),2)) %>% as.data.frame %>% setNames(LETTERS[1:4])
table1
A B C D
1 0.81 0.08 0.85 0.89
2 0.88 0.82 0.62 0.77
3 0.12 0.13 0.99 0.02
4 0.17 0.54 0.37 0.62
5 0.77 0.10 0.81 0.34
6 0.58 0.15 0.00 0.56
7 0.61 0.15 0.59 0.15
8 0.52 0.36 0.12 0.99
9 0.83 0.93 0.29 0.30
10 0.52 0.02 0.48 0.46
table2
A B C D
1 0.95 0.81 0.99 0.92
2 0.18 0.99 0.35 0.09
3 0.73 0.10 0.02 0.68
4 0.37 0.53 0.78 0.02
5 0.48 0.54 0.79 0.83
6 0.75 0.32 0.41 0.04
We might create a new variable called ID from their row numbers and use that to sum the values after binding the rows:
library(dplyr)
library(tibble)
bind_rows(table1 %>% rowid_to_column("ID"),table2 %>% rowid_to_column("ID")) %>%
group_by(ID) %>%
summarise(across(everything(),sum))
# A tibble: 10 x 5
ID A B C D
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1.76 0.89 1.84 1.81
2 2 1.06 1.81 0.97 0.86
3 3 0.85 0.23 1.01 0.7
4 4 0.54 1.07 1.15 0.64
5 5 1.25 0.64 1.6 1.17
6 6 1.33 0.47 0.41 0.6
7 7 0.61 0.15 0.59 0.15
8 8 0.52 0.36 0.12 0.99
9 9 0.83 0.93 0.290 0.3
10 10 0.52 0.02 0.48 0.46
A potentially more dangerous base R approach is to subset table1 to the dimensions of table2, and add them together:
table1[seq(1,nrow(table2)),seq(1,ncol(table2))] <- table1[seq(1,nrow(table2)),seq(1,ncol(table2))] + table2
table1
A B C D
1 1.76 0.89 1.84 1.81
2 1.06 1.81 0.97 0.86
3 0.85 0.23 1.01 0.70
4 0.54 1.07 1.15 0.64
5 1.25 0.64 1.60 1.17
6 1.33 0.47 0.41 0.60
7 0.61 0.15 0.59 0.15
8 0.52 0.36 0.12 0.99
9 0.83 0.93 0.29 0.30
10 0.52 0.02 0.48 0.46
# Create your data frames
df1<-data.frame(a=c(1,2,3),b=c(2,3,4),c=c(3,4,5))
df2<-data.frame(a=c(1,2),b=c(2,3),c=c(3,4))
# Create a new data frame from the bigger of the two
if (nrow(df1)>nrow(df2)){
df3 <-df1
} else {
df3<-df2
}
# For each line in the smaller data frame add it to the larger
for (number in 1:min(nrow(df1),nrow(df2))){
df3[number,] <- df1[number,]+df2[number,]
}
I have the following dataframe (DF_A):
PARTY_ID PROBS_3001 PROBS_3002 PROBS_3003 PROBS_3004 PROBS_3005 PROBS_3006 PROBS_3007 PROBS_3008
1: 1000000 0.03 0.58 0.01 0.42 0.69 0.98 0.55 0.96
2: 1000001 0.80 0.37 0.10 0.95 0.77 0.69 0.23 0.07
3: 1000002 0.25 0.73 0.79 0.83 0.24 0.82 0.81 0.01
4: 1000003 0.10 0.96 0.53 0.59 0.96 0.10 0.98 0.76
5: 1000004 0.36 0.87 0.76 0.03 0.95 0.40 0.53 0.89
6: 1000005 0.15 0.78 0.24 0.21 0.03 0.87 0.67 0.64
And I have this other dataframe (DF_B):
V1 V2 V3 V4 PARTY_ID
1 0.58 0.69 0.96 0.98 1000000
2 0.69 0.77 0.80 0.95 1000001
3 0.79 0.81 0.82 0.83 1000002
4 0.76 0.96 0.96 0.98 1000003
5 0.76 0.87 0.89 0.95 1000004
6 0.64 0.67 0.78 0.87 1000005
I need to find the position of the elements of the DF_A in the DF_B to have something like this:
PARTY_ID P1 P2 P3 P4
1 1000000 3 6 9 7
...
Currently I'm working with match function but it takes a lot of time (I have 400K rows). I'm doing this:
i <- 1
while(i < nrow(DF_A)){
position <- match(DF_B[i,],DF_A[i,])
i <- i + 1
}
Although it works, it's very slow and I know that it's not the best answer to my problem. Can anyone help me please??
You can merge and then Map with a by group operation:
df_a2 <- df_a[setDT(df_b), on = "PARTY_ID"]
df_a3 <- df_a2[, c(PARTY_ID,
Map(f = function(x,y) which(x==y),
x = list(.SD[,names(df_a), with = FALSE]),
y = .SD[, paste0("V",1:4), with = FALSE])), by = 1:nrow(df_a2)]
setnames(df_a3, paste0("V",1:5), c("PARTY_ID", paste0("P", 1:4)))[,nrow:=NULL]
df_a3
# PARTY_ID P1 P2 P3 P4
#1: 1000000 3 6 9 7
#2: 1000001 7 6 2 5
#3: 1000002 4 8 7 5
#4: 1000003 9 3 3 8
#5: 1000003 9 6 6 8
#6: 1000004 4 3 9 6
#7: 1000005 9 8 3 7
Here is an example on 1 milion rows with two columns. It takes 14 ms on my computer.
# create data tables with matching ids but on different positions
x <- as.data.table(data.frame(id=sample(c(1:1000000), 1000000, replace=FALSE), y=sample(LETTERS, 1000000, replace=TRUE)))
y <- as.data.table(data.frame(id=sample(c(1:1000000), 1000000, replace=FALSE), z=sample(LETTERS, 1000000, replace=TRUE)))
# add column to both data tables which will store the position in x and y
x$x_row_nr <- 1:nrow(x)
y$y_row_nr <- 1:nrow(y)
# set key in both data frames using matching columns name
setkey(x, "id")
setkey(y, "id")
# merge data tables into one
z <- merge(x,y)
# now you just use this to extract what is the position
# of 100 hundreth record in x data table in y data table
z[x_row_nr==100, y_row_nr]
z will contain matching row records from both datasets with there columns attached.
I have an input file with Format as below :
RN KEY MET1 MET2 MET3 MET4
1 1 0.11 0.41 0.91 0.17
2 1 0.94 0.02 0.17 0.84
3 1 0.56 0.64 0.46 0.7
4 1 0.57 0.23 0.81 0.09
5 2 0.82 0.67 0.39 0.63
6 2 0.99 0.90 0.34 0.84
7 2 0.83 0.01 0.70 0.29
I have to execute Kmeans in R -separately for DF with Key=1 and Key=2 and so on...
Afterwards the final output CSV should look like
RN KEY MET1 MET2 MET3 MET4 CLST
1 1 0.11 0.41 0.91 0.17 1
2 1 0.94 0.02 0.17 0.84 1
3 1 0.56 0.64 0.46 0.77 2
4 1 0.57 0.23 0.81 0.09 2
5 2 0.82 0.67 0.39 0.63 1
6 2 0.99 0.90 0.34 0.84 2
7 2 0.83 0.01 0.70 0.29 2
Ie Key=1 is to be treated as separate DF and Key=2 is be treated as separate DF and so on...
Finally the output of clustering (of each DF)is to be combined with Key column first (since Key cannot participate in clustering) and then combined with each different DF for final output
In the above example :
DF1 is
KEY MET1 MET2 MET3 MET4
1 0.11 0.41 0.91 0.17
1 0.94 0.02 0.17 0.84
1 0.56 0.64 0.46 0.77
1 0.57 0.23 0.81 0.09
DF2 is
KEY MET1 MET2 MET3 MET4
2 0.82 0.67 0.39 0.63
2 0.99 0.90 0.34 0.84
2 0.83 0.01 0.70 0.29
Please suggest how to achieve in R
Psuedo code :
n<-Length(unique(Mydf$key))
for i=1 to n
{
#fetch partial df for each value of Key and run K means
dummydf<-subset(mydf,mydf$key=i
KmeansIns<-Kmeans(dummydf,2)
# combine with cluster result
dummydf<-data.frame(dummydf,KmeansIns$cluster)
# combine each smalldf into final Global DF
finaldf<-data.frame(finaldf,dummydf)
}Next i
#Now we have finaldf then it can be written to file
I think the easiest way would be to use by. Something along the lines of
by(data = DF, INDICES = DF$KEY, FUN = function(x) {
# your clustering code here
})
where x is a subset of your DF for each KEY.
A solution using data.tables.
library(data.table)
setDT(DF)[,CLST:=kmeans(.SD, centers=2)$clust, by=KEY, .SDcols=3:6]
DF
# RN KEY MET1 MET2 MET3 MET4 CLST
# 1: 1 1 0.11 0.41 0.91 0.17 2
# 2: 2 1 0.94 0.02 0.17 0.84 1
# 3: 3 1 0.56 0.64 0.46 0.70 1
# 4: 4 1 0.57 0.23 0.81 0.09 2
# 5: 5 2 0.82 0.67 0.39 0.63 2
# 6: 6 2 0.99 0.90 0.34 0.84 2
# 7: 7 2 0.83 0.01 0.70 0.29 1
#Read data
mdf <- read.table("mydat.txt", header=T)
#Convert to list based on KEY column
mls <- split(mdf, f=mdf$KEY)
#Define columns to use in clustering
myv <- paste0("MET", 1:4)
#Cluster each df item in list : modify kmeans() args as appropriate
kls <- lapply(X=mls, FUN=function(x){x$clust <- kmeans(x[, myv],
centers=2)$cluster ; return(x)})
#Make final "global" df
finaldf <- do.call(rbind, kls)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
X<-scan()
1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1
1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 1
1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
Z<-scan()
-0.05 0.11 -0.01 1.08 0.68 -1.79 -0.12 -0.06 0.17 -1.35 1.55 0.60
-1.42 -1.21 0.97 0.23 0.20 0.89 0.28 0.56 1.02 -0.32 0.20 -1.35
0.53 -0.52 -0.07 -1.07 0.10 0.53 0.97 0.32 -0.07 0.98 -1.23 0.72
-0.09 0.31 1.25 0.60 1.16 -0.98 1.63 0.72 0.24 -0.02 -1.13 0.56
0.78 1.75 -0.01 -0.44 0.47 -0.21 2.06 2.19 -0.94 -0.36 1.35 -1.35
1.50 0.13 -0.20 -0.57 -0.14 -1.34 -1.17 2.04 0.21 1.47 -1.20 -0.60
0.15 -0.64 -0.71 0.24 -0.86 -1.39 -0.63 -1.25 0.40 -0.76 0.73 -0.15
0.09 0.35 -0.19 0.29 0.56 0.82 -0.28 0.63 1.35 -0.04 1.99 1.12
-1.91 0.26 -1.18 -0.10
In the vector X, 0 is control group and 1 is case group.
I want to match this cases and controls based on Z vector.Actually I want to match elements of X based on Z ang get the samples from matched data.
what should I do?
The other answers seem to think that you're looking for subsetting, but I'm assuming (based on your use of the language "case" and "controls") that you're talking about matching in a statistical sense. If so, it sounds like you want something like the functionality provided by the Matching package, like the following:
library(Matching)
out <- Match(Tr=X,X=Z)
out$mdata # list of `Y` outcome vector (if applicable),
# `Tr` treatment vector, and
# `X` matrix of covariates for the matched sample
If you also have an outcome measure, you can specify that in Match and it will give you treatment effect estimates.
There are also other packages to do matching, like MatchIt, cem, and nonrandom (the last of which has apparently been removed from CRAN), depending on what particular matching procedure you're going for.
I suppose you are looking for
Z[as.logical(X)] # case
and
Z[!X] # control
I suppose your question is about subsetting, here is some examples:
# Data
X<-c(1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,0,0,1,1,0,0,1,1,1,1,1,1,0,1,1,1,1,1,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1)
Z<-c(-0.05,0.11,-0.01,1.08,0.68,-1.79,-0.12,-0.06,0.17,-1.35,1.55,0.60,-1.42,-1.21,0.97,0.23,0.20,0.89,0.28,0.56,1.02,-0.32,0.20,-1.35,0.53,-0.52,-0.07,-1.07,0.10,0.53,0.97,0.32,-0.07,0.98,-1.23,0.72,-0.09,0.31,1.25,0.60,1.16,-0.98,1.63,0.72,0.24,-0.02,-1.13,0.56,0.78,1.75,-0.01,-0.44,0.47,-0.21,2.06,2.19,-0.94,-0.36,1.35,-1.35,1.50,0.13,-0.20,-0.57,-0.14,-1.34,-1.17,2.04,0.21,1.47,-1.20,-0.60,0.15,-0.64,-0.71,0.24,-0.86,-1.39,-0.63,-1.25,0.40,-0.76,0.73,-0.15,0.09,0.35,-0.19,0.29,0.56,0.82,-0.28,0.63,1.35,-0.04,1.99,1.12,-1.91,0.26,-1.18,-0.10)
myMatrix <- cbind(X,Z)
# Subsetting
myMatrixControls <- myMatrix[ myMatrix[,1]==0,]
myMatrixCases <- myMatrix[ myMatrix[,1]==1,]
# Example: get sum per group
sumZ_Contolrs <- sum(myMatrix[ myMatrix[,1]==0, 2])
sumZ_Cases <- sum(myMatrix[ myMatrix[,1]==1, 2])