How to mix dataframes in R [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 7 years ago.
I have the following situation,2 huge dataframes X and Y (the rownumber is about 13 millions per dataframe and the columns are 11 for each dataframe) and I need to merge them in a specific way.
The X dataframe example is
A 1 2 3
B 3 2 4
C 1 6 8
The Y dataframe is
A 9 1 8
B 3 1 7
D 2 9 4
I have to mix them with the following logic:
If the first element of the row in Y is present in X then i have to append it to the row
If the first element of the row in Y is not present in X then i have to append zeroes and then append the Y data
For all the X rows not present in Y I have to append then zeroes
The mix result should be like this:
A 1 2 3 9 1 8 I found A in Y and I appended
B 3 2 4 3 1 7 I found B in Y and I appended
C 1 6 8 0 0 0 I didn't found C in Y and added 0
D 0 0 0 2 9 4 I didn't found D in X and added 0 then appended C
I tried to go row by row but it takes ages and I need a one shot or double shot (double instruction ) solution...
Thanks

without a reproducible example I can't test this, but I think you want:
library(dplyr)
z<-full_join(x,y, by=FirstColumn)
z[is.na(Z)]<-0
this assumes there are no NA's in the original data.

Related

R merge two dataframes by rows with some similar but also different columns [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 months ago.
aa <- data.frame(a= c(1,1,2,3,4), b = c(5,3,2,6,1))
bb <- data.frame(a= c(1,1,6,7,4), c = c(4,3,7,2,1))
I want to merge aa and bb by the rows, however, when I use the rbind() function, it gives me an error that "the numbers of columns of arguments do not match". The final format that I want keeps all the columns that are present in both data frame and fill them with zeros if one column does not exist in the other dataframe's part. A sample output for the reproducible data would be as follows:
Thank you for your time!
You can use bind_rows from dplyr, it's similar to rbind but any missing columns are filled with NA instead of throwing an error.
cc <- dplyr::bind_rows(aa, bb)
cc <- replace(cc, is.na(cc), 0)
a b c
1 1 5 0
2 1 3 0
3 2 2 0
4 3 6 0
5 4 1 0
6 1 0 4
7 1 0 3
8 6 0 7
9 7 0 2
10 4 0 1

Repeating rows in data frame by using the content of a column in R [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I want to create a data frame by repeating rows by using content of a column in a data frame. Below is the source data frame.
data.frame(c("a","b","c"), c(4,5,6), c(2,2,3)) -> df
colnames(df) <- c("sample", "measurement", "repeat")
df
sample measurement repeat
1 a 4 2
2 b 5 2
3 c 6 3
I want to repeat the rows by using the "repeat" column and its content to get a data frame like the one below. Ideally, I would like to have a function to this.
sample measurement repeat
1 a 4 2
2 a 4 2
3 b 5 2
4 b 5 2
5 c 6 3
6 c 6 3
7 c 6 3
Thanks in advance!
Solved. df[rep(rownames(df), df$repeat), ] did the job.

Delete Duplicates when Merging DF [duplicate]

This question already has answers here:
Select only the first row when merging data frames with multiple matches
(4 answers)
Closed 5 years ago.
I know, I know.... Another merging Df question, please hear me out as I have searched SO for an answer on this but none has come.
I am merging two Df's, one smaller than the other, and doing a left merge, to match up the longer DF to the smaller DF.
This works well except for one issue, rows get added to the left (smaller) df when the right(longer) df has duplicates.
An Example:
Row<-c("a","b","c","d","e")
Data<-(1:5)
df1<-data.frame(Row,Data)
Row2<-c("a","b","b","c","d","e","f","g","h")
Data2<-(1:9)
df2<-data.frame(Row2,Data2)
names(df2)<-c("Row","Data2")
DATA<-merge(x = df1, y = df2, by = "Row", all.x = TRUE)
>DATA
Row Data Data2
1 a 1 1
2 b 2 2
3 b 2 3
4 c 3 4
5 d 4 5
6 e 5 6
See the extra "b" row?, that is what I want to get rid of, I want to keep the left DF, but very strictly, as in if there are 5 rows in DF1, when merged I want there to only be 5 rows.
Like this...
Row Data Data2
1 a 1 1
2 b 2 2
3 c 3 4
4 d 4 5
5 e 5 6
Where it only takes the first match and moves on.
I realize the merge function is only doing its job here, so is there another way to do this to get my expected result? OR is there a post-merge modification that should be done instead.
Thank you for your help and time.
Research:
How to join (merge) data frames (inner, outer, left, right)?
deleting duplicates
Merging two data frames with different sizes and missing values
We can use the duplicated function as follows:
DATA[!duplicated(DATA$Row),]
Row Data Data2
1 a 1 1
2 b 2 2
4 c 3 4
5 d 4 5
6 e 5 6
It´s possible also like
merge(x = df1, y = df1[unique(df1$Row),], by = "Row", all.x = TRUE)
# Row Data.x Data.y
#1 a 1 1
#2 b 2 2
#3 c 3 3
#4 d 4 4
#5 e 5 5
Since you only want the first row and don't care what variables are chosen, then you can use this code (before you merge):
Row2<-c("a","b","b","c","d","e","f","g","h")
Data2<-(1:9)
df2<-data.frame(Row2,Data2)
library(dplyr)
df2 %>%
group_by(Row2) %>%
slice(1)

Transform multiple rows of a data frame into one row with multiple columns with R [duplicate]

This question already has answers here:
Reshape three column data frame to matrix ("long" to "wide" format) [duplicate]
(6 answers)
Closed 5 years ago.
I have a data frame with four columns :
df=data.frame( UserId=c(1,2,2,2,3,3), CatoId=c('C','A','B','C','D','E'), No=c(1,9,2,2,5,3))
UserId CatoId No
1 C 1
2 A 9
2 B 2
2 C 2
3 D 5
3 E 3
I would like to transform the structure into the following one :
UserId A B C D E
1 0 0 1 0 0
2 9 2 2 0 0
3 0 0 0 5 3
Where the columns represents all possible values in CatoId.
The first data frame has 2 million rows and CatoId has 21 different values. So I don't want to use any loops. Is there a way to do this with R. Otherwise what is the best way to proceed?
My goal would be to apply a clustering algorithm on the last dataframe.
You can do this using dcast:
df1 <- dcast(df, UserId ~ CatoId, value.var = "No", fill = 0)

How do I subset a data frame in R using specific row indicies? [duplicate]

This question already has an answer here:
Subset of table in R using row numbers?
(1 answer)
Closed 9 years ago.
I have a large data frame which I would like to break down into smaller data frames. I know which rows I would like to split up (i.e I want to separate rows 1 - 33, 34 - 60, ....). I know I have to use subset(), but I cant seem to find the specific parameters.
If you mean from the 1st to the 33th row, just do this
df[1:33,]
as an example:
> df<-data.frame(A=LETTERS[1:10], B=c(1:10))
> df
A B
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
6 F 6
7 G 7
8 H 8
9 I 9
10 J 10
> df[1:3,]
A B
1 A 1
2 B 2
3 C 3

Resources