Consolidate rows by group value [duplicate] - r

This question already has answers here:
Pivoting a large data set
(2 answers)
Closed 7 years ago.
Trying to simplify a huge and redundant dataset and would like your help with moving cells around so each row is a different "group" according to the value in column 1, with added columns for each unique OLD row cell/element that matches that group value. See below.
What I have:
col1 col2
1 a
1 b
1 c
1 d
2 a
2 c
2 d
2 e
3 a
3 b
3 d
3 e
What I want:
col1 col2 col3 col4 col5 col6
1 a b c d N/A
2 a N/A c d e
3 a b N/A d e
I hope this isn't too vague but I will update this question as soon as get notification of replies.
Thanks in advance!

We could use dcast from library(reshape2) to convert from 'long' to 'wide' format. By default, it will take the value.var='col2'. If there are more columns, we can explicitly specify the value.var.
library(reshape2)
dcast(df1, col1~ factor(col2, labels=paste0('col', 2:6)))
# col1 col2 col3 col4 col5 col6
#1 1 a b c d <NA>
#2 2 a <NA> c d e
#3 3 a b <NA> d e

Here is another way that uses reshape of the stats package,
x<-data.frame(col1 = c(1,1,1,1,2,2,2,2,3,3,3,3),
col2 = c('a','b','c','d','a','c','d', 'e', 'a', 'b', 'd', 'e'))
x<-reshape(x, v.names="col2", idvar="col1", timevar="col2", direction="wide")
names(x)<-c('col1', 'col2', 'col3', 'col4', 'col5', 'col6')
Output:
col1 col2 col3 col4 col5 col6
1 1 a b c d <NA>
5 2 a <NA> c d e
9 3 a b <NA> d e

Related

Create a column whose outcome is the random match of two other columns in R

I have a very long data.frame that looks like this
col1 <- c("a","b","c","d")
col2 <- c("E","F","G","H")
df <- data.frame(col1,col2)
col1 col2
1 a E
2 b F
3 c G
4 d H
I want to create column, "col3", whose outcome is product of the random combination
of the elements in col1 and col2.
for example I want something like this
col1 col2 col3
1 a E a-G
2 b F b-H
3 c G c-E
4 d H d-F
any help por guidance is highly appreciated.
With data.table:
library(data.table)
setDT(df)
df[,col3:=paste(col1,'-',sample(col2))][]
col1 col2 col3
1: a E a - F
2: b F b - E
3: c G c - H
4: d H d - G
or base R:
df$col3 <- paste(df$col1,'-',sample(df$col2))
One dplyr option could be:
df %>%
mutate(col3 = paste(sample(col1), sample(col2), sep = "-"))
col1 col2 col3
1 a E c-E
2 b F b-H
3 c G a-F
4 d H d-G

Finding duplicates in a dataframe and returning count of each duplicate record

I have a dataframe like
col1 col2 col3
A B C
A B C
A B B
A B B
A B C
B C A
I want to get an output in the below format:
col1 col2 col3 Count
A B C 3 Duplicates
A B B 2 Duplicates
I dont want to use any specific column in the function to find the duplicates.
That is the reason of not using add_count from dplyr.
Using duplicate will have
col1 col2 col3 count
2 A B C 3
3 A B B 2
5 A B C 3
So not the desired output.
We can use group_by_all to group by all columns and then remove the ones which are not duplicates by selecting rows which have count > 1.
library(dplyr)
df %>%
group_by_all() %>%
count() %>%
filter(n > 1)
# col1 col2 col3 n
# <fct> <fct> <fct> <int>
#1 A B B 2
#2 A B C 3
We can use data.table
library(data.table)
setDT(df1)[, .(n =.N), names(df1)][n > 1]
# col1 col2 col3 n
#1: A B C 3
#2: A B B 2
Or with base R
subset(aggregate(n ~ ., transform(df1, n = 1), FUN = sum), n > 1)
# col1 col2 col3 n
#2 A B B 2
#3 A B C 3

R find unduplicated rows based on other data's columns [duplicate]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a data table in R, called A, which has three columns Col1, Col2, and Col3. Another table, called B, also has the same three columns. I want to remove all the rows in table A, for which the pairs (Col1, Col2) are present in table B. I tried, but I am not sure how to do this. I am stuck on this for last few days.
Thanks,
library(data.table)
A = data.table(Col1 = 1:4, Col2 = 4:1, Col3 = letters[1:4])
# Col1 Col2 Col3
#1: 1 4 a
#2: 2 3 b
#3: 3 2 c
#4: 4 1 d
B = data.table(Col1 = c(1,3,5), Col2 = c(4,2,1))
# Col1 Col2
#1: 1 4
#2: 3 2
#3: 5 1
A[!B, on = c("Col1", "Col2")]
# Col1 Col2 Col3
#1: 2 3 b
#2: 4 1 d
We can use anti_join
library(dplyr)
anti_join(A, B, by = c('Col1', 'Col2'))
Here's a go, using interaction:
A <- data.frame(Col1=1:3, Col2=2:4, Col3=10:12)
B <- data.frame(Col1=1:2, Col2=2:3, Col3=10:11)
A
# Col1 Col2 Col3
#1 1 2 10
#2 2 3 11
#3 3 4 12
B
# Col1 Col2 Col3
#1 1 2 10
#2 2 3 11
byv <- c("Col1","Col2")
A[!(interaction(A[byv]) %in% interaction(B[byv])),]
# Col1 Col2 Col3
#3 3 4 12
Or create a unique id for each row, and then exclude those that merged:
A[-merge(cbind(A[byv],id=seq_len(nrow(A))), B[byv], by=byv)$id,]

remove duplicate rows in R based on values in all columns

I have the following dataset
col1 col2 col3
a b 1
a b 2
a b 3
unique(dataset) returns
col1 col2 col3
a b 1
dataset[!duplicated(1:3),] returns
col1 col2 col3
a b 1
a b 2
a b 3
But the same thing fails to work in following
dataset2
col1 col2 col3
a b 1
a b 1
unique(dataset2) returns
col1 col2 col3
a b 1
dataset2[!duplicated(1:3),] returns
col1 col2 col3
a b 1
a b 1
NA NA NA
Use !duplicated:
dataset[!duplicated(dataset[c("col1", "col2", "col3")]),]

Removing one table from another in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a data table in R, called A, which has three columns Col1, Col2, and Col3. Another table, called B, also has the same three columns. I want to remove all the rows in table A, for which the pairs (Col1, Col2) are present in table B. I tried, but I am not sure how to do this. I am stuck on this for last few days.
Thanks,
library(data.table)
A = data.table(Col1 = 1:4, Col2 = 4:1, Col3 = letters[1:4])
# Col1 Col2 Col3
#1: 1 4 a
#2: 2 3 b
#3: 3 2 c
#4: 4 1 d
B = data.table(Col1 = c(1,3,5), Col2 = c(4,2,1))
# Col1 Col2
#1: 1 4
#2: 3 2
#3: 5 1
A[!B, on = c("Col1", "Col2")]
# Col1 Col2 Col3
#1: 2 3 b
#2: 4 1 d
We can use anti_join
library(dplyr)
anti_join(A, B, by = c('Col1', 'Col2'))
Here's a go, using interaction:
A <- data.frame(Col1=1:3, Col2=2:4, Col3=10:12)
B <- data.frame(Col1=1:2, Col2=2:3, Col3=10:11)
A
# Col1 Col2 Col3
#1 1 2 10
#2 2 3 11
#3 3 4 12
B
# Col1 Col2 Col3
#1 1 2 10
#2 2 3 11
byv <- c("Col1","Col2")
A[!(interaction(A[byv]) %in% interaction(B[byv])),]
# Col1 Col2 Col3
#3 3 4 12
Or create a unique id for each row, and then exclude those that merged:
A[-merge(cbind(A[byv],id=seq_len(nrow(A))), B[byv], by=byv)$id,]

Resources