remove duplicate base on 2 columns of data [duplicate]

remove duplicate base on 2 columns of data [duplicate] - r

This question already has an answer here:
remove duplicate values based on 2 columns
(1 answer)
Closed 4 years ago.
I have a set of data:
x <- c(rep("A", 3), rep("B", 3), rep("C",2))
y <- c(1,1,2,4,1,1,2,2)
z <- c(rep("E", 1), rep("F", 4), rep("G",3))
df <-data.frame(x,y,z)
I only want to remove the duplicate row if both column x and column z are duplicated.
In this case, after applying the code, row 2,3 will left with 1 row, row 4,5 will left with 1 row, row 7,8 will left with 1 row
How to do it?

You can use a simple condition to subset your data:
x <- c(rep("A", 3), rep("B", 3), rep("C",2))
y <- c(1,1,2,4,1,1,2,2)
z <- c(rep("A", 1), rep("B", 4), rep("C",3))
df <-data.frame(x,y,z)
df
df[!df$x == df$z,] # the ! excludes all rows for which x == z is TRUE
x y z
2 A 1 B
3 A 2 B
6 B 1 C
Edit: As #RonakShah commented, to exclude duplicated rows, use
df[!duplicated(df[c("x", "z")]),]
or
df[!duplicated(df[c(1, 3)]),]
x y z
1 A 1 A
2 A 1 B
4 B 4 B
6 B 1 C
7 C 2 C

Related

How to rbind two dataframes in R when one has more columns than the other [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 2 years ago.
I want to merge three dataframes together, appending them as rows to the bottom of the previous, but I want their columns to match. Each dataframe has a different number of columns, but they share column names. EX:
Dataframe A Dataframe B Dataframe C
A B Y Z A B C X Y Z A B C D W X Y Z
# # # # # # # # # # # # # # # # # #
In the end, I want them to look like:
Dataframe_Final
A B C D W X Y Z
# # # #
# # # # # #
# # # # # # # #
How can I merge these dataframes together in this way? Again, there's no ID for the rows that is unique (ascending, etc) across the dataframes.
Thanks!

A base R option might be Reduce + merge
out <- Reduce(function(x,y) merge(x,y,all = TRUE),list(dfA,dfB,dfC))
out <- out[order(names(out))]
which gives
A B C D W X Y Z
1 1 2 NA NA NA NA 3 4
2 1 2 3 NA NA 4 5 6
3 1 2 3 4 5 6 7 8
Dummy Data
dfA <- data.frame(A = 1, B = 2, Y = 3, Z = 4)
dfB <- data.frame(A = 1, B = 2, C = 3, X = 4, Y = 5, Z = 6)
dfC <- data.frame(A = 1, B = 2, C = 3, D = 4, W = 5, X = 6, Y = 7, Z = 8)

Subtracting columns in a loop

I've got a data frame like that:
df:
A B C
1 1 2 3
2 2 2 4
3 2 2 3
I would like to subtract each column with the next smaler one (A-0, B-A, C-B). So my results should look like that:
df:
A B C
1 1 1 1
2 2 0 2
3 2 0 1
I tried the following loop, but it didn't work.
for (i in 1:3) {
j <- data[,i+1] - data[,i]
}

Try this
df - cbind(0, df[-ncol(df)])
# A B C
# 1 1 1 1
# 2 2 0 2
# 3 2 0 1
Data
df <- data.frame(A = c(1, 2, 2), B = c(2, 2, 2), C = c(3, 4, 3))

We can also remove the first and last column and do the subtraction
df[-1] <- df[-1]-df[-length(df)]
data
df <- data.frame(A = c(1, 2, 2), B = c(2, 2, 2), C = c(3, 4, 3))

Remove rows with zero-variance in R

I have a dataframe of survey responses (rows = participants, columns = question responses). Participants would respond to 50 questions on a 5-point Likert scale. I would like to remove participants who answered 5 across the 50 questions as they have zero-variance and likely to bias my results.
I have seen the nearZeroVar()function, but was wondering if there's a way to do this in base R?
Many thanks,
R

If you had this dataframe:
df <- data.frame(col1 = rep(1, 10),
col2 = 1:10,
col3 = rep(1:2, 5))
You could calculate the variance of each column and select only those columns where the variance is not 0 or greater than or equal to a certain threshold which is close to what nearZeroVar() would do:
df[, sapply(df, var) != 0]
df[, sapply(df, var) >= 0.3]
If you wanted to exclude rows, you could do something similar, but loop through the rows instead and then subset:
df[apply(df, 1, var) != 0, ]
df[apply(df, 1, var) >= 0.3, ]

Assuming you have data like this.
survey <- data.frame(participants = c(1:10),
q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
You can do the following.
idx <- which(apply(survey[,-1], 1, function(x) all(x == 5)) == T)
survey[-idx,]
This will remove rows where all values equal 5.

# Dummy data:
df <- data.frame(
matrix(
sample(1:5, 100000, replace =TRUE),
ncol = 5
)
)
names(df) <- paste0("likert", 1:5)
df$id <- 1:nrow(df)
head(df)
likert1 likert2 likert3 likert4 likert5 id
1 1 2 4 4 5 1
2 5 4 2 2 1 2
3 2 1 2 1 5 3
4 5 1 3 3 2 4
5 4 3 3 5 1 5
6 1 3 3 2 3 6
dim(df)
[1] 20000 6
# Clean out rows where all likert values are 5
df <- df[rowSums(df[grepl("likert", names(df))] == 5) != 5, ]
nrow(df)
[1] 19995

Stealing #AshOfFire's data, with small modification as you say you only have answers in columns and not participants :
survey <- data.frame(q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
survey[!apply(survey==survey[[1]],1,all),]
# q1 q2 q3
# 1 1 1 3
# 4 5 5 4
# 6 1 1 5
# 10 2 3 5
the equality test builds a data.frame filled with Booleans, then with apply we keep rows that aren't always TRUE.

Removing duplicate rows in dataframes with multiple columns

In a dataframe like this
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)
it is possible to remove duplicates using (based on the results of b column) :
df[!duplicated(df), ]
if a have a third column c in the df and I would like again to remove the duplicate based on the values of column b is it right to use this:
df[!duplicated(df$b), ]
using a third column.
The dataframe:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c("i","i","ii","ii","iii","iii","iv","iv")
df <-data.frame(a,b,c)
using remove duplicates based on column b:
df[!duplicated(df$b), ]
the result is this:
a b c
A 1 i
A 2 ii
B 4 ii
and I would expect this
a b c
A 1 i
A 2 ii
B 4 ii
B 1 iii
C 2 iv

Input:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c("i","i","ii","ii","iii","iii","iv","iv")
df <-data.frame(a,b,c)
Described as expected output in post:
a b c
A 1 i
A 2 ii
B 4 ii
B 1 iii
C 2 iv
Using distinct on all columns seems to do what you want:
>library(dplyr)
>distinct(df)
a b c
1 A 1 i
2 A 2 ii
3 B 4 ii
4 B 1 iii
5 C 2 iv
Other variation: only allow unique b's:
> distinct(df,b, .keep_all = TRUE)
a b c
1 A 1 i
2 A 2 ii
3 B 4 ii

How to manipulate multiple columns in R

I have the following table:
X Y
A 4 8
B 2 6
C 5 4
D 6 3
E 9 13
But I would like to re-arrange this to look like:
AX AY BX BY CX CY......
4 8 2 6 5 4
I am working in R and get the table by doing
table(db[,1],db[,2])
How can I change the command to get the desired output?

If you do not care about the names and you have numeric data then the easiest solution would be to coerce to a matrix and then to a vector like so:
as.vector( t( x ) )
# [1] 4 8 2 6 5 4 6 3 9 13
If you also want to preserve the names, use expand.grid to get the combinations...
# The data
y <- as.vector( t( x ) )
# Combinations of row and column names
nms <- expand.grid( colnames(x) , rownames(x) )
# Rename vector with desired names
names(y) <- paste0( nms[,2] , nms[,1] )
#AX AY BX BY CX CY DX DY EX EY
# 4 8 2 6 5 4 6 3 9 13

Assuming your data is set up this way:
db <- data.frame(
c(rep("A", 4), rep("B", 2), rep("C", 5), rep("D", 6), rep("E", 9),
rep("A", 8), rep("B", 6), rep("C", 4), rep("D", 3), rep("E", 13)),
c(rep("X", 26), rep("Y", 34)),
stringsAsFactors = FALSE)
tab <- table(db[,1],db[,2])
You could do this in a one-liner:
array(tab, dimnames = list(do.call("paste0", expand.grid(dimnames(tab)))))
AX BX CX DX EX AY BY CY DY EY
4 2 5 6 9 8 6 4 3 13

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

remove duplicate base on 2 columns of data [duplicate] - r

Related

How to rbind two dataframes in R when one has more columns than the other [duplicate]

Subtracting columns in a loop

Remove rows with zero-variance in R

Removing duplicate rows in dataframes with multiple columns

How to manipulate multiple columns in R

Categories

Resources