identify and count duplicate values across multiple columns

identify and count duplicate values across multiple columns - r

I have a dataset with multiple columns that look similar to this:
ID1 ID2 ID3 ID4
Blue Grey Fuchsia Green
Black Blue Orange Blue
Green Green Yellow Pink
Pink Yellow NA Orange
What I want to do is count how many times each value is duplicated across the four columns. For example, this is what I'd like to get back from the above:
ID Replicates
Blue 3
Black 1
Green 3
Pink 2
Grey 1
Yellow 2
Fuchsia 1
Orange 2
I'd also like to be able to ask which ID value is present in the dataset at frequency >2. So the expected result would be: Green and Blue.
Any thoughts on how to do this in R?
Thanks!

Just a regular table is all you need for a data set full of factors.
> ( tab <- table(unlist(data)) )
Black Blue Green Pink Grey Yellow Fuchsia Orange
1 3 3 2 1 2 1 2
Add deparse.level = 2 if you want the table to be named.
It's easily subsetted with [ indexing. Just subset tab such that tab is greater than 2. And you can get the colors with names.
> tab[tab > 2]
Blue Green
3 3
> names(tab[tab > 2])
[1] "Blue" "Green"
There's also an as.data.frame method.
> as.data.frame(tab)
Var1 Freq
1 Black 1
2 Blue 3
3 Green 3
4 Pink 2
5 Grey 1
6 Yellow 2
7 Fuchsia 1
8 Orange 2

a dplyr / tidyr solution
ID1<-c("Blue", "Black", "Green", "Pink")
ID2<-c("Grey", "Blue", "Green", "Yellow")
ID3<-c("Fuchsia", "Orange", "Yellow", NA)
ID4<-c("Green", "Blue", "Pink", "Orange")
mydf<-data.frame(ID1,ID2,ID3,ID4)
library(dplyr)
library(tidyr)
mydf %>%
gather(key,value,1:4) %>%
group_by (value) %>%
tally
value n
1 Black 1
2 Blue 3
3 Fuchsia 1
4 Green 3
5 Grey 1
6 Orange 2
7 Pink 2
8 Yellow 2
9 NA 1
to return those at higher freq than 2...
mydf %>%
gather(key,value,1:4) %>%
group_by (value) %>%
tally %>%
filter (n>2)
value n
1 Blue 3
2 Green 3

Related

Ordering rows in a dataframe based on order of rows in another, with different dimensions [duplicate]

I have a categorical data set that looks similar to:
A < -data.frame(animal = c("cat","cat","cat","dog","dog","dog","elephant","elephant","elephant"),
color = c(rep(c("blue","red","green"), 3)))
animal color
1 cat blue
2 cat red
3 cat green
4 dog blue
5 dog red
6 dog green
7 elephant blue
8 elephant red
9 elephant green
I want to order it so that 'animal' is sorted as dog < elephant < cat, and then the color is sorted green < blue < red. So in the end it would look like
# animal color
# 6 dog green
# 4 dog blue
# 5 dog red
# 9 elephant green
# 7 elephant blue
# 8 elephant red
# 3 cat green
# 1 cat blue
# 2 cat red

The levels should be specified explicitly:
A$animal <- factor(A$animal, levels = c("dog", "elephant","cat"))
A$color <- factor(A$color, levels = c("green", "blue", "red"))
Then you order by the 2 columns simultaneously:
A[order(A$animal,A$color),]
# animal color
# 6 dog green
# 4 dog blue
# 5 dog red
# 9 elephant green
# 7 elephant blue
# 8 elephant red
# 3 cat green
# 1 cat blue
# 2 cat red

You can also use match - you do not alter column class neither do a factor transformation.
animalOrder = c("dog", "elephant","cat")
colorOrder = c("green", "blue", "red")
A[ order(match(A$animal, animalOrder), match(A$color, colorOrder)), ]
animal color
6 dog green
4 dog blue
5 dog red
9 elephant green
7 elephant blue
8 elephant red
3 cat green
1 cat blue
2 cat red

One other thing worth noting - you don't have to convert the class to do this. You can simply order by the factor of the variable. Thus preserving as eg character class within the existing data structure, if that is desired.
so eg, using the example above:
A[order(factor(A$animal, levels = c("dog", "elephant","cat")) ,factor(A$color, levels = c("green", "blue", "red"))),]
Depends on whether conservation of class is important. This would be a much more typical use case for me personally. HTH

In a vein similar to how agstudy did it, I'd present the 'tidyverse' way of presenting the ordering:
A$animal <- factor(A$animal, levels = c("dog", "elephant","cat"))
A$color <- factor(A$color, levels = c("green", "blue", "red"))
Then we load dplyr or the whole tidyverse and can do
arrange(A, animal, color)
or simply
A %>% arrange(animal, color)
where %>% is the 'pipe' operator in r, and can be accessed by using Ctrl + Shift + m

Matching values in two columns then based on that returning new value in R

I have these columns in a data frame that look like:
combination color_1 color_2
1_1 red red
1_2 red blue
1_3 red green
1_4 red yellow
2_1 blue red
2_2 blue blue
2_3 blue green
2_4 blue yellow
...
Based off matching the color_1 and color_2 values, I would like to be able to create new columns that outputs the result of the match. There are certain specifications to this. For the first row where "red" and "red" are the same, the output in the new column (e.g. "Red-Only") should be a "1", and then "2" for every other match. Then, I would repeat this code but then picking up on matches where "blue" and "blue" occur, to output "1" in a next column (e.g. "Blue-Only") and "2" everywhere else. This goes for Yellow-only matches, Green-only matches, etc. So at the end I would have 4 extra columns depending on the condition.
Thanks for the help in advance!

Let's start with your existing data:
df <- structure(list(combination = c("1_1", "1_2", "1_3", "1_4", "2_1",
"2_2", "2_3", "2_4"), color_1 = c("red", "red", "red", "red",
"blue", "blue", "blue", "blue"), color_2 = c("red", "blue", "green",
"yellow", "red", "blue", "green", "yellow")), class = "data.frame", row.names = c(NA,
-8L))
combination color_1 color_2
1 1_1 red red
2 1_2 red blue
3 1_3 red green
4 1_4 red yellow
5 2_1 blue red
6 2_2 blue blue
7 2_3 blue green
8 2_4 blue yellow
One solution would be to loop over your four color categories, checking for matches.
colors <- c('red', 'green', 'yellow', 'blue')
matches <- lapply(colors, function(x) {
out <- ifelse(with(df, color_1 == color_2 & color_1 == x), 1, 2)
out
})
And then naming the results of this operation with your intended column names.
names(matches) <- paste(colors, 'only', sep = '_')
And finally, gluing the results together with the original data:
df.new <- cbind(df, as.data.frame(matches))
combination color_1 color_2 red_only green_only yellow_only blue_only
1 1_1 red red 1 2 2 2
2 1_2 red blue 2 2 2 2
3 1_3 red green 2 2 2 2
4 1_4 red yellow 2 2 2 2
5 2_1 blue red 2 2 2 2
6 2_2 blue blue 2 2 2 1
7 2_3 blue green 2 2 2 2
8 2_4 blue yellow 2 2 2 2

You can use ifelse. If you have a lot looping would be a good idea
cols <- data.frame(
color_1=c("Red","Red","Red","Red","Blue","Blue","Blue","Blue"),
color_2=c("Red","Blue","Green","Yellow","Red","Blue","Green","Yellow")
)
cols$redonly <- ifelse( cols$color_1 %in% "Red" & cols$color_2 %in% "Red" , 1 ,2 )
cols$Blueonly <- ifelse( cols$color_1 %in% "Blue" & cols$color_2 %in% "Blue" , 1 ,2 )
cols$greeonly <- ifelse( cols$color_1 %in% "Green" & cols$color_2 %in% "Green" , 1 ,2 )

Here is a way that doesn't depend on knowing the names of the colors.
fun <- function(color, DF, col1, col2){
2L - (color == DF[[col1]] & color == DF[[col2]])
}
cols1 <- unique(df1$color_1)
cbind(df1, sapply(cols1, fun, df1, 'color_1', 'color_2'))
# combination color_1 color_2 red blue
#1 1_1 red red 1 2
#2 1_2 red blue 2 2
#3 1_3 red green 2 2
#4 1_4 red yellow 2 2
#5 2_1 blue red 2 2
#6 2_2 blue blue 2 1
#7 2_3 blue green 2 2
#8 2_4 blue yellow 2 2
Data.
df1 <- read.table(text = "
combination color_1 color_2
1_1 red red
1_2 red blue
1_3 red green
1_4 red yellow
2_1 blue red
2_2 blue blue
2_3 blue green
2_4 blue yellow
", header = TRUE, stringsAsFactors = FALSE)

R - Flatten a data frame within a list

I have received a JSON file which could be read into R as a list using
library(jsonlite)
data <- jsonlite::fromJSON(URL)
The data is a list which contains both data columns and data frame. For example
temp = list(id = c(1, 2, 3), name = c("banana", "organge", "apple"), type = data.frame(colour=c("red", "blue", "green", "purple"), shape = c("round", "round", "square", "square")))
> temp
$id
[1] 1 2 3
$name
[1] "banana" "organge" "apple"
$type
colour shape
1 red round
2 blue round
3 green square
4 purple square
How can we convert this list to data frame without losing information? In that case, I suppose each row in the nested data frame will be aligned with a row in the list. The result should be
id name type.colour type.shape
1 1 banana red round
2 1 banana blue round
3 1 banana green square
4 1 banana purple square
5 2 orange red round
6 2 orange blue round
7 2 orange green square
8 2 orange purple square
9 3 apple red round
10 3 apple blue round
11 3 apple green square
12 3 apple purple square

For this specific case you can use the following code :
DFidxs <- rep(seq_len(nrow(temp$type)),times=length(temp$id))
colidxs <- rep(seq_len(length(temp$id)),each=nrow(temp$type))
DF <- cbind(id = temp$id[colidxs],
name = temp$name[colidxs],
temp$type[DFidxs,])
> DF
id name colour shape
1 1 banana red round
2 1 banana blue round
3 1 banana green square
4 1 banana purple square
1.1 2 organge red round
2.1 2 organge blue round
3.1 2 organge green square
4.1 2 organge purple square
1.2 3 apple red round
2.2 3 apple blue round
3.2 3 apple green square
4.2 3 apple purple square
assuming that id,name (and possibly other vectors/columns) have the same length you can reuse this code to replicate the lines of type data.frame for each element of the columns and bind them.

How to count frequency of a categorical variable (level) for each row in a dataframe

I would like to summarize the frequency of each level (category) for each ID in my data frame. For example, how could I generate the values 1,2,0 respectively for the ID 4003491503?
I tried tapply and count and I keep getting errors.
RespondentID Case.A Case.B Case.C Freq Red Freq Blue Freq Missing/NA
1 4003491503 Red Blue Blue <b> 1 2 0 </b>
2 4003491653 Blue Red Red
3 4003491982 Red Blue Red
4 4003494862 Red Red NA
15 4003494880 Blue Blue Blue

We can melt the dataset with 'id.var' as the 'RespondentID', get the frequency with table convert the output to a data.frame, change the column names and cbind with the original dataset.
library(reshape2)
df2 <- as.data.frame.matrix(table(melt(df1, id.var='RespondentID')[-2], useNA='ifany'))
colnames(df2) <- paste0('Freq', colnames(df2))
cbind(df1, df2)
# RespondentID Case.A Case.B Case.C FreqBlue FreqRed FreqNA
#1 4003491503 Red Blue Blue 2 1 0
#2 4003491653 Blue Red Red 1 2 0
#3 4003491982 Red Blue Red 1 2 0
#4 4003494862 Red Red <NA> 0 2 1
#15 4003494880 Blue Blue Blue 3 0 0

Separating a column element into 3 separate columns (R)

I have a data frame (theData) as follows which has values separated by pipes:
Col1 Col2 Col3
1 colors red|green|purple
1 colors red|pink|yellow
1 colors yellow|mauve|purple
1 colors red|green|orange
1 colors red|yellow|purple
1 colors red|green|purple
I would like to separate the Col3 into additional columns like this:
Col1 Col2 Col3 Col4 Col5
1 colors red green purple
1 colors red pink yellow
1 colors yellow mauve purple
1 colors red green orange
1 colors red yellow purple
1 colors red green purple
I have tried the following:
str_split_fixed(as.character(theData$Col3), "|", 3)
but this does not work.

My cSplit function handles this type of problem quite easily.
cSplit(theData, "Col3", "|")
# Col1 Col2 Col3_1 Col3_2 Col3_3
# 1: 1 colors red green purple
# 2: 1 colors red pink yellow
# 3: 1 colors yellow mauve purple
# 4: 1 colors red green orange
# 5: 1 colors red yellow purple
# 6: 1 colors red green purple
The result is a data.table since the function makes use of the "data.table" package for the efficiencies it offers, particularly with larger datasets.

You could also try colsplit from reshape
library(reshape)
cbind(theData[,1:2],
colsplit(theData$Col3, "[|]", names=c("Col3", "Col4", "Col5")))
# Col1 Col2 Col3 Col4 Col5
#1 1 colors red green purple
#2 1 colors red pink yellow
#3 1 colors yellow mauve purple
#4 1 colors red green orange
#5 1 colors red yellow purple
#6 1 colors red green purple
Or just use read.table
cbind(theData[,1:2],
setNames(read.table(text=theData$Col3,sep="|",header=F,stringsAsFactors=F),paste0("Col",3:5)))

Add one more option. See if you like. It's Hadley's tidyr package. The code is quite clean.
> library(tidyr)
> test <- data.frame(Col3 = c("red|green|purple", "red|pink|yellow"))
> test
Source: local data frame [2 x 1]
Col3
1 red|green|purple
2 red|pink|yellow
> test %>% separate(Col3, c("A", "B", "C"), sep = "\\|")
Source: local data frame [2 x 3]
A B C
1 red green purple
2 red pink yellow

You just needed to wrap | with [], or escape it with \\|. This looks like a job for mapply.
> m <- mapply(strsplit, dat$Col3, split = "[|]", USE.NAMES = FALSE)
> setNames(cbind(dat[-3], do.call(rbind, m)), paste0("Col", 1:5))
# Col1 Col2 Col3 Col4 Col5
# 1 1 colors red green purple
# 2 1 colors red pink yellow
# 3 1 colors yellow mauve purple
# 4 1 colors red green orange
# 5 1 colors red yellow purple
# 6 1 colors red green purple
Using your attempt with str_split_fixed, it would just need a little change,
> library(stringr)
> cbind(dat[-3], str_split_fixed(dat$Col3, "[|]", 3))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

identify and count duplicate values across multiple columns - r

Related

Ordering rows in a dataframe based on order of rows in another, with different dimensions [duplicate]

Matching values in two columns then based on that returning new value in R

R - Flatten a data frame within a list

How to count frequency of a categorical variable (level) for each row in a dataframe

Separating a column element into 3 separate columns (R)

Categories

Resources