Staking multiple columns to two columns and removing duplicates in R - r

I have multiple columns, but here is only a part of my data:
df<-read.table (text=" Color1 Size1 Color2 Size2 Color3 Size3
Yellow AA Gray GB Purpul MO
Blue BD Cyne CE Gray GB
Yellow AA Yellow AA Black LL
Red MD Reddark KK Reddark KK
Green MC Reddark KK Green MC
", header=TRUE)
I want to bring down all the columns and show them as two columns and then remove duplicates to get this table:
Color Size
Yellow AA
Blue BD
Red MD
Green MC
Gray GB
Cyne CE
Reddark KK
Purpul MO
Black LL
I try Reshape2 using melt, but I struggled to do it.

With no other libraries, reshape and unique can get the job done:
> unique(reshape(df, varying=1:6, direction="long", v.names=c("Color", "Size"), timevar=NULL)[1:2])
Color Size
1.1 Yellow AA
2.1 Blue BD
4.1 Red MD
5.1 Green MC
1.2 Gray GB
2.2 Cyne CE
4.2 Reddark KK
1.3 Purpul MO
3.3 Black LL
Pivoting seems like overkill to me, but what do I know. If the index bothers you (though it saves the information on how the wide table was structured) then reset the row names:
> uniq = unique(reshape(df, varying=1:6, direction="long", v.names=c("Color", "Size"), timevar=NULL)[1:2])
> rownames(uniq) = NULL

Another way of using pivot_longer() and pivot_wider() can be:
library(dplyr)
library(tidyr)
#Code
newdf <- df %>%
pivot_longer(everything()) %>%
mutate(name=substr(name,1,nchar(name)-1)) %>%
group_by(name) %>% mutate(id2=row_number()) %>%
pivot_wider(names_from = name,values_from=value) %>%
select(-id2) %>%
filter(!duplicated(paste(Color,Size)))
Output:
# A tibble: 9 x 2
Color Size
<fct> <fct>
1 Yellow AA
2 Gray GB
3 Purpul MO
4 Blue BD
5 Cyne CE
6 Black LL
7 Red MD
8 Reddark KK
9 Green MC

We can use pivot_longer from tidyr to reshape from 'wide' to 'long' in two columns by specifying the names_sep as the boundary between a letter and a digit ((?<=[a-z])(?=\\d)) in the column names and then take the distinct of the two columns
library(dplyr)
library(tidyr)
pivot_longer(df, cols = everything(),
names_to = c( '.value', 'grp'), names_sep="(?<=[a-z])(?=\\d)") %>%
distinct(Color, Size)
-output
# A tibble: 9 x 2
# Color Size
# <chr> <chr>
#1 Yellow AA
#2 Gray GB
#3 Purpul MO
#4 Blue BD
#5 Cyne CE
#6 Black LL
#7 Red MD
#8 Reddark KK
#9 Green MC
Or using data.table
library(data.table)
unique(melt(setDT(df), measure = patterns('^Color', '^Size'),
value.name = c('Color', 'Size'))[, variable := NULL])
# Color Size
#1: Yellow AA
#2: Blue BD
#3: Red MD
#4: Green MC
#5: Gray GB
#6: Cyne CE
#7: Reddark KK
#8: Purpul MO
#9: Black LL
data
df <- structure(list(Color1 = c("Yellow", "Blue", "Yellow", "Red",
"Green"), Size1 = c("AA", "BD", "AA", "MD", "MC"), Color2 = c("Gray",
"Cyne", "Yellow", "Reddark", "Reddark"), Size2 = c("GB", "CE",
"AA", "KK", "KK"), Color3 = c("Purpul", "Gray", "Black", "Reddark",
"Green"), Size3 = c("MO", "GB", "LL", "KK", "MC")),
class = "data.frame", row.names = c(NA,
-5L))

Related

How can I add a set of rows to each distinct observation in R

I am looking to automate a process in R that was previously done by hand and is very time consuming. I'd like to add a series of observations from one dataframe to each unique variable in another. An example using data will probably illustrate this better...
Table one contains a number of observations for each animal, this is the table where I will want to add a set of rows for each type of animal.
Animal
Colour
Temperament
Cat
Black
Calm
Dog
Beige
Anxious
Cat
White
Playful
Table two shows the rows that should be applied to each animal.
Colour
Temperament
Brown
Control
Beige
Control
White
Control
The final table should look something like:
Animal
Colour
Temperament
Cat
Black
Calm
Dog
Beige
Anxious
Cat
White
Playful
Cat
Brown
Control
Cat
Beige
Control
Cat
White
Control
Dog
Brown
Control
Dog
Beige
Control
Dog
White
Control
Would someone be able to point me in the right direction? Pref using tidyverse over base R (but not essential :) )
1.We create an easy to use and reproducible example data
d1 <- data.frame(an = c("c", "d", "c"),
cl = c("bl", "be", "wh"),
tm = c("cl", "an", "pl"))
d2 <- data.frame(cl = c("br", "be", "wh"),
tm = "cn")
2.Using expand_grid in combination with tidyr::full_join to expand the data.frame d1 to the desired form:
library(dplyr)
library(tidyr)
d1 %>%
full_join(expand_grid(d2, an = unique(d1$an)))
This returns:
an cl tm
1 c bl cl
2 d be an
3 c wh pl
4 c br cn
5 d br cn
6 c be cn
7 d be cn
8 c wh cn
9 d wh cn
Using crossing
library(dplyr)
library(tidyr)
crossing(d2, an = d1$an) %>%
full_join(d1)
Joining, by = c("cl", "tm", "an")
# A tibble: 9 × 3
cl tm an
<chr> <chr> <chr>
1 be cn c
2 be cn d
3 br cn c
4 br cn d
5 wh cn c
6 wh cn d
7 bl cl c
8 be an d
9 wh pl c
data
d1 <- structure(list(an = c("c", "d", "c"), cl = c("bl", "be", "wh"
), tm = c("cl", "an", "pl")), class = "data.frame", row.names = c(NA,
-3L))
d2 <- structure(list(cl = c("br", "be", "wh"), tm = c("cn", "cn", "cn"
)), class = "data.frame", row.names = c(NA, -3L))

How to filter values in a list within a dataframe in R?

I have a dataframe, df:
df <- structure(list(id = c("id1", "id2", "id3",
"id4"), type = c("blue", "blue", "brown", "blue"
), value = list(
value1 = "cat", value2 = character(0),
value3 = "dog", value4 = "fish")), row.names = 1:4, class = "data.frame")
> df
id type value
1 id1 blue cat
2 id2 blue
3 id3 brown dog
4 id4 blue fish
The third column, value, is a list. I want to be able to filter out any rows in the dataframe where entries in that column that don't have any characters (ie. the second row).
I've tried this:
df <- filter(df, value != "")
and this
df <- filter(df, nchar(value) != 0)
But it doesn't have any effect on the data frame. What is the correct way to do this so my data frame looks like this:
> df
id type value
1 id1 blue cat
3 id3 brown dog
4 id4 blue fish
The lengths() function is perfect here - it gives the length of each element of a list. You want all the rows where value has non-zero length:
df[lengths(df$value) > 0, ]
# id type value
# 1 id1 blue cat
# 3 id3 brown dog
# 4 id4 blue fish
here is my approach
idx <- lapply(df$value, length)
filter(df, idx > 0)
id type value
1 id1 blue cat
2 id3 brown dog
3 id4 blue fish
An option with tidyverse
library(dplyr)
library(purrr)
df %>%
filter(map_int(value, length) > 0)
# id type value
#1 id1 blue cat
#2 id3 brown dog
#3 id4 blue fish
Try this:
df <- filter(df, !sapply(df$value,function(x) identical(x,character(0))) )

Restructuring data from long to wide by removing characters

Here is a sample of my data
code group type outcome
11 A red M*P
11 N orange N*P
11 Z red R
12 AB A blue Z*P
12 AN B green Q*P
12 AA A gray AB
which can be created by:
df <- data.frame(
code = c(rep(11,3), rep(12,3)),
group = c("A", "N", "Z", "AB A", "AN B", "AA A"),
type = c("red", "orange", "red", "blue", "green", "gray"),
outcome = c("M*P", "N*P", "R", "Z*P", "Q*P", "AB"),
stringsAsFactors = FALSE
)
I want to get the following table
code group1 group2 group3 type1 type2 type3 outcome
11 A N Z red orange red MNR
12 AB A AN B AA A blue green gray ZQAB
I have used the following code, but it does not work. I want to remove Ps in outcome. Thanks for your help.
dcast(df, formula= code +group ~ type, value.var = 'outcome')
Using data.table to hit your expected output:
library(data.table)
setDT(df)
# Clean out the Ps before hand
df[, outcome := gsub("*P", "", outcome, fixed = TRUE)]
# dcast but lets leave the outcome for later... (easier)
wdf <- dcast(df, code ~ rowid(code), value.var = c('group', 'type'))
# Now outcome maneuvering separately by code and merge
merge(wdf, df[, .(outcome = paste(outcome, collapse = "")), code])
code group_1 group_2 group_3 type_1 type_2 type_3 outcome
1: 11 A N Z red orange red MNR
2: 12 AB A AN B AA A blue green gray ZQAB

Combine two data frames across multiple columns

Say I have two dataframes, each with four columns. One column is a numeric value. The other three are identifying variables. For example:
set1 <- data.frame(label1 = c("a","b", "c"), label2 = c("red", "white", "blue"), name = c("sam", "bob", "drew"), val = c(1, 10, 100))
set2 <- data.frame(label1 = c("b","c", "d"), label2 = c("white", "green", "orange"), name = c("bob", "drew", "collin"), val = c(7, 100, 15))
Which are:
> set1
label1 label2 name val
1 a red sam 1
2 b white bob 10
3 c blue drew 50
> set2
label1 label2 name val
1 b white bob 7
2 c green drew 100
3 d orange collin 15
The first three columns can be combined to form a primary key. What is the most efficient way to combine these two data frames such that all unique values (from columns label1, label2, name) are displayed along with the two val columns:
set3 <- data.frame(label = c("a", "b", "c", "c", "d"), label2 = c("red", "white", "blue", "green", "orange"), name = c("sam", "bob", "drew", "drew", "collin"), val.set1 = c(1, 10, 50, NA, NA), val.set2 = c(NA, 7, NA, 100, 15))
> set3
label label2 name val.set1 val.set2
1 a red sam 1 NA
2 b white bob 10 7
3 c blue drew 50 NA
4 c green drew NA 100
5 d orange collin NA 15
>
When thinking of efficiency, you should evaluate the data.table package:
library(data.table)
(merge(
setDT(set1, key=names(set1)[1:3]),
setDT(set2, key=names(set2)[1:3]),
all=T,
suffixes=paste0(".set",1:2)
) -> set3)
# label1 label2 name val.set1 val.set2
# 1: a red sam 1 NA
# 2: b white bob 10 7
# 3: c blue drew 100 NA
# 4: c green drew NA 100
# 5: d orange collin NA 15
Since they're in the same format, you could just rowbind them together and then take only the unique values. Using dplyr:
bind_rows(set1, set2) %>% distinct(label1, label2, name)
You just want to make sure that you don't have factors in there, that everything is a character or numeric.

Matching columns in R and adding frequencies against them

I need to match my values in col1 with col 2 and col3 and if they match i need to add their frequencies.It should display the count from freq1 freq2 and freq3 of the unique values.
col1 freq1 col2 freq2 col3 freq3
apple 3 grapes 4 apple 1
grapes 5 apple 2 orange 2
orange 4 banana 5 grapes 2
guava 3 orange 6 banana 7
I need my output like this
apple 6
grapes 11
orange 12
guava 3
banana 12
I m a beginner.How do I code this in R.
We can use melt from data.table with patterns specified in the measure argument to convert the 'wide' format to 'long' format, then grouped by 'col', we get the sum of 'freq' column
library(data.table)
melt(setDT(df1), measure = patterns("^col", "^freq"),
value.name = c("col", "freq"))[,.(freq = sum(freq)) , by = col]
# col freq
#1: apple 6
#2: grapes 11
#3: orange 12
#4: guava 3
#5: banana 12
If it is alternating 'col', 'freq', columns, we can just unlist the subset of 'col' columns and 'freq' columns separately to create a data.frame (using c(TRUE, FALSE) to recycle for subsetting columns), and then use aggregate from base R to get the sum grouped by 'col'.
aggregate(freq~col, data.frame(col = unlist(df1[c(TRUE, FALSE)]),
freq = unlist(df1[c(FALSE, TRUE)])), sum)
# col freq
#1 apple 6
#2 banana 12
#3 grapes 11
#4 guava 3
#5 orange 12
I think that the easiest to understand for newbie would be creating 3 separate dataframes (I assumed here that your dataframe name is df):
df1 <- data.frame(df$col1, df$freq1)
colnames(df1) <- c("fruit", "freq")
df2 <- data.frame(df$col2, df$freq2)
colnames(df2) <- c("fruit", "freq")
df3 <- data.frame(df$col3, df$freq3)
colnames(df3) <- c("fruit", "freq")
Then bind all dataframes by rows:
df <- rbind(df1, df2, df3)
And at the end group by fruit and sum frequencies using dplyr library.
library(dplyr)
df <- df %>%
group_by(fruit)%>%
summarise(sum(freq))

Resources