R Create a variable with the levels of grouped data - r

I have a data frame such as data
data = data.frame(ID = as.factor(c("A", "A", "B","B","C","C")),
var.color= as.factor(c("red", "blue", "green", "red", "green", "yellow")))
I wonder whether it is possible to get the levels of each group in ID (e.g. A, B, C) and create a variable that pastes them. I have attempted to do so by running the following:
data %>% group_by(ID) %>%
mutate(ex = paste(droplevels(var.color), sep = "_"))
That yields:
Source: local data frame [6 x 3]
Groups: ID [3]
ID var.color ex
<fctr> <fctr> <chr>
1 A red red
2 A blue blue
3 B green red
4 B red red
5 C green green
6 C yellow yellow
However, my desired data.frame should be something like:
ID var.color ex
<fctr> <fctr> <chr>
1 A red red_blue
2 A blue red_blue
3 B green green_red
4 B red green_red
5 C green green_yellow
6 C yellow green_yellow

Basically, you need collapse instead of sep
Instead of dropping levels , you can just paste the text together grouped by ID
library(dplyr)
data %>% group_by(ID) %>%
mutate(ex = paste(var.color, collapse = "_"))
# ID var.color ex
# <fctr> <fctr> <chr>
#1 A red red_blue
#2 A blue red_blue
#3 B green green_red
#4 B red green_red
#5 C green green_yellow
#6 C yellow green_yellow

You can do the same by using loops
for(i in unique(data$ID)){
data$ex[data$ID==i] <- paste0(data$var.color[data$ID==i], collapse = "_")
}
> data
ID var.color ex
1 A red red_blue
2 A blue red_blue
3 B green green_red
4 B red green_red
5 C green green_yellow
6 C yellow green_yellow

Related

How to build a summary data frame

I have a data set looks like this:
and I would like to get a summary data set that will looks like this:
what should i do? Thanks. The sample.data can be build through following codes:
ID<- c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18")
Group<-c("A","B","C","D","D","D","A","B","D","C","B","D","A","A","C","B","B","B")
Color<-c("Green","Yellow","Red","Red","Red","Yellow","Green","Green","Yellow","Red","Red","Yellow","Yellow","Yellow","Green","Red","Red","Green")
Realy_Love<-c("Y","N","Y","Y","N","N","Y","Y","Y","N","N","Y","N","Y","N","Y","N","Y")
Sample.data <- data.frame(ID, Group, Color, Realy_Love)
You can use dplyr and group by the following items:
Sample.data %>%
group_by(Group, Color, Realy_Love) %>%
summarise(Obs = n())
# Group Color Realy_Love Obs
# <chr> <chr> <chr> <int>
# 1 A Green Y 2
# 2 A Yellow N 1
# 3 A Yellow Y 1
# 4 B Green Y 2
# 5 B Red N 2
# 6 B Red Y 1
# 7 B Yellow N 1
# 8 C Green N 1
# 9 C Red N 1
# 10 C Red Y 1
# 11 D Red N 1
# 12 D Red Y 1
# 13 D Yellow N 1
# 14 D Yellow Y 2
Use dplyr from the Tidyverse to get a summary. You can then use arrange() to sort by Color or another variable.
group_by(Group, Color, Realy_Love) %>%
summarise(Obs = n()) %>%
arrange(Color)
With dplyr, you even don't need to group the columns, just use one step solution with the count() function:
Sample.data %>%
count(Group, Color, Realy_Love, sort = TRUE)
The optional sort = TRUE argument says to sort with descending order from the most frequent:
Group Color Realy_Love n
1 A Green Y 2
2 B Green Y 2
3 B Red N 2
4 D Yellow Y 2
5 A Yellow N 1
6 A Yellow Y 1
7 B Red Y 1
8 B Yellow N 1
9 C Green N 1
10 C Red N 1
11 C Red Y 1
12 D Red N 1
13 D Red Y 1
14 D Yellow N 1

R variable number of string concatenations within group_by

Let's say I have the following table of houses (or anything) and their colors:
I'm trying to:
group_by(Group)
count rows (I assume with length(unique(ID)),
mutate or summarize into a new row with a count of each color in group, as a string.
Result should be:
So I know step 3 could be done by manually entering every possible combination with something like
df <- df %>%
group_by(Group) %>%
mutate(
Summary = case_when(
all(
sum(count_green) > 0
) ~ paste(length(unique(ID)), " houses, ", count_green, " green")
)
)
but what if I have hundreds of possible combinations? Is there a way to paste into a string and append for each new color/count?
Here is one approach where we count the frequency of 'Group', 'Color' with add_count, unite that with 'Color', then grouped by 'Group', create the 'Summary' column by concatenating the unique elements of 'nColor' with the frequency (n())
library(dplyr)
library(tidyr)
library(stringr)
df %>%
add_count(Group, Color) %>%
unite(nColor, n, Color, sep= ' ', remove = FALSE) %>%
group_by(Group) %>%
mutate(
Summary = str_c(n(), ' houses, ', toString(unique(nColor)))) %>%
select(-nColor)
# Groups: Group [2]
# ID Group Color n Summary
# <int> <chr> <chr> <int> <chr>
#1 1 a Green 2 3 houses, 2 Green, 1 Orange
#2 2 a Green 2 3 houses, 2 Green, 1 Orange
#3 3 a Orange 1 3 houses, 2 Green, 1 Orange
#4 4 b Blue 2 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 1 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 2 3 houses, 2 Blue, 1 Yellow
data
df <- structure(list(ID = 1:6, Group = c("a", "a", "a", "b", "b", "b"
), Color = c("Green", "Green", "Orange", "Blue", "Yellow", "Blue"
)), class = "data.frame", row.names = c(NA, -6L))
Here's an approach with map_chr from purrr and a lot of pasting.
library(dplyr)
library(purrr)
df %>%
group_by(Group) %>%
mutate(Summary = paste(n(),"houses,",
paste(map_chr(unique(as.character(Color)),
~paste(sum(Color == .x),.x)),
collapse = ", ")))
## A tibble: 6 x 4
## Groups: Group [2]
# ID Group Color Summary
# <int> <fct> <fct> <chr>
#1 1 a Green 3 houses, 2 Green, 1 Orange
#2 2 a Green 3 houses, 2 Green, 1 Orange
#3 3 a Orange 3 houses, 2 Green, 1 Orange
#4 4 b Blue 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 3 houses, 2 Blue, 1 Yellow

Select the n most frequent values in a variable

I would like to find the most common values in a column in a data frame. I assume using table would be the best way to do this? I then want to filter/subset my data frame to only include these top-n values.
An example of my data frame is as follows. Here I want to find e.g. the top 2 IDs.
ID col
A blue
A purple
A green
B green
B red
C red
C blue
C yellow
C orange
I therefore want to output the following:
Top 2 values of ID are:
A and C
I will then select the rows corresponding to ID A and C:
ID col
A blue
A purple
A green
C red
C blue
C yellow
C orange
You can try a tidyverse. Add the counts of ID's, then filter for the top two (using < 3) or top ten (using < 11):
library(tidyverse)
d %>%
add_count(ID) %>%
filter(dense_rank(-n) < 3)
# A tibble: 7 x 3
ID col n
<fct> <fct> <int>
1 A blue 3
2 A purple 3
3 A green 3
4 C red 4
5 C blue 4
6 C yellow 4
7 C orange 4
Data
d <- read.table(text="ID col
A blue
A purple
A green
B green
B red
C red
C blue
C yellow
C orange", header=T)
We can count the number of values using table, sort them in decreasing order and select first 2 (or 10) values, get the corresponding ID's and subset those ID's from the data frame.
df[df$ID %in% names(sort(table(df$ID), decreasing = TRUE)[1:2]), ]
# ID col
#1 A blue
#2 A purple
#3 A green
#6 C red
#7 C blue
#8 C yellow
#9 C orange
With the tidyverse and its top_n :
library(tidyverse)
d %>%
group_by(ID) %>%
summarise(n()) %>%
top_n(2)
Selecting by n()
# A tibble: 2 x 2
ID `n()`
<fct> <int>
1 A 3
2 C 4
To complete with the subset :
d %>%
group_by(ID) %>%
summarise(n()) %>%
top_n(2) %>%
{ filter(d, ID %in% .$ID) }
Selecting by n()
ID col
1 A blue
2 A purple
3 A green
4 C red
5 C blue
6 C yellow
7 C orange
(we use the braces because we don't feed the left hand side result as the first argument of the filter)

Determining most/least amount of occurrences within subset row & column group in a data frame

I am trying to find the most and least amount of items within a row / column group in a larger data frame. Here is the data to make it clearer:
df <- data.frame(matrix(nrow = 8, ncol = 3))
df$X1 <- c(1, 1, 1, 2, 2, 3, 3, 3)
df$X2 <- c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange")
df$X3 <- c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA)
names(df) <- c("group", "A", "B")
Here is what that looks like (I have NAs in the original data, so I've included them):
group A B
1 1 yellow green
2 1 green yellow
3 1 yellow <NA>
4 2 blue blue
5 2 <NA> red
6 3 orange purple
7 3 <NA> orange
8 3 orange <NA>
In the first "group", for instance, I want to determine which color occurs the most and which color occurs the least. Something that looks like this:
group A B most least
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
I am working within a dplyr chain in the original data so I can group_by "group", but I am having a hard time figuring out a method that allows me to work within a "cluster" of two columns with differing numbers of rows. I do not need this to be done with dplyr, but I figured it might be easiest given the usefulness of group_by. Additionally, I need the result to somehow remain in the original data frame as new columns. Any suggestions?
A solution uses dplyr and tidyr. The strategy is to find the "most" and "least" item and prepare a new data frame. After that, use the right_join to merge the original data frame and prepare the desired output.
Notice that during the process I used slice to subset the data frame to get the most and least item. This guarantees that there will be only one "most" and one "least" for each group. Nevertheless, it is possible that there could be a tie for each group. If that happens, you may want to think about what could be a good rule to determine which one is the "most" or which one is the "least".
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, na.rm = TRUE) %>%
count(group, Value) %>%
arrange(group, desc(n)) %>%
group_by(group) %>%
slice(c(1, n())) %>%
mutate(Type = c("most", "least")) %>%
select(-n) %>%
spread(Type, Value) %>%
right_join(df, by = "group") %>%
select(c(colnames(df), "most", "least"))
df2
# A tibble: 8 x 5
group A B most least
<dbl> <chr> <chr> <chr> <chr>
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
Two options:
Reshape to long form and use summarise (or count) to aggregate, subsetting the which.max/which.min:
library(tidyverse)
df <- data_frame(group = c(1, 1, 1, 2, 2, 3, 3, 3),
A = c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange"),
B = c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA))
df %>%
gather(var, color, A:B) %>%
drop_na(color) %>%
group_by(group, color) %>%
summarise(n = n()) %>%
summarise(most = color[which.max(n)],
least = color[which.min(n)]) %>%
left_join(df, .)
#> Joining, by = "group"
#> # A tibble: 8 x 5
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple
Sort a table of values and subset it:
df %>%
group_by(group) %>%
mutate(most = last(names(sort(table(c(A, B))))),
least = first(names(sort(table(c(A, B))))))
#> # A tibble: 8 x 5
#> # Groups: group [3]
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple

R - Test if value is the same as the one in the cell above

I have the following df:
name color
A red
B red
C green
D red
E red
F red
And I want to test the values in the 'color' column to see if they're the same as the values in the row above and write to a new column... I can do so using the following:
> df$same <- ifelse(df$color == df$color[c(NA,1:(nrow(df)-1))], 1, 0)
To give me:
name color same
A red NA
B red 1
C green 0
D red 0
E red 1
F red 1
But is there a cleaner way to do it? (I use this all the time)...
Adding to Rafael's answer, you can use ifelse with dplyr::mutate:
> dt <- data_frame(name = c('A', 'B', 'C', 'D', 'E', 'F'), color = c('red', 'red', 'green', 'red', 'red', 'red'))
> dt
# A tibble: 6 x 2
name color
<chr> <chr>
1 A red
2 B red
3 C green
4 D red
5 E red
6 F red
> dt %>% mutate(same = ifelse(color == lag(color), 1, 0))
# A tibble: 6 x 3
name color same
<chr> <chr> <dbl>
1 A red NA
2 B red 1
3 C green 0
4 D red 0
5 E red 1
6 F red 1
You can try the lag function from dplyr package. You can create a new column with the values of the row above and after compare them,
> dt$color_above <- lag(dt$color, n=1)
> dt
name color color_above
1 A red <NA>
2 B red red
3 C green red
4 D red green
5 E red red
6 F red red
Or solve the issue directly, you can use the pipe-operators from magrittr package. It is still verbose, but i think it keeps the code more clear.
> dt %$%
{ color == lag(color, n=1) } %>%
as.numeric() %>%
{.} -> dt$same
> dt
name color same
1 A red NA
2 B red 1
3 C green 0
4 D red 0
5 E red 1
6 F red 1
How about this in standard R (maybe a bit more readable, but not much shorter than yours):
colour <- c("red","red","green","red","red","red")
(c(NA, colour) == c(colour, NA))[1:length(colour)]
[1] NA TRUE FALSE FALSE TRUE TRUE

Resources