finding matching or non matching values in r - r

I have a medium sized data frame like the one I made up below, (with several columns though) where I want to find if any "id"s have different "letter"s
I imagine that there is a simple way to do this, maybe with tidyr?
df<-data.frame("id"=c(1, 1, 2, 3, 3, 3, 4, 4, 5, 6, 6), "letter"=c("f", "f",
"r", "r", "k", "k", "k", "k", "r", "f", "r"))
EDIT: I am trying to find the "id"s that have more than one letter. i.e. in this df id 3 and 6. I am less interested in which "letter"s (though it's not bad if they're shown), more in which "id"s

Not sure of your desired output, but you can check to see how many letters correspond to each id :
library(dplyr)
df %>%
group_by(id) %>%
summarise(n_letters = n_distinct(letter))
A tibble: 6 x 2
id n_letters
<dbl> <int>
1 1 1
2 2 1
3 3 2
4 4 1
5 5 1
6 6 2
If you just want a vector of the ids with only 1 letter:
df %>%
group_by(id) %>%
summarise(n_letters = n_distinct(letter)) %>%
filter(n_letters == 1) %>%
pull(id)
[1] 1 2 4 5
Or if you want a df of the ids with more than 1 letter:
multiple_letter_ids <- df %>%
group_by(id) %>%
summarise(n_letters = n_distinct(letter)) %>%
filter(n_letters > 1) %>%
pull(id)
df %>% filter(id %in% multiple_letter_ids)
id letter
1 3 r
2 3 k
3 3 k
4 6 f
5 6 r

Related

Distribution of Combination of Amounts by Group

I have a table that looks like this:
Id
Types
1
A
1
A
1
A
1
B
2
A
2
B
3
A
3
B
4
A
4
B
4
B
What I would like to do is 1. count for every ID the amount of A's and B's it has. 2. Compute the distribution of every combination of the amounts of A and B.
So at the end of step 2 I should have the table:
Amount of A
Amount of B
Number of Different IDs
1
1
2
1
2
1
3
1
1
How can this be achieved?
Thank you.
Here's a solution with dplyr and tidyr:
library(dplyr)
library(tidyr)
# ...
# Code to generate your original table: "your_table".
# ...
result <- your_table %>%
# Count the amount of each type for each Id.
group_by(Id) %>% count(Types) %>% ungroup() %>%
# "Pivot" the Types column, such that each type (here "A" and "B") gets its
# own column (here "Amount of A" and "Amount of B") to hold its amount (as
# calculated right above).
pivot_wider(id_cols = c(Id, Types),
names_from = Types, names_prefix = "Amount of ",
values_from = n) %>%
# For each combination of amounts among those pivoted columns (ie. all the
# columns except "Id"), count how many distinct IDs there are.
group_by(across(-c(Id))) %>%
summarize("Number of Different IDs" = n_distinct(Id)) %>% ungroup()
# Print the result.
result
Given the example of your_table that you provided
your_table <- tibble::tribble(
~Id, ~Types,
1, "A",
1, "A",
1, "A",
1, "B",
2, "A",
2, "B",
3, "A",
3, "B",
4, "A",
4, "B",
4, "B"
)
you should get the following result:
# A tibble: 3 x 3
`Amount of A` `Amount of B` `Number of Different IDs`
<int> <int> <int>
1 1 1 2
2 1 2 1
3 3 1 1

Interpolation of values from list

I have a dataframe containing the results of a competition. In this example competitors b and c have tied for second place. The actual dataframe is very large and could contain multiple ties.
df <- data.frame(name = letters[1:4],
place = c(1, 2, 2, 4))
I also have point values for the respective places, where first place gets 4 points, 2nd gets 3, 3rd gets 1 and 4th gets 0.
points <- c(4, 3, 1, 0)
names(points) <- 1:4
I can match points to place to get each competitor's score
df %>%
mutate(score = points[place])
name place score
1 a 1 4
2 b 2 3
3 c 2 3
4 d 4 0
What I would like to do though is award points to b and c that are the mean of the point values for 2nd and 3rd, such that each receives 2 points like this:
name place score
1 a 1 4
2 b 2 2
3 c 2 2
4 d 4 0
How can I accomplish this programmatically?
A solution using nested data frames and purrr.
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(name = letters[1:4],
place = c(1, 2, 2, 4))
points <- c(4, 3, 1, 0)
names(points) <- 1:4
# a function to help expand the dataframe based on the number of ties
expand_all <- function(x,n){
x:(x+n-1)
}
df %>%
group_by(place) %>%
tally() %>%
mutate(new_place = purrr::map2(place,n, expand_all)) %>%
unnest(new_place) %>%
mutate(score = points[new_place]) %>%
group_by(place) %>%
summarize(score = mean(score)) %>%
inner_join(df)
Robert Wilson's answer gave me an idea. Rather than mapping over nested dataframes the rank function from base can get to the same result
df %>%
mutate(new_place = rank(place, ties.method = "first")) %>%
mutate(score = points[new_place]) %>%
group_by(place) %>%
summarize(score = mean(score)) %>%
inner_join(df)
place score name
<dbl> <dbl> <chr>
1 1 4 a
2 2 2 b
3 2 2 c
4 4 0 d
This can be accomplished in few lines with an ifelse() statement inside of a mutate():
df %>%
group_by(place) %>%
mutate(n_ties = n()) %>%
ungroup %>%
mutate(score = (points[place] + ifelse(n_ties > 1, 1, 0))/ n_ties)
# A tibble: 4 x 4
name place n_ties score
<chr> <dbl> <int> <dbl>
1 a 1 1 4
2 b 2 2 2
3 c 2 2 2
4 d 4 1 0

Get last row of each group in R [duplicate]

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 4 years ago.
I have some data similar in structure to:
a <- data.frame("ID" = c("A", "A", "B", "B", "C", "C"),
"NUM" = c(1, 2, 4, 3, 6, 9),
"VAL" = c(1, 0, 1, 0, 1, 0))
And I am trying to sort it by ID and NUM then get the last row.
This code works to get the last row and summarize down to a unique ID, however, it doesn't actually get the full last row like I want.
a <- a %>% arrange(ID, NUM) %>%
group_by(ID) %>%
summarise(max(NUM))
I understand why this code doesn't work but am looking for the dplyr way of getting the last row for each unique ID
Expected Results:
ID NUM VAL
<fct <dbl> <dbl>
1 A 2 0
2 B 4 1
3 C 9 0
Note: I will admit that though it is nearly a duplicate of Select first and last row from grouped data, the answers on that thread were not quite what I was looking for.
You might try:
a %>%
group_by(ID) %>%
arrange(NUM) %>%
slice(n())
One dplyr option could be:
a %>%
arrange(ID, NUM) %>%
group_by(ID) %>%
summarise_all(last)
ID NUM VAL
<fct> <dbl> <dbl>
1 A 2. 0.
2 B 4. 1.
3 C 9. 0.
Or since dplyr 1.0.0:
a %>%
arrange(ID, NUM) %>%
group_by(ID) %>%
summarise(across(everything(), last))
Or using slice_max():
a %>%
group_by(ID) %>%
slice_max(order_by = NUM, n = 1)
tail() returns the last 6 items of a subsettable object. When using aggregate(), the parameters to the FUN argument are passed immediately after the function using a comma; here 1 refers to n = 1, which tells tail() to only return the last item.
aggregate(a[, c('NUM', 'VAL')], list(a$ID), tail, 1)
# Group.1 NUM VAL
# 1 A 2 0
# 2 B 3 0
# 3 C 9 0
You can use top_n. (grouping already sorts by ID, and sorting by NUM isn't necessary since there's only 1 value)
library(dplyr)
a %>%
group_by(ID) %>%
top_n(1, NUM)
# # A tibble: 3 x 3
# # Groups: ID [3]
# ID NUM VAL
# <fct> <dbl> <dbl>
# 1 A 2 0
# 2 B 4 1
# 3 C 9 0

How can we apply tidyr:: spread() to all categorical variables at once creating new columns for each level of each categorical variable? [duplicate]

This question already has answers here:
Using Group_by create aggregated counts conditional on value
(1 answer)
reshape of a large data
(2 answers)
Aggregating factor level counts - by factor
(1 answer)
Closed 4 years ago.
I have a dataframe with 3 categorical variables (x,y,z) along with an ID column :
df <- frame_data(
~id, ~x, ~y, ~z,
1, "a", "c" ,"v",
1, "b", "d", "f",
2, "a", "d", "v",
2, "b", "d", "v")
I want to apply spread() to each of the categorical variables group by ID .
Output should be like this :
id a b c d v f
1 1 1 1 1 1 1
2 1 1 0 2 2 0
I tried doing it but I was able to do it only for one variable at once not all together .
For e.g: Applying spread only to the y column (similarly , it can be done for x and z separately) but not together in a single line
df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1.00 1 1
2.00 0 2
Explaining my codes in three steps:
Step 1: count frequency
df %>% count(id,y)
id y n
<dbl> <chr> <int>
1.00 c 1
1.00 d 1
2.00 d 2
Step 2 : applying spread()
df %>% count(id,y) %>% spread(y,n)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1 1.00 1 1
2 2.00 NA 2
Step 3: Adding fill = 0 , replaces NA which means there was zero occurrence of c in y column for id 2 (as you can see in df)
df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1.00 1 1
2.00 0 2
Problem : In my actual data set , I have 20 such categorical variables , I can't do it one by one for all. I am looking to do it all at once.
Is it possible apply spread() in tidyr for all of categorical variables all together ? If not can you please suggest an alternative
Note: I also gave a try to these answers but were not helpful for this particular case:
R spreading multiple columns with tidyr
Is it possible to use spread on multiple columns in tidyr similar to dcast?
Can spread() in tidyr spread across multiple value?
Expanding columns associated with a categorical variable into multiple columns with dplyr/tidyr while retaining id variable
Additional related helpful question :
It is possible that two categorical columns (Eg: Survey dataset) have same values . Like below.
df <- frame_data(
~id, ~Do_you_Watch_TV, ~Do_you_Drive,
1, "yes", "yes",
1, "yes", "no",
2, "yes", "no",
2, "no", "yes")
# A tibble: 4 x 3
id Do_you_Watch_TV Do_you_Drive
<dbl> <chr> <chr>
1 1.00 yes yes
2 1.00 yes no
3 2.00 yes no
4 2.00 no yes
Running the below code would not differentiate counts of yes and no for 'Do_you_Watch_TV', 'Do_you_Drive' :
df %>% gather(Key, value, -id) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
id no yes
1 1 3
2 2 2
Whereas, expected output should be :
id Do_you_Watch_TV_no Do_you_Watch_TV_yes Do_you_Drive_no Do_you_Drive_yes
1 0 2 1 1
2 1 1 1 1
So , we need to treat No and Yes from Do_you_Watch_TV and Do_you_Drive separately by adding prefix. Do_you_Drive_yes , Do_you_Drive_no , Do_you_Watch_TV _yes, Do_you_Watch_TV _no .
How can we achieve this?
Thanks
First you need to convert your data frame in long format before you can actually transform it in wide format. Hence, first you need to use tidyr::gather and convert data frame to long format. Afterwards, you have couple of options:
Option#1: Using tidyr::spread:
#data
df <- frame_data(
~id, ~x, ~y, ~z,
1, "a", "c" ,"v",
1, "b", "d", "f",
2, "a", "d", "v",
2, "b", "d", "v")
library(tidyverse)
df %>% gather(Key, value, -id) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
# id a b c d f v
# 1 1 1 1 1 1 1 1
# 2 2 1 1 0 2 0 2
Option#2: Another option can be is to use reshape2::dcast as:
library(tidyverse)
library(reshape2)
df %>% gather(Key, value, -id) %>%
dcast(id~value, fun.aggregate = length)
# id a b c d f v
# 1 1 1 1 1 1 1 1
# 2 2 1 1 0 2 0 2
Edited: To include solution for 2nd data frame.
#Data
df1 <- frame_data(
~id, ~Do_you_Watch_TV, ~Do_you_Drive,
1, "yes", "yes",
1, "yes", "no",
2, "yes", "no",
2, "no", "yes")
library(tidyverse)
df1 %>% gather(Key, value, -id) %>% unite("value", c(Key, value)) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
# id Do_you_Drive_no Do_you_Drive_yes Do_you_Watch_TV_no Do_you_Watch_TV_yes
# 1 1 1 1 0 2
# 2 2 1 1 1 1

dplyr count number of one specific value of variable

Say I have a dataset like this:
id <- c(1, 1, 2, 2, 3, 3)
code <- c("a", "b", "a", "a", "b", "b")
dat <- data.frame(id, code)
I.e.,
id code
1 1 a
2 1 b
3 2 a
4 2 a
5 3 b
6 3 b
Using dplyr, how would I get a count of how many a's there are for each id
i.e.,
id countA
1 1 1
2 2 2
3 3 0
I'm trying stuff like this which isn't working,
countA<- dat %>%
group_by(id) %>%
summarise(cip.completed= count(code == "a"))
The above gives me an error, "Error: no applicable method for 'group_by_' applied to an object of class "logical""
Thanks for your help!
Try the following instead:
library(dplyr)
dat %>% group_by(id) %>%
summarise(cip.completed= sum(code == "a"))
Source: local data frame [3 x 2]
id cip.completed
(dbl) (int)
1 1 1
2 2 2
3 3 0
This works because the logical condition code == a is just a series of zeros and ones, and the sum of this series is the number of occurences.
Note that you would not necessarily use dplyr::count inside summarise anyway, as it is a wrapper for summarise calling either n() or sum() itself. See ?dplyr::count. If you really want to use count, I guess you could do that by first filtering the dataset to only retain all rows in which code==a, and using count would then give you all strictly positive (i.e. non-zero) counts. For instance,
dat %>% filter(code==a) %>% count(id)
Source: local data frame [2 x 2]
id n
(dbl) (int)
1 1 1
2 2 2

Resources