So my problem is as follows, I have a small data frame like this:
test_df <- data.frame(id=c(1,1,2,2,2), ttype=c("D", "C", "D", "D", "C"), val=c(1, 5, 10, 5, 100))
test_df
id ttype val
1 1 A 1
2 1 B 5
3 2 A 10
4 2 A 5
5 2 B 100
Now I want to make it wider to end up like this:
id A B n
1 1 5 1 2
2 2 100 15 3
So I want to replace the ttype with a column for each value, grouped by id with the summed values of val. But my problem is that I still want to keep track of how many either A or B occurred in total for each id, which is n in this case.
Now I found a way to do this, but it is very ugly. But this way works:
test_df %>%
group_by(id, ttype) %>%
summarise(val = sum(val), n=n()) %>%
pivot_wider(names_from = ttype, values_from=c(val, n), values_fill=0) %>%
mutate(n=n_A+n_B) %>%
select(-n_A, -n_B)
results in:
# A tibble: 2 x 4
# Groups: id [2]
id val_A val_B n
<dbl> <dbl> <dbl> <int>
1 1 5 1 2
2 2 100 15 3
So here the amount of A en B is included separately, after which I sum them and remove both other columns. But this means I have to hardcode column names and makes it not really doable when there are more than 2 values in ttype.
I feel like there must be a simple way to do this, but I can't figure it out.
You can add count of id rows as new column and get data in wide format using pivot_wider by taking sum of val values.
library(dplyr)
library(tidyr)
test_df %>%
add_count(id) %>%
pivot_wider(names_from = ttype, values_from = val, values_fn = sum)
# id n D C
# <dbl> <int> <dbl> <dbl>
#1 1 2 1 5
#2 2 3 15 100
Related
I have a data frame like below
d1<-c('a','b','c','d','e','f','g','h','i','j','k','l')
d2<-c(1,5,1,2,13,2,32,2,1,2,4,5)
df1<-data.frame(d1,d2)
Which looks like the data table in this picture
My goal is to filter the rows based on which value of d2 in every 3 rows is biggest. So it would look like this:
Thank you!
We may use rollmax from zoo to filter the rows
library(dplyr)
library(zoo)
df1 %>%
filter(d2 == na.locf0(rollmax(d2, k = 3, fill = NA)))
d1 d2
1 b 5
2 e 13
3 g 32
4 l 5
You can create a grouping variable that puts observations into groups of 3. I have first created a sequence from 1 to the total number of rows, incremented by 3. And then repeated each number of this sequence 3 times and subset the result to get a vector the same length of the data, incase the number of observations is not perfectly divisible by 3. Then simply filter rows based by the largest number of each group in d2 column.
library(dplyr)
df1 %>%
mutate(group = rep(seq(1, n(), by = 3), each = 3)[1:n()]) %>%
group_by(group) %>%
filter(d2 == max(d2))
# A tibble: 4 x 3
# Groups: group [4]
# d1 d2 group
# <chr> <dbl> <dbl>
# 1 b 5 1
# 2 e 13 4
# 3 g 32 7
# 4 l 5 10
Yet another solution:
library(tidyverse)
d1<-c('a','b','c','d','e','f','g','h','i','j','k','l')
d2<-c(1,5,1,2,13,2,32,2,1,2,4,5)
df1<-data.frame(d1,d2)
df1 %>%
mutate(id = rep(1:(n()/3), each=3)) %>%
group_by(id) %>%
slice_max(d2) %>%
ungroup %>% select(-id)
#> # A tibble: 4 × 2
#> d1 d2
#> <chr> <dbl>
#> 1 b 5
#> 2 e 13
#> 3 g 32
#> 4 l 5
Actual dataframe consist of more than a million rows.
Say for example a dataframe is:
UniqueID Code Value OtherData
1 A 5 Z01
1 B 6 Z02
1 C 7 Z03
2 A 10 Z11
2 B 11 Z24
2 C 12 Z23
3 A 10 Z21
I want to obtain ratio of A/B. For example, for UniqueID 1, its ratio of A/B = 5/6.
Thus, I transform the original dataframe to:
UniqueID A_Value B_Value Ratio_A/B
1 5
2 10
3 10
Question is, how do I lookup the original dataframe by its UniqueID and then fill in its B value? If there is no B value, then just return 0.
Thank you.
You can first remove the columns which are not necessary, select only rows where Code has value "A" or "B", get the data in wide format and create a new column with the value of A/B
library(dplyr)
library(tidyr)
df %>%
select(-OtherData) %>%
filter(Code %in% c("A", "B")) %>%
pivot_wider(names_from = Code, values_from = Value, values_fill = list(Value = 0)) %>%
#OR if you want to have NA values instead of 0 use
#pivot_wider(names_from = Code, values_from = Value) %>%
mutate(Ratio_A_B = A/B)
# UniqueID A B Ratio_A_B
# <int> <int> <int> <dbl>
#1 1 5 6 0.833
#2 2 10 11 0.909
#3 3 10 0 Inf
I have a situation where I am trying to find the number of intersections with a vector per group in another tibble.
Data example
a <- tibble(EXPERIMENT = rep(c("a","b","c"),each =4),
ECOTYPE = rep(1:12))
b <- tibble(ECOTYPE = c(1,1,5,4,8,7,6,1,4,4,2,5,6,7,1))
I want to find the number of intersections between ECOTYPE in b and ECOTYPEper EXPERIMENT in a.
I wonder if I can use dplyr to solve this, as the group_by function seems to fit this problem, but when I run:
a %>%
group_by(EXPERIMENT) %>%
summarise(INTERSECTIONS = length(intersect(b$ECOTYPE, .$ECOTYPE))
I only get the total number of intersections between a and b.
Am I missing something?
Edit:
Sorry for not posting my desired output. I would like something like this:
# A tibble: 3 x 2
EXPERIMENT INTERSECTIONS
<chr> <dbl>
1 a 8
2 b 7
3 c 0
Depending how you want to count, this will give the number of rows in b matching a:
b %>% mutate(b_flag = 1) %>%
right_join(a) %>%
group_by(EXPERIMENT) %>%
summarize(INTERSECTIONS = sum(b_flag, na.rm = T))
# # A tibble: 3 x 2
# EXPERIMENT INTERSECTIONS
# <fctr> <dbl>
# 1 a 8
# 2 b 7
# 3 c 0
I think the only problem with your code is the unnecessary .$, but it gives the counts of distinct ecotypes in b, ignoring the fact that b has three ECOTYPE = 1 rows, for example.
a %>%
group_by(EXPERIMENT) %>%
summarise(INTERSECTIONS = length(intersect(b$ECOTYPE, ECOTYPE)))
# # A tibble: 3 x 2
# EXPERIMENT INTERSECTIONS
# <fctr> <int>
# 1 a 3
# 2 b 4
# 3 c 0
This is a result of how intersect works:
intersect(c(1, 2, 3), c(1, 1, 1))
# [1] 1
Join the two and count how many are left:
inner_join(a,b, by='ECOTYPE') %>% group_by(EXPERIMENT) %>% count()
# A tibble: 2 x 2
# Groups: EXPERIMENT [2]
EXPERIMENT n
<chr> <int>
1 a 8
2 b 7
Now, if you add an indicator column to b, you can start to count absences as well:
b %>% mutate(present=TRUE) %>% right_join(a, by='ECOTYPE') %>% group_by(EXPERIMENT) %>% summarise(n(), missing=sum(is.na(present)))
# A tibble: 3 x 3
EXPERIMENT `n()` missing
<chr> <int> <int>
1 a 9 1
2 b 7 0
3 c 4 4
Got a data frame with a lot of variables (82), many of them are used for further calculations. So I've tried to convert to numerical but there's a huge work guessing distinct values for every variable and then assign numbers.
I wonder if there's a more automated way of doing it since I don't care which number is assigned to any value as it is not repeated.
My approach so far (for he sake of clarity, dummy data):
df <- data.frame(original.var1 = c("display","memory","software","display","disk","memory"),
original.var2 = c("skeptic","believer","believer","believer","skeptic","believer"),
original.var3 = c("round","square","triangle","cube","sphere","hexagon"),
original.var4 = c(10,20,30,40,50,60))
taking into account this worked fine
library(dplyr)
library(magrittr)
df$NEW1 <- as.numeric(interaction(df$original.var1, drop=TRUE))
I've tried to adapt to dplyr and pipes this way
df %<>% mutate(VAR1= as.numeric(interaction(original.var1, drop=TRUE))) %>%
mutate(VAR2= as.numeric(interaction(original.var2, drop=TRUE))) %>%
mutate(VAR3= as.numeric(interaction(original.var2, drop=TRUE)))
but results got wrong from third VAR ahead
df %>% dplyr::group_by(original.var1,VAR1) %>% tally()
# A tibble: 4 x 3
# Groups: original.var1 [?]
original.var1 VAR1 n
<fctr> <dbl> <int>
1 disk 1 1
2 display 2 2
3 memory 3 2
4 software 4 1
> df %>% dplyr::group_by(original.var2,VAR2) %>% tally()
# A tibble: 2 x 3
# Groups: original.var2 [?]
original.var2 VAR2 n
<fctr> <dbl> <int>
1 believer 1 4
2 skeptic 2 2
> df %>% dplyr::group_by(original.var3,VAR3) %>% tally()
# A tibble: 6 x 3
# Groups: original.var3 [?]
original.var3 VAR3 n
<fctr> <dbl> <int>
1 cube 1 1
2 hexagon 1 1
3 round 2 1
4 sphere 2 1
5 square 1 1
6 triangle 1 1
Any approach or package to recode not having the mapping declared previously?
You can use mutate_if,
library(dplyr)
mutate_if(df, is.factor, funs(as.numeric(interaction(., drop = TRUE))))
which gives,
original.var1 original.var2 original.var3 original.var4
1 2 2 3 10
2 3 1 5 20
3 4 1 6 30
4 2 1 1 40
5 1 2 4 50
6 3 1 2 60
Alternatively you can read your data frame with stringsAsFactors = FALSE and use is.character but it's the same thing
To address your comment, If you want to also keep your original columns, then,
mutate_if(df, is.factor, funs(new = as.numeric(interaction(., drop = TRUE))))
Using purrr Keep the factor columns only and operate on them. Merge with numerical at the end.
df %>% purrr::keep(is.factor) %>% mutate_all(funs(as.numeric(interaction(., drop = TRUE))))
I have read a few different posts about finding the difference between two different rows in R using dplyr. However, the posts I have seen do not give me quite what I want. I would like to find the difference between the times, and place that difference between n and n+1 in a new variable, on the same row as n, kind of like the duration between n and n+1. All other posts place the elapsed time on the same row as n+1.
Here is some sample data:
df <- read.table(text = c("
id time
1 1
1 4
1 7
2 5
2 10"), header = T)
My desired output:
# id time duration
# 1 1 3
# 1 4 3
# 1 7 NA
# 2 5 5
# 2 10 NA
I have the following code at the moment:
df %>% arrange(id, time) %>% group_by(id) %>% mutate(duration = time - lag(time))
Please let me know how I should change this around. Thanks!
You can use diff(), appending the NA to each group. Just change your mutate() call to
mutate(duration = c(diff(time), NA)))
Edit: To clarify, the code above is only the mutate() call at the end of the pipe in the code shown in the question. So the the entire operation would be, based on the code shown in the question, is
df %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(duration = c(diff(time), NA))
# Source: local data frame [5 x 3]
# Groups: id [2]
#
# id time duration
# <dbl> <dbl> <dbl>
# 1 1 1 3
# 2 1 4 3
# 3 1 7 NA
# 4 2 5 5
# 5 2 10 NA
We can swap lag with lead
df %>%
group_by(id) %>%
mutate(duration = lead(time)- time)
# id time duration
# <int> <int> <int>
#1 1 1 3
#2 1 4 3
#3 1 7 NA
#4 2 5 5
#5 2 10 NA
A corresponding option in data.table would be shift with type = "lead"
library(data.table)
setDT(df)[, duration := shift(time, type = "lead") - time, by = id]
NOTE: In the example the 'id', 'time' were in order. If it is not, add the order statement as the OP showed in his post.