Let's say that I have a dataset which consists of multiple observations. Sometimes a single observation is actually multple ones that have been condensed into one. To keep track of how many observations were merged together an integer-valued variable exists.
What I want to do is to reverse this process.
Example code:
library(tidyverse)
# Example tibble
df_ex <- tibble(
var1 = seq(1, 3),
var2 = c('Some', 'Random', 'Text'),
var3 = c(1, 3, 2)
)
The code above produces the following tibble:
# A tibble: 3 x 3
var1 var2 var3
<int> <chr> <dbl>
1 1 Some 1
2 2 Random 3
3 3 Text 2
The desired tibble after some tidyverse magic would be:
# A tibble: 6 x 3
var1 var2 var3
<dbl> <chr> <dbl>
1 1 Some 1
2 2 Random 1
3 2 Random 1
4 2 Random 1
5 3 Text 1
6 3 Text 1
There are multiple ways to do this in tidyverse
1) Do a group by 'var1' (assuming it is unique), create a list column for 'var3' by replicating 1 with the value of 'var3' and then unnest
df_ex %>%
group_by(var1) %>%
mutate(var3 = list(rep(1, var3))) %>%
unnest
2) Use map to get the list column for 'var3' and unnest
df_ex %>%
mutate(var3 = map(var3, ~ rep(1, .x))) %>%
unnest
3) With base R, replicate the sequence of rows to expand the data and then transform the 'var3' to 1
transform(df_ex[rep(seq_len(nrow(df_ex)), df_ex$var3),], var3 = 1)
Related
I've got a dataframe such as this:
df = data.frame(col1=c(1,1,1,2,2,2,3,3,3),
col2=as.factor(c('a','b','b','a','a','a','b','a','b')))
Then I extract all the categories (levels) related to each column:
levels_df = expand.grid(unique(df$col1), unique(df$col2))
colnames(levels_df)=c('col1','col2')
My objective now is to perform for the rows belonging to each pair of levels a function. How can I do that?
sapply(levels, FUN, dataset=df)
Any other strategy to perform the same task is accepted. The function operation could be whatever you like, for example a counting function (how many rows belong to each pair of levels), in which case the output would have this aspect:
In conclusion I want to susbset rows from a dataframe using each pair of levels, so I can manipulate those rows to perform a function ( such as nrows() )
You can skip the levels part, and just use dplyr to group by col1 and col2, then count the rows. Finally, we use complete to add in any combinations that don't appear in our dataset:
library(tidyverse)
df %>%
group_by(col1, col2) %>% # group df by col1 and col2
summarise(n = n()) %>% # make a new column, n, which is the count
complete(col1, col2, fill=list(n=0)) # Fill in missing pairs with 0
The output matches what you expected:
# A tibble: 6 x 3
# Groups: col1 [3]
col1 col2 n
<dbl> <fct> <dbl>
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 0
5 3 a 1
6 3 b 2
I‘m not sure if this specific count example will help you, but here‘s what you could do in the tidyverse:
library(tidyverse)
df %>%
group_by(col1, col2) %>%
count() %>%
ungroup() %>%
complete(col1, col2, fill = list(n = 0))
which gives:
# A tibble: 6 x 3
col1 col2 n
<dbl> <fct> <dbl>
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 0
5 3 a 1
6 3 b 2
I want to create a variable using dplyr that takes in a value conditional on another variable.
See example below.
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
I want to create a variable 'baseline' that takes the content of variable 'value' where time = 1 and by group. As such the desired output would be
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1),
baseline = c(1,1,3,3)))
Tried to run the following code with indexing but am clearly going wrong somewhere
x <- data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
x %>% group_by(group) %>%
mutate(baseline = .[[.$time==1,.$value]])
Thanks
We can use which.min
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(baseline = value[which.min(time)])
# A tibble: 4 x 4
# Groups: group [2]
# group time value baseline
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 2 1
#3 b 1 3 3
#4 b 2 4 3
and if it is already ordered by 'time', then simply use first
df1 %>%
group_by(group) %>%
mutate(baseline = first(value))
data
df1 <- data.frame(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
I have a dataset with 3 columns: ID, value a, and value b. I want to group the dataset based on the values in the ID column and then identify duplicates that have identical data in the value a and b columns between the different groupings.
I know that I can use the dplyr package and data %>% group_by (ID) to group my dataset based on the ID column. I also know that I can use data[duplicated(data[,2:3]),] to return all rows with duplicate data in rows 2 (value a) and 3 (value b).
However, I would like a function that can only finds duplicates between different ID groups instead of just duplicates within the whole dataset. I've tried combining group_by and duplicated, but it doesn't return the correct results. Which function would do this?
It was a little unclear if you wanted to return:
only the distinct rows
single examples of duplicated rows
all duplicated rows
So here are some options:
library(dplyr)
library(readr)
"ID,a,b
1, 1, 1
1, 1, 1
1, 1, 2
2, 1, 1
2, 1, 2" %>%
read_csv() -> exp_dat
# return only distinct rows
exp_dat %>%
distinct(ID, a, b)
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 2
# 3 2 1 1
# 4 2 1 2
# return single examples of duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 1 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# return all duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
add_count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 2 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 1
I have a dataframe with groups that essentially looks like this
DF <- data.frame(state = c(rep("A", 3), rep("B",2), rep("A",2)))
DF
state
1 A
2 A
3 A
4 B
5 B
6 A
7 A
My question is how to count the number of consecutive rows where the first value is repeated in its first "block". So for DF above, the result should be 3. The first value can appear any number of times, with other values in between, or it may be the only value appearing.
The following naive attempt fails in general, as it counts all occurrences of the first value.
DF %>% mutate(is_first = as.integer(state == first(state))) %>%
summarize(count = sum(is_first))
The result in this case is 5. So, hints on a (preferably) dplyr solution to this would be appreciated.
You can try:
rle(as.character(DF$state))$lengths[1]
[1] 3
In your dplyr chain that would just be:
DF %>% summarize(count_first = rle(as.character(state))$lengths[1])
# count_first
# 1 3
Or to be overzealous with piping, using dplyr and magrittr:
library(dplyr)
library(magrittr)
DF %>% summarize(count_first = state %>%
as.character %>%
rle %$%
lengths %>%
first)
# count_first
# 1 3
Works also for grouped data:
DF <- data.frame(group = c(rep(1,4),rep(2,3)),state = c(rep("A", 3), rep("B",2), rep("A",2)))
# group state
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 2 B
# 6 2 A
# 7 2 A
DF %>% group_by(group) %>% summarize(count_first = rle(as.character(state))$lengths[1])
# # A tibble: 2 x 2
# group count_first
# <dbl> <int>
# 1 1 3
# 2 2 1
No need of dplyrhere but you can modify this example to use it with dplyr. The key is the function rle
state = c(rep("A", 3), rep("B",2), rep("A",2))
x = rle(state)
DF = data.frame(len = x$lengths, state = x$values)
DF
# get the longest run of consecutive "A"
max(DF[DF$state == "A",]$len)
I recently had to compile a data frame of student scores (one row per student, id column and several integer-valued columns, one per score component). I had to combine a "master" data frame and several "correction" data frames (containing mostly NA and some updates to the master), so that the result contains the maximum values from the master, and all corrections.
I succeeded by copy-pasting a sequence of mutate() calls, which works (see example below), but is not elegant in my opinion. What I would have wanted to do, was instead of copying and pasting, to use something along the lines of map2 and two lists of columns to compare the columns pair-wise. Something like (which obviously does not work as such):
list_of_cols1 <- list(col1.x, col2.x, col3.x)
list_of_cols2 <- list(col1.y, col2.y, col3.y
map2(list_of_cols1, list_of_cols2, ~ column = pmax(.x, .y, na.rm=T))
I can't seem to be able to figure out to do it. My question is: how to specify such lists of columns and mutate them in one map2() call in dplyr pipe, or is it even possible – have I gotten it all wrong?
Minimum working example
library(tidyverse)
master <- tibble(
id=c(1,2,3),
col1=c(1,1,1),
col2=c(2,2,2),
col3=c(3,3,3)
)
correction1 <- tibble(
id=seq(1,3),
col1=c(NA, NA, 2 ),
col2=c( 1, NA, 3 ),
col3=c(NA, NA, NA)
)
result <- reduce(
# Ultimately there would several correction data frames
list(master, correction1),
function(x,y) {
x <- x %>%
left_join(
y,
by = c("id")
) %>%
# Wish I knew how to do this mutate call with map2
mutate(
col1 = pmax(col1.x, col1.y, na.rm=T),
col2 = pmax(col2.x, col2.y, na.rm=T),
col3 = pmax(col3.x, col3.y, na.rm=T)
) %>%
select(id, col1:col3)
}
)
The result is
> result
# A tibble: 3 x 4
id col1 col2 col3
<int> <dbl> <dbl> <dbl>
1 1 1 2 3
2 2 1 2 3
3 3 2 3 3
Rather than do a left_join, just bind the rows then summarize. For example
result <- reduce(
list(master, master),
function(x,y) {
bind_rows(x, y) %>%
group_by(id) %>%
summarize_all(max, na.rm=T)
}
)
result
# id col1 col2 col3
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 3
# 2 2 1 2 3
# 3 3 2 3 3
Actually, you don't even need reduce as bind_rows can take a list
Adding another table
correction2 <- tibble(id=2,col1=NA,col2=8,col3=NA)
bind_rows(master, correction1, correction2) %>%
group_by(id) %>%
summarize_all(max, na.rm=T)
Sorry this doesn't answer your question about map2, I find it's easier to aggregate over rows than it is over columns in tidy R:
library(dplyr)
master <- tibble(
id=c(1,2,3),
col1=c(1,1,1),
col2=c(2,2,2),
col3=c(3,3,3)
)
correction1 <- tibble(
id=seq(1,3),
col1=c(NA, NA, 2 ),
col2=c( 1, NA, 3 ),
col3=c(NA, NA, NA)
)
result <- list(master, correction1) %>%
bind_rows() %>%
group_by(id) %>%
summarise_all(max, na.rm = TRUE)
result
#> # A tibble: 3 x 4
#> id col1 col2 col3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 3
#> 2 2 1 2 3
#> 3 3 2 3 3
If correction tables will always have the same structure as master, you can do something like the following:
library(dplyr)
library(purrr)
update_master = function(...){
map(list(...), as.matrix) %>%
reduce(pmax, na.rm = TRUE) %>%
data.frame()
}
update_master(master, correction1)
To allow id to take character values, make the following modification:
update_master = function(x, ...){
map(list(x, ...), function(x) as.matrix(x[-1])) %>%
reduce(pmax, na.rm = TRUE) %>%
data.frame(id = x[[1]], .)
}
update_master(master, correction1)
Result:
id col1 col2 col3
1 1 1 2 3
2 2 1 2 3
3 3 2 3 3