How to create new column of repeating sequence based on other column - r

I have a the following dataframe:
Participant_ID Order
1 A
1 A
2 B
2 B
3 A
3 A
4 B
4 B
5 B
5 B
6 A
6 A
Every two rows refer to the same participant. I want to create a new column based on the value in the column 'Order'. If the 'Order' == A, then I want it to create a new column with two rows of [1, 2], and then if the 'Order' == B, then I want it to create two rows of [2,1] in the same column
The preferred output would be the following:
Participant_ID Order Period
1 A 1
1 A 2
2 B 2
2 B 1
3 A 1
3 A 2
4 B 2
4 B 1
5 B 2
5 B 1
6 A 1
6 A 2
Any help would be appreciated

Here are a couple of possibilities. This assumes that Order value is same for a given Participant_ID. If this isn't the case, you will need to include additional logic.
You can use if_else:
library(tidyverse)
df %>%
group_by(Participant_ID) %>%
mutate(Period = if_else(Order == "A", 1:2, 2:1))
Or to explicitly check for multiple different values (e.g., "A", "B", etc.), have more flexibility, and include NA for other cases, you can use case_when:
df %>%
group_by(Participant_ID) %>%
mutate(Period = case_when(
Order == "A" ~ 1:2,
Order == "B" ~ 2:1,
TRUE ~ NA_integer_
))
Output
Participant_ID Order Period
<int> <chr> <int>
1 1 A 1
2 1 A 2
3 2 B 2
4 2 B 1
5 3 A 1
6 3 A 2
7 4 B 2
8 4 B 1
9 5 B 2
10 5 B 1
11 6 A 1
12 6 A 2

Related

Manipulating large dataset with dcast

Apologies if this is a repeat question but I could not find the specific answer I am looking for. I have a dataframe with counts of different species caught on a given trip. A simplified example with 5 trips and 4 species is below:
trip = c(1,1,1,2,2,3,3,3,3,4,5,5)
species = c("a","b","c","b","d","a","b","c","d","c","c","d")
count = c(5,7,3,1,8,10,1,4,3,1,2,10)
dat = cbind.data.frame(trip, species, count)
dat
> dat
trip species count
1 1 a 5
2 1 b 7
3 1 c 3
4 2 b 1
5 2 d 8
6 3 a 10
7 3 b 1
8 3 c 4
9 3 d 3
10 4 c 1
11 5 c 2
12 5 d 10
I am only interested in the counts of species b for each trip. So I want to manipulate this data frame so I end up with one that looks like this:
trip2 = c(1,2,3,4,5)
species2 = c("b","b","b","b","b")
count2 = c(7,1,1,0,0)
dat2 = cbind.data.frame(trip2, species2, count2)
dat2
> dat2
trip2 species2 count2
1 1 b 7
2 2 b 1
3 3 b 1
4 4 b 0
5 5 b 0
I want to keep all trips, including trips where species b was not observed. So I can't just subset the data by species b. I know I can cast the data so species are the columns and then just remove the columns for the other species like so:
library(dplyr)
library(reshape2)
test = dcast(dat, trip ~ species, value.var = "count", fun.aggregate = sum)
test
> test
trip a b c d
1 1 5 7 3 0
2 2 0 1 0 8
3 3 10 1 4 3
4 4 0 0 1 0
5 5 0 0 2 10
However, my real dataset has several hundred species caught on thousands of trips, and if I try to cast that many species to columns R chokes. There are way too many columns. Is there a way to specify in dcast that I only want to cast species b? Or is there another way to do this that doesn't require casting the data? Thank you.
Here is a data.table approach which I suspect will be very fast for you:
library(data.table)
setDT(dat)
result <- dat[,.(species = "b", count = sum(.SD[species == "b",count])),by = trip]
result
trip species count
1: 1 b 7
2: 2 b 1
3: 3 b 1
4: 4 b 0
5: 5 b 0
We can use tidyverse
library(dplyr)
library(tidyr)
dat %>%
filter(species == 'b') %>%
group_by(trip, species) %>%
summarise(count = sum(count)) %>%
ungroup %>%
complete(trip = unique(dat$trip), fill = list(species = 'b', count = 0))
# A tibble: 5 x 3
# trip species count
# <dbl> <chr> <dbl>
#1 1 b 7
#2 2 b 1
#3 3 b 1
#4 4 b 0
#5 5 b 0

lump factor based on another column

The example shows measurements of production output of different factories,
where the first columns denotes the factory
and the last column the amount produced.
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production)
df
factory production
1 A 15
2 A 2
3 B 1
4 B 1
5 B 2
6 B 1
7 B 2
8 C 20
9 D 5
Now I want to lump together the factories into fewer levels, based on their total output in this data set.
With the normal forcats::fct_lump, I can lump them by the number of rows in which thy appear, e.g. for making 3 levels:
library(tidyverse)
df %>% mutate(factory=fct_lump(factory,2))
factory production
1 A 15
2 A 2
3 B 1
4 B 1
5 B 2
6 B 1
7 B 2
8 Other 20
9 Other 5
but I want to lump them based on the sum(production), retaining the top n=2 factories (by total output) and lump the remaining factories. Desired result:
1 A 15
2 A 2
3 Other 1
4 Other 1
5 Other 2
6 Other 1
7 Other 2
8 C 20
9 Other 5
Any suggestions?
Thanks!
The key here is to apply a specific philosophy in order to group factories together based on their sum of production. Note that this philosophy has to do with the actual values you have in your (real) dataset.
Option 1
Here's an example that groups together factories that have a sum production equal to 15 or less. If you want another grouping you can modify the threshold (e.g. use 18 instead of 15)
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)
library(dplyr)
df %>%
group_by(factory) %>%
mutate(factory_new = ifelse(sum(production) > 15, factory, "Other")) %>%
ungroup()
# # A tibble: 9 x 3
# factory production factory_new
# <chr> <dbl> <chr>
# 1 A 15 A
# 2 A 2 A
# 3 B 1 Other
# 4 B 1 Other
# 5 B 2 Other
# 6 B 1 Other
# 7 B 2 Other
# 8 C 20 C
# 9 D 5 Other
I'm creating factory_new without removing the (original) factory column.
Option 2
Here's an example where you can rank / order the factories based on their production and then you can pick a number of top factories to keep as they are and group the rest
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)
library(dplyr)
# get ranked factories based on sum production
df %>%
group_by(factory) %>%
summarise(SumProd = sum(production)) %>%
arrange(desc(SumProd)) %>%
pull(factory) -> vec_top_factories
# input how many top factories you want to keep
# rest will be grouped together
n = 2
# apply the grouping based on n provided
df %>%
group_by(factory) %>%
mutate(factory_new = ifelse(factory %in% vec_top_factories[1:n], factory, "Other")) %>%
ungroup()
# # A tibble: 9 x 3
# factory production factory_new
# <chr> <dbl> <chr>
# 1 A 15 A
# 2 A 2 A
# 3 B 1 Other
# 4 B 1 Other
# 5 B 2 Other
# 6 B 1 Other
# 7 B 2 Other
# 8 C 20 C
# 9 D 5 Other
Just specify the weight argument w:
> df %>%
+ mutate(factory = fct_lump_n(factory, 2, w = production))
factory production
1 A 15
2 A 2
3 Other 1
4 Other 1
5 Other 2
6 Other 1
7 Other 2
8 C 20
9 Other 5
Note: use forcats::fct_lump_n because the generic fct_lump is no longer recommended.
We could use base R as well by creating a logical condition with ave
df$factory_new <- "Other"
i1 <- with(df, ave(production, factory, FUN = sum) > 15)
df$factory_new[i1] <- df$factory[i1]

delete a row where the order is wrong within a group

I have a data with about 1000 groups each group is ordered from 1-100(can be any number within 100).
As I was looking through the data. I found that some groups had bad orders, i.e., it would order to 100 then suddenly a 24 would show up.
How can I delete all of these error data
As you can see from the picture above(before -> after), I would like to find all rows that don't follow the order within the group and just delete it.
Any help would be great!
lag will compute the difference between the current value and the previous value, diff will be used to select only positive difference i.e. the current value is greater than the previous value. min is used as lag give the first value NA. I keep the helper column diff to check, but you can deselect using %>% select(-diff)
library(dplyr)
df1 %>% group_by(gruop) %>% mutate(diff = order-lag(order)) %>%
filter(diff >= 0 | order==min(order))
# A tibble: 8 x 3
# Groups: gruop [2]
gruop order diff
<int> <int> <int>
1 1 1 NA
2 1 3 2
3 1 5 2
4 1 10 5
5 2 1 NA
6 2 4 3
7 2 4 0
8 2 8 4
Data
df1 <- read.table(text="
gruop order
1 1
1 3
1 5
1 10
1 2
2 1
2 4
2 4
2 8
2 3
",header=T, stringsAsFactors = F)
Assuming the order column increments by 1 every time we can use ave where we remove those rows which do not have difference of 1 with the previous row by group.
df[!ave(df$order, df$group, FUN = function(x) c(1, diff(x))) != 1, ]
# group order
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#6 2 1
#7 2 2
#8 2 3
#9 2 4
EDIT
For the updated example, we can just change the comparison
df[ave(df$order, df$group, FUN = function(x) c(1, diff(x))) >= 0, ]
Playing with data.table:
library(data.table)
setDT(df1)[, diffo := c(1, diff(order)), group][diffo == 1, .(group, order)]
group order
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 2 1
6: 2 2
7: 2 3
8: 2 4
Where df1 is:
df1 <- data.frame(
group = rep(1:2, each = 5),
order = c(1:4, 2, 1:4, 3)
)
EDIT
If you only need increasing order, and not steps of one then you can do:
df3 <- transform(df1, order = c(1,3,5,10,2,1,4,7,9,3))
setDT(df3)[, diffo := c(1, diff(order)), group][diffo >= 1, .(group, order)]
group order
1: 1 1
2: 1 3
3: 1 5
4: 1 10
5: 2 1
6: 2 4
7: 2 7
8: 2 9

R group by key get max value for multiple columns

I want to do something like this:
How to make a unique in R by column A and keep the row with maximum value in column B
Except my data.table has one key column, and multiple value columns. So say I have the following:
a b c
1: 1 1 1
2: 1 2 1
3: 1 2 2
4: 2 1 1
5: 2 2 5
6: 2 3 3
7: 3 1 4
8: 3 2 1
If the key is column a, I want for each unique a to return the row with the maximum b, and if there is more than one unique max b, get the one with the max c and so on for multiple columns. So the result should be:
a b c
1: 1 2 2
2: 2 3 3
3: 3 2 1
I'd also like this to be done for an arbitrary number of columns. So if my data.table had 20 columns, I'd want the max function to be applied in order from left to right.
Here is a suggested data.table solution. You might want to consider using data.table::frankv as follows:
DT[, .SD[frankv(.SD, ties.method="first")[.N],], by=a]
frankv returns the order. Then [.N] will take the largest rank. Then .SD[ subset to that particular row.
Please let me know if it fails for your larger dataset.
to make this work for any number of columns, a possible dplyr solution would be to use arrange_all
df <- data.frame(a = c(1,1,1,2,2,2,3,3), b = c(1,2,2,1,2,3,1,2),
c = c(1,1,2,1,5,3,4,1))
df %>% group_by(a) %>% arrange_all() %>% filter(row_number() == n())
# A tibble: 3 x 3
# Groups: a [3]
# a b c
# 1 1 2 2
# 2 2 3 3
# 3 3 2 1
The generic solution can be achieved for arbitrary number of column using mutate_at. In the below example c("a","b","c") are arbitrary columns.
library(dplyr)
df %>% arrange_at(.vars = vars(c("a","b","c"))) %>%
mutate(changed = ifelse(a != lead(a), TRUE, FALSE)) %>%
filter(is.na(changed) | changed ) %>%
select(-changed)
a b c
1 1 2 2
2 2 3 3
3 3 2 1
Another option could be using max and dplyr as below. The approach is to first group_by on a and then filter for max value of b. The again group_by on both a and b and filter for rows with max value of c.
library(dplyr)
df %>% group_by(a) %>%
filter(b == max(b)) %>%
group_by(a, b) %>%
filter(c == max(c))
# Groups: a, b [3]
# a b c
# <int> <int> <int>
#1 1 2 2
#2 2 3 3
#3 3 2 1
Data
df <- read.table(text = "a b c
1: 1 1 1
2: 1 2 1
3: 1 2 2
4: 2 1 1
5: 2 2 5
6: 2 3 3
7: 3 1 4
8: 3 2 1", header = TRUE, stringsAsFactors = FALSE)
dat <- data.frame(a = c(1,1,1,2,2,2,3,3),
b = c(1,2,2,1,2,3,1,2),
c = c(1,1,2,1,5,3,4,1))
library(sqldf)
sqldf("with d as (select * from 'dat' group by a order by b, c desc) select * from d order by a")
a b c
1 1 2 2
2 2 3 3
3 3 2 1

Retain rows up to first occurrence of a value in a column, by group. Groups without value allowed

I have a data frame like this one:
> df
id type
1 1 a
2 1 a
3 1 b
4 1 a
5 1 b
6 2 a
7 2 a
8 2 b
9 3 a
10 3 a
I want to keep all rows for each group (id) up to the first occurrence of value 'b' in the type column. For groups without type 'b', I want to keep all their rows.
The resulting data frame should look like this:
> dfnew
id type
1 1 a
2 1 a
3 1 b
4 2 a
5 2 a
6 2 b
7 3 a
8 3 a
I tried the following code, but it retains additional rows that have the value 'a' beyond the first occurrence of 'b', and only excludes additional occurrences of 'b', which is not what I want. Look at row 4 in the following. I want to rid of it.
> df %>% group_by(id) %>% filter(cumsum(type == 'b') <= 1)
Source: local data frame [7 x 2]
Groups: id
id type
1 1 a
2 1 a
3 1 b
4 1 a
5 2 a
6 2 a
7 2 b
8 3 a
9 3 a
You could combine match or which with slice or (as mentioned by #Richard) which.max
library(dplyr)
df %>%
group_by(id) %>%
slice(if(any(type == "b")) 1:which.max(type == "b") else row_number())
# Source: local data table [8 x 2]
# Groups: id
#
# id type
# 1 1 a
# 2 1 a
# 3 1 b
# 4 2 a
# 5 2 a
# 6 2 b
# 7 3 a
# 8 3 a
Or you could try it with data.table
library(data.table)
setDT(df)[, if(any(type == "b")) .SD[1:which.max(type == "b")] else .SD, by = id]
# id type
# 1: 1 a
# 2: 1 a
# 3: 1 b
# 4: 2 a
# 5: 2 a
# 6: 2 b
# 7: 3 a
# 8: 3 a

Resources