retain the first few duplicates based on a column value - r

I have a time series data which contains 5 columns. first column is the user name, and rest are the values at different points of time. Sample data shown here - https://pastebin.com/raw/TzmhKybt
I want to retain the first 3 values of any given user, and remove the rest. So for every given user, there will be at most 3 records in the data set. I have tried the following but it does not seem to work. Please point me in the right direction as I could not find any good way to do this.
data %>% group_by(User) %>% top_n(3)
Output of dput(data[1:10,]) is
structure(list(User = c("mmcclafl", "mmcclafl", "mmcclafl", "mmcclafl",
"mmcclafl", "mmcclafl", "gsnabwez", "gsnabwez", "gsnabwez", "gsnabwez"
), StartTime = c(584.93, 584.93, 584.93, 584.93, 584.93, 584.93,
1501.26, 1501.26, 1501.26, 1501.26), Time = c(597.94, 675.28,
774.02, 843.05, 1093.79, 1142.85, 1510.94, 1582.81, 1665.26,
1689.91), SelfReport = c("FLOW", "FLOW", "FLOW", "FRUSTRATION",
"FRUSTRATION", "FRUSTRATION", "FLOW", "FRUSTRATION", "FRUSTRATION",
"FRUSTRATION"), Affectiva = c("BOREDOM", "BOREDOM", "BOREDOM",
"BOREDOM", "BOREDOM", "BOREDOM", "BOREDOM", "BOREDOM", "OTHER",
"BOREDOM")), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 8L, 9L, 10L,
11L), class = "data.frame")

You could select first 3 rows for each group by doing.
Using dplyr
library(dplyr)
data %>% group_by(User) %>% slice(1:3)
# User StartTime Time SelfReport Affectiva
# <chr> <dbl> <dbl> <chr> <chr>
#1 gsnabwez 1501. 1511. FLOW BOREDOM
#2 gsnabwez 1501. 1583. FRUSTRATION BOREDOM
#3 gsnabwez 1501. 1665. FRUSTRATION OTHER
#4 mmcclafl 585. 598. FLOW BOREDOM
#5 mmcclafl 585. 675. FLOW BOREDOM
#6 mmcclafl 585. 774. FLOW BOREDOM
In base R
subset(data, ave(StartTime, User, FUN = seq_along) <= 3)
and in data.table
library(data.table)
setDT(data)[, .SD[1:3], by=User]

Related

Keep colums while grouping

I am a beginner in R and was looking for help online, but the examples I found among similar titles don't quite fit my needs, because they only deal with few colums.
I have a data.frame T1 with over 100 columns and what I am looking for is something like a summary, but I want to retain every other column after the summary. I thought about using aggregate but since it's not a function, I am uncertain. The most promising way I think of you can see below.
T2 <- T1 %>% group_by(ID) %>% summarise(AGI = paste(AGI, collapse = "; "))
The summary works the way I want, but I lose any other column.
I definitly appreciate any kind of advice! Thank you very much
Expanding on TTS's comment, if you want to keep any other column you have to use mutate instead of summarise because, as the documentation says, summarise() creates a new data frame.
You should therefore use
T1 %>% group_by(ID) %>% mutate(AGI = paste(AGI, collapse = "; ")) %>% ungroup()
Data
T1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 3L, 4L), UniProt_Accession = c("P25702",
"F4HWZ6", "Q9C5M0", "Q9SR37", "Q9LKR7", "Q9FXI7"), AGI = c("AT1G54630",
"AT1G54630", "AT5G19760", "AT3G09260", "AT5G28510", "AT1G19890"
)), class = "data.frame", row.names = c(NA, -6L))
Output
# A tibble: 6 x 3
# ID UniProt_Accession AGI
# <int> <chr> <chr>
# 1 1 P25702 AT1G54630; AT1G54630
# 2 1 F4HWZ6 AT1G54630; AT1G54630
# 3 2 Q9C5M0 AT5G19760
# 4 3 Q9SR37 AT3G09260; AT5G28510
# 5 3 Q9LKR7 AT3G09260; AT5G28510
# 6 4 Q9FXI7 AT1G19890

How to select the members who has transacted on the last date of every month from a large data

I have data of members who have transacted everyday, I need list of all the members who have transacted on the last date of each month for the whole year.
My output needs to have a list of members (with all the columns) who have transacted on 31-Jan-2019, 28-Feb-2019 and so on upto 31-Dec-2019.
month.ends <- as.Date(paste(unique(year(df$date)), unique(month(df$date)), "01", sep = "-"))-1
df %>% filter(date %in% month.ends)
If you want just the unique member names with some other columns you can use the distinct function
Okay here a pretty inefficient code (relatively new to R), but I think it works if you only want to do this for the year 2019.
#create a manual dataframe with the last days of the months in 2019
LastDays <- structure(list(Date = structure(c(7L, 2L, 10L, 4L, 11L, 5L,
12L, 13L, 6L, 8L, 3L, 1L, 9L), .Label = c("10-12-2019", "28-2-2019",
"30-11-2019", "30-4-2019", "30-6-2019", "30-9-2019", "31-1-2019",
"31-10-2019", "31-12-2019", "31-3-2019", "31-5-2019", "31-7-2019",
"31-8-2019"), class = "factor")), class = "data.frame", row.names = c(NA,
-13L))
#remove transactions on other dates in a new dataframe
df_subset <- df[which(df$Date %in% LastDays$Date),]
#find Members which did transactions on all the last days of the month
Members <- df_subset %>% group_by(Member, Date) %>% summarise_all(funs(mean)) %>% select(Member, Date) %>% filter(n() >11)
Members <- unique(Members$Member)
#The information of all the members which transacted on all last dates of the year
df[which(df$Member %in% Members),]

find duplicates with grouped variables

I have a df that looks like this:
I guess it will work some with dplyr and duplicates. Yet I don't know how to address multiple columns while distinguishing between a grouped variable.
from to group
1 2 metro
2 4 metro
3 4 metro
4 5 train
6 1 train
8 7 train
I want to find the ids which exist in more than one group variable.
The expected result for the sample df is: 1 and 4. Because they exist in the metro and the train group.
Thank you in advance!
Using base R we can split the first two columns based on group and find the intersecting value between the groups using intersect
Reduce(intersect, split(unlist(df[1:2]), df$group))
#[1] 1 4
We gather the 'from', 'to' columns to 'long' format, grouped by 'val', filter the groups having more than one unique elements, then pull the unique 'val' elements
library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, from:to) %>%
group_by(val) %>%
filter(n_distinct(group) > 1) %>%
distinct(val) %>%
pull(val)
#[1] 1 4
Or using base R we can just table to find the frequency, and get the ids out of it
out <- with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"
data
df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L,
4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train",
"train", "train")), class = "data.frame", row.names = c(NA, -6L
))
Convert data to long format and count unique values, using data.table. melt is used to convert to long format, and data table allows filtering in the i part of df1[ i, j, k], grouping in the k part, and pulling in the j part.
library(data.table)
library(magrittr)
setDT(df1)
melt(df1, 'group') %>%
.[, .(n = uniqueN(group)), value] %>%
.[n > 1, unique(value)]
# [1] 1 4

Split column into intervals based on row content

I am trying to convert a single-column data frame into separate columns — the main descriptor in the data is the "item number" and then includes information on the price, date, color, etc. I would just split the column depending on row number, but since each item has a different amount of information, that doesn't really work.
I've been playing around with this a bit but haven't found anything at all to come close, as I can't use regex to create a separate column (using str_which, for example) since the information differs so much item to item. How can I use regex to create intervals that I can then split the column into (so I need the information between each row containing "item" in a separate column). Sample data is below.
data
item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017
Thank you!
Here is a function to reformat your data depending on how you want the final dataset to look like. For the function, you supply the dataframe DF, the variable var, and a vector of column names in the correct order colnames and byitem to choose the output format (default is TRUE, which outputs a dataframe with one row per item):
library(tidyverse)
df_transform = function(DF, var, colnames, byitem = TRUE){
if(byitem){
ID = sym("rowid")
}else{
ID = sym("id")
}
DF %>%
group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
filter(!grepl("item", var)) %>%
spread(!!ID, var)
}
Output:
> df_transform(df, var, c("price", "color", "date"))
# A tibble: 3 x 4
# Groups: id [3]
id color date price
<chr> <fct> <fct> <fct>
1 item1 red <NA> $600
2 item2 <NA> <NA> $70
3 item3 orange 10/11/2017 $430
> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)
# A tibble: 3 x 4
rowid item1 item2 item3
<chr> <fct> <fct> <fct>
1 color red <NA> orange
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430
Note that this would not work if you have missing values in the middle, since the column names are assigned by position.
Data:
df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L,
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1",
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA,
-9L))

Dynamic Grouping in R | Grouping based on condition on applied function

In R, in aggregate() function, How to specify stopping condition on grouping on applied function on the variable?
For example, I have data-frame like this: "df"
Input Data frame
Note: Assuming each row in input data frame is denoting single ball played by a player in that match. So, by counting a number of rows can tell us the number of balls required.
And, I want my data frame like this one: Output data frame
My need is: How many balls are required to score 10 runs?
Currently, I am using this R code:
group_data <- aggregate(df$score, by=list(Category=df$player,df$match), FUN=sum,na.rm = TRUE)
Using this code, I can not stop grouping as I want, it stops when it groups all rows. I don't want all rows to consider.
But How to put constraint like "Stop grouping as soon as score >= 10"
By putting this constraint, my sole purpose is to count the number of rows satisfying this condition.
Thanks in advance.
Here is one option using dplyr
library(dplyr)
df1 %>%
group_by(match, player) %>%
filter(!lag(cumsum(score) > 10, default = FALSE)) %>%
summarise(score = sum(score), Count = n())
# A tibble: 2 x 4
# Groups: match [?]
# match player score Count
# <int> <int> <dbl> <int>
#1 1 30 12 2
#2 2 31 15 3
data
df1 <- structure(list(match = c(1L, 1L, 1L, 2L, 2L, 2L), player = c(30L,
30L, 30L, 31L, 31L, 31L), score = c(6, 6, 6, 3, 6, 6)), .Names = c("match",
"player", "score"), row.names = c(NA, -6L), class = "data.frame")

Resources