In R, in aggregate() function, How to specify stopping condition on grouping on applied function on the variable?
For example, I have data-frame like this: "df"
Input Data frame
Note: Assuming each row in input data frame is denoting single ball played by a player in that match. So, by counting a number of rows can tell us the number of balls required.
And, I want my data frame like this one: Output data frame
My need is: How many balls are required to score 10 runs?
Currently, I am using this R code:
group_data <- aggregate(df$score, by=list(Category=df$player,df$match), FUN=sum,na.rm = TRUE)
Using this code, I can not stop grouping as I want, it stops when it groups all rows. I don't want all rows to consider.
But How to put constraint like "Stop grouping as soon as score >= 10"
By putting this constraint, my sole purpose is to count the number of rows satisfying this condition.
Thanks in advance.
Here is one option using dplyr
library(dplyr)
df1 %>%
group_by(match, player) %>%
filter(!lag(cumsum(score) > 10, default = FALSE)) %>%
summarise(score = sum(score), Count = n())
# A tibble: 2 x 4
# Groups: match [?]
# match player score Count
# <int> <int> <dbl> <int>
#1 1 30 12 2
#2 2 31 15 3
data
df1 <- structure(list(match = c(1L, 1L, 1L, 2L, 2L, 2L), player = c(30L,
30L, 30L, 31L, 31L, 31L), score = c(6, 6, 6, 3, 6, 6)), .Names = c("match",
"player", "score"), row.names = c(NA, -6L), class = "data.frame")
Related
I am a beginner in R and was looking for help online, but the examples I found among similar titles don't quite fit my needs, because they only deal with few colums.
I have a data.frame T1 with over 100 columns and what I am looking for is something like a summary, but I want to retain every other column after the summary. I thought about using aggregate but since it's not a function, I am uncertain. The most promising way I think of you can see below.
T2 <- T1 %>% group_by(ID) %>% summarise(AGI = paste(AGI, collapse = "; "))
The summary works the way I want, but I lose any other column.
I definitly appreciate any kind of advice! Thank you very much
Expanding on TTS's comment, if you want to keep any other column you have to use mutate instead of summarise because, as the documentation says, summarise() creates a new data frame.
You should therefore use
T1 %>% group_by(ID) %>% mutate(AGI = paste(AGI, collapse = "; ")) %>% ungroup()
Data
T1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 3L, 4L), UniProt_Accession = c("P25702",
"F4HWZ6", "Q9C5M0", "Q9SR37", "Q9LKR7", "Q9FXI7"), AGI = c("AT1G54630",
"AT1G54630", "AT5G19760", "AT3G09260", "AT5G28510", "AT1G19890"
)), class = "data.frame", row.names = c(NA, -6L))
Output
# A tibble: 6 x 3
# ID UniProt_Accession AGI
# <int> <chr> <chr>
# 1 1 P25702 AT1G54630; AT1G54630
# 2 1 F4HWZ6 AT1G54630; AT1G54630
# 3 2 Q9C5M0 AT5G19760
# 4 3 Q9SR37 AT3G09260; AT5G28510
# 5 3 Q9LKR7 AT3G09260; AT5G28510
# 6 4 Q9FXI7 AT1G19890
I have a time series data which contains 5 columns. first column is the user name, and rest are the values at different points of time. Sample data shown here - https://pastebin.com/raw/TzmhKybt
I want to retain the first 3 values of any given user, and remove the rest. So for every given user, there will be at most 3 records in the data set. I have tried the following but it does not seem to work. Please point me in the right direction as I could not find any good way to do this.
data %>% group_by(User) %>% top_n(3)
Output of dput(data[1:10,]) is
structure(list(User = c("mmcclafl", "mmcclafl", "mmcclafl", "mmcclafl",
"mmcclafl", "mmcclafl", "gsnabwez", "gsnabwez", "gsnabwez", "gsnabwez"
), StartTime = c(584.93, 584.93, 584.93, 584.93, 584.93, 584.93,
1501.26, 1501.26, 1501.26, 1501.26), Time = c(597.94, 675.28,
774.02, 843.05, 1093.79, 1142.85, 1510.94, 1582.81, 1665.26,
1689.91), SelfReport = c("FLOW", "FLOW", "FLOW", "FRUSTRATION",
"FRUSTRATION", "FRUSTRATION", "FLOW", "FRUSTRATION", "FRUSTRATION",
"FRUSTRATION"), Affectiva = c("BOREDOM", "BOREDOM", "BOREDOM",
"BOREDOM", "BOREDOM", "BOREDOM", "BOREDOM", "BOREDOM", "OTHER",
"BOREDOM")), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 8L, 9L, 10L,
11L), class = "data.frame")
You could select first 3 rows for each group by doing.
Using dplyr
library(dplyr)
data %>% group_by(User) %>% slice(1:3)
# User StartTime Time SelfReport Affectiva
# <chr> <dbl> <dbl> <chr> <chr>
#1 gsnabwez 1501. 1511. FLOW BOREDOM
#2 gsnabwez 1501. 1583. FRUSTRATION BOREDOM
#3 gsnabwez 1501. 1665. FRUSTRATION OTHER
#4 mmcclafl 585. 598. FLOW BOREDOM
#5 mmcclafl 585. 675. FLOW BOREDOM
#6 mmcclafl 585. 774. FLOW BOREDOM
In base R
subset(data, ave(StartTime, User, FUN = seq_along) <= 3)
and in data.table
library(data.table)
setDT(data)[, .SD[1:3], by=User]
I'm quite new to R and I'm facing a problem which I guess is quite easy to fix but I couldn't find the answer.
I have a dataframe called clg where basically I have 3 columns date, X1, X2.
X1 and X2 are name of country teams. X1 and X2 have the same list of countries.
I'm simply trying to count the frequency of each country in the two columns as a total.
So far, I've only been able to count the frequency of the X1 column but I didn't find a way to sum both columns.
clt <- as_tibble(na.omit(count(clg, clg$X1)))
I would like to get a data frame where in the first columns I have unique countries, and in the second column the sum of occurrences in X1 + X2.
You can useunlist() and table() to get the overall counts. Wrapping it in data.frame() will give you the desired two column output.
clg <- data.frame(date=1:3,
X1=c("nor", "swe", "alg"),
X2=c("swe", "alg", "jpn"))
data.frame(table(unlist(clg[c("X1", "X2")])))
# Var1 Freq
# 1 alg 2
# 2 nor 1
# 3 swe 2
# 4 jpn 1
With tidyverse, we can gather into 'long' format and then do the count
library(tidyverse)
gather(clg, key, Var1, -date) %>%
count(Var1)
# A tibble: 4 x 2
# Var1 n
# <chr> <int>
#1 alg 2
#2 jpn 1
#3 nor 1
#4 swe 2
data
clg <- structure(list(date = 1:3, X1 = structure(c(2L, 3L, 1L), .Label = c("alg",
"nor", "swe"), class = "factor"), X2 = structure(c(3L, 1L, 2L
), .Label = c("alg", "jpn", "swe"), class = "factor")),
class = "data.frame", row.names = c(NA,
-3L))
You can obtain your goal with two steps. In the first step, you calculate the sum of occurrences for each country. In the next step, you're joining the two df's together and calculate the total sum.
X1_sum <- df %>%
dplyr::group_by(X1) %>%
dplyr::summarize(n_x1 = n())
X2_sum <- df %>%
dplyr::group_by(X2) %>%
dplyr::summarize(n_x2 = n()
final_summary <- X1_sum %>%
# merging data with by country names
dplyr::left_join(., X2_sum, by = c("X1", "X2")) %>%
dplyr::mutate(n_sum = n_x1 + n_x2)
I have a df that looks like this:
I guess it will work some with dplyr and duplicates. Yet I don't know how to address multiple columns while distinguishing between a grouped variable.
from to group
1 2 metro
2 4 metro
3 4 metro
4 5 train
6 1 train
8 7 train
I want to find the ids which exist in more than one group variable.
The expected result for the sample df is: 1 and 4. Because they exist in the metro and the train group.
Thank you in advance!
Using base R we can split the first two columns based on group and find the intersecting value between the groups using intersect
Reduce(intersect, split(unlist(df[1:2]), df$group))
#[1] 1 4
We gather the 'from', 'to' columns to 'long' format, grouped by 'val', filter the groups having more than one unique elements, then pull the unique 'val' elements
library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, from:to) %>%
group_by(val) %>%
filter(n_distinct(group) > 1) %>%
distinct(val) %>%
pull(val)
#[1] 1 4
Or using base R we can just table to find the frequency, and get the ids out of it
out <- with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"
data
df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L,
4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train",
"train", "train")), class = "data.frame", row.names = c(NA, -6L
))
Convert data to long format and count unique values, using data.table. melt is used to convert to long format, and data table allows filtering in the i part of df1[ i, j, k], grouping in the k part, and pulling in the j part.
library(data.table)
library(magrittr)
setDT(df1)
melt(df1, 'group') %>%
.[, .(n = uniqueN(group)), value] %>%
.[n > 1, unique(value)]
# [1] 1 4
I am trying to convert a single-column data frame into separate columns — the main descriptor in the data is the "item number" and then includes information on the price, date, color, etc. I would just split the column depending on row number, but since each item has a different amount of information, that doesn't really work.
I've been playing around with this a bit but haven't found anything at all to come close, as I can't use regex to create a separate column (using str_which, for example) since the information differs so much item to item. How can I use regex to create intervals that I can then split the column into (so I need the information between each row containing "item" in a separate column). Sample data is below.
data
item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017
Thank you!
Here is a function to reformat your data depending on how you want the final dataset to look like. For the function, you supply the dataframe DF, the variable var, and a vector of column names in the correct order colnames and byitem to choose the output format (default is TRUE, which outputs a dataframe with one row per item):
library(tidyverse)
df_transform = function(DF, var, colnames, byitem = TRUE){
if(byitem){
ID = sym("rowid")
}else{
ID = sym("id")
}
DF %>%
group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
filter(!grepl("item", var)) %>%
spread(!!ID, var)
}
Output:
> df_transform(df, var, c("price", "color", "date"))
# A tibble: 3 x 4
# Groups: id [3]
id color date price
<chr> <fct> <fct> <fct>
1 item1 red <NA> $600
2 item2 <NA> <NA> $70
3 item3 orange 10/11/2017 $430
> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)
# A tibble: 3 x 4
rowid item1 item2 item3
<chr> <fct> <fct> <fct>
1 color red <NA> orange
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430
Note that this would not work if you have missing values in the middle, since the column names are assigned by position.
Data:
df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L,
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1",
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA,
-9L))