Is there a #dplyr function that can filter a table when the sum of a column hits a certain value? Ex. if df has 10 rows and if I add the sum of column1 and it gets to 5 by row 6, rows 7-10 are filtered out?
Do you mean something like this?
df <- data.frame(a = seq(1:10));
require(dplyr);
df %>%
filter(cumsum(a) < 8)
# a
#1 1
#2 2
#3 3
Explanation: cumsum is your friend here, where in above example you filter rows where the cumulative row sum is <8.
Related
I have a data frame with a large number of observations and I want to remove NA values in 1 specific column while keeping the rest of the data frame the same. I want to do this without using na.omit(). How do I do this?
We can use is.na or complete.cases to return a logical vector for subsetting
subset(df1, complete.cases(colnm))
where colnm is the actual column name
This is how I would do it using dplyr:
library(dplyr)
df <- data.frame(a = c(1,2,NA),
b = c(5,NA,8))
filter(df, !is.na(a))
# output
a b
1 1 5
2 2 NA
I have the following data frame in R (actual data frame is millions of rows with thousands of unique Column A values):
Row Column A Column B
1 130077 65
2 130077 65
3 130077 65
4 200040 10
5 200040 10
How can I add up Column B values grouped by Column A values without including duplicated Column A values? Correct output would be:
130077 65
200040 10
........
I have tried using filter and group_by with no success since the final output does sum values by Column A values but includes duplicated values.
An option is to get the distinct rows, then do a group by 'ColumnA' and get the sum of 'ColumnB'
library(dplyr)
df1 %>%
distinct(ColumnA, ColumnB) %>% # The example gives the expected output here
group_by(ColumnA) %>%
summarise(ColumnB = sum(ColumnB))
Or in base R with unique and aggregate
aggregate(ColumnB ~ ColumnA, unique(df1[c("ColumnA", "ColumnB")]), sum)
I have a data frame which looks like this
where value of b ranges from 1:31 and alpha_1,alpha_2 and alpha_3 can only have value 0 and 1. for each b value i have 1000 observations so total 31000 observations. I want to group the entire dataset by b and wanted to count value of alpha columns ONLY when its value is 1. So the end result would have 31 observations (unique b values from 1:31) and count of alpha values when its 1.
how do i do this in R. I have tried using pipe methods in dplyr and nothing seems to be working.
We can use
library(dplyr)
df1 %>%
group_by(b) %>%
summarise_at(vars(starts_with("alpha")), sum)
I have a tibble with several columns, including an ID column and a "score" column. The ID column has some duplicated values. I want to create a tibble that has one row per unique ID, and the same number of columns as the original tibble. For any ID, the "score" value in this new tibble should be the mean of the scores for the ID in the original tibble. And for any ID, the value for the other columns should be the first value that appears for that ID in the original tibble.
When the number of columns in the original tibble is small and known, this is an easy problem. Example:
scores <- tibble(
ID = c(1, 1, 2, 2, 3),
score = 1:5,
a = 6:10)
scores %>%
group_by(ID) %>%
summarize(score = mean(score), a = first(a))
But I often work with tibbles (or data frames) that have dozens of columns. I don't know in advance how many columns there will be or how they will be named. In these cases, I still want a function that takes, within each group, the mean of the score column and the first value of the other columns. But it isn't practical to spell out the name of each column. Is there a generic command that will let me summarize() by taking the mean of one column and the first value of all of the others?
A two-step solution would start by using mutate() to replace each score within a group with the mean of those scores. Then I could create my desired tibble by taking the first row of each group. But is there a one-step solution, perhaps using one of the select_helpers in dplyr?
Summarizing unknown number of column in R using dplyr is the closest post that I've been able to find. But I can't see that it quite speaks to this problem.
You can use mutate to get the mean values and then use slice to get the first row of each group, i.e.
library(dplyr)
scores %>%
group_by(ID) %>%
mutate(score = mean(score)) %>%
slice(1L)
#Source: local data frame [3 x 3]
#Groups: ID [3]
# ID score a
# <dbl> <dbl> <int>
#1 1 1.5 6
#2 2 3.5 8
#3 3 5.0 10
I apologize if this question is abhorrently simple, but I'm looking for a way to just add a column of consecutive integers to a data frame (if my data frame has 200 observations, for example, starting with 1 for the first observation, and ending with 200 on the last one).
How can I do this?
For a dataframe (df) you could use
df$observation <- 1:nrow(df)
but if you have a matrix you would rather want to use
ma <- cbind(ma, "observation"=1:nrow(ma))
as using the first option will transform your data into a list.
Source: http://r.789695.n4.nabble.com/adding-column-of-ordered-numbers-to-matrix-td2250454.html
Or use dplyr.
library(dplyr)
df %>% mutate(observation = 1:n())
You might want it to be the first column of df.
df %>% mutate(observation = 1:n()) %>% select(observation, everything())
Probably, function tibble::rowid_to_column is what you need if you are using tidyverse ecosystem.
library(tidyverse)
dat <- tibble(x=c(10, 20, 30),
y=c('alpha', 'beta', 'gamma'))
dat %>% rowid_to_column(var='observation')
# A tibble: 3 x 3
observation x y
<int> <dbl> <chr>
1 1 10 alpha
2 2 20 beta
3 3 30 gamma