I have a very large data frame with fish species captured as one of the columns. Here is a very shortened example:
ID = seq(1,50,1)
fishes = c("bass", "jack", "snapper")
common = sample(fishes, size = 50, replace = TRUE)
dat = as.data.frame(cbind(ID, common))
I want to remove any species that make up less than a certain percentage of the data. For the example here say I want to remove all species that make up less than 30% of the data:
library(dplyr)
nrow(filter(dat, common == "bass")) #22 rows -> 22/50 -> 44%
nrow(filter(dat, common == "jack")) #12 rows -> 12/50 -> 24%
nrow(filter(dat, common == "snapper")) #16 rows -> 16/50 -> 32%
Here, jacks make up less than 30% of the rows, so I want to remove all the rows with jacks (or all species with less than 15 rows). This is easy to do here, but in reality I have over 700 fish species in my data frame and I want to throw out all species that make up less than 1% of the data (which in my case would be less than 18,003 rows). Is there a streamlined way to do this without having to filter out each species individually?
I imagine perhaps some kind of loop that says if the number of rows for common name = "x" is less than 18003, remove those rows...
You may also do it in one pipe:
library(dplyr)
dat %>%
mutate(percentage = n()) %>%
group_by(common) %>%
mutate(percentage = n() / percentage) %>%
filter(percentage > 0.3) %>%
select(-percentage)
One way to approach this is to first create a summary table, then filter based on the summary stat. There are probably more direct ways to accomplish the same thing.
library(dplyr)
set.seed(914) # so you get the same results from sample()
ID = seq(1,50,1)
fishes = c("bass", "jack", "snapper")
common = sample(fishes, size = 50, replace = TRUE)
dat = as.data.frame(cbind(ID, common)) # same as your structure, but I ended up with different species mix
summ.table <- dat %>%
group_by(common) %>%
summarize(number = n()) %>%
mutate(pct= number/sum(number))
summ.table
# # A tibble: 3 x 3
# common number pct
# <fct> <int> <dbl>
# 1 bass 18 0.36
# 2 jack 18 0.36
# 3 snapper 14 0.28
include <- summ.table$common[summ.table$pct > .3]
dat.selected = filter(dat, common %in% include)
Related
In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5
I have a dataset with 11 columns and 18350 observations which has a variable company and region. There are 9 companies(company-0) spread across 5 regions(region-0 to region-5) and not all companies are present at all regions. I want to create a seperate dataframe for each combination of company and region.You can see like this-
company0-region1,
company0-region10,
company0-region7,
company1-region5,
company2-region0,
company3-region2,
company4-region3,
company5-region7,
company6-region6,
company8-region9,
company9-region8
Thus I need 11 different dataframes in R.No other combinations are possible
Any other approach would be highly appreciated.
Thanks in Advance
I used split function to get a list-
p<-split(tsog1,list(tsog1$company),drop=TRUE)
Now I have a list of dataframes and I can't convert the each element of that list into an individual dataframe.
I tried using loops too, but can't get a unique named dataframe.
v<-c(1:9)
p<-levels(tsog1$company)
for (x in v)
{
x.tsog1<-subset(tsog1,tsog1$company==p[x])
}
Dataset Image
You can create a column for the region company combination and split by that column.
For example:
library(tidyverse)
# Create a df with 9 regions, 6 companies, and some dummy observations (3 per case)
df <- expand.grid(region = 0:8, company = 0:5, dummy = 1:3 ) %>%
mutate(x = round(rnorm((54*3)),2)) %>%
select(-dummy) %>% as_tibble()
# Create the column to split, and split.
df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
Now, what to do once you have the list of data frames, depends on your next steps. If you want to for example, save them, you can do walk or lapply.
For saving:
df_list <- df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
iwalk(df_list,function(df, nm){
write_csv(df, paste0(nm,'.csv'))
})
Or if you simply wants to access it:
> df_list$`0_4`
# A tibble: 3 x 4
region company x region_company
<int> <int> <dbl> <chr>
1 0 4 0.54 0_4
2 0 4 1.61 0_4
3 0 4 0.16 0_4
Edit: apologies for the more-than-minimal example. I redid this with a more parsimonious example, and it looks like aosmith's answer worked out!
This is the next step after this question, in the same process. It's been a doozy.
I have a dataset with a series of variables, each with low, medium, and high values. There are also multiple identification variables, which here I am calling "scenario" and "month" just for this example. I'm doing a calculation involving 3 different values, some of which have a low, medium, or high value that varies in each scenario, and each month.
# generating a practice dataset
library(dplyr)
library(tidyr)
set.seed(123)
pracdf <- bind_cols(expand.grid(ID = letters[1:2],
month = 1:2,
scenario = c("a", "b")),
data_frame(p.mid = runif(8, 100, 1000),
a = rep(runif(2), 4),
b = rep(runif(2), 4),
c = rep(runif(2), 4)))
pracdf <- pracdf %>% mutate(p.low = p.mid * 0.75,
p.high = p.mid * 1.25) %>%
gather(p.low, p.mid, p.high, key = "ptype", value = "p")
# all of that is just to generate the practice dataset.
# 2 IDs * 2 months * 2 scenarios * 3 different values of p = 24 total rows in this dataset
# Do the calculation
pracdf2 <- pracdf %>%
mutate(result = p * a * b * c)
This fully "gathered" dataset has the results that I want. Let's do a spread-type operation to get this in a way that's a bit more readable, with each month, scenario, and p-type combination having it's own column. An example column name would be 'month1_scenario.a_p.low'. The total with this dataset would be 2 months * 3 p types * 2 scenarios = 12 columns.
# this fully "gathered" dataset is exactly what I want.
# Let's put it in a format that the supervisor for this project will be happy with
# ID, month, scenario, and p.type are all "key" variables
# spread() only allows one key variable at a time, so...
pracdf2.spread1 <- pracdf2 %>% spread(ptype, result, sep = ".")
# Produces NA's. Looks like it's messing up with the different values of p
pracdf2.spread2 <- pracdf2 %>% select(-p) %>% spread(ptype, result, sep = ".")
# that's better, now let's spread across scenarios
pracdf2.spread2.spread2low <- pracdf2.spread2 %>% select(-ptype.p.high, -ptype.p.mid) %>% spread(scenario, ptype.p.low, sep = ".")
pracdf2.spread2.spread2mid <- pracdf2.spread2 %>% select(-ptype.p.low, -ptype.p.high) %>% spread(scenario, ptype.p.mid, sep = ".")
pracdf2.spread2.spread2high <- pracdf2.spread2 %>% select(-ptype.p.mid, -ptype.p.low) %>% spread(scenario, ptype.p.high, sep = ".")
pracdf2.spread2.spread2 <- pracdf2.spread2.spread2low %>% left_join(pracdf2.spread2.spread2mid)
# Ok, that was rough and will clearly spiral out of control quickly
# what am I still doing with my life?
I could do the spread() to spread each key column, then redo the spread for each consequent value column, but that will take ages and will likely be error-prone.
Is there a cleaner, tidier, and tidyr way to do this?
Thanks!
You can use unite from tidyr to combine the three columns into one prior to spreading.
Then you can spread, using the new column as the key and the "result" as value.
I also removed columns "a" through "p" prior to spreading, as it didn't seem like these were needed in the desired result.
pracdf2 %>%
unite("allgroups", month, scenario, ptype) %>%
select(-(a:p)) %>%
spread(allgroups, result)
# A tibble: 2 x 13
ID `1_a_p.high` `1_a_p.low` `1_a_p.mid` `1_b_p.high` `1_b_p.low` `1_b_p.mid` `2_a_p.high` `2_a_p.low`
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 160 96.2 128 423 254 338 209 126
2 b 120 72.0 96.0 20.9 12.5 16.7 133 79.5
# ... with 4 more variables: `2_a_p.mid` <dbl>, `2_b_p.high` <dbl>, `2_b_p.low` <dbl>, `2_b_p.mid` <dbl>
I searched around a lot trying to find an answer for this. It seems like what would be a relatively simple and common question, and I'm surprised I didn't find an answer but perhaps I am just not searching for the correct keywords.
I would like to calculate a weighted sum of some columns in three rows based on a value in another column. I think it makes more sense if you look at the dummy table below.
INDIVIDUAL <- c("A","A","A","A","A","A","B","B","B","B","B","B")
BEHAVIOR <- c("Smell", "Dig", "Eat", "Smell", "Dig", "Eat","Smell", "Dig", "Eat","Smell", "Dig", "Eat")
FOOD <- c("a", "a", "a","b","b","b", "a", "a", "a","b","b","b")
TIME <- c(2,4,7,6,1,2,9,0,4,3,7,6)
sample <- data.frame(Individual=INDIVIDUAL, Behavior=BEHAVIOR, Food=FOOD, Time=TIME)
Each individual spends a certain amount of time Smelling, Digging, and Eating each food item. I would like to weight and sum these three times to have one overall time per food item. Smelling is the lowest weight, eating is the highest. So basically I want a time interacting with each food item: Time per FoodA = (EatA) + (0.5*DigA) + (0.33*SmellA).
After extensive web browsing the best idea I could come up with was this:
sample %>%
group_by(Individual, Food) %>%
mutate(TIME = ((fullsum$BEHAVIOR == "EAT")
+(.5*(fullsum$BEHAVIOR == "DIG")
+(.33*(fullsum$BEHAVIOR == "SMELL")))))
But it doesn't work and I get this error: Error in mutate_impl(.data, dots) : incompatible size (2195), expecting 1 (the group size) or 1.
Any advice or direction to where this question has been answered already would be greatly appreciated!
FINAL RESULT
I modified fexjoo's suggestion to account for missing values and the result matches up with the values I calculated manually in Excel, so it looks like this is the winner. There may be a tidier way to remove the NAs from each of the columns but I'm ok with this.
data.frame %>%
spread(BEHAVIOR, TIME) %>%
mutate(EAT = coalesce(EAT, 0)) %>%
mutate(DIG = coalesce(DIG, 0)) %>%
mutate(SMELL = coalesce(SMELL, 0)) %>%
mutate(TIME = EAT + .5*DIG + .33*SMELL)
Try this
sample %>%
group_by(Individual, Food) %>%
mutate(TIME = ((Behavior == "Eat") + (.5*(Behavior == "Dig")
+(.33*(Behavior == "Smell")))))
My suggestion:
library(tidyr)
sample %>%
spread(Behavior, Time) %>%
mutate(TIME = Eat + .5*Dig + .33*Smell)
The result is:
Individual Food Dig Eat Smell TIME
1 A a 4 7 2 9.66
2 A b 1 2 6 4.48
3 B a 0 4 9 6.97
4 B b 7 6 3 10.49
You could do:
sample %>%
mutate(weights=case_when(.$Behavior=="Smell"~0.33,.$Behavior=="Dig"~0.5,.$Behavior=="Eat"~1))
%>% group_by(Food,Individual)
%>% summarise(WeightedTime=sum(weights*Time))
Which gives:
Food Individual WeightedTime
<fctr> <fctr> <dbl>
1 a A 9.66
2 a B 6.97
3 b A 4.48
4 b B 10.49
You could create a column with the weights based on the Behavior column:
library(dplyr)
sample$weights <-
case_when(
sample$Behavior == "Smell" ~ 0.33,
sample$Behavior == "Dig" ~ 0.5,
sample$Behavior == "Eat" ~ 1
)
sample %>% group_by(Individual, Food) %>%
summarise(time = sum(Time * weights))
Looking to reduce resource allocation by looping through each resource's name, and looking at the assigned accounts to that persons name, selecting one at random and replacing that person's name with NA.
reproducible example:
Accts <- paste0("Acc", 1:200)
Value <- c(500, 2000, 5000, 1000)
AccountDF <- data.frame(Accts, Value)
AccountDF$Owner[1:200] <- NA
AccountDF$Owner[1:23] <- "Jeff"
AccountDF$Owner[24:37] <- "Alex"
AccountDF$Owner[38:61] <- "Steph"
AccountDF$Owner[62:111] <- "Matt"
AccountDF$Owner[112:141] <- "David"
library(dplyr)
OwnerDF <- AccountDF %>%
group_by(Owner) %>%
summarise(Count = n(),
TotalValue = sum(Value)) %>%
filter(!is.na(Owner))
Where I got so far:
for (p in 1:nrow(OwnerDF)){
while (AccountDF$Count[p] > 22){
AccountDF %>%
filter(Owner == OwnerDF$Owner[p]) %>%
sample_n(1)
}
}
I've heard that for loops are unnecessary. I'm sure this can be done with the purr package and pmap or something like that. I am still learning.
I would like to iterate through the OwnerDF and look at whether that person "owns" too many accounts. If yes, look at the original account list and select a random one and replace the owner's name with NA, remove 1 from their count, and continue on.
Lastly after figuring this out I would like to see if it can be done with multiple conditions.. like While(Count > 22 & Value > $40,000), or maybe two while loops. The object is to reduce each person's "owned" accounts to less than a certain threshold and reduce $$ to less than a certain threshold.
To select random accounts, just make a random var and sort on it, taking the first N accounts that meet your conditions:
set.seed(1)
res = AccountDF %>%
mutate(r = runif(n())) %>%
arrange(r) %>%
group_by(Owner) %>%
mutate(newOwner = replace(Owner, cumsum(Value) > 40000 | row_number() > 22, NA)) %>%
select(-r)
# Test that it worked...
res %>%
filter(!is.na(newOwner)) %>%
group_by(newOwner) %>%
summarise(Count = n(), TotalValue = sum(Value))
# A tibble: 5 x 3
# newOwner Count TotalValue
# <chr> <int> <dbl>
# 1 Alex 14 27000
# 2 David 18 37000
# 3 Jeff 18 39500
# 4 Matt 18 39500
# 5 Steph 17 36500
An extension mentioned by the OP in a comment:
Another question for you. Say I have a threshold for each value and count, and if someone has a low count but high value, I want to take a random account from their high value accounts, if they have a high count and low value, I want to take low value accounts away from them. How can I do this from a random perspective?
I'd probably assign a real-valued score to each observation, like...
s = scale(f(x))
where f is some function based on the conditions you mentioned (high count, high value or both), maybe as simple as x when you want to bias towards the low values and -x when you want to bias towards the high values.
Then, add on some noise and sort using the result as above:
r = s + rnorm(length(s))