lump factor based on another column

lump factor based on another column - r

The example shows measurements of production output of different factories,
where the first columns denotes the factory
and the last column the amount produced.
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production)
df
factory production
1 A 15
2 A 2
3 B 1
4 B 1
5 B 2
6 B 1
7 B 2
8 C 20
9 D 5
Now I want to lump together the factories into fewer levels, based on their total output in this data set.
With the normal forcats::fct_lump, I can lump them by the number of rows in which thy appear, e.g. for making 3 levels:
library(tidyverse)
df %>% mutate(factory=fct_lump(factory,2))
factory production
1 A 15
2 A 2
3 B 1
4 B 1
5 B 2
6 B 1
7 B 2
8 Other 20
9 Other 5
but I want to lump them based on the sum(production), retaining the top n=2 factories (by total output) and lump the remaining factories. Desired result:
1 A 15
2 A 2
3 Other 1
4 Other 1
5 Other 2
6 Other 1
7 Other 2
8 C 20
9 Other 5
Any suggestions?
Thanks!

The key here is to apply a specific philosophy in order to group factories together based on their sum of production. Note that this philosophy has to do with the actual values you have in your (real) dataset.
Option 1
Here's an example that groups together factories that have a sum production equal to 15 or less. If you want another grouping you can modify the threshold (e.g. use 18 instead of 15)
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)
library(dplyr)
df %>%
group_by(factory) %>%
mutate(factory_new = ifelse(sum(production) > 15, factory, "Other")) %>%
ungroup()
# # A tibble: 9 x 3
# factory production factory_new
# <chr> <dbl> <chr>
# 1 A 15 A
# 2 A 2 A
# 3 B 1 Other
# 4 B 1 Other
# 5 B 2 Other
# 6 B 1 Other
# 7 B 2 Other
# 8 C 20 C
# 9 D 5 Other
I'm creating factory_new without removing the (original) factory column.
Option 2
Here's an example where you can rank / order the factories based on their production and then you can pick a number of top factories to keep as they are and group the rest
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)
library(dplyr)
# get ranked factories based on sum production
df %>%
group_by(factory) %>%
summarise(SumProd = sum(production)) %>%
arrange(desc(SumProd)) %>%
pull(factory) -> vec_top_factories
# input how many top factories you want to keep
# rest will be grouped together
n = 2
# apply the grouping based on n provided
df %>%
group_by(factory) %>%
mutate(factory_new = ifelse(factory %in% vec_top_factories[1:n], factory, "Other")) %>%
ungroup()
# # A tibble: 9 x 3
# factory production factory_new
# <chr> <dbl> <chr>
# 1 A 15 A
# 2 A 2 A
# 3 B 1 Other
# 4 B 1 Other
# 5 B 2 Other
# 6 B 1 Other
# 7 B 2 Other
# 8 C 20 C
# 9 D 5 Other

Just specify the weight argument w:
> df %>%
+ mutate(factory = fct_lump_n(factory, 2, w = production))
factory production
1 A 15
2 A 2
3 Other 1
4 Other 1
5 Other 2
6 Other 1
7 Other 2
8 C 20
9 Other 5
Note: use forcats::fct_lump_n because the generic fct_lump is no longer recommended.

We could use base R as well by creating a logical condition with ave
df$factory_new <- "Other"
i1 <- with(df, ave(production, factory, FUN = sum) > 15)
df$factory_new[i1] <- df$factory[i1]

Related

How to create new column of repeating sequence based on other column

I have a the following dataframe:
Participant_ID Order
1 A
1 A
2 B
2 B
3 A
3 A
4 B
4 B
5 B
5 B
6 A
6 A
Every two rows refer to the same participant. I want to create a new column based on the value in the column 'Order'. If the 'Order' == A, then I want it to create a new column with two rows of [1, 2], and then if the 'Order' == B, then I want it to create two rows of [2,1] in the same column
The preferred output would be the following:
Participant_ID Order Period
1 A 1
1 A 2
2 B 2
2 B 1
3 A 1
3 A 2
4 B 2
4 B 1
5 B 2
5 B 1
6 A 1
6 A 2
Any help would be appreciated

Here are a couple of possibilities. This assumes that Order value is same for a given Participant_ID. If this isn't the case, you will need to include additional logic.
You can use if_else:
library(tidyverse)
df %>%
group_by(Participant_ID) %>%
mutate(Period = if_else(Order == "A", 1:2, 2:1))
Or to explicitly check for multiple different values (e.g., "A", "B", etc.), have more flexibility, and include NA for other cases, you can use case_when:
df %>%
group_by(Participant_ID) %>%
mutate(Period = case_when(
Order == "A" ~ 1:2,
Order == "B" ~ 2:1,
TRUE ~ NA_integer_
))
Output
Participant_ID Order Period
<int> <chr> <int>
1 1 A 1
2 1 A 2
3 2 B 2
4 2 B 1
5 3 A 1
6 3 A 2
7 4 B 2
8 4 B 1
9 5 B 2
10 5 B 1
11 6 A 1
12 6 A 2

How to detect class type and change in R

I have a dataset where the first line is the header, the second line is some explanatory data, and then rows 3 on are numbers. Because when I read in the data with this second explanatory row, the classes are automatically converted to factors (or I could put stringsasfactors=F).
What I would like to do is remove the second row, and have a function that goes through all columns and detects if they're just numbers and change the class type to the appropriate type. Is there something like that available? Perhaps using dplyr? I have many columns so I'd like to avoid manually reassigning them.
A simplified example below
> df <- data.frame(A = c("col 1",1,2,3,4,5), B = c("col 2",1,2,3,4,5))
> df
A B
1 col 1 col 2
2 1 1
3 2 2
4 3 3
5 4 4
6 5 5

if all the numbers are after the second line, then we can do so
library(tidyverse)
df[-1, ] %>% mutate_all(as.numeric)
depending on the task can be done this way
df <- tibble(A = c("col 1",1,2,3,4,5),
B = c("col 2",1,2,3,4,5),
C = c(letters[1:5], 6))
df[-1, ] %>% mutate_if(~ any(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 4 4 NA
5 5 5 6
or so
df[-1, ] %>% mutate_if(~ all(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <chr>
1 1 1 b
2 2 2 c
3 3 3 d
4 4 4 e
5 5 5 6

In base R, we can just do
df[-1] <- lapply(df[-1], as.numeric)

programming R loop

I need help with programming R. I have data.frame B with one column
x<- c("300","300","300","400","400","400","500","500","500"....etc.) **2 milion rows**
and I need create next columns with rank. Next columns should look as
y<- c(1,2,3,1,2,3,1,2,3,......etc. )
I used cycle with for
B$y[1]=1
for (i in 2:length(B$x))
{
B$y[i]<-ifelse(B$x[i]==B$x[i-1], B$y[i-1]+1, 1)
}
The process ran for 4 hours.
So I need help anything speed up or anything else.
Thanks for your answer.

Here is a solution with base R:
B <- data.frame(x = rep(c(300, 400, 400), sample(c(5:10), 3)))
B
B$y <- ave(B$x, B$x, FUN=seq_along)

Here's an approach with dplyr that takes about 0.2 seconds on 2 million rows.
First I make sample data:
n = 2E6 # number of rows in test
library(dplyr)
sample_data <- data.frame(
x = round(runif(n = n, min = 1, max = 100000), digits = 0)
) %>%
arrange(x) # Optional, added to make output clearer so that each x is adjacent to the others that match.
Then I group by x and make y show which # occurrence of x it is within that group.
sample_data_with_rank <- sample_data %>%
group_by(x) %>%
mutate(y = row_number()) %>%
ungroup()
head(sample_data_with_rank, 20)
# A tibble: 20 x 2
x y
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 1 13
14 1 14
15 1 15
16 2 1
17 2 2
18 2 3
19 2 4
20 2 5

Need a index column in R data frame to differentiate variables with same value

Newbie question
I have 2 columns in a data frame that looks like
Name Size
A 1
A 1
A 1
A 2
A 2
B 3
B 5
C 7
C 17
C 17
I need a third column that will run continuously as a sequence until either Name or Size changes value
Name Size NewCol
A 1 1
A 1 2
A 1 3
A 2 1
A 2 2
B 3 1
B 5 1
C 7 1
C 17 1
C 17 2
Basically a dummy field to reference each record separately even if Name and Size are the same.
So the index changes from k to k+1 when it encounters both same values for Name and Size otherwise resets.
Therefore in my data set if I have 200 A and 1s suppose each will be indexed between 1..200. Then when it moves to A and 2 the index shall reset

We can try with data.table
library(data.table)
setDT(df1)[, NewCol := match(Size, unique(Size)), by = .(Name)]
df1
# Name Size NewCol
#1: A 1 1
#2: A 1 1
#3: A 2 2
#4: B 3 1
#5: C 7 1
#6: C 17 2
If there is a typo somewhere in the expected output, may be this would be the output
setDT(df1)[, NewCol := seq_len(.N), .(Name, Size)]
Or using dplyr
library(dplyr)
df1 %>%
group_by(Name) %>%
mutate(NewCol = match(Size, unique(Size)))
Or
df1 %>%
group_by(Name) %>%
mutate(NewCol = row_number())
Or we can use the same approach with ave from base R

I guess this might not be the most efficient solution, but at least a good start :
# Reproducing the example
df <- data.frame(Name=LETTERS[c(1, 1, 1, 1, 1, 2, 2, 3, 3, 3)], Size=c(1, 1, 1, 2, 2, 3, 5, 7, 17, 17))
# Create new colum with unique id
df$NewCol <- paste0(df$Name, df$Size)
# Modify column to write count instead
df$NewCol <- unlist(sapply(unique(df$NewCol), function(id) 1:table(df$NewCol)[id]))
df
Name Size NewCol
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 1
5 A 2 2
6 B 3 1
7 B 5 1
8 C 7 1
9 C 17 1
10 C 17 2

R, dplyr: cumulative version of n_distinct

I have a dataframe as follows. It is ordered by column time.
Input -
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
I want to create another variable var2 which computes no of distinct var1 values so far i.e. until that point in time for each group grp . This is a little different from what I'd get if I were to use n_distinct.
Expected output -
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
I want to create a function say cum_n_distinct for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))

A dplyr solution inspired from #akrun's answer -
Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -
df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)
head(df,10)
Source: local data frame [10 x 4]
Groups: grp
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2

Assuming stuff is ordered by time already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
And dplyr, again, same thing:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))

Try:
Update
With your new dataset, an approach in base R
df$var2 <- unlist(lapply(split(df, df$grp),
function(x) {x$var2 <-0
indx <- match(unique(x$var1), x$var1)
x$var2[indx] <- 1
cumsum(x$var2) }))
head(df,7)
# time grp var1 var2
# 1 1 1 A 1
# 2 2 1 B 2
# 3 3 1 A 2
# 4 4 1 B 2
# 5 5 2 A 1
# 6 6 2 B 2
# 7 7 2 A 2

Here's another solution using data.table that's pretty quick.
Generic Function
cum_n_distinct <- function(x, na.include = TRUE){
# Given a vector x, returns a corresponding vector y
# where the ith element of y gives the number of unique
# elements observed up to and including index i
# if na.include = TRUE (default) NA is counted as an
# additional unique element, otherwise it's essentially ignored
temp <- data.table(x, idx = seq_along(x))
firsts <- temp[temp[, .I[1L], by = x]$V1]
if(na.include == FALSE) firsts <- firsts[!is.na(x)]
y <- rep(0, times = length(x))
y[firsts$idx] <- 1
y <- cumsum(y)
return(y)
}
Example Use
cum_n_distinct(c(5,10,10,15,5)) # 1 2 2 3 3
cum_n_distinct(c(5,NA,10,15,5)) # 1 2 3 4 4
cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE) # 1 1 2 3 3
Solution To Your Question
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

lump factor based on another column - r

Just specify the weight argument w: > df %>% + mutate(factory = fct_lump_n(factory, 2, w = production)) factory production 1 A 15 2 A 2 3 Other 1 4 Other 1 5 Other 2 6 Other 1 7 Other 2 8 C 20 9 Other 5 Note: use forcats::fct_lump_n because the generic fct_lump is no longer recommended.

We could use base R as well by creating a logical condition with ave df$factory_new <- "Other" i1 <- with(df, ave(production, factory, FUN = sum) > 15) df$factory_new[i1] <- df$factory[i1]

Related

How to create new column of repeating sequence based on other column

How to detect class type and change in R

programming R loop

Need a index column in R data frame to differentiate variables with same value

R, dplyr: cumulative version of n_distinct

Categories

Resources