Is there a limit of factors in `dplyr::group_by`?

Is there a limit of factors in `dplyr::group_by`? - r

I'm struggling on how can I calculate the wear of a component using the lag of a variable. However, I need to calculate the wear on different groups, so I'm using the group_by function, but here's a problem, when I use the variable that I need to group, this results in a column of "NA's", but when I test by grouping one another variable that has fewer factors the calculation works.
The dataframe I'm using has 4093902 rows and 52 lines. The variable I need to group to perform my wear calculation has 90183 factors. The other one that I tested and it worked had 11321 factors.
Here's the code I'm using:
final_date = result_data %>%
arrange((time)) %>%
group_by(id_specific)%>%
mutate(wear = dplyr::lag(some_value, n = 1, default = NA) - some_value)
Does anyone know if there is a factor limit for grouping? Or any other tips on how I can perform this calculation?

The NA can be a result of either lag which returns the first value by default as NA or from the other column value which can also be NA. Thus, when we do the - (or any arithmetic) if there is any NA in the lhs or rhs, it returns NA. One option is to make use of a function (rowSums) that can use na.rm = TRUE
library(dplyr)
final_date <- result_data %>%
arrange((time)) %>%
group_by(id_specific)%>%
mutate(some_value_new = dplyr::lag(some_value, n = 1,
default = NA)) %>%
ungroup %>%
mutate(wear = rowSums(cbind(some_value_new, -1 * some_value),
na.rm = TRUE), some_value_new = NULL)
NOTE: It is also better to ungroup before doing the rowSums to get some efficiency

Related

R new column (variable) that rowSums across lists with NULL values

I have a data.frame that looks like this:
UID<-c(rep(1:25, 2), rep(26:50, 2))
Group<-c(rep(5, 25), rep(20, 25), rep(-18, 25), rep(-80, 25))
Value<-sample(100:5000, 100, replace=TRUE)
df<-data.frame(UID, Group, Value)
But I need the values separated into new rows so I run this:
df<-pivot_wider(df, names_from = Group,
values_from = Value,
values_fill = list(Value = 0))
Which introduces NULL into the dataset. Sorry, could not figure out a way to get an example dataset with NULL values. Note: this is now a tbl_df tbl data.frame
These aren't great variable names so I run this:
colnames(df)[which(names(df) == "20")] <- "pos20"
colnames(df)[which(names(df) == "5")] <- "pos5"
colnames(df)[which(names(df) == "-18")] <- "neg18"
colnames(df)[which(names(df) == "-80")] <- "neg80"
What I want to be able to do is create a new column (variable) that rowSums across columns. So I run this:
df<-df%>%
replace(is.na(.), 0) %>%
mutate(rowTot = rowSums(.[2:5]))
Which of course works on the example dataset but not on the one with NULL values. I have tried converting NULL to NA using df[df== "NULL"] <- NA but the values do not change. I have tried converting the lists to numeric using as.numeric(as.character(unlist(df[[2]]))) but I get an error telling me I have unequal number of rows, which I guess would be expected.
I realize there might be a better process to get my desired end result, so any suggestions to any of this is most appreciated.
EDIT: Here is a link to the actual dataset which will introduce Null values after using pivot_wider. https://drive.google.com/file/d/1YGh-Vjmpmpo8_sFAtGedxzfCiTpYnKZ3/view?usp=sharing

Difficult to answer with confidence without an actual reproducible example where the error occurs but I am going to take a guess.
I think your pivot_wider steps produces list columns (meaning some values are vectors) and that is why you are getting NULL values. Create a unique row for each Group and then use pivot_wider. Also rowSums has na.rm parameter so you don't need replace.
library(dplyr)
df %>%
group_by(temp) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = temp, values_from = numseeds) %>%
mutate(rowTot = rowSums(.[3:6], na.rm = TRUE))
Please change the column numbers according to your data in rowSums if needed.

call variable that has been grouped by

Some sample data:
df <- data.frame(lang = rep(c("A", "B", "C"), 3),
answer = rep(c("1", "2", "3"), each=3))
I am getting an error when I try to call a variable that I recently grouped by:
df2 <- df %>%
Total = count(lang) %>% # count is short hand for tally + group_by()
filter(answer=='2') %>%
mutate(prop = NROW(answer)/NROW(Total))
Error in group_vars(x) : object 'lang' not found
I would like a new column on my dataframe that says the proportion of the answer '2' to total observations in each level of lang. So how many times does '2' occur in 'A' in proportion to the total number of observations in 'A'?

Here's a solution that does what you want:
df %>%
group_by(lang) %>%
summarize(
prop = length(lang[answer==2])/n()
)
Here, we group by the variable or variables that you want set as the unique groups you want to get the proportion of and then use summarize to calculate the length of the vector of one of the variables where answer is equal to 2 and divide that by the number of rows in the grouping. If, for whatever reason, you want the prop column AND the answer column, just change summarize to mutate.
The reason you were getting the error about not finding lang is because count needs to be used as a function like mutate, i.e.
df %>%
count(lang, name = "Total")
You could achieve the same thing adapting your code, but you should use add_count (so your answer column is preserved) or mutate(Total = n()). However, group_by was designed to address problems such as this and is definitely worth spending some time to learn about.
df %>%
add_count(lang, name = "Total") %>%
filter(answer == 2) %>%
add_count(lang, name = "Twos") %>%
distinct(lang, .keep_all = TRUE) %>%
mutate(prop = Twos/Total) %>%
select(lang, prop)

Alternate solution with data.table
I prefer to use data.table than data frames everywhere personally. Here is the implementation with that method, although admittedly it looks a bit more cryptic than the solution in dplyr (The syntax to accomplish something like this may be more involved, but getting used to it ends up giving you a whole bag of tricks, and with simple queries the syntax actually looks better)
You end up trying to use "lang" like its a variable, when its a name of a column.
To get the values requested, 0.3333 for each,
library(data.table)
df <- data.table(df)
df[, nrow(.SD[answer == 2])/nrow(.SD), by="lang"]
lang V1
1: A 0.3333333
2: B 0.3333333
3: C 0.3333333
(the special variable .SD allows you to manipulate every subset of the data, split by by)

How to return a value from a variable based on a condition in another variable within a grouped data frame?

I am calculating some metrics on each of a set of variables within a grouped dataframe using the basic group_by() + summarize_at approach. Each group represents a small timeseries. One metric I would like to calculate is the initial value (in this case, day == 1) of each variable within each group. Thus, the generalized problem is to return a value of a variable based on a criterion in another variable, within groups of a grouped dataframe. Within the group_by() + summarize_at approach, I believe I need a custom function that summarize_at can then apply to each variable. I can successfully deploy other custom functions that depend only on the data variable at hand. I seem to be hung up on getting the function to go look in other columns of the dataframe.
I am not married to this approach, and welcome alternate recommendations. However, I am most comfortable with dplyr.
# a dataset
df <- data.frame(day = rep(c(1:5),3),
group = c(rep(1,5),rep(2,5),rep(3,5)),
var_a = seq(1:15),
var_b = seq(2,30, length.out = 15),
var_c = seq(3,45, length.out = 15))
# the logic of what I am going for, on a manually extracted example group:
# initial value (day == 1) of var_a for group 2
df_subset <- df %>%
filter(group == 2)
df_subset$var_a[which(df_subset$day == 1)]
# [1] 6
# my laughable attempt at a function
initial <- function(x){
ini <- which(.$day == 1)
x[ini]
}
# custom function deployed in dplyr pipe (which of course doesn't work)
df %>%
group_by(group) %>%
summarize_at(c("var_a","var_b","var_c"),
list(max = max, ini = initial))
Many thanks.

After the group_by step, specify the variables to select in summarise_at using one of the select_helpers (here starts_with works fine), and within the list, apply the different functions on each of the columns (~ is one way to prefix the anonymous call instead of explicitly specifying function(x)), For the second function, 'day' is not part of the selected columns, but it can be selected with the unquoted column name
library(dplyr)
df %>%
group_by(group) %>%
summarise_at(vars(starts_with('var')),
list(max = ~max(.), ini = ~ .[day == 1]))

Creating summary statistics (summarise_all) for a large factor dataset, retaining factor info

I have a large dataset with observational survey data which I would like to aggregate to country-year level (also for factors), in order to use the data as country-level data in another dataset. One df that I would like to aggregate has the following classes:
character labelled numeric
24 272 50
Where I am pretty sure the labelled class is the result of the Hmisc library.
I started out as follows, which worked quite well.
dfsum <- df %>%
group_by(countryyear) %>%
summarise_all(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first (.)))
Surprisingly this leaves me with 244/346 variables (I have no clue why it would be that number, any explanation would be great).
I would like to include as many columns as possible in the dfsum. I realise that for un-ordered factors that would not provide any useful info, but it will for the ordered factors. For binary variables the value between 0 and 1 would for example give me the size of each category and the ordinal variables are often scales. I tried to do:
dfsum <- df%>%
group_by(countryyear) %>%
summarise_all(funs(if(is.numeric(.)|is.factor(.)) mean(., na.rm = TRUE) else first (.)))
But that did not really do anything (not add any extra variables).
More importantly I would like in the summarization process like to retain the factor information. Is it possible to somehow reattach that information in a different way? For example that it was a binary value (perhaps if more than 50% of the original variable was either 0 or 1), or add the scale (by taking the min and the max of the original variable)?

By combining a lot of other answers, please see the appropriate links, I managed to deal with my problem as follows:
#1
as.numeric.factor <- function(x) {as.numeric(as.character(x))}
#2
df[] = lapply(df, as.numeric.factor)
#3
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
#4
dfsummary = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=countryyear]
1, 2, 3, 4

How to make a weighted mean inside a summarise_if

I have a dataframe containing a line per company, with different variables (some numeric, others not):
data <- data.frame(id=1:5,
CA = c(1200,1500,1550,200,0),
EBE = c(800,50,654,8555,0),
VA = c(6984,6588,633,355,84),
FBCF = c(35,358,358,1331,86),
name=c("qsdf","xdwfq","qsdf","sqdf","qsdfaz"),
weight = c(1, 5, 10,1 ,1))
I would like to summarise all numeric variables by a weighted sum. If I wanted a simple sum I would do:
data %>% summarise_if(is.numeric,sum)
but I don't see how to define a weighted sum.
I tried:
w.sum <- function(x) {sum(x*weight) %>% return()}
but without any success.

We can use it inside the funs
data %>%
summarise_if(is.numeric, funs(sum(.*weight)))
Note that the above is based on the condition that if the columns are numeric class. Based on the example the 'id' column is numeric, which may not need the summariseation. A better option would be summarise_at to specify the columns of interest
data %>%
summarise_at(names(.)[2:5], funs(sum(.*weight)))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Is there a limit of factors in `dplyr::group_by`? - r

Related

R new column (variable) that rowSums across lists with NULL values

call variable that has been grouped by

How to return a value from a variable based on a condition in another variable within a grouped data frame?

Creating summary statistics (summarise_all) for a large factor dataset, retaining factor info

How to make a weighted mean inside a summarise_if

Categories

Resources