Store specific index value of output in R - r

I'm basically looking for the equivalent of the following python code in R:
df.groupby('Categorical')['Count'].count()[0]
The following is what I'm doing in R:
by(df$count,df$Categorical,sum)
This accomplishes the same thing as the first code but I'd like to know how to store an index value to a variable in R (new to R) .

Based on the by code, it seems like we can use (assuming that 'count' is a columns of 1s)
library(dplyr)
out <- df %>%
group_by(Categorical) %>%
summarise(Sum = sum(count))
If the columns 'count' have other values as well, the python function is taking the frequency count of 'Categorical' column. So, a similar option would be
out <- df %>%
count(Categorical) %>%
slice(1) %>%
pull(n)

Related

How create column that lists number of occurrences of X in another column?

I've got a huge df that include the following:
subsetdf <- data_frame(Id=c(1:6),TicketNo=c(15,16,15,17,17,17))
I want to add a column, GroupSize, that tells for each Id how many other Ids share the same TicketNo value. In other words, I want output like this:
TheDream <- data_frame(Id=c(1:6),TicketNo=c(15,16,15,17,17,17),GroupSize=c(2,1,2,3,3,3)
I've unsuccessfully tried:
subsetdf <- subsetdf %>%
group_by(TicketNo) %>%
add_count(name = "GroupSize")
I'd like to use mutate() but I can't seem to get it right.
Edit
With the GroupSize column now added, I want to add a final column that looks at the values in two other columns and returns the value of whichever is higher. So I've got:
df <- data_frame(Id=c(1:6),TicketNo=c(15,16,15,17,17,17),GroupSize=c(2,1,2,3,3,3),FamilySize=c(2,2,1,1,4,4)
And I want:
df <- data_frame(Id=c(1:6),TicketNo=c(15,16,15,17,17,17),GroupSize=c(2,1,2,3,3,3),FamilySize=c(2,2,1,1,4,4),FinalSize=c(2,2,2,3,4,4)
I've unsuccessfully tried:
df <- df %>% pmax(df$GroupSize, df$FamilySize) %>% dplyr::mutate(FinalSize = n())
That attempt earns me the error: Error: ! Subscript iis a matrix, the datavalue` must have size 1.
Backtrace:
... %>% dplyr::mutate(Groupsize = n())
base::pmax(., train_data$Family_size, train_data$PartySize)
tibble:::[<-.tbl_df(*tmp*, change, value = <int>)
tibble:::tbl_subassign_matrix(x, j, value, j_arg, substitute(value))`
If we need to use mutate use n() to get the group size. Also, make sure that the mutate is from dplyr (as there is also a plyr::mutate which could mask the function if it is loaded later)
library(dplyr)
subsetdf %>%
group_by(TicketNo) %>%
dplyr::mutate(GroupSize = n())

Using the R syntax sequence operator ":" within the the sum command with more then 50 columns

i would like to index by column name within the sum command using the sequence operator.
library(dbplyr)
library(tidyverse)
df=data.frame(
X=c("A","B","C"),
X.1=c(1,2,3),X.2=c(1,2,3),X.3=c(1,2,3),X.4=c(1,2,3),X.5=c(1,2,3),X.6=c(1,2,3),X.7=c(1,2,3),X.8=c(1,2,3),X.9=c(1,2,3),X.10=c(1,2,3),
X.11=c(1,2,3),X.12=c(1,2,3),X.13=c(1,2,3),X.14=c(1,2,3),X.15=c(1,2,3),X.16=c(1,2,3),X.17=c(1,2,3),X.18=c(1,2,3),X.19=c(1,2,3),X.20=c(1,2,3),
X.21=c(1,2,3),X.22=c(1,2,3),X.23=c(1,2,3),X.24=c(1,2,3),X.25=c(1,2,3),X.26=c(1,2,3),X.27=c(1,2,3),X.28=c(1,2,3),X.29=c(1,2,3),X.30=c(1,2,3),
X.31=c(1,2,3),X.32=c(1,2,3),X.33=c(1,2,3),X.34=c(1,2,3),X.35=c(1,2,3),X.36=c(1,2,3),X.37=c(1,2,3),X.38=c(1,2,3),X.39=c(1,2,3),X.40=c(1,2,3),
X.41=c(1,2,3),X.42=c(1,2,3),X.43=c(1,2,3),X.44=c(1,2,3),X.45=c(1,2,3),X.46=c(1,2,3),X.47=c(1,2,3),X.48=c(1,2,3),X.49=c(1,2,3),X.50=c(1,2,3),
X.51=c(1,2,3),X.52=c(1,2,3),X.53=c(1,2,3),X.54=c(1,2,3),X.55=c(1,2,3),X.56=c(1,2,3))
Is there a quicker way todo this. The following provides the correct result. However, for large datasets (larger than this one ) it becomes vary laborious to deal with especially when pivot_wider is used and the columns are not created before hand (like above)
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1,X.2,X.3,X.4,X.5)),
X=="B"~ sum(c(X.4,X.5)),
X=="C" ~ sum(c( X.3, X.4, X.5, X.6, X.7, X.8, X.9, X.10, X.11, X.12, X.13, X.14, X.15, X.16,
X.17, X.18, X.19, X.20, X.21, X.22, X.23, X.24, X.25, X.26, X.27, X.28, X.29, X.30,
X.31, X.32, X.33, X.34, X.35, X.36, X.37, X.38, X.39, X.40, X.41, X.42,X.43, X.44,
X.45, X.46, X.47, X.48, X.49, X.50, X.51, X.52, X.53, X.54, X.55, X.56)))) %>% dplyr::select(Result_column)
The following is the how it would be used when using "select" syntax, which is that i would like to use. However, does not provide correct numerical solution. One can shorter the code by ~50 entries, by using a sequence operator ":".
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
below is a related question, however, not the same because what is needed is not a column that starts with "X" but rather a sequence.
Using mutate rowwise over a subset of columns
EDIT:
the provided code (below) from cnbrowlie is correct.
df %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
This can be done with dplyr>=1.0.0 using rowSums() (which computes the sum for a row across multiple columns) and across() (which superceded vars() as a method for specifying columns in a dataframe, allowing the use of : to select sequences of columns):
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ rowSums(across(X.1:X.5)),
X=="B"~ rowSums(across(X.4:X.5)),
X=="C" ~ rowSums(across(X.3:X.56))
)
) %>% dplyr::select(Result_column)

how can I filter data inside mutate() using a counting function (like NROW) in R?

I have a dataframe with the columns doc_id and feats (both character vectors). I'm trying to create a new column n_rel_prn, which has the number of total occurrences of the value 'PronType=Rel' in the feats column, for each doc_id.
I can't use filter(), because it filters out all of the other data I need (i.e. where the value for feats is not 'PronType=Rel'), but otherwise it does the trick. (Here's that code snippet:)
tcorpus %>% group_by(doc_id) %>%
filter(feats=='PronType=Rel') %>%
mutate(n_rel_prn = n())
Basically, I need something that works like the following code (except that actually works--this obviously doesn't):
tcorpus %>% group_by(doc_id) %>%
mutate(n_rel_prn = NROW(feats == 'PronType=Rel'))
Is there a way I can count the number of 'PronType=Rel' observations (grouped by doc_id) and add these totals to a new column? (I'm assuming at the very least group_by %>% mutate() is the way to go.)
You are almost there. Try this:
tcorpus %>% group_by(doc_id) %>% mutate(n_rel_prn = sum(feats == 'PronType=Rel'))

R version 3.6.3 (2020-02-29) | Using package dplyr_1.0.0 | Unable to perform summarise() function

Trying to perform the basic Summarise() function but getting the same error again and again!
I have a large number of csv files having 4 columns. I am reading them into R using lapply and rbinding them. Next I need to see the number of complete observations present for each ID.
Error:
*Problem with `summarise()` input `complete_cases`.
x unused argument (Date)
i Input `complete_cases` is `n(Date)`.
i The error occured in group 1: ID = 1.*
Code:
library(dplyr)
merged <-do.call(rbind,lapply(list.files(),read.csv))
merged <- as.data.frame(merged)
remove_na <- merged[complete.cases(merged),]
new_data <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n(Date))
Here is what the data looks like
The problem is not coming from summarise but from n.
If you look at the help ?n, you will see that n is used without any argument, like this:
new_data_count <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n())
This will count the number of rows for each ID group and is independent from the Date column. You could also use the shortcut function count:
new_data_count <- remove_na %>% count(ID)
If you want to count the different Date values, you might want to use n_distinct:
new_data_count_dates <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n_distinct(Date))
Of note, you could have written your code with purrr::map, which has better functions than _apply as you can specify the return type with the suffix. This could look like this:
library(purrr)
remove_na = map_dfr(list.files(), read.csv) %>% na.omit()
Here, map_dfr returns a data.frame with binding rows, but you could have used map_dfc which returns a data.frame with binding columns.

2 Numeric Values In A Dataframe Field In R

I have a dataset in R with a little under 100 columns.
Some of the columns have numeric values such as 87+3 as oppose to 90.
I have been able to update each column with the following piece of code:
library(dplyr)
new_dataframe = dataframe %>%
rowwise() %>%
mutate(new_value = eval(parse(text = value)))
However, I would like to be able to update a list of 60 columns in a more efficient way than simply repeating this line for each column.
Can someone help me find a more efficient way?
We can use mutate_at
library(dplyr)
dataframe %>%
rowwise() %>%
mutate_at(1:60, list(new_value = ~eval(parse(text = .))))

Resources