Store specific index value of output in R - r
I'm basically looking for the equivalent of the following python code in R:
df.groupby('Categorical')['Count'].count()[0]
The following is what I'm doing in R:
by(df$count,df$Categorical,sum)
This accomplishes the same thing as the first code but I'd like to know how to store an index value to a variable in R (new to R) .
Based on the by code, it seems like we can use (assuming that 'count' is a columns of 1s)
library(dplyr)
out <- df %>%
group_by(Categorical) %>%
summarise(Sum = sum(count))
If the columns 'count' have other values as well, the python function is taking the frequency count of 'Categorical' column. So, a similar option would be
out <- df %>%
count(Categorical) %>%
slice(1) %>%
pull(n)
Related
How create column that lists number of occurrences of X in another column?
I've got a huge df that include the following: subsetdf <- data_frame(Id=c(1:6),TicketNo=c(15,16,15,17,17,17)) I want to add a column, GroupSize, that tells for each Id how many other Ids share the same TicketNo value. In other words, I want output like this: TheDream <- data_frame(Id=c(1:6),TicketNo=c(15,16,15,17,17,17),GroupSize=c(2,1,2,3,3,3) I've unsuccessfully tried: subsetdf <- subsetdf %>% group_by(TicketNo) %>% add_count(name = "GroupSize") I'd like to use mutate() but I can't seem to get it right. Edit With the GroupSize column now added, I want to add a final column that looks at the values in two other columns and returns the value of whichever is higher. So I've got: df <- data_frame(Id=c(1:6),TicketNo=c(15,16,15,17,17,17),GroupSize=c(2,1,2,3,3,3),FamilySize=c(2,2,1,1,4,4) And I want: df <- data_frame(Id=c(1:6),TicketNo=c(15,16,15,17,17,17),GroupSize=c(2,1,2,3,3,3),FamilySize=c(2,2,1,1,4,4),FinalSize=c(2,2,2,3,4,4) I've unsuccessfully tried: df <- df %>% pmax(df$GroupSize, df$FamilySize) %>% dplyr::mutate(FinalSize = n()) That attempt earns me the error: Error: ! Subscript iis a matrix, the datavalue` must have size 1. Backtrace: ... %>% dplyr::mutate(Groupsize = n()) base::pmax(., train_data$Family_size, train_data$PartySize) tibble:::[<-.tbl_df(*tmp*, change, value = <int>) tibble:::tbl_subassign_matrix(x, j, value, j_arg, substitute(value))`
If we need to use mutate use n() to get the group size. Also, make sure that the mutate is from dplyr (as there is also a plyr::mutate which could mask the function if it is loaded later) library(dplyr) subsetdf %>% group_by(TicketNo) %>% dplyr::mutate(GroupSize = n())
Using the R syntax sequence operator ":" within the the sum command with more then 50 columns
i would like to index by column name within the sum command using the sequence operator. library(dbplyr) library(tidyverse) df=data.frame( X=c("A","B","C"), X.1=c(1,2,3),X.2=c(1,2,3),X.3=c(1,2,3),X.4=c(1,2,3),X.5=c(1,2,3),X.6=c(1,2,3),X.7=c(1,2,3),X.8=c(1,2,3),X.9=c(1,2,3),X.10=c(1,2,3), X.11=c(1,2,3),X.12=c(1,2,3),X.13=c(1,2,3),X.14=c(1,2,3),X.15=c(1,2,3),X.16=c(1,2,3),X.17=c(1,2,3),X.18=c(1,2,3),X.19=c(1,2,3),X.20=c(1,2,3), X.21=c(1,2,3),X.22=c(1,2,3),X.23=c(1,2,3),X.24=c(1,2,3),X.25=c(1,2,3),X.26=c(1,2,3),X.27=c(1,2,3),X.28=c(1,2,3),X.29=c(1,2,3),X.30=c(1,2,3), X.31=c(1,2,3),X.32=c(1,2,3),X.33=c(1,2,3),X.34=c(1,2,3),X.35=c(1,2,3),X.36=c(1,2,3),X.37=c(1,2,3),X.38=c(1,2,3),X.39=c(1,2,3),X.40=c(1,2,3), X.41=c(1,2,3),X.42=c(1,2,3),X.43=c(1,2,3),X.44=c(1,2,3),X.45=c(1,2,3),X.46=c(1,2,3),X.47=c(1,2,3),X.48=c(1,2,3),X.49=c(1,2,3),X.50=c(1,2,3), X.51=c(1,2,3),X.52=c(1,2,3),X.53=c(1,2,3),X.54=c(1,2,3),X.55=c(1,2,3),X.56=c(1,2,3)) Is there a quicker way todo this. The following provides the correct result. However, for large datasets (larger than this one ) it becomes vary laborious to deal with especially when pivot_wider is used and the columns are not created before hand (like above) df %>% rowwise() %>% mutate( Result_column=case_when( X=="A"~ sum(c(X.1,X.2,X.3,X.4,X.5)), X=="B"~ sum(c(X.4,X.5)), X=="C" ~ sum(c( X.3, X.4, X.5, X.6, X.7, X.8, X.9, X.10, X.11, X.12, X.13, X.14, X.15, X.16, X.17, X.18, X.19, X.20, X.21, X.22, X.23, X.24, X.25, X.26, X.27, X.28, X.29, X.30, X.31, X.32, X.33, X.34, X.35, X.36, X.37, X.38, X.39, X.40, X.41, X.42,X.43, X.44, X.45, X.46, X.47, X.48, X.49, X.50, X.51, X.52, X.53, X.54, X.55, X.56)))) %>% dplyr::select(Result_column) The following is the how it would be used when using "select" syntax, which is that i would like to use. However, does not provide correct numerical solution. One can shorter the code by ~50 entries, by using a sequence operator ":". df %>% rowwise() %>% mutate( Result_column=case_when( X=="A"~ sum(c(X.1:X.5)), X=="B"~ sum(c(X.4:X.5)), X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column) below is a related question, however, not the same because what is needed is not a column that starts with "X" but rather a sequence. Using mutate rowwise over a subset of columns EDIT: the provided code (below) from cnbrowlie is correct. df %>% mutate( Result_column=case_when( X=="A"~ sum(c(X.1:X.5)), X=="B"~ sum(c(X.4:X.5)), X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
This can be done with dplyr>=1.0.0 using rowSums() (which computes the sum for a row across multiple columns) and across() (which superceded vars() as a method for specifying columns in a dataframe, allowing the use of : to select sequences of columns): df %>% rowwise() %>% mutate( Result_column=case_when( X=="A"~ rowSums(across(X.1:X.5)), X=="B"~ rowSums(across(X.4:X.5)), X=="C" ~ rowSums(across(X.3:X.56)) ) ) %>% dplyr::select(Result_column)
how can I filter data inside mutate() using a counting function (like NROW) in R?
I have a dataframe with the columns doc_id and feats (both character vectors). I'm trying to create a new column n_rel_prn, which has the number of total occurrences of the value 'PronType=Rel' in the feats column, for each doc_id. I can't use filter(), because it filters out all of the other data I need (i.e. where the value for feats is not 'PronType=Rel'), but otherwise it does the trick. (Here's that code snippet:) tcorpus %>% group_by(doc_id) %>% filter(feats=='PronType=Rel') %>% mutate(n_rel_prn = n()) Basically, I need something that works like the following code (except that actually works--this obviously doesn't): tcorpus %>% group_by(doc_id) %>% mutate(n_rel_prn = NROW(feats == 'PronType=Rel')) Is there a way I can count the number of 'PronType=Rel' observations (grouped by doc_id) and add these totals to a new column? (I'm assuming at the very least group_by %>% mutate() is the way to go.)
You are almost there. Try this: tcorpus %>% group_by(doc_id) %>% mutate(n_rel_prn = sum(feats == 'PronType=Rel'))
R version 3.6.3 (2020-02-29) | Using package dplyr_1.0.0 | Unable to perform summarise() function
Trying to perform the basic Summarise() function but getting the same error again and again! I have a large number of csv files having 4 columns. I am reading them into R using lapply and rbinding them. Next I need to see the number of complete observations present for each ID. Error: *Problem with `summarise()` input `complete_cases`. x unused argument (Date) i Input `complete_cases` is `n(Date)`. i The error occured in group 1: ID = 1.* Code: library(dplyr) merged <-do.call(rbind,lapply(list.files(),read.csv)) merged <- as.data.frame(merged) remove_na <- merged[complete.cases(merged),] new_data <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n(Date)) Here is what the data looks like
The problem is not coming from summarise but from n. If you look at the help ?n, you will see that n is used without any argument, like this: new_data_count <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n()) This will count the number of rows for each ID group and is independent from the Date column. You could also use the shortcut function count: new_data_count <- remove_na %>% count(ID) If you want to count the different Date values, you might want to use n_distinct: new_data_count_dates <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n_distinct(Date)) Of note, you could have written your code with purrr::map, which has better functions than _apply as you can specify the return type with the suffix. This could look like this: library(purrr) remove_na = map_dfr(list.files(), read.csv) %>% na.omit() Here, map_dfr returns a data.frame with binding rows, but you could have used map_dfc which returns a data.frame with binding columns.
2 Numeric Values In A Dataframe Field In R
I have a dataset in R with a little under 100 columns. Some of the columns have numeric values such as 87+3 as oppose to 90. I have been able to update each column with the following piece of code: library(dplyr) new_dataframe = dataframe %>% rowwise() %>% mutate(new_value = eval(parse(text = value))) However, I would like to be able to update a list of 60 columns in a more efficient way than simply repeating this line for each column. Can someone help me find a more efficient way?
We can use mutate_at library(dplyr) dataframe %>% rowwise() %>% mutate_at(1:60, list(new_value = ~eval(parse(text = .))))