I am trying to make simple pivot table in R using dplyr or reshape2 packages as my dataset is too large and R goes out of memory with sqldf. The two columns of my dataset that I want to make a pivot table out of is "Product" and "Cust_Id". I want to count the number of customer per product. And this is what I got.
library(reshape2)
mydata<-read.table("Book1.txt",header=TRUE,fill=TRUE)
mydata.m<-melt(mydata,id=c("Product"),measured=c(Cust_Id))
mydata.d<-dcast(mydata.m,Product~variable,count)
It returns
Error in UseMethod("group_by_"):
no applicable method for 'group_by_' applied to an object of class "c('integer','numeric')"
I have also tried dplyr with below code(not sure about the last step though as I did it on the other laptop)
library(dplyr)
mydata.df<-tbl_df(mydata)
summarize(mydata.df,Product,Cust_Id=n())
I got no error message but a lot of values seems to be missing in the output.
I really appreciate your input. Thanks in advance.
Try this:
library(dplyr)
mydata <- mydata %>%
group_by(Product) %>%
summarise(nCustomers = n())
Alternatively, if you only want to count unique customers, you can do:
library(dplyr)
mydata <- mydata %>%
group_by(Product) %>%
summarise(nCustomers = n_distinct(Cust_Id))
If this really is a big data set then your best option in the data.table package
require(data.table)
mydata_data_table = data.table(mydata)
number_customer = mydata_data_table[, .(number_customers = .N), by=Product]
Related
I have the following dataframe in R:
I am trying to make an extra column called "Opposition" which I would like to be the other team that has the same Date, Venue and inverse of the Margin. My expected output is:
Does anyone know how to achieve this in R? Im fairly new and cant quite work it out. Thanks!
We could reverse the Team A value for each Team, Date and absolute value of Margin.
library(dplyr)
df %>%
group_by(Venue, Date, tmp = abs(Margin)) %>%
mutate(Opposition = rev(`Team A`)) %>%
select(-tmp) -> result
result
Suppose that your dataframe is named dataset, and then
dataset %>%
left_join(dataset %>% mutate(Margin=-Margin) %>%
rename(Opposition=`Team A`),
by=c("Team A", "Date", "Venue")
)
will do the trick.
Please note that dplyr package is required to utilize mutate(), rename(), left_join() functions, and magrittr package to utilize %>% pipe operator. You can import both packages at once by importing tidyverse package.
Using data.table
library(data.table)
setDT(df)[, Opposition := `Team A`[.N:1], .(Venue, Date, tmp = abs(Margin))]
A base R option with ave + rev may help
transform(
df,
Opposition = ave(TeamA, Date, Venue, abs(Margin),FUN = rev)
)
I am wondering if there is an efficient way or alternative way to compute the row wise product of a selection of columns in dplyr format.
I know one way to do it (see below), but it seems using rowwise() take a long time to run on my large data set, hence looking for any alternative way to do this.
df = df %>%
rowwise %>%
mutate(myprod = prod(c_across(starts_with('var_xyz'))))
Here are some alternative options.
If you want to stay in tidyverse you can try pmap_dbl :
library(dplyr)
library(purrr)
df %>% mutate(myprod = pmap_dbl(select(., starts_with('var_xyz')), prod))
A base R option with Reduce or using rowProds from matrixStats.
cols <- grep('^var_xyz', names(df))
#2.
df$myprod <- Reduce(`*`, df[cols])
#3.
df$myprod <- matrixStats::rowProds(as.matrix(df[cols]))
I'd like to pick a different number of rows of each group of my data frame. I haven't figured out an elegant way to do this with dplyr yet. To pick out the same number of rows for each group I accomplish like this:
library(dplyr)
iris %>%
group_by(Species) %>%
arrange(Sepal.Length) %>%
top_n(2)
But I would like to be able to reference another table with the number of rows I'd like for each group, a sample table like this below:
top_rows_desired <- data.frame(Species = unique(iris$Species),
n_desired = c(4,2,5))
We can do a left_join with 'iris' and 'top_rows_desired' by 'Species', grouped by 'Species', slice the sequence of first 'n_desired' and remove the 'n_desired' column with select.
left_join(iris, top_rows_desired, by = "Species") %>%
group_by(Species) %>%
arrange(desc(Sepal.Length)) %>%
slice(seq(first(n_desired))) %>%
select(-n_desired)
Just adding this answer for those folks who are unable to run the code that akrun provided. I struggled with this for a while. This answer tackles the issue #2531 mentioned on github.
You might not be able to run slice because you already have xgboost loaded in your environment. xgboost masks dplyr's slice function leading to this issue.
Attaching package: ‘xgboost’
The following object is masked from ‘package:dplyr’:
slice
Warning message:
package ‘xgboost’ was built under R version 3.4.1
So using
detach(package: xgboost)
might work for you.
I wasted an hour because of this. Hope this is helpful.
This is my first stackoverflow question.
I'm trying to use dplyr to process and output a summary of data grouped by a categorical variable (inj_length_cat3) in my dataset. Actually, I generate this variable (from inj_length) on the fly using mutate(). I also want to output the same summary of the data without grouping. The only way I figured out how to do that is to do the analysis twice over, once with, once without grouping, and then combine the outputs. Ugh.
I'm sure there is a more elegant solution than this and it bugs me. I wonder if anyone would be able to help.
Thanks!
library(dplyr)
df<-data.frame(year=sample(c(2005,2006),20,replace=T),inj_length=sample(1:10,20,replace=T),hiv_status=sample(0:1,20,replace=T))
tmp <- df %>%
mutate(inj_length_cat3 = cut(inj_length, breaks=c(0,3,100), labels = c('<3 years','>3 years')))%>%
group_by(year,inj_length_cat3)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
) %>%
filter(inj_length_cat3%in%c('<3 years','>3 years'))
tmp_all <- df %>%
group_by(year)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
)
tmp_all$inj_length_cat3=as.factor('All')
tmp<-merge(tmp_all,tmp,all=T)
I'm not sure you consider this more elegant, but you can get a solution to work if you first create a dataframe that has all your data twice: once so that you can get the subgroups and once to get the overall summary:
df1 <- rbind(df,df)
df1$inj_length_cat3 <- cut(df$inj_length, breaks=c(0,3,100,Inf),
labels = c('<3 years','>3 years','All'))
df1$inj_length_cat3[-(1:nrow(df))] <- "All"
Now you just need to run your first analysis without mutate():
tmp <- df1 %>%
group_by(year,inj_length_cat3)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
) %>%
filter(inj_length_cat3%in%c('<3 years','>3 years','All'))
I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.
For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))