Knitr Kable won't output can someone tell me why? - r

I have two Kables, Both will output correctly as a data frame but i want them to be kables as it just looks smarter. This first one works fine as a kable:
#Build tables and convert to df
#First time=N
firstTimeNWork <- as.data.frame(tbl(SDreams, "PassengersInfo") %>% filter(FirstTime == 0 & WorkHoliday == 1)
%>% select(FirstTime,WorkHoliday,Price))
firstTimeNWorkN <- as.data.frame(tbl(SDreams, "PassengersInfo") %>% filter(FirstTime == 0 & WorkHoliday == 0)
%>% select(FirstTime,WorkHoliday,Price))
#First time=Y
firstTimeYWork <- as.data.frame(tbl(SDreams, "PassengersInfo") %>% filter(FirstTime == 1 & WorkHoliday == 1)
%>% select(FirstTime,WorkHoliday,Price))
firstTimeYWorkN <- as.data.frame(tbl(SDreams, "PassengersInfo") %>% filter(FirstTime == 1 & WorkHoliday == 0)
%>% select(FirstTime,WorkHoliday,Price))
#Setup headings for the outputted table
FirstTime <- c('No',' ','Yes',' ')
WorkHoliday<- c('Yes','No','Yes','No')
#Find out average amount paid based on the tables i created
#First time=N
firstTimeNWorkMean <- mean(firstTimeNWork$Price)
firstTimeNWorkNMean <- mean(firstTimeNWorkN$Price)
#First time=Y
firstTimeYWorkMean <- mean(firstTimeYWork$Price)
firstTimeYWorkNMean <- mean(firstTimeYWorkN$Price)
#Add values under heading
MeanPrice <-c(firstTimeNWorkMean,firstTimeNWorkNMean,firstTimeYWorkMean,firstTimeYWorkNMean)
#Create the output df
passengerSummary1 <- data.frame(FirstTime,WorkHoliday,MeanPrice)
#Label columns & display
kable(passengerSummary1,
col.names = c("First time travelling?","As part of work?","Average price paid in (£)"),
align = c("c","c","c"),digits = 0)
However the second Kable won't create:
#Create tables
wouldVisitAgainNPrice <- as.data.frame(tbl(SDreams, "PassengersInfo") %>% filter(Propensity == 1)
%>% select(Propensity,Price))
wouldVisitAgainYPrice <- as.data.frame(tbl(SDreams, "PassengersInfo") %>% filter(Propensity == 2)
%>% select(Propensity,Price))
#Count amount of passengers based on filter
wouldVisitAgainNPriceTab <-tabulate(wouldVisitAgainNPrice$Propensity)
wouldVisitAgainYPriceTab <-tabulate(wouldVisitAgainYPrice$Propensity)[[2]]
#Setup column
WouldVisitAgain <-c('No','Yes')
#Add column with values
Passengers <-c(wouldVisitAgainNPriceTab,wouldVisitAgainYPriceTab)
#Means of prices
wouldVisitAgainNPriceMean <-mean(wouldVisitAgainNPrice$Price)
wouldVisitAgainYPriceMean <-mean(wouldVisitAgainYPrice$Price)
#Add column with price values
MeanPrice2 <-c(wouldVisitAgainNPriceMean,wouldVisitAgainYPriceMean)
#Create df
passengerSummary2 <-data.frame(WouldVisitAgain,Passengers,MeanPrice2)
#Label columns & display
kable(passengerSummary2,
col.names = c("Would visit again?","No of Passengers","Mean Price in (£)"),
align = c("c","c","c"),digits = 0)
And as you can tell the kable code is exactly the same so i am very confused as to why it won't work. I have tried updating all my packages and restarting R Studio and removing objects however none of that has made any difference.
Sorry in advance for the amount of code and how bad it is.I am new to this so i am aware i have likely done something pretty stupid and will be questioned about why i did that and my only answer will be I'm new,I'll remember that and maybe try it out if i have time.
Thanks in advance if anyone can help with this :)

Related

Removing/Filtering rows based on condition in R

I would like to remove participants who scored 2 in column EXP_MAN and 1 in Ethnicity_Rescuer.
I used the following code and it worked.
mydata <- mydata %>%
mutate(to_exclude =
case_when(EXP_MAN == 2 &
(Ethnicity_Rescuer ==1) ~ 1)) %>%
mutate(to_exclude = replace(to_exclude, is.na(to_exclude), 0))
mydata <- mydata %>% filter(to_exclude == 0)
However, this code seems very complicated and I am sure there should be a simpler solution.
I tried to filter out participants with the below code but it did not work. Just wondering what is the simplest code for removing participants in this case.
mydata <- mydata %>% filter(EXP_MAN != 2 & Ethnicity_Rescuer !=1)
You can use subset function to select a dataset and one/several conditions
subset(mydata,EXP_MAN != 2 & Ethnicity_Rescuer !=1)

Dynamic sum/count condition while assignment

I have two data frames (table1 and randomdata) with the following schema:
#randomdata
randomdata$cube = {1,5,3,3,4,5,5,2,2,6,1,2,....} (1000 rows)
#table1
table1$side = {1,2,3,4,5,6} (6 rows)
table1$frequency = NULL
I want to count the occurence from the different sides of the cube (of the first 10 rows from randomdata$cube) and assign the result to table1$frequency to the corresponding row (based on table1$side).
I can do this successfuly this way:
table1$frequency[1] <- sum(randomdata$cube[1:10] == 1)
table1$frequency[2] <- sum(randomdata$cube[1:10] == 2)
table1$frequency[3] <- sum(randomdata$cube[1:10] == 3)
...
table1$frequency[6] <- sum(randomdata$cube[1:10] == 6)
This works very well, but there must be a better way.
Instead of 6 statements, I imagine something like this:
table1$frequency <- sum(randomdata$cube[1:10] == table1$side)
Can someone show me a more dynamic way to do this?
Thank you.
We can do this with converting the 'cube' column to factor with levels specified as 1:6 and then do the table. If we do it without that, missing elements can get dropped out of the table output. Here, it would be 0 if a level is missing
table1$frequency <- table(factor(randomdata$cube[1:10], levels = 1:6))
Or using tidyverse
library(tidyverse)
randomdata %>%
slice(1:6) %>%
count(cube = factor(cube, levels = 1:6), .drop = FALSE) %>%
pull(n) %>%
mutate(table1, frequency = .)

Using apply to replace nested for loop

My goal is to go through various signals and ignore any 1's that are not part of a series (minimum of at least two 1's in a row). The data is an xts time series with 180K+ columns and 84 months. I've provided a small simplified data set I've used a nest for loop, but it's taking way too long to finish on the entire data set. It works but is horribly inefficient.
I know there's some way to use an apply function, but I can't figure it out.
Example data:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
mod_sig <- xts(mod_sig, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
Example code:
# fixing months where condition is only met for one month
# creating a new data frame for modified signals
Signals_Fin <- data.frame(matrix(nrow = nrow(mod_sig), ncol = ncol(mod_sig)))
colnames(Signals_Fin) <- colnames(mod_sig)
# Loop over Signals to change 1's to 0's for one month events
for(col in 1:ncol(mod_sig)) {
for(row in 1:nrow(mod_sig)) {
val <- ifelse(mod_sig[row,col] == 1,
ifelse(mod_sig[row-1,col] == 0,
ifelse(mod_sig[row+1,col] == 0,0,1),1),0)
Signals_Fin[row, col] <- val
}
}
As you can see with the loop, any 1's that aren't in a sequence are changed to 0's. I know there is a better way, so I'm hoping to improve my approach. Any insights would be greatly appreciated. Thanks!
Answer from Zack and Ryan:
Zack and Ryan were spot on with dyplr, I only made slight modifications based off what was given and some colleague help.
Answer code:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
Signals_fin = mod_sig %>%
mutate_all(funs(ifelse((. == 1 & (lag(.) == 1 | lead(.) == 1)),1,0))) %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Signals_fin <- xts(Signals_fin, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
here's a stab from a dplyr perspective, I converted your row_names to a column but you can just as easily convert them back to rownames with tibble::column_to_rownames():
library(dplyr)
library(tibble)
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), function(x){
if_else(x == 1 &
(lag(x, order_by = .$months) == 1 |
lead(x, order_by = .$months) == 1),
1,
0)
})
As suggested by #Ryan, his mutate_at call is more elegant, it's important everything is already sorted, though:
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), ~ as.numeric(.x & (lag(.x) | lead(.x))))
And to build on his suggestion:
mod_sig %>%
as.data.frame() %>%
mutate_all(~ as.numeric(.x & (lag(.x) | lead(.x))))

Improve efficiency of lookup algorithm in R

I think this is an interesting task to optimize a piece of R code.
I have a dataframe df_red which details from orders of a webshop. For each product (ean), I want to get the 12 most likely other products to be in a basket with it.
This is the sample code to generate such data set:
library(tidyverse)
# create a vector with 1400 products (characterized by their EANs)
eans <- sample(1e5:1e6, 1400, replace = FALSE)
# create a vector with 200k orders
basket_nr <- 1:2e5
# a basket can have up to 4 items, it's most likely to have 3 items
n_prod_per_basket <- sample(x = 1:4, length(basket_nr), prob = c(0.2, 0.2, 0.5, 0.1), replace = TRUE)
# create df_red, each line of which correspond to a product with it's respective basket number
df <- data_frame(basket_nr, n_prod_per_basket)
df_red <- data_frame(basket_nr = rep(basket_nr, n_prod_per_basket))
df_red$ean <- sample(x = eans, nrow(df_red), replace = TRUE)
The code I am using to accomplish this task is the following. But I am sure it's not an efficient one. How can I increase the speed of the program?
ean <- unique(df_red$ean)
out <- list()
for (i in 1:length(ean)){
ean1 <- ean[i]
# get all basket_nr that contain the ean in question
basket_nr <- df_red[df_red$ean == ean1, ]$basket_nr
# get products that were together in the same basket with the ean in question
boo <- (df_red$ean != ean1) & (df_red$basket_nr %in% basket_nr)
prod <- df_red[boo, ]
# get top most frequent
top12 <- prod %>%
group_by(ean) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() %in% 1:12)
# skip products that weren't together in a basket with at least 12 different other products
if(nrow(top12) == 12) out[[i]] <- data_frame(ean = ean1, recom = top12$ean, freq = top12$n)
if(i %% 100 == 0) print(paste0(round(i/length(ean)*100, 2), '% is complete'))
}
Performance improvements are of course a matter of degree. How far to go before it is improved "enough" is hard to say. However, we can reduce run time by about 25% by functionalizing your code and cleaning up the subsetting logic. Starting with your code:
#added a timer
start.time <- Sys.time()
for (i in 1:length(ean)){
ean1 <- ean[i]
# get all basket_nr that contain the ean in question
basket_nr <- df_red[df_red$ean == ean1, ]$basket_nr
# get products that were together in the same basket with the ean in question
boo <- (df_red$ean != ean1) & (df_red$basket_nr %in% basket_nr)
prod <- df_red[boo, ]
# get top most frequent
top12 <- prod %>%
group_by(ean) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() %in% 1:12)
# skip products that weren't together in a basket with at least 12 different other products
if(nrow(top12) == 12) out[[i]] <- data_frame(ean = ean1, recom = top12$ean, freq = top12$n)
if(i %% 100 == 0) print(paste0(round(i/length(ean)*100, 2), '% is complete'))
}
Sys.time() - start.time
This takes between 30-34 seconds on my machine. However we can rewrite it as a function like so:
my.top12.func <- function(id, df_red) {
#improved subsetting logic - using which is faster and we can remove some code by
#removing the ean that is being iterated in the filter step below
prod <- df_red[df_red$basket_nr %in% df_red$basket_nr[which(df_red$ean == id)], ]
# set cutoff from 12 to 13 since the specific ean will always be one of the top 12
top12 <- prod %>%
group_by(ean) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() %in% 1:13 & ean != id) #additional filter required
# skip products that weren't together in a basket with at least 12 different other products
if(nrow(top12) == 12) return(data_frame(ean = id, recom = top12$ean, freq = top12$n))
}
Now we can test the speed and accuracy of this approach by doing:
start.time <- Sys.time()
my.out <- lapply(ean, my.top12.func, df_red = df_red)
Sys.time() - start.time
#test for equality
all.equal(out, my.out)
Which is about 24-26 seconds for a 25%+ improvement.
Playing around with data.table I have the output produced in under 7 seconds (which I guess is about 80% improvement):
library(data.table)
setDT(df_red)
all_eans <- df_red[, unique(ean)]
k <- lapply(all_eans, function(x) {
df_red[basket_nr %in% df_red[ean == x, unique(basket_nr)],
.N,
by = ean][order(-N)][2:13]
}
)
names(k) <- all_eans
k <- k[sapply(k, nrow) == 12]
I would consider not using a loop.
df_red$k <- 1
df_s <- left_join(df_red, df_red, by = "k") %>%
filter(ean.x != ean.y & basket_nr.x == basket_nr.y) %>%
group_by(ean.x) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() %in% 1:13)
df_s.ct <- df_s %>% filter(row_number() == 12)
df_s.fin <- df_s[df_s$ean.x %in% df_s.ct$ean.x, ]
The rate limiting step in this is the left_join which merges the dataset to itself, creating an exponentially larger dataset (so if you have 50,000 points, then you will end up creating a new dataset that is 2.5B points). It now indicates that the best way to store and manipulate the data is using data.table, which will increase the speed of this procedure, especially when combined with dplyr.

Issues with replacing a subset of a data.frame using the R Package dplyr

I am trying to replace some filtered values of a data set. So far, I wrote this lines of code:
df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA)),
where uniq is just a list containing variable names I want to focus on (and group1 and values are column names). This is actually working. However, it only outputs the altered filtered rows and does not replace anything in the data set df. Does anyone have an idea, where my mistake is? Thank you so much! The following code is to reproduce the example:
group1 <- c("A","A","A","B","B","C")
values <- c(0.6,0.3,0.1,0.2,0.8,0.9)
df = data.frame(group1, group2, values)
uniq <- unique(unlist(df$group1))
for (i in 1:length(uniq)){
df <- df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA))
}
What I would like to get is that it leaves all values except the last one since it is one unique group (group1 == C) and 0.9 < 1. So I'd like to get the exact same data frame here except that 0.9 is replaced with NA. Moreover, would it be possible to just use if instead of ifelse?
dplyr won't create a new object unless you use an assignment operator (<-).
Compare
require(dplyr)
data(mtcars)
mtcars %>% filter(cyl == 4)
with
mtcars4 <- mtcars %>% filter(cyl == 4)
mtcars4
The data are the same, but in the second example the filtered data is stored in a new object mtcars4

Resources