How to Transform Data to Find Index with Same Value - r

I intend to find customers who have bought exactly the same products,
The data I have is customers' behaviors--what they have bought.
The example that I provided is a simplified version of my data. Customers will usually buy 10 to 20 products. There are around 50 products that consumers could choose to buy.
I am really confused what is an easy way to transform my data into the output that I prefer.
Could you please give me any advice? Thanks
Input:
structure(list(Customer_ID = 1:6, Products = c("Apple, Beer, Diaper",
"Beer, Apple", "Beer, Apple, Diaper, Diaper", "Apple, Diaper",
"Diaper, Apple", "Apple, Diaper, Beer, Beer")), .Names = c("Customer_ID",
"Products"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L), spec = structure(list(cols = structure(list(Customer_ID = structure(list(), class = c("collector_integer",
"collector")), Products = structure(list(), class = c("collector_character",
"collector"))), .Names = c("Customer_ID", "Products")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
Output:
structure(list(`Products Bought` = c("Apple, Beer, Diaper", "Apple, Diaper"
), Customer_ID = c("1, 3, 6", "4, 5")), .Names = c("Products Bought",
"Customer_ID"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-2L), spec = structure(list(cols = structure(list(`Products Bought` = structure(list(), class = c("collector_character",
"collector")), Customer_ID = structure(list(), class = c("collector_character",
"collector"))), .Names = c("Products Bought", "Customer_ID")),
default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

I am suspicious that you may want to look at structuring your data in a way that is more usable. In any case, the tidyverse can be a helpful way of thinking through your task.
As mentioned, posting code for others to start with can save them time and get you an answer faster.
library(dplyr)
library(stringr)
library(tidyr)
d <- data_frame(id=c(1,2,3,4,5,6)
, bought=c('Apple, Beer, Diaper','Apple, Beer', 'Apple, Beer, Diaper, Diaper'
, 'Apple, Diaper', 'Diaper, Apple', 'Apple, Diaper, Beer, Beer'))
d %>%
## Unnest the values & take care of white space
## - This is the better data structure to have, anyways
mutate(buy=str_split(bought,',')) %>%
unnest(buy) %>% mutate(buy=str_trim(buy)) %>% select(-bought) %>%
## Get distinct (and sort?)
distinct(id, buy) %>% arrange(id, buy) %>%
## Aggregate by id
group_by(id) %>% summarize(bought=paste(buy,collapse=', ')) %>% ungroup %>%
## Count
group_by(bought) %>% summarize(ids=paste(id,collapse=',')) %>% ungroup
EDIT: referencing this SO post for getting distinct combinations faster / cleaner in dplyr

Using the given input data and data.table, this can be written as (rather convoluted) "one-liner":
dcast(unique(setDT(input)[, strsplit(Products, ", "), Customer_ID])[
order(Customer_ID, V1)],
Customer_ID ~ ., paste, collapse = ", ")[
, .(Customers = paste(Customer_ID, collapse = ", ")), .(Products = .)]
# Products Customers
#1: Apple, Beer, Diaper 1, 3, 6
#2: Apple, Beer 2
#3: Apple, Diaper 4, 5
Note that the OP has dropped the second line with only one customer from
the expected output but hasn't mentioned any criteria for filtering the output in the question.
Input data
(As given by OP):
input <- structure(list(Customer_ID = 1:6, Products = c("Apple, Beer, Diaper",
"Beer, Apple", "Beer, Apple, Diaper, Diaper", "Apple, Diaper",
"Diaper, Apple", "Apple, Diaper, Beer, Beer")), .Names = c("Customer_ID",
"Products"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L), spec = structure(list(cols = structure(list(Customer_ID = structure(list(), class = c("collector_integer",
"collector")), Products = structure(list(), class = c("collector_character",
"collector"))), .Names = c("Customer_ID", "Products")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

Related

Issue in creating polygon in ggplot in R

Why does my polygon look like two triangles instead of a box? Can some help explain how I can turn the polygon into a box?
data <- structure(list(AREA = c("a", "a", "b", "b"), Lat = c(43.68389835,
43.68389835, 44.3395883, 44.3395883), Long = c(-88.22909367,
-88.99888743, -88.22909367, -88.99888743)), row.names = c(NA,
-4L), spec = structure(list(cols = list(AREA = structure(list(), class = c("collector_character",
"collector")), Lat = structure(list(), class = c("collector_double",
"collector")), Long = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x000002548f014500>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
Code:
library(tidyverse)
ggplot() + geom_polygon(data=data, mapping=aes(x=Long, y=Lat))
Currently geom_polygon draws the polygon in exactly the order of the points as given in data. To have a closed polygon, you need to order your points appropriately, either clockwise or anti-clockwise.
We can do this by calculating the angle relative to the lat/long centre, and then order points according to that angle.
library(tidyverse)
data %>%
mutate(angle = atan2(Lat - mean(Lat), Long - mean(Long))) %>%
arrange(desc(angle)) %>%
ggplot() +
geom_polygon(aes(x = Long, y = Lat))

How to convert dataframe rows to variables in R

If I have a 2 column, 4 row data frame such as:
structure(list(var = c("url.loc", "radius", "jt", "post.age"),
value = c("london", "25", "fulltime", "7")), row.names = c(NA,
-4L), spec = structure(list(cols = list(var = structure(list(), class = c("collector_character",
"collector")), value = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x60000323fee0>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
... how would I go about converting this to 4 variables so the result would be the same as:
url.loc <- "london"
radius <- "25"
jt <- "fulltime"
post.age <- "7"
I can think of using assign and a loop, but I'm thinking there's a more elegant way, perhaps using NSE.
thanks
We may create a named list and use list2env
library(tibble)
list2env(as.list(deframe(df1)), .GlobalEnv)
-checking
> url.loc
[1] "london"
> radius
[1] "25"
A compact option is with %=% from collapse
library(collapse)
df1$var %=% df1$value
-checking
> url.loc
[1] "london"
> radius
[1] "25"

Loop in tidyverse

I am learning tidyverse() and I am using a time-series dataset, and I selected columns that start with sec. What I would like basically to identify those values from columns that equal 123, keep these and have the rest replace with 0. But I don't know how to loop from sec1:sec4. Also how can I sum() per columns?
df1<-df %>%
select(starts_with("sec")) %>%
select(ifelse("sec1:sec4"==123, 1, 0))
Sample data:
structure(list(sec1 = c(1, 123, 1), sec2 = c(123, 1, 1), sec3 = c(123,
0, 0), sec4 = c(1, 123, 1)), spec = structure(list(cols = list(
sec1 = structure(list(), class = c("collector_double", "collector"
)), sec2 = structure(list(), class = c("collector_double",
"collector")), sec3 = structure(list(), class = c("collector_double",
"collector")), sec4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), row.names = c(NA,
-3L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"))
I think you would have to use mutate and across to accomplish this. below you will mutate across each column starting with sec and then keep all values that are 123 and replace all others with 0.
df1<-df %>%
select(starts_with("sec")) %>%
mutate(across(starts_with("sec"),.fns = function(x){ifelse(x == 123,x,0)}))

Unexpected behavior of filter inside a function dplyr

I have a function that filters a data.frame based on the unique values of a group column that is passed to the function
la <- function(df, grp){
gr <- df %>% pull({{grp}}) %>% unique()
purrr::map(gr, function(x){
print(x)
filter(df, {{grp}} == x)
})
}
When I use it with this df,
x <- structure(list(mac = c("dc:a6:32:21:59:2b", "dc:a6:32:2d:8c:ca",
"dc:a6:32:2d:b8:62", "dc:a6:32:2d:ca:3f"), datetime = structure(c(1594644546,
1594645457, 1594645375, 1594645080), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Comment = c("FED2", "FED7", "FED1", "FED6")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
la(x, mac)
I get the proper prints and the subsets.
However, when I use it with this other df, which should be equivalent, it doesn't work as expected.
df <- structure(list(datetime = structure(c(1594644600, 1594644900,
1594645200, 1594645500, 1594645800, 1594646100), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), movement = c(9940.50454596681, 10779.7747307276,
7148.52826988968, 7687.54314683339, 8797.06954533588, 7524.02474093548
), x = c(606, NA, 240, NA, 504, NA), y = c(386, NA, 274, NA,
56, NA), i_x = c(606, 228, 214, 407.5, 500, 292.947368421053),
i_y = c(386, 286, 258, 49.1666666666667, 56, 234), mac = c("dc:a6:32:21:59:2b",
"dc:a6:32:21:59:2b", "dc:a6:32:21:59:2b", "dc:a6:32:21:59:2b",
"dc:a6:32:21:59:2b", "dc:a6:32:21:59:2b")), spec = structure(list(
cols = list(filename = structure(list(), class = c("collector_character",
"collector")), datetime = structure(list(format = ""), class = c("collector_datetime",
"collector")), movement = structure(list(), class = c("collector_double",
"collector")), x = structure(list(), class = c("collector_double",
"collector")), y = structure(list(), class = c("collector_double",
"collector")), i_x = structure(list(), class = c("collector_double",
"collector")), i_y = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = "\t"), class = "col_spec"), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
I get 0 rows on each type of group (my real example has the same groups as the ones for the x dataframe).
Interestingly, this works as expected.
la(select(head(df), mac, datetime), mac)
[1] "dc:a6:32:21:59:2b"
[[1]]
# A tibble: 6 x 2
mac datetime
<chr> <dttm>
1 dc:a6:32:21:59:2b 2020-07-13 12:50:00
2 dc:a6:32:21:59:2b 2020-07-13 12:55:00
3 dc:a6:32:21:59:2b 2020-07-13 13:00:00
4 dc:a6:32:21:59:2b 2020-07-13 13:05:00
5 dc:a6:32:21:59:2b 2020-07-13 13:10:00
6 dc:a6:32:21:59:2b 2020-07-13 13:15:00
What is going on?
As the comment suggests, the problem is that I have function(x) inside the map call and because df has an x column, things become weird. I chose another variable name for that, and now it's working.
la <- function(df, grp){
gr <- df %>% pull({{grp}}) %>% unique()
purrr::map(gr, function(tt){
print(tt)
filter(df, {{grp}} == tt)
})
}

Visualize bubbles on a map, using hc_add_series_map() instead of hcmap()

I am trying to visualize a bubble map, using highcharter.
I did it perfectly, using this code
library(highcharter)
library(tidyverse)
hcmap("custom/africa") %>%
hc_add_series(data = fake_data, type = "mapbubble", maxSize = '10%', color =
"Red", showInLegend = FALSE) %>%
hc_legend(enabled = FALSE)
My data
> dput(fake_data)
structure(list(country = c("DZ", "CD", "ZA", "TZ"), lat = c(28.033886,
-4.038333, -30.559482, -6.369028), lon = c(1.659626, 21.758664,
22.937506, 34.888822), name = c("Algeria", "Congo, Dem. Rep",
"South Africa", "Tanzania"), z = c(20, 5, 10, 1)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), spec =
structure(list(
cols = list(country = structure(list(), class = c("collector_character",
"collector")), lat = structure(list(), class = c("collector_double",
"collector")), lon = structure(list(), class = c("collector_double",
"collector")), name = structure(list(), class = c("collector_character",
"collector")), z = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
External geo data for Africa originally comes from this source and used with hcmap().
But I transform it into RDS and use locally. Available here.
My problem that I cannot use my code and external data due to corporate IT security restrictions. I cannot deploy this code with Shiny/RMarkdown on Connect, it is blocked.
So my solution currently
Use the same data in RDS format
africa_map_data <- readRDS("africa_map_data.RDS")
And use the hc_add_series_map() with local data instead of hcmap().
highchart() %>%
hc_add_series_map(
map = africa_map_data,
df = fake_data,
value = "z",
joinBy = c("hc-a2", "country"),
type = "mapbubble",
maxSize = '10%',
color = "Red"
)
But it does not work well, I get a mess.
How to create a bubble map with hc_add_series_map() (or any other way) without 'hcmap' and pulling external data.
Thanks!

Resources