In R, how can I randomly choose two out three names 500 times, with balanced selections? - r

I know I can use complete_ra from the randomizr package to randomly and equally allocate to one of three "arms" (in this case "arms" are just names of people)
library(randomizr)
set.seed(100)
names <- complete_ra(N = 500, num_arms = 3)
#each "arm" is chosen ~167 times
#Now put the names in
library(plyr)
df <- transform(df,
names=revalue(names,c("T1"="Luis", "T2"="Conor","T3"="Dafydd")))
But what I need is to actually assign the 500 samples to a randomly chosen two of the three names. So I need my dataset to be:
ID# Name1 Name2
1 Conor Luis
2 Conor Dafydd
3 Luis Dafydd
...
500 Conor Luis
and at the end I need each of the 3 to still be chosen an equal amount.
A workaround is since there's 3 names, that means there's 3 combinations too, so I could simply replace Conor with "Conor and Luis", Luis with "Luis and Dafydd", and Dafydd with "Conor and Dafydd"...but I'm sure there's a more eloquent way that would allow for other combinations (like choosing 2 out of 4 names). Also I don't like the workaround because currently each name can show up 8 times in a row for example, which means we would have an exact pair 8 times in a row. I think a more eloquent method of randomly choosing 2 out of the 3 names would result in fewer "in a row" cases.

The canonical way to select n elements from a list (without replacement here) would be sample. Here a simple way to create 500 such samples and transform the result into a data.frame:
set.seed(100)
names <- c("Luis", "Conor", "Dafydd")
samples <- lapply(1:500, function(x) sample(names, 2))
head(as.data.frame(matrix(unlist(samples), ncol = 2, byrow = TRUE)))
#> V1 V2
#> 1 Luis Dafydd
#> 2 Conor Luis
#> 3 Conor Luis
#> 4 Dafydd Luis
#> 5 Conor Luis
#> 6 Conor Dafydd
Created on 2019-03-15 by the reprex package (v0.2.1)

Here's a fun approach with randomizr and tidyverse. It treats each person as a block of two observations, then uses pivot_wider to reshape the data
library(tidyverse)
library(randomizr)
tibble(
person_id = rep(1:500, each = 2),
name = rep(c("Name1", "Name2"), 500),
assignment = block_ra(
blocks = person_id,
conditions = c("Luis", "Conor", "Dafydd")
)
) %>%
pivot_wider(names_from = name,
values_from = assignment)
#> # A tibble: 500 x 3
#> person_id Name1 Name2
#> <int> <fct> <fct>
#> 1 1 Luis Dafydd
#> 2 2 Conor Luis
#> 3 3 Dafydd Luis
#> 4 4 Dafydd Conor
#> 5 5 Conor Dafydd
#> 6 6 Luis Dafydd
#> 7 7 Dafydd Luis
#> 8 8 Conor Luis
#> 9 9 Conor Luis
#> 10 10 Dafydd Conor
#> # … with 490 more rows
Created on 2020-01-24 by the reprex package (v0.3.0)

Related

Beginner Question : How do you remove a date from a column?

I want to remove the date part from the first column but can't do it for all the dataset?
can someone advise please?
Example of dataset:
You can use sub() function with replacing ^[^[:alpha:]]+ (regular expression, i.e. all non-alphabetic characters at the beginning of the string), with "", i.e. empty string.
sub("^[^[:alpha:]]+", "", data)
Example
data <- data.frame(
good_column = 1:4,
bad_column = c("13/1/2000pasta", "14/01/2000flour", "15/1/2000aluminium foil", "15/1/2000soap"))
data
#> good_column bad_column
#> 1 1 13/1/2000pasta
#> 2 2 14/01/2000flour
#> 3 3 15/1/2000aluminium foil
#> 4 4 15/1/2000soap
data$bad_column <- sub("^[^[:alpha:]]+", "", data$bad_column)
data
#> good_column bad_column
#> 1 1 pasta
#> 2 2 flour
#> 3 3 aluminium foil
#> 4 4 soap
Created on 2020-07-29 by the reprex package (v0.3.0)

Joining tables and applying functions to columns with the same name in R and tidyverse

I am looking to join tables with customer id (easy enough) but then I want to multiply the columns to get updated values.
Customer_Week_1<-data.frame(First_name=c("John","Mary","David","Paul"),
Last_name=c("Jackson","Smith","Williams", "Zimmerman"),
Factor_1=c(2,5,8,9),
Factor_2=c(.5,.5,.75,.75),
Factor_3=c(0,1,2,3))
Customer_Week_2<-data.frame(First_name=c("John","Mary","David","Paul"),
Last_name=c("Jackson","Smith","Williams", "Zimmerman"),
Factor_1=c(3,7,1,7),
Factor_2=c(.51,.65,.72,.4),
Factor_3=c(1,2,3,4))
Customer_week3<-Customer_Week_1%>%
left_join(Customer_Week_2, by = c("First_name","Last_name"))
The expected results can be found by in a vector by just
Customer_week3_expected<-Customer_Week_1[,3:5]*Customer_Week_2[,3:5]
And I know I can just manually type out every column. But I have dozens of columns and need to make this code as easy to follow as possible.
I also know that I can just bind the results vector to
Customer_week3<-Customer_Week_1%>%
left_join(Customer_Week_2, by = c("First_name","Last_name"))%>%
select(1:2)
But that does not look like best practice to me, and I would rather this be done with a join some way to ensure everything lines up when I am iterating over the customers(tables)
Assuming I understand the output you're trying to get, I can think of two methods. If you know that the names are in the first two columns and are the same in both data frames (this might not be the case in real life), you can use the same multiplication operation you tried above, bound to the first two columns of either of the data frames.
cbind(Customer_Week_1[1:2], Customer_Week_1[-1:-2] * Customer_Week_2[-1:-2])
#> First_name Last_name Factor_1 Factor_2 Factor_3
#> 1 John Jackson 6 0.255 0
#> 2 Mary Smith 35 0.325 2
#> 3 David Williams 8 0.540 6
#> 4 Paul Zimmerman 63 0.300 12
Or you can be more verbose but maybe more flexible, and eshape to a long data frame, then do a grouped operation to summarize products for each person and factor. Starting from the join you have above:
library(dplyr)
library(tidyr)
Customer_week3 <- Customer_Week_1 %>%
left_join(Customer_Week_2, by = c("First_name", "Last_name"))
Make long-shaped data, separate the Factor_1.x into Factor_1 and x, and make products as your summary calculation.
products <- Customer_week3 %>%
gather(key = factor, value = value, -First_name, -Last_name) %>%
separate(factor, into = c("factor", "week"), sep = "\\.") %>%
group_by(First_name, Last_name, factor) %>%
summarise(value = prod(value))
head(products)
#> # A tibble: 6 x 4
#> # Groups: First_name, Last_name [2]
#> First_name Last_name factor value
#> <fct> <fct> <chr> <dbl>
#> 1 David Williams Factor_1 8
#> 2 David Williams Factor_2 0.54
#> 3 David Williams Factor_3 6
#> 4 John Jackson Factor_1 6
#> 5 John Jackson Factor_2 0.255
#> 6 John Jackson Factor_3 0
If you need to get back to a wide format, spread back.
products %>%
spread(key = factor, value = value)
#> # A tibble: 4 x 5
#> # Groups: First_name, Last_name [16]
#> First_name Last_name Factor_1 Factor_2 Factor_3
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 David Williams 8 0.54 6
#> 2 John Jackson 6 0.255 0
#> 3 Mary Smith 35 0.325 2
#> 4 Paul Zimmerman 63 0.3 12
Similar to #camille's reshaping, but in data.table (and disregarding Customer_week3):
library(data.table)
# long format
long = rbindlist(list(Customer_Week_1, Customer_Week_2), id=TRUE)
# aggregate
long[, lapply(.SD, prod), by=.(First_name, Last_name), .SDcols=patterns("^Factor")]
First_name Last_name Factor_1 Factor_2 Factor_3
1: John Jackson 6 0.255 0
2: Mary Smith 35 0.325 2
3: David Williams 8 0.540 6
4: Paul Zimmerman 63 0.300 12
Going longer (again as seen in #camille's answer) might also make sense, so as to avoid repeatedly fiddling with names of Factor_* columns:
longer = melt(long, meas=patterns("^Factor")) # analogous to gather
longer[, .(value = prod(value)), by=.(First_name, Last_name, variable)]

Is rvest the best tool to collect information from this table?

I have used rvest package to extract a list of companies and the a.href elements in each company, which I need to proceed with the data collection process. This is the link of the website: http://www.bursamalaysia.com/market/listed-companies/list-of-companies/main-market.
I have used the following code to extract the table but nothing comes out. I used other approaches as those posted in "Scraping table of NBA stats with rvest" and similar links, but I cannot obtain what I want. Any help would be greatly appreciated.
my code:
link.main <-
"http://www.bursamalaysia.com/market/listed-companies/list-of-companies/main-market/"
web <- read_html(link.main) %>%
html_nodes("table#bm_equities_prices_table")
# it does not work even when I write html_nodes("table")
or ".table" or #bm_equities_prices_table
web <- read_html(link.main)
%>% html_nodes(".bm_center.bm_dataTable")
# no working
web <- link.main %>% read_html() %>% html_table()
# to inspect the position of table in this website
The page generates the table using JavaScript, so you either need to use RSelenium or Python's Beautiful Soup to simulate the browser session and allow javascript to run.
Another alternative is to use awesome package by #hrbrmstr called decapitated, which basically runs headless Chrome browser session in the background.
#devtools::install_github("hrbrmstr/decapitated")
library(decapitated)
library(rvest)
res <- chrome_read_html(link.main)
main_df <- res %>%
rvest::html_table() %>%
.[[1]] %>%
as_tibble()
This outputs the content of the table alright. If you want to get to the elements underlying the table (href attributes behind the table text), you will need to do a bit more of list gymnastics. Some of the elements in the table are actually missing links, extracting by css proved to be difficult.
library(dplyr)
library(purrr)
href_lst <- res %>%
html_nodes("table td") %>%
as_list() %>%
map("a") %>%
map(~attr(.x, "href"))
# we need every third element starting from second element
idx <- seq.int(from=2, by=3, length.out = nrow(main_df))
href_df <- tibble(
market_href=as.character(href_lst[idx]),
company_href=as.character(href_lst[idx+1])
)
bind_cols(main_df, href_df)
#> # A tibble: 800 x 5
#> No `Company Name` `Company Website` market_href company_href
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 7-ELEVEN MALAYS~ http://www.7elev~ /market/list~ http://www.~
#> 2 2 A-RANK BERHAD [~ http://www.arank~ /market/list~ http://www.~
#> 3 3 ABLEGROUP BERHA~ http://www.gefun~ /market/list~ http://www.~
#> 4 4 ABM FUJIYA BERH~ http://www.abmfu~ /market/list~ http://www.~
#> 5 5 ACME HOLDINGS B~ http://www.suppo~ /market/list~ http://www.~
#> 6 6 ACOUSTECH BERHA~ http://www.acous~ /market/list~ http://www.~
#> 7 7 ADVANCE SYNERGY~ http://www.asb.c~ /market/list~ http://www.~
#> 8 8 ADVANCECON HOLD~ http://www.advan~ /market/list~ http://www.~
#> 9 9 ADVANCED PACKAG~ http://www.advan~ /market/list~ http://www.~
#> 10 10 ADVENTA BERHAD ~ http://www.adven~ /market/list~ http://www.~
#> # ... with 790 more rows
Another option without using browser:
library(httr)
library(jsonlite)
library(XML)
r <- httr::GET(paste0(
"http://ws.bursamalaysia.com/market/listed-companies/list-of-companies/list_of_companies_f.html",
"?_=1532479072277",
"&callback=jQuery16206432131784246533_1532479071878",
"&alphabet=",
"&market=main_market",
"&_=1532479072277"))
l <- rawToChar(r$content)
m <- gsub("jQuery16206432131784246533_1532479071878(", "", substring(l, 1, nchar(l)-1), fixed=TRUE)
tbl <- XML::readHTMLTable(jsonlite::fromJSON(m)$html)$bm_equities_prices_table
output:
> head(tbl)
# No Company Name Company Website
#1 1 7-ELEVEN MALAYSIA HOLDINGS BERHAD http://www.7eleven.com.my
#2 2 A-RANK BERHAD [S] http://www.arank.com.my
#3 3 ABLEGROUP BERHAD [S] http://www.gefung.com.my
#4 4 ABM FUJIYA BERHAD [S] http://www.abmfujiya.com.my
#5 5 ACME HOLDINGS BERHAD [S] http://www.supportivetech.com/
#6 6 ACOUSTECH BERHAD [S] http://www.acoustech.com.my/

Tableau LOD R Equivalent

I'm using a Tableau Fixed LOD function in a report, and was looking for ways to mimic this functionality in R.
Data set looks like:
Soldto<-c("123456","122456","123456","122456","124560","125560")
Shipto<-c("123456","122555","122456","124560","122560","122456")
IssueDate<-as.Date(c("2017-01-01","2017-01-02","2017-01-01","2017-01-02","2017-01-01","2017-01-01"))
Method<-c("Ground","Ground","Ground","Air","Ground","Ground")
Delivery<-c("000123","000456","000123","000345","000456","000555")
df1<-data.frame(Soldto,Shipto,IssueDate,Method,Delivery)
What I'm looking to do is "For each Sold-to/Ship-to/Method count the number of unique delivery IDs".
The intent is to find the number of unique deliveries that could potentially be "aggregated."
In Tableau that function looks like:
{FIXED [Soldto],[Shipto],[IssueDate],[Method],:countd([Delivery])
Could this be done with aggregate or summarize as in an example below:
df.new<-ddply(df,c("Soldto","Shipto","Method"),summarise,
Deliveries = n_distinct(Delivery))
This is fairly easy with dplyr. You are looking for the number of unique delivery for each combination of soldto, shipto and method, which is just group_by and then summarise:
library(tidyverse)
tbl <- tibble(
soldto = c("123456","122456","123456","122456","124560","125560"),
shipto = c("123456","122555","122456","124560","122560","122456"),
issuedate = as.Date(c("2017-01-01","2017-01-02","2017-01-01","2017-01-02","2017-01-01","2017-01-01")),
method = c("Ground","Ground","Ground","Air","Ground","Ground"),
delivery = c("000123","000456","000123","000345","000456","000555")
)
tbl %>%
group_by(soldto, shipto, method) %>%
summarise(uniques = n_distinct(delivery))
#> # A tibble: 6 x 4
#> # Groups: soldto, shipto [?]
#> soldto shipto method uniques
#> <chr> <chr> <chr> <int>
#> 1 122456 122555 Ground 1
#> 2 122456 124560 Air 1
#> 3 123456 122456 Ground 1
#> 4 123456 123456 Ground 1
#> 5 124560 122560 Ground 1
#> 6 125560 122456 Ground 1
Created on 2018-03-02 by the reprex package (v0.2.0).

How do I infill non-adjacent rows with sample data from previous rows in R?

I have data containing a unique identifier, a category, and a description.
Below is a toy dataset.
prjnumber <- c(1,2,3,4,5,6,7,8,9,10)
category <- c("based","trill","lit","cold",NA,"epic", NA,NA,NA,NA)
description <- c("skip class",
"dunk on brayden",
"record deal",
"fame and fortune",
NA,
"female attention",
NA,NA,NA,NA)
toy.df <- data.frame(prjnumber, category, description)
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 <NA> <NA>
6 6 epic female attention
7 7 <NA> <NA>
8 8 <NA> <NA>
9 9 <NA> <NA>
10 10 <NA> <NA>
I want to randomly sample the 'category' and 'description' columns from rows that have been filled in to use as infill for rows with missing data.
The final data frame would be complete and would only rely on the initial 5 rows which contain data. The solution would preserve between-column correlation.
An expected output would be:
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 lit record deal
6 6 epic female attention
7 7 based skip class
8 8 based skip class
9 9 lit record deal
10 10 trill dunk on brayden
complete = na.omit(toy.df)
toy.df[is.na(toy.df$category), c("category", "description")] =
complete[sample(1:nrow(complete), size = sum(is.na(toy.df$category)), replace = TRUE),
c("category", "description")]
toy.df
# prjnumber category description
# 1 1 based skip class
# 2 2 trill dunk on brayden
# 3 3 lit record deal
# 4 4 cold fame and fortune
# 5 5 lit record deal
# 6 6 epic female attention
# 7 7 cold fame and fortune
# 8 8 based skip class
# 9 9 epic female attention
# 10 10 epic female attention
Though it would seem a little more straightforward if you didn't start with the unique identifiers filled out for the NA rows...
You could try
library(dplyr)
toy.df %>%
mutate_each(funs(replace(., is.na(.), sample(.[!is.na(.)]))), 2:3)
Based on new information, we may need a numeric index to use in the funs.
toy.df %>%
mutate(indx= replace(row_number(), is.na(category),
sample(row_number()[!is.na(category)], replace=TRUE))) %>%
mutate_each(funs(.[indx]), 2:3) %>%
select(-indx)
Using Base R to fill in a single field a at a time, use something like (not preserving the correlation between the fields):
fields <- c('category','description')
for(field in fields){
missings <- is.na(toy.df[[field]])
toy.df[[field]][missings] <- sample(toy.df[[field]][!missings],sum(missings),T)
}
and to fill them in simultaneously (preserving the correlation between the fields) use something like:
missings <- apply(toy.df[,fields],
1,
function(x)any(is.na(x)))
toy.df[missings,fields] <- toy.df[!missings,fields][sample(sum(!missings),
sum(missings),
T),]
and of course, to avoid the implicit for loop in the apply(x,1,fun), you could use:
rowAny <- function(x) rowSums(x) > 0
missings <- rowAny(toy.df[,fields])

Resources