joining two dataframes on matching values of two common columns R - r

I have a two dataframes A and B that both have multiple columns. They share the common columns "week" and "store". I would like to join these two dataframes on the matching values of the common columns.
For example this is a small subset of the data that I have:
A = data.frame(retailer = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
store = c(5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6),
week = c(2021100301, 2021092601, 2021091901, 2021091201, 2021082901, 2021082201, 2021081501, 2021080801,
2021080101, 2021072501, 2021071801, 2021071101, 2021070401, 2021062701, 2021062001, 2021061301),
dollars = c(121817.9, 367566.7, 507674.5, 421257.8, 453330.3, 607551.4, 462674.8,
464329.1, 339342.3, 549271.5, 496720.1, 554858.7, 382675.5,
373210.9, 422534.2, 381668.6))
and
B = data.frame(
week = c("2020080901", "2017111101", "2017061801", "2020090701", "2020090701", "2020090701",
"2020091201","2020082301", "2019122201", "2017102901"),
store = c(14071, 11468, 2428, 17777, 14821, 10935, 5127, 14772, 14772, 14772),
fill = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
)
I would like to join these two tables on the matching week AND store values in order to incorporate the "fill" column from B into A. Where the values don't match, I would like to have a label "0" in the fill column, instead of a 1. Is there a way I can do this? I am not sure which join to use as well, or if "merge" would be better for this? Essentially I am NOT trying to get rid of any rows that do not have the matching values for the two common columns. Thanks for any help!

We may do a left_join
library(dplyr)
library(tidyr)
A %>%
mutate(week = as.character(week)) %>%
left_join(B) %>%
mutate(fill = replace_na(fill, 0))

Related

Using map() function to apply for each element

I need, with the help of the map() function, apply the above for each element
How can I do so?
As dt is of class data.table, you can make a vector of columns of interest (i.e. your items; below I use grepl on the names), and then apply your weighting function to each of those columns using .SD and .SDcols, with by
qs = names(dt)[grepl("^q", names(dt))]
dt[, (paste0(qs,"wt")):=lapply(.SD, \(q) 1/(sum(!is.na(q))/.N)),
.(sex, education_code, age), .SDcols = qs]
As mentioned in the comments, you miss a dt <- in your dt[, .(ID, education_code, age, sex, item = q1_1)] which makes the column item unavailable in the following line dt[, no_respond := is.na(item)].
Your weighting scheme is not absolutely clear to me however, assuming you want to do what is done in your code here, I would go with dplyr solution to iterate over columns.
# your data without no_respond column and correcting missing value in q2_3
dt <- data.table::data.table(
ID = c(1,2,3,4, 5, 6, 7, 8, 9, 10),
education_code = c(20,50,20,60, 20, 10,5, 12, 12, 12),
age = c(87,67,56,52, 34, 56, 67, 78, 23, 34),
sex = c("F","M","M","M", "F","M","M","M", "M","M"),
q1_1 = c(NA,1,5,3, 1, NA, 3, 4, 5,1),
q1_2 = c(NA,1,5,3, 1, 2, NA, 4, 5,1),
q1_3 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q1_text = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_1 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_2 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_3 = c(NA,1,5,3, 1, NA, NA, 4, 5,1),
q2_text = c(NA,1,5,3, 1, NA, 3, 4, 5,1))
dt %>%
group_by(sex, education_code, age) %>% #groups the df by sex, education_code, age
add_count() %>% #add a column with number of rows in each group
mutate(across(starts_with("q"), #for each column starting with "q"
~ 1/(sum(!is.na(.))/n), #create a new column following your weight calculation
.names = '{.col}_wgt')) %>% #naming the new column with suffix "_wgt" to original name
ungroup()

Set bubble size according to categorical data

Keep in mind, I am very new to R.
I have a dataset from a public opinion survey, and would like to represent the answers through a bubble chart, though the data is categorical, not numeric.
From dataset "Arab4" I have question/variable "Q713" with all of the observations coded as 1, 2, 3, 4, or 5 as the response options. I would like to plot the bubbles (stacked on top of one another by "country") with the size of the bubble corresponding to the percent of the vote share that answer got. For example, if 49% of respondents in Israel voted for option 1 under question "Q", then the bubble size would represent 49% and be situated above the Israel category label with the color of the bubble corresponding to the response type (1, 2, 3, 4, or 5).
I have the following code, giving me a blank chart, and I know to eventually use the "points" command with more specifications.
What I need help with is defining the radius of the circles from the data I have.
plot(Arab4$Country, Arab4$Q713, type= "n", xlab = FALSE, ylab=FALSE)
points(Arab4$country, Arab4$q713)
Here is some dput from the data set
dput(Arab4$q713[1:50])
structure(c(3, 5, 3, 3, 1, 3, 5, 5, 5, 5, 3, 2, 2, 3, 1, 1, 4,
2, 3, 5, 5, 5, 2, 5, 4, 2, 5, 2, 5, 3, 5, 5, 2, 2, 5, 2, 1, 2,
1, 2, 5, 3, 4, 5, 1, 1, 1, 4, 5, 3), labels = structure(c(1,
2, 3, 4, 5, 98, 99), .Names = c("Promoting democracy", "Promoting economic
development",
"Resolving the Arab-Israeli conflict", "Promoting women’s rights",
"The US should not get involved", "Don't know (Do not read)",
"Decline to answer (Do not read)")), class = "labelled")
Any ideas would help! Thanks!
As others have commented, this really is not a bubble chart as you only have 2 dimensions and the size of the circle does not add anything (other than perhaps visual appeal). But with that disclaimer, here is one approach to what I think you are trying to achieve. This requires the ggplot2 and reshape2 libraries.
library(ggplot2)
library(reshape2)
# create simulated data
dat <- data.frame(Egypt=sample(c(1:5), 20), Libya=sample(c(1:5),20))
# tabulate
dat.tab <- apply(dat, 2, table)
dat.long <- melt(dat.tab)
colnames(dat.long) <- c("Response", "Count", "Country")
ggplot(dat.long, aes(x=Country, y=Count, color=Country)) +
geom_point(aes(size=Count))
EDIT Here is another approach, using the data manipulation tools of the dplyr package to get you all the way to proportions:
# using dat from above again
dat.long <- melt(dat)
colnames(dat.long) <- c("Country", "Response")
dat.tab <- dat.long %>%
group_by(Country) %>%
count(Response) %>%
mutate(prop = prop.table(n))
ggplot(dat.tab, aes(x=Country, y=prop, color=Country)) +
geom_point(aes(size=prop))
You will need to do a little additional work to remove unwanted values (98, 99) if they are truly unwanted.
hth.

Find/replace or map, using a lookup table in R

Total R-newbie, here. Please be gentle.
I have a column in a dataframe with numerical values representing ethnicity (UK Census data).
# create example data
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
ethnicode = c(0, 1, 2, 3, 4, 5, 6, 7, 8)
df = data.frame(id, ethnicode)
I can do a mapping (or find/replace) to create a column (or edit an existing column) that contains a human-readable value:
# map values one-to-one from numeric to string
df$ethnicity <- mapvalues(df$ethnicode,
from = c(8, 7, 6, 5, 4, 3, 2, 1, 0),
to = c("Other", "Black", "Asian", "Mixed",
"WhiteOther", "WhiteIrish", "WhiteUK",
"WhiteTotal", "All"))
Of all of the things I tried this seemed to be the quickest (around 20 seconds for 9 million rows as opposed to over a minute with some approaches).
What I can’t seem to find (or understand from what I’ve read), is how to reference a lookup table instead.
# create lookup table
ethnicode = c(8, 7, 6, 5, 4, 3, 2, 1, 0)
ethnicity = c(("Other", "Black", "Asian", "Mixed", "WhiteOther",
"WhiteIrish", "WhiteUK", "WhiteTotal", "All")
lookup = data.frame(ethnicode, ethnicity)
The point being, if I want to change the human readable strings, or do anything else to the process, I’d rather do it once to the look-up table, than have to do it in several places in several scripts... and if I can do it more efficiently (under 20 seconds for 9 million rows) that would be good, too.
I also want to easily make sure that “8” still equals ‘Other’ (or whatever equivalent), and “0” still equals ‘All’, etc., which is more difficult, visually, with longer lists using the above approach.
Thanks in advance.
You could use named vectors for this. However, you would need to convert the ethnicode to character.
df = data.frame(
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
ethnicode = as.character(c(0, 1, 2, 3, 4, 5, 6, 7, 8)),
stringsAsFactors=FALSE
)
# create lookup table
ethnicode = c(8, 7, 6, 5, 4, 3, 2, 1, 0)
ethnicity = c("Other", "Black", "Asian", "Mixed", "WhiteOther",
"WhiteIrish", "WhiteUK", "WhiteTotal", "All")
lookup = setNames(ethnicity, as.character(ethnicode))
Then you can do
df <- transform(df, ethnicity=lookup[ethnicode], stringsAsFactors=FALSE)
and you are done.
For working with 9 million rows, I suggest you use a database like sqlite or monetdb. For sqlite, the following code might be helpful:
library(RSQLite)
dbname <- "big_data_mapping.db" # db to create
csvname <- "data/big_data_mapping.csv" # large dataset
ethn_codes = data.frame(
ethnicode= c(8, 7, 6, 5, 4, 3, 2, 1, 0),
ethnicity= c("Other", "Black", "Asian", "Mixed", "WhiteOther", "WhiteIrish", "WhiteUK", "WhiteTotal", "All")
)
# build db
con <- dbConnect(SQLite(), dbname)
dbWriteTable(con, name="main", value=csvname, overwrite=TRUE)
dbWriteTable(con, name="ethn_codes", ethn_codes, overwrite=TRUE)
# join the tables
dat <- dbGetQuery(con, "SELECT main.id, ethn_codes.ethnicity FROM main JOIN ethn_codes ON main.ethnicode=ethn_codes.ethnicode")
# finish
dbDisconnect(con)
#file.remove(dbname)
monetdb is said to be more suitable for the tasks you usually do with R, so it is definitly worth a look.

determining age from min max dates for each item in dataset [duplicate]

This question is very similar to a question asked in another thread which can be found here. I'm trying to achieve something similar: within groups (events) subtract the first date from the last date. I'm using the dplyr package and code provided in the answers of this thread. Subtracting the first date from the last date works, however it does not provide satisfactory results; the resulting time difference is displayed in numbers, and there seems to be no distinction between different time units (e.g., minutes and hours) --> subtractions in first 2 events are correct, however in the 3rd one it is not i.e. should be minutes. How can I manipulate the output by dplyr so that the resulting subtractions are actually a correct reflection of the time difference? Below you will find a sample of my data (1 group only) and the code that I used:
df<- structure(list(time = structure(c(1428082860, 1428083340, 1428084840,
1428086820, 1428086940, 1428087120, 1428087240, 1428087360, 1428087480,
1428087720, 1428088800, 1428089160, 1428089580, 1428089700, 1428090120,
1428090240, 1428090480, 1428090660, 1428090780, 1428090960, 1428091080,
1428091200, 1428091500, 1428091620, 1428096060, 1428096420, 1428096540,
1428096600, 1428097560, 1428097860, 1428100440, 1428100560, 1428100680,
1428100740, 1428100860, 1428101040, 1428101160, 1428101400, 1428101520,
1428101760, 1428101940, 1428102240, 1428102840, 1428103080, 1428103620,
1428103980, 1428104100, 1428104160, 1428104340, 1428104520, 1428104700,
1428108540, 1428108840, 1428108960, 1428110340, 1428110460, 1428110640
), class = c("POSIXct", "POSIXt"), tzone = ""), event = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)), .Names = c("time",
"event"), class = "data.frame", row.names = c(NA, 57L))
df1 <- df %>%
group_by(event) %>%
summarize(first(time),last(time),difference = last(time)-first(time))
We can use difftime and specify the unit to get all the difference in the same unit.
df %>%
group_by(event) %>%
summarise(First = first(time),
Last = last(time) ,
difference= difftime(last(time), first(time), unit='hour'))

r - how to subtract first date entry from last date entry in grouped data and control output format

This question is very similar to a question asked in another thread which can be found here. I'm trying to achieve something similar: within groups (events) subtract the first date from the last date. I'm using the dplyr package and code provided in the answers of this thread. Subtracting the first date from the last date works, however it does not provide satisfactory results; the resulting time difference is displayed in numbers, and there seems to be no distinction between different time units (e.g., minutes and hours) --> subtractions in first 2 events are correct, however in the 3rd one it is not i.e. should be minutes. How can I manipulate the output by dplyr so that the resulting subtractions are actually a correct reflection of the time difference? Below you will find a sample of my data (1 group only) and the code that I used:
df<- structure(list(time = structure(c(1428082860, 1428083340, 1428084840,
1428086820, 1428086940, 1428087120, 1428087240, 1428087360, 1428087480,
1428087720, 1428088800, 1428089160, 1428089580, 1428089700, 1428090120,
1428090240, 1428090480, 1428090660, 1428090780, 1428090960, 1428091080,
1428091200, 1428091500, 1428091620, 1428096060, 1428096420, 1428096540,
1428096600, 1428097560, 1428097860, 1428100440, 1428100560, 1428100680,
1428100740, 1428100860, 1428101040, 1428101160, 1428101400, 1428101520,
1428101760, 1428101940, 1428102240, 1428102840, 1428103080, 1428103620,
1428103980, 1428104100, 1428104160, 1428104340, 1428104520, 1428104700,
1428108540, 1428108840, 1428108960, 1428110340, 1428110460, 1428110640
), class = c("POSIXct", "POSIXt"), tzone = ""), event = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)), .Names = c("time",
"event"), class = "data.frame", row.names = c(NA, 57L))
df1 <- df %>%
group_by(event) %>%
summarize(first(time),last(time),difference = last(time)-first(time))
We can use difftime and specify the unit to get all the difference in the same unit.
df %>%
group_by(event) %>%
summarise(First = first(time),
Last = last(time) ,
difference= difftime(last(time), first(time), unit='hour'))

Resources