Find/replace or map, using a lookup table in R - r

Total R-newbie, here. Please be gentle.
I have a column in a dataframe with numerical values representing ethnicity (UK Census data).
# create example data
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
ethnicode = c(0, 1, 2, 3, 4, 5, 6, 7, 8)
df = data.frame(id, ethnicode)
I can do a mapping (or find/replace) to create a column (or edit an existing column) that contains a human-readable value:
# map values one-to-one from numeric to string
df$ethnicity <- mapvalues(df$ethnicode,
from = c(8, 7, 6, 5, 4, 3, 2, 1, 0),
to = c("Other", "Black", "Asian", "Mixed",
"WhiteOther", "WhiteIrish", "WhiteUK",
"WhiteTotal", "All"))
Of all of the things I tried this seemed to be the quickest (around 20 seconds for 9 million rows as opposed to over a minute with some approaches).
What I can’t seem to find (or understand from what I’ve read), is how to reference a lookup table instead.
# create lookup table
ethnicode = c(8, 7, 6, 5, 4, 3, 2, 1, 0)
ethnicity = c(("Other", "Black", "Asian", "Mixed", "WhiteOther",
"WhiteIrish", "WhiteUK", "WhiteTotal", "All")
lookup = data.frame(ethnicode, ethnicity)
The point being, if I want to change the human readable strings, or do anything else to the process, I’d rather do it once to the look-up table, than have to do it in several places in several scripts... and if I can do it more efficiently (under 20 seconds for 9 million rows) that would be good, too.
I also want to easily make sure that “8” still equals ‘Other’ (or whatever equivalent), and “0” still equals ‘All’, etc., which is more difficult, visually, with longer lists using the above approach.
Thanks in advance.

You could use named vectors for this. However, you would need to convert the ethnicode to character.
df = data.frame(
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
ethnicode = as.character(c(0, 1, 2, 3, 4, 5, 6, 7, 8)),
stringsAsFactors=FALSE
)
# create lookup table
ethnicode = c(8, 7, 6, 5, 4, 3, 2, 1, 0)
ethnicity = c("Other", "Black", "Asian", "Mixed", "WhiteOther",
"WhiteIrish", "WhiteUK", "WhiteTotal", "All")
lookup = setNames(ethnicity, as.character(ethnicode))
Then you can do
df <- transform(df, ethnicity=lookup[ethnicode], stringsAsFactors=FALSE)
and you are done.
For working with 9 million rows, I suggest you use a database like sqlite or monetdb. For sqlite, the following code might be helpful:
library(RSQLite)
dbname <- "big_data_mapping.db" # db to create
csvname <- "data/big_data_mapping.csv" # large dataset
ethn_codes = data.frame(
ethnicode= c(8, 7, 6, 5, 4, 3, 2, 1, 0),
ethnicity= c("Other", "Black", "Asian", "Mixed", "WhiteOther", "WhiteIrish", "WhiteUK", "WhiteTotal", "All")
)
# build db
con <- dbConnect(SQLite(), dbname)
dbWriteTable(con, name="main", value=csvname, overwrite=TRUE)
dbWriteTable(con, name="ethn_codes", ethn_codes, overwrite=TRUE)
# join the tables
dat <- dbGetQuery(con, "SELECT main.id, ethn_codes.ethnicity FROM main JOIN ethn_codes ON main.ethnicode=ethn_codes.ethnicode")
# finish
dbDisconnect(con)
#file.remove(dbname)
monetdb is said to be more suitable for the tasks you usually do with R, so it is definitly worth a look.

Related

joining two dataframes on matching values of two common columns R

I have a two dataframes A and B that both have multiple columns. They share the common columns "week" and "store". I would like to join these two dataframes on the matching values of the common columns.
For example this is a small subset of the data that I have:
A = data.frame(retailer = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
store = c(5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6),
week = c(2021100301, 2021092601, 2021091901, 2021091201, 2021082901, 2021082201, 2021081501, 2021080801,
2021080101, 2021072501, 2021071801, 2021071101, 2021070401, 2021062701, 2021062001, 2021061301),
dollars = c(121817.9, 367566.7, 507674.5, 421257.8, 453330.3, 607551.4, 462674.8,
464329.1, 339342.3, 549271.5, 496720.1, 554858.7, 382675.5,
373210.9, 422534.2, 381668.6))
and
B = data.frame(
week = c("2020080901", "2017111101", "2017061801", "2020090701", "2020090701", "2020090701",
"2020091201","2020082301", "2019122201", "2017102901"),
store = c(14071, 11468, 2428, 17777, 14821, 10935, 5127, 14772, 14772, 14772),
fill = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
)
I would like to join these two tables on the matching week AND store values in order to incorporate the "fill" column from B into A. Where the values don't match, I would like to have a label "0" in the fill column, instead of a 1. Is there a way I can do this? I am not sure which join to use as well, or if "merge" would be better for this? Essentially I am NOT trying to get rid of any rows that do not have the matching values for the two common columns. Thanks for any help!
We may do a left_join
library(dplyr)
library(tidyr)
A %>%
mutate(week = as.character(week)) %>%
left_join(B) %>%
mutate(fill = replace_na(fill, 0))

Extract cluster information and combine results

I am attempting to run a clustering algorithm over a list of dissimilarity matrices for different numbers of clusters k and extract some information for each run.
This first block of code produces the list of dissimilarity matrices
library(tidyverse)
library(cluster)
library(rje)
dat=mtcars[,1:3]
v_names=names(dat)
combos=rje::powerSet(v_names)
combos=combos[lengths(combos)>1]
df_list=list()
for (i in seq_along(combos)){
df_list[[i]]=dat[combos[[i]]]
}
gower_ls=lapply(df_list,daisy,metric="gower")
Here is the section of code I am having a problem with
set.seed(4)
model_num <-c(NA)
sil_width <-c(NA)
min_sil<-c(NA)
mincluster<-c(NA)
k_clusters <-c(NA)
lowest_sil <-c(NA)
maxcluster <-c(NA)
model_vars <- c(NA)
clust_4=lapply(gower_ls,pam,diss=TRUE,k=4)
for(m in 1:length(clust_4)){
sil_width[m] <-clust_4[[m]][7]$silinfo$avg.width
min_sil[m] <- min(clust_4[[m]][7]$silinfo$clus.avg.widths)
mincluster[m] <-min(clust_4[[m]][6]$clusinfo[,1])
maxcluster[m] <-max(clust_4[[m]][6]$clusinfo[,1])
k_clusters[m]<- nrow(clust_4[[m]][6]$clusinfo)
lowest_sil[m]<-min(clust_4[[m]][7]$silinfo$widths)
model_num[m] <-m
}
colresults_4=as.data.frame(cbind( sil_width, min_sil,mincluster,maxcluster,k_clusters,model_num,lowest_sil))
How can I convert this piece of code to run for a given range of k? I've tried a nested loop but I was not able to code it correctly. Here are the desired results for k= 4:6, thanks.
structure(list(sil_width = c(0.766467312788453, 0.543226669407726,
0.765018469447229, 0.705326458357873, 0.698351173575526, 0.480565022092276,
0.753366365875066, 0.644345251543097, 0.699437672202048, 0.430310752506775,
0.678224885117295, 0.576411380463116), min_sil = c(0.539324315243191,
0.508330909368204, 0.637090842537915, 0.622120627356455, 0.539324315243191,
0.334047777245833, 0.430814518122641, 0.568591550281139, 0.539324315243191,
0.295113900268025, 0.430814518122641, 0.19040716086259), mincluster = c(5,
3, 4, 5, 2, 3, 3, 3, 2, 3, 3, 3), maxcluster = c(14, 12, 11,
14, 12, 10, 11, 11, 9, 6, 7, 7), k_clusters = c(4, 4, 4, 4, 5,
5, 5, 5, 6, 6, 6, 6), model_num = c(1, 2, 3, 4, 1, 2, 3, 4, 1,
2, 3, 4), lowest_sil = c(-0.0726256983240229, 0.0367238314801671,
0.308069836672298, 0.294247157041013, -0.0726256983240229, -0.122804288130541,
-0.317748917748917, 0.218164082936686, -0.0726256983240229, -0.224849074123824,
-0.317748917748917, -0.459909237820881)), row.names = c(NA, -12L
), class = "data.frame")
I was able to come up with a solution by writing a function clus_func that extracts the cluster information and then using cross2 and map2 from the purrr package:
library(tidyverse)
library(cluster)
library(rje)
dat=mtcars[,1:3]
v_names=names(dat)
combos=rje::powerSet(v_names)
combos=combos[lengths(combos)>1]
clus_func=function(x,k){
clust=pam(x,k,diss=TRUE)
clust_stats=as.data.frame(cbind(
avg_sil_width=clust$silinfo$avg.width,
min_clus_width=min(clust$silinfo$clus.avg.widths),
min_individual_sil=min(clust$silinfo$widths[,3]),
max_individual_sil=max(clust$silinfo$widths[,3]),
mincluster= min(clust$clusinfo[,1]),
maxcluster= max(clust$clusinfo[,1]),
num_k=max(clust$clustering) ))
}
df_list=list()
for (i in seq_along(combos)){
df_list[[i]]=dat[combos[[i]]]
}
gower_ls=lapply(df_list,daisy,metric="gower")
begin_k=4
end_k=6
cross_list=cross2(gower_ls,begin_k:end_k)
k=c(NA)
for(i in 1:length(cross_list)){ k[i]=cross_list[[i]][2]}
diss=c(NA)
for(i in 1:length(cross_list)){ diss[i]=cross_list[[i]][1]}
model_stats=map2(diss, k, clus_func)
model_stats=rbindlist(model_stats)

Adding rows to make a full long dataset for longitudinal data analysis

I am working with a long-format longitudinal dataset where each person has 1, 2 or 3 time points. In order to perform certain analyses I need to make sure that each person has the same number of rows even if it consists of NAs because they did not complete the certain time point.
Here is a sample of the data before adding the rows:
structure(list(Values = c(23, 24, 45, 12, 34, 23), P_ID = c(1,
1, 2, 2, 2, 3), Event_code = c(1, 2, 1, 2, 3, 1), Site_code = c(1,
1, 3, 3, 3, 1)), class = "data.frame", row.names = c(NA, -6L))
This is the data I aim to get after adding the relevant rows:
structure(list(Values = c(23, 24, NA, 45, 12, 34, 23, NA, NA),
P_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Event_code = c(1, 2,
3, 1, 2, 3, 1, 2, 3), Site_code = c(1, 1, 1, 3, 3, 3, 1,
1, 1)), class = "data.frame", row.names = c(NA, -9L))
I want to come up with code that would automatically add rows to the dataset conditionally on whether the participant has had 1, 2 or 3 visits. Ideally it would make rest of data all NAs while copying Participant_ID and site_code but if not possible I would be satisfied just with creating the right number of rows.
We could use fill after doing a complete
library(dplyr)
library(tidyr)
ExpandedDataset %>%
complete(P_ID, Event_code) %>%
fill(Site_code)
I came with quite a long code, but you could group it in a function and make it easier:
Here's your dataframe:
df <- data.frame(ID = c(rep("P1", 2), rep("P2", 3), "P3"),
Event = c("baseline", "visit 2", "baseline", "visit 2", "visit 3", "baseline"),
Event_code = c(1, 2, 1, 2, 3, 1),
Site_code = c(1, 1, 2, 2, 2, 1))
How many records you have per ID?
values <- summary(df$ID)
What is the maximum number of records for a single patient?
target <- max(values)
Which specific patients have less records than the maximum?
uncompliant <- names(which(values<target))
And how many records do you have for those patients who have missing information?
rowcount <- values[which(values<target)]
So now, let's create the vectors of the data frame we will add to your original one. First, IDs:
IDs <- vector()
for(i in 1:length(rowcount)){
y <- rep(uncompliant[i], target - rowcount[i])
IDs <- c(IDs, y)
}
And now, the sitecodes:
SC <- vector()
for(i in 1:length(rowcount)){
y <- rep(unique(df$Site_code[which(df$ID == uncompliant[i])]), target - rowcount[i])
SC <- c(SC, y)
}
Finally, a data frame with the values we will introduce:
introduce <- data.frame(ID = IDs, Event = rep(NA, length(IDs)),
Event_code = rep(NA, length(IDs)),
Site_code = SC)
Combine the original dataframe with the new values to be added and sort it so it looks nice:
final <- as.data.frame(rbind(df, introduce))
final <- final[order(v$ID), ]

Creating new variable based on specific rows of other two variables in a long formatted dataset

I have a long dataset of emotional responses and I need to create a variable based on specific rows of two other variables, within subjects.
The following data frame includes data for two participants ("person") presented with 2 pictures (P1, P2, P3), each with 3 repetitions (R1, R2, R3) which is the "phase" variable. The variable response includes two things the rating for each presentation ( scale -30 to 30) and the emotion experienced per picture.
person <- c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2)
block <- c(4, 4, 4, 5, 5, 5, 8, 8, 4, 4, 4, 5, 5, 5, 8, 8)
phase <- c("P1R1", "P1R2", "P1R3", "P2R1","P2R2","P2R3", "Post1", "Post2","P1R1",
"P1R2", "P1R3", "P2R1","P2R2","P2R3", "Post1", "Post2")
response <- c(30, 30, 30, -30, -30, -30, "Happy", "Sad", 28, 27, 25, -23, -24,
-22, "Excited", "Scared")
df <- data.frame(person, block, phase, emotion, response)
I need to create a new column that will be based on the block number and give me the emotion per picture.
I would like the new column to be called “postsurvey” and expect it to be as following:
postsurvey <-c ("Happy", "Happy", "Happy","Sad","Sad", "Sad", NA, NA,
"Excited", "Excited", "Excited", "Scared", "Scared", "Scared", NA, NA)
df <- data.frame(person, block, phase, emotion, response, postsurvey)
The code that I used is:
df<-df %>% group_by(person, block) %>%
mutate(postsurvey=if(block==4){response[phase=="Post1"]}
else if (block==5){response[phase=="Post2"]}
else {print("NA")})
I expect for each subject to receive for each block number the same response, but what I get is that the response is not grouped by the subjects, and is not repeated within the subject by a block number, as if there is a vector of emotions and a person gets emotions that are not his.
*In my original data I have 4 pictures per subject with 10 repetitions, so the "else if" code repeated with more then two conditions.

Set bubble size according to categorical data

Keep in mind, I am very new to R.
I have a dataset from a public opinion survey, and would like to represent the answers through a bubble chart, though the data is categorical, not numeric.
From dataset "Arab4" I have question/variable "Q713" with all of the observations coded as 1, 2, 3, 4, or 5 as the response options. I would like to plot the bubbles (stacked on top of one another by "country") with the size of the bubble corresponding to the percent of the vote share that answer got. For example, if 49% of respondents in Israel voted for option 1 under question "Q", then the bubble size would represent 49% and be situated above the Israel category label with the color of the bubble corresponding to the response type (1, 2, 3, 4, or 5).
I have the following code, giving me a blank chart, and I know to eventually use the "points" command with more specifications.
What I need help with is defining the radius of the circles from the data I have.
plot(Arab4$Country, Arab4$Q713, type= "n", xlab = FALSE, ylab=FALSE)
points(Arab4$country, Arab4$q713)
Here is some dput from the data set
dput(Arab4$q713[1:50])
structure(c(3, 5, 3, 3, 1, 3, 5, 5, 5, 5, 3, 2, 2, 3, 1, 1, 4,
2, 3, 5, 5, 5, 2, 5, 4, 2, 5, 2, 5, 3, 5, 5, 2, 2, 5, 2, 1, 2,
1, 2, 5, 3, 4, 5, 1, 1, 1, 4, 5, 3), labels = structure(c(1,
2, 3, 4, 5, 98, 99), .Names = c("Promoting democracy", "Promoting economic
development",
"Resolving the Arab-Israeli conflict", "Promoting women’s rights",
"The US should not get involved", "Don't know (Do not read)",
"Decline to answer (Do not read)")), class = "labelled")
Any ideas would help! Thanks!
As others have commented, this really is not a bubble chart as you only have 2 dimensions and the size of the circle does not add anything (other than perhaps visual appeal). But with that disclaimer, here is one approach to what I think you are trying to achieve. This requires the ggplot2 and reshape2 libraries.
library(ggplot2)
library(reshape2)
# create simulated data
dat <- data.frame(Egypt=sample(c(1:5), 20), Libya=sample(c(1:5),20))
# tabulate
dat.tab <- apply(dat, 2, table)
dat.long <- melt(dat.tab)
colnames(dat.long) <- c("Response", "Count", "Country")
ggplot(dat.long, aes(x=Country, y=Count, color=Country)) +
geom_point(aes(size=Count))
EDIT Here is another approach, using the data manipulation tools of the dplyr package to get you all the way to proportions:
# using dat from above again
dat.long <- melt(dat)
colnames(dat.long) <- c("Country", "Response")
dat.tab <- dat.long %>%
group_by(Country) %>%
count(Response) %>%
mutate(prop = prop.table(n))
ggplot(dat.tab, aes(x=Country, y=prop, color=Country)) +
geom_point(aes(size=prop))
You will need to do a little additional work to remove unwanted values (98, 99) if they are truly unwanted.
hth.

Resources