Seeking some advice around the use of ggalluvium to demonstrate the distribution of preferences in Australia.
Context, in Australia we have preferential voting. Say I live in an area with 4 candidates contesting.
The ballot is completed by numbering a box 1-4 according to your party/candidate preference.
The candidate with the lowest proportion of the vote after the first count will be eliminated and their votes will be apportioned to where their voters have indicated on their ballot paper. This process is reiterated until two candidates remain and a candidate is elected when they have greater than 50% of the two party preferred vote.
I'm seeking to visualise the above reiterating distribution process using flow diagram, and ggalluvium.
However I can't quite seem to plot the aesthetics to show the flows feeding votes to candidates in the next count of the votes.
Here's what I get so far:
library(tidyverse)
library(magrittr)
library(ggalluvial)
Load Data
house_of_reps <- read_csv("https://results.aec.gov.au/24310/Website/Downloads/HouseDopByDivisionDownload-24310.csv", skip = 1)
house_of_reps$BallotPosition %<>% as.factor()
house_of_reps$CountNumber %<>% as.factor()
cooper <- house_of_reps %>%
filter(DivisionNm == "Cooper") %>%
spread(CalculationType, CalculationValue) %>%
select(4,9,10,14)
cooper %>% ggplot(aes(x = CountNumber, alluvium = PartyNm, stratum = `Preference Percent`, y = `Preference Percent`, fill = PartyAb)) +
geom_alluvium(aes(fill = PartyAb), decreasing = TRUE) +
geom_stratum(decreasing = TRUE) +
geom_text(stat = "stratum",decreasing = TRUE, aes(label = after_stat(fill))) +
stat_stratum(decreasing = TRUE) +
stat_stratum(geom = "text", aes(label = PartyAb), decreasing = TRUE) +
scale_fill_viridis_d() +
theme_minimal()
Output image
Would appreciate any guidance on how to show where the votes after each subsequent count are flowing to which political party in the next stratum.
Unfortunately your dataset is not well suited for the kind of plot you have in mind. While the plotting itself is easy, to achieve the desired plot involves "some" data wrangling and preparation steps.
The general issue is that your dataset as is does not show the flow of votes from one party to a second. It only shows the overall number of votes a party lost or receivd in each count.
However, as in each step only one party drops out this missing information could be extracted from your data. The basic idea is to split the obs for each party or more precisely each party which drops out in one of the later counts by voter's secondary party preference.
Not sure wether each step is clear but I added some explanations as comments and added a plot of the final structure of the dataset which hopefully makes it clearer what's the final result of all the steps:
library(tidyverse)
library(magrittr)
library(ggalluvial)
# Load Data
house_of_reps <- read_csv("https://results.aec.gov.au/24310/Website/Downloads/HouseDopByDivisionDownload-24310.csv", skip = 1)
house_of_reps$BallotPosition %<>% as.factor()
house_of_reps$CountNumber %<>% as.factor()
cooper <- house_of_reps %>%
filter(DivisionNm == "Cooper") %>%
spread(CalculationType, CalculationValue) %>%
select(count = CountNumber, party = PartyAb, pref = `Preference Count`, trans = `Transfer Count`)
# Helper function to
make_rows <- function(x) {
# Name of party which gets dropped in this period
dropped <- filter(x, trans < 0) %>% pull(party)
if (length(dropped) > 0) {
x <- filter(x, trans >= 0)
# Replacements are added two times. Once for the period where the party drops out,
# and also for the previous period
xdrop <- mutate(x, party = dropped, pref = trans, trans = 0, is_drop = FALSE)
xdrop1 <- mutate(xdrop, count = count - 1, to = party, is_drop = FALSE)
# For the parties to keep or which receive transfered votes have to adjust the number of votes
xkeep <- mutate(x, pref = pref - trans, trans = 0)
bind_rows(xdrop1, xdrop, xkeep)
} else {
x
}
}
cooper1 <- cooper %>%
# First: Convert count to a numeric. Add a "to" variable for second
# party preference or the party where votes are transferred to. This variable
# will later on be mapped on the "fill" aes
mutate(to = party, count = as.numeric(as.character(count))) %>%
group_by(party) %>%
# Add identifier of obs. to drop. Obs. to drop are obs. of parties which
# drop out in the following count
mutate(is_drop = lead(trans, default = 0) < 0) %>%
ungroup() %>%
# Split obs. to be dropped by secondary party preference, i.e. in count 0 the
# obs for party "IND" is replaced by seven obs. reflecting the secondary preference
# for one of the other seven parties
split(.$count) %>%
map(make_rows) %>%
bind_rows() %>%
# Now drop original obs.
filter(!is_drop, pref > 0) %>%
# Add a unique identifier
group_by(count, party) %>%
mutate(id = paste0(party, row_number())) %>%
ungroup() %>%
# To make the flow chart work we have make the dataset complete, i.e. add
# "empty" obs for each type of voter and each count
complete(count, id, fill = list(pref = 0, trans = 0, is_drop = FALSE)) %>%
# Fill up party and "to" columns
mutate(across(c(party, to), ~ if_else(is.na(.), str_extract(id, "[^\\d]+"), .))) %>%
# Filling up the "to" column with last observed value for "to" if any
group_by(id) %>%
mutate(last_id = last(which(party != to)),
to = if_else(count >= last_id & !is.na(last_id), to[last_id], to)) %>%
ungroup()
The final structure of the dataset could be illustrated by means of a tile plot:
cooper1 %>%
add_count(count, party) %>%
ggplot(aes(count, reorder(id, n), fill = to)) +
geom_tile(color = "white")
As I said, after all the cumbersome data wrangling making the flow chart itself is the easiest task and could be achieved like so:
cooper1 %>%
ggplot(aes(x = count, alluvium = id, stratum = to, y = pref, fill = to)) +
geom_flow(decreasing = TRUE) +
geom_stratum(decreasing = TRUE) +
scale_fill_viridis_d() +
theme_minimal()
Related
I am working on a music streaming project, and I am trying to get the top15 global streamings in 2020 and make it an interactive graph.
It successfully showed the top 15 song names as a dataframe, but it failed to show as a bar graph, I wonder where did I do wrong here? Although it worked after I flip the bar graph into horizontal, but the data seem to look a bit off.
It looks like this as a vertical bar graph:
The horizontical bar graph looks like this, but the data seem incorrect:
Here is the code I have:
library("dplyr")
library("ggplot2")
# load the .csv into R studio, you can do this 1 of 2 ways
#read.csv("the name of the .csv you downloaded from kaggle")
spotiify_origional <- read.csv("charts.csv")
spotiify_origional <- read.csv("https://raw.githubusercontent.com/info201a-au2022/project-group-1-section-aa/main/data/charts.csv")
View(spotiify_origional)
# filters down the data
# removes the track id, explicit, and duration columns
spotify_modify <- spotiify_origional %>%
select(name, country, date, position, streams, artists, genres = artist_genres)
#returns all the data just from 2022
#this is the data set you should you on the project
spotify_2022 <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
arrange(date) %>%
group_by(date)
# use write.csv() to turn the new dataset into a .csv file
write.csv(Your DataFrame,"Path to export the DataFrame\\File Name.csv", row.names = FALSE)
write.csv(spotify_2022, "/Users/oliviasapp/Documents/info201/project-group-1-section-aa/data/spotify_2022.csv" , row.names = FALSE)
# then I pushed the spotify_2022.csv to the GitHub repo
View(spotiify_origional)
spotify_2022_global <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
filter(country == "global") %>%
arrange(date) %>%
group_by(streams)
View(spotify_2022_global)
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
top_15 <- top_15[1:15,]
top_15$streams <- as.numeric(top_15$streams)
View(top_15)
col_chart <- ggplot(data = top_15) +
geom_col(mapping = aes(x = name, y = streams)) +
ggtitle("Top 15 Songs Daily Streamed Globally") +
theme(plot.title = element_text(hjust = 0.5))
col_chart <- col_chart + coord_cartesian(ylim = c(999000,1000000)) + coord_flip()
col_chart
Thank you so much! Any suggestions will hugely help!
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
This code sorts in decreasing order, but the streams data here is still of character type, so numbers like 999975 will be "higher" than 1M, which is why your data looks weird. One song had two weeks just under 1M which is why it shows up with ~2M.
If you use this instead you'll get more what you intended:
top_15 <- spotify_2022_global[order(as.numeric(spotify_2022_global$streams), decreasing = TRUE), ]
However, this is finding the highest song-weeks, not the highest songs, so in this case all 15 highest song-weeks were one song.
I'd suggest you group_by(name) and then summarize to get total streams by song, filter top 15, and then make name an ordered factor, e.g. with forcats::fct_reorder.
I am at the final stages of a project where i have been comparing the appraisal price vs the sold price of different properties. The complete code for data collection and tidying is below.
At this stage i am looking at different ways to visualize my data. However, I am quite new to it so my question is whether anyone has any "new" or special ways they visualizing data that they find usefull og intuitive. I have given a couple of examples of what i am able to visualize now using ggplot.
Additionally: Now my visualizations plots all 1275 observations every time. I would however also like to visualize the data both with mean and median for the Percentage, Sold and Tax variables which i am most interested in. For example to visualize the mean value of the Percentage column based on different years.
Appreciate any help!
Complete code:
#Step 1: Load needed library
library(tidyverse)
library(rvest)
library(jsonlite)
library(stringi)
library(dplyr)
library(data.table)
library(ggplot2)
#Step 2: Access the URL of where the data is located
url <- "https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/10/"
#Step 3: Direct JSON as format of data in URL
data <- jsonlite::fromJSON(url, flatten = TRUE)
#Step 4: Access all items in API
totalItems <- data$TotalNumberOfItems
#Step 5: Summarize all data from API
allData <- paste0('https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/', totalItems,'/') %>%
jsonlite::fromJSON(., flatten = TRUE) %>%
.[1] %>%
as.data.frame() %>%
rename_with(~str_replace(., "ListItems.", ""), everything())
#Step 6: removing colunms not needed
allData <- allData[, -c(1,4,8,9,11,12,13,14,15)]
#Step 7: remove whitespace and change to numeric in columns SoldAmount and Tax
#https://stackoverflow.com/questions/71440696/r-warning-argument-is-not-an-atomic-vector-when-attempting-to-remove-whites/71440806#71440806
allData[c("Tax", "SoldAmount")] <- lapply(allData[c("Tax", "SoldAmount")], function(z) as.numeric(gsub(" ", "", z)))
#Step 8: Remove rows where value is NA
#https://stackoverflow.com/questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame
alldata <- allData %>%
filter(across(where(is.numeric),
~ !is.na(.)))
#Step 9: Remove values below 10000 NOK on SoldAmount og Tax.
alldata <- alldata %>%
filter_all(any_vars(is.numeric(.) & . > 10000))
#Step 10: Calculate percentage change between tax and sold amount and create new column with percent change
#df %>% mutate(Percentage = number/sum(number))
alldata_Percent <- alldata %>% mutate(Percentage = (SoldAmount-Tax)/Tax)
Visualization
# Plot Percentage difference based on County
ggplot(data=alldata_Percent,mapping = aes(x = Percentage, y = County)) +
geom_point(size = 1.5)
#Plot County with both Date and Percentage difference The The
theme_set(new = ggthemes::theme_economist())
p <- ggplot(data = alldata_Percent,
mapping = aes(x = Date, y = Percentage, colour = County)) +
geom_line(na.rm = TRUE) +
geom_point(na.rm = TRUE)
p
As a newbie to network analysis, I am struggling with transforming an event level dataset I want to plot into the correct shape. I am grateful for any hints/ leads/ etc. What I did until now, broadly follows this introduction.
The dataset in question contains events organized by the political party Jobbik. Each event defined by a unique id (id) has associated organizational sponsors (org_names) and their type (org). There is no hierarchy between org_1, org_2, or org_names1 and org_names2.
Originally the dataset comes in a wide format. Although I am not sure if this is what I should be doing, the first step I do is to transform the data into a long format and clean a bit the strings. This is the code for reading in the data and getting it into a long format:
jobbik <- read.csv("http://eborbath.github.io/stackoverflow/jobbik.csv")
library(tidyverse)
library(stringr)
library(igraph)
# long format
jobbik <- reshape(as.data.frame(jobbik), dir='long',
varying=list(c(3:13), c(14:24)),
v.names=c('org_names', 'org'), times = c(as.character(seq(1:11))))
jobbik$org <- str_trim(jobbik$org, side="both")
jobbik$org_names <- str_trim(jobbik$org_names, side="both")
jobbik <- jobbik %>%
filter(!(org=="no other organizer" & org_names=="")) %>%
filter(!(org=="JOBBIK" & org_names %in% c("Jobbik",
"Jobbik Magyarországért Mozgalom",
"",
"JObbik",
"jobbik",
"aktivisté Jobbiku",
"a Jobbik"))) %>%
mutate(org_names=ifelse(org_names=="", org, org_names)) %>%
distinct(.)
In the next step I want to create the network dataset. To do so, I calculate the number of times each unique organization has been involved in events with Jobbik. Add Jobbik as one side of each edge and plot the data with igraph:
network <- jobbik %>%
select(id, org_names) %>%
group_by(org_names) %>%
summarise(weight = n()) %>%
ungroup() %>%
mutate(from=1,
org_names=as.factor(org_names)) %>%
mutate(org_id=as.numeric(factor(org_names)))
edges <- network %>% select(from, org_id, weight)
nodes <- network %>% select(org_id, org_names) %>%
mutate(org_names=as.character(org_names))
routes_igraph <- graph_from_data_frame(d = edges, vertices = nodes, directed = FALSE)
plot(routes_igraph, layout = layout_with_graphopt)
While this runs and creates the network, it only gets me the relationship between each unique organization and Jobbik, but not the relationship between these organizations, which do not involve Jobbik. I realize that the error is in the data transformation I do and I should use the event level information to calculate the number of times each organizational pair has been involved in organizing something together, then plot that data. Unfortunately, though I don't know how to get there. I am grateful for any help.
I am not exactly an expert in network analysis, and igraph in particular. But I think something like that might be helpful.
I changed the preprocessing part of your analysis, because I've found few complications in a way:
Encoding of Hungarian language: that took time to find right encoding (see locale = 'cp1250 in read_csv call;
After gathering I've changed org_name* to org and org* into type;
I use chop to make it easier to spread -> unnest;
I've tried to make filter call shorter, but with no big success;
I use stringr::str_to_title() to unify org var, because there are same names which differs only in the way that nth word of the name is capitalized or not;
I use coalesce to fill NAs of org var with values from type var.
library(tidyverse)
library(magrittr)
library(igraph)
jobbik <- read_csv(
"http://eborbath.github.io/stackoverflow/jobbik.csv",
trim_ws = T,
locale = locale(encoding = 'cp1250')
)
jobbik %<>%
gather('key', 'val', -c('id', 'date')) %>%
mutate(
key = case_when(
grepl('^org_names\\d+$', key) ~ 'org',
grepl('^org\\d+$', key) ~ 'type',
TRUE ~ key
)
) %>%
chop(val) %>%
spread(key, val) %>%
unnest(c(org, type)) %>%
filter(
!(is.na(org) & (type == 'no other organizer')) &
!((is.na(org) | grepl('.*jobbik.*', org, T )) & (type == 'JOBBIK'))
) %>%
mutate(org = str_to_title(coalesce(org, type)))
To form data frame of graph edges, I am grouping by id of the event, filtering out all the events that where supported by only one organization (so there is no connection with other organizations), and finally I create pairs within id between the organizations with combn function. The result is character vector Org A-Org B, which, after unnesting, I separate into to cols from and to using - as a split (which is potentially dangerous, if the name of the org. has - symbol in it). I also filter out all self loops, if any. The last operation is count, to calculate how frequently each individual pair appears through the list of Jobbik meetings. I assign it to the width because when plotting, igraph::plot will use it as a width for the edges.
ed <- jobbik %>%
group_by(id) %>%
filter(n() > 1) %>%
summarise(edge = list(combn(org, 2, paste, collapse = '-'))) %>%
unnest(edge) %>%
separate(edge, into = c('from', 'to'), sep = '-') %>%
filter(from != to) %>%
count(from, to, name = width)
Similar analysis is performed for vertices. I add here extra information for the vertices, namely event id, date, organization type which you could use further, color - mapping the number of times given org. supported Jobbik and some additional graphical parameters for latter plot.
nd <- jobbik %>%
filter(org %in% c(ed$from, ed$to)) %>%
group_by(name = org) %>%
summarise(
id = sprintf('Event ids: %s', paste(id, collapse = ', ')),
date = sprintf('Event dates: %s', paste(date, collapse = ', ')),
type = sprintf('Org. type: %s', paste(type, collapse = '; ')),
color = n()
) %>%
ungroup() %>%
mutate(
color = heat.colors(10)[cut(color, 10)],
frame.color = NA,
label.dist = 1,
label.cex = .5,
label.color = 'gray10'
)
With these data we can make undirected graph, using graph_from_data_frame() function:
g <- graph_from_data_frame(ed, F, nd)
vertex_attr(g, 'size') <- degree(g, mode = 'all')
In a second line above, I add vertex attribute size to map degree of the vertices to the size of the vertices.
And finally to plot the comunity, I can do just:
plot(
g,
edge.curved = .2,
layout = layout_with_kk,
asp = 1,
main = 'Jobbik interaction network',
)
Fifa2 datasetFirst, I am not a developer and have little experience with R, so please forgive me. I have tried to get this done on my own, but have run out of ideas for filtering a data frame using the 'filter' command.
the data frame has about a dozen or so columns, with one being Grp (meaning Group). This is a FIFA soccer dataset, so the Group in this context means the general position the player is in (Defense, Midfield, Goalkeeper, Forward).
I need to filter this data frame to provide me this exact information:
the Top 4 Defense Players
the Top 4 Midfield Players
the Top 2 Forwards
the Top 1 Goalkeeper
What do I mean by "Top"? It's arranged by the Grp column, which is just a numeric number. So, Top 4 would be like 22,21,21,20 (or something similar because that numeric number could in fact be repeated for different players). The Growth column is the difference between the Potential Column and Overall column, so again just a simple subtraction to find the difference between them.
#Create a subset of the data frame
library(dplyr)
fifa2 <- fifa %>% select(Club,Name,Position,Overall,Potential,Contract.Valid.Until2,Wage2,Value2,Release.Clause2,Grp) %>% arrange(Club)
#Add columns for determining potential
fifa2$Growth <- fifa2$Potential - fifa2$Overall
head(fifa2)
#Find Southampton Players
ClubName <- filter(fifa2, Club == "Southampton") %>%
group_by(Grp) %>% arrange(desc(Growth), .by_group=TRUE) %>%
top_n(4)
ClubName
ClubName2 <- ggplot(ClubName, aes(x=forcats::fct_reorder(Name, Grp),
y=Growth, fill = Grp)) +
geom_bar(stat = "identity", colour = "black") +
coord_flip() + xlab("Player Names") + ylab("Unfilled Growth Potential") +
ggtitle("Southampton Players, Grouped by Position")
ClubName2
That chart produces a list of players that ends up having the Top 4 players in each position (top_n(4)), but I need it further filtered per the logic I described above. How can I achieve this? I tried fooling around with dplyr and that is fairly easy to get rows by Grp name, but don't see how to filter it to the 4-4-2-1 that I need. Any help appreciated.
Sample Output from fifa2 & ClubName (which shows the data sorted by top_n(4):
fifa2_Dataset
This might not be the most elegant solution, but hopefully it works :)
# create dummy data
data_test = data.frame(grp = sample(c("def", "mid", "goal", "front"), 30, replace = T), growth = rnorm(30, 100,10), stringsAsFactors = F)
# create referencetable to give the number of players needed per grp
desired_n = data.frame(grp = c("def", "mid", "goal", "front"), top_n_desired = c(4,4,1,2), stringsAsFactors = F)
# > desired_n
# grp top_n_desired
# 1 def 4
# 2 mid 4
# 3 goal 1
# 4 front 2
# group and arrange, than look up the desired amount of players in the referencetable and select them.
data_test %>% group_by(grp) %>% arrange(desc(growth)) %>%
slice(1:desired_n$top_n_desired[which(first(grp) == desired_n$grp)]) %>%
arrange(grp)
# A bit more readable, but you have to create an additional column in your dataframe
# create additional column with desired amount for the position written in grp of each player
data_test = merge(data_test, desired_n, by = "grp", all.x = T
)
data_test %>% group_by(grp) %>% arrange(desc(growth)) %>%
slice(1:first(top_n_desired)) %>%
arrange(grp)
I am trying to use the 'Synth' package in R to explore the effect that certain coups had on economic growth in the countries where they occurred, but I'm hung up on an error I can't understand. When I attempt to run dataprep(), I get the following:
Error in dataprep(foo = World, predictors = c("rgdpe.pc", "population.ln", :
unit.variable not found as numeric variable in foo.
That's puzzling because my data frame, World, does include a numeric id called "idno" as specified in the call to dataprep().
Here is the script I'm using. It ingests a .csv with the requisite data from GitHub. The final step --- the call to dataprep() --- is where the error arises. I would appreciate help in figuring out why this error arises and how to avoid it so I can get on to the synth() part to follow.
library(dplyr)
library(Synth)
# DATA INGESTION AND TRANSFORMATION
World <- read.csv("https://raw.githubusercontent.com/ulfelder/coups-and-growth/master/data.raw.csv", stringsAsFactors=FALSE)
World$rgdpe.pc = World$rgdpe/World$pop # create per capita version of GDP (PPP)
World$idno = as.numeric(as.factor(World$country)) # create numeric country id
World$population.ln = log(World$population/1000) # population size in 1000s, logged
World$trade.ln = log(World$trade) # trade as % of GDP, logged
World$civtot.ln = log1p(World$civtot) # civil conflict scale, +1 and logged
World$durable.ln = log1p(World$durable) # political stability, +1 and logged
World$polscore = with(World, ifelse(polity >= -10, polity, NA)) # create version of Polity score that's missing for -66, -77, and -88
World <- World %>% # create clocks counting years since last coup (attempt) or 1950, whichever is most recent
arrange(countrycode, year) %>%
mutate(cpt.succ.d = ifelse(cpt.succ.n > 0, 1, 0),
cpt.any.d = ifelse(cpt.succ.n > 0 | cpt.fail.n > 0, 1, 0)) %>%
group_by(countrycode, idx = cumsum(cpt.succ.d == 1L)) %>%
mutate(cpt.succ.clock = row_number()) %>%
ungroup() %>%
select(-idx) %>%
group_by(countrycode, idx = cumsum(cpt.any.d == 1L)) %>%
mutate(cpt.any.clock = row_number()) %>%
ungroup() %>%
select(-idx) %>%
mutate(cpt.succ.clock.ln = log1p(cpt.succ.clock), # include +1 log versions
cpt.any.clock.ln = log1p(cpt.any.clock))
# THAILAND 2006
THI.coup.year = 2006
THI.years = seq(THI.coup.year - 5, THI.coup.year + 5)
# Get names of countries that had no coup attempts during window analysis will cover. If you wanted to restrict the comparison to a
# specific region or in any other categorical way, this would be the place to do that as well.
THI.controls <- World %>%
filter(year >= min(THI.years) & year <= max(THI.years)) %>% # filter to desired years
group_by(idno) %>% # organize by country
summarise(coup.ever = sum(cpt.any.d)) %>% # get counts by country of years with coup attempts during that period
filter(coup.ever==0) %>% # keep only the ones with 0 counts
select(idno) # cut down to country names
THI.controls = unlist(THI.controls) # convert that data frame to a vector
names(THI.controls) = NULL # strip the vector of names
THI.synth.dat <- dataprep(
foo = World,
predictors = c("rgdpe.pc", "population.ln", "trade.ln", "fcf", "govfce", "energy.gni", "polscore", "durable.ln", "cpt.any.clock.ln", "civtot.ln"),
predictors.op = "mean",
time.predictors.prior = seq(from = min(THI.years), to = THI.coup.year - 1),
dependent = "rgdpe.pc",
unit.variable = "idno",
unit.names.variable = "country",
time.variable = "year",
treatment.identifier = unique(World$idno[World$country=="Thailand"]),
controls.identifier = THI.controls,
time.optimize.ssr = seq(from = THI.coup.year, to = max(THI.years)),
time.plot = THI.years
)
Too long for a comment.
Your dplyr statement:
World <- World %>% ...
converts World from a data.frame to a tbl_df object (read the docs on dplyr). Unfortunately, this causes mode(World[,"idno"]) to return list, not numeric and the test for numeric unit.variable fails.
You can fix this by using
`World <- as.data.frame(World)`
just before the call to dataprep(...).
Unfortunately (again) you now get a different error which may be due to the logic of your dplyr statement.