How to merge two datasets in R?

How to merge two datasets in R? - r

I am currently working to merge two datasets in R. The first is a cross-national longitudinal dataset of democracy scores and inequality levels for countries over hundreds of years (15,034 observations, dat_as). The second is a cross-national longitudinal dataset of whether a given country in a given year has a legislature (27,192 observations, dat_vdem). I want to attach the legislatures data to the inequality data. The goal is to have a final df with the same number of observations (15,034). If there is a match, merge the data. If there is not a match, just insert an NA for the row. Every approach I have tried in R does not work. For example, using this code I get a df with 2,558,975 observations.
# load data
dat_as <- read.csv("as.csv")
dat_vdem <- read.csv("vdem.csv")
# merge
test_df <- merge(dat_as, dat_vdem, by = c("code"))
Using this code, however, I get a df with 13,355 observations.
test_df <- merge(dat_as, dat_vdem, by = c("country", "year"))
What am I doing wrong? Any help would be appreciated. Below are reproducible data.
Here is the dat_as:
structure(list(X = 1:6, country = c("United States", "United States",
"United States", "United States", "United States", "United States"
), year = 1800:1805, scode = c("USA", "USA", "USA", "USA", "USA",
"USA"), code = c("USA", "USA", "USA", "USA", "USA", "USA"), democracy = c(1L,
1L, 1L, 1L, 1L, 1L), lagdemocracy = c(NA, 1L, 1L, 1L, 1L, 1L),
lbmginiint = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), lbmgdppint = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), ldemlbmginiint = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), ldemlbmgdppint = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), yearsq = c(3240000,
3243601, 3247204, 3250809, 3254416, 3258025), legislature = c(NA,
NA, NA, NA, NA, NA)), row.names = c(NA, 6L), class = "data.frame")
Here is the dat_vdem:
structure(list(X = 1:6, year = 1800:1805, country = c("United States", "United States", "United States", "United States", "United States", "United States"), code = c("USA",
"USA", "USA", "USA", "USA", "USA"), v2lgbicam = c(0L, 0L, 0L,
0L, 0L, 0L), v2lgqstexp = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), v2lgotovst = c(-2.1, -2.1, -2.1, -2.1, -2.1,
-2.1), v2lginvstp = c(-2.05, -2.05, -2.05, -2.05, -2.05, -2.05
), legislature = c(0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA,
6L), class = "data.frame")

You're describing a left join. The way I find easier is to use dplyr.
dplyr::left_join(dat_as, dat_vdem).
By default it will try and guess which key variables to match by. With the sample data you provided, it matched by "X", "country", "year", "code", "legislature". But you can specify them if need be.

Related

Pivot_wider is giving a lot of N/A values when reshaping

I am trying to use the pivot_wider function on the PARAMETER column to get out the unique values however, when I do that, it gives me a bunch on NA values. Below is the code that I am trying to use so far but it is resulting in the picture below and I have attempted a lot of na.omit related functions which just removed all rows.
pivot_wider(names_from = PARAMETER,
values_from = Month_Average)
I am trying to get it in the below format: where everything is on one row.
Year
Month
LAT
LON
Temperature
Humidity
wind_10_meters
wind_50_meters
precipitation
1990
Sep
25.5
-90
95
24
8
8
.5
1991
Oct
25.5
-90
89
20
8
4
1
These aren't accurate numbers, but I want to get all the information to show for that year and month in one row? Below I have provided the data that I am working with.
Here is what dput() gave me. I did head() since it was really long.
structure(list(PARAMETER = c("PS", "PS", "PS", "PS", "PS", "PS"
), YEAR = c(1990L, 1990L, 1990L, 1990L, 1990L, 1990L), LAT = c(35.25,
35.25, 35.25, 35.25, 35.25, 35.25), LON = c(-71.75, -71.75, -71.75,
-71.75, -71.75, -71.75), ANN = c(101.91, 101.91, 101.91, 101.91,
101.91, 101.91), MONTH = c("NOV", "JAN", "FEB", "MAR", "APR",
"MAY"), Month_Average = c(101.9, 102.01, 102.22, 102.36, 101.87,
101.63)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
dput of the code after running my pivot_wider:
structure(list(YEAR = c(1990L, 1990L, 1990L, 1990L, 1990L, 1990L
), LAT = c(35.25, 35.25, 35.25, 35.25, 35.25, 35.25), LON = c(-71.75,
-71.75, -71.75, -71.75, -71.75, -71.75), ANN = c(101.91, 101.91,
101.91, 101.91, 101.91, 101.91), MONTH = c("NOV", "JAN", "FEB",
"MAR", "APR", "MAY"), PS = c(101.9, 102.01, 102.22, 102.36, 101.87,
101.63), T2M = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), RH2M = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), WS10M = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), WS50M = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), PRECTOTCORR = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
Adding Parameter test
structure(list(YEAR = c(1990L, 1990L, 1990L, 1990L, 1990L, 1990L
), MONTH = c("APR", "APR", "APR", "APR", "APR", "APR"), LAT = c(35.25,
35.25, 35.25, 35.25, 35.25, 35.25), LON = c(-78.75, -78.75, -78.75,
-78.75, -78.75, -78.75), ANN = c(2.93, 3.42, 5.39, 16.89, 75.28,
101.13), number_of_parameters = c(1L, 1L, 1L, 1L, 1L, 1L)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
YEAR = 1990L, MONTH = "APR", LAT = 35.25, LON = -78.75, .rows = structure(list(
1:6), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L), .drop = TRUE))
Please let me know if I need to include any more information!

Exporting to GPX file

When I import a .gpx file from my naviagation software into r using the gpx package it arrives in this format
dput(test)
list(routes = list(structure(list(Elevation = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), Time = structure(c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Latitude = c(50.76098, 50.76327, 50.766489,
50.771325, 50.771792, 50.773814, 50.774321, 50.774669, 50.775666,
50.774327), Longitude = c(-1.322124, -1.32737, -1.324514, -1.316833,
-1.314606, -1.300727, -1.294736, -1.290568, -1.27571, -1.263494
), extensions = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-10L))), tracks = list(structure(list(Elevation = logical(0),
Time = logical(0), Latitude = logical(0), Longitude = logical(0)), class = "data.frame", row.names = integer(0))),
waypoints = list(structure(list(Elevation = logical(0), Time = logical(0),
Latitude = logical(0), Longitude = logical(0)), class = "data.frame", row.names = integer(0))))
What I want to do is export a gpx back to the same software. I have tried using the pgirmess package but it no longer supports the writeGPX command seemingly.
Then I tried to manipulate the answer given in this question here R Convert GPS to GPX with timestamp
but I am not looking to export a track but a route for people to follow so the timestamp is not relevant.
I also tried using the rgdal package as shown below.
my data
dput(routes)
list(structure(list(Elevation = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Latitude = c(50.768333, 50.771833,
50.7735, 50.769167, 50.77, 50.769167), Longitude = c(-1.307167,
-1.295833, -1.292667, -1.286667, -1.295833, -1.2775), Name = c("3X Donna",
"3Z Trinity House Buoy", "33 Prince Consort", "34 Cowes Corinthian",
"39 Snowden", "4K Royal London YC")), class = "data.frame", row.names = c(53L,
55L, 58L, 59L, 60L, 70L)))
The I run the code below but still no joy
library(rgdal)
coordinates(routes) <- ~Latitude+Longitude
proj4string(routes) <- "+proj=longlat +datum=WGS84"
writeOGR(routes, dsn = "routes.gpx", layer = "routes", driver = "GPX",
dataset_options = "GPX_USE_EXTENSIONS=yes")
I would like ideally to export a route not a track.

ggplot with tryCatch: want blank plot if there's an error during expression

Some data:
x %>% dput
structure(list(date = structure(c(18782, 18783, 18784, 18785,
18786, 18787, 18789, 18791, 18792, 18793, 18795, 18797, 18798,
18799, 18801, 18803, 18805, 18806), class = "Date"), `Expired Trials` = c(3L,
1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L), `Trial Sign Ups` = c(3L, 1L, 1L, 2L, 3L, 4L, 1L, 1L, 1L,
1L, 2L, 1L, 3L, 2L, 2L, 1L, 1L, 1L), `Total Site Conversions` = c(3,
1, 1, 2, 3, 4, 1, 1, 1, 1, 2, 1, 3, 2, 2, 1, 1, 1), `Site Conversion Rate` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `Trial to Paid Conversion Rate` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_)), row.names = c(NA, -18L), class = c("tbl_df",
"tbl", "data.frame"))
Context is within a shiny app where sometimes field 'Sessions' will exist and others it won't, depending on the users selections. Rather than display the red warning message, I just want nothing or a blank plot shown instead of an error message:
x %>%
ggplot(aes(date, Sessions)) +
geom_col(na.rm = T) +
geom_line(aes(y = `Site Conversion Rate`), na.rm = T)
Error in FUN(X[[i]], ...) : object 'Sessions' not found
Tried:
tryCatch(expr = {x %>%
ggplot(aes(date, Sessions)) +
geom_col(na.rm = T) +
geom_line(aes(y = `Site Conversion Rate`), na.rm = T)
},
error = function(e) {message(''); print(e)},
finally = {ggplot() + theme_void()})
But, this still spits out the error, wanted/expected a blank plot instead.
How can I do this?

Consider using an if/else expression with all i.e. we plot only if all the column names specified in plot are present or else return a blank plot
nm1 <- c("date", "Sessions", "Site Conversion Rate")
if(!all(nm1 %in% names(x))) {
message("Not all columns are found")
ggplot()
} else {x %>%
ggplot(aes(date, Sessions)) +
geom_col(na.rm = TRUE) +
geom_line(aes(y = `Site Conversion Rate`), na.rm = TRUE)}
Or another option is possibly with specifying otherwise
library(purrr)
f1 <- function(x) {
p1 <- x %>%
ggplot(aes(date, Sessions)) +
geom_col(na.rm = TRUE) +
geom_line(aes(y = `Site Conversion Rate`), na.rm = TRUE)
print(p1)
}
f1p <- possibly(f1, otherwise = ggplot())
-testing
f1p(x)
-output
Or a modification of the OP's tryCatch
tryCatch(expr = {print(x %>%
ggplot(aes(date, Sessions)) +
geom_col(na.rm = T) +
geom_line(aes(y = `Site Conversion Rate`), na.rm = TRUE))
},
error = function(e) {message(''); print(e)},
finally = {
ggplot() +
theme_void()
})
<simpleError in FUN(X[[i]], ...): object 'Sessions' not found>

Save complicated plot() to object

I have a series of commands that create a vibration of effects plot. Now, I want to assign the plot to an object (to later make it downloadable via Shiny). However, that does not seem possible. When I try to save the plot to an object, the object returns "Null" and likewise if I try to save it it saves an empty .png file.
See below for the function and some example data.
#some packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(MASS, tidyverse, ggplot2, dplyr, shiny, here, BayesFactor, ggpubr, effsize, DescTools, rqPen)
#plot of p value vs effect size vibration plot
#https://figshare.com/articles/Code_data_and_analysis_script_for_A_Traveler_s_Guide_to_the_Multiverse_Promises_Pitfalls_and_a_Framework_for_the_Evaluation_of_Analytic_Decisions_/12089736 main source
multiverse.vibration <- function(effsize, statistic, alpha = 0.05, threshold = 6, type = c("frequentist")){
#assign colours schemes
point.color <- rgb(0,76,153, alpha=80, maxColorValue=255)
contour.color = rgb(60,130,180, alpha=130, maxColorValue=255)
#vibrations
vibrations <- kde2d(effsize, -log10(statistic), n=50)
if (type == "frequentist"){
#do the plotting.
plot(effsize, -log10(statistic), type="n", las=1, xlab=expression(paste("Effect size")), ylab=expression(paste("-log"[10],"(",italic("p"),"-value)")), main="", cex.lab=1.35, cex.axis=1.2 ) ####the label of the y axis gets cut off by the picture for no reason whatsoever####
#add quantile lines
abline(v=as.numeric(quantile(effsize, probs=0.5)), lty=3, lwd=1.8, col="gray70")
abline(h=-log10(as.numeric(quantile(statistic, probs=0.5))), lty=3, lwd=1.8, col="gray70")
#add data points
points(effsize, -log10(statistic), pch=16, col=point.color, cex=1.5)
#add "vibrations"
contour(vibrations, drawlabels=FALSE, nlevels=5, lwd=1.7, col=contour.color, add=TRUE)
text(as.numeric(quantile(effsize, probs=0.5)), max(-log10(statistic)), "50", pos=2, col="gray40", cex=1)
text(max(effsize), -log10(as.numeric(quantile(statistic, probs=0.5))), "50", pos=3, col="gray40", cex=1)
#add alpha line and label
abline(h=-log10(alpha), lty=3, lwd=1.5, col="red")
text(min(effsize), -log10(alpha), expression(paste(alpha)), pos = 1, cex = 1, col = "red")
}
#...function simplified
}
#and below some data
df_multiverse <- structure(list(transformation = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("square",
"squareroot"), class = "factor"), datatrimming = structure(c(2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("mad",
"notrimming"), class = "factor"), fixedtrimming = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "nofixedtrimming", class = "factor"),
min = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), max = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), DispersionMeasure = c(NA,
2, 2.5, 3, 3.5, 4, 4.5, 5, NA, 2, 2.5, 3, 3.5, 4, 4.5, 5),
NumberOfTrials = c(2481, 2017, 2089, 2152, 2202, 2235, 2271,
2292, 2481, 2017, 2089, 2152, 2202, 2235, 2271, 2292), df = c(21,
21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21
), t.value = c(0.834352731211477, -1.89143806501942, -2.06164045172582,
-2.29402139720537, -2.20170894686594, -1.30874979649765,
-1.46580636234517, -0.933033039387291, -0.381340656529586,
-2.65553835404059, -2.70367808996487, -2.88191068442976,
-2.89698876130645, -2.31203065738409, -2.40524937843272,
-1.99997820996895), p.value = c(0.413473232348569, 0.0724397922282673,
0.0518359697127152, 0.0322027617938105, 0.0390026786336539,
0.204761347160827, 0.157515139319996, 0.361407402521166,
0.706781450011369, 0.0147953018060795, 0.013300947944711,
0.00892256290108781, 0.0086233125398353, 0.0310102245266004,
0.0254623057912856, 0.0586025361696588), estimate = c(0.0513517727014905,
-0.138440596771433, -0.152826845040145, -0.172473124495872,
-0.150035258885051, -0.106059860414446, -0.0904972867538278,
-0.0636909905658258, -0.0224006885730891, -0.132591874705722,
-0.141473579509691, -0.162307800901886, -0.156924178280938,
-0.138723145332572, -0.124862443444392, -0.109932966289113
)), row.names = c("df", "df1", "df2", "df3", "df4", "df5",
"df6", "df7", "df8", "df9", "df10", "df11", "df12", "df13", "df14",
"df15"), class = "data.frame")
#and below a call
object <- multiverse.vibration(df_multiverse$estimate, df_multiverse$p.value, type = "frequentist")
#Now I try to save it
svg(file = "Figure 1.svg", width = 9, height = 9, antialias = "gray")
object
dev.off()
#empty file, does not save plot.
My goal is to save the plot to an object in a way that later allows me to download the object via some command.

Why can't rbind append these two datasets?

I have these two datasets that I am trying to append:
data1 = structure(list(year = c(2017, 2018), flow = c("Export", "Export"
), EUR = c(4, 3.44), Home = c(3.09, 3.03), Not_reported = c(0.12,
0), USD = c(92.29, 93.04), country = c("Brazil", "Brazil"), Other = c(0.499999999999994,
0.489999999999994)), row.names = c(NA, -2L), vars = c("year",
"flow"), drop = TRUE, indices = list(0L, 1L), group_sizes = c(1L,
1L), biggest_group_size = 1L, labels = structure(list(year = c(2017,
2018), flow = c("Export", "Export")), row.names = c(NA, -2L), vars = c("year",
"flow"), drop = TRUE, indices = list(0L, 1L), group_sizes = c(1L,
1L), biggest_group_size = 1L, labels = structure(list(year = c(2017,
2018), flow = c("EXP", "EXP")), class = "data.frame", row.names = c(NA,
-2L), vars = c("year", "flow"), drop = TRUE), class = "data.frame"), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
data2 = structure(list(flow = c("Export", "Export", "Export", "Export",
"Export", "Import"), country = structure(c(6L, 6L, 6L, 6L, 6L,
6L), .Label = c("Algeria", "Argentina", "Australia", "Austria",
"Belgium", "Brazil", "Bulgaria", "Canada", "China", "Colombia",
"Cyprus", "Czech Republic", "Denmark", "Estonia", "Euro", "Finland",
"France", "Germany", "Greece", "Hungary", "Iceland", "India",
"Indonesia", "Ireland", "Israel", "Italy", "Japan", "Latvia",
"Lithuania", "Luxembourg", "Malaysia", "Malta", "Morocco", "Netherlands",
"Pakistan", "Poland", "Portugal", "Romania", "Slovakia", "Slovenia",
"South Africa", "South Korea", "Spain", "Sweden", "Switzerland",
"Thailand", "Ukraine", "United Kingdom", "United States"), class = "factor"),
year = c(2007, 2008, 2009, 2010, 2011, 2007), EUR = c(4.76,
4.95, 4.51, 4.28, 3.8, 11.1), Home = c(0.13, 0.16, 1.11,
0.82, NA, 0.48), Not_reported = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), USD = c(94.7, 94.4, 93.8,
94.3, 94.5, 85.5), Other = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_)), row.names = c(NA, 6L), class = "data.frame")
When I tried:
rbind(data1, data2)
I got a list instead of a dataframe. I have checked the class of each column and they seem consistent with each other. Can someone explain to me? Thanks!