I have two data frames that look like so
> df1
county state code
ANDERSON Texas 1
ANDREWS Texas 2
ANGELINA Texas 3
....
> df2
county state citations year
ANDERSON Texas 124 2011
ANDREWS Texas 32 2011
ANGELINA Texas 491 2011
....
I have tried to merge the two of these a few different ways:
merge <- full_join(df1, df2, by = c("county", "state"))
merge <- merge(df1, df2, by = c("county", "state"))
In both cases, I receive the following warning:
Warning message:
Column `county` joining factor and character vector, coercing into
character vector
The resulting data frame does not have any data for df2, even after coercing the factor into a character. I tried it again after turning the county column into a character in both data frames and still have issues.
Here are the heads of the two data frames I am attempting to merge:
> dput(head(data))
structure(list(year = c(2011L, 2011L, 2011L, 2011L, 2011L, 2011L
), month = c(1L, 1L, 1L, 1L, 1L, 1L), county = c("ANDERSON COUNTY",
"ANGELINA COUNTY", "ARANSAS COUNTY", "ATASCOSA COUNTY", "BASTROP COUNTY",
"BELL COUNTY"), state = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Montana",
"Texas"), class = "factor"), citations = c(218L, 422L, 55L, 472L,
745L, 1403L), warnings = c(521L, 711L, 124L, 1173L, 819L, 2242L
), population = c(56760L, 82812L, 24721L, 43589L, 72248L, 276975L
), d_revenue = c(-736L, -6723L, 1134L, 71L, 2308L, 852L), crashes = c(73L,
133L, 18L, 71L, 95L, 422L), density = c(55, 108.8, 91.9, 36.8,
83.5, 295.2), unemp_rate = c(8, 8.3, 9.6, 8.5, 8.5, 8), stops =
c(739L, 1133L, 179L, 1645L, 1564L, 3645L), stops_per_cap = c(0.013019732,
0.013681592, 0.007240807, 0.037738879, 0.021647658, 0.013160032
), crashes_per_cap = c(0.001286117, 0.001606047, 0.000728126,
0.001628851, 0.001314915, 0.001523603)), .Names = c("year", "month",
"county", "state", "citations", "warnings", "population", "d_revenue",
"crashes", "density", "unemp_rate", "stops", "stops_per_cap",
"crashes_per_cap"), row.names = c(NA, 6L), class = "data.frame")
> dput(head(codes))
structure(list(county = c("ANDERSON COUNTY ", "ANDREWS COUNTY ",
"ANGELINA COUNTY ", "ARANSAS COUNTY ", "ARCHER COUNTY ", "ARMSTRONG COUNTY "
), state = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Montana",
"Texas"), class = "factor"), code = 1:6), .Names = c("county",
"state", "code"), row.names = c(NA, 6L), class = "data.frame")
Related
I'm trying to change the color of my statebins map. I'm super new to R and still learning a lot, and googling around hasn't really helped.
This is what the sample looks like:
fipst stab state color
1 37 NC North Carolina 2
2 1 AL Alabama 2
3 28 MS Mississippi 2
4 5 AR Arkansas 2
5 47 TN Tennessee 2
6 45 SC South Carolina 1
7 23 ME Maine 2
49 32 NV Nevada 1
50 15 HI Hawaii 2
51 11 DC District of Columbia 2
digitaltax <- structure(list(fipst = c(37L, 1L, 28L, 5L, 47L, 45L, 23L, 32L,15L, 11L), stab = c("NC", "AL", "MS", "AR", "TN", "SC", "ME","NV", "HI", "DC"), state = c("North Carolina", "Alabama", "Mississippi","Arkansas", "Tennessee", "South Carolina", "Maine", "Nevada","Hawaii", "District of Columbia"), color = c(2L, 2L, 2L, 2L,2L, 1L, 2L, 1L, 2L, 2L)), row.names = c(1L, 2L, 3L, 4L, 5L, 6L,7L, 49L, 50L, 51L), class = "data.frame")
==X==============================================================X==
mutate(
digitaltax,
share = cut(color, breaks = 3, labels = c("No sales tax", "Exempts digital goods", "Taxes digital goods"))
) %>%
statebins(
value_col = "share", font_size = 2.5,
ggplot2_scale_function = scale_fill_brewer,
name = ""
) +
labs(title = "Which states tax digital products?") + theme_statebins()
This produces a map with a range of blues. How can I change the color? No matter what I've tried and found on google, it always throws this error:
Error in ggplot2_scale_function(...) : could not find function "ggplot2_scale_function"
Any help at all would be super appreciated. Thank you!
One approach is to use named RColorBrewer palates with brewer_pal =:
library(statebins)
statebins(digitaltax,
value_col = "color",
breaks = length(unique(digitaltax$share)),
labels = unique(digitaltax$share),
brewer_pal = "Dark2") +
labs(title = "Which states tax digital products?")
Execute this command to see all palates:
library(RColorBrewer)
display.brewer.all()
With your data and most of your code I did change DC color to 3 so it shows all categories.
library(dplyr)
library(ggplot2)
library(statebins)
# changes DC to be color 3
digitaltax <- structure(list(fipst = c(37L, 1L, 28L, 5L, 47L, 45L, 23L, 32L,15L, 11L), stab = c("NC", "AL", "MS", "AR", "TN", "SC", "ME","NV", "HI", "DC"), state = c("North Carolina", "Alabama", "Mississippi","Arkansas", "Tennessee", "South Carolina", "Maine", "Nevada","Hawaii", "District of Columbia"), color = c(2L, 2L, 2L, 2L,2L, 1L, 2L, 1L, 2L, 3L)), row.names = c(1L, 2L, 3L, 4L, 5L, 6L,7L, 49L, 50L, 51L), class = "data.frame")
mutate(
digitaltax,
share = cut(color, breaks = 3, labels = c("No sales tax", "Exempts digital goods", "Taxes digital goods"))
) %>%
statebins(
value_col = "share", font_size = 2.5,
ggplot2_scale_function = scale_fill_brewer,
name = ""
) +
labs(title = "Which states tax digital products?") +
theme_statebins()
Created on 2020-05-12 by the reprex package (v0.3.0)
I have the df1:
Name Y_N FIPS score1 score2
1: Alabama 0 1 2633 8
2: Alaska 0 2 382 1
3: Arizona 1 4 2695 41
4: Arkansas 1 5 2039 10
5: California 1 6 27813 524
6: Colorado 0 8 8609 133
7: Connecticut 1 9 5390 111
8: Delaware 0 10 858 3
9: Florida 1 12 14172 215
10: Georgia 1 13 9847 308
11: Hawaii 0 15 720 0
12: Idaho 1 16 845 7
I would like to perform a T-test to see if score1 differs based on Y_N. I would then like to plot these two against each other. I have made a boxplot that looks like:
Instead I want my graph to look like except with confidence bars: I want to now change from a boxplot to a plot that shows all of the individual points and then a mean horizontal line with 95% confidence intervals. How is this done? I would also like to add the text of the p-value in a corner of the graph.
I might try:
text(x = max(df1$Y_N)+1,
y = min(df1$score1)+20000,
labels = paste0(
"\np-value = ",
round(coef_lm[2,4],5),
pos = 4)
But I realize that coef_lm[2,4],5 are the test-statistics from a linear model. How do I access the outputs of a t-test?
I'm not sure why you added that extra point in your code. But on your original data, you might use ggplot2 and ggpubr.
Edit
Now more like your paint drawing.
ggplot(df1,aes(x = as.factor(Y_N), y = score1)) +
geom_jitter(position = position_jitter(0.1)) +
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = 0.3) +
stat_summary(fun = "mean", geom = "errorbar", aes(ymax = ..y.., ymin = ..y..), col = "red", width = 0.5) +
stat_compare_means(method="t.test") +
xlab("Group") + ylab("Score 1")
Original Data
df1 <- structure(list(Name = structure(1:12, .Label = c("Alabama", "Alaska",
"Arizona", "Arkansas", "California", "Colorado", "Connecticut",
"Delaware", "Florida", "Georgia", "Hawaii", "Idaho"), class = "factor"),
Y_N = c(0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L),
FIPS = c(1L, 2L, 4L, 5L, 6L, 8L, 9L, 10L, 12L, 13L, 15L,
16L), score1 = c(2633L, 382L, 2695L, 2039L, 27813L, 8609L,
5390L, 858L, 14172L, 9847L, 720L, 845L), score2 = c(8L, 1L,
41L, 10L, 524L, 133L, 111L, 3L, 215L, 308L, 0L, 7L)), class = "data.frame", row.names = c("1:",
"2:", "3:", "4:", "5:", "6:", "7:", "8:", "9:", "10:", "11:",
"12:"))
Alternatively, without to install ggpubr you can calculate p value outside of ggplot2 and use annotate function to add the pvalue into the plot:
pval <- t.test(score1~Y_N,data = df)$p.value
library(ggplot2)
ggplot(df, aes(x = as.factor(Y_N), y = score1, fill = as.factor(Y_N), color = as.factor(Y_N)))+
geom_boxplot(alpha = 0.3, color = "black", outlier.shape = NA)+
geom_jitter(show.legend = FALSE)+
annotate(geom = "text", label = paste("p.value: ",round(pval,3)), x = 1.5, y = max(df$score1)*0.9)
EDIT: Without a boxplot
Alternatively to the boxplot, if you want to have individual points and a bar representing the mean, you can first calculate the mean per group in a ne dataset (here I'm using dplyr package for doing it):
library(dplyr)
Mean_df <- df %>% group_by(Y_N) %>% summarise(Mean = mean(score1))
# A tibble: 2 x 2
Y_N Mean
<int> <dbl>
1 0 2640.
2 1 8972.
Then, you can plot individual points using geom_jitter and the mean using geom_errobar by calling the new dataset Mean_df:
library(ggplot2)
ggplot(df, aes(x = as.factor(Y_N), y = score1))+
geom_jitter(show.legend = FALSE, width = 0.2)+
geom_errorbar(inherit.aes = FALSE, data = Mean_df,
aes(x = as.factor(Y_N),ymin = Mean, ymax = Mean),
color = "red",width = 0.2)+
annotate(geom = "text", label = paste("p.value: ",round(pval,3)),
x = 1.5, y = max(df$score1)*0.9)
Reproducible example
structure(list(Name = c("Alabama", "Alaska", "Arizona", "Arkansas",
"California", "Colorado", "Connecticut", "Delaware", "Florida",
"Georgia", "Hawaii", "Idaho"), Y_N = c(0L, 0L, 1L, 1L, 1L, 0L,
1L, 0L, 1L, 1L, 0L, 1L), FIPS = c(1L, 2L, 4L, 5L, 6L, 8L, 9L,
10L, 12L, 13L, 15L, 16L), score1 = c(2633L, 382L, 2695L, 2039L,
27813L, 8609L, 5390L, 858L, 14172L, 9847L, 720L, 845L), score2 = c(8L,
1L, 41L, 10L, 524L, 133L, 111L, 3L, 215L, 308L, 0L, 7L)), row.names = c(NA,
-12L), class = c("data.table", "data.frame"))
dd <- structure(list(Name = c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho"), Y_N = c(0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L), FIPS = c(1L, 2L, 4L, 5L, 6L, 8L, 9L, 10L, 12L, 13L, 15L, 16L), score1 = c(2633L, 382L, 2695L, 2039L, 27813L, 8609L, 5390L, 858L, 14172L, 9847L, 720L, 845L), score2 = c(8L, 1L, 41L, 10L, 524L, 133L, 111L, 3L, 215L, 308L, 0L, 7L)), row.names = c(NA, -12L), class = c("data.table", "data.frame"))
## frame
boxplot(score1 ~ Y_N, dd, border = NA)
## 95% ci, medians
sp <- split(dd$score1, dd$Y_N)
sapply(seq_along(sp), function(ii) {
x <- sp[[ii]]
arrows(ii, quantile(x, 0.025), ii, quantile(x, 0.975), code = 3, angle = 90, length = 0.1)
segments(ii - 0.05, median(x), ii + 0.05, col = 'red', lwd = 2)
})
points(dd$Y_N + 1, dd$score1, col = dd$Y_N + 1)
## t-test
lbl <- sprintf('p = %s', format.pval(t.test(score1 ~ Y_N, dd)$p.value, digits = 2))
mtext(lbl, at = par('usr')[2], adj = 1)
One of your questions relates to how to access the t.test statistics. Here's an answer to that question. Suppose you have that type of data:
set.seed(12)
YN <- sample(0:1, 100, replace = T)
score1 <- sample(500:1500, 100, replace = T)
df <- data.frame(YN, score1)
And suppose further that you run and store the t.test like this:
test <- tapply(df$score1, df$YN, t.test)
Then you can access the test statistics bit by bit like this, illustrated here for the factor level 0:
test$`0`$p.value # p-value
test$`0`$conf.int # confidence interval
test$`0`$estimate # estimate
test$`0`$statistic # statistic
Now obviously you will not want to do it manually bit by bit but in a more autmated and systematic way. This is how you can achieve this:
df1 <- do.call(rbind, lapply(test, function(x) c(
statistic = unname(x$statistic),
ci = unname(x$conf.int),
est = unname(x$estimate),
pval = unname(x$p.value))))
The ouput is this:
statistic ci1 ci2 est pval
0 22.31155 837.3901 1003.263 920.3265 5.484012e-27
1 22.91558 870.5426 1037.810 954.1765 3.543693e-28
Here part of mydataset
df=structure(list(CustomerName = structure(c(1L, 1L, 1L, 2L, 2L,
2L), .Label = c("x", "y"), class = "factor"), ItemRelation = c(11202L,
11202L, 11202L, 1L, 1L, 1L), SaleCount = c(214L, 88L, 42L, 214L,
88L, 42L), DocumentNum = c(137L, 137L, 137L, 3L, 3L, 3L), DocumentYear = c(2018L,
2018L, 2018L, 2018L, 2018L, 2018L), k = c(114.66667, 114.66667,
114.66667, 114.66667, 114.66667, 114.66667), m0 = c(31.92, 31.92,
31.92, 31.92, 31.92, 31.92), Action_Effect = c(82.74667, 82.74667,
82.74667, 82.74667, 82.74667, 82.74667)), .Names = c("CustomerName",
"ItemRelation", "SaleCount", "DocumentNum", "DocumentYear", "k",
"m0", "Action_Effect"), class = "data.frame", row.names = c(NA,
-6L))
i need for each group CustomerName+ItemRelation+DocumentNum+DocumentYear
calculate the sum for salecount and then from this sum substract Action_Effect column.
I.E. output must be
df2=structure(list(CustomerName = structure(c(1L, 1L, 1L, 2L, 2L,
2L), .Label = c("x", "y"), class = "factor"), ItemRelation = c(11202L,
11202L, 11202L, 1L, 1L, 1L), SaleCount = c(214L, 88L, 42L, 214L,
88L, 42L), DocumentNum = c(137L, 137L, 137L, 3L, 3L, 3L), DocumentYear = c(2018L,
2018L, 2018L, 2018L, 2018L, 2018L), X. = c(114.66667, 114.66667,
114.66667, 114.66667, 114.66667, 114.66667), m0 = c(31.92, 31.92,
31.92, 31.92, 31.92, 31.92), Action_Effect = c(82.74667, 82.74667,
82.74667, 82.74667, 82.74667, 82.74667), sum = c(344L, 344L,
344L, 344L, 344L, 344L), output = c(261.25333, 261.25333, 261.25333,
261.25333, 261.25333, 261.25333)), .Names = c("CustomerName",
"ItemRelation", "SaleCount", "DocumentNum", "DocumentYear", "X.",
"m0", "Action_Effect", "sum", "output"), class = "data.frame", row.names = c(NA,
-6L))
the long table, so i decided show desired output via dput()
How can i do it?
Your data is a bit weird, as the values are the same for both groups, but this should work:
libary(dplyr)
df %>%
group_by(CustomerName, ItemRelation, DocumentNum, DocumentYear) %>%
mutate(test = sum(SaleCount) - Action_Effect)
# A tibble: 6 x 9
# Groups: CustomerName, ItemRelation, DocumentNum, DocumentYear [2]
CustomerName ItemRelation SaleCount DocumentNum DocumentYear k m0 Action_Effect test
<fctr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 x 11202 214 137 2018 114.6667 31.92 82.74667 261.2533
2 x 11202 88 137 2018 114.6667 31.92 82.74667 261.2533
3 x 11202 42 137 2018 114.6667 31.92 82.74667 261.2533
4 y 1 214 3 2018 114.6667 31.92 82.74667 261.2533
5 y 1 88 3 2018 114.6667 31.92 82.74667 261.2533
6 y 1 42 3 2018 114.6667 31.92 82.74667 261.2533
To add the sum, use
df %>%
group_by(CustomerName, ItemRelation, DocumentNum, DocumentYear) %>%
mutate(sum = sum(SaleCount), output = sum(SaleCount) - Action_Effect)
For completeness, adding base and data.table syntax:
base:
df$test <- unlist(by(df,
paste(df$CustomerName, df$ItemRelation, df$DocumentNum, df$DocumentYear),
function(x) sum(x$SaleCount) - x$Action_Effect))
df
data.table:
library(data.table)
setDT(df)
df[, test2:=sum(SaleCount) - Action_Effect,
by=.(CustomerName, ItemRelation, DocumentNum, DocumentYear)][]
I have a table like the following image and I'm trying to use a simple if statement to return the country name only in cases where food is "Oranges". The 3rd column is the desired outcome, the 4th column is what I get in R.
In excel the formula would be:
=IF(A2="Oranges",B2,"n/a")
I have used the following r code to generate the "oranges_country" variable:
table$oranges_country <- ifelse (Food == "Oranges", Country , "n/a")
[As per the image above] The code returns the number of the level (e.g. 6) in the levels list for 'Country' rather than 'Country' itself (e.g. "Spain"). I understand where this coming from (the position in the extract as below), but it's a pain particularly when using several nested if statements.
levels(Country)
[1] "California" "Ecuador" "France" "New Zealand" "Peru" "Spain" "UK"
There must be a simple way to change this???
As requested in a comment: dput(table) output as follows:
dput(table)
structure(list(Food = structure(c(1L, 1L, 3L, 1L, 1L, 3L, 3L,
2L, 2L), .Label = c("Apples", "Bananas", "Oranges"), class = "factor"),
Country = structure(c(3L, 7L, 6L, 4L, 7L, 6L, 1L, 5L, 2L), .Label = c("California",
"Ecuador", "France", "New Zealand", "Peru", "Spain", "UK"
), class = "factor"), Desired_If.Outcome = structure(c(2L,
2L, 3L, 2L, 2L, 3L, 1L, 2L, 2L), .Label = c("California",
"n/a", "Spain"), class = "factor"), oranges_country = c("n/a",
"n/a", "6", "n/a", "n/a", "6", "1", "n/a", "n/a"), desiredcolumn = c(NA,
NA, 6L, NA, NA, 6L, 1L, NA, NA)), .Names = c("Food", "Country",
"Desired_If.Outcome", "oranges_country", "desiredcolumn"), row.names = c(NA,
-9L), class = "data.frame")
Try the ifelse loop. Firstly , change Table$Country to character()
table$Country<-as.character(Table$Country)
table$desiredcolumn<-ifelse(table$Food == "Oranges", table$Country, NA)
Here is my version:
Food<-c("Ap","Ap","Or","Ap","Ap","Or","Or","Ba","Ba")
Country<-c("Fra","UK","Sp","Nz","UK","Sp","Cal","Per","Eq")
Table<-cbind(Food,Country)
Table<-data.frame(Table)
Table$Country<-as.character(Table$Country)
Table$DC<-ifelse(Table$Food=="Or", Table$Country, NA)
Table
Food Country DC
1 Ap Fra <NA>
2 Ap UK <NA>
3 Or Sp Sp
4 Ap Nz <NA>
5 Ap UK <NA>
6 Or Sp Sp
7 Or Cal Cal
8 Ba Per <NA>
9 Ba Eq <NA>
Try this (if your table is called table):
table[table$Food=="Oragnes", ]
I am trying to get more control over the text that appears when using add_tooltip in ggvis.
Say I want to plot 'totalinns' against 'avg' for this dataframe. Color points by 'country'.
The text I want to appear in the hovering tooltip would be: 'player', 'country', 'debutyear' 'avg'
tmp:
# player totalruns totalinns totalno totalout avg debutyear country
# 1 AG Ganteaume 112 1 0 1 112.00000 1948 WI
# 2 DG Bradman 6996 80 10 70 99.94286 1928 Aus
# 3 MN Nawaz 99 2 1 1 99.00000 2002 SL
# 4 VH Stollmeyer 96 1 0 1 96.00000 1939 WI
# 5 DM Lewis 259 5 2 3 86.33333 1971 WI
# 6 Abul Hasan 165 5 3 2 82.50000 2012 Ban
# 7 RE Redmond 163 2 0 2 81.50000 1973 NZ
# 8 BA Richards 508 7 0 7 72.57143 1970 SA
# 9 H Wood 204 4 1 3 68.00000 1888 Eng
# 10 JC Buttler 200 3 0 3 66.66667 2014 Eng
I understand that I need to make a key/id variable as ggvis only takes information supplied to it. Therefore I need to refer back to the original data. I have tried changing my text inside of my paste0() command, but still can't get it right.
tmp$id <- 1:nrow(tmp)
all_values <- function(x) {
if(is.null(x)) return(NULL)
row <- tmp[tmp$id == x$id, ]
paste0(tmp$player, tmp$country, tmp$debutyear,
tmp$avg, format(row), collapse = "<br />")
}
tmp %>% ggvis(x = ~totalinns, y = ~avg, key := ~id) %>%
layer_points(fill = ~factor(country)) %>%
add_tooltip(all_values, "hover")
Find below code to reproduce example:
tmp <- structure(list(player = c("AG Ganteaume", "DG Bradman", "MN Nawaz",
"VH Stollmeyer", "DM Lewis", "Abul Hasan", "RE Redmond", "BA Richards",
"H Wood", "JC Buttler"), totalruns = c(112L, 6996L, 99L, 96L,
259L, 165L, 163L, 508L, 204L, 200L), totalinns = c(1L, 80L, 2L,
1L, 5L, 5L, 2L, 7L, 4L, 3L), totalno = c(0L, 10L, 1L, 0L, 2L,
3L, 0L, 0L, 1L, 0L), totalout = c(1L, 70L, 1L, 1L, 3L, 2L, 2L,
7L, 3L, 3L), avg = c(112, 99.9428571428571, 99, 96, 86.3333333333333,
82.5, 81.5, 72.5714285714286, 68, 66.6666666666667), debutyear = c(1948L,
1928L, 2002L, 1939L, 1971L, 2012L, 1973L, 1970L, 1888L, 2014L
), country = c("WI", "Aus", "SL", "WI", "WI", "Ban", "NZ", "SA",
"Eng", "Eng")), .Names = c("player", "totalruns", "totalinns",
"totalno", "totalout", "avg", "debutyear", "country"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -10L))
I think this is closer:
all_values <- function(x) {
if(is.null(x)) return(NULL)
row <- tmp[tmp$id == x$id, ]
paste(tmp$player[x$id], tmp$country[x$id], tmp$debutyear[x$id],
tmp$avg[x$id], sep="<br>")
}