I have got this dataset so far:
julia>import Downloads
julia>using DLMReader, VegaLite, InMemoryDatasets
julia>data=Downloads.download("https://raw.githubusercontent.com/akshdfyehd/salary/main/ds_salaries.csv")
julia>ds=filereader(data,emptycolname=true)
julia>new=filter(ds,:employment_type,by= ==("FT"))
julia>select!(new,:job_title,:salary_in_usd,:work_year)
588×4 Dataset
Row │ job_title work_year experience_level salary_in_usd
│ identity identity identity identity
│ String? Int64? String? Int64?
─────┼────────────────────────────────────────────────────────────────────────
1 │ Data Scientist 2020 MI 79833
2 │ Machine Learning Scientist 2020 SE 260000
3 │ Big Data Engineer 2020 SE 109024
4 │ Product Data Analyst 2020 MI 20000
5 │ Machine Learning Engineer 2020 SE 150000
6 │ Data Analyst 2020 EN 72000
7 │ Lead Data Scientist 2020 SE 190000
8 │ Data Scientist 2020 MI 35735
9 │ Business Data Analyst 2020 MI 135000
10 │ Lead Data Engineer 2020 SE 125000
11 │ Data Scientist 2020 EN 51321
⋮ │ ⋮ ⋮ ⋮ ⋮
I want to have two graphs, 1. first one with x axis be salary, color in experience level and y axis will be job title
2. second one is y axis be job title but they will be grouped by different work year, x axis be salary, color in experience level.The outcome should be sth like this:
for the first pic I have tried with this code
julia>palette = ["brown", "blue", "tan", "green"]
julia>plot(new, y=:job_title, x=:salary_in_usd, color=:experience_level,
Geom.subplot_grid(Geom.bar(position=:stack, orientation=:horizontal),
Guide.ylabel(orientation=:vertical) ),
Scale.color_discrete_manual(palette...),
Guide.colorkey(title="experience_level\ntype"))
but error raised saying no method match this. Does anyone know where I got this wrong? Many thanks!
Related
Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 months ago.
Improve this question
I have got this dataset in julia:
julia> import Downloads
julia> using DLMReader, VegaLite, InMemoryDatasets
julia> data=Downloads.download("https://raw.githubusercontent.com/akshdfyehd/salary/main/ds_salaries.csv");
julia> ds=filereader(data,emptycolname=true);
julia> new=filter(ds,:employment_type,by= ==("FT"));
julia> select!(new,:job_title,:salary_in_usd,:work_year)
588×4 Dataset
Row │ job_title work_year experience_level salary_in_usd
│ identity identity identity identity
│ String? Int64? String? Int64?
─────┼────────────────────────────────────────────────────────────────────────
1 │ Data Scientist 2020 MI 79833
2 │ Machine Learning Scientist 2020 SE 260000
3 │ Big Data Engineer 2020 SE 109024
4 │ Product Data Analyst 2020 MI 20000
5 │ Machine Learning Engineer 2020 SE 150000
6 │ Data Analyst 2020 EN 72000
7 │ Lead Data Scientist 2020 SE 190000
8 │ Data Scientist 2020 MI 35735
9 │ Business Data Analyst 2020 MI 135000
10 │ Lead Data Engineer 2020 SE 125000
11 │ Data Scientist 2020 EN 51321
12 │ Data Scientist 2020 MI 40481
13 │ Data Scientist 2020 EN 39916
14 │ Lead Data Analyst 2020 MI 87000
⋮ │ ⋮ ⋮ ⋮ ⋮
576 │ Data Analytics Manager 2022 SE 150260
577 │ Data Analytics Manager 2022 SE 109280
578 │ Data Scientist 2022 SE 210000
579 │ Data Analyst 2022 SE 170000
580 │ Data Scientist 2022 MI 160000
581 │ Data Scientist 2022 MI 130000
582 │ Data Analyst 2022 EN 67000
583 │ Data Analyst 2022 EN 52000
584 │ Data Engineer 2022 SE 154000
585 │ Data Engineer 2022 SE 126000
586 │ Data Analyst 2022 SE 129000
587 │ Data Analyst 2022 SE 150000
588 │ AI Scientist 2022 MI 200000
561 rows omitted
I have tried following graph:
But these are not really represent the info quite clearly, because I don't have quite good ideas about how to visualize these info now, this graph is a good one but I'm not sure if my dataset can produce this kind of graph:
can I please have any sugesstions to make a better graph? any other packages as long as it can show good graph.
Thanks in advance.
I suggest you read the following to get a good idea of better dataviz practices.
https://clauswilke.com/dataviz/
In answer to your question, it really depends what you're trying to show. In the first instance, I would switch the axes and that will make the data a lot more readable.
i have been struggeling on a barplot two days now but i really need some help here.
What i need is 4 barplots diagrams besides each other for the four levels of burning degrees in a forest (level 1-4), in each graph i want a distribution of diametres of trees (no.= 9005 trees) with classes from 1(<10cm) to 10(>=50 cm).xlab=classes of diametre (1-10); ylab= number of stems in this diametre class. But i have different numbers of trees for each level so i would need to use the mean for every bar (devide the sum with the number of stems with the same number A03,...)
or group them through the column "class" which already is the diameter class. And last but not least i want to bars next to each other(or one after teh others) for the two periods (2011 and 2020).
no. dbh year level class
A03 19 2011 unverbrannt 1
A03 19 2011 kaum verbrannt 2
A04 27 2011 kaum verbrannt 3
A04 15 2011 kaum verbrannt 4
A05 33 2011 kaum verbrannt 5
B01 21 2011 kaum verbrannt 6
. 2020 . 7
2020 kaum verbrannt 8
. 2020 . 9
. . schwach verbrannt 10
. .
. . stark verbrannt
In the end it should look like this:
https://luisdva.github.io/rstats/Diverging-bar-plots/
I hope it makes any sense.
Thanks in advance
Elsa
Yes of course. This is my data.
I have:
10 groups diametre distribution classes "BHD klasse"
and 2 time periods (jahr) and i wanne get a graph for each
4 groups degree of burning "grad" ( with ggplot facet wrap maybe):
i just could reproduce or copy the code in here so here is an example:
BHD Klasse Jahr grad
1 2011 stark
2 2011 schwach
3 2020 nicht
4 2020 kaum
5 2020 kaum
i tried to use the gplot
BHD Klasse jahr grad
1 2011 stark
2 2020 schwach
1 2011 kaum
5 2020 stark
10 2020 nicht
ggplot()+ geom_histogram(data=bhdklass,aes(x = bhdklasse, fill=jahr))
would be awesome if i´d hear from you
Edit: It looks like this is a known issue with the "cascade" method. Results that return NA values after the first attempt don't like being converted to doubles when subsequent methods return lat/lons.
Data: I have a list of addresses that I need to geocode. I'm using lapply() to split-apply-combine, which works, but very slowly. My thought to split (further)-apply-combine is returning errors about dim names and sizes that are confusing to me.
# example data
library(dplyr)
library(tidygeocoder)
url <- "https://www.briandunning.com/sample-data/us-500.zip"
download.file(url = url, destfile = basename(url))
adds <- readr::read_csv(basename(url)) %>%
select(address, city,
county, state, zip) %>%
mutate(date = seq.Date(as.Date('2015-01-01'), to = Sys.Date(), length.out = 500)) %>%
mutate(year = lubridate::year(date)) %>%
# to keep it small
sample_n(20)
This works, split addresses by year, apply tidygeocoder function to return lat/lons, and recombine.
adds_by_year <- adds %>% split(.$year)
geo_list <- lapply(adds_by_year, function(x) {
geo <- geocode(.tbl = x,
street = address,
city = city,
county = county,
state = state,
postalcode = zip,
# cascade method uses all options (census, osm, etc)
# takes longer but may be more accurate
method = "cascade", timeout = 500) %>%
filter(!is.na(lat))
return(geo)
})
out <- bind_rows(geo_list)
Below does not:
adds <- adds %>%
mutate(yrmn = zoo::as.yearmon(date))
adds_by_yrm <- adds %>% split(.$yrmn)
geo_list <- lapply(adds_by_yrm, function(x) {
geo <- geocode(.tbl = x,
street = address,
city = city,
county = county,
state = state,
postalcode = zip,
# cascade method uses all options (census, osm, etc)
# takes longer but may be more accurate
method = "cascade", timeout = 500) %>%
filter(!is.na(lat))
return(geo)
})
out <- bind_rows(geo_list)
Returns this error:
Error: Assigned data `retry_results` must be compatible with existing data.
ℹ Error occurred for column `lat`.
x Can't convert from <double> to <logical> due to loss of precision.
* Locations: 1.
Run `rlang::last_error()` to see where the error occurred.
I did some searching and found this, but the proposed solution -- wrapping x in as.data.frame(), resulted in the same error.
Any insight is appreciated. I've looked into using purrr but I'm not sure I grok completely.
Here is the full backtrace, which I'm not familiar enough with to parse completely:
Backtrace:
█
1. ├─base::lapply(...)
2. │ └─global::FUN(X[[i]], ...)
3. │ └─tidygeocoder::geocode(...)
4. │ ├─base::do.call(geo, geo_args)
5. │ └─(function (address = NULL, street = NULL, city = NULL, county = NULL, ...
6. │ ├─base::do.call(geo_cascade, all_args[!names(all_args) %in% c("method")])
7. │ └─(function (..., cascade_order = c("census", "osm")) ...
8. │ ├─base::`[<-`(...)
9. │ └─tibble:::`[<-.tbl_df`(...)
10. │ └─tibble:::tbl_subassign(x, i, j, value, i_arg, j_arg, substitute(value))
11. │ └─tibble:::tbl_subassign_row(x, i, value, value_arg)
12. │ ├─base::withCallingHandlers(...)
13. │ └─vctrs::`vec_slice<-`(`*tmp*`, i, value = value[[j]])
14. │ └─(function () ...
15. │ └─vctrs:::vec_cast.logical.double(...)
16. │ └─vctrs::maybe_lossy_cast(out, x, to, lossy, x_arg = x_arg, to_arg = to_arg)
17. │ ├─base::withRestarts(...)
18. │ │ └─base:::withOneRestart(expr, restarts[[1L]])
19. │ │ └─base:::doWithOneRestart(return(expr), restart)
20. │ └─vctrs:::stop_lossy_cast(...)
21. │ └─vctrs:::stop_vctrs(...)
22. │ └─rlang::abort(message, class = c(class, "vctrs_error"), ...)
23. │ └─rlang:::signal_abort(cnd)
24. │ └─base::signalCondition(cnd)
25. └─(function (cnd) ...
It is working with dplyr 1.0.6
dplyr::bind_rows(geo_list)
# A tibble: 8 x 11
address city county state zip date year yrmn lat long geo_method
<chr> <chr> <chr> <chr> <chr> <date> <dbl> <yearmon> <dbl> <dbl> <chr>
1 134 Lewis Rd Nashville Davidson TN 37211 2016-11-06 2016 Nov 2016 36.2 -86.8 osm
2 6651 Municipal Rd Houma Terrebonne LA 70360 2017-02-03 2017 Feb 2017 29.6 -90.7 osm
3 189 Village Park Rd Crestview Okaloosa FL 32536 2017-08-25 2017 Aug 2017 30.8 -86.6 osm
4 9122 Carpenter Ave New Haven New Haven CT 06511 2018-01-14 2018 Jan 2018 41.5 -72.8 osm
5 5221 Bear Valley Rd Nashville Davidson TN 37211 2018-09-17 2018 Sep 2018 36.1 -86.8 osm
6 28 S 7th St #2824 Englewood Bergen NJ 07631 2020-03-31 2020 Mar 2020 40.9 -74.0 census
7 5 E Truman Rd Abilene Taylor TX 79602 2021-02-25 2021 Feb 2021 32.5 -99.7 osm
8 9 Front St Washington District of Columbia DC 20001 2021-05-16 2021 May 2021 38.9 -77.0 osm
Noticed that there are some list elements having 0 rows. Maybe, we could remove those 0 row elements and then use bind_rows
library(purrr)
library(dplyr)
geo_list %>%
keep(~ NROW(.x) > 0) %>%
bind_rows
# A tibble: 8 x 11
address city county state zip date year yrmn lat long geo_method
<chr> <chr> <chr> <chr> <chr> <date> <dbl> <yearmon> <dbl> <dbl> <chr>
1 134 Lewis Rd Nashville Davidson TN 37211 2016-11-06 2016 Nov 2016 36.2 -86.8 osm
2 6651 Municipal Rd Houma Terrebonne LA 70360 2017-02-03 2017 Feb 2017 29.6 -90.7 osm
3 189 Village Park Rd Crestview Okaloosa FL 32536 2017-08-25 2017 Aug 2017 30.8 -86.6 osm
4 9122 Carpenter Ave New Haven New Haven CT 06511 2018-01-14 2018 Jan 2018 41.5 -72.8 osm
5 5221 Bear Valley Rd Nashville Davidson TN 37211 2018-09-17 2018 Sep 2018 36.1 -86.8 osm
6 28 S 7th St #2824 Englewood Bergen NJ 07631 2020-03-31 2020 Mar 2020 40.9 -74.0 census
7 5 E Truman Rd Abilene Taylor TX 79602 2021-02-25 2021 Feb 2021 32.5 -99.7 osm
8 9 Front St Washington District of Columbia DC 20001 2021-05-16 2021 May 2021 38.9 -77.0 osm
SOLVED:
update dplyr (thanks to akrun)
update tidygeocoder-- turns out the issue was bind_rows numeric results to NA results, which was dealt with in a newer release, which I didn't have yet. Posting my code here because there are several useful flags in the geocode() function for debugging:
adds_by_yrm <- adds %>% split(.$yrmn)
geo_list <- lapply(adds_by_yrm, function(x) {
geo <- geocode(.tbl = as.data.frame(x),
street = address,
city = city,
county = county,
state = state,
postalcode = zip,
# cascade method uses all options (census, osm, etc)
# takes longer but may be more accurate
method = "cascade",
cascade_order = c("census", "osm"),
timeout = 500,
unique_only = TRUE,
verbose = T) %>%
filter(!is.na(lat))
return(geo)
})
out <- geo_list %>%
purrr::keep(~ NROW(.x) > 0) %>%
bind_rows()
Below is the sample data and one manipulation. The first data set is employment specific to an industry. The second data set is overall employment and unemployment rate. I am seeking to do a left join (or at least that's what I think it should be) to achieve the desired result below. When I do it, I get a one to many issue with the row count growing. In this example, it goes from 14 to 18. In the larger data set, it goes from 228 to 4348. Primary question is if this can be done with only a properly written join script or is there more to it?
area1<-c(000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000)
periodyear<-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2021,2021)
month<-c(1,2,3,4,5,6,7,8,9,10,11,12,1,2)
emp1 <-c(10,11,12,13,14,15,16,17,20,21,22,24,26,28)
firstset<-data.frame(area1,periodyear,month,emp1)
area1<-c(000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000)
periodyear1<-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2021,2021)
period<-c(01,02,03,04,05,06,07,08,09,10,11,12,01,02)
rate<-c(3.0,3.2,3.4,3.8,2.5,4.5,6.5,9.1,10.6,5.5,7.8,6.5,4.5,2.9)
emp2<-c(1001,1002,1005,1105,1254,1025,1078,1106,1099,1188,1254,1250,1301,1188)
secondset<-data.frame(area2,periodyear1,period,rate,emp2)
secondset <- secondset%>%mutate(month = as.numeric(period))
secondset <- left_join(firstset,secondset, by=c("month"))
Desired Result (14 rows with below being the first 3)
area1 periodyear month emp1 rate emp2
000000 2020 1 10 3.0 1001
000000 2020 2 11 3.2 1002
000000 2020 3 12 3.4 1005
We may have to add 'periodyear' as well in the by
library(dplyr)
left_join(firstset,secondset, by=c("periodyear" = "periodyear1",
"area1" = "area2", "month"))
-output
area1 periodyear month emp1 period rate emp2
1 0 2020 1 10 1 3.0 1001
2 0 2020 2 11 2 3.2 1002
3 0 2020 3 12 3 3.4 1005
...
I'm trying to plot a boxplot for a time series (e.g. http://www.r-graph-gallery.com/146-boxplot-for-time-series/) and can get every other example to work, bar my last one. I have averages per month for six years (2011 to 2016) and have data for 2014 and 2015 (albeit in small quantities), but for some reason, boxes aren't being shown for the 2014 and 2015 data.
My input data has three columns: year, month and residency index (a value between 0 and 1). There are multiple individuals (in this example, 37) each with an average residency index per month per year (including 2014 and 2015).
For example:
year month RI
2015 1 NA
2015 2 NA
2015 3 NA
2015 4 NA
2015 5 NA
2015 6 NA
2015 7 0.387096774
2015 8 0.580645161
2015 9 0.3
2015 10 0.225806452
2015 11 0.3
2015 12 0.161290323
2016 1 0.096774194
2016 2 0.103448276
2016 3 0.161290323
2016 4 0.366666667
2016 5 0.258064516
2016 6 0.266666667
2016 7 0.387096774
2016 8 0.129032258
2016 9 0.133333333
2016 10 0.032258065
2016 11 0.133333333
2016 12 0.129032258
which is repeated for each individual fish.
My code:
#make boxplot
boxplot(RI$RI~RI$month+RI$year,
xaxt="n",xlab="",col=my_colours,pch=20,cex=0.3,ylab="Residency Index (RI)", ylim=c(0,1))
abline(v=seq(0,12*6,12)+0.5,col="grey")
axis(1,labels=unique(RI$year),at=seq(6,12*6,12))
The average trend line works as per the other examples.
a=aggregate(RI$RI,by=list(RI$month,RI$year),mean, na.rm=TRUE)
lines(a[,3],type="l",col="red",lwd=2)
Any help on this matter would be greatly appreciated.
Your problem seems to be the presence of missing values, NA, in your data, the other values are plotted correctly. I've simplified your code a bit.
boxplot(RI$RI ~ RI$month + RI$year,
ylab="Residency Index (RI)")
a <- aggregate(RI ~ month + year, data = RI, FUN = mean, na.rm = TRUE)
lines(c(rep(NA, 6), a[,3]), type="l", col="red", lwd=2)
Also, I believe that maybe a boxplot is not the best way to depict your data. You only have one value per year/month, when a boxplot would require more. Maybe a simple scatter plot will do better.