I got the dataset like following:
I want to make a world map and see which country have higher mean salary, maybe represent through density or sth else, like density higher means the mean salary is higher, I tried do that with vegalite but I always got the error:
then I realized this data have country name like this:
https:
Ru means russia, NZ means new zealand …Is there any way that I can covert these into the complete country name? and where have I got wrong on this map code?
Can someone help me with that please?
Thanks for any help:)
I just wanna say thank you to all people that offered me suggestions!!!!!!
I have successfully change my country name, but I don't know how to make a map for each country and show which country have higher mean value, Can someone give me some advices please?
These abbreviations resemble ISO 3166 alpha-2 codes. The Julia package Countries.jl is great for converting ISO standards:
julia> using Countries, DataFrames
julia> ds = DataFrame(company_location=["RU", "US", "NZ"], Mean=[1580000, 142000, 122000])
3×2 DataFrame
Row │ company_location Mean
│ String Int64
─────┼───────────────────────────
1 │ RU 1580000
2 │ US 142000
3 │ NZ 122000
julia> ds.country_names = [x.name for x in get_country.(ds.company_location)];
julia> ds
3×3 DataFrame
Row │ company_location Mean country_names
│ String Int64 String
─────┼───────────────────────────────────────────────
1 │ RU 1580000 Russian Federation
2 │ US 142000 United States
3 │ NZ 122000 New Zealand
Alternatively, you could make your own dictionary that maps codes to full names. This could be useful if the abbreviations are non-standard, or if you're working with something else in general:
julia> using DataFrames
julia> ds = DataFrame(company_location=["RU", "US", "NZ"], Mean=[1580000, 142000, 122000])
3×2 DataFrame
Row │ company_location Mean
│ String Int64
─────┼───────────────────────────
1 │ RU 1580000
2 │ US 142000
3 │ NZ 122000
julia> country_codes = Dict("RU" => "Russia", "US" => "United States", "NZ" => "New Zealand");
julia> ds.country_names = getindex.(Ref(country_codes), ds.company_location);
julia> ds
3×3 DataFrame
Row │ company_location Mean country_names
│ String Int64 String
─────┼──────────────────────────────────────────
1 │ RU 1580000 Russia
2 │ US 142000 United States
3 │ NZ 122000 New Zealand
(I couldn't find an obvious way to pass multiple indices to a dictionary, but this post on the Julia Discourse shows a working method, which I've used in my example)
Related
I got the dataset like following:
I would like to get average salary for each experience level in each job title, I have tried:
second=combine(groupby(new,:experience_level,),:salary_in_usd =>IMD.mean)
but then I realized if I do the leftjoin here it will be same salary for all experience level in different job title, my goal is to get a dataset where each experience level have average salary in each job title,does anyone knows how to do that? use Inmemorydataset package functions please.
Thanks
You need to groupby by both job title and experience level:
julia> using DataFrames, Statistics
julia> salaries = DataFrame(
job_title = ["Data Analyst","Data Analyst","Data Analyst","Data Analyst","Data Scientist","Data Scientist","Data Scientist","Data Scientist","ML Specialist","ML Specialist","ML Specialist","ML Specialist"],
salary_in_usd = [1,2,3,4,5,6,7,8,9,10,11,12],
work_year = [2020,2020,2022,2022,2020,2020,2022,2022,2020,2020,2022,2022],
experience_level = ["EN","SE","EN","SE","EN","SE","EN","SE","EN","SE","EN","SE"]
);
julia> groups = groupby(salaries,["job_title","experience_level"]);
julia> avg_salaries = combine(groups, "salary_in_usd" => mean => "avg_salary")
6×3 DataFrame
Row │ job_title experience_level avg_salary
│ String String Float64
─────┼──────────────────────────────────────────────
1 │ Data Analyst EN 2.0
2 │ Data Analyst SE 3.0
3 │ Data Scientist EN 6.0
4 │ Data Scientist SE 7.0
5 │ ML Specialist EN 10.0
6 │ ML Specialist SE 11.0
julia> avg_salary_nice = unstack(avg_salaries,"experience_level","avg_salary")
3×3 DataFrame
Row │ job_title EN SE
│ String Float64? Float64?
─────┼────────────────────────────────────
1 │ Data Analyst 2.0 3.0
2 │ Data Scientist 6.0 7.0
3 │ ML Specialist 10.0 11.0
See also my tutorial on Split-Apply-Combine (with video on top of the page)
OT: how many job "names"....
I have this dataset:
text sentiment
randomstring positive
randomstring negative
randomstring netrual
random mixed
Then if I run a countmap i have:
"mixed" -> 600
"positive" -> 2000
"negative" -> 3300
"netrual" -> 780
I want to random sample from this dataset in a way that I have records of all smallest class (mixed = 600) and the same amount of each of other classes (positive=600, negative=600, neutral = 600)
I know how to do this in pandas:
df_teste = [data.loc[data.sentiment==i]\
.sample(n=int(data['sentiment']
.value_counts().nsmallest(1)[0]),random_state=SEED) for i in data.sentiment.unique()]
df_teste = pd.concat(df_teste, axis=0, ignore_index=True)
But I am having a hard time to do this in Julia.
Note: I don´t want to hardcode which of the class is the lowest one, so I am looking for a solution that infer that from the countmap or freqtable, if possible.
Why do you want a countmap or freqtable solution if you seem do want to use a data frame in the end?
This is how you would do this with DataFrames.jl (but without StatsBase.jl and FreqTables.jl as they are not needed for this):
julia> using Random
julia> using DataFrames
julia> df = DataFrame(text = [randstring() for i in 1:6680],
sentiment = shuffle!([fill("mixed", 600);
fill("positive", 2000);
fill("ngative", 3300);
fill("neutral", 780)]))
6680×2 DataFrame
Row │ text sentiment
│ String String
──────┼─────────────────────
1 │ R3W1KL5b positive
2 │ uCCpNrat ngative
3 │ fwqYTCWG ngative
⋮ │ ⋮ ⋮
6678 │ UJiNrlcw ngative
6679 │ 7aiNOQ1o neutral
6680 │ mbIOIQmQ ngative
6674 rows omitted
julia> gdf = groupby(df, :sentiment);
julia> min_len = minimum(nrow, gdf)
600
julia> df_sampled = combine(gdf) do sdf
return sdf[randperm(nrow(sdf))[1:min_len], :]
end
2400×2 DataFrame
Row │ sentiment text
│ String String
──────┼─────────────────────
1 │ positive O0QsyrJZ
2 │ positive 7Vt70PSh
3 │ positive ebFd8m4o
⋮ │ ⋮ ⋮
2398 │ neutral Kq8Wi2Vv
2399 │ neutral yygOzKuC
2400 │ neutral NemZu7R3
2394 rows omitted
julia> combine(groupby(df_sampled, :sentiment), nrow)
4×2 DataFrame
Row │ sentiment nrow
│ String Int64
─────┼──────────────────
1 │ positive 600
2 │ ngative 600
3 │ mixed 600
4 │ neutral 600
If your data is very large and you need the operation to be very fast there are more efficient ways to do it, but in most situations this should be fast enough and the solution does not require any extra packages.
I have a loop related question. I have the following folder structure (excerpt):
├───Y2017
│ UDB_cSK17D.csv
│ UDB_cSK17H.csv
│ UDB_cSK17P.csv
│ UDB_cSK17R.csv
│ UDB_cUK17D.csv
│ UDB_cUK17H.csv
│ UDB_cUK17P.csv
│ UDB_cUK17R.csv
└───Y2018
│ UDB_cSK18D.csv
│ UDB_cSK18H.csv
│ UDB_cSK18P.csv
│ UDB_cSK18R.csv
│ UDB_cUK18D.csv
│ UDB_cUK18H.csv
│ UDB_cUK18P.csv
│ UDB_cUK18R.csv
All the files have the same structure. I would like to loop through them and extract data from a select number of columns. The file names also all have the same structure. All files have:
unique country identified (e.g. UK, SK in the examples above)
unique database type (D, H, P... - last character in file name)
I would like to construct a loop that iterates through the file names. For one country this would work like this:
library(data.table)
ldf<-list()
country_id<-"UK(.*)"
db_id<-"P.csv$"
listcsv<-dir(pattern = paste0(country_id,db_id), recursive = T, full.names = T)
for (k in 1:length(listcsv)){
ldf[[k]]<-fread(listcsv[k],select = c("PB010","PB020"))
}
uk_data<-bind_rows(as.data.frame(do.call(rbind,ldf[])))
This code extract all the columns I need based on the country identifier I give it (UK in this example). As I have numerous countries in my data set I would like to have a code that iterates through and updates the country identifier. I have tried the following:
ldf_new<-list()
countries <-c("SK", "UK")
for (i in 1:length(countries)) {
currcty1 <- countries[i]
listcsv<-dir(pattern = paste0(currcty1,"(.*)",db_id), recursive = T, full.names = T)
# print(listcsv)
ldf_new<-fread(listcsv[i],select = c("PB010","PB020"))
}
What happens here is that I get only the results of the last iteration in the variable ldf_new (i.e. UK in this case). Is there any way I could get the results for SK and UK.
Many thanks in advance!
Changing the last line of your loop so that a new element is added to the list should do the trick:
ldf_new<-list()
countries <-c("SK", "UK")
for (i in 1:length(countries)) {
currcty1 <- countries[i]
listcsv<-dir(pattern = paste0(currcty1,"(.*)",db_id), recursive = T, full.names = T)
# print(listcsv)
ldf_new<-c(ldf_new, fread(listcsv[i],select = c("PB010","PB020")))
}
I am using Julia CSV and I am trying to read data with DateTime in the form 10/17/2012 12:00:00 AM i tried
dfmt = dateformat"mm/dd/yyyy HH:MM:SS"
data =CSV.File("./Fremont_Bridge_Bicycle_Counter.csv", dateformat=dfmt) |> DataFrame
println(first(data,8))
but the thing is that I think the AM and PM makes the string not recognized as a date can someone help show how to pass this as a date
You can use the p specifier, which matches AM or PM. With that, your date format would look like this:
dfmt = dateformat"mm/dd/yyyy HH:MM:SS p"
You can see that the parsing is correct:
julia> DateTime("10/17/2012 12:00:00 AM", dfmt)
2012-10-17T00:00:00
To see all the possible format characters, check out the docstring of Dates.DateFormat, which is accessible in the REPL through ?DateFormat.
With the file Fremont_Bridge_Bicycle_Counter.csv
N1, N2, fecha
hola, 3, 10/03/2020 10:30:00
pepe, 5, 10/03/2020 11:40:50
juan, 5, 03/04/2020 20:10:12
And with the julia code:
using DataFrames, Dates, CSV
dfmt = dateformat"mm/dd/yyyy HH:MM:SS p"
data =CSV.File("./Fremont_Bridge_Bicycle_Counter.csv", dateformat=dfmt) |> DataFrame
println(first(data,8))
It gives the right result:
3×3 DataFrame
│ Row │ N1 │ N2 │ fecha │
│ │ String │ Int64 │ DateTime │
├─────┼────────┼───────┼─────────────────────┤
│ 1 │ hola │ 3 │ 2020-10-03T10:30:00 │
│ 2 │ pepe │ 5 │ 2020-10-03T11:40:50 │
│ 3 │ juan │ 5 │ 2020-03-04T20:10:12 │
This may be a stupid question, but for the life of me I can't figure out how to get Julia to read a csv file with column names that start with numbers and use them in DataFrames. How does one do this?
For example, say I have the file "test.csv" which contains the following:
,1Y,2Y,3Y
1Y,11,12,13
2Y,21,22,23
If I just use readtable(), I get this:
julia> using DataFrames
julia> df = readtable("test.csv")
2x4 DataFrames.DataFrame
| Row | x | x1Y | x2Y | x3Y |
|-----|------|-----|-----|-----|
| 1 | "1Y" | 11 | 12 | 13 |
| 2 | "2Y" | 21 | 22 | 23 |
What gives? How can I get the column names to be what they're supposed to be, "1Y, "2Y, etc.?
The problem is that in DataFrames, column names are symbols, which aren't meant to (see comment below) start with a number.
You can see this by doing e.g. typeof(:2), which will return Int64, rather than (as you might expect) Symbol. Thus, to get your columnnames into a useable format, DataFrames will have to prefix it with a letter - typeof(:x2) will return Symbol, and is therefore a valid column name.
Unfortunately, you can't use numbers for starting names in DataFrames.
The code that does the parsing of names makes sure that this restriction stays like this.
I believe this is because of how parsing takes place in julia: :aa names a symbol, while :2aa is a value (makes more sense considering 1:2aa is a range)
You could just use rename!() after the import:
df = csv"""
,1Y,2Y,3Y
1Y,11,12,13
2Y,21,22,23
"""
rename!(df, Dict(:x1Y =>Symbol("1Y"), :x2Y=>Symbol("2Y"), :x3Y=>Symbol("3Y") ))
2×4 DataFrames.DataFrame
│ Row │ x │ 1Y │ 2Y │ 3Y │
├─────┼──────┼────┼────┼────┤
│ 1 │ "1Y" │ 11 │ 12 │ 13 │
│ 2 │ "2Y" │ 21 │ 22 │ 23 │
Still you may experience problems later in your code, better to avoid column names starting with numbers...