Should multiple dummy variables start from different numbers when handling multiple categorical features in a data set? [closed] - dummy-variable

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Considering multiple independent categorical features in a data set, we want to encode multiple variables in each category. Should the dummy variables be different in each category? or is it reasonable to start the dummies in each category from 0? Consider the following example:
Distance_Group .... Airlines_with_HIGHEST_fare .......... dummies_1
============ .... ======================= ......... ========
G .......................... Atlantic Airways ................................. 0
A .......................... Bahamas Air ...................................... 1
B .......................... Bahamas Air ...................................... 1
C .......................... Jet Blue ............................................ 2
A .......................... United Airline ..................................... 3
Distance_group .... Airlines_with_LOWEST_fare ......... dummies_2
============ ....====================== ..........========
F ........................... Jet Blue .......................................... 0
E ........................... United Airline .................................. 1
A ........................... Lufthansa ........................................ 2
G .......................... Georgia Airways .............................. 3
Starting each category from 0, in first category, Jet Blue is corresponding to dummy variable: 2, in second one it is corresponding to dummy variable: 0.
Is this the right encoding for the two categories?
In case the query is needed for clarifying the example:
This Python query loops over all unique type categories while counting up.
map_dict1 = {}
for token, value in enumerate(Data['Airlines_with_HIGHEST_fare'].unique()):
map_dict1[value] = token
Data['Airlines_with_HIGHEST_fare'].replace(map_dict1, inplace=True)
The same logic also applies for the Airlines with lowest fare category for encoding airlines.
I am trying to cluster the airline fares, based on some numerical features like: Distance_Group, # passengers, etc. The above example is the two categorical features (= name of Airlines). All these features are input cells of a neural network, that's why they should be numerical. Because Neural Networks do not accept categorical variables.

Related

Constrained K-means, R

I am currently doing k-means to cluster my data, however, I wish each cluster to appear once in each given year. I have searched for answers for a whole night but with no result. Would anyone have ideas upon this problem using R? Or is there any package I should look for ? Thanks.
More background infos :
I try to replicated the cluster of relationships, using the reported gender, education level and birth year. I am doing this because this is a survey data whose respondents are old people and they sometime will report inaccurate age or education infos. My main challenge now is that I wish to "have only one cluster labels in each survey year". For example, I do not want to see there are two cluster3 in survey year 2000. My data is like below :
survey year
relationship
gender
education level
birth year
k-means cluster
2000
41( first daughter)
0
3
1997
1
2003
41( first daughter)
0
3
1997
1
2000
42( second daughter)
0
4
1999
2
2003
42( second daughter)
0
4
1999
2
2000
42( third daughter)
0
5
1999
2
2003
42( third daughter)
0
5
2001
3
Thanks in advance.
--Update--
A more detailed description of the task:
The data set is a panel survey data asking elders for their health status, their relationships ( incl. sons, daughters, neighbors ). Since these older people are sometimes imprecise on their family's demographic information such as birth year, education level, etc., we might need to delete a big part of the data if it did not match.
(e.g., A reported his first son is 30 years old in 1997, while said his first son was 29 years old in 1999, this data could therefore be problematic). My task is to save as much data as possible if the imprecision is not that high.
Therefore I first mutated columns to check the precision of each family member (e.g., birth year error %in% c(-1,2)). Next, I run k-means if the family members are detected to be imprecise. In this way, I save much of the data. Although I did not solve the above problem, it rarely occurs that I can almost ignore or drop these observations.

forecasting multiples products data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I would like to predict the next 5 orders and the quantity of the 3 products in each order.
I am a beginner using r and timeseries and I saw examples using arima but they're applied only to measure one thing and not multiple products like in my example.
Should I use arima?
What should I do exactly?
Sorry for my bad English. Thank you in advance.
dateordrer,product1,product2,product3
12/01/2012,2565,3254,635
25/01/2012,2270,3254,670
01/03/2012,2000,785,0
05/05/2012,300,3254,750
26/06/2012,3340,0,540
30/06/2012,0,3254,0
21/06/2012,3360,3356,830
01/07/2012,2470,3456,884
03/07/2012,3680,3554,944
05/07/2012,2817,3854,0
09/07/2012,4210,4254,32
09/08/2012,0,3254,1108
13/09/2012,4560,5210,952
25/09/2012,4452,4256,1143
31/09/2012,5090,5469,199
25/11/2012,5100,5569,0
10/12/2012,5440,5789,1323
11/12/2012,5528,5426,1350
Your question is very broad, so it can only be answered in a broad manner. Also, the question has more to do with forecasting theory than with R.
I will give you two pointers to get you started...
It seems you have some pre-processing to do, i.e.: what are your time intervals? what is your basic time unit? (week? month?). You should aggregate the data according to that time unit. For these kind of operations you can use the tidyr and lubridate packages. Here's an example of your data set after I arranged it a bit:
data.raw <- read_csv("data1.csv") %>%
mutate(date.re = as.POSIXct(dateordrer, format = "%d/%m/%Y"))
complete.dates <- range(data.raw$date.re)
dates.seq <- seq(complete.dates[1], complete.dates[2], by = "month")
series <- data.frame(sale.month = month(dates.seq), sale.year = year(dates.seq))
data.post <- data.raw %>%
mutate(sale.month = month(date.re), sale.year = year(date.re)) %>%
select(product1:product3, sale.month, sale.year) %>%
group_by(sale.month, sale.year) %>%
summarize_all(funs(sum(.))) %>%
right_join(series) %>%
replace_na(list(product1 = 0, product2 = 0, product3 = 0))
It would look like this:
sale.month sale.year product1 product2 product3
1 2012 4835 6508 1305
2 2012 0 0 0
3 2012 2000 785 0
4 2012 0 0 0
etc...
See that for months 2 and 4 you had no data (originally), therefore they appear as 0s.
Note that pre-processing is not to be taken lightly, I used months as the basic unit, but that might not be true or relevant to your goals. You might even revise this after you continue and try to see if different aggregation gives better results.
Only after preprocessing you can turn to forecasting. If the three product are independent, they can be predicted independently (e.g. use Arima / Holt-Winters / any other model * three times). However, The fact that you have three products which might be correlated to each other, directs us to hierarchical time series (package hts). The function hts() within this package is able to best-fit forecasting models when there is a statistical relationship between the various products. For example, when a certain product is purchased with another (complementing products) or when you are out-of-stock and that leads to a different product (alternative product).
Since this is far from being self-contained for such a broad topic, the next best move for you is to check out the following online book:
Forecasting: principles and practice
By Hyndman and Athanasopoulos. I read it when I started with time series. It's a very good book. Specifically, for multiple time series you should cover chapter:
9.4 Forecasting hierarchical or grouped time series
Make sure you also read chapter 7 at that book (before moving to 9.4).

What is membership in community detection? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am finding it hard to understand what membership and modularity returns and why is it exactly used.
wc <- walktrap.community(karate)
modularity(wc)
membership(wc)
plot(wc, karate)
for the above code I get the following when I execute membership:
[1] 1 1 2 1 5 5 5 1 2 2 5 1 1 2 3 3 5 1 3 1 3 1 3 4 4 4 3 4 2 3 2 2 3
for the above code I get the following when I execute modularity:
[1] 0.3532216
I read the documentation but still a bit confusing.
The result of walktrap.community is a partition of your graph into communities which are numbered with id's from 1 to 5 in your case. The membership function gives a vector of a community ids for every node in your graph. So in your case node 1 belongs to community 1, and node 3 belongs to community 2.
The partition of the graph into communities is based on optimizing a so called modularity function. When you call modularity you get the final value of that function after the optimization process is complete. A high value of modularity indicates a good partition of the graph into clear communities, while a low value indicates the opposite.

Is it sensible to use R 3D graph to visualize the following [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
After creating a JSON data structure and iterating over that JSON structure I came up with following Data representation. Column 1 & and Column 2 are String data type Column 4 represents sum of type explained in column 3.
Example:
A | B | Level1 | 10
A | C | Level2 | 1
To get a verbal explanation on this I would take countries in Country A there are People who can speak country B's language at level 1 expertise and the total of them equals to 10.
I was thinking to represent this in 3 Axis, X = Country1, Y = Country2, and Z represent the levels. Is this sensible if so How can I accomplish this? I don't have prior experience with R 3D Graphics.
Here is how my actual data look like: They are in CSV file.
Here is data in a dataframe:
country1 country2 langLevel frequency
1 gv ca level1 2
2 gv bg level1 1
3 zea li level1 1
4 zea li level3 1
I hope I explained the problem clear enough with what I want to accomplish. I was thinking and seems like 3D is the best way to represent this but I could be wrong.
Data in CSV Format:
country1,country2,langLevel,frequency
gv,ca,level1,2
gv,bg,level1,1
zea,li,level1,1
zea,li,level3,1
zea,de,level1,26
zea,de,level3,5
zea,el,level1,1
zea,eo,level1,3
zea,en,level1,5
zea,en,level2,34
zea,en,level3,38
zea,en,level4,12
zea,es,level1,7
zea,la,level1,7
zea,zea,level1,5
zea,zea,level3,4
zea,stq,level1,1
zea,sk,level2,1
zea,nl,level4,4
zea,fr,level2,9
zea,fy,level2,1
cdo,cdo,level3,1
cdo,de,level1,23
cdo,de,level2,4
cdo,de,level3,4
cdo,eo,level1,1
cdo,eo,level2,1
cdo,eo,level3,3
cdo,en,level1,6
cdo,en,level2,31
cdo,en,level3,38
cdo,en,level4,17
cdo,es,level1,8
cdo,es,level2,6
cdo,es,level3,3
cdo,fr,level1,14
cdo,fr,level2,11
cdo,fr,level3,6
gd,als,level1,1
gd,af,level1,2
vls,de,level1,32
vls,de,level2,7
vls,de,level3,6
vls,de,level4,3
vls,eo,level1,2
vls,eo,level2,3
vls,eo,level3,3
vls,en,level1,7
vls,en,level2,38
vls,en,level3,53
vls,en,level4,16
vls,es,level1,15
vls,es,level2,4
vls,es,level3,1
vls,es,level4,1
vls,ru,level2,8
vls,ru,level3,1
vls,ja,level1,2
This is what I tried but, its really had to see anything clear in this plot:
library("rgl")
plot3d(template_levels$country1, template_levels$country2, template_levels$frequency, col=template_levels$langLevel)
Here is the plot:
One possibility is to use barplots with different colors. Here is a solution using package ggplot2 and assuming that your data frame is named df.
On the x axis we see country2 values then for each country1 we have separate facet. Each bar is colored according to langLevel. scales="free_y" in facet_grid() ensures that we have different y scale in each facet (because values are quite different).
library(ggplot2)
ggplot(df,aes(country2,frequency,fill=langLevel))+geom_bar(stat="identity")+
facet_grid(country1~.,scales="free_y")

How to retrieve/calculate citation counts and/or citation indices from a list of authors?

I have a list of authors.
I wish to automatically retrieve/calculate the (ideally yearly) citation index (h-index, m-quotient,g-index, HCP indicator or ...) for each author.
Author Year Index
first 2000 1
first 2001 2
first 2002 3
I can calculate all of these metrics given the citation counts for each paper of each researcher.
Author Paper Year Citation_count
first 1 2000 1
first 2 2000 2
first 3 2002 3
Despite my efforts, I have not found an API/scraping method capable of this.
My institution has access to a number of services including Web of Science.
Effectively the main problem is to build the citation graph. Once you have that you can compute any metrics you want (e.g. h-index, g-index, PageRank).
Supposing you have a collections of papers (that you've retrieved in some way) you can extract the citations from each of them and build the citation graph. You might find useful ParsCit, an open-source CRF Reference String and Logical Document Structure Parsing Package which is also used by CiteSeerX and works great.

Resources