plotting a heatmap with categorical data - r

Is heat map possible with categorical data like this:
so I want bins in y axis, year in x axis and Firm as values. Is this something possible? if so how to do in python.
Firm year bins
0 A 1998 binA
1 A 2000 binB
2 A 1999 binA
3 B 1998 binA
4 B 2000 binE
5 B 1999 binA
6 C 1998 binA
7 C 2000 binE
8 C 1999 binA
9 D 1998 binA
10 D 2000 binA
11 D 1999 binB
12 E 1998 binB
13 E 2000 binA
14 E 1999 binB
15 F 1998 binB
16 F 2000 binC
17 F 1999 binH
18 G 1998 binB
19 G 2000 binE
20 G 1999 binF
21 H 1998 binB
22 H 2000 binA
23 H 1999 binF
24 I 1998 binB
25 I 2000 binF
26 I 1999 binF
27 J 1998 binC
28 J 2000 binA
29 J 1999 binF
30 K 1998 binD
31 K 2000 binE
32 K 1999 binA
33 L 1998 binE
34 L 2000 binH
35 L 1999 binC
36 M 1998 binE
37 M 2000 binH
38 M 1999 binH
One solution with seaborn I tried did not work
import seaborn as sns
df=pd.pivot(df7['Firm'],df7['year'], df7['bins'])
ax = sns.heatmap(df)
R has following example:
Heatmap of categorical variable counts
Using R and following code I have tentatively been able to construct the heatmap with above example:
library(magrittr)
library(dplyr)
m<- read.csv("~/df55testR.csv",
stringsAsFactors=FALSE, header=T)
m<-m%>%select(2:6)
ml <- reshape2::melt(data = m, id.vars="Firm", variable.name = "year", value.name="bin")
ml
ml$Test_Gr <- apply(ml[,2:3], 1, paste0, collapse="_")
mw <- reshape2::dcast(ml, Firm ~ bin, fun.aggregate = length)
mwm<-as.matrix(mw[,-1])
mwm
mcm <- t(mwm) %*% mwm
colnames(mcm) <- colnames(mw)[-1]
rownames(wc) <- colnames(xw)[-1]
gplots::heatmap.2(mcm, trace="none", col = rev(heat.colors(15)))

you can try groupby with nunique
grouped = df.groupby(['year','bins']).nunique()['Firm'].reset_index([0,1])
piv_grouped = grouped.pivot(index='bins', columns='year', values='Firm')
sns.heatmap(piv_grouped, cmap='RdYlGn_r', linewidths=0.5, annot=True)

Related

apply hpfilter to grouped variables with NAs using dplyr

I am trying to apply the hpfilter to one of the variables in my dataset that has a panel structure (id + year) and then add the filtered series to my dataset. It works perfectly fine as long as I do not have any NAs in one of the variables, but it yields an error if one of the ids has missing values. The reason for this is that the hpfilter function does not work with NAs (it yields only NAs).
Here's a reproducible example:
df1 <- read.table(text="country year X1 X2 W
A 1990 10 20 40
A 1991 12 15 NA
A 1992 14 17 41
A 1993 17 NA 44
B 1990 20 NA 45
B 1991 NA 13 61
B 1992 12 12 67
B 1993 14 10 68
C 1990 10 20 70
C 1991 11 14 50
C 1992 12 15 NA
C 1993 14 16 NA
D 1990 20 17 80
D 1991 16 20 91
D 1992 15 21 70
D 1993 14 22 69
", header=TRUE, stringsAsFactors=FALSE)
My approach was to use the dplyr group_by function to apply the hpfilter by country to variable X1:
library(mFilter)
library(plm)
# Organizing the Data as a Panel
df1 <- pdata.frame(df1, index = c("country","year"))
# Apply hpfilter to X1 and add trend to the sample
df1 <- df1 %>% group_by(country) %>% mutate(X1_trend = mFilter::hpfilter(na.exclude(X1), type = "lambda", freq = 6.25)$trend)
However, this yields the following error:
Error in `[[<-.data.frame`(`*tmp*`, col, value = c(11.1695436493374, 12.7688604220353, :
replacement has 15 rows, data has 16
The error occurs because the filtered series is shortened after applying the hp filter (by the NAs).
Since I have a large dataset with many countries it would be really great if there was a workaround, to maybe ignore the NAs when passing the series to the hpfilter, but not removing them. Thank you!
Here is a way to drop NA and calculate trend:
df2 <- df1 %>% group_by(country) %>%
filter(!is.na(X1)) %>%
pdata.frame(., index = c("country","year")) %>%
mutate(X1_trend = mFilter::hpfilter(X1, type = "lambda", freq = 6.25)$trend)
> df2
country year X1 X2 W X1_trend
1 A 1990 10 20 40 11.16954
2 A 1991 12 15 NA 12.76886
3 A 1992 14 17 41 14.18105
4 A 1993 17 NA 44 15.09597
5 B 1990 20 NA 45 15.17450
6 B 1992 12 12 67 14.38218
7 B 1993 14 10 68 13.45663
8 C 1990 10 20 70 12.75429
9 C 1991 11 14 50 12.71858
10 C 1992 12 15 NA 13.35221
11 C 1993 14 16 NA 14.38293
12 D 1990 20 17 80 15.32211
13 D 1991 16 20 91 15.61990
14 D 1992 15 21 70 15.47486
15 D 1993 14 22 69 15.14639
EDIT: To keep missing values in the final output, we do one more operation:
df3 <- merge(df1,df2, by = colnames(df1),all.x = T)
> df3
country year X1 X2 W X1_trend
1 A 1990 10 20 40 11.16954
2 A 1991 12 15 NA 12.76886
3 A 1992 14 17 41 14.18105
4 A 1993 17 NA 44 15.09597
5 B 1990 20 NA 45 15.17450
6 B 1991 NA 13 61 NA
7 B 1992 12 12 67 14.38218
8 B 1993 14 10 68 13.45663
9 C 1990 10 20 70 12.75429
10 C 1991 11 14 50 12.71858
11 C 1992 12 15 NA 13.35221
12 C 1993 14 16 NA 14.38293
13 D 1990 20 17 80 15.32211
14 D 1991 16 20 91 15.61990
15 D 1992 15 21 70 15.47486
16 D 1993 14 22 69 15.14639

reordering database by loop in R - help me

I am trying to reorder a database by loop but it does not work for me. There is too much data to do it one by one.
fact <- rep (1:2 , each = 3)
t1 <- c(2006,2007,2008,2000,2001,2002)
t2 <- c(2007,2008,2009,2001,2002,2004)
var1 <- c(56,52,44,10,32,41)
var2 <- c(52,44,50,32,41,23)
db1 <- as.data.frame(cbind(fact, t1, t2, var1, var2))
db1
fact t1 t2 var1 var2
1 1 2006 2007 56 52
2 1 2007 2008 52 44
3 1 2008 2009 44 50
4 2 2000 2001 10 32
5 2 2001 2002 32 41
6 2 2002 2004 41 23
I need it to stay this way:
factor <- rep (1:2 , each = 4)
t <- c(2006,2007,2008,2009,2000,2001,2002,2004)
var <- c(56,52,44,50,10,32,41,23)
db2 <- as.data.frame(cbind(factor, t, var))
db2
factor t var
1 1 2006 56
2 1 2007 52
3 1 2008 44
4 1 2009 50
5 2 2000 10
6 2 2001 32
7 2 2002 41
8 2 2004 23
very thanks
dat1 <- as.data.frame(cbind(fact, t1, var1))
names(dat1) <- c("fact", "t", "var")
dat2 <- as.data.frame(cbind(fact, t2, var2))
names(dat2) <- c("fact", "t", "var")
rbind.data.frame(dat1, dat2)
fact t var
1 1 2006 56
2 1 2007 52
3 1 2008 44
4 2 2000 10
5 2 2001 32
6 2 2002 41
7 1 2007 52
8 1 2008 44
9 1 2009 50
10 2 2001 32
11 2 2002 41
12 2 2004 23
Or
dat <- db1
names(dat) <- c("fact", rep("t", 2), rep("var", 2))
rbind(dat[,c(1,2,4)], dat[,c(1,3,5)])
fact t var
1 1 2006 56
2 1 2007 52
3 1 2008 44
4 2 2000 10
5 2 2001 32
6 2 2002 41
7 1 2007 52
8 1 2008 44
9 1 2009 50
10 2 2001 32
11 2 2002 41
12 2 2004 23
Or, as indicated, have a look at the reshape2 package - melt would certainly be of use, e.g.
library(reshape2)
dat <- db1
names(dat) <- c("fact", rep("t", 2), rep("var", 2))
rbind(melt(dat[,c(1,2,4)], id.vars = c("fact","t"), value.name = "var"),
melt(dat[,c(1,3,5)], id.vars = c("fact","t"), value.name = "var")
)

Calculating quintile based scores on R

I have a dataframe with year (2006 to 2010), 4 industry sectors, 150 firm names and the net income of these firms. In total I have 750 observations, one for each firm for each year. I want to give scores to firms for their income within each industry year based on the quintiles. So, firms with income in the top 20% within each industry-year get a score of 5, the next 20% get a score of 4 and so on. The bottom 20% get a score of 1.
The sample data base is:
Year Industry Firm Income
2006 Chemicals ABC 334.50
2007 Chemicals ABC 388.98
.
.
2006 Pharma XYZ 91.45
.
.
How do I do this in R? I have tried aggregate and tapply along with quantile but am not able to arrive at the logic that should be used for this. Please help.
I tried this just to allocate a score of 1 to the lowest 20%, but it returned an error.
db10$score <- ifelse(db10$income < aggregate(income~Year+industry,db10,quantile,c(0.2)),1,0)
Try this method:
First, I'll create the sample where to test the function below:
y = c(rep(2001,15),rep(2002,15),rep(2003,15))
ind = c("A","B","C","D","E","G","H","I","J","K","L","M","N","O","P")
val = runif(45,10,100)
df = data.frame(y,ind,val)
head(df,20)
y ind val
1 2001 A 63.32011
2 2001 B 85.67976
3 2001 C 86.77527
4 2001 D 32.18319
5 2001 E 49.86626
6 2001 G 57.73214
7 2001 H 18.08216
8 2001 I 22.31012
9 2001 J 44.11174
10 2001 K 54.76902
11 2001 L 41.82495
12 2001 M 64.84514
13 2001 N 59.16529
14 2001 O 61.28870
15 2001 P 84.76561
16 2002 A 83.68185
17 2002 B 45.01354
18 2002 C 62.22964
19 2002 D 98.41717
20 2002 E 19.91548
There are 3 years, and industries from A to P. The data frame is ordered by year and later by industry.
This function below takes a year value y and calculates the quintile category for all df$val where the year df$y is y
quintile = function(y) {
x = df$val[df$y == y]
qn = quantile(x, probs = (0:5)/5)
result = as.numeric(cut(x, qn, include.lowest = T))
}
The only thing left is to apply this function to the unique year values
df$qn = unlist(lapply(unique(df$y), quintile))
Result:
> head(df,20)
y ind val qn
1 2001 A 63.32011 4
2 2001 B 85.67976 5
3 2001 C 86.77527 5
4 2001 D 32.18319 1
5 2001 E 49.86626 2
6 2001 G 57.73214 3
7 2001 H 18.08216 1
8 2001 I 22.31012 1
9 2001 J 44.11174 2
10 2001 K 54.76902 3
11 2001 L 41.82495 2
12 2001 M 64.84514 4
13 2001 N 59.16529 3
14 2001 O 61.28870 4
15 2001 P 84.76561 5
16 2002 A 83.68185 4
17 2002 B 45.01354 1
18 2002 C 62.22964 3
19 2002 D 98.41717 5
20 2002 E 19.91548 1
Maybe there is a much simpler way to implement this...
Grouping by two columns
If you want to calculate quintiles based on the grouping of two columns: y and grp
y = c(rep(2001,15),rep(2002,15),rep(2003,15))
grp = c("G1","G1","G1","G1","G1","G2","G2","G2","G2","G2","G3","G3","G3","G3","G3")
ind = c("A","B","C","D","E","G","H","I","J","K","L","M","N","O","P")
val = round(runif(45,10,100))
df = data.frame(y,grp,ind,val)
> head(df,20)
y grp ind val
1 2001 G1 A 40
2 2001 G1 B 33
3 2001 G1 C 65
4 2001 G1 D 99
5 2001 G1 E 18
6 2001 G2 G 36
7 2001 G2 H 15
8 2001 G2 I 17
9 2001 G2 J 42
10 2001 G2 K 67
11 2001 G3 L 60
12 2001 G3 M 34
13 2001 G3 N 61
14 2001 G3 O 76
15 2001 G3 P 15
16 2002 G1 A 18
17 2002 G1 B 15
18 2002 G1 C 44
19 2002 G1 D 79
20 2002 G1 E 22
Then use:
quintile = function(z) {
x = df$val[df$y == z[1] & df$grp == z[2]]
qn = quantile(x, probs = (0:5)/5)
result = as.numeric(cut(x, qn, include.lowest = T))
}
df$qn = as.vector(apply(unique(df[,c("y","grp")]),1, quintile))
Result:
> head(df,20)
y grp ind val qn
1 2001 G1 A 40 3
2 2001 G1 B 33 2
3 2001 G1 C 65 4
4 2001 G1 D 99 5
5 2001 G1 E 18 1
6 2001 G2 G 36 3
7 2001 G2 H 15 1
8 2001 G2 I 17 2
9 2001 G2 J 42 4
10 2001 G2 K 67 5
11 2001 G3 L 60 3
12 2001 G3 M 34 2
13 2001 G3 N 61 4
14 2001 G3 O 76 5
15 2001 G3 P 15 1
16 2002 G1 A 18 2
17 2002 G1 B 15 1
18 2002 G1 C 44 4
19 2002 G1 D 79 5
20 2002 G1 E 22 3
I this example, y would be the year and grp the industry group, ind the firms and val the income.
Pay attention to the order of c("y","grp") inside the apply and the columns names inside the quintile function. You'll have to replace them with the column names you want.
Be warned that if your groups are small (in this example 5 firms per group), the quintiles may not be unique and an error will pop-up.
Using column names from question
quintile = function(z) {
x = df$Income[df$Year == z[1] & df$Industry == z[2]]
qn = quantile(x, probs = (0:5)/5)
result = as.numeric(cut(x, qn, include.lowest = T))
}
df$qn = as.vector(apply(unique(df[,c("Year","Industry")]),1, quintile))
Before applying this, the data frame df must be ordered by Year and Industry.

Merging two different size datasets, copying the same row conditionally from smaller dataset to several rows in larger dataset

I am completely new with R, and I tried googling a representative solution for my problem for some time, but haven't found an adequate answer so far, so I hope that asking for help might solve this one here.
I should merge two different size data sets (other includes annual data: df_f, and other monthly data: df_m). I should merge the smaller df_f to the larger df_m in a way that rows of df_f are merged conditionally with df_m.
Here is a descriptive example of my problem (with some very basic reproducible numbers):
first dataset
a <- c(1990)
b <- c(1980:1981)
c <- c(1994:1995)
aa <- rep("A", 1)
bb <- rep("B", 2)
cc <- rep("C", 2)
df1 <- data.frame(comp=factor(c(aa, bb, cc)))
df2 <- data.frame(year=factor(c(a, b, c)))
other.columns <- rep("other_columns", length(df1))
df_f <- cbind(df1, df2, other.columns ) # first dataset
second dataset
z <- c(10:12)
x <- c(7:12)
xx <- c(1:9)
v <- c(2:9)
w <- rep(1990, length(z))
e <- rep(1980, length(x))
ee <- rep (1981, length(xx))
r <- rep(1995, length(v))
t <- rep("A", length(z))
y <- rep("B", length(x) + length(xx))
u <- rep("C", length(v))
df3 <- data.frame(month=factor(c(z, x, xx, v)))
df4 <- data.frame(year=factor(c(w, e, ee, r)))
df5 <- data.frame(comp=factor(c(t, y, u)))
df_m <- cbind(df5, df4, df3) # second dataset
Output:
> df_m
comp year month
1 A 1990 10
2 A 1990 11
3 A 1990 12
4 B 1980 7
5 B 1980 8
6 B 1980 9
7 B 1980 10
8 B 1980 11
9 B 1980 12
10 B 1981 1
11 B 1981 2
12 B 1981 3
13 B 1981 4
14 B 1981 5
15 B 1981 6
16 B 1981 7
17 B 1981 8
18 B 1981 9
19 C 1995 2
20 C 1995 3
21 C 1995 4
22 C 1995 5
23 C 1995 6
24 C 1995 7
25 C 1995 8
26 C 1995 9
> df_f
comp year other.columns
1 A 1990 other_columns
2 B 1980 other_columns
3 B 1981 other_columns
4 C 1994 other_columns
5 C 1995 other_columns
I want to have the rows from df_f placed to df_m (store the data from df_f to new columns in df_m) according to the conditions comp, year, and month. Comp (company) needs to match always, but matching the year is conditional to month: if month is >6 then year is matched between datasets, if month is <7 then year + 1 (in df_m) is matched with year (in df_f). Note that a certain row in df_f should be placed into several rows in df_m according to the conditions.
The wanted output clarifies the problem and the goal:
Wanted output:
comp year month comp year other.columns
1 A 1990 10 A 1990 other_columns
2 A 1990 11 A 1990 other_columns
3 A 1990 12 A 1990 other_columns
4 B 1980 7 B 1980 other_columns
5 B 1980 8 B 1980 other_columns
6 B 1980 9 B 1980 other_columns
7 B 1980 10 B 1980 other_columns
8 B 1980 11 B 1980 other_columns
9 B 1980 12 B 1980 other_columns
10 B 1981 1 B 1980 other_columns
11 B 1981 2 B 1980 other_columns
12 B 1981 3 B 1980 other_columns
13 B 1981 4 B 1980 other_columns
14 B 1981 5 B 1980 other_columns
15 B 1981 6 B 1980 other_columns
16 B 1981 7 B 1981 other_columns
17 B 1981 8 B 1981 other_columns
18 B 1981 9 B 1981 other_columns
19 C 1995 2 C 1994 other_columns
20 C 1995 3 C 1994 other_columns
21 C 1995 4 C 1994 other_columns
22 C 1995 5 C 1994 other_columns
23 C 1995 6 C 1994 other_columns
24 C 1995 7 C 1995 other_columns
25 C 1995 8 C 1995 other_columns
26 C 1995 9 C 1995 other_columns
Thank you very much in advance! I hope the question is clear enough, it was somewhat difficult to explain it at least.
The basic idea to solve your problem is to add an extra column with the year that should be used for matching. I will use the package dpylr for this and other manipulation steps.
Before the tables can be combined, the numeric columns must be converted to be numeric:
library(dplyr)
df_m <- mutate(df_m, year = as.numeric(as.character(year)),
month = as.numeric(as.character(month)))
df_f <- mutate(df_f, year = as.numeric(as.character(year)))
The reason is that you want to be able to do numerical comparison with the month (month > 6) and subtract one from the year. You cannot do this with a factor.
Then I add the column to be used for matching:
df_m <- mutate(df_m, match_year = ifelse(month >= 7, year, year - 1))
And in the last step, I join the two tables:
df_new <- left_join(df_m, df_f, by = c("comp", "match_year" = "year"))
The argument by determines which columns of the two data frames should be matched. The output agrees with your result:
## comp year month match_year other.columns
## 1 A 1990 10 1990 other_columns
## 2 A 1990 11 1990 other_columns
## 3 A 1990 12 1990 other_columns
## 4 B 1980 7 1980 other_columns
## 5 B 1980 8 1980 other_columns
## 6 B 1980 9 1980 other_columns
## 7 B 1980 10 1980 other_columns
## 8 B 1980 11 1980 other_columns
## 9 B 1980 12 1980 other_columns
## 10 B 1981 1 1980 other_columns
## 11 B 1981 2 1980 other_columns
## 12 B 1981 3 1980 other_columns
## 13 B 1981 4 1980 other_columns
## 14 B 1981 5 1980 other_columns
## 15 B 1981 6 1980 other_columns
## 16 B 1981 7 1981 other_columns
## 17 B 1981 8 1981 other_columns
## 18 B 1981 9 1981 other_columns
## 19 C 1995 2 1994 other_columns
## 20 C 1995 3 1994 other_columns
## 21 C 1995 4 1994 other_columns
## 22 C 1995 5 1994 other_columns
## 23 C 1995 6 1994 other_columns
## 24 C 1995 7 1995 other_columns
## 25 C 1995 8 1995 other_columns
## 26 C 1995 9 1995 other_columns

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources