Regression for multiple countries over time - r

My data set looks as follows:
country year Var1 Var2 Var3 Var4
1 AT 2010 0.27246094 15 0 0
2 BE 2010 0.14729459 53 0 1
3 BG 2010 0.08744856 3 0 0
4 CY 2010 0.15369261 6 0 0
5 CZ 2010 0.20284360 6 0 1
6 DE 2010 0.12541694 37 0 0
7 AT 2011 0.35370741 16 0 0
8 BE 2011 0.14572864 54 0 0
9 BG 2011 0.11929461 4 0 0
10 CY 2011 0.24550898 7 0 1
11 CZ 2011 0.23333333 7 0 0
12 DE 2011 0.21943574 38 0 0
13 AT 2012 0.35073780 17 0 0
14 BE 2012 0.19700000 55 0 0
15 BG 2012 0.08472803 5 0 0
16 CY 2012 0.16949153 8 0 0
17 CZ 2012 0.26914661 8 0 0
18 DE 2012 0.22037422 39 0 0
19 AT 2013 0.34716599 18 0 1
20 BE 2013 0.28906250 56 0 0
21 BG 2013 0.14602216 6 0 1
22 CY 2013 0.44023904 9 0 0
23 CZ 2013 0.35146022 9 0 1
24 DE 2013 0.25500323 40 0 1
It covers 4 years for each of the 6 countries.
What I want to do is run a regression Var2 ~ Var 1.
Since I have multiple years I considered using time series. So, first I changed the year column from character to date:
library(dplyr)
mutate(testdf, year = as.Date(year, format= "%Y"))
Then, I tried to run my regression and received this error:
library(plm)
reg1 <- plm(Var2 ~ Var1 + Var3 + Var4, data = df)
summary(reg1)
Error in pdim.default(index[[1]], index[[2]]) : duplicate couples (id-time)
Did I miss a step before running the regression or am I just using the wrong function?
I also tried to run the regression by using the lmerfunction (using time and to control for country differences):
library(lme4)
library(lmerTest)
reg2 <- lmer(Var2 ~ time(Var1) + Var3 + Var4 + (1 | country), data = df, REML = F)
summary(reg2)
Here I got a result, but I am completely unsure whether this is the way it should be done. Would this be a possibility or is it something different?

The date requires month and day, I suggest to use the beginning of the year via ISOdate.
testdf <- transform(testdf, year=as.Date(ISOdate(year, 1, 1))) ## Note: transform is from
## base R
head(testdf, 3)
# country year Var1 Var2 Var3 Var4
# 1 AT 2010-01-01 0.27246094 15 0 0
# 2 BE 2010-01-01 0.14729459 53 0 1
# 3 BG 2010-01-01 0.08744856 3 0 0
In the plm call you probably want to define the index= and select a model=, see ?plm.
library(plm)
reg1 <- plm(Var2 ~ Var1 + Var3 + Var4, data=testdf, index=c("country", "year"),
model="random")
Result:
summary(reg1)
# Oneway (individual) effect Random Effect Model
# (Swamy-Arora's transformation)
#
# Call:
# plm(formula = Var2 ~ Var1 + Var3 + Var4, data = testdf, model = "random",
# index = c("country", "year"))
#
# Balanced Panel: n = 6, T = 4, N = 24
#
# Effects:
# var std.dev share
# idiosyncratic 0.8135 0.9019 0.001
# individual 615.6029 24.8113 0.999
# theta: 0.9818
#
# Residuals:
# Min. 1st Qu. Median 3rd Qu. Max.
# -1.416570 -0.789216 -0.064901 0.728004 1.392325
#
# Coefficients:
# Estimate Std. Error z-value Pr(>|z|)
# (Intercept) 18.47629 9.76600 1.8919 0.0585 .
# Var1 12.95722 2.84290 4.5577 5.171e-06 ***
# Var4 0.32221 0.40056 0.8044 0.4212
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Total Sum of Squares: 32.753
# Residual Sum of Squares: 15.806
# R-Squared: 0.5174
# Adj. R-Squared: 0.47144
# Chisq: 22.5147 on 2 DF, p-value: 1.2912e-05
Data:
testdf <- structure(list(country = structure(c(1L, 2L, 3L, 4L, 5L, 6L,
1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L,
5L, 6L), .Label = c("AT", "BE", "BG", "CY", "CZ", "DE"), class = "factor"),
year = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2011L,
2011L, 2011L, 2011L, 2011L, 2011L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L),
Var1 = c(0.27246094, 0.14729459, 0.08744856, 0.15369261,
0.2028436, 0.12541694, 0.35370741, 0.14572864, 0.11929461,
0.24550898, 0.23333333, 0.21943574, 0.3507378, 0.197, 0.08472803,
0.16949153, 0.26914661, 0.22037422, 0.34716599, 0.2890625,
0.14602216, 0.44023904, 0.35146022, 0.25500323), Var2 = c(15L,
53L, 3L, 6L, 6L, 37L, 16L, 54L, 4L, 7L, 7L, 38L, 17L, 55L,
5L, 8L, 8L, 39L, 18L, 56L, 6L, 9L, 9L, 40L), Var3 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Var4 = c(0L, 1L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 1L, 0L, 1L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24"
))

Related

Create categorical variable from mutually exclusive dummy variables [duplicate]

This question already has answers here:
Reconstruct a categorical variable from dummies in R [duplicate]
(3 answers)
Closed 3 years ago.
How can I create a categorical variable from mutually exclusive dummy variables (taking values 0/1)?
Basically I am looking for the exact opposite of this solution: (https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781787124479/1/01lvl1sec22/creating-dummies-for-categorical-variables).
Would appreciate a base R solution.
For example, I have the following data:
dummy.df <- structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L),
.Dim = c(10L, 4L),
.Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX", "State.VA")))
State.NJ State.NY State.TX State.VA
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 1 0 0 0
[4,] 0 0 0 1
[5,] 0 1 0 0
[6,] 0 0 1 0
[7,] 1 0 0 0
[8,] 0 0 0 1
[9,] 0 0 1 0
[10,] 0 0 0 1
I would like to get the following results
state
1 NJ
2 NY
3 NJ
4 VA
5 NY
6 TX
7 NJ
8 VA
9 TX
10 VA
cat.var <- structure(list(state = structure(c(1L, 2L, 1L, 4L, 2L, 3L, 1L,
4L, 3L, 4L), .Label = c("NJ", "NY", "TX", "VA"), class = "factor")),
class = "data.frame", row.names = c(NA, -10L))
# toy data
df <- data.frame(a = c(1,0,0,0,0), b = c(0,1,0,1,0), c = c(0,0,1,0,1))
df$cat <- apply(df, 1, function(i) names(df)[which(i == 1)])
Result:
> df
a b c cat
1 1 0 0 a
2 0 1 0 b
3 0 0 1 c
4 0 1 0 b
5 0 0 1 c
To generalize, you'll need to play with the df and names(df) part, but you get the drift. One option would be to make a function, e.g.,
catmaker <- function(data, varnames, catname) {
data[,catname] <- apply(data[,varnames], 1, function(i) varnames[which(i == 1)])
return(data)
}
newdf <- catmaker(data = df, varnames = c("a", "b", "c"), catname = "newcat")
One nice aspect of the functional approach is that it is robust to variations in the order of names in the vector of column names you feed into it. I.e., varnames = c("c", "a", "b") produces the same result as varnames = c("a", "b", "c").
P.S. You added some example data after I posted this. The function works on your example, as long as you convert dummy.df to a data frame first, e.g., catmaker(data = as.data.frame(dummy.df), varnames = colnames(dummy.df), "State") does the job.
You can use tidyr::gather:
library(dplyr)
library(tidyr)
as_tibble(dummy.df) %>%
mutate(id =1:n()) %>%
pivot_longer(., -id, values_to = "Value",
names_to = c("txt","State"), names_sep = "\\.") %>%
filter(Value ==1) %>% select(State)
#> # A tibble: 10 x 1
#> State
#> <chr>
#> 1 NJ
#> 2 NY
#> 3 NJ
#> 4 VA
#> 5 NY
#> 6 TX
#> 7 NJ
#> 8 VA
#> 9 TX
#> 10 VA
You can do:
states <- names(dummy.df)[max.col(dummy.df)]
Or if as in your example it's a matrix you'd need to use colnames():
colnames(dummy.df)[max.col(dummy.df)]
Then just clean it up with sub():
sub(".*\\.", "", states)
"NJ" "NY" "NJ" "VA" "NY" "TX" "NJ" "VA" "TX" "VA"
EDIT : with your data
One way with model.matrix for dummy creation and matrix multiplication :
dummy.df<-structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L), .Dim = c(10L, 4L
), .Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX",
"State.VA")))
level_names <- colnames(dummy.df)
# use matrix multiplication to extract wanted level
res <- dummy.df%*%1:ncol(dummy.df)
# clean up
res <- as.numeric(res)
factor(res, labels = level_names)
#> [1] State.NJ State.NY State.NJ State.VA State.NY State.TX State.NJ
#> [8] State.VA State.TX State.VA
#> Levels: State.NJ State.NY State.TX State.VA
General reprex :
# create factor and dummy target y
dfr <- data.frame(vec = gl(n = 3, k = 3, labels = letters[1:3]),
y = 1:9)
dfr
#> vec y
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 4
#> 5 b 5
#> 6 b 6
#> 7 c 7
#> 8 c 8
#> 9 c 9
# dummies creation
dfr_dummy <- model.matrix(y ~ 0 + vec, data = dfr)
# use matrix multiplication to extract wanted level
res <- dfr_dummy%*%c(1,2,3)
# clean up
res <- as.numeric(res)
factor(res, labels = letters[1:3])
#> [1] a a a b b b c c c
#> Levels: a b c

How to take an Average of + or - SD

I have data where the [1] dependent variable is taken from a controlled and independent variable [2] then independent variable. The mean and SD are taken from [1].
(a) and this is the result of SD:
Year Species Pop_Index
1 1994 Corn Bunting 2.082483
5 1998 Corn Bunting 2.048155
10 2004 Corn Bunting 2.061617
15 2009 Corn Bunting 2.497792
20 1994 Goldfinch 1.961236
25 1999 Goldfinch 1.995600
30 2005 Goldfinch 2.101403
35 2010 Goldfinch 2.138496
40 1995 Grey Partridge 2.162136
(b) And the result of mean:
Year Species Pop_Index
1 1994 Corn Bunting 2.821668
5 1998 Corn Bunting 2.916975
10 2004 Corn Bunting 2.662797
15 2009 Corn Bunting 4.171538
20 1994 Goldfinch 3.226108
25 1999 Goldfinch 2.452807
30 2005 Goldfinch 2.954816
35 2010 Goldfinch 3.386772
40 1995 Grey Partridge 2.207708
(c) This is the Code for SD:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.0824833420524, 2.04815530904537,
2.06161673349657, 2.49779159320587, 1.96123572400404, 1.99559986715288,
2.10140285528351, 2.13849611018009, 2.1621364896722)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(d) This is the code for mean:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.82166841455814, 2.91697463618566,
2.66279663056763, 4.17153795031277, 3.22610845074252, 2.45280743991572,
2.95481600904799, 3.38677188055508, 2.20770835158744)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(e) And this is the code used to take the mean of mean Pop_Index over the years:
df2 <- aggregate(Pop_Index ~ Year, df1, mean)
(f) And this is the result:
Year Pop_Index
1 1994 3.023888
2 1995 2.207708
3 1998 2.916975
4 1999 2.452807
5 2004 2.662797
6 2005 2.954816
7 2009 4.171538
8 2010 3.386772
Now it wouldn't make sense for me to take the average of SD by doing the same procedure as before with the function mean or SD.
I have looked online and found someone in a similar predicament with this data:
Month: January
Week 1 Mean: 67.3 Std. Dev: 0.8
Week 2 Mean: 80.5 Std. Dev: 0.6
Week 3 Mean: 82.4 Std. Dev: 0.8
And the response:
"With equal samples size, which is what you have, the standard deviation you are looking for is:
Sqrt [ (.64 + .36 + .64) / 3 ] = 0.739369"
How would I do this in R, or is there another way of doing this? Because I want to plot error bars and the dataset plotted is like that of (f), and it would be absurd to plot the SD of (a) against this because the vector lengths would differ.
Sample from original data.frame with a few columns and many rows not included:
structure(list(GRIDREF = structure(c(1L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L), .Label = c("SP8816", "SP9212", "SP9322",
"SP9326", "SP9440", "SP9513", "SP9632", "SP9939", "TF7133", "TF9437"
), class = "factor"), Lat = c(51.83568688, 51.83568688, 51.79908899,
51.88880822, 51.92476157, 52.05042795, 51.80757645, 51.97818159,
52.04057068, 52.86730817, 52.89542895), Long = c(-0.724233561,
-0.724233561, -0.667258035, -0.650074995, -0.648996758, -0.630626734,
-0.62349292, -0.603710436, -0.558026241, 0.538966197, 0.882597783
), Year = c(2006L, 2007L, 1999L, 2004L, 1995L, 2009L, 2011L,
2007L, 2011L, 1996L, 2007L), Species = structure(c(4L, 7L, 5L,
10L, 4L, 6L, 8L, 3L, 2L, 9L, 1L), .Label = c("Blue Tit", "Buzzard",
"Canada Goose", "Collared Dove", "Greenfinch", "Jackdaw", "Linnet",
"Meadow Pipit", "Robin", "Willow Warbler"), class = "factor"),
Pop_Index = c(0L, 0L, 2L, 0L, 1L, 0L, 1L, 4L, 0L, 0L, 8L)), row.names = c(1L,
100L, 1000L, 2000L, 3000L, 4000L, 5000L, 6000L, 10000L, 20213L,
30213L), class = "data.frame")
A look into this data.frame:
GRIDREF Lat Long Year Species Pop_Index TempJanuary
1 SP8816 51.83569 -0.7242336 2006 Collared Dove 0 2.128387
100 SP8816 51.83569 -0.7242336 2007 Linnet 0 4.233226
1000 SP9212 51.79909 -0.6672580 1999 Greenfinch 2 5.270968
2000 SP9322 51.88881 -0.6500750 2004 Willow Warbler 0 4.826452
3000 SP9326 51.92476 -0.6489968 1995 Collared Dove 1 4.390322
4000 SP9440 52.05043 -0.6306267 2009 Jackdaw 0 2.934516
5000 SP9513 51.80758 -0.6234929 2011 Meadow Pipit 1 3.841290
6000 SP9632 51.97818 -0.6037104 2007 Canada Goose 4 7.082580
10000 SP9939 52.04057 -0.5580262 2011 Buzzard 0 3.981290
20213 TF7133 52.86731 0.5389662 1996 Robin 0 3.532903
30213 TF9437 52.89543 0.8825978 2007 Blue Tit 8 7.028710

How to subtract different columns from each other in a dataframe based on a condition

In my dataframe for my university project i have to subtract the true_date from the year renovated, if the house was never renovated then it should be subtracted from yr_built.
Here is a small extract
yr_built yr_renovated true_date
1 1995 0 2014-12-30
2 2006 0 2014-09-12
3 2008 0 2014-07-24
4 1973 0 2014-08-01
5 1998 0 2015-02-25
6 1971 0 2014-10-09
7 2004 0 2014-08-18
8 1954 0 2014-06-20
9 1977 0 2014-08-22
10 1968 0 2015-01-08
11 1924 0 2015-03-03
12 1953 0 2015-04-28
13 1941 1998 2014-08-14
I first converted the true date into a date using as.Date then extracted the years into an object called yr_sold.
House_Data$true_date <- as.Date(House_Data$true_date,na.mr=TRUE)
yr_sold <- as.numeric(format(House_Data$true_date, "%Y"))
The part where im stuck on is how would i write a condition for it to subtract yr_sold from yr_renovated when yr_renovated > 0 or from yr_built when yr_renovated <= 0
for(i in 1:yr_sold) {
if(House_Data$yr_renovated <= 0) {
yr_since_renovations <- yr_sold - House_Data$yr_built
} else {
yr_since_renovations <- yr_sold-House_Data$yr_renovated
}
}
this is what i wrote for it and it seems to work but R warns me that there were 50 or more warnings.
Are there any better suggestions on how to do this?
summary(yr_since_renovations)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 19.00 42.00 43.11 62.00 115.00
We also had to create a summary of yr_since_renovations and it was strongly hinted that something would be odd about it but i do not see anything so im wondering if my code has a mistake in it.
EDIT
After fixing the code i found 2 negative results that need to be removed. I saw that there were negative values in yr_since_renovation and the rows that caused yr_since_renovation need to be removed from my dataset.
House_Data <- House_Data[!(years_since_renovation < 0),]
This is what i tried and it does seem to remove the 2 negative values but when i run years_since_renovation <- with(House_Data, true_date_year - ifelse(yr_renovated == 0, yr_built, yr_renovated)) to update the variable i receive this error longer object length is not a multiple of shorter object length
Get the year from true_date and subtract the values using ifelse
df$true_date <- as.Date(df$true_date)
df$true_date_year <- as.integer(format(df$true_date, "%Y"))
with(df, true_date_year - ifelse(yr_renovated == 0, yr_built, yr_renovated))
#[1] 19 8 6 41 17 43 10 60 37 47 91 62 16
data
df <- structure(list(yr_built = c(1995L, 2006L, 2008L, 1973L, 1998L,
1971L, 2004L, 1954L, 1977L, 1968L, 1924L, 1953L, 1941L), yr_renovated = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1998L), true_date = structure(c(9L,
7L, 2L, 3L, 11L, 8L, 5L, 1L, 6L, 10L, 12L, 13L, 4L), .Label = c("2014-06-20",
"2014-07-24", "2014-08-01", "2014-08-14", "2014-08-18", "2014-08-22",
"2014-09-12", "2014-10-09", "2014-12-30", "2015-01-08", "2015-02-25",
"2015-03-03", "2015-04-28"), class = "factor")), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"))

Apply function across multiple columns

Please find here a very small subset of a long data.table I am working with
dput(dt)
structure(list(id = 1:15, pnum = c(4298390L, 4298390L, 4298390L,
4298558L, 4298558L, 4298559L, 4298559L, 4299026L, 4299026L, 4299026L,
4299026L, 4300436L, 4300436L, 4303566L, 4303566L), invid = c(15L,
101L, 102L, 103L, 104L, 103L, 104L, 106L, 107L, 108L, 109L, 87L,
111L, 2L, 60L), fid = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L), .Label = c("CORN", "DowCor",
"KIM", "Texas"), class = "factor"), dom_kn = c(1L, 0L, 0L, 0L,
1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), prim_kn = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), pat_kn = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), net_kn = c(1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), age_kn = c(1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), legclaims = c(5L,
0L, 0L, 2L, 5L, 2L, 5L, 0L, 0L, 0L, 0L, 5L, 0L, 5L, 2L), n_inv = c(3L,
3L, 3L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L)), .Names = c("id",
"pnum", "invid", "fid", "dom_kn", "prim_kn", "pat_kn", "net_kn",
"age_kn", "legclaims", "n_inv"), class = "data.frame", row.names = c(NA,
-15L))
I am looking to apply a tweaked greater than comparison in 5 different columns.
Within each pnum (patent), there are multiple invid (inventors). I want to compare the values of the columns dom_kn, prim_kn, pat_kn, net_kn, and age_kn per row, to the values in the other rows with the same pnum. The comparison is simply > and if the value is indeed bigger than the other, one "point" should be attributed.
So for the first row pnum == 4298390 and invid == 15, you can see the values in the five columns are all 1, while the values for invid == 101 | 102 are all zero. This means that if we individually compare (is greater than?) each value in the first row to each cell in the second and third row, the total sum would be 10 points. In every single comparison, the value in the first row is bigger and there are 10 comparisons.
The number of comparisons is by design 5 * (n_inv -1).
The result I am looking for for row 1 should then be 10 / 10 = 1.
For pnum == 4298558 the columns net_kn and age_kn both have values 1 in the two rows (for invid 103 and 104), so that each should get 0.5 points (if there would be three inventors with value 1, everyone should get 0.33 points). The same goes for pnum == 4298558.
For the next pnum == 4299026 all values are zero so every comparison should result in 0 points.
Thus note the difference: There are three different dyadic comparisons
1 > 0 --> assign 1
1 = 1 --> assign 1 / number of positive values in column subset
0 = 0 --> assign 0
Desired result
An extra column result in the data.table with values 1 0 0 0.2 0.8 0.2 0.8 0 0 0 0 1 0 0.8 0.2
Any suggestions on how to compute this efficiently?
Thanks!
vars = grep('_kn', names(dt), value = T)
# all you need to do is simply assign the correct weight and sum the numbers up
dt[, res := 0]
for (var in vars)
dt[, res := res + get(var) / .N, by = c('pnum', var)]
# normalize
dt[, res := res/sum(res), by = pnum]
# id pnum invid fid dom_kn prim_kn pat_kn net_kn age_kn legclaims n_inv res
# 1: 1 4298390 15 CORN 1 1 1 1 1 5 3 1.0
# 2: 2 4298390 101 CORN 0 0 0 0 0 0 3 0.0
# 3: 3 4298390 102 CORN 0 0 0 0 0 0 3 0.0
# 4: 4 4298558 103 DowCor 0 0 0 1 1 2 2 0.2
# 5: 5 4298558 104 DowCor 1 1 1 1 1 5 2 0.8
# 6: 6 4298559 103 DowCor 0 0 0 1 1 2 2 0.2
# 7: 7 4298559 104 DowCor 1 1 1 1 1 5 2 0.8
# 8: 8 4299026 106 Texas 0 0 0 0 0 0 4 NaN
# 9: 9 4299026 107 Texas 0 0 0 0 0 0 4 NaN
#10: 10 4299026 108 Texas 0 0 0 0 0 0 4 NaN
#11: 11 4299026 109 Texas 0 0 0 0 0 0 4 NaN
#12: 12 4300436 87 KIM 1 1 1 1 1 5 2 1.0
#13: 13 4300436 111 KIM 0 0 0 0 0 0 2 0.0
#14: 14 4303566 2 DowCor 1 1 1 1 1 5 2 0.8
#15: 15 4303566 60 DowCor 1 0 0 1 0 2 2 0.2
Dealing with the above NaN case (arguably the correct answer), is left to the reader.
Here's a fastish solution using dplyr:
library(dplyr)
dt %>%
group_by(pnum) %>% # group by pnum
mutate_each(funs(. == max(.) & max(.) != 0), ends_with('kn')) %>%
#give a 1 if the value is the max, and not 0. Only for the column with kn
mutate_each(funs(. / sum(.)) , ends_with('kn')) %>%
#correct for multiple maximums
select(ends_with('kn')) %>%
#remove all non kn columns
do(data.frame(x = rowSums(.[-1]), y = sum(.[-1]))) %>%
#make a new data frame with x = rowsums for each indvidual
# and y the colusums
mutate(out = x/y)
#divide by y (we could just use /5 if we always have five columns)
giving your desired output in the column out:
Source: local data frame [15 x 4]
Groups: pnum [6]
pnum x y out
(int) (dbl) (dbl) (dbl)
1 4298390 5 5 1.0
2 4298390 0 5 0.0
3 4298390 0 5 0.0
4 4298558 1 5 0.2
5 4298558 4 5 0.8
6 4298559 1 5 0.2
7 4298559 4 5 0.8
8 4299026 NaN NaN NaN
9 4299026 NaN NaN NaN
10 4299026 NaN NaN NaN
11 4299026 NaN NaN NaN
12 4300436 5 5 1.0
13 4300436 0 5 0.0
14 4303566 4 5 0.8
15 4303566 1 5 0.2
The NaNs come from the groups with no winners, convert them back using eg:
x[is.na(x)] <- 0

R: Assign colors to values/color gradient palette

I have a sample dataframe which looks like this:
reg1 <- structure(list(REGION = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("REG1", "REG2"), class = "factor"),STARTYEAR = c(1959L, 1960L, 1961L, 1962L, 1963L, 1964L, 1965L, 1966L, 1967L, 1945L, 1946L, 1947L, 1948L, 1949L), ENDYEAR = c(1960L, 1961L, 1962L, 1963L, 1964L, 1965L, 1966L, 1967L, 1968L, 1946L, 1947L, 1948L, 1949L, 1950L), Y_START = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 2L, 2L, 2L, 2L), Y_END = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L), COLOR_VALUE = c(-969L, -712L, -574L, -312L, -12L, 1L, 0L, -782L, -999L, -100L, 23L, 45L, NA, 999L)), .Names = c("REGION", "STARTYEAR", "ENDYEAR", "Y_START", "Y_END", "COLOR_VALUE"), class = "data.frame", row.names = c(NA, -14L))
REGION STARTYEAR ENDYEAR Y_START Y_END COLOR_VALUE
1 REG1 1959 1960 0 1 -969
2 REG1 1960 1961 0 1 -712
3 REG1 1961 1962 0 1 -574
4 REG1 1962 1963 0 1 -312
5 REG1 1963 1964 0 1 -12
6 REG1 1964 1965 0 1 1
7 REG1 1965 1966 0 1 0
8 REG1 1966 1967 0 1 -782
9 REG1 1967 1968 0 1 -999
10 REG2 1945 1946 2 3 -100
11 REG2 1946 1947 2 3 23
12 REG2 1947 1948 2 3 45
13 REG2 1948 1949 2 3 NA
14 REG2 1949 1950 2 3 999
I am creating a plot with the rect() function which works fine.
xx = unlist(reg1[, c(2, 3)])
yy = unlist(reg1[, c(4, 5)])
png(width=1679, height=1165, res=150)
if(any(xx < 1946)) {my_x_lim <- c(min(xx), 2014)} else {my_x_lim <- c(1946, 2014)}
plot(xx, yy, type='n', xlim = my_x_lim)
apply(reg1, 1, function(y)
rect(y[2], y[4], y[3], y[5]))
dev.off()
In my reg1 data I have a 6th column which contains values between +1000 and -1000. What I was wondering is if there is a method that I could colour the rectangles in my plot according to my color values. Low values should be blue, values around 0 should result in white and high values in red (if no value is present or NA, then grey should be plotted).
My question: How could I create a color palette that ranges from values 1000 to -1000 (from red over white to blue) and apply it to my plot so that each rectangle gets coloured according to the color value?
Here is how your get a color ramp and match it in the data frame.
my.colors<-colorRampPalette(c("blue", "white", "red")) #creates a function my.colors which interpolates n colors between blue, white and red
color.df<-data.frame(COLOR_VALUE=seq(-1000,1000,1), color.name=my.colors(2001)) #generates 2001 colors from the color ramp
reg1.with.color<-merge(reg1, color.df, by="COLOR_VALUE")
I can't help you with the rect() plotting, I've never used it

Resources