how can I manipulate data frame based on several columns strings - r

I want to make all the strings in col1 , col2 , col3 and col4 unique and then bring their res value in front of it. so the output is look like this
I want to have output like this
output <- structure(list(col1 = structure(c(13L, 14L, 16L, 17L, 27L, 18L,
26L, 25L, 24L, 4L, 7L, 9L, 11L, 21L, 22L, 23L, 5L, 8L, 10L, 12L,
15L, 1L, 2L, 3L, 6L, 19L, 20L), .Label = c("A8WFJ8", "A8WFK2",
"A8WHR6", "A8WHS3", "A8WIT0", "A8XQE0", "A9D0C6", "A9D4S6", "A9D649",
"A9D8E6", "A9UJN4", "A9Z1L6", "ADliba1", "ADNIL2", "B0M0N9",
"DFGH2", "GDH76", "ML2IS5", "Q9XXL6", "Q9XXN0", "Q9XXN2", "Q9XXQ4",
"Q9XXQ6", "QSEA12", "RR2JDG", "T2HDY3", "TR5421"), class = "factor"),
res1 = c(3.59e-08, 2.15e-08, 1.52e-07, 1.24e-07, 4.53e-08,
3.11e-08, 7.08e-08, 1.98e-08, 1.46e-08, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), res2 = c(8.11e-07, 7.21e-08,
0, 4.02e-08, 0, 0, 2.32e-08, 0, 1.46e-08, 3.86e-08, 2.68e-08,
2.7e-08, 7.76e-08, 7.76e-08, 7.76e-08, 7.76e-08, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), res3 = c(8.76e-08, 1.4e-07, 0, 2.8e-08,
0, 0, 0, 0, 0, 0, 7.85e-08, 0, 0, 0, 0, 0, 2.13e-08, 3.57e-08,
1.46e-07, 5.23e-08, 6.44e-08, 0, 0, 0, 0, 0, 0), res4 = c(1.42e-07,
8.66e-08, 0, 7.64e-08, 0, 0, 6.28e-07, 0, 0, 0, 7.25e-07,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.26e-05, 8.58e-08, 2.83e-08,
3.7e-08, 1.26e-05, 8.58e-08)), .Names = c("col1", "res1",
"res2", "res3", "res4"), class = "data.frame", row.names = c(NA,
-27L))

Begin by cleaning up the data
# organizes your "col" and "res" values into different lists
splitDF <- lapply(seq(1, ncol(df), by = 2),
function(x) df[x:(x+1)])
# renames first column to make it easier for the merge
splitDF <- lapply(splitDF, function(x) names(x)[1] <- "col1")
# removes blank lines
splitDF <- lapply(splitDF, function(x) x[complete.cases(x), ])
Then you can use the great merge solution found here to gather into one data frame.
output <- Reduce(function(...) merge(..., all=T), splitDF)
Finally you can set all of the NA values to zero and reorder the rows.
output[is.na(output)] <- 0
varOrder <- c("ADliba1", "ADNIL2", "DFGH2", "GDH76", "TR5421", "ML2IS5",
"T2HDY3", "RR2JDG", "QSEA12", "A8WHS3", "A9D0C6", "A9D649",
"A9UJN4", "Q9XXN2", "Q9XXQ4", "Q9XXQ6", "A8WIT0", "A9D4S6",
"A9D8E6", "A9Z1L6", "B0M0N9", "A8WFJ8", "A8WFK2", "A8WHR6",
"A8XQE0", "Q9XXL6", "Q9XXN0")
output <- output[match(varOrder, output[["col1"]]), ]

Related

Creating a function to find precision by group

I have the following dataframe for which I am trying to calculate the precision of observations by group.
df<- structure(list(BLG = c(77.634011090573, 119.341563786008, 12.0603015075377,
0, 155.275381552754, 117.391304347826, 81.1332904056665, 3.96563119629874,
91.566265060241), GSF = c(11.090573012939, 4.11522633744856,
0, 0, 0, 0, 0, 0, 0), LMB = c(73.9371534195933, 28.8065843621399,
24.1206030150754, 20.2360876897133, 59.721300597213, 13.0434782608696,
38.6349001931745, 31.7250495703899, 28.9156626506024), YLB = c(14.7874306839187,
4.11522633744856, 0, 0, 0, 0, 0, 0, 0), BLC = c(7.39371534195933,
0, 0, 20.2360876897133, 3.9814200398142, 0, 0, 7.93126239259749,
9.63855421686747), WHC = c(0, 0, 0, 0, 3.9814200398142, 0, 0,
0, 0), RSF = c(0, 0, 0, 0, 11.9442601194426, 0, 0, 0, 4.81927710843374
), CCF = c(0, 0, 0, 0, 0, 0, 0, 0, 0), BLB = c(0, 0, 0, 0, 0,
0, 0, 0, 0), group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)), row.names = c(NA,
-9L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00000270a7061ef0>)
I am trying to find the precision with this formula:
Y_estimated= the value of in each cell of df
Y_true= y_true<- c(83, 10, 47, 8, 9, 6, 12, 5, 8) #the true value for each column in df
R= number of observations in each group (in this case=3)
After applying the formula, I should have 3 measures of precision for each column. But I am unsure of how to make this formula into a function that will do this. Specifically the applying the epsilon by group and defining R.
I've been working on the following:
estimate = function(df, y_true) {
R = 3
y_estimated = (df, .SD)
(sum((sqrt( (y_estimated - y_true)^2 / 3))) / y_true) * 100
}
But apart from this throwing errors (I think from the .SD in the y_estimated), I have to manually put in the value of R which I hope to not have to do given that this will be applied on data frames with multiple group sizes.
Any help would be greatly appreciated.

Find the second largest value rowwise with dplyr R

Problem:
I am working with wages data and I want to flag outliers as possible measurement errors. For doing so, I am combining two criteria:
To receive more than twice the value of the 99th percentile of wages within a given year, relative to the whole distribution of wages on my dataset (comparison criteria between persons, within year)
To receive more than twice the value of the second highest wage within a same person, across years. That is an intra-individual criteria (comparison criteria within person, between years).
I accomplished to code the first criteria, but I am having some trouble with coding the second one.
My data is in the wide format. Perhaps the solution to my problem can be easier achieved by reshaping the data to the long format, but as I am working in a multi-author project, if I use this solution I need to reshape it back to the wide format again.
Data example:
Below, I provide some rows of my data with cases that already met the first criteria:
df <- structure(list(
wage_2010 = c(120408.54, 11234.67, 19918.64, NA, 66006.32, 40581.36, 344587.84, 331970.28, NA, 161351.45, NA, 115310.68, 323336.27, 9681.69, NA, 682324.53, 43764.76, 134023.61, 78195.16, 141231.5, 48163.23, 71259.66, 73858.65, 57737.6, NA, 182837.23), wage_2011 = c(413419.86, 24343.04, 36349.02, NA, 99238.53, 18890.34, 129921.58, 108714.29, NA, 169289.89, 36158.73, 129543.51, 130791.99, 13872.76, 4479.58, 222327.52, 826239.14, 48892.78, 78506.06, 111569.8, 653239.41, 813158.54, 72960.17, 80193.15, NA, 209796.19), wage_2012 = c(136750.86, 77386.62, 177528.17, 86512.48, 375958.76, 20302.29, 145373.42, 91071.64, 95612.23, 176866.72, 85244.44, 225698.7, 181093.52, 162585.23, 147918.83, 254057.11, 72845.46, 86001.31, 80958.22, 105629.12, 77723.77, 115217.74, 68959.04, 111843.87, 85180.26, 261942.95 ),
wage_2013 = c(137993.48, 104584.84, 239822.37, 95688.8, 251573.14, 21361.93, 142771.58, 92244.51, 111058.93, 208013.94, 111326.07, 254276.36, 193663.33, 225404.84, 84135.55, 259772.16, 100031.38, 100231.81, 824271.38, 107336.19, 95292.2, 217071.19, 125665.58, 74513.66, 116227.01, 245161.73), wage_2014 = c(134914.8, 527180.87, 284218.4, 112332.41, 189337.74, 23246.46, 144070.09, 92805.77, 114123.3, 251389.07, 235863.98, 285511.12, 192950.23, 205364.45, 292988.3, 318408.56, 86255.91, 497960.18, 85467.13, 152987.99, 145663.31, 242682.93, 184123.01, 107423.03, 132046.43, 248928.89), wage_2015 = c(168812.65, 145961.09, 280556.86, 256268.69, 144549.45, 23997.1, 130253.75, NA, 115522.88, 241031.91, 243697.87, 424135.76, 15927.33, 213203.96, 225118.19, 298042.59, 77749.09, 151336.85, 88596.38, 121741.45, 34054.26, 206284.71, 335127.7, 201891.17, 189409.04, 246440.69),
wage_2016 = c(160742.14, 129892.09, 251333.29, 137192.73, 166127.1, 537611.12, 139350.84, NA, 115395.21, 243154.02, 234685.36, 903334.7, NA, 205664.08, 695079.91, 33771.37, 100938.19, 138864.28, 58658.4, 98576.95, NA, 144613.53, 430393.04, 217989.1, 229369.56, 600079.86), wage_2017 = c(175932.3, 138128.41, 584536.47, 143506.22, 61674.63, 1442.8, 126084.46, NA, 575771.83, 586909.69, 372954.89, 701815.37, NA, 402347.33, 93873.2, NA, 96792.96, 172908.08, 89006.92, 631645.41, NA, 72183.55, 579455.71, 294539.56, 353615.43, 151327.43), wage_2018 = c(146111.42, 149313.9, 627679.77, 850182.4, 72654.62, 9129.35, 41544.24, NA, 248020.12, 334280.68, 611781.99, 597465.2, NA, 535628.5, 63369.44, NA, 93710.71, 146769.63, 100736.71, 108022.87, NA, 79019.43, 772012.47, 549097.81, 504183.59, 99129.6),
outlier_2010 = c(0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), outlier_2011 = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0), outlier_2012 = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), outlier_2013 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0), outlier_2014 = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0), outlier_2015 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), outlier_2016 = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1), outlier_2017 = c(0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), outlier_2018 = c(0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0)),
groups = structure(list(.rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", "list"))), row.names = c(NA, -26L), class = c("tbl_df", "tbl", "data.frame")), row.names = c(NA, -26L), class = c("rowwise_df", "tbl_df", "tbl", "data.frame"))
I have averages anual wages from 2010 to 2018, that is, 9 points in time. However, it seems to be hard to use a solution with the quantile function, because of possible missing values for some individuals in some years.
What I have tried:
So far I am using a median function within the dplyer approach. I flag as an outlier (possible error) if, in one given year, the individual receives more than twice the median of what he received across the years:
library(dplyr)
df1 <- df %>%
rowwise %>%
mutate(
median_wage = median(c(wage_2010, wage_2011, wage_2012, wage_2013, wage_2014, wage_2015, wage_2016, wage_2017, wage_2018), na.rm=T)) %>%
mutate(
individual_threshold = median_wage * 2,
) %>%
mutate(
outlier_2010 = case_when (wage_2010 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2011 = case_when (wage_2011 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2012 = case_when (wage_2012 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2013 = case_when (wage_2013 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2014 = case_when (wage_2014 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2015 = case_when (wage_2015 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2016 = case_when (wage_2016 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2017 = case_when (wage_2017 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2018 = case_when (wage_2018 > individual_threshold ~ 1, TRUE ~ 0))
However, when I inspect the data, I see that I am coding as outlier possible legitimate wages. For example, in the third row/person of my data, I am flagging as outliers wages in 2017 and 2018. However, as we can see, there is a pattern of increase in this person's wage. Although he receives more than twice his median wage in these years, probably that is not a mistake, as the increase was recorded in two years in a row.
In the forth row, however, the 2018 wage is more likely to be wrongly reported, since there is not a similar wage to that one for the same person. In 2018 year, that person wage grew more than 4 times than it was ever before (and also became more than twice the 99th percentile of the whole distribution).
Summing up:
I want to write a code to analyse 9 variables for every individual (or rowwise): wage_2010-2018, and compare the highest value to the second highest value. If the highest value is more than twice the size of the second highest value, I flag it as a possible measurement error. Preferably within dplyr.
Here's a way to do this with a helper function.
library(dplyr)
compare_2nd_highest <- function(x) {
#Sort the wages in descending order
x1 <- sort(x, decreasing = TRUE)
#Is the highest value more than double of second highest value
x1[1] > (x1[2] * 2)
}
df %>%
rowwise() %>%
mutate(is_outlier = compare_2nd_highest(c_across(starts_with('wage')))) %>%
ungroup

Multiply each value in a vector to all values in a df in R

I have a df of binary variables as seen here:
df <- structure(list(Incident = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 1), WorkZone = c(0,
0, 0, 0, 0, 0, 1, 1, 1, 0), Weather = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0), SpecialEvents = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), RecurringCongestion = c(1,
0, 1, 1, 0, 0, 0, 1, 1, 1), MultipleCauses = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0), Unclassified = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0
)), row.names = c(NA, -10L), groups = structure(list(.rows = structure(list(
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame")), class = c("rowwise_df", "tbl_df", "tbl",
"data.frame"))
I would like to take each value in the df of 10 and multiply each row by a correpsonding value in a vector, also size 10.
vect <- structure(list(UDC = c(102484.184937655, 126.379057607441, 132753.66551244,
1042.40780236563, 2438.05671857084, 29124.7628066832, 6406.8910421133,
141757.747682935, 95303.0160407684, 0)), row.names = c(NA, 10L
), class = "data.frame")
So every entry in Row 1 of df is multplied by Row 1 in vect, and so on. However I get this error:
I try and do so below but I get this error
df <- df %>%
mutate_all(.,function(col){vect$UDC*col})
Error: Problem with `mutate()` input `Incident`.
x Input `Incident` can't be recycled to size 1.
i Input `Incident` is `(function (col) ...`.
i Input `Incident` must be size 1, not 136.
i Did you mean: `Incident = list((function (col) ...)` ?
i The error occured in row 1.
There is a grouping attribute with rowwise, if we ungroup, it should work. In dplyr 1.0.0, we can use across
library(dplyr)
df %>%
ungroup %>%
mutate(across(everything(), ~ . * vect$UDC))

How can I plot ggplot2 columns in a loop?

I have a dataframe with lat, long and dozens of other columns. Basically, I want to write a loop where X=Long,Y=Lat remains constant, but the color changes in every loop. The color is basically the other columns, one for every plot. How can I do this?
library(maps)
library(ggplot2)
library(RColorBrewer)
library(reshape)
usamap <- ggplot2::map_data("state")
myPalette <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
simplefun<-function(colname){
ggplot()+
geom_polygon( data=usamap, aes(x=long, y=lat, group=group),colour="black",fill="white")+
geom_point(data=stat,aes_string(x=stat$longitude,y=stat$latitude,color=colname))+
scale_colour_gradientn(name="name",colours = myPalette(10))+
xlab('Longitude')+
ylab('Latitude')+
coord_map(projection = "mercator")+
theme_bw()+
theme(line = element_blank())+
theme(legend.position = c(.93,.20),panel.grid.major = element_line(colour = "#808080"))+
ggsave(paste0(colname,".png"),width=10, height=8,dpi=300)
}
colname<-names(stat[4:16])
lapply(colname,simplefun)
dput(droplevels(stat))
structure(list(siteId = structure(1:16, .Label = c("US1NYAB0001",
"US1NYAB0006", "US1NYAB0010", "US1NYAB0021", "US1NYAB0023", "US1NYAB0028",
"US1NYAB0032", "US1NYAL0002", "US1NYBM0004", "US1NYBM0007", "US1NYBM0011",
"US1NYBM0014", "US1NYBM0021", "US1NYBM0024", "US1NYBM0032", "US1NYBM0034"
), class = "factor"), latitude = c(42.667, 42.7198, 42.5455,
42.6918, 42.6602, 42.7243, 42.5754, 42.2705, 42.0296, 42.0493,
42.0735, 42.3084, 42.0099, 42.1098, 42.1415, 42.0826), longitude = c(-74.0509,
-73.9304, -74.1475, -73.8311, -73.8103, -73.757, -73.7995, -77.9419,
-76.0213, -76.0288, -75.9296, -75.9569, -75.5142, -75.8858, -75.889,
-75.9912), no = c(2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 1L), min_obs = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), min_mod = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), avg_obs = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0.15,
0, 0, 0, 0, 0, 0), avg_mod = c(3136.8388671875, 2997.28173828125,
3258.61840820312, 2970.74340820312, 2992.9765625, 0, 3075.54443359375,
2701.03662109375, 2974.23413085938, 2967.5029296875, 3004.57861328125,
2965.07470703125, 3260.25463867188, 3028.55590820312, 2981.8876953125,
0), max_obs = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0.3, 0, 0, 0, 0, 0,
0), max_mod = c(6273.677734375, 5994.5634765625, 6517.23681640625,
5941.48681640625, 5985.953125, 0, 6151.0888671875, 5402.0732421875,
5948.46826171875, 5935.005859375, 6009.1572265625, 5930.1494140625,
6520.50927734375, 6057.11181640625, 5963.775390625, 0), mean_bias = c(0,
0, 0, 0, 0, NaN, 0, 0, 0, 5.05475490855863e-05, 0, 0, 0, 0, 0,
NaN), corr_coef = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA,
NA, NA, NA, NA, NA), additive_bias = c(-6273.677734375, -5994.5634765625,
-6517.23681640625, -5941.48681640625, -5985.953125, 0, -6151.0888671875,
-5402.0732421875, -5948.46826171875, -5934.705859375, -6009.1572265625,
-5930.1494140625, -6520.50927734375, -6057.11181640625, -5963.775390625,
0), mean_error = c(-3136.8388671875, -2997.28173828125, -3258.61840820312,
-2970.74340820312, -2992.9765625, 0, -3075.54443359375, -2701.03662109375,
-2974.23413085938, -2967.3529296875, -3004.57861328125, -2965.07470703125,
-3260.25463867188, -3028.55590820312, -2981.8876953125, 0), mean_abs_error = c(3136.8388671875,
2997.28173828125, 3258.61840820312, 2970.74340820312, 2992.9765625,
0, 3075.54443359375, 2701.03662109375, 2974.23413085938, 2967.3529296875,
3004.57861328125, 2965.07470703125, 3260.25463867188, 3028.55590820312,
2981.8876953125, 0), rmse = c(4436.16006895562, 4238.79648453055,
4608.38234747949, 4201.26561821133, 4232.7080465523, 0, 4349.47664966936,
3819.84262201718, 4206.20224553428, 4196.4707575116, 4249.11582411849,
4193.24886413303, 4610.69632679956, 4283.02483978603, 4217.02602018439,
0)), .Names = c("siteId", "latitude", "longitude", "no", "min_obs",
"min_mod", "avg_obs", "avg_mod", "max_obs", "max_mod", "mean_bias",
"corr_coef", "additive_bias", "mean_error", "mean_abs_error",
"rmse"), row.names = c(NA, -16L), class = "data.frame")
I had the same problem and I solve it like this:
Let's assume ggplotdata in my code is like your dataframe (with more than two columns) in the second post. (dput(droplevels(stat))?)
library(reshape) # package for melting data
shapes <- 1:ncol(ggplot_data) # number of shapes
ggplot_data <- melt(ggplot_data, id = "X1") # melt data together
p1 <- ggplot(ggplot_data, aes(X1,value))
p1 <- p1 +aes(shape = factor(variable))+ # different shapes
geom_point(aes(colour = factor(variable)))+ # different colors
scale_shape_manual(labels=colname, values = shapes)+ # same for the legend
scale_color_manual(labels=colname, values = mypalette) # same for legend

Why PLM creates massive objects and fails to open them

I am working on a large (but not enormous) data base of 1.1mln observations x 41 variables. Data are arranged as an unbalanced panel. Using these variables I specified three different models and I run each of them as a 1) fixed effects, 2) random effects and 3) pooled OLS regression.
The original .RData file containing only the data base is about 15Mb. The .RData containing the data base and the regression results (a total of 9 regressions) weights about 650Mb. I do realize that (from the base documentation)
An object of class c("plm","panelmodel").
A "plm" object has the following elements :
coefficients the vector of coefficients,
vcov the covariance matrix of the coefficients,
residuals the vector of residuals,
df.residual degrees of freedom of the residuals,
formula an object of class ’pFormula’ describing the model,
model a data.frame of class ’pdata.frame’ containing the variables usedfor the estimation: the response is in first position and the two indexes in the last positions,
ercomp an object of class ’ercomp’ providing the estimation of the components of the
errors (for random effects models only),
call the call
even so, I am not able to understand why those files should be so massive.
To avoid overloading the memory while working with the plm objects, I saved them in three different files (each of which weights now around 200Mb).
I called summary one hour ago to see the fixed-effects model results but it hasn't showed me any results yet. My question now is pretty straightforward. Do you find this a normal behavior? Is there something I can do to reduce the plm objects size and speed up the results retrieval?
Here are some things you might want to know:
The data base I am using is in data.table format
formulas in the regressions are pre-assembled and are included in the plm calls preceded by as.formula(), as suggested here. Example:
form<-y~x1+x2+x3+...+xn
mod.fe<-plm(as.formula(form), regr, effect="individual", model="within", index=c("id", "year"))
Please, let me know if there is any other info I can provide and that you might need to answer the question.
EDIT
I managed to make up a small scale data base with similar characteristics as the one I am working on. Here it is:
structure(list(id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L,
5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L,
10L, 10L, 11L, 11L), year = structure(c(1L, 2L, 1L, 2L, 3L, 4L,
1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 3L, 4L, 1L, 2L), .Label = c("2000", "2001", "2002",
"2003"), class = "factor"), study = c(3.37354618925767, 4.18364332422208,
5.32950777181536, 4.17953161588198, 5.48742905242849, 5.73832470512922,
6.57578135165349, 5.69461161284364, 6.3787594194582, 4.7853001128225,
7.98380973690105, 8.9438362106853, 9.07456498336519, 7.01064830413663,
10.6198257478947, 9.943871260471, 9.84420449329467, 8.52924761610073,
3.52184994489138, 4.4179415601997, 5.35867955152904, 3.897212272657,
5.38767161155937, 4.9461949594171, 3.62294044317139, 4.58500543670032,
7.10002537198388, 6.76317574845754, 6.83547640374641, 6.74663831986349
), ethnic = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L,
2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
1L, 1L, 2L, 2L), .Label = c("hispanic", "black", "chinese"), class = "factor"),
sport = c(0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0), health = structure(c(1L,
1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("none",
"drink", "both", "smoke"), class = "factor"), gradec = c(2.72806403942929,
3.10067738633308, 4.04728186632456, 2.19701362539883, 1.73115878111307,
5.35879931359977, 5.79613840739381, 5.07050219214859, 4.26224490644077,
3.53554192927934, 6.10515669475491, 7.18032957183198, 6.73191149590581,
6.49512764543435, 6.4783689354808, 6.19974636196512, 5.54014977312232,
6.72545652880344, 1.00223129492982, 1.08994269214495, 3.06702680106689,
1.70103126320561, 4.82973481729635, 3.14010240687364, 3.8068435242348,
5.01254268106181, 5.66497772013949, 4.16303452633342, 4.2751229553617,
3.05652055248093), event = c(1, 0, 1, 1, 1, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0,
0), evm3 = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0), evm2 = c(0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 1, 0, 0, 0, 0), evm1 = c(0, 1, 0, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
1, 0, 0, 0, 0), evp1 = c(0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1),
evp2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1), evp3 = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0), ndm3 = c(1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,
1, 1, 1, 1), ndm2 = c(1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1), ndm1 = c(1,
0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
0, 0, 1, 0, 0, 0, 1, 0, 1, 0), ndp1 = c(0, 1, 0, 0, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 0, 0), ndp2 = c(1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0),
ndp3 = c(1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1)), .Names = c("id",
"year", "study", "ethnic", "sport", "health", "gradec", "event",
"evm3", "evm2", "evm1", "evp1", "evp2", "evp3", "ndm3", "ndm2",
"ndm1", "ndp1", "ndp2", "ndp3"), class = "data.frame", row.names = c(NA,
30L))
The formula and the plm call I used are:
form<-gradec~year+study+ethnic+sport+health+event+evm3+evm2+evm1+evp1+evp2+evp3+ndm3+ndm2+ndm1+ndp1+ndp2+ndp3
plm.f<-plm(as.formula(form), data, effect="individual", model="within", index=c("id", "year"))
Using object.size() suggested by #BenBolker I found out that the call generated a plm object weighting 64.5Kb, while the original data frame has size 6.9Kb, which means that the results are about 10 times larger than the input matrix. Here then I set the options suggested by #zx8754 below but unfortunately they had no effect.
When I finally called summary(plm.f) I got the error message:
Error in crossprod(t(X), beta) : non-conformable arguments
which I eventually got also with my large data base, but only after hours of computing.
Here it is suggested that the problem might be due to the coefficient matrix being singular. However, testing for singularity with is.matrix.singular() found in the matrixcalc package it turns out that this is not the case.
Another couple of things you might want to know:
year, ethnic and health are factors
Variables in the formula are more or less self-explanatory except for the last ones. event is a supposed traumatic event happened at a certain time. It is coded 1 in case of an event in a certain year and 0 otherwise. The variable evm1 is equal to 1 if one of these events happened in the year before (minus 1) and 0 otherwise. Similarly, evp1 is 1 if the event happens in the following year (plus 1) and 0 otherwise. Variables ndm. and ndp. work in the same way but they are coded 1 when that distance is not observable (because the time period for a certain individual is too short) and 0 otherwise. The presence of so deeply connected variables raises the suspect of perfect collinearity. As told above however, a test revealed that the matrix in non-singular.
Let me tell once again that I would be very thankful if someone could answer the question.
About the error message Error in crossprod(t(X), beta) : non-conformable arguments:
This is likely due to a singularity in the model matrix, just as suggested. Please keep in mind that a model matrix for fixed effects models is the transformed data (transformed data frame).
Thus, you will need to check for singularity of the transformed data. The fixed effects transformation can result in linear dependence (singularity) even if the original data are not linear dependent! The plm package has quite a good documentation about that issue in ?detect.lindep which I am going to repeat here partly (only one example):
### Example 1 ###
# prepare the data
data(Cigar)
Cigar[ , "fact1"] <- c(0,1)
Cigar[ , "fact2"] <- c(1,0)
Cigar.p <- pdata.frame(Cigar)
# setup a pFormula and a model frame
pform <- pFormula(price ~ 0 + cpi + fact1 + fact2)
mf <- model.frame(pform, data = Cigar.p)
# no linear dependence in the pooling model's model matrix
# (with intercept in the formula, there would be linear depedence)
detect.lindep(model.matrix(pform, data = mf, model = "pooling"))
# linear dependence present in the FE transformed model matrix
modmat_FE <- model.matrix(pform, data = mf, model = "within")
detect.lindep(modmat_FE)
mod_FE <- plm(pform, data = Cigar.p, model = "within")
detect.lindep(mod_FE)
alias(mod_FE) # => fact1 == -1*fact2
plm(pform, data = mf, model = "within")$aliased # "fact2" indicated as aliased
So you should run your function to detect linear dependence on the transformed data of the model which you get by model.matrix(you_model). You can use the functions supplied by plm: detect.lindep, alias or any function that works on a matrix.
You could also look at your plm model object:
your_model$aliased to see if some variables have been dropped in the estimation.

Resources