import .csv avoiding conversion to double data in R - r

I have a problem using the descdist() function of the fitdistrplus package on my data frame. I think the problem comes from the data type: double. I would like to avoid the conversion to double when importing my csv, but keeping the data as a numeric (I apparently cannot convert them back using as.numeric, it remains as double after that).
Here is my code to import the dataset:
setwd("[directory]")
data=read.csv('data_BehCoor.csv', header=T, sep=";", dec=".", fill=T)
require("fitdistrplus")
descdist(data$stateTSp)
returns the following error
Error in plot.window(...) : 'xlim' needs finite values
An idea of the data:
dput(head(data))
structure(list(day = c(2L, 2L, 2L, 2L, 2L, 2L),
trial = c(1L, 1L, 1L, 1L, 1L, 1L), ID = structure(c(2L, 2L, 3L, 3L, 4L, 4L),
.Label = c("", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L"),
class = "factor"),
condition = structure(c(3L, 3L, 3L, 3L, 2L, 2L), .Label = c("",
"C", "T"), class = "factor"), gender = structure(c(3L, 2L,
3L, 2L, 3L, 2L), .Label = c("", "F", "M"), class = "factor"),
TSp = c(5, 3, 1, 5, 0, 6), AGR = c(3, 0, 0, 0, 0, 0), beAGR = c(0,
3, 0, 0, 0, 0), FOR = c(27.729, 24.51, 51.459, 37.645, 34.489,
34.281), FOR_noTSp = c(22.729, 21.51, 50.459, 32.645, 34.489,
28.281), NI = c(39.857, 82.421, 76.922, 9.277, 265.484, 249.692
), stateTSp = c(55.858, 21.607, 0, 79.961, 0, 2.001), TSpFOR = c(20.345,
8.408, 0, 0, 0, 0), tot_duration = c(136.967, 136.967, 128.395,
128.395, 300, 300), OL_FOR = c(3.746, 3.746, 5.002, 5.002,
10.081, 10.081), OL_FOR_stateTSp = c(4.563, 10.907, 41.703,
0, 0, 0), OL_FOR_TSpFOR = c(3.372, 1.113, 0, 0, 0, 0), OL_FOR_NI = c(11.041,
8.496, 2.748, 27.639, 19.655, 18.191), OL_stateTSp_FOR = c(10.907,
4.563, 0, 41.703, 0, 0), OL_stateTSp = c(3.249, 3.249, 0,
0, 0, 0), OL_stateTSp_TSpFOR = c(4.034, 0, 0, 0, 0, 0),
OL_stateTSp_NI = c(36.66,
11.79, 0, 37.249, 0, 2.001), OL_TSpFOR_FOR = c(1.113, 3.372,
0, 0, 0, 0), OL_TSpFOR_stateTSp = c(0, 4.034, 0, 0, 0, 0),
OL_TSpFOR = c(0, 0, 0, 0, 0, 0), OL_TSpFOR_NI = c(18.23,
2.499, 0, 0, 0, 0), overlap_NI_FOR = c(8.496, 11.041, 27.639,
2.748, 18.191, 19.655), OL_NI_stateTSp = c(11.79, 36.66,
37.249, 0, 2.001, 0), OL_NI_TSpFOR = c(2.499, 18.23, 0, 0,
0, 0), OL_NI = c(16.065, 16.065, 6.528, 6.528, 230.021, 230.021
), AGR_in_FOR = c(0, 0, 0, 0, 0, 0), AGR_in_stateTSp = c(0,
0, 0, 0, 0, 0), AGR_in_TSpFOR = c(0, 0, 0, 0, 0, 0), AGR_in_NI = c(3,
0, 0, 0, 0, 0), beAGR_in_FOR = c(0, 0, 0, 0, 0, 0), beAGR_in_stateTSp = c(0,
0, 0, 0, 0, 0), beAGR_in_TSpFOR = c(0, 0, 0, 0, 0, 0), beAGR_in_NI = c(0,
3, 0, 0, 0, 0), comment = structure(c(1L, 1L, 1L, 1L, 1L,
1L),
.Label = c("", "moved the plate too fast"), class = "factor")),
row.names = c(NA, 6L), class = "data.frame")
Thanks in advance

fitdistrplus::descdist works fine with type double, see below:
foo <- runif(50, min = 1, max = 100)
typeof(foo)
fitdistrplus::descdist(foo)

Related

Creating a function to find precision by group

I have the following dataframe for which I am trying to calculate the precision of observations by group.
df<- structure(list(BLG = c(77.634011090573, 119.341563786008, 12.0603015075377,
0, 155.275381552754, 117.391304347826, 81.1332904056665, 3.96563119629874,
91.566265060241), GSF = c(11.090573012939, 4.11522633744856,
0, 0, 0, 0, 0, 0, 0), LMB = c(73.9371534195933, 28.8065843621399,
24.1206030150754, 20.2360876897133, 59.721300597213, 13.0434782608696,
38.6349001931745, 31.7250495703899, 28.9156626506024), YLB = c(14.7874306839187,
4.11522633744856, 0, 0, 0, 0, 0, 0, 0), BLC = c(7.39371534195933,
0, 0, 20.2360876897133, 3.9814200398142, 0, 0, 7.93126239259749,
9.63855421686747), WHC = c(0, 0, 0, 0, 3.9814200398142, 0, 0,
0, 0), RSF = c(0, 0, 0, 0, 11.9442601194426, 0, 0, 0, 4.81927710843374
), CCF = c(0, 0, 0, 0, 0, 0, 0, 0, 0), BLB = c(0, 0, 0, 0, 0,
0, 0, 0, 0), group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)), row.names = c(NA,
-9L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00000270a7061ef0>)
I am trying to find the precision with this formula:
Y_estimated= the value of in each cell of df
Y_true= y_true<- c(83, 10, 47, 8, 9, 6, 12, 5, 8) #the true value for each column in df
R= number of observations in each group (in this case=3)
After applying the formula, I should have 3 measures of precision for each column. But I am unsure of how to make this formula into a function that will do this. Specifically the applying the epsilon by group and defining R.
I've been working on the following:
estimate = function(df, y_true) {
R = 3
y_estimated = (df, .SD)
(sum((sqrt( (y_estimated - y_true)^2 / 3))) / y_true) * 100
}
But apart from this throwing errors (I think from the .SD in the y_estimated), I have to manually put in the value of R which I hope to not have to do given that this will be applied on data frames with multiple group sizes.
Any help would be greatly appreciated.

Multiply each value in a vector to all values in a df in R

I have a df of binary variables as seen here:
df <- structure(list(Incident = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 1), WorkZone = c(0,
0, 0, 0, 0, 0, 1, 1, 1, 0), Weather = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0), SpecialEvents = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), RecurringCongestion = c(1,
0, 1, 1, 0, 0, 0, 1, 1, 1), MultipleCauses = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0), Unclassified = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0
)), row.names = c(NA, -10L), groups = structure(list(.rows = structure(list(
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame")), class = c("rowwise_df", "tbl_df", "tbl",
"data.frame"))
I would like to take each value in the df of 10 and multiply each row by a correpsonding value in a vector, also size 10.
vect <- structure(list(UDC = c(102484.184937655, 126.379057607441, 132753.66551244,
1042.40780236563, 2438.05671857084, 29124.7628066832, 6406.8910421133,
141757.747682935, 95303.0160407684, 0)), row.names = c(NA, 10L
), class = "data.frame")
So every entry in Row 1 of df is multplied by Row 1 in vect, and so on. However I get this error:
I try and do so below but I get this error
df <- df %>%
mutate_all(.,function(col){vect$UDC*col})
Error: Problem with `mutate()` input `Incident`.
x Input `Incident` can't be recycled to size 1.
i Input `Incident` is `(function (col) ...`.
i Input `Incident` must be size 1, not 136.
i Did you mean: `Incident = list((function (col) ...)` ?
i The error occured in row 1.
There is a grouping attribute with rowwise, if we ungroup, it should work. In dplyr 1.0.0, we can use across
library(dplyr)
df %>%
ungroup %>%
mutate(across(everything(), ~ . * vect$UDC))

Multiple side-by-side boxplot's: changing the size of the axis

I apologise for asking a simple question but I have tried to find a solution, and I am fairly new to producing complicated plots. I have a data frame (below) containing four variables: (1) Sampling.Station; (2) Bat.Species; (3) Light.Intensity; and (4) Bat.Passes.
My aim is to produce two plots that are grouped by the variable Bat.Species containing three level factors: (1) Pipestrellus pygmaeus; (2) Pipestrellus pipestrellus; and (3) nyctalus noctula:
Plot 1:
x-axis - Sampling.Station (numbered 1:8);
y-axis - Light.Intensity
Plot 2:
x-axis - Sampling.Station (numbered 1:8);
y-axis - Bat.Passes
I have successfully produced plot 1 (below) from a data frame called Format. The data frame below was produced using the function cbind in order to add the column called Bat.Passes, from which I am attempting to produce plot 2; however, I keep on getting this warning message, whereby the code produces the box-and-whisker plots but not the axis, axis labels or legend.
Warning message:
In bxp(list(stats = c(2, 2, 3.5, 5, 5, 2, 3, 4, 6, 6, 1, 6, 10.5, :
some notches went outside hinges ('box'): maybe set notch=FALSE
Any suggestions how to do that? I made some searches but I could not solve the issue and I cannot open a window using the function windows() in R studio on a Mac in order to increase the plot parameters. Any help will be appreciated!
windows(width=10, height=9)
Error: could not find function "windows"
Example of plot 1
Code for Plot 1:
Sampling.Station.labels=c("1","2","3","4","5","6","7","8")
bat.labels <- rep(c("Pipistrellus pygmaeus", "Pipestrellus pipestrellus", "Nyctalus noctula"), 8)
data_long <- gather(bats1, x, Mean.Value, Saparano.Pipestrelle:Noctule)
head(data_long)
stacked.data.1<-melt(data_long, id=c('Sampling.Station', 'x'))
head(stacked.data.1)
str(stacked.data.1)
stacked.data.1=stacked.data.1[, -3]
head(stacked.data.1)
colnames(stacked.data.1)<-c("Sampling.Station", "Bat.Species", "Light.Intensity")
head(stacked.data.1)
par(mfrow = c(1,1))
boxplots.double.1=boxplot(Light.Intensity~Sampling.Station + Bat.Species,
data = stacked.data.1,
at = c(1:24),
ylim = c(min(0, min(0)),
max(30, na.rm = T)),
xaxt = "n",
notch=TRUE,
col = c("red", "blue", "green"),
cex.axis=0.7,
cex.labels=0.7,
ylab="Light Intensity (Lux)",
xlab="Sampling Stations",
space=1)
axis(side = 1, at = seq(3, 24, by = 1), labels = FALSE)
text(seq(3, 24, by=3), par("usr")[3] - 0.2, labels=unique(Sampling.Station.labels), srt = 45, pos = 1, xpd = TRUE, cex=0.8)
par(oma = c(4, 1, 1, 1))
par(fig = c(0, 1, 0, 1), oma = c(0, 0, 0, 0), mar = c(0, 0, 0, 0), new = TRUE)
plot(0, 0, type = "n", bty = "n", xaxt = "n", yaxt = "n")
legend("top",
legend=c("Pipistrellus pygmaeus","Pipestrellus pipestrellus","Nyctalus noctula"),
fill=c("Blue", "Red", "Green"),
xpd = TRUE, horiz = TRUE,
inset = c(0,0),
bty = "n",
col = 1:4,
cex = 0.8,
title = "Bat Species",
lty = c(1,1))
Code for plot 2
par(mfrow = c(1,1))
Sampling.Station.labels=c("1","2","3","4","5","6","7","8")
bat.labels <- rep(c("Pipistrellus pygmaeus", "Pipestrellus pipestrellus", "Nyctalus noctula"), 8)
data_long <- gather(bats1, x, Mean.Value, Saparano.Pipestrelle:Noctule)
head(data_long)
head(bats1)
stacked.data.2<-melt(data_long, id=c('Number.of.bat.passes', 'x'))
head(stacked.data.2)
str(stacked.data.2)
stacked.data.2$x<- as.factor(stacked.data.2$x)
stacked.data.2$Number.of.bat.passes <-as.numeric(stacked.data.2$Number.of.bat.passes)
stacked.data.2=stacked.data.2[,-3]
head(stacked.data.2)
colnames(stacked.data.2)<-c("Bat.Passes", "Bat.Species", "Sampling.Station")
head(stacked.data.2)
str(stacked.data.2)
par(mfrow = c(1,1))
boxplots.double.2=boxplot(Bat.Passes~Sampling.Station + Bat.Species,
data = stacked.data.1,
at = c(1:24),
ylim = c(min(0, min(0)),
max(30, na.rm = T)),
xaxt = "n",
notch=TRUE,
col = c("red", "blue", "green"),
cex.axis=0.7,
cex.labels=0.7,
ylab="Light Intensity (Lux)",
xlab="Sampling Stations",
space=1)
axis(side = 1, at = seq(3, 24, by = 1), labels = FALSE)
text(seq(3, 24, by=3), par("usr")[3] - 0.2, labels=unique(Sampling.Station.labels), srt = 45, pos = 1, xpd = TRUE, cex=0.8)
par(oma = c(4, 1, 1, 1))
par(fig = c(0, 1, 0, 1), oma = c(0, 0, 0, 0), mar = c(0, 0, 0, 0), new = TRUE)
plot(0, 0, type = "n", bty = "n", xaxt = "n", yaxt = "n")
legend("top",
legend=c("Pipistrellus pygmaeus","Pipestrellus pipestrellus","Nyctalus noctula"),
fill=c("Blue", "Red", "Green"),
xpd = TRUE, horiz = TRUE,
inset = c(0,0),
bty = "n",
col = 1:4,
cex = 0.8,
title = "Bat Species",
lty = c(1,1))
If anyone can help to produce the axis, axis labels and legend, then many thanks in advance.
Dataframe
bats1<-structure(list(Sampling.Station = c(1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L,
8L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L,
7L, 7L, 7L, 8L, 8L, 8L, 1L, 1L, 1L), Light.Intensity.S = c(26.9,
25.16, 39, 29.81, 21.83, 20.22, 2.9, 2.1, 0.85, 0.62, 0.39, 0.26,
24.7, 21.99, 20.46, 26.32, 0, 0, 0.43, 0.02, 0.02, 0.03, 0.02,
0.03, 293.56, 167.79, 114.06, 17.22, 16.26, 4.76, 0.63, 0.56,
0.56, 86.63, 87.97, 88.59, 0.31, 0.04, 0.05, 0, 0.01, 0.01, 0.02,
2.6, 2.68, 2.62, 0.43, 0.44), Number.of.bat.passes = c(3L, 2L,
5L, 6L, 15L, 2L, 10L, 12L, 17L, 2L, 0L, 0L, 15L, 7L, 17L, 0L,
1L, 0L, 14L, 10L, 12L, 7L, 4L, 1L, 3L, 5L, 3L, 1L, 6L, 11L, 3L,
0L, 0L, 12L, 11L, 9L, 1L, 2L, 1L, 12L, 14L, 10L, 3L, 2L, 1L,
5L, 4L, 2L), Saparano.Pipestrelle = c(0.444444444, 0, 0, 0.027777778,
0, 0, 0.25, 0, 0.08650519, 0, 0, 0, 0.111111111, 0, 0.124567474,
0, 0, 0, 0.25, 0.01, 0.111111111, 0.081632653, 0, 0, 0, 0.04,
0, 1, 0.027777778, 0.033057851, 0.111111111, 0, 0, 0.027777778,
0.074380165, 0.012345679, 0, 0, 1, 0.173611111, 0.081632653,
0.16, 1, 0.25, 0, 0.04, 0.25, 0.25), Common.Pipestrelle = c(0.25,
0, 0, 7, 0, 0, 0, 0, 0.08, 0, 0, 0, 0.6, 0, 0.222222222, 0, 0,
0, 0.142857143, 9, 0.5, 1.25, 0, 0, 0, 4, 0, 0, 5, 2, 1, 0, 0,
1.25, 0.888888889, 5, 0, 0, 2, 0.28, 0.625, 0.375, 0, 1, 0, 4,
0.5, 1), Noctule = c(0, 0, 1, 0, 0.109375, 0, 0, 0, 0.75, 0,
0, 0, 0, 0.08, 0.046875, 0, 0, 0, 0, 0, 0.046875, 0, 0, 0, 0,
0.0625, 0, 0, 0, 0.015625, 0, 0, 0, 0.28, 0, 0.12, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Sampling.Station", "Light.Intensity.S",
"Number.of.bat.passes", "Saparano.Pipestrelle", "Common.Pipestrelle",
"Noctule"), row.names = c(NA, -48L), class = "data.frame")

How can I plot ggplot2 columns in a loop?

I have a dataframe with lat, long and dozens of other columns. Basically, I want to write a loop where X=Long,Y=Lat remains constant, but the color changes in every loop. The color is basically the other columns, one for every plot. How can I do this?
library(maps)
library(ggplot2)
library(RColorBrewer)
library(reshape)
usamap <- ggplot2::map_data("state")
myPalette <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
simplefun<-function(colname){
ggplot()+
geom_polygon( data=usamap, aes(x=long, y=lat, group=group),colour="black",fill="white")+
geom_point(data=stat,aes_string(x=stat$longitude,y=stat$latitude,color=colname))+
scale_colour_gradientn(name="name",colours = myPalette(10))+
xlab('Longitude')+
ylab('Latitude')+
coord_map(projection = "mercator")+
theme_bw()+
theme(line = element_blank())+
theme(legend.position = c(.93,.20),panel.grid.major = element_line(colour = "#808080"))+
ggsave(paste0(colname,".png"),width=10, height=8,dpi=300)
}
colname<-names(stat[4:16])
lapply(colname,simplefun)
dput(droplevels(stat))
structure(list(siteId = structure(1:16, .Label = c("US1NYAB0001",
"US1NYAB0006", "US1NYAB0010", "US1NYAB0021", "US1NYAB0023", "US1NYAB0028",
"US1NYAB0032", "US1NYAL0002", "US1NYBM0004", "US1NYBM0007", "US1NYBM0011",
"US1NYBM0014", "US1NYBM0021", "US1NYBM0024", "US1NYBM0032", "US1NYBM0034"
), class = "factor"), latitude = c(42.667, 42.7198, 42.5455,
42.6918, 42.6602, 42.7243, 42.5754, 42.2705, 42.0296, 42.0493,
42.0735, 42.3084, 42.0099, 42.1098, 42.1415, 42.0826), longitude = c(-74.0509,
-73.9304, -74.1475, -73.8311, -73.8103, -73.757, -73.7995, -77.9419,
-76.0213, -76.0288, -75.9296, -75.9569, -75.5142, -75.8858, -75.889,
-75.9912), no = c(2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 1L), min_obs = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), min_mod = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), avg_obs = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0.15,
0, 0, 0, 0, 0, 0), avg_mod = c(3136.8388671875, 2997.28173828125,
3258.61840820312, 2970.74340820312, 2992.9765625, 0, 3075.54443359375,
2701.03662109375, 2974.23413085938, 2967.5029296875, 3004.57861328125,
2965.07470703125, 3260.25463867188, 3028.55590820312, 2981.8876953125,
0), max_obs = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0.3, 0, 0, 0, 0, 0,
0), max_mod = c(6273.677734375, 5994.5634765625, 6517.23681640625,
5941.48681640625, 5985.953125, 0, 6151.0888671875, 5402.0732421875,
5948.46826171875, 5935.005859375, 6009.1572265625, 5930.1494140625,
6520.50927734375, 6057.11181640625, 5963.775390625, 0), mean_bias = c(0,
0, 0, 0, 0, NaN, 0, 0, 0, 5.05475490855863e-05, 0, 0, 0, 0, 0,
NaN), corr_coef = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA,
NA, NA, NA, NA, NA), additive_bias = c(-6273.677734375, -5994.5634765625,
-6517.23681640625, -5941.48681640625, -5985.953125, 0, -6151.0888671875,
-5402.0732421875, -5948.46826171875, -5934.705859375, -6009.1572265625,
-5930.1494140625, -6520.50927734375, -6057.11181640625, -5963.775390625,
0), mean_error = c(-3136.8388671875, -2997.28173828125, -3258.61840820312,
-2970.74340820312, -2992.9765625, 0, -3075.54443359375, -2701.03662109375,
-2974.23413085938, -2967.3529296875, -3004.57861328125, -2965.07470703125,
-3260.25463867188, -3028.55590820312, -2981.8876953125, 0), mean_abs_error = c(3136.8388671875,
2997.28173828125, 3258.61840820312, 2970.74340820312, 2992.9765625,
0, 3075.54443359375, 2701.03662109375, 2974.23413085938, 2967.3529296875,
3004.57861328125, 2965.07470703125, 3260.25463867188, 3028.55590820312,
2981.8876953125, 0), rmse = c(4436.16006895562, 4238.79648453055,
4608.38234747949, 4201.26561821133, 4232.7080465523, 0, 4349.47664966936,
3819.84262201718, 4206.20224553428, 4196.4707575116, 4249.11582411849,
4193.24886413303, 4610.69632679956, 4283.02483978603, 4217.02602018439,
0)), .Names = c("siteId", "latitude", "longitude", "no", "min_obs",
"min_mod", "avg_obs", "avg_mod", "max_obs", "max_mod", "mean_bias",
"corr_coef", "additive_bias", "mean_error", "mean_abs_error",
"rmse"), row.names = c(NA, -16L), class = "data.frame")
I had the same problem and I solve it like this:
Let's assume ggplotdata in my code is like your dataframe (with more than two columns) in the second post. (dput(droplevels(stat))?)
library(reshape) # package for melting data
shapes <- 1:ncol(ggplot_data) # number of shapes
ggplot_data <- melt(ggplot_data, id = "X1") # melt data together
p1 <- ggplot(ggplot_data, aes(X1,value))
p1 <- p1 +aes(shape = factor(variable))+ # different shapes
geom_point(aes(colour = factor(variable)))+ # different colors
scale_shape_manual(labels=colname, values = shapes)+ # same for the legend
scale_color_manual(labels=colname, values = mypalette) # same for legend

Why PLM creates massive objects and fails to open them

I am working on a large (but not enormous) data base of 1.1mln observations x 41 variables. Data are arranged as an unbalanced panel. Using these variables I specified three different models and I run each of them as a 1) fixed effects, 2) random effects and 3) pooled OLS regression.
The original .RData file containing only the data base is about 15Mb. The .RData containing the data base and the regression results (a total of 9 regressions) weights about 650Mb. I do realize that (from the base documentation)
An object of class c("plm","panelmodel").
A "plm" object has the following elements :
coefficients the vector of coefficients,
vcov the covariance matrix of the coefficients,
residuals the vector of residuals,
df.residual degrees of freedom of the residuals,
formula an object of class ’pFormula’ describing the model,
model a data.frame of class ’pdata.frame’ containing the variables usedfor the estimation: the response is in first position and the two indexes in the last positions,
ercomp an object of class ’ercomp’ providing the estimation of the components of the
errors (for random effects models only),
call the call
even so, I am not able to understand why those files should be so massive.
To avoid overloading the memory while working with the plm objects, I saved them in three different files (each of which weights now around 200Mb).
I called summary one hour ago to see the fixed-effects model results but it hasn't showed me any results yet. My question now is pretty straightforward. Do you find this a normal behavior? Is there something I can do to reduce the plm objects size and speed up the results retrieval?
Here are some things you might want to know:
The data base I am using is in data.table format
formulas in the regressions are pre-assembled and are included in the plm calls preceded by as.formula(), as suggested here. Example:
form<-y~x1+x2+x3+...+xn
mod.fe<-plm(as.formula(form), regr, effect="individual", model="within", index=c("id", "year"))
Please, let me know if there is any other info I can provide and that you might need to answer the question.
EDIT
I managed to make up a small scale data base with similar characteristics as the one I am working on. Here it is:
structure(list(id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L,
5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L,
10L, 10L, 11L, 11L), year = structure(c(1L, 2L, 1L, 2L, 3L, 4L,
1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 3L, 4L, 1L, 2L), .Label = c("2000", "2001", "2002",
"2003"), class = "factor"), study = c(3.37354618925767, 4.18364332422208,
5.32950777181536, 4.17953161588198, 5.48742905242849, 5.73832470512922,
6.57578135165349, 5.69461161284364, 6.3787594194582, 4.7853001128225,
7.98380973690105, 8.9438362106853, 9.07456498336519, 7.01064830413663,
10.6198257478947, 9.943871260471, 9.84420449329467, 8.52924761610073,
3.52184994489138, 4.4179415601997, 5.35867955152904, 3.897212272657,
5.38767161155937, 4.9461949594171, 3.62294044317139, 4.58500543670032,
7.10002537198388, 6.76317574845754, 6.83547640374641, 6.74663831986349
), ethnic = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L,
2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
1L, 1L, 2L, 2L), .Label = c("hispanic", "black", "chinese"), class = "factor"),
sport = c(0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0), health = structure(c(1L,
1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("none",
"drink", "both", "smoke"), class = "factor"), gradec = c(2.72806403942929,
3.10067738633308, 4.04728186632456, 2.19701362539883, 1.73115878111307,
5.35879931359977, 5.79613840739381, 5.07050219214859, 4.26224490644077,
3.53554192927934, 6.10515669475491, 7.18032957183198, 6.73191149590581,
6.49512764543435, 6.4783689354808, 6.19974636196512, 5.54014977312232,
6.72545652880344, 1.00223129492982, 1.08994269214495, 3.06702680106689,
1.70103126320561, 4.82973481729635, 3.14010240687364, 3.8068435242348,
5.01254268106181, 5.66497772013949, 4.16303452633342, 4.2751229553617,
3.05652055248093), event = c(1, 0, 1, 1, 1, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0,
0), evm3 = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0), evm2 = c(0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 1, 0, 0, 0, 0), evm1 = c(0, 1, 0, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
1, 0, 0, 0, 0), evp1 = c(0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1),
evp2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1), evp3 = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0), ndm3 = c(1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,
1, 1, 1, 1), ndm2 = c(1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1), ndm1 = c(1,
0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
0, 0, 1, 0, 0, 0, 1, 0, 1, 0), ndp1 = c(0, 1, 0, 0, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 0, 0), ndp2 = c(1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0),
ndp3 = c(1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1)), .Names = c("id",
"year", "study", "ethnic", "sport", "health", "gradec", "event",
"evm3", "evm2", "evm1", "evp1", "evp2", "evp3", "ndm3", "ndm2",
"ndm1", "ndp1", "ndp2", "ndp3"), class = "data.frame", row.names = c(NA,
30L))
The formula and the plm call I used are:
form<-gradec~year+study+ethnic+sport+health+event+evm3+evm2+evm1+evp1+evp2+evp3+ndm3+ndm2+ndm1+ndp1+ndp2+ndp3
plm.f<-plm(as.formula(form), data, effect="individual", model="within", index=c("id", "year"))
Using object.size() suggested by #BenBolker I found out that the call generated a plm object weighting 64.5Kb, while the original data frame has size 6.9Kb, which means that the results are about 10 times larger than the input matrix. Here then I set the options suggested by #zx8754 below but unfortunately they had no effect.
When I finally called summary(plm.f) I got the error message:
Error in crossprod(t(X), beta) : non-conformable arguments
which I eventually got also with my large data base, but only after hours of computing.
Here it is suggested that the problem might be due to the coefficient matrix being singular. However, testing for singularity with is.matrix.singular() found in the matrixcalc package it turns out that this is not the case.
Another couple of things you might want to know:
year, ethnic and health are factors
Variables in the formula are more or less self-explanatory except for the last ones. event is a supposed traumatic event happened at a certain time. It is coded 1 in case of an event in a certain year and 0 otherwise. The variable evm1 is equal to 1 if one of these events happened in the year before (minus 1) and 0 otherwise. Similarly, evp1 is 1 if the event happens in the following year (plus 1) and 0 otherwise. Variables ndm. and ndp. work in the same way but they are coded 1 when that distance is not observable (because the time period for a certain individual is too short) and 0 otherwise. The presence of so deeply connected variables raises the suspect of perfect collinearity. As told above however, a test revealed that the matrix in non-singular.
Let me tell once again that I would be very thankful if someone could answer the question.
About the error message Error in crossprod(t(X), beta) : non-conformable arguments:
This is likely due to a singularity in the model matrix, just as suggested. Please keep in mind that a model matrix for fixed effects models is the transformed data (transformed data frame).
Thus, you will need to check for singularity of the transformed data. The fixed effects transformation can result in linear dependence (singularity) even if the original data are not linear dependent! The plm package has quite a good documentation about that issue in ?detect.lindep which I am going to repeat here partly (only one example):
### Example 1 ###
# prepare the data
data(Cigar)
Cigar[ , "fact1"] <- c(0,1)
Cigar[ , "fact2"] <- c(1,0)
Cigar.p <- pdata.frame(Cigar)
# setup a pFormula and a model frame
pform <- pFormula(price ~ 0 + cpi + fact1 + fact2)
mf <- model.frame(pform, data = Cigar.p)
# no linear dependence in the pooling model's model matrix
# (with intercept in the formula, there would be linear depedence)
detect.lindep(model.matrix(pform, data = mf, model = "pooling"))
# linear dependence present in the FE transformed model matrix
modmat_FE <- model.matrix(pform, data = mf, model = "within")
detect.lindep(modmat_FE)
mod_FE <- plm(pform, data = Cigar.p, model = "within")
detect.lindep(mod_FE)
alias(mod_FE) # => fact1 == -1*fact2
plm(pform, data = mf, model = "within")$aliased # "fact2" indicated as aliased
So you should run your function to detect linear dependence on the transformed data of the model which you get by model.matrix(you_model). You can use the functions supplied by plm: detect.lindep, alias or any function that works on a matrix.
You could also look at your plm model object:
your_model$aliased to see if some variables have been dropped in the estimation.

Resources