geeglm in R: Error message about variable lengths differing - r

I am trying to run a GEE model with a logit outcome, using the following code.
mod.gee <- geeglm(general_elec~activity_outside_home+econ_scale_6_pt,
data=D_work_small, id="pidlink",
family=binomial(link="logit"), corstr="ar1")
But I keep getting this error:
Error in model.frame.default(formula = general_elec ~ activity_outside_home+ :
variable lengths differ (found for '(id)')
My data is longitudinal data from two waves of a survey. I have tried omitting responses with NAs, and changing the data types of the variables, but nothing has worked. Any suggestions, or has anyone run into this problem before?
My data is structured as follows:
> head(D_work_small)
pidlink econ_scale_6_pt activity_outside_home general_elec wave age female java educ_level pid_unit
1 001220001 3 1 1 3 48 0 0 1 0012200013
10 001220002 3 1 1 3 47 1 0 1 0012200023
19 001220003 2 1 1 3 27 0 0 4 0012200033
77 001250003 2 1 1 3 27 0 0 1 0012500033
79 001290001 2 1 1 3 52 0 0 1 0012900013
88 001290002 2 1 1 3 49 1 0 1 0012900023
> summary(D_work_small)
pidlink econ_scale_6_pt activity_outside_home general_elec wave
Length:44106 Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :3.000
Class :character 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:3.000
Mode :character Median :3.000 Median :1.0000 Median :1.0000 Median :4.000
Mean :2.894 Mean :0.7048 Mean :0.8304 Mean :3.608
3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :6.000 Max. :1.0000 Max. :1.0000 Max. :4.000
age female java educ_level pid_unit
Min. : 14.00 Min. :0.0000 Min. :0.0000 Min. :1.0 Length:44106
1st Qu.: 26.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0 Class :character
Median : 35.00 Median :1.0000 Median :0.0000 Median :2.0 Mode :character
Mean : 37.35 Mean :0.5118 Mean :0.4171 Mean :2.2
3rd Qu.: 47.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:3.0
Max. :999.00 Max. :1.0000 Max. :1.0000 Max. :5.0

Related

How to extract core values for summary() with a data frame? [duplicate]

I have this admission_table containing ADMIT, GRE, GPA and RANK.
> head(admission_table)
ADMIT GRE GPA RANK
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
5 0 520 2.93 4
6 1 760 3.00 2
I'm trying to convert the summary of this table into data.frame. I want to have ADMIT, GRE, GPA and RANK as my column headers.
> summary(admission_table)
ADMIT GRE GPA RANK
Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
Median :0.0000 Median :580.0 Median :3.395 Median :2.000
Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
> as.data.frame(summary(admission_table))
Var1 Var2 Freq
1 ADMIT Min. :0.0000
2 ADMIT 1st Qu.:0.0000
3 ADMIT Median :0.0000
4 ADMIT Mean :0.3175
5 ADMIT 3rd Qu.:1.0000
6 ADMIT Max. :1.0000
7 GRE Min. :220.0
8 GRE 1st Qu.:520.0
9 GRE Median :580.0
10 GRE Mean :587.7
11 GRE 3rd Qu.:660.0
12 GRE Max. :800.0
13 GPA Min. :2.260
14 GPA 1st Qu.:3.130
15 GPA Median :3.395
16 GPA Mean :3.390
17 GPA 3rd Qu.:3.670
18 GPA Max. :4.000
19 RANK Min. :1.000
20 RANK 1st Qu.:2.000
21 RANK Median :2.000
22 RANK Mean :2.485
23 RANK 3rd Qu.:3.000
24 RANK Max. :4.000
As I'm trying to convert into data.frame, this is the only result I get. I want the data frame have the exact output just like the summary table because after that I want to insert that into Oracle database using this line of code:
dbWriteTable(connection,name="SUM_ADMISSION_TABLE",value=as.data.frame(summary(admission_table)),row.names = FALSE, overwrite = TRUE ,append = FALSE)
Is the any way to do so?
You can consider unclass, I suppose:
data.frame(unclass(summary(mydf)), check.names = FALSE, stringsAsFactors = FALSE)
# ADMIT GRE GPA RANK
# 1 Min. :0.0000 Min. :380.0 Min. :2.930 Min. :1.000
# 2 1st Qu.:0.2500 1st Qu.:550.0 1st Qu.:3.047 1st Qu.:2.250
# 3 Median :1.0000 Median :650.0 Median :3.400 Median :3.000
# 4 Mean :0.6667 Mean :626.7 Mean :3.400 Mean :2.833
# 5 3rd Qu.:1.0000 3rd Qu.:735.0 3rd Qu.:3.655 3rd Qu.:3.750
# 6 Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
str(.Last.value)
# 'data.frame': 6 obs. of 4 variables:
# $ ADMIT: chr "Min. :0.0000 " "1st Qu.:0.2500 " "Median :1.0000 " "Mean :0.6667 " ...
# $ GRE : chr "Min. :380.0 " "1st Qu.:550.0 " "Median :650.0 " "Mean :626.7 " ...
# $ GPA : chr "Min. :2.930 " "1st Qu.:3.047 " "Median :3.400 " "Mean :3.400 " ...
# $ RANK: chr "Min. :1.000 " "1st Qu.:2.250 " "Median :3.000 " "Mean :2.833 " ...
Note that there is a lot of excessive whitespace there, in both the names and the values.
However, it might be sufficient to do something like:
do.call(cbind, lapply(mydf, summary))
# ADMIT GRE GPA RANK
# Min. 0.0000 380.0 2.930 1.000
# 1st Qu. 0.2500 550.0 3.048 2.250
# Median 1.0000 650.0 3.400 3.000
# Mean 0.6667 626.7 3.400 2.833
# 3rd Qu. 1.0000 735.0 3.655 3.750
# Max. 1.0000 800.0 4.000 4.000
Another way to output a dataframe is:
as.data.frame(apply(mydf, 2, summary))
Works if only numerical columns are selected.
And it may throw an Error in dimnames(x) if there are columns with NA's. It's worth checking for that without the as.data.frame() function first.
None of these solutions actually capture the output of the summary function. The tidy() function extracts the elements from a summary object and makes a bland data.frame, so it does not preserve other features or formatting.
If you want the exact output of the summary function in a data frame, you can do:
output<-capture.output(summary(thisModel), file=NULL,append=FALSE)
output_df <-as.data.frame(output)
This retains all of the new lines and is suitable for writing to XLSX, etc., which will result in the output appropriately spaced across rows.
If you want this output collapsed into a single cell, you can do:
output_collapsed <- paste0(output,sep="",collapse="\n")
output_df <-as.data.frame(output_collapsed)

Finding the right cluster methods based on data distribution

My record has 821050 rows and 18 columns. The rows represent different online users, the columns the browsing behavior of the users in an online shop. The column variables include shopping cart cancellations, number of items in the shopping cart, detailed view of items, product list/multi-item view, detailed search view, etc... Half of the variables are discrete, half are continuous. 8 of the variables are dummy variables. Based on the data set, I want to apply different hard and soft clustering methods and analyze the shopping cart abondonnement of my data set more precisely. With the help of descriptive statistics I have analyzed my data set and obtained the following results.
# 1. WKA_ohneJB <- read.csv("WKA_ohneJB_PCA.csv", header=TRUE, sep = ";", stringsAsFactors = FALSE)
# 2. summary(WKA_ohneJB)
X BASKETS_NZ LOGONS PIS PIS_AP PIS_DV
Min. : 1 Min. : 0.000 Min. :0.0000 Min. : 1.00 Min. : 0.000 Min. : 0.000
1st Qu.:205263 1st Qu.: 1.000 1st Qu.:1.0000 1st Qu.: 9.00 1st Qu.: 0.000 1st Qu.: 0.000
Median :410525 Median : 1.000 Median :1.0000 Median : 20.00 Median : 1.000 Median : 1.000
Mean :410525 Mean : 1.023 Mean :0.9471 Mean : 31.11 Mean : 1.783 Mean : 4.554
3rd Qu.:615786 3rd Qu.: 1.000 3rd Qu.:1.0000 3rd Qu.: 41.00 3rd Qu.: 2.000 3rd Qu.: 5.000
Max. :821048 Max. :49.000 Max. :1.0000 Max. :593.00 Max. :71.000 Max. :203.000
PIS_PL PIS_SDV PIS_SHOPS PIS_SR QUANTITY WKA
Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 1.00 Min. :0.0000
1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.:0.0000
Median : 0.000 Median : 0.00 Median : 2.00 Median : 0.000 Median : 2.00 Median :1.0000
Mean : 5.729 Mean : 2.03 Mean : 10.67 Mean : 3.873 Mean : 3.14 Mean :0.6341
3rd Qu.: 4.000 3rd Qu.: 2.00 3rd Qu.: 11.00 3rd Qu.: 4.000 3rd Qu.: 4.00 3rd Qu.:1.0000
Max. :315.000 Max. :142.00 Max. :405.00 Max. :222.000 Max. :143.00 Max. :1.0000
NEW_CUST EXIST_CUST WEB_CUST MOBILE_CUST TABLET_CUST LOGON_CUST_STEP2
Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.00000 Median :1.0000 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.07822 Mean :0.9218 Mean :0.4704 Mean :0.3935 Mean :0.1361 Mean :0.1743
3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
With the non-dummy variables it is noticeable that they have a right-skewed distribution. For the dummy variables, 5 have a right-skewed distribution and 3 have a left-skewed distribution.
I have also listed range and quantiles for the 9 non-dummies
# BASKETS_NZ
range(WKA_ohneJB$BASKETS_NZ) # 0 49
quantile(WKA_ohneJB$BASKETS_NZ, 0.5) # 1
quantile(WKA_ohneJB$BASKETS_NZ, 0.25) # 1
quantile(WKA_ohneJB$BASKETS_NZ, 0.75) # 1
# PIS
range(WKA_ohneJB$PIS) # 1 593
quantile(WKA_ohneJB$PIS, 0.25) # 9
quantile(WKA_ohneJB$PIS, 0.5) # 20
quantile(WKA_ohneJB$PIS, 0.75) # 41
# PIS_AP
range(WKA_ohneJB$PIS_AP) # 0 71
quantile(WKA_ohneJB$PIS_AP, 0.25) # 0
quantile(WKA_ohneJB$PIS_AP, 0.5) # 1
quantile(WKA_ohneJB$PIS_AP, 0.75) # 2
# PIS_DV
range(WKA_ohneJB$PIS_DV) # 0 203
quantile(WKA_ohneJB$PIS_DV, 0.25) # 0
quantile(WKA_ohneJB$PIS_DV, 0.5) # 1
quantile(WKA_ohneJB$PIS_DV, 0.75) # 5
#PIS_PL
range(WKA_ohneJB$PIS_PL) # 0 315
quantile(WKA_ohneJB$PIS_PL, 0.25) # 0
quantile(WKA_ohneJB$PIS_PL, 0.5) # 0
quantile(WKA_ohneJB$PIS_PL, 0.75) # 4
#PIS_SDV
range(WKA_ohneJB$PIS_SDV) # 0 142
quantile(WKA_ohneJB$PIS_SDV, 0.25) # 0
quantile(WKA_ohneJB$PIS_SDV, 0.5) # 0
quantile(WKA_ohneJB$PIS_SDV, 0.75) # 2
# PIS_SHOPS
range(WKA_ohneJB$PIS_SHOPS) # 0 405
quantile(WKA_ohneJB$PIS_SHOPS, 0.25) # 0
quantile(WKA_ohneJB$PIS_SHOPS, 0.5) # 2
quantile(WKA_ohneJB$PIS_SHOPS, 0.75) # 11
# PIS_SR
range(WKA_ohneJB$PIS_SR) # 0 222
quantile(WKA_ohneJB$PIS_SR, 0.25) # 0
quantile(WKA_ohneJB$PIS_SR, 0.5) # 0
quantile(WKA_ohneJB$PIS_SR, 0.75) # 4
# QUANTITY
range(WKA_ohneJB$QUANTITY) # 1 143
quantile(WKA_ohneJB$QUANTITY, 0.25) # 1
quantile(WKA_ohneJB$QUANTITY, 0.5) # 2
quantile(WKA_ohneJB$QUANTITY, 0.75) # 4
How can I recognize from the distribution of my data which cluster methods are suitable for mixed type clickstream data?

Error in eval(predvars, data, env) : object 'Sewer' not found

I have a set of data containing species names and numbers (spp_data) and I am trying to test how the species are influenced by different parameters such as pH, conductivity, as well as the Sewer position (Upstream/Downstream) (env_data1).
When I'm trying to run the lm() I get the following error:
lm1 <- lm(specnumber ~ Sewer + pH + Conductivity, data=spp_data,env_data1)
Error in eval(predvars, data, env) : object 'Sewer' not found
Is it because the column Sewer is non-numeric?
I also tried to exclude that column and run the lm() but it did not work.
species data
summary(spp_data)
Pisidium G_pulex C_pseudo A_aquatic V_pisc
Min. :0.000 Min. : 0.00 Min. : 0.000 Min. :0.0000 Min. :0.00000
1st Qu.:0.000 1st Qu.: 3.00 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.000 Median : 8.00 Median : 3.000 Median :0.0000 Median :0.00000
Mean :1.429 Mean :16.86 Mean : 4.476 Mean :0.5714 Mean :0.04762
3rd Qu.:2.000 3rd Qu.:20.00 3rd Qu.:10.000 3rd Qu.:0.0000 3rd Qu.:0.00000
Max. :7.000 Max. :68.00 Max. :16.000 Max. :4.0000 Max. :1.00000
Taeniopt Rhyacoph Hydropsy Lepidost Glossos
Min. :0.00000 Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 0.00
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.: 0.00
Median :0.00000 Median :0.0000 Median :0.000 Median :0.000 Median : 0.00
Mean :0.09524 Mean :0.2381 Mean :1.286 Mean :1.238 Mean : 1.81
3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.: 1.00
Max. :2.00000 Max. :2.0000 Max. :5.000 Max. :7.000 Max. :14.00
Agapetus Hydroptil Limneph S_person Tipula
Min. : 0.0000 Min. :0.00000 Min. :0.000 Min. :0.00000 Min. :0
1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0
Median : 0.0000 Median :0.00000 Median :0.000 Median :0.00000 Median :0
Mean : 0.5714 Mean :0.04762 Mean :0.381 Mean :0.09524 Mean :0
3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0
Max. :12.0000 Max. :1.00000 Max. :2.000 Max. :2.00000 Max. :0
Culicida Ceratopo Simuliid Chrinomi Chrnomus
Min. :0.0000 Min. : 0 Min. : 0.0000 Min. : 0.000 Min. : 0.000
1st Qu.:0.0000 1st Qu.: 0 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 1.000
Median :0.0000 Median : 1 Median : 0.0000 Median : 2.000 Median : 3.000
Mean :0.5714 Mean : 7 Mean : 0.5238 Mean : 7.286 Mean : 6.095
3rd Qu.:0.0000 3rd Qu.: 8 3rd Qu.: 0.0000 3rd Qu.: 8.000 3rd Qu.: 6.000
Max. :5.0000 Max. :31 Max. :10.0000 Max. :67.000 Max. :41.000
environmental data
summary(env_data)
Sample Sewer pH Conductivity
Length:21 Length:21 Min. :7.780 Length:21
Class :character Class :character 1st Qu.:7.850 Class :character
Mode :character Mode :character Median :8.100 Mode :character
Mean :8.044
3rd Qu.:8.270
Max. :8.280
Depth %rock %mud %sand,,
Min. : 7.00 Min. :10.00 Min. : 0 Length:21
1st Qu.: 8.00 1st Qu.:10.00 1st Qu.:20 Class :character
Median :11.00 Median :70.00 Median :30 Mode :character
Mean :17.14 Mean :57.14 Mean :40
3rd Qu.:28.00 3rd Qu.:80.00 3rd Qu.:90
Max. :40.00 Max. :90.00 Max. :90
Assuming that the rows of your spp_data match the rows of your environmental data ... I think if you do
lm1 <- lm(as.matrix(spp_data) ~ Sewer + pH + Conductivity,
data=env_data1)
you will get the results of running 44 separate linear models, one for each species. (Be careful: with 44 regressions and only 21 observations, you may need to do some multiple comparisons corrections to avoid overstating your conclusions.)
There are R packages for more sophisticated multi-species analyses such as mvabund or gllvm, but they might not apply to a data set this size ...

Error in using by()

Pretty new to R....I'm following an example from The R Book by Crawley. I downloaded a data set from the book's website and loaded it into a dataframe
worms <- read.table("worms.txt", header=T)
attach(worms)
head(worms)
Field.Name Area Slope Vegetation Soil.pH Damp Worm.density
1 Nashs.Field 3.6 11 Grassland 4.1 FALSE 4
2 Silwood.Bottom 5.1 2 Arable 5.2 FALSE 7
3 Nursery.Field 2.8 3 Grassland 4.3 FALSE 2
4 Rush.Meadow 2.4 5 Meadow 4.9 TRUE 5
5 Gunness.Thicket 3.8 0 Scrub 4.2 FALSE 6
6 Oak.Mead 3.1 2 Grassland 3.9 FALSE 2
Now, according to the book I should be able to type
by(worms,Vegetation,mean)
And get output that is grouped by vegetation type and has means for the numerical variables (T/F coerced to 0/1) and NA for the character string variables
See page 163 in the R book
But instead I get this (apologies for the image but blockquote wasn't preserving the formatting)
Am I doing something wrong? Is this a version issue between what the book was using and what I now have? I'm thoroughly confused...
Update: here are some of the requested outputs
> str(worms)
'data.frame': 20 obs. of 7 variables:
$ Field.Name : Factor w/ 20 levels "Ashurst","Cheapside",..: 8 17 10 16 7 11 3 1 19 15 ...
$ Area : num 3.6 5.1 2.8 2.4 3.8 3.1 3.5 2.1 1.9 1.5 ...
$ Slope : num 11 2 3 5 0 2 3 0 0 4 ...
$ Vegetation : Factor w/ 5 levels "Arable","Grassland",..: 2 1 2 3 5 2 2 1 4 2 ...
$ Soil.pH : num 4.1 5.2 4.3 4.9 4.2 3.9 4.2 4.8 5.7 5 ...
$ Damp : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
$ Worm.density: int 4 7 2 5 6 2 3 4 9 7 ...
> by(worms, Vegetation, summary)
Vegetation: Arable
Field.Name Area Slope Vegetation Soil.pH Damp
Ashurst :1 Min. :2.100 Min. :0.000 Arable :3 Min. :4.500 Mode :logical
Pound.Hill :1 1st Qu.:3.250 1st Qu.:1.000 Grassland:0 1st Qu.:4.650 FALSE:3
Silwood.Bottom:1 Median :4.400 Median :2.000 Meadow :0 Median :4.800 NA's :0
Cheapside :0 Mean :3.867 Mean :1.333 Orchard :0 Mean :4.833
Church.Field :0 3rd Qu.:4.750 3rd Qu.:2.000 Scrub :0 3rd Qu.:5.000
Farm.Wood :0 Max. :5.100 Max. :2.000 Max. :5.200
(Other) :0
Worm.density
Min. :4.000
1st Qu.:4.500
Median :5.000
Mean :5.333
3rd Qu.:6.000
Max. :7.000
---------------------------------------------------------------------------
Vegetation: Grassland
Field.Name Area Slope Vegetation Soil.pH Damp
Church.Field :1 Min. :1.500 Min. : 1.000 Arable :0 Min. :3.5 Mode :logical
Gravel.Pit :1 1st Qu.:2.800 1st Qu.: 2.000 Grassland:9 1st Qu.:3.9 FALSE:8
Nashs.Field :1 Median :3.100 Median : 3.000 Meadow :0 Median :4.1 TRUE :1
North.Gravel :1 Mean :2.911 Mean : 3.667 Orchard :0 Mean :4.1 NA's :0
Nursery.Field:1 3rd Qu.:3.500 3rd Qu.: 4.000 Scrub :0 3rd Qu.:4.2
Oak.Mead :1 Max. :3.700 Max. :11.000 Max. :5.0
(Other) :3
Worm.density
Min. :0.000
1st Qu.:1.000
Median :2.000
Mean :2.444
3rd Qu.:3.000
Max. :7.000
---------------------------------------------------------------------------
Vegetation: Meadow
Field.Name Area Slope Vegetation Soil.pH Damp
Pond.Field :1 Min. :2.400 Min. :0.000 Arable :0 Min. :4.900 Mode:logical
Rush.Meadow :1 1st Qu.:3.150 1st Qu.:0.000 Grassland:0 1st Qu.:4.900 TRUE:3
Water.Meadow:1 Median :3.900 Median :0.000 Meadow :3 Median :4.900 NA's:0
Ashurst :0 Mean :3.467 Mean :1.667 Orchard :0 Mean :4.933
Cheapside :0 3rd Qu.:4.000 3rd Qu.:2.500 Scrub :0 3rd Qu.:4.950
Church.Field:0 Max. :4.100 Max. :5.000 Max. :5.000
(Other) :0
Worm.density
Min. :5.000
1st Qu.:5.500
Median :6.000
Mean :6.333
3rd Qu.:7.000
Max. :8.000
---------------------------------------------------------------------------
Vegetation: Orchard
Field.Name Area Slope Vegetation Soil.pH Damp
The.Orchard :1 Min. :1.9 Min. :0 Arable :0 Min. :5.7 Mode :logical
Ashurst :0 1st Qu.:1.9 1st Qu.:0 Grassland:0 1st Qu.:5.7 FALSE:1
Cheapside :0 Median :1.9 Median :0 Meadow :0 Median :5.7 NA's :0
Church.Field:0 Mean :1.9 Mean :0 Orchard :1 Mean :5.7
Farm.Wood :0 3rd Qu.:1.9 3rd Qu.:0 Scrub :0 3rd Qu.:5.7
Garden.Wood :0 Max. :1.9 Max. :0 Max. :5.7
(Other) :0
Worm.density
Min. :9
1st Qu.:9
Median :9
Mean :9
3rd Qu.:9
Max. :9
---------------------------------------------------------------------------
Vegetation: Scrub
Field.Name Area Slope Vegetation Soil.pH Damp
Cheapside :1 Min. :0.800 Min. : 0 Arable :0 Min. :4.200 Mode :logical
Farm.Wood :1 1st Qu.:1.850 1st Qu.: 6 Grassland:0 1st Qu.:4.575 FALSE:2
Garden.Wood :1 Median :2.550 Median : 9 Meadow :0 Median :4.900 TRUE :2
Gunness.Thicket:1 Mean :2.425 Mean : 7 Orchard :0 Mean :4.800 NA's :0
Ashurst :0 3rd Qu.:3.125 3rd Qu.:10 Scrub :4 3rd Qu.:5.125
Church.Field :0 Max. :3.800 Max. :10 Max. :5.200
(Other) :0
Worm.density
Min. :3.00
1st Qu.:3.75
Median :5.00
Mean :5.25
3rd Qu.:6.50
Max. :8.00
According to the book I should be seeing:
First, the page prior to the one you show has an attach(worms) command, which is why you are able to use Vegetation and not worms$Vegetation.
A reproducible question would have included
worms <- read.table("http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/worms.txt", header = TRUE)
attach(worms)
It's possible that the code for by.data.frame has changed since the book was written, because I do reproduce your output (does not produce a table-wise mean by Vegetation type) but I can reproduce what's in the book using
by(worms, Vegetation, function(x) sapply(x, mean))
#> Vegetation: Arable
#> Field.Name Area Slope Vegetation Soil.pH
#> NA 3.866667 1.333333 NA 4.833333
#> Damp Worm.density
#> 0.000000 5.333333
#> --------------------------------------------------------
#> Vegetation: Grassland
#> Field.Name Area Slope Vegetation Soil.pH
#> NA 2.9111111 3.6666667 NA 4.1000000
#> Damp Worm.density
#> 0.1111111 2.4444444
#> --------------------------------------------------------
#> Vegetation: Meadow
#> Field.Name Area Slope Vegetation Soil.pH
#> NA 3.466667 1.666667 NA 4.933333
#> Damp Worm.density
#> 1.000000 6.333333
#> --------------------------------------------------------
#> Vegetation: Orchard
#> Field.Name Area Slope Vegetation Soil.pH
#> NA 1.9 0.0 NA 5.7
#> Damp Worm.density
#> 0.0 9.0
#> --------------------------------------------------------
#> Vegetation: Scrub
#> Field.Name Area Slope Vegetation Soil.pH
#> NA 2.425 7.000 NA 4.800
#> Damp Worm.density
#> 0.500 5.250

Convert summary to data.frame

I have this admission_table containing ADMIT, GRE, GPA and RANK.
> head(admission_table)
ADMIT GRE GPA RANK
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
5 0 520 2.93 4
6 1 760 3.00 2
I'm trying to convert the summary of this table into data.frame. I want to have ADMIT, GRE, GPA and RANK as my column headers.
> summary(admission_table)
ADMIT GRE GPA RANK
Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
Median :0.0000 Median :580.0 Median :3.395 Median :2.000
Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
> as.data.frame(summary(admission_table))
Var1 Var2 Freq
1 ADMIT Min. :0.0000
2 ADMIT 1st Qu.:0.0000
3 ADMIT Median :0.0000
4 ADMIT Mean :0.3175
5 ADMIT 3rd Qu.:1.0000
6 ADMIT Max. :1.0000
7 GRE Min. :220.0
8 GRE 1st Qu.:520.0
9 GRE Median :580.0
10 GRE Mean :587.7
11 GRE 3rd Qu.:660.0
12 GRE Max. :800.0
13 GPA Min. :2.260
14 GPA 1st Qu.:3.130
15 GPA Median :3.395
16 GPA Mean :3.390
17 GPA 3rd Qu.:3.670
18 GPA Max. :4.000
19 RANK Min. :1.000
20 RANK 1st Qu.:2.000
21 RANK Median :2.000
22 RANK Mean :2.485
23 RANK 3rd Qu.:3.000
24 RANK Max. :4.000
As I'm trying to convert into data.frame, this is the only result I get. I want the data frame have the exact output just like the summary table because after that I want to insert that into Oracle database using this line of code:
dbWriteTable(connection,name="SUM_ADMISSION_TABLE",value=as.data.frame(summary(admission_table)),row.names = FALSE, overwrite = TRUE ,append = FALSE)
Is the any way to do so?
You can consider unclass, I suppose:
data.frame(unclass(summary(mydf)), check.names = FALSE, stringsAsFactors = FALSE)
# ADMIT GRE GPA RANK
# 1 Min. :0.0000 Min. :380.0 Min. :2.930 Min. :1.000
# 2 1st Qu.:0.2500 1st Qu.:550.0 1st Qu.:3.047 1st Qu.:2.250
# 3 Median :1.0000 Median :650.0 Median :3.400 Median :3.000
# 4 Mean :0.6667 Mean :626.7 Mean :3.400 Mean :2.833
# 5 3rd Qu.:1.0000 3rd Qu.:735.0 3rd Qu.:3.655 3rd Qu.:3.750
# 6 Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
str(.Last.value)
# 'data.frame': 6 obs. of 4 variables:
# $ ADMIT: chr "Min. :0.0000 " "1st Qu.:0.2500 " "Median :1.0000 " "Mean :0.6667 " ...
# $ GRE : chr "Min. :380.0 " "1st Qu.:550.0 " "Median :650.0 " "Mean :626.7 " ...
# $ GPA : chr "Min. :2.930 " "1st Qu.:3.047 " "Median :3.400 " "Mean :3.400 " ...
# $ RANK: chr "Min. :1.000 " "1st Qu.:2.250 " "Median :3.000 " "Mean :2.833 " ...
Note that there is a lot of excessive whitespace there, in both the names and the values.
However, it might be sufficient to do something like:
do.call(cbind, lapply(mydf, summary))
# ADMIT GRE GPA RANK
# Min. 0.0000 380.0 2.930 1.000
# 1st Qu. 0.2500 550.0 3.048 2.250
# Median 1.0000 650.0 3.400 3.000
# Mean 0.6667 626.7 3.400 2.833
# 3rd Qu. 1.0000 735.0 3.655 3.750
# Max. 1.0000 800.0 4.000 4.000
Another way to output a dataframe is:
as.data.frame(apply(mydf, 2, summary))
Works if only numerical columns are selected.
And it may throw an Error in dimnames(x) if there are columns with NA's. It's worth checking for that without the as.data.frame() function first.
None of these solutions actually capture the output of the summary function. The tidy() function extracts the elements from a summary object and makes a bland data.frame, so it does not preserve other features or formatting.
If you want the exact output of the summary function in a data frame, you can do:
output<-capture.output(summary(thisModel), file=NULL,append=FALSE)
output_df <-as.data.frame(output)
This retains all of the new lines and is suitable for writing to XLSX, etc., which will result in the output appropriately spaced across rows.
If you want this output collapsed into a single cell, you can do:
output_collapsed <- paste0(output,sep="",collapse="\n")
output_df <-as.data.frame(output_collapsed)

Resources