Combine data frames after using rvest - r

My task is to grab baseball data from all 30 teams and combine it all into one table. However, I keep getting integer(0) as a return. Here are my data frames:
install.packages("rvest")
library(rvest)
# Store web url
baseball1 <- read_html("http://www.baseball-reference.com/teams/ARI/")
#Scrape the website for the franchise table
franch1 <- baseball1 %>%
html_nodes("#franchise_years") %>%
html_table()
franch1
# Store web url
baseball2 <- read_html("http://www.baseball-reference.com/teams/ATL/")
#Scrape the website for the franchise table
franch2 <- baseball2 %>%
html_nodes("#franchise_years") %>%
html_table()
franch2
Here is the structure of the data frame: str(franch1)
List of 1
$ :'data.frame': 18 obs. of 21 variables:
..$ Rk : int [1:18] 1 2 3 4 5 6 7 8 9 10 ...
..$ Year : int [1:18] 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
..$ Tm : chr [1:18] "Arizona Diamondbacks" "Arizona Diamondbacks" "Arizona Diamondbacks" "Arizona Diamondbacks" ...
..$ Lg : chr [1:18] "NL West" "NL West" "NL West" "NL West" ...
..$ G : int [1:18] 162 162 162 162 162 162 162 162 162 162 ...
..$ W : int [1:18] 79 64 81 81 94 65 70 82 90 76 ...
..$ L : int [1:18] 83 98 81 81 68 97 92 80 72 86 ...
..$ Ties : int [1:18] 0 0 0 0 0 0 0 0 0 0 ...
..$ W-L% : num [1:18] 0.488 0.395 0.5 0.5 0.58 0.401 0.432 0.506 0.556 0.469 ...
..$ pythW-L% : num [1:18] 0.504 0.415 0.493 0.53 0.545 0.428 0.462 0.509 0.487 0.491 ...
..$ Finish : chr [1:18] "3rd of 5" "5th of 5" "2nd of 5" "3rd of 5" ...
..$ GB : chr [1:18] "13.0" "30.0" "11.0" "13.0" ...
..$ Playoffs : chr [1:18] "" "" "" "" ...
..$ R : int [1:18] 720 615 685 734 731 713 720 720 712 773 ...
..$ RA : int [1:18] 713 742 695 688 662 836 782 706 732 788 ...
..$ BatAge : num [1:18] 26.6 27.6 28.1 28.3 28.2 26.8 26.5 26.7 26.6 29.6 ...
..$ PAge : num [1:18] 27.1 28 27.6 27.4 27.4 27.9 27.7 29.4 28.2 28.8 ...
..$ #Bat : int [1:18] 50 52 44 48 51 48 45 41 47 45 ...
..$ #P : int [1:18] 27 25 23 23 25 28 24 20 26 25 ...
..$ Top Player: chr [1:18] "P.Goldschmidt (8.8)" "P.Goldschmidt (4.5)" "P.Goldschmidt (7.1)" "A.Hill (5.0)" ...
..$ Managers : chr [1:18] "C.Hale (79-83)" "K.Gibson (63-96) and A.Trammell (1-2)" "K.Gibson (81-81)" "K.Gibson (81-81)" ...
What function do I use to combine these data frames? Your help is much appreciated and let me know if I need to provide additional info.

It's because your franchise tables are listed as data frame values that needed to be converted into data frames still. Also, "read_html" didn't work for me I use "html" instead.
Try this:
# Store web url using "html" not "read_html"
baseball1 <- html("http://www.baseball-reference.com/teams/ARI/")
#Scrape the website for the franchise table
franch1 <- baseball1 %>%
html_nodes("#franchise_years") %>%
html_table()
franch1
# Store web url
baseball2 <- html("http://www.baseball-reference.com/teams/ATL/")
#Scrape the website for the franchise table
franch2 <- baseball2 %>%
html_nodes("#franchise_years") %>%
html_table()
franch2
franch1 <- as.data.frame(franch1)
franch2 <- as.data.frame(franch2)
franchMerged <- rbind(franch1, franch2)
Let me know if that works for you.

Related

Lapply over subset of dataframes without splitting further

I am at a loss on how to do this without addressing each individual part. I have an initial timeseries dataset that I split into a list of 12 dataframes representing each month. Within each month, I want to run calculations and ggplot on each unique site without having to call each individual site. The structure currently is as follows:
$ April :'data.frame': 9360 obs. of 15 variables:
..$ site_id : int [1:9360] 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 ...
..$ UTC_date.1 : Date[1:9360], format: "2005-04-01" "2005-04-02" "2005-04-03" "2005-04-04" ...
..$ POSIXct : POSIXct[1:9360], format: "2005-04-01 06:00:00" "2005-04-02 06:00:00" "2005-04-03 06:00:00" "2005-04-04 06:00:00" ...
..$ swe_mm : num [1:9360] 45.9 44.6 43.5 42.4 41.2 ...
..$ fsca : num [1:9360] 1 1 1 1 0.997 ...
..$ snoht_m : num [1:9360] 0.303 0.239 0.21 0.186 0.165 ...
..$ swe_mm.1 : num [1:9360] 45.9 44.6 43.5 42.4 41.2 ...
..$ fsca.1 : num [1:9360] 1 1 1 1 0.997 ...
..$ snoht_m.1 : num [1:9360] 0.303 0.239 0.21 0.186 0.165 ...
..$ actSWE_mm : num [1:9360] 279 282 282 282 282 284 292 295 295 295 ...
..$ actSD_cm : num [1:9360] 79 79 NA 79 79 81 185 81 81 81 ...
..$ swe_Res_mm : num [1:9360] 233 237 238 240 241 ...
..$ snoht_Res_m : num [1:9360] 0.487 0.551 NA 0.604 0.625 ...
..$ swe_Res1_mm : num [1:9360] 233 237 238 240 241 ...
..$ snoht_Res1_m: num [1:9360] 0.487 0.551 NA 0.604 0.625 ...
I can use lapply to calculate the standardized rmse without issue if I apply it to each dataframe entirely:
stdres.fun <- function(data,x,out) {data[out] <- data[[x]] / ((sum(data[[x]]^2, na.rm = TRUE)/NROW(data))^.5); data}
monthSplit <- lapply(monthSplit, stdres.fun, x = "swe_Res_mm", out="stdSWE_res")
However, I am having trouble figuring out how to run this calculation on each unique site_id. What I mean to say is there are 32 different sites. They are the same sites in each dataframe, however I want to calculate the rmse for each site within each dataframe in the list. So if I had sites 946 and 1003, the calculation would run on each of those separately rather than together.
I'm assuming I can split the data further into different lists but I feel like this would be messier than it already is. Is there another way I can go about doing this?
We could modify the function and use tidyverse methods
library(purrr)
library(dplyr)
monthSplit2 <- map(monthSplit, ~
.x %>%
group_by(sites) %>%
mutate(stdSWE_res = swe_Res_mm/((sum(swe_Res_mm^2,
na.rm = TRUE)/n()) ^.5))

How to concatenate all vlaues in a column which is a list of data frames R

structure of data frame
> str(df)
'data.frame': 459 obs. of 6 variables:
$ Source : chr "Mumbai" "Mumbai" "Bangalore" "Bangalore" ...
$ Destination: chr "Bangalore" "Bangalore" "Chennai" "Cochin" ...
$ src_loc :'data.frame': 459 obs. of 2 variables:
..$ lon: num 72.9 72.9 77.6 77.6 73.9 ...
..$ lat: num 19.1 19.1 13 13 18.5 ...
$ dest_loc :'data.frame': 459 obs. of 2 variables:
..$ lon: num 77.6 77.6 80.3 76.3 78.5 ...
..$ lat: num 12.97 12.97 13.08 9.93 17.39 ...
$ route_line:List of 459
..$ :'data.frame': 219 obs. of 2 variables:
.. ..$ lat: num 19.1 19.1 19.1 19.1 19.1 ...
.. ..$ lon: num 72.9 72.9 72.9 72.9 73 ...
..$ :'data.frame': 219 obs. of 2 variables:
.. ..$ lat: num 19.1 19.1 19.1 19.1 19.1 ...
.. ..$ lon: num 72.9 72.9 72.9 72.9 73 ...
..$ :'data.frame': 244 obs. of 2 variables:
.. ..$ lat: num 13 13 13 13 13 ...
.. ..$ lon: num 77.6 77.6 77.6 77.6 77.6 ...
..$ :'data.frame': 228 obs. of 2 variables:
.. ..$ lat: num 13 13 13 12.9 12.9 ...
.. ..$ lon: num 77.6 77.6 77.6 77.6 77.6 ...
..$ :'data.frame': 232 obs. of 2 variables:
.. ..$ lat: num 18.5 18.5 18.5 18.5 18.5 ...
.. ..$ lon: num 73.9 73.9 73.9 73.9 73.9 ...
..$ :'data.frame': 234 obs. of 2 variables:
.. ..$ lat: num 15.4 15.4 15.4 15.4 15.4 ...
.. ..$ lon: num 75.1 75.1 75.1 75.1 75.1 ...
..$ :'data.frame': 218 obs. of 2 variables:
.. ..$ lat: num 17.4 17.4 17.4 17.5 17.5 ...
.. ..$ lon: num 78.5 78.5 78.5 78.5 78.5 ...
so on..
> df$route_line[[1]] #gives a data frame
lat lon
1 19.07597 72.87765
2 19.06575 72.89918
3 19.06331 72.91443
4 19.05159 72.93661
5 19.06758 72.98437
6 19.06653 73.02000
7 19.04099 73.02868
8 19.02309 73.04452
9 19.03844 73.07676
10 18.99688 73.13215
11 18.98191 73.14718
12 18.96049 73.15789
13 18.94201 73.15694
14 18.92484 73.16662
15 18.89439 73.20433
16 18.84075 73.24026
17 18.81434 73.27669
18 18.79409 73.29148
19 18.77373 73.32182
20 18.77023 73.33760
21 18.76414 73.34698
22 18.77114 73.36076
23 18.76580 73.35765
24 18.77090 73.36348
25 18.75822 73.37283
26 18.76368 73.38653
27 18.76939 73.40145
28 18.76301 73.41848
29 18.75766 73.42920
30 18.73973 73.42921
I want to create a new column (with name route_str) which contains the string obtained by concatenating all latitudes and longitudes in the above obtained data frame for every row in df
For example,
> df$route_str[1] #should give
[1] "19.07597 72.87765, 19.06575 72.89918, 19.06331 72.91443,19.05159 72.93661..." so on till 30
I tried this
> fun <- function(ip)
+ {
+ a <- ip[[1]]
+ a[3] <- paste(a[1],a[2]," ")
+ op <- paste(a[3],collapse = ",")
+ return(op)
+ }
> df$route_str <- lapply(df$route_line,fun)
But the output I get is
> unique_routes$route_str[1]
[[1]]
[1] "19.0759696960449 19.0657501220703 "
I tried to create reproducible data using following code but the structure isn't the same
df <- data.frame(src=c("chennai","Mumbai","Bangalore"),dest=c("Mumbai","Bangalore","Mumbai"),route=list(list(lat=c(19,20,21),lon=c(72,73,74)),data.frame(lat=c(19,20,21),lon=c(72,73,74)),data.frame(lat=c(19,20,21),lon=c(72,73,74))))
But the structure of above created data is as follows
> str(df)
'data.frame': 3 obs. of 8 variables:
$ src : Factor w/ 3 levels "Bangalore","chennai",..: 2 3 1
$ dest : Factor w/ 2 levels "Bangalore","Mumbai": 2 1 2
$ route.lat : num 19 20 21
$ route.lon : num 72 73 74
$ route.lat.1: num 19 20 21
$ route.lon.1: num 72 73 74
$ route.lat.2: num 19 20 21
$ route.lon.2: num 72 73 74
I am using R version 3.3.1 on windows 10 pls help!
EDIT:
This is how I ended up with that complicated data frame
Initial Data frame was like this
> df <- data.frame(source=c("chennai","Mumbai","Bangalore"),destination=c("Mumbai","Bangalore","Mumbai"))
> df
source destination
1 chennai Mumbai
2 Mumbai Bangalore
3 Bangalore Mumbai
I want to have a column containing single string with all the way-points(lat lon) between source and destination separated by a comma
I used googleway package to get waypoints
> library(googleway)
> res <- function(src,dest,key) #key is google maps API key
+ {
+ polylinex <- google_directions(origin = src,destination = dest,key = key)
+ return(polylinex$routes$overview_polyline$points)
+ }
> df$source <- as.character(df$source)
> df$destination <- as.character(df$destination)
> df$x <- mapply(res,df$source,df$destination,key)
> df$route_line <- lapply(df$x,function(y) googleway::decode_pl(y))
> df <- df[,!(names(df)=="x")]
> str(df)
'data.frame': 3 obs. of 3 variables:
$ source : chr "chennai" "Mumbai" "Bangalore"
$ destination: chr "Mumbai" "Bangalore" "Mumbai"
$ route_line :List of 3
..$ :'data.frame': 219 obs. of 2 variables:
.. ..$ lat: num 13.1 13.1 13.1 13.1 13.1 ...
.. ..$ lon: num 80.3 80.2 80.2 80.2 80.2 ...
..$ :'data.frame': 219 obs. of 2 variables:
.. ..$ lat: num 19.1 19.1 19.1 19.1 19.1 ...
.. ..$ lon: num 72.9 72.9 72.9 72.9 73 ...
..$ :'data.frame': 218 obs. of 2 variables:
.. ..$ lat: num 13 13 13 13 13 ...
.. ..$ lon: num 77.6 77.6 77.6 77.6 77.5 ...
A slight modification to your lapply into an sapply, and altering the paste sequence slightly will get you want you want
df$route_str <- sapply(df$x, function(y){
df_coords <- decode_pl(y)
paste0(t(sapply(df_coords, paste0)), collapse = ",")
})
str(df)
'data.frame': 3 obs. of 4 variables:
$ source : chr "chennai" "Mumbai" "Bangalore"
$ destination: chr "Mumbai" "Bangalore" "Mumbai"
$ x : chr "weznA{z|hNjrAlkDue#vsDtVnhD|dAnkErSbdI~kGzmRtmLjrNldI|iWnjBbuDf^duJgPzqNsiCtaIyLpnOyXzrKe{AvaG|JxpF~VpkCga#tkG_sBp|Cev#fvDpI|gF"| __truncated__ "ywlsBi|x{Lz~#qeCfNi~AfhAsiC}bBoiHpEu}Er~Cgu#znB_bB}~AohEvbGeyIp|A}|AzdC}aAnrB|DhjBo{#h}DujFfnIq_F`dDubFp}Bm{Af~Bs|DzTsaB`e#uy#w"| __truncated__ "oodnA}drxMkcAhKggApm#s}A|uAey#|rAi~BdjF{fDpaLgxB||F}`DvxE{sDdmDgkGthKmlK|vJmgIbzJa`BrjCssC|aBw`Dvw#osBrkCutNpbIigD|sCk`Ft_C}iPv"| __truncated__
$ route_str : chr "13.0826797485352,80.2706985473633,13.0693397521973,80.2431106567383,13.0755300521851,80.2141876220703,13.0717391967773,80.18706"| __truncated__ "19.0759696960449,72.8776473999023,19.0657501220703,72.8991775512695,19.0633087158203,72.9144287109375,19.0515899658203,72.93660"| __truncated__ "12.9715995788574,77.5945510864258,12.9825401306152,77.5925750732422,12.9940996170044,77.5851287841797,13.0092391967773,77.57122"| __truncated__
Note: I'm the googleway author, thanks for using the package

How to deal with " rank-deficient fit may be misleading" in R?

I'm trying to predict the values of test data set based on train data set, it is predicting the values (no errors) however the predictions deviate A LOT by the original values. Even predicting values around -356 although none of the original values exceeds 200 (and there are no negative values). The warning is bugging me as I think the values deviates a lot because of this warning.
Warning message:
In predict.lm(fit2, data_test) :
prediction from a rank-deficient fit may be misleading
any way I can get rid of this warning? the code is simple
fit2 <- lm(runs~., data=train_data)
prediction<-predict(fit2, data_test)
prediction
I searched a lot but tbh I couldn't understand much about this error.
str of test and train data set in case someone needs them
> str(train_data)
'data.frame': 36 obs. of 28 variables:
$ matchid : int 57 58 55 56 53 54 51 52 45 46 ...
$ TeamName : chr "South Africa" "West Indies" "South Africa" "West Indies" ...
$ Opp_TeamName : chr "West Indies" "South Africa" "West Indies" "South Africa" ...
$ TeamRank : int 4 3 4 3 4 3 10 7 5 1 ...
$ Opp_TeamRank : int 3 4 3 4 3 4 7 10 1 5 ...
$ Team_Top10RankingBatsman : int 0 1 0 1 0 1 0 0 2 2 ...
$ Team_Top50RankingBatsman : int 4 6 4 6 4 6 3 5 4 3 ...
$ Team_Top100RankingBatsman: int 6 8 6 8 6 8 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 0 1 0 1 0 0 0 2 2 ...
$ Opp_Top50RankingBatsman : int 6 4 6 4 6 4 5 3 3 4 ...
$ Opp_Top100RankingBatsman : int 8 6 8 6 8 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "1st innings" "2nd innings" ...
$ Runs_OverAll : num 361 705 348 630 347 ...
$ AVG_Overall : num 27.2 20 23.3 19.1 24 ...
$ SR_Overall : num 128 121 120 118 118 ...
$ Runs_Last10Matches : num 118.5 71 102.1 71 78.6 ...
$ AVG_Last10Matches : num 23.7 20.4 20.9 20.4 23.2 ...
$ SR_Last10Matches : num 120 106 114 106 116 ...
$ Runs_BatingFirst : num 236 459 230 394 203 ...
$ AVG_BatingFirst : num 30.6 23.2 24 21.2 27.1 ...
$ SR_BatingFirst : num 127 136 123 125 118 ...
$ Runs_BatingSecond : num 124 262 119 232 144 ...
$ AVG_BatingSecond : num 25.5 18.3 22.8 17.8 22.8 ...
$ SR_BatingSecond : num 125 118 112 117 114 ...
$ Runs_AgainstTeam2 : num 88.3 118.3 76.3 103.9 49.3 ...
$ AVG_AgainstTeam2 : num 28.2 23 24.7 22.1 16.4 ...
$ SR_AgainstTeam2 : num 139 127 131 128 111 ...
$ runs : int 165 168 231 236 195 126 143 141 191 135 ...
> str(data_test)
'data.frame': 34 obs. of 28 variables:
$ matchid : int 59 60 61 62 63 64 65 66 69 70 ...
$ TeamName : chr "India" "West Indies" "England" "New Zealand" ...
$ Opp_TeamName : chr "West Indies" "India" "New Zealand" "England" ...
$ TeamRank : int 2 3 5 1 4 8 6 2 10 1 ...
$ Opp_TeamRank : int 3 2 1 5 8 4 2 6 1 10 ...
$ Team_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 0 2 ...
$ Team_Top50RankingBatsman : int 5 6 4 3 4 2 5 5 3 3 ...
$ Team_Top100RankingBatsman: int 7 8 7 6 6 5 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 2 0 ...
$ Opp_Top50RankingBatsman : int 6 5 3 4 2 4 5 5 3 3 ...
$ Opp_Top100RankingBatsman : int 8 7 6 7 5 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "2nd innings" "1st innings" ...
$ Runs_OverAll : num 582 618 470 602 509 ...
$ AVG_Overall : num 25 21.8 20.3 20.7 19.6 ...
$ SR_Overall : num 113 120 123 120 112 ...
$ Runs_Last10Matches : num 182 107 117 167 140 ...
$ AVG_Last10Matches : num 37.1 43.8 21 24.9 27.3 ...
$ SR_Last10Matches : num 111 153 122 141 120 ...
$ Runs_BatingFirst : num 319 314 271 345 294 ...
$ AVG_BatingFirst : num 23.6 17.8 20.6 20.3 19.5 ...
$ SR_BatingFirst : num 116.9 98.5 118 124.3 115.8 ...
$ Runs_BatingSecond : num 264 282 304 256 186 ...
$ AVG_BatingSecond : num 28 23.7 31.9 21.6 16.5 ...
$ SR_BatingSecond : num 96.5 133.9 129.4 112 99.5 ...
$ Runs_AgainstTeam2 : num 98.2 95.2 106.9 75.4 88.5 ...
$ AVG_AgainstTeam2 : num 45.3 42.7 38.1 17.7 27.1 ...
$ SR_AgainstTeam2 : num 125 138 152 110 122 ...
$ runs : int 192 196 159 153 122 120 160 161 70 145 ...
In simple word, how can I get rid of this warning so that it doesn't effect my predictions?
(Intercept) matchid TeamNameBangladesh
1699.98232628 -0.06793787 59.29445330
TeamNameEngland TeamNameIndia TeamNameNew Zealand
347.33030177 -499.40074338 -179.19192936
TeamNamePakistan TeamNameSouth Africa TeamNameSri Lanka
-272.71610614 -3.54867488 -45.27920191
TeamNameWest Indies Opp_TeamNameBangladesh Opp_TeamNameEngland
-345.54349798 135.05901017 108.04227770
Opp_TeamNameIndia Opp_TeamNameNew Zealand Opp_TeamNamePakistan
-162.24418387 -60.55364436 -114.74599364
Opp_TeamNameSouth Africa Opp_TeamNameSri Lanka Opp_TeamNameWest Indies
196.90856999 150.70170068 -6.88997714
TeamRank Opp_TeamRank Team_Top10RankingBatsman
NA NA NA
Team_Top50RankingBatsman Team_Top100RankingBatsman Opp_Top10RankingBatsman
NA NA NA
Opp_Top50RankingBatsman Opp_Top100RankingBatsman InningType2nd innings
NA NA 24.24029455
Runs_OverAll AVG_Overall SR_Overall
-0.59935875 20.12721378 -13.60151334
Runs_Last10Matches AVG_Last10Matches SR_Last10Matches
-1.92526750 9.24182916 1.23914363
Runs_BatingFirst AVG_BatingFirst SR_BatingFirst
1.41001672 -9.88582744 -6.69780509
Runs_BatingSecond AVG_BatingSecond SR_BatingSecond
-0.90038727 -7.11580086 3.20915976
Runs_AgainstTeam2 AVG_AgainstTeam2 SR_AgainstTeam2
3.35936312 -5.90267210 2.36899131
You can have a look at this detailed discussion :
predict.lm() in a loop. warning: prediction from a rank-deficient fit may be misleading
In general, multi-collinearity can lead to a rank deficient matrix in logistic regression.
You can try applying PCA to tackle the multi-collinearity issue and then apply logistic regression afterwards.

How can I split a multiply imputed dataset created in Amelia?

I have imputed missing values using Amelia thereby creating 5 multiply imputed datasets. Now, I would like to split this multi-dataset, e.g. one set for year => 1990 and one set for year =<1990. Any ideas how I can do so? Many thanks!
data(freetrade)
freetrade$year #splitting variable
#Imputation of missing data
a.out <- amelia(freetrade, m=5, ts="year", cs="country")
#split of created dataset?
Amelia returns an object that contains a list of dataframes (for each imputations). You can see the structure of this object with str().
> library(Amelia)
> data(freetrade)
>
> a.out <- amelia(freetrade, m=5, ts="year", cs="country")
-- Imputation 1 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
-- Imputation 2 --
1 2 3 4 5 6 7 8 9 10 11 12 13
-- Imputation 3 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
-- Imputation 4 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
-- Imputation 5 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> str(a.out)
List of 12
$ imputations:List of 5
..$ imp1:'data.frame': 171 obs. of 10 variables:
.. ..$ year : int [1:171] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 ...
.. ..$ country : chr [1:171] "SriLanka" "SriLanka" "SriLanka" "SriLanka" ...
.. ..$ tariff : num [1:171] 30.6 22.4 41.3 26.8 31 ...
.. ..$ polity : num [1:171] 6 5 5 5 5 5 5 5 5 5 ...
.. ..$ pop : num [1:171] 14988000 15189000 15417000 15599000 15837000 ...
.. ..$ gdp.pc : num [1:171] 461 474 489 508 526 ...
.. ..$ intresmi: num [1:171] 1.94 1.96 1.66 2.8 2.26 ...
.. ..$ signed : num [1:171] 0 0 1 0 0 0 0 1 0 0 ...
.. ..$ fiveop : num [1:171] 12.4 12.5 12.3 12.3 12.3 ...
.. ..$ usheg : num [1:171] 0.259 0.256 0.266 0.299 0.295 ...
..$ imp2:'data.frame': 171 obs. of 10 variables:
.. ..$ year : int [1:171] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 ...
.. ..$ country : chr [1:171] "SriLanka" "SriLanka" "SriLanka" "SriLanka" ...
.. ..$ tariff : num [1:171] 33.6 59.7 41.3 18.2 31 ...
.. ..$ polity : num [1:171] 6 5 5 5 5 5 5 5 5 5 ...
.. ..$ pop : num [1:171] 14988000 15189000 15417000 15599000 15837000 ...
.. ..$ gdp.pc : num [1:171] 461 474 489 508 526 ...
.. ..$ intresmi: num [1:171] 1.94 1.96 1.66 2.8 2.26 ...
.. ..$ signed : num [1:171] 0 0 1 0 0 0 0 1 0 0 ...
.. ..$ fiveop : num [1:171] 12.4 12.5 12.3 12.3 12.3 ...
.. ..$ usheg : num [1:171] 0.259 0.256 0.266 0.299 0.295 ...
..$ imp3:'data.frame': 171 obs. of 10 variables:
.. ..$ year : int [1:171] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 ...
.. ..$ country : chr [1:171] "SriLanka" "SriLanka" "SriLanka" "SriLanka" ...
.. ..$ tariff : num [1:171] 48.5 32.9 41.3 47.2 31 ...
.. ..$ polity : num [1:171] 6 5 5 5 5 5 5 5 5 5 ...
.. ..$ pop : num [1:171] 14988000 15189000 15417000 15599000 15837000 ...
.. ..$ gdp.pc : num [1:171] 461 474 489 508 526 ...
.. ..$ intresmi: num [1:171] 1.94 1.96 1.66 2.8 2.26 ...
.. ..$ signed : num [1:171] 0 0 1 0 0 0 0 1 0 0 ...
.. ..$ fiveop : num [1:171] 12.4 12.5 12.3 12.3 12.3 ...
.. ..$ usheg : num [1:171] 0.259 0.256 0.266 0.299 0.295 ...
..$ imp4:'data.frame': 171 obs. of 10 variables:
.. ..$ year : int [1:171] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 ...
.. ..$ country : chr [1:171] "SriLanka" "SriLanka" "SriLanka" "SriLanka" ...
.. ..$ tariff : num [1:171] 18.4 45.5 41.3 16.9 31 ...
.. ..$ polity : num [1:171] 6 5 5 5 5 5 5 5 5 5 ...
.. ..$ pop : num [1:171] 14988000 15189000 15417000 15599000 15837000 ...
.. ..$ gdp.pc : num [1:171] 461 474 489 508 526 ...
.. ..$ intresmi: num [1:171] 1.94 1.96 1.66 2.8 2.26 ...
.. ..$ signed : num [1:171] 0 0 1 0 0 0 0 1 0 0 ...
.. ..$ fiveop : num [1:171] 12.4 12.5 12.3 12.3 12.3 ...
.. ..$ usheg : num [1:171] 0.259 0.256 0.266 0.299 0.295 ...
..$ imp5:'data.frame': 171 obs. of 10 variables:
.. ..$ year : int [1:171] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 ...
.. ..$ country : chr [1:171] "SriLanka" "SriLanka" "SriLanka" "SriLanka" ...
.. ..$ tariff : num [1:171] 15.3 44.4 41.3 40.1 31 ...
.. ..$ polity : num [1:171] 6 5 5 5 5 5 5 5 5 5 ...
.. ..$ pop : num [1:171] 14988000 15189000 15417000 15599000 15837000 ...
.. ..$ gdp.pc : num [1:171] 461 474 489 508 526 ...
.. ..$ intresmi: num [1:171] 1.94 1.96 1.66 2.8 2.26 ...
.. ..$ signed : num [1:171] 0 0 1 0 0 0 0 1 0 0 ...
.. ..$ fiveop : num [1:171] 12.4 12.5 12.3 12.3 12.3 ...
.. ..$ usheg : num [1:171] 0.259 0.256 0.266 0.299 0.295 ...
..- attr(*, "class")= chr [1:2] "mi" "list"
$ m : num 5
$ missMatrix : logi [1:171, 1:10] FALSE FALSE FALSE FALSE FALSE FALSE ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:10] "year" "country" "tariff" "polity" ...
$ overvalues : NULL
$ theta : num [1:9, 1:9, 1:5] -1 -0.08456 -0.03404 -0.00193 0.06483 ...
$ mu : num [1:8, 1:5] -0.08456 -0.03404 -0.00193 0.06483 -0.11178 ...
$ covMatrices: num [1:8, 1:8, 1:5] 0.7881 -0.1869 -0.0531 0.2121 -0.0819 ...
$ code : num 1
$ message : chr "Normal EM convergence."
$ iterHist :List of 5
..$ : num [1:15, 1:3] 44 34 25 28 26 25 24 22 20 14 ...
..$ : num [1:13, 1:3] 44 27 24 22 22 21 18 17 14 11 ...
..$ : num [1:19, 1:3] 44 34 29 27 26 26 25 24 23 21 ...
..$ : num [1:15, 1:3] 44 34 27 28 23 24 23 23 19 19 ...
..$ : num [1:20, 1:3] 44 32 30 27 24 23 23 23 23 21 ...
$ arguments :List of 22
..$ idvars : NULL
..$ logs : NULL
..$ ts : num 1
..$ cs : num 2
..$ empri : NULL
..$ tolerance : num 1e-04
..$ polytime : NULL
..$ splinetime : NULL
..$ lags : NULL
..$ leads : NULL
..$ intercs : logi FALSE
..$ sqrts : NULL
..$ lgstc : NULL
..$ noms : NULL
..$ ords : NULL
..$ priors : NULL
..$ autopri : num 0.05
..$ bounds : NULL
..$ max.resample: num 100
..$ startvals : num 0
..$ overimp : NULL
..$ emburn : num [1:2] 0 0
..- attr(*, "class")= chr [1:2] "ameliaArgs" "list"
$ orig.vars : chr [1:10] "year" "country" "tariff" "polity" ...
- attr(*, "class")= chr "amelia"
From here you can see that the the "imputations" element of your a.out object contains your data frames, so you can reference each of your imputations from there. For example a.out$imputations[[1]]$year will give you the years from your first imputation. If you like to do that across each imputation then you can do so using an apply function or loop. To illustrate this, consider:
> sapply(a.out$imputations,function(x) head(x$year))
imp1 imp2 imp3 imp4 imp5
[1,] 1981 1981 1981 1981 1981
[2,] 1982 1982 1982 1982 1982
[3,] 1983 1983 1983 1983 1983
[4,] 1984 1984 1984 1984 1984
[5,] 1985 1985 1985 1985 1985
[6,] 1986 1986 1986 1986 1986
EDIT: I just re-read your question and I saw that you're actually looking for something more specific. You can take what's above an apply it to make subsets of each each data frame doing something like lapply(a.out$imputations,function(x) x[x$year > 1990,]). I'm not sure how you would like to combine these imputed datasets (split by years great than/less than 1990), but if you just want to append all rows together rbind() will do the trick (if not let me know how you'd like to and I can probably recommend a solution):
> df1 <- do.call(rbind,lapply(a.out$imputations,function(x) x[x$year > 1990,]))
> df2 <- do.call(rbind,lapply(a.out$imputations,function(x) x[x$year < 1990,]))
> head(df1)
year country tariff polity pop gdp.pc intresmi signed fiveop usheg
imp1.11 1991 SriLanka 26.9000 5 17247000 597.6987 2.285213 1.000000 12.8 0.2589872
imp1.12 1992 SriLanka 25.0000 5 17405000 618.3329 2.877877 0.515665 13.1 0.2623017
imp1.13 1993 SriLanka 24.2000 5 17628420 652.6205 4.280361 0.000000 13.2 0.2812928
imp1.14 1994 SriLanka 26.0000 5 17865000 680.0408 4.389912 0.000000 13.2 0.2783585
imp1.15 1995 SriLanka 20.0000 5 18112000 707.6591 3.995919 0.000000 13.2 0.2627195
imp1.16 1996 SriLanka 20.5646 5 18300000 727.0039 3.676763 0.000000 13.2 0.2681700
> head(df2)
year country tariff polity pop gdp.pc intresmi signed fiveop usheg
imp1.1 1981 SriLanka 30.56693 6 14988000 461.0236 1.937347 0 12.4 0.2593112
imp1.2 1982 SriLanka 22.39382 5 15189000 473.7634 1.964430 0 12.5 0.2558008
imp1.3 1983 SriLanka 41.30000 5 15417000 489.2266 1.663936 1 12.3 0.2655022
imp1.4 1984 SriLanka 26.81580 5 15599000 508.1739 2.797462 0 12.3 0.2988009
imp1.5 1985 SriLanka 31.00000 5 15837000 525.5609 2.259116 0 12.3 0.2952431
imp1.6 1986 SriLanka 17.76314 5 16117000 538.9237 1.832549 0 12.5 0.2886563

Split and unsplit a dataframe in four parts

I'd like to split a dataframe in 4 equals parts, because I'd like to use the 4 cores of my computer.
I did this :
df2 <- split(df, 1:4)
unsplit(df2, f=1:4)
and that
df2 <- split(df, 1:4)
unsplit(df2, f=c('1','2','3','4')
But the unsplit function did not work, I have these warnings messages
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Do you have an idea of the reason ?
How many rows in df? You will get that warning if the number of rows in your table is not divisible by 4. I think you are using the split factor f incorrectly, unless what you want to do is put each subsequent row into a different split data.frame.
If you really want to split your data into 4 dataframes. one row after the other then make your splitting factor the same size as the number of rows in your dataframe using rep_len like this:
## Split like this:
split(df , f = rep_len(1:4, nrow(df) ) )
## Unsplit like this:
unsplit( split(df , f = rep_len(1:4, nrow(df) ) ) , f = rep_len(1:4,nrow(df) ) )
Hopefully this example illustrates why the error occurs and how to avoid it (i.e. use a proper splitting factor!).
## Want to split our data.frame into two halves, but rows not divisible by 2
df <- data.frame( x = runif(5) )
df
## Splitting still works but...
## We get a warning because the split factor 'f' was not recycled as a multiple of it's length
split( df , f = 1:2 )
#$`1`
# x
#1 0.6970968
#3 0.5614762
#5 0.5910995
#$`2`
# x
#2 0.6206521
#4 0.1798006
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
## Instead let's use the same split levels (1:2)...
## but make it equal to the length of the rows in the table:
splt <- rep_len( 1:2 , nrow(df) )
splt
#[1] 1 2 1 2 1
## Split works, and f is not recycled because there are
## the same number of values in 'f' as rows in the table
split( df , f = splt )
#$`1`
# x
#1 0.6970968
#3 0.5614762
#5 0.5910995
#$`2`
# x
#2 0.6206521
#4 0.1798006
## And unsplitting then works as expected and reconstructs our original data.frame
unsplit( split( df , f = splt ) , f = splt )
# x
#1 0.6970968
#2 0.6206521
#3 0.5614762
#4 0.1798006
#5 0.5910995
In the R language 'split' example . . .
aq <- airquality
g <- aq$Month
l <- split(aq,g)
After the 'scale' function is executed
l <- lapply(l, transform, Ozone = scale(Ozone))
I am guessing that at one time in R history
the function 'scale' did not add extra attributes
to the column it is modifying.
..$ Ozone : num ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
As seen in here . . .
> str(l)
List of 5
$ 5:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] 0.782 0.557 -0.523 -0.253 NA ...
.. ..- attr(*, "scaled:center")= num 23.6
.. ..- attr(*, "scaled:scale")= num 22.2
..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 6:'data.frame': 30 obs. of 6 variables:
..$ Ozone : num [1:30, 1] NA NA NA NA NA ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
$ 7:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] 2.399 -0.32 -0.857 NA 0.154 ...
.. ..- attr(*, "scaled:center")= num 59.1
.. ..- attr(*, "scaled:scale")= num 31.6
..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 8:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] -0.528 -1.284 -1.108 0.455 -0.629 ...
.. ..- attr(*, "scaled:center")= num 60
.. ..- attr(*, "scaled:scale")= num 39.7
..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 9:'data.frame': 30 obs. of 6 variables:
..$ Ozone : num [1:30, 1] 2.674 1.928 1.721 2.467 0.644 ...
.. ..- attr(*, "scaled:center")= num 31.4
.. ..- attr(*, "scaled:scale")= num 24.1
..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
But now it does add those attributes
..$ Ozone : num ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
and the very simple 'unsplit' function is not programmed to handle those attributes.
> unsplit(l,g)
Error in xj[i, , drop = FALSE] : (subscript) logical subscript too long
The (direct and simple) solution is to get rid of those attributes.
attributes(l[[1]]$Ozone) <- NULL
attributes(l[[2]]$Ozone) <- NULL
attributes(l[[3]]$Ozone) <- NULL
attributes(l[[4]]$Ozone) <- NULL
attributes(l[[5]]$Ozone) <- NULL
Then try to unsplit again.
str( unsplit(l,g) )
> str( unsplit(l,g) )
'data.frame': 153 obs. of 6 variables:
$ Ozone : num 0.782 0.557 -0.523 -0.253 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
So, now it works.
Andre Mikulec

Resources