change content disposition of a list in R - r

I've got a list with more than 5000 elements and I want to save them in a .csv data frame with specific disposition.
library(XML)
url <- "http://www.omie.es/aplicaciones/datosftp/datosftp.jsp?path=/marginalpdbc/"
doc <- htmlParse(url)
links <- xpathSApply(doc, "//a/#href")
free(doc)
head(links)
wanted <- links[grepl("http*", links)]
head(wanted)
GetMe <- paste("", wanted, sep = "")
datos<-lapply(seq_along(GetMe),
function(x) read.csv(GetMe[x], header = F, sep = ";", as.is = TRUE,skip=1))
Like this I've got 7 variables with 25 instances in each list element.
V1 V2 V3 V4 V5 V6 V7
1 1999 1 1 1 3.350 0.02030303 NA
2 1999 1 1 2 3.595 0.02178788 NA
3 1999 1 1 3 3.293 0.01995758 NA
4 1999 1 1 4 2.800 0.01696970 NA
5 1999 1 1 5 2.516 0.01524848 NA
6 1999 1 1 6 2.516 0.01524848 NA
7 1999 1 1 7 2.516 0.01524848 NA
8 1999 1 1 8 2.516 0.01524848 NA
9 1999 1 1 9 2.516 0.01524848 NA
10 1999 1 1 10 2.516 0.01524848 NA
11 1999 1 1 11 2.516 0.01524848 NA
12 1999 1 1 12 2.840 0.01721212 NA
13 1999 1 1 13 2.840 0.01721212 NA
14 1999 1 1 14 3.595 0.02178788 NA
15 1999 1 1 15 3.586 0.02173333 NA
16 1999 1 1 16 2.840 0.01721212 NA
17 1999 1 1 17 2.840 0.01721212 NA
18 1999 1 1 18 2.840 0.01721212 NA
19 1999 1 1 19 4.172 0.02528485 NA
20 1999 1 1 20 3.639 0.02205455 NA
21 1999 1 1 21 3.661 0.02218788 NA
22 1999 1 1 22 3.661 0.02218788 NA
23 1999 1 1 23 3.661 0.02218788 NA
24 1999 1 1 24 3.638 0.02204848 NA
25 * NA NA NA NA NA NA
I want to have them all in the same dataframe with the following disposition:
FECHA A„O MES DIASEM DIA H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12 H13 H14 H15
01/01/2003 2003 1 M 1 15 10.97 8.22 5.24 2.65 2.13 2.06 0.02 0 0 0.77 2.1 3.5 5.33 6.33
02/01/2003 2003 1 J 2 8.33 4.2 2.87 2.63 2.56 2.56 3.51 5.15 10 17.17 20 21.02 21.02 20 17.62
03/01/2003 2003 1 V 3 14.27 9.47 5.08 3.57 3.01 3.01 4.61 9.41 12.83 16.27 17.62 19.66 19.6 17.62 16.2
Where V1 is the year, V2 is the month, V3 is the day, V4 is de hour and V6 of the list corresponds to the values of each row.
In the final data frame each hour has to be one column.
Thanks for your help!

Related

Canculating the compound annual growth rate

I'm trying to calculate the compound annual growth rate of my data (snipet shown below), does anyone know the best way to do this or if there is a function that does part of the job?
Data: (only woried about the preds column here, others can be ignored)
year month timestep ymin ymax preds date
1 1998 1 1 17.84037 18.58553 18.21295 1998-01-01
2 1998 2 2 17.05009 17.70642 17.37826 1998-02-01
3 1998 3 3 16.97067 17.61320 17.29193 1998-03-01
4 1998 4 4 18.38551 19.00838 18.69695 1998-04-01
5 1998 5 5 21.39082 21.97338 21.68210 1998-05-01
6 1998 6 6 24.77679 25.35464 25.06571 1998-06-01
7 1998 7 7 27.27057 27.82818 27.54938 1998-07-01
8 1998 8 8 28.24703 28.76702 28.50702 1998-08-01
9 1998 9 9 27.72370 28.24619 27.98494 1998-09-01
10 1998 10 10 25.83783 26.33969 26.08876 1998-10-01
11 1998 11 11 22.94968 23.42268 23.18618 1998-11-01
12 1998 12 12 19.50499 20.05466 19.77982 1998-12-01
13 1999 1 13 17.98323 18.50530 18.24426 1999-01-01
14 1999 2 14 17.20124 17.61746 17.40935 1999-02-01
15 1999 3 15 17.11064 17.53492 17.32278 1999-03-01

Is there a way to make `lm` report the amount of observations used (instead of omitted)?

Simple question. I would like to immediately get the amount of observations used by the lm model when I subset the data. But just to give a reproducible example:
library(data.table)
df <- fread(
"ID DEP C fac H I clvl iso year matchcode
1 1 1 NA 9 1 1 NLD 2009 NLD2009
2 1 1 NA 8 1 1 NLD 2009 NLD2009
3 7 0 NA 3 0 2 NLD 2014 NLD2014
4 8 0 NA 4 0 2 NLD 2014 NLD2014
5 1 0 B 6 0 2 AUS 2011 AUS2011
6 2 0 B 7 0 2 AUS 2011 AUS2011
7 4 1 B 8 1 2 AUS 2007 AUS2007
8 5 1 B 7 7 2 AUS 2007 AUS2007
9 6 0 NA 5 1 1 USA 2007 USA2007
10 1 0 NA 5 1 1 USA 2007 USA2007
11 0 1 NA 0 0 2 USA 2011 USA2010
12 2 1 NA 1 0 2 USA 2011 USA2010
13 2 0 NA 6 NA 3 USA 2013 USA2013
14 9 0 NA 4 0 3 USA 2013 USA2013
15 8 1 A 5 1 2 BLG 2007 BLG2007
16 2 0 A 6 0 4 BEL 2009 BEL2009
17 NA 0 A 1 0 4 BEL 2009 BEL2009
18 9 1 A 0 1 4 BEL 2012 BEL2012",
header = TRUE
)
ols <- lm(DEP ~ H + I + iso, data=df, subset=(ID != 15))
summary(ols)
Is there a way to make lm report the amount of observations used (instead of omitted)? I can seriously not find this anywhere. It is important because I not know the amount of observations of each subset by heart.
Call:
lm(formula = DEP ~ H + I + iso, data = df, subset = (ID != 15))
Residuals:
Min 1Q Median 3Q Max
-4.930 -1.907 -0.167 1.855 6.065
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.658 3.589 1.58 0.15
H -0.499 0.436 -1.14 0.28
I 0.417 0.594 0.70 0.50
isoBEL 1.130 3.592 0.31 0.76
isoNLD 1.376 2.685 0.51 0.62
isoUSA -0.728 3.030 -0.24 0.82
Residual standard error: 3.6 on 9 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.206, Adjusted R-squared: -0.235
F-statistic: 0.468 on 5 and 9 DF, p-value: 0.791
If not, what would be the easiest way to deduce it from the lm output?
The nobs() function tells you how many observations were used
nobs(ols)

How can I drop observations within a group following the occurrence of NA?

I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.
Here an example:
productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf
# productreference Year assets
# 1 1 2000 2
# 2 1 2001 3
# 3 1 2002 NA
# 4 1 2003 2
# 5 2 1999 34
# 6 2 2000 NA
# 7 2 2001 45
# 8 3 2005 1
# 9 3 2006 23
# 10 3 2007 34
# 11 3 2008 56
# 12 4 1998 56
# 13 4 1999 67
# 14 4 2000 23
# 15 5 2000 23
# 16 5 2001 NA
# 17 5 2002 14
# 18 5 2003 NA
I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.
mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1
I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!
I believe this is what you want:
library(dplyr)
group_by(mydf, productreference) %>%
filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
#
# productreference Year assets
# (dbl) (dbl) (dbl)
# 1 1 2000 2
# 2 1 2001 3
# 3 2 1999 34
# 4 3 2005 1
# 5 3 2006 23
# 6 3 2007 34
# 7 3 2008 56
# 8 4 1998 56
# 9 4 1999 67
# 10 4 2000 23
# 11 5 2000 23
Here is the same approach using data.table:
library(data.table)
dt <- as.data.table(mydf)
dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]
# productreference Year assets nas
# 1: 1 2000 2 0
# 2: 1 2001 3 0
# 3: 2 1999 34 0
# 4: 3 2005 1 0
# 5: 3 2006 23 0
# 6: 3 2007 34 0
# 7: 3 2008 56 0
# 8: 4 1998 56 0
# 9: 4 1999 67 0
#10: 4 2000 23 0
#11: 5 2000 23 0
Here is a base R option
mydf[unsplit(lapply(split(mydf, mydf$productreference),
function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]
# productreference Year assets
#1 1 2000 2
#2 1 2001 3
#5 2 1999 34
#8 3 2005 1
#9 3 2006 23
#10 3 2007 34
#11 3 2008 56
#12 4 1998 56
#13 4 1999 67
#14 4 2000 23
#15 5 2000 23
Or an option with data.table
library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)]
else .SD, by = productreference]
You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.
mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
s1 <- mydf[mydf$productreference==i,]
s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
mydf2 <- rbind(mydf2, s2)
mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2

Merging data frames with different number of rows and different columns

I have two data frames with different number of columns and rows. I want to combine them into one data frame.
> month.saf
Name NCDC Year Month Day HrMn Temp Q
244 AP 99999 2014 2 1 0 12 1
245 AP 99999 2014 2 1 300 12.2 1
246 AP 99999 2014 2 1 600 14.4 1
247 AP 99999 2014 2 1 900 18.6 1
248 AP 99999 2014 2 1 1200 18 1
249 AP 99999 2014 2 1 1500 13.6 1
250 AP 99999 2014 2 1 1800 11.8 1
251 AP 99999 2014 2 1 2100 10.8 1
252 AP 99999 2014 2 2 0 8.4 1
253 AP 99999 2014 2 2 300 8.6 1
254 AP 99999 2014 2 2 600 19.8 2
255 AP 99999 2014 2 2 900 22.8 1
256 AP 99999 2014 2 2 1200 20.8 1
257 AP 99999 2014 2 2 1500 16.4 1
258 AP 99999 2014 2 2 1800 13.4 1
259 AP 99999 2014 2 2 2100 12.4 1
> T2Mdf
V1 V2
0 293.494262695312 291.642639160156
300 294.003479003906 292.375091552734
600 296.809997558594 295.207885742188
900 298.287811279297 297.181549072266
1200 298.317565917969 297.725708007813
1500 298.134002685547 296.226165771484
1800 296.006805419922 293.354248046875
2100 293.785491943359 293.547210693359
0.1 294.638732910156 293.019866943359
300.1 292.179992675781 291.256958007812
The output that I want is like this:
Name NCDC Year Month Day HrMn Temp Q V1 V2
244 AP 99999 2014 2 1 0 12 1 293.4942627 291.6426392
245 AP 99999 2014 2 1 300 12.2 1 294.003479 292.3750916
246 AP 99999 2014 2 1 600 14.4 1 296.8099976 295.2078857
247 AP 99999 2014 2 1 900 18.6 1 298.2878113 297.1815491
248 AP 99999 2014 2 1 1200 18 1 298.3175659 297.725708
249 AP 99999 2014 2 1 1500 13.6 1 298.1340027 296.2261658
250 AP 99999 2014 2 1 1800 11.8 1 296.0068054 293.354248
251 AP 99999 2014 2 1 2100 10.8 1 293.7854919 293.5472107
252 AP 99999 2014 2 2 0 8.4 1 294.6387329 293.0198669
253 AP 99999 2014 2 2 300 8.6 1 292.1799927 291.256958
254 AP 99999 2014 2 2 600 19.8 2 292.2477417 291.3471069
255 AP 99999 2014 2 2 900 22.8 1 294.2276306 294.2766418
256 AP 99999 2014 2 2 1200 20.8 1 NA NA
257 AP 99999 2014 2 2 1500 16.4 1 NA NA
258 AP 99999 2014 2 2 1800 13.4 1 NA NA
259 AP 99999 2014 2 2 2100 12.4 1 NA NA
I tried cbindbut it gives me an error
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 216, 220
And using rbind.fill() but it gives me something like
V1 V2 Name USAF NCDC Year Month Day HrMn I Type QCP Temp Q
1 293.494262695312 291.642639160156 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
2 294.003479003906 292.375091552734 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
3 296.809997558594 295.207885742188 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
4 298.287811279297 297.181549072266 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
5 298.317565917969 297.725708007813 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
6 <NA> <NA> AP 421820 99999 2014 2 1 0 4 FM-12 NA 12 1
7 <NA> <NA> AP 421820 99999 2014 2 1 300 4 FM-12 NA 12.2 1
8 <NA> <NA> AP 421820 99999 2014 2 1 600 4 FM-12 NA 14.4 1
9 <NA> <NA> AP 421820 99999 2014 2 1 900 4 FM-12 NA 18.6 1
10 <NA> <NA> AP 421820 99999 2014 2 1 1200 4 FM-12 NA 18 1
How is it possible to do this in R?
If A and B are the two input data frames, here are some solutions:
1) merge This solutions works regardless of whether A or B has more rows.
merge(data.frame(A, row.names=NULL), data.frame(B, row.names=NULL),
by = 0, all = TRUE)[-1]
The first two arguments could be replaced with just A and B respectively if A and B have default rownames, i.e. 1, 2, ..., or if they have consistent rownames. That is, merge(A, B, by = 0, all = TRUE)[-1] .
For example, if we have this input:
# test inputs
A <- data.frame(BOD, row.names = letters[1:6])
B <- setNames(2 * BOD[1:2, ], c("X", "Y"))
then:
merge(data.frame(A, row.names=NULL), data.frame(B, row.names=NULL),
by = 0, all = TRUE)[-1]
gives:
Time demand X Y
1 1 8.3 2 16.6
2 2 10.3 4 20.6
3 3 19.0 NA NA
4 4 16.0 NA NA
5 5 15.6 NA NA
6 7 19.8 NA NA
1a) An equivalent variation is:
do.call("merge", c(lapply(list(A, B), data.frame, row.names=NULL),
by = 0, all = TRUE))[-1]
2) cbind.zoo This solution assumes that A has more rows and that B's entries are all of the same type, e.g. all numeric. A is not restricted. These conditions hold in the data of the question.
library(zoo)
data.frame(A, cbind(zoo(, 1:nrow(A)), as.zoo(B)))

Calculating yearly growth-rates from quarterly, long form data in r

My data takes the following form:
df <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("USA", 16)),
Quarter=rep(1:8,2),Income=20:35)
df2 <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("UK", 16)),
Quarter=rep(1:8,2),Income=32:47)
df <- rbind(df, df2)
What I want to do is to calculate the growth rate from the first quarter each year to the first quarter the second year, within country and sector. In the example above it would be the growth rate from quarter 1 to quarter 5. So for Sector A, in the USA, it would be (24/20)-1=0.2
I then want to append this data to the dataframe as a new column.
I looked at the solutions in:
How calculate growth rate in long format data frame?
But didn't have the r-skills to get it to work if the lag is more then one time-unit. Any suggestions?
ADDITION
So what i want is the growth-rate, that is (24/20)-1=0.2 in the example below. Not 1-(24/20), which I first wrote. The desired output should look something like this:
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2
6 A USA 6 25 0.1904
7 A USA 7 26 0.1818
I think you need something like this:
library(dplyr)
df %>%
#group by sector and country
group_by(Sector, Country) %>%
#calculate growth as (quarter / 5-period-lagged quarter) - 1
mutate(growth = Income / lag(Income, 4) - 1)
Output
Source: local data frame [32 x 5]
Groups: Sector, Country [4]
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2000000
6 A USA 6 25 0.1904762
7 A USA 7 26 0.1818182
8 A USA 8 27 0.1739130
9 B USA 1 28 NA
10 B USA 2 29 NA
.. ... ... ... ... ...
df3 = copy(df)
df3$Quarter = df3$Quarter - 4
df = merge(df,df3,c('Sector','Country','Quarter'), suffixes = c('','_prev'), all.x = T)
df$growth = 1 - (df$Income_prev/df$Income
> df
Sector Country Quarter Income Income_prev growth
1 A USA 1 20 24 -4
2 A USA 2 21 25 -4
3 A USA 3 22 26 -4
4 A USA 4 23 27 -4
5 A USA 5 24 NA NA
6 A USA 6 25 NA NA
7 A USA 7 26 NA NA
8 A USA 8 27 NA NA
9 A UK 1 32 36 -4
10 A UK 2 33 37 -4
11 A UK 3 34 38 -4
12 A UK 4 35 39 -4
13 A UK 5 36 NA NA
14 A UK 6 37 NA NA
15 A UK 7 38 NA NA
16 A UK 8 39 NA NA
17 B USA 1 28 32 -4
18 B USA 2 29 33 -4
19 B USA 3 30 34 -4
20 B USA 4 31 35 -4
21 B USA 5 32 NA NA
22 B USA 6 33 NA NA
23 B USA 7 34 NA NA
24 B USA 8 35 NA NA
25 B UK 1 40 44 -4
26 B UK 2 41 45 -4
27 B UK 3 42 46 -4
28 B UK 4 43 47 -4
29 B UK 5 44 NA NA
30 B UK 6 45 NA NA
31 B UK 7 46 NA NA
32 B UK 8 47 NA NA
>

Resources