converting a dataframe in given format - r

Given data frame values are
Group year Value
A 2010 17
A 2011 18
F 2010 8
F 2011 9
i want to convert it into
Year A F
2010 17 8
2011 18 9
is there any simple solution to solve this

library('reshape2')
df <- read.table(text=" Group year Value
A 2010 17
A 2011 18
F 2010 8
F 2011 9", header = TRUE)
dfc <- dcast(df, year ~ Group )

Although the syntax can be confusing, I still find reshape in base R useful to know. Using df provided by gauden
reshape_df <- reshape(df,dir="wide",idvar="year",timevar="Group")
colnames(reshape_df) <- c("year","A","F")
The converts to data from "long" format to "wide". Usually, the time variable becomes the column name, but in this case, we seek "A" and "F". Therefore, the syntax calls for timevar to be "Group".

Related

How to get unique values from table() function in R

I have a data frame which 31 columns. In column of Year (named "Anos"), I have rows which years are repeated and when I use table(df$Anos), I get frequency of years. I need only years with 12 observations (12 months).
Example:
freq_years <- table(df$Anos)
freq_years
Result:
2009 2010 2011 2012 2013 2014 2015 2017 2018 2019 2020
10 12 12 3 11 6 8 12 12 12 5
How to get automatically in a new variable only years with freq = 12? (maybe like 2010,2011,2018,2019)
Here is a tidyverse version. Depending on your use with the other 30 columns in your data frame, keeping the data as df2 might be useful.
install.packages("dplyr")
install.packages("magrittr")
library("magrittr")
library("dplyr")
#create example dataset
df <- data.frame("Anos" = c(rep(2009,10),
rep(2010,12),
rep(2011,12),
rep(2012,3),
rep(2013,11),
rep(2014,6),
rep(2015,8),
rep(2016,12),
rep(2017,12)))
head(df)
# count number of years by row and filter to those with only 12
df2 <- df %>% group_by(Anos) %>% count() %>% filter(n == 12)
head(df2)
# create variable with list of years that have exactly 12 rows
variable <- df2$Anos
variable
We can create a logical vector and subset the names of the table output
names(freq_years)[freq_years == 12]

Using custom order to arrange rows after previous sorting with arrange

I know this has already been asked, but I think my issue is a bit different (nevermind if it is in Portuguese).
I have this dataset:
df <- cbind(c(rep(2012,6),rep(2016,6)),
rep(c('Emp.total',
'Fisicas.total',
'Outros,total',
'Politicos.total',
'Receitas.total',
'Proprio.total'),2),
runif(12,0,1))
colnames(df) <- c('Year,'Variable','Value)
I want to order the rows to group first everything that has the same year. Afterwards, I want the Variable column to be ordered like this:
Receitas.total
Fisicas.total
Emp.total
Politicos.total
Proprio.total
Outros.total
I know I could usearrange() from dplyr to sort by the year. However, I do not know how to combine this with any routine using factor and order without messing up the previous ordering by year.
Any help? Thank you
We create a custom order by converting the 'Variable' into factor with levels specified in the custom order
library(dplyr)
df %>%
arrange(Year, factor(Variable, levels = c('Receitas.total',
'Fisicas.total', 'Emp.total', 'Politicos.total',
'Proprio.total', 'Outros.total')))
# A tibble: 12 x 3
# Year Variable Value
# <dbl> <chr> <dbl>
# 1 2012 Receitas.total 0.6626196
# 2 2012 Fisicas.total 0.2248911
# 3 2012 Emp.total 0.2925740
# 4 2012 Politicos.total 0.5188971
# 5 2012 Proprio.total 0.9204438
# 6 2012 Outros,total 0.7042230
# 7 2016 Receitas.total 0.6048889
# 8 2016 Fisicas.total 0.7638205
# 9 2016 Emp.total 0.2797356
#10 2016 Politicos.total 0.2547251
#11 2016 Proprio.total 0.3707349
#12 2016 Outros,total 0.8016306
data
set.seed(24)
df <- data_frame(Year =c(rep(2012,6),rep(2016,6)),
Variable = rep(c('Emp.total',
'Fisicas.total',
'Outros,total',
'Politicos.total',
'Receitas.total',
'Proprio.total'),2),
Value = runif(12,0,1))

Reshaping issues in R: my reshaped dataframe changes 3 variables into 1

I'm a relative newbie to R and trying to reshape my data into long format from wide format and having problems. I'm thinking that my problem may be due to having made the data.frame from a data.frame that I have created in R, getting mean values of the large data.frame into another data.frame.
What I have done is this created an empty data.frame (ndf):
ndf <- data.frame(matrix(ncol = 0, nrow = 3))
Then used lapply to get the means from the large data.frame (ldf) into separate columns in the new data.frame, with the year being used from the large data.frame:
ndf$Year <- names(ldf)
ndf$col1 <- lapply(ldf, function(i) {mean(i$col1)})
ndf$col2 <- lapply(ldf, function(i) {mean(i$col2)})
etc.
The melted function in reshape2 does not work apparently because there are non-atomic 'measure' columns.
For using the reshape base function I have used the code:
reshape.ndf <- reshape(ndf,
varying = list(names(ndf)[2:7]),
v.names = "cover",
timevar = "species",
times = names(ndf[2:7]),
new.row.names = 1:1000,
direction = "long")
My output is then essentially just using the first row for the variables. So my wide data.frame looks like this (sorry for the strange names):
Year Cladonia.portentosa Erica.tetralix Eriophorum.vaginatum
1 2014 11.75 35 55
2 2015 15.75 25.75 70
3 2016 22.75 5 37.5
And the long data.frame looks like this:
Year species cover id
1 2014 Cladonia.portentosa 11.75 1
2 2015 Cladonia.portentosa 11.75 2
3 2016 Cladonia.portentosa 11.75 3
4 2014 Erica.tetralix 35.00 1
5 2015 Erica.tetralix 35.00 2
6 2016 Erica.tetralix 35.00 3
Where the "cover" column should have the value from each year put into the cell with the corresponding year.
Please could someone tell me where I've gone wrong!?
Here is an example of 'melting' in tidyr.
You'll need tidyr but I also like dplyr and am including it here to encourage its use along with the rest of the tidyverse. You'll find endless great tutorials on the web...
library(dplyr)
library(tidyr)
Let's use iris as an example, I want a long form where species, variable and value are the columns.
data(iris)
Here it is with gather(). we specify that variable and value are the column names for the new 'melted' columns. we also specify that we do not want to melt the column Species which we want to remain its own column.
iris_long <- iris %>%
gather(variable, value, -Species)
inspect the iris_long object to make sure it worked.
In addition to roman's answer, I thought I would share exactly what I did with my data set.
My initial "wide" data.frame ndf looked like this:
Year Cladonia.portentosa Erica.tetralix Eriophorum.vaginatum
1 2014 11.75 35 55
2 2015 15.75 25.75 70
3 2016 22.75 5 37.5
I used downloaded tidyr
install.packages("tidyr")
Then selected the package
library(tidyr)
I then used the gather() function in the tidyr package to gather the species columns Cladonia.portentosa Erica.tetralix and Eriophorum.vaginatum together into one column, with a cover column in the new "long" data.frame.
long.ndf <- ndf %>% gather(species, cover, Cladonia.portentosa:Eriophorum.vaginatum)
Easy peasy!
Thanks again to roman for the suggestion!
I'm answering your question in case it may help someone using reshape function.
Please could someone tell me where I've gone wrong!?
You have not specified parameter idvar and reshape has created one for you named id. In order to avoid it, just add to your code the line idvar = "Year" :
ndf <- read.table(text =
"Year Cladonia.portentosa Erica.tetralix Eriophorum.vaginatum
1 2014 11.75 35 55
2 2015 15.75 25.75 70
3 2016 22.75 5 37.5",
header=TRUE, stringsAsFactors = F)
reshape.ndf <- reshape(ndf,
varying = list(names(ndf)[2:4]),
v.names = "cover",
idvar = "Year",
timevar = "species",
times = names(ndf[2:4]),
new.row.names = 1:9,
direction = "long")
The result looks as you were expecting
reshape.ndf
Year species cover
1 2014 Cladonia.portentosa 11.75
2 2015 Cladonia.portentosa 15.75
3 2016 Cladonia.portentosa 22.75
4 2014 Erica.tetralix 35.00
5 2015 Erica.tetralix 25.75
6 2016 Erica.tetralix 5.00
7 2014 Eriophorum.vaginatum 55.00
8 2015 Eriophorum.vaginatum 70.00
9 2016 Eriophorum.vaginatum 37.50

How to make all the months to have an equal number of days (for example 22 days) for a MIDAS regression in R

This is a follow up question for these two posts.
How to deal with impossible dates for midasr package
https://stats.stackexchange.com/questions/77495/what-can-i-do-with-these-two-time-series
I need to use mls function in MIDAS package in R to transform the high frequency (daily) financial data to low frequency (quarterly) macroeconomic data.
The author #mpiktas mentioned
You must make all the months to have an equal number of days. And then
set frequency to that number. You can achieve that by discarding data,
padding NAs or extrapolating.
and
You could use zoo objects to make the padding easier, but in the end
simple numeric vector should be passed.
I tried different ways to search and did not find an easy way to implement.
I use dplyr to get each month to have 31 days with 7-11 NA.
# generate the date vector
library(midasr)
library(dplyr)
library(quantmod)
tsxdate <- as.Date( paste(1979, rep(1:12, each=31), 1:31, sep="-") )
for (year in 1980:2015){
tsxdate <- c(tsxdate,as.Date( paste(year, rep(1:12, each=31), 1:31, sep="-") ))
}
# transform to dataframe
tsxdate.df <- as.data.frame(tsxdate)
# get the stock market index from yahoo
tsxindex <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# merge two data frame to get each month with 31 days
tsx.df <- left_join(tsxdate.df, tsxindex)
I doubt this caused a problem due to too many NAs.
I put the new daily data into MIDAS regression in R. It did not work. None of the weight functions work.
# since each month has 31 days. one quarter yy correspond to 93 days data.
midas_r(midas_r(yy~trend+fmls(zz,30,93,nealmon) ,start=list(zz=rep(0,4))), Ofunction="nls")
Could you tell me how to make all the months to have an equal number of days?
update:
Finally, I got a way in zoo package with aggregate and first function. It is not perfect, but it works and fast. first will add NAs according to the parameter.
I still need to figure out how to fit it into a MIDAS regression.
# get data
tsx <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# subset
# generate a zoo object
library(zoo)
tsx.zoo <- zoo(tsx$GSPTSE.Adjusted)
# group by yearmonth and take first 22 days data.
days <-aggregate(tsx.zoo, as.yearmon, first, 22)
It looks like this: each row is one month with 22 days data.
Jun 1979 1614.29 NA NA NA NA NA NA NA NA NA
Jul 1979 1614.29 1598.73 1579.88 1582.57 1582.27 1576.19 1559.23 1529.81 1533.50 1547.66
Aug 1979 1554.14 1556.94 1553.84 1553.84 1551.95 1561.23 1562.52 1571.00 1578.08 1580.28
Sep 1979 1685.11 1657.58 1690.10 1720.92 1716.53 1711.34 1722.71 1714.63 1727.50 1724.51
Oct 1979 1749.05 1767.40 1775.98 1786.35 1800.12 1800.12 1735.88 1685.21 1681.52 1670.65
Nov 1979 1599.33 1606.81 1596.54 1592.94 1574.49 1569.20 1583.97 1608.70 1611.00 1619.78
Jun 1979 NA NA NA NA NA NA NA NA NA NA
Jul 1979 1556.94 1546.86 1548.46 1553.54 1542.07 1543.17 1552.85 1566.01 1573.99 1564.12
Aug 1979 1596.64 1602.82 1615.09 1636.53 1653.09 1660.97 1657.78 1665.46 1674.44 1674.64
Sep 1979 1714.73 1717.53 1732.59 1736.48 1731.19 1732.49 1746.75 1754.33 1747.45 NA
Oct 1979 1639.03 1613.19 1616.29 1635.34 1593.44 1533.40 1522.12 1534.49 1517.24 1523.92
Nov 1979 1628.55 1621.57 1624.36 1627.56 1620.27 1647.51 1677.93 1683.81 1690.70 1698.97
Jun 1979 NA NA
Jul 1979 1554.14 NA
Aug 1979 1674.24 1675.43
Sep 1979 NA NA
Oct 1979 1538.68 1552.25
update again:
#mpiktas gives a better and right way to do it.
1 NAs should be padded at beginning of each period.
2 Data should be gather in the frequency of response variable. In my case, it is quarterly.
His function can be used in aggregate function in zoo. I guess it do the same job as group_by plus do in dplyr: split, operate, and give back a list of results. I try this
tsxdaily <- aggregate(tsx.zoo, yearqtr, padd_nas, 66)
yearqtr is the frequency of response variable.
Here is one possible way of how to add NAs.
First, note that MIDAS regression puts the emphasis on the last values of the period, so you need to put NAs in front, not in the back.
Suppose that we have the following dummy data:
> dt <- data.frame(Day=1:10,Quarter=c(rep(1,6),rep(2,4)),value=1:10)
> dt
Day Quarter value
1 1 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 5 1 5
6 6 1 6
7 7 2 7
8 8 2 8
9 9 2 9
10 10 2 10
In this example there are two quarters, the first one has 6 days, the second one 4. Suppose we want to harmonize the data, so that the quarter has 7 days (for example).
Define simple function which adds NAs at the beginning of the data:
padd_nas <- function(x, desired_length) {
n <- length(x)
if(n < desired_length) {
c(rep(NA,desired_length-n),x)
} else {
tail(x,desired_length)
}
}
Here is an example illustrating how this function works:
> padd_nas(1:4,7)
[1] NA NA NA 1 2 3 4
>
Now add NAs for each quarter and make sure that the data is ordered by day:
library(dplyr)
pdt <- dt %>% arrange(Day) %>% group_by(Quarter) %>% do(pv = padd_nas(.$value, 7))
> pdt
Source: local data frame [2 x 2]
Groups: <by row>
Quarter pv
1 1 <int[7]>
2 2 <int[7]>
To get the padded result simply use unlist on column pv:
> pv <- pdt$pv %>% unlist
> pv
[1] NA 1 2 3 4 5 6 NA NA NA 7 8 9 10
Now we can prepared this for MIDAS regression with mls. Suppose that only last 3 days are relevant for each quarter:
> library(midasr)
> mls(pv, 0:2, 7)
X.0/m X.1/m X.2/m
[1,] 6 5 4
[2,] 10 9 8
Compare this with original data dt.
This approach can be generalized for any low and high frequency data configuration.

Sum column values that match year in another column in R

I have the following dataframe
y<-data.frame(c(2007,2008,2009,2009,2010,2010),c(10,13,10,11,9,10),c(5,6,5,7,4,7))
colnames(y)<-c("year","a","b")
I want to have a final data.frame that adds together within the same year the values in "y$a" in the new "a" column and the values in "y$b" in the new "b" column so that it looks like this"
year a b
2007 10 5
2008 13 6
2009 21 12
2010 19 11
The following loop has done it for me,
years<- as.numeric(levels(factor(y$year)))
add.a<- numeric(length(y[,1]))
add.b<- numeric(length(y[,1]))
for(i in years){
ind<- which(y$year==i)
add.a[ind]<- sum(as.numeric(as.character(y[ind,"a"])))
add.b[ind]<- sum(as.numeric(as.character(y[ind,"b"])))
}
y.final<-data.frame(y$year,add.a,add.b)
colnames(y.final)<-c("year","a","b")
y.final<-subset(y.final,!duplicated(y.final$year))
but I just think there must be a faster command. Any ideas?
Kindest regards,
Marco
The aggregate function is a good choice for this sort of operation, type ?aggregate for more information about it.
aggregate(cbind(a,b) ~ year, data = y, sum)
# year a b
#1 2007 10 5
#2 2008 13 6
#3 2009 21 12
#4 2010 19 11

Resources