Reshaping panel data - r

I need to reshape my data for panel data analysis. I searched the internet and only found out how to get the desired results by using Stata; however I am supposed to use R and Excel.
My initial and final data(the desired result) looks very similar to the given in the first page of this example of reshaping data with Stata.
http://spot.colorado.edu/~moonhawk/technical/C1912567120/E220703361/Media/reshape.pdf
Is it attainable with R or just Excel? I tried using melt function from reshape2 library, yet I get
CountryName ProductName Unit Years value
1 Belarus databaseHouseholds '000 Y1977 2942.702
2 Belarus databasePopulation '000 Y1977 9434.200
3 Belarus databaseUrbanPopulation '000 Y1977 4946.882
4 Belarus databaseRuralPopulation '000 Y1977 4487.318
5 Belarus originalHouseholds '000 Y1977 NA
6 Belarus originalUrban households '000 Y1977 NA
7 Poland ..............................................
...........................................................
when I would like to get something like this:
CountryName Years databaseHouseholds databasePopulation databaseUrbanPopulation databaseRuralPopulationUnit originalHousehold originalUrbanhouseholds
Belarus
In the columns databaseHouseholds, databasePopulation,... should be their respective values, so I can use dataset for panel modeling.
Thank you very much.

Try:
library(reshape2)
dcast(dat, CountryName+Years+Unit~ProductName, value.var="value")
# CountryName Years Unit databaseHouseholds databasePopulation
#1 Belarus Y1977 0 2942.702 9434.2
# databaseRuralPopulation databaseUrbanPopulation originalHouseholds
#1 4487.318 4946.882 NA
# originalUrban households
# 1 NA
data
dat <- structure(list(CountryName = c("Belarus", "Belarus", "Belarus",
"Belarus", "Belarus", "Belarus"), ProductName = c("databaseHouseholds",
"databasePopulation", "databaseUrbanPopulation", "databaseRuralPopulation",
"originalHouseholds", "originalUrban households"), Unit = c(0L,
0L, 0L, 0L, 0L, 0L), Years = c("Y1977", "Y1977", "Y1977", "Y1977",
"Y1977", "Y1977"), value = c(2942.702, 9434.2, 4946.882, 4487.318,
NA, NA)), .Names = c("CountryName", "ProductName", "Unit", "Years",
"value"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6"))

Related

Dividing each row by the previous one in R

I have R dataframe:
city hour value
0 NY 0 12
1 NY 12 24
2 LA 0 3
3 LA 12 9
I want, for each city, to divide each row by the previous one and write the result into a new dataframe. The desired output is:
city ratio
NY 2
LA 3
You can try aggregate like below
aggregate(value ~city,df, function(x) x[-1]/x[1])
which gives
city value
1 LA 3
2 NY 2
Data
> dput(df)
structure(list(city = c("NY", "NY", "LA", "LA"), hour = c(0L,
12L, 0L, 12L), value = c(12L, 24L, 3L, 9L)), class = "data.frame", row.names = c("0",
"1", "2", "3"))
You can use lag to get previous value, divide each value by it's previous value for each city and drop NA rows.
library(dplyr)
df %>%
arrange(city, hour) %>%
group_by(city) %>%
summarise(value = value/lag(value)) %>%
na.omit()
# city value
# <chr> <dbl>
#1 LA 3
#2 NY 2
In data.table we can do this via shift :
library(data.table)
setDT(df)[order(city, hour), value := value/shift(value), city]
na.omit(df)

How to replace values in specific rows of some columns in R tibble with transformed values conditional on row values?

I have a tibble in R, where I want to change values in some columns with a condition based on a value of another column. So in the tibble df below, I want to multiply all values in the columns agr, man and ser where value in variable column is equal to va with 1000 and where value is equal to emp with 100 and replace the values in the respective columns with these calculated values. There must be a simple solution to it but I am at a loss.
df
country variable year agr man ser
chn va 1980 345 124 62
chn emp 1980 34 65 58
chn va 1981 345 243 670
ind emp 1980 54 34 40
ind va 1980 456 345 760
I have tried using ifelse, mutate_at and sweep functions but it does not work out.
Assuming that there would be also other value in 'variable' column, an option is to use case_when with mutate_at
library(dplyr)
df %>%
mutate_at(vars(agr:ser), ~ case_when(variable == 'va'~ . * 1000,
variable == 'emp' ~ .* 100, TRUE ~ as.numeric(.)))
data
df <- structure(list(country = c("chn", "chn", "chn", "ind", "ind"),
variable = c("va", "emp", "va", "emp", "va"), year = c(1980L,
1980L, 1981L, 1980L, 1980L), agr = c(345L, 34L, 345L, 54L,
456L), man = c(124L, 65L, 243L, 34L, 345L), ser = c(62L,
58L, 670L, 40L, 760L)), class = "data.frame", row.names = c(NA,
-5L))

Merging two Dataframes in R by ID, One is the subset of the other

I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]

Populating new variable from ddply within old data frame in R

I have a data.frame which looks like this (in reality 1M rows):
`> df
R.DMA.NAMES quarter daypart allpersons.imp rate station spot.id
1 Wilkes.Barre.Scranton.Hztn Q22014 afternoon 0.0 30 WSWB 13048713
2 Nashville Q12014 primetime 0.0 50 COM NASHVILLE 11969260
3 Seattle.Tacoma Q12014 primetime 6.1 51 ESPN SEATTLE, EVERETT ZONE 11898905
4 Jacksonville Q42013 late fringe 2.3 130 Jacksonville WAWS 11617447
5 Detroit Q22014 overnight 0.0 0 WKBD 12571421
6 South.Bend.Elkhart Q42013 primetime 11.5 325 WBND 11741171`
dput(df)
structure(list(R.DMA.NAMES = c("Wilkes.Barre.Scranton.Hztn",
"Nashville", "Seattle.Tacoma", "Jacksonville", "Detroit", "South.Bend.Elkhart"
), quarter = structure(c(3L, 1L, 1L, 6L, 3L, 6L), .Label = c("Q12014",
"Q22013", "Q22014", "Q32013", "Q32014", "Q42013"), class = "factor"),
daypart = c("afternoon", "primetime", "primetime", "late fringe",
"overnight", "primetime"), allpersons.imp = c(0, 0, 6.1,
2.3, 0, 11.5), rate = c(30, 50, 51, 130, 0, 325), station = c("WSWB",
"COM NASHVILLE", "ESPN SEATTLE, EVERETT ZONE", "Jacksonville WAWS",
"WKBD", "WBND"), spot.id = c(13048713L, 11969260L, 11898905L,
11617447L, 12571421L, 11741171L)), .Names = c("R.DMA.NAMES",
"quarter", "daypart", "allpersons.imp", "rate", "station", "spot.id"
), row.names = c(NA, -6L), class = "data.frame")
I am using a ddply function to perform a calculation:
ddply(df, .(R.DMA.NAMES, station, quarter), function (x) {
cpi = sum(df$rate) / sum(df$allpersons.imp)
})
This creates a new data.frame which looks like this:
R.DMA.NAMES station quarter V1
1 Detroit WKBD Q22014 NaN
2 Jacksonville Jacksonville WAWS Q42013 56.521739
3 Nashville COM NASHVILLE Q12014 Inf
4 Seattle.Tacoma ESPN SEATTLE, EVERETT ZONE Q12014 8.360656
5 South.Bend.Elkhart WBND Q42013 28.260870
6 Wilkes.Barre.Scranton.Hztn WSWB Q22014 Inf
What I'd like to do is create a new column called "cpi" in my original df i.e. the applicable "cpi" value should appear against the particular row. Of course, the same value will repeat many times i.e. 8.36 will appear for every row which contains "Seattle.Tacoma" for R.DMA.NAMES, "ESPN SEATTLE, EVERETT ZONE" for station and Q12014 for quarter. I tried several things including:
transform(df, cpi = ddply(df, .(R.DMA.NAMES, station, quarter), function (x) {
cpi = sum(df$rate) / sum(df$allpersons.imp)
})
But this didn't work ! Can someone explain . .
Use transform within ddply:
ddply(df, .(R.DMA.NAMES, station, quarter),
transform, cpi = sum(rate) / sum(allpersons.imp))

Sum by months of the year with decades of data in R

I have a dataframe with some monthly data for 2 decades:
year month value
1960 January 925
1960 February 903
1960 March 1006
...
1969 December 892
1970 January 990
1970 February 866
...
1979 December 120
I would like to create a dataframe where I sum up the totals, for each decade, by month, as follows:
year month value
decade_60s January 4012
decade_60s February 8678
decade_60s March 9317
...
decade_60s December 3995
decade_70s January 8005
decade_70s February 9112
...
decade_70s December 325
I have been looking at the aggregate function, but this doesn't appear to be the right option.
I looked instead at some careful subsetting using the which function but this quickly became too messy.
For this kind of problem, what would be the correct approach? Will I need to use apply at some point, and if so, how?
I feel the temptation to use a for loop growing but I don't think this would be the best way to improve my skills in R..
Thanks for the advice.
PS: The month value is an ordinal factor, if this matters.
Aggregate is a way to go using base R
First define the decade
yourdata$decade <- cut(yourdata$year, breaks=c(1960,1970,1980), labels=c(60,70),
include.lowest=TRUE, right=FALSE)
Then aggregate the data
aggregate(value ~ decade + month, data=yourdata , sum)
Then order to get required output
plyr's count + gsub are definitely your friends here:
library(plyr)
dat <- structure(list(year = c(1960L, 1960L, 1960L, 1969L, 1970L, 1970L, 1979L),
month = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 1L),
.Label = c("December", "February", "January", "March"),
class = "factor"),
value = c(925L, 903L, 1006L, 892L, 990L, 866L, 120L)),
.Names = c("year", "month", "value"),
class = "data.frame", row.names = c(NA, -7L))
dat$decade <- gsub("[0-9]$", "0", dat$year)
count(dat, .(decade, month), wt_var=.(value))
## decade month freq
## 1 1960 December 892
## 2 1960 February 903
## 3 1960 January 925
## 4 1960 March 1006
## 5 1970 December 120
## 6 1970 February 866
## 7 1970 January 990

Resources