Panel data, from wide to long with multiple variables [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 4 years ago.
I'm struggling with a sizeable panel data in long format with multiple variables. It looks like this
set.seed(42)
dat_0=
data.frame(
c(rep('AFG',2),rep('UK',2)),
c(rep(c('GDP','pop'),2)),
runif(4),
runif(4))
colnames(dat_0)<-c('country','variable','2010','2011')
Producing a data frame like this:
country variable 2010 2011
1 AFG GDP 0.535761290 0.7515226
2 AFG pop 0.002272966 0.4527316
3 UK GDP 0.608937453 0.5357900
4 UK pop 0.836801559 0.5373767
And I am trying/struggling to coerce it to this structure
country year GDP pop
1 AFG 2010 0.5357612 0.0022729
2 AFG 2011 0.7515226 0.4527316
3 UK 2010 0.6089374 0.8368015
4 UK 2011 0.5357900 0.5373767
Apologies if repeated, I seem to be struggling with reshape/tidyr/dplyr

You need to gather and then spread:
library(tidyverse)
set.seed(42)
dat_0 <- data.frame(c(rep("AFG", 2), rep("UK", 2)), c(rep(c("GDP", "pop"), 2)), runif(4), runif(4))
colnames(dat_0) <- c("country", "variable", "2010", "2011")
dat_0 %>%
gather(year, value, `2010`, `2011`) %>%
spread(variable, value)
#> country year GDP pop
#> 1 AFG 2010 0.9148060 0.9370754
#> 2 AFG 2011 0.6417455 0.5190959
#> 3 UK 2010 0.2861395 0.8304476
#> 4 UK 2011 0.7365883 0.1346666
Created on 2019-02-20 by the reprex package (v0.2.1)

Looks like you could solve your problem with a mix from spread and gather functions from the tidyverse package.
Edit: actually the package is tidyr, which is part of the tidyverse package
You can solve this problem in two steps.
First: gather by year and values, creating a new column called "measurement"
> dat_1 <- dat_0 %>% gather(key="year",value="measurement","2010":"2011")
> dat_1
country variable year measurement
1 AFG GDP 2010 0.9148060
2 AFG pop 2010 0.9370754
3 UK GDP 2010 0.2861395
4 UK pop 2010 0.8304476
5 AFG GDP 2011 0.6417455
6 AFG pop 2011 0.5190959
7 UK GDP 2011 0.7365883
8 UK pop 2011 0.1346666
Second: spread by your new "variable" and "measurement"
> dat_2 <- dat_1 %>% spread(key="variable",value="measurement")
> dat_2
country year GDP pop
1 AFG 2010 0.9148060 0.9370754
2 AFG 2011 0.6417455 0.5190959
3 UK 2010 0.2861395 0.8304476
4 UK 2011 0.7365883 0.1346666
I sincerly hope this solves your problem.

Related

R transform data with year begin and year end into time series data

I have a problem that is very similar to this:
R transform data frame with start and end year into time series however, none of the solutions have worked for me.
This the original df:
df <- data.frame(country = c("Albania", "Albania", "Albania"), leader = c("Sali Berisha", "Sali Berisha", "Sali Berisha"), term = c(2, 2, 2), yearbegin = c(2009,2009, 2009), yearend = c(2013, 2013, 2013))
And it currently looks like this:
#> country leader term yearbegin yearend
#> 1 Albania Sali Berisha 2 2009 2013
#> 2 Albania Sali Berisha 2 2009 2013
#> 3 Albania Sali Berisha 2 2009 2013
And I'm trying to get it to look like this:
#> 1 Albania Sali Berisha 2 2009
#> 2 Albania Sali Berisha 2 2010
#> 3 Albania Sali Berisha 2 2011
#> 4 Albania Sali Berisha 2 2012
#> 5 Albania Sali Berisha 2 2013
When using this code:
library(tidyverse)
gpd_df<- gpd_df %>%
mutate(year = map2(yearbegin, yearend, `:`)) %>%
select(-yearbegin, -yearend) %>%
unnest```
I get a column that looks like this:
```year
2009:2013
2009:2013
2009:2013
Many thanks in advance for your help!!
trying to transform date into time-series form year begin/year end. Have just found errors :')
Use distinct first:
library(dplyr)
library(tidyr)
gpd_df %>%
distinct() %>%
mutate(year = map2(yearbegin, yearend, `:`), .keep = "unused") %>%
unnest_longer(year)
country leader term year
1 Albania Sali Berisha 2 2009
2 Albania Sali Berisha 2 2010
3 Albania Sali Berisha 2 2011
4 Albania Sali Berisha 2 2012
5 Albania Sali Berisha 2 2013

Revaluing many observations with a for loop in R

I have a data set where I am looking at longitudinal data for countries.
master.set <- data.frame(
Country = c(rep("Afghanistan", 3), rep("Albania", 3)),
Country.ID = c(rep("Afghanistan", 3), rep("Albania", 3)),
Year = c(2015, 2016, 2017, 2015, 2016, 2017),
Happiness.Score = c(3.575, 3.360, 3.794, 4.959, 4.655, 4.644),
GDP.PPP = c(1766.593, 1757.023, 1758.466, 10971.044, 11356.717, 11803.282),
GINI = NA,
Status = 2,
stringsAsFactors = F
)
> head(master.set)
Country Country.ID Year Happiness.Score GDP.PPP GINI Status
1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
4 Albania Albania 2015 4.959 10971.044 NA 2
5 Albania Albania 2016 4.655 11356.717 NA 2
6 Albania Albania 2017 4.644 11803.282 NA 2
I created that Country.ID variable with the intent of turning them into numerical values 1:159.
I am hoping to avoid doing something like this to replace the value at each individual observation:
master.set$Country.ID <- master.set$Country.ID[master.set$Country.ID == "Afghanistan"] <- 1
As I implied, there are 159 countries listed in the data set. Because it' longitudinal, there are 460 observations.
Is there any way to use a for loop to save me a lot of time? Here is what I attempted. I made a couple of lists and attempted to use an ifelse command to tell R to label each country the next number.
Here is what I have:
#List of country names
N.Countries <- length(unique(master.set$Country))
Country <- unique(master.set$Country)
Country.ID <- unique(master.set$Country.ID)
CountryList <- unique(master.set$Country)
#For Loop to make Country ID numerically match Country
for (i in 1:460){
for (j in N.Countries){
master.set[[Country.ID[i]]] <- ifelse(master.set[[Country[i]]] == CountryList[j], j, master.set$Country)
}
}
I received this error:
Error in `[[<-.data.frame`(`*tmp*`, Country.ID[i], value = logical(0)) :
replacement has 0 rows, data has 460
Does anyone know how I can accomplish this task? Or will I be stuck using the ifelse command 159 times?
Thanks!
Maybe something like
master.set$Country.ID <- as.numeric(as.factor(master.set$Country.ID))
Or alternatively, using dplyr
library(tidyverse)
master.set <- master.set %>% mutate(Country.ID = as.numeric(as.factor(Country.ID)))
Or this, which creates a new variable Country.ID2based on a key-value pair between Country.ID and a 1:length(unique(Country)).
library(tidyverse)
master.set <- left_join(master.set,
data.frame( Country = unique(master.set$Country),
Country.ID2 = 1:length(unique(master.set$Country))))
master.set
#> Country Country.ID Year Happiness.Score GDP.PPP GINI Status
#> 1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
#> 2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
#> 3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
#> 4 Albania Albania 2015 4.959 10971.044 NA 2
#> 5 Albania Albania 2016 4.655 11356.717 NA 2
#> 6 Albania Albania 2017 4.644 11803.282 NA 2
#> Country.ID2
#> 1 1
#> 2 1
#> 3 1
#> 4 2
#> 5 2
#> 6 2
library(dplyr)
df<-data.frame("Country"=c("Afghanistan","Afghanistan","Afghanistan","Albania","Albania","Albania"),
"Year"=c(2015,2016,2017,2015,2016,2017),
"Happiness.Score"=c(3.575,3.360,3.794,4.959,4.655,4.644),
"GDP.PPP"=c(1766.593,1757.023,1758.466,10971.044,11356.717,11803.282),
"GINI"=NA,
"Status"=rep(2,6))
df1<-df %>% arrange(Country) %>% mutate(Country_id = group_indices_(., .dots="Country"))
View(df1)

revising the values of a variable in a data frame [duplicate]

This question already has answers here:
Remove part of a string in dataframe column (R)
(3 answers)
removing particular character in a column in r
(3 answers)
Closed 3 years ago.
I want to revise the values of a variable. The values are for a series of years. They start from 1960 and end at 2017. There are multiple 1960s, 1961s and so on till 2017. The multiple values for each year correspond to different countries. Countries are another variable in another column. However, each year is tagged with an X. eg. each 1960 has X1960 and so on till X2017. I want to remove the X for all years.
database is as shown below
Country Year GDP
Afghanistan X1960
England X1960
Sudan X1960
.
.
.
Afghanistan X2017
England X2017
Sudan X2017
.
.
Hi You can you gsub function to your data frame
ABC <- data.frame(country = c("Afghanistan", "England"), year = c("X1960","X1960"))
print(ABC)
country year
1 Afghanistan X1960
2 England X1960
ABC$year <- gsub("X","",ABC$year)
> print(ABC)
country year
1 Afghanistan 1960
2 England 1960
Here's a tidyverse solution.
# Load libraries
library(dplyr)
library(readr)
# Dummy data frame
df <- data.frame(country = c("Afghanistan", "England", "Sudan"),
year = rep("X1960", 3),
stringsAsFactors = FALSE)
# Quick peak
print(df)
#> country year
#> 1 Afghanistan X1960
#> 2 England X1960
#> 3 Sudan X1960
# Strip all non-numerics from strings
df %>% mutate(year = parse_number(year))
#> country year
#> 1 Afghanistan 1960
#> 2 England 1960
#> 3 Sudan 1960
Created on 2019-05-23 by the reprex package (v0.2.1)

Using a conditional in a for loop to create a unique panel id

I have a dataset which looks as follows:
# A tibble: 5,458 x 539
# Groups: country, id1 [2,729]
idstd id2 xxx id1 country year
<dbl+> <dbl> <dbl+lbl> <dbl+lbl> <chr> <dbl>
1 445801 NA NA 7 Albania 2009
2 542384 4616555 1163 7 Albania 2013
3 445802 NA NA 8 Albania 2009
4 542386 4616355 1162 8 Albania 2013
5 445803 NA NA 25 Albania 2009
6 542371 4616545 1161 25 Albania 2013
7 445804 NA NA 30 Albania 2009
8 542152 4616556 475 30 Albania 2013
9 445805 NA NA 31 Albania 2009
10 542392 4616542 1160 31 Albania 2013
The data is paneldata, but is there is no unique panel-id. The first two observations are for example respondent number 7 from Albania, but number 7 is used again for other countries. id2 however is unique. My plan is therefore to copy id2 into the NA entry of the corresponding respondent.
I wrote the following code:
for (i in 1:nrow(df)) {
if (df$id1[i]== df$id1[i+1] & df$country[i] == df$country[i+1]) {
df$id2[i] <- df$id2[i+1]
}}
Which gives the following error:
Error in if (df$id1[i] == df1$id1[i + 1] & : missing value where TRUE/FALSE needed
It does however seem to work. As my dataset is quite large and I am not very skilled, I am reluctant to accept the solution I came up with, especially when it gives an error.
Could anyone may help explain the error to me?
In addition, is there a more efficient (for example data.table) and maybe error free way to deal with this?
Can you not do something along the line:
library(tidyverse)
df %>%
group_by(country, id1) %>%
mutate(uniqueId = id2 %>% discard(is.na) %>% unique) %>%
ungroup()
Also, from looking at your loop I judge that the NA are always 1 row apart from the unique IDs, so you could also do:
df %>%
mutate(id2Lag = lag(id2),
uniqueId = ifelse(is.na(id2), id2Lag, id2) %>%
select(-id2Lag)

How to reshape this complicated data frame?

Here is first 4 rows of my data;
X...Country.Name Country.Code Indicator.Name
1 Turkey TUR Inflation, GDP deflator (annual %)
2 Turkey TUR Unemployment, total (% of total labor force)
3 Afghanistan AFG Inflation, GDP deflator (annual %)
4 Afghanistan AFG Unemployment, total (% of total labor force)
Indicator.Code X2010
1 NY.GDP.DEFL.KD.ZG 5.675740
2 SL.UEM.TOTL.ZS 11.900000
3 NY.GDP.DEFL.KD.ZG 9.437322
4 SL.UEM.TOTL.ZS NA
I want my data reshaped into two colums, one of each Indicator code, and I want each row correspond to a country, something like this;
Country Name NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
Turkey 5.6 11.9
Afghanistan 9.43 NA
I think I could do this with Excel, but I want to learn the R way, so that I don't need to rely on excel everytime I have a problem. Here is dput of data if you need it.
Edit: I actually want 3 colums, one for each indicator and one for the country's name.
Sticking with base R, use reshape. I took the liberty of cleaning up the column names. Here, I'm only showing you a few rows of the output. Remove head to see the full output. This assumes your data.frame is named "mydata".
names(mydata) <- c("CountryName", "CountryCode",
"IndicatorName", "IndicatorCode", "X2010")
head(reshape(mydata[-c(2:3)],
direction = "wide",
idvar = "CountryName",
timevar = "IndicatorCode"))
# CountryName X2010.NY.GDP.DEFL.KD.ZG X2010.SL.UEM.TOTL.ZS
# 1 Turkey 5.675740 11.9
# 3 Afghanistan 9.437322 NA
# 5 Albania 3.459343 NA
# 7 Algeria 16.245617 11.4
# 9 American Samoa NA NA
# 11 Andorra NA NA
Another option in base R is xtabs, but NA gets replaced with 0:
head(xtabs(X2010 ~ CountryName + IndicatorCode, mydata))
# IndicatorCode
# CountryName NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
# Afghanistan 9.437322 0.0
# Albania 3.459343 0.0
# Algeria 16.245617 11.4
# American Samoa 0.000000 0.0
# Andorra 0.000000 0.0
# Angola 22.393924 0.0
The result of xtabs is a matrix, so if you want a data.frame, wrap the output with as.data.frame.matrix.

Resources