reshape data from wide to long with multiple rows - r

I have a dataset dfs that i would like to reshape
dfs
# country.name indicator.name x1990 x1991 x1992
# 507 andorra GDP at market prices (current US$) 1.028989e+09 1.106891e+09 1.209993e+09
# 510 andorra GDP growth (annual %) 3.781393e+00 2.546001e+00 9.292154e-01
# 1347 albania GDP at market prices (current US$) 2.101625e+09 1.139167e+09 7.094526e+08
# 1350 albania GDP growth (annual %) -9.575640e+00 -2.958900e+01 -7.200000e+00
# 3587 austria GDP at market prices (current US$) 1.660624e+11 1.733755e+11 1.946082e+11
And i would like it so that the indicator names are columns and the times are in one column with an indicator.
# country time gdp_market gdp_growth
# 1 andorra 1990 1028989394 3.7813935
# 2 andorra 1990 1106891025 2.5460006
# 3 andorra 1990 1209992650 0.9292154
# 4 albania 1991 2101624963 3.7813935
# 5 albania 1991 1139166646 2.5460006
# 6 albania 1991 709452584 0.9292154
# 7 austria 1992 166062376740 NA
# 8 austria 1992 173375508073 NA
# 9 austria 1992 194608183696 NA
I can melt reshape the data into long format but cant seperate it into two columns
library(reshape2)
melt.dfs <- melt(dfs, id=1:2)
I could do a split and cbind, but id prefer to do it with reshape. Thanks
dfs = structure(list(country.name = c("andorra", "andorra", "albania",
"albania", "austria"), indicator.name = c("GDP at market prices (current US$)",
"GDP growth (annual %)", "GDP at market prices (current US$)",
"GDP growth (annual %)", "GDP at market prices (current US$)"
), x1990 = c(1028989393.70295, 3.78139347786568, 2101624962.5,
-9.57564018741695, 166062376739.683), x1991 = c(1106891024.78653,
2.54600064090229, 1139166645.83333, -29.5889976817695, 173375508073.07
), x1992 = c(1209992649.56688, 0.929215382801402, 709452583.880319,
-7.19999998650893, 194608183696.469)), .Names = c("country.name",
"indicator.name", "x1990", "x1991", "x1992"), row.names = c(507L,
510L, 1347L, 1350L, 3587L), class = "data.frame")

We can use
library(dplyr)
library(tidyr)
gather(dfs, time, Val, x1990:x1992) %>%
spread(indicator.name, Val)
EDIT: Based on comments from #docendo discimus
Or using recast
library(reshape2)
recast(dfs, measure = 3:5, ...~indicator.name, value.var='value')

Related

Identifying matching observations in dyadic data in R

Hell everyone,
I am struggling with the following issue. Currently, I have a dataset looking like this:
living_in from Year stock
Austria Australia 2014 2513
Austria Australia 2013 2000
Germany Austria 2010 6000
Australia Austria 2014 3000
Austria Australia 1993 NA
Now I would like to identify all observations that fulfill the following criteria:
Should be from same year
Should contain the same country pairs in that year
Should not contain NA
For instance, I want to find all observations for combinations of two countries like Austria-Australia and Australia-Austria within the same year that contain values. This is due to the fact that some combinations in a given year in the dataset have only one value for stock not two. I want to remove those.
What is the best way to proceed here? Many thanks in advance!
P.S. I have about 14 country pairs in my dataset that need this kind of identification
A helpful output might be something like this.
living_in from Year stock dummy
Austria Australia 2014 2513 1
Austria Australia 2013 2000 0
Germany Austria 2010 6000 0
Australia Austria 2014 3000 1
Austria Australia 1993 NA 0
For each combination of country irrespective of their order (A-B is same as B-A) assign 1 to dummy column if for the same Year it has more than 1 row and all the stock values are non-NA or else assign 0.
library(dplyr)
df %>%
group_by(col1 = pmin(living_in, from), col2 = pmax(living_in, from), Year) %>%
mutate(dummy = as.integer(n() > 1 && all(!is.na(stock)))) %>%
ungroup %>%
select(-col1, -col2)
# living_in from Year stock dummy
# <chr> <chr> <int> <int> <int>
#1 Austria Australia 2014 2513 1
#2 Austria Australia 2013 2000 0
#3 Germany Austria 2010 6000 0
#4 Australia Austria 2014 3000 1
#5 Austria Australia 1993 NA 0
data
df <- structure(list(living_in = c("Austria", "Austria", "Germany",
"Australia", "Austria"), from = c("Australia", "Australia", "Austria",
"Austria", "Australia"), Year = c(2014L, 2013L, 2010L, 2014L,
1993L), stock = c(2513L, 2000L, 6000L, 3000L, NA)),
class = "data.frame", row.names = c(NA, -5L))

revising the values of a variable in a data frame [duplicate]

This question already has answers here:
Remove part of a string in dataframe column (R)
(3 answers)
removing particular character in a column in r
(3 answers)
Closed 3 years ago.
I want to revise the values of a variable. The values are for a series of years. They start from 1960 and end at 2017. There are multiple 1960s, 1961s and so on till 2017. The multiple values for each year correspond to different countries. Countries are another variable in another column. However, each year is tagged with an X. eg. each 1960 has X1960 and so on till X2017. I want to remove the X for all years.
database is as shown below
Country Year GDP
Afghanistan X1960
England X1960
Sudan X1960
.
.
.
Afghanistan X2017
England X2017
Sudan X2017
.
.
Hi You can you gsub function to your data frame
ABC <- data.frame(country = c("Afghanistan", "England"), year = c("X1960","X1960"))
print(ABC)
country year
1 Afghanistan X1960
2 England X1960
ABC$year <- gsub("X","",ABC$year)
> print(ABC)
country year
1 Afghanistan 1960
2 England 1960
Here's a tidyverse solution.
# Load libraries
library(dplyr)
library(readr)
# Dummy data frame
df <- data.frame(country = c("Afghanistan", "England", "Sudan"),
year = rep("X1960", 3),
stringsAsFactors = FALSE)
# Quick peak
print(df)
#> country year
#> 1 Afghanistan X1960
#> 2 England X1960
#> 3 Sudan X1960
# Strip all non-numerics from strings
df %>% mutate(year = parse_number(year))
#> country year
#> 1 Afghanistan 1960
#> 2 England 1960
#> 3 Sudan 1960
Created on 2019-05-23 by the reprex package (v0.2.1)

Using mutate and group_by to roll an operation over rows

I have the following data:
country year sales
--------------------------
Afghanistan 1950 30
Afghanistan 1951 35
Albania 1950 0
Albania 1951 5
total 1950 30
total 1951 40
I want to generate a new column, ratio, which is the ratio of sales for any given country-year combination to the total for that year. So the output should be:
country year sales ratio
---------------------------------
Afghanistan 1950 30 1
Afghanistan 1951 35 0.875
Albania 1950 0 0
Albania 1951 5 0.125
total 1950 30 1
total 1951 40 1
I'd like to use tidyverse (which I am somewhat new to) to accomplish this, but I'm still somewhat confused about how to use mutate and group_by to accomplish this (or even if that is the best way to go about this task in general).
I tried unsuccessfully to use the advice given in this thread. What I have tried is:
library(tidyverse)
df <- df %>%
group_by(year) %>%
mutate(ratio = sales[country]/sales[country == "total"])
But this generates a column called ratio full of NAs. Do I need to use a loop or something else? I'm somewhat new to R and I will admit I have avoided loops up until now. Looking over documentation on loops, I couldn't quite think of how I would use one to run over each country-year combination and generate a new column.
You can group by country and then divide sales by maximum of sales - which is total, I suppose.
library(dplyr)
df %>%
group_by(year) %>%
mutate(ratio = sales / max(sales))
# A tibble: 6 x 4
# Groups: year [2]
# country year sales ratio
# <chr> <int> <int> <dbl>
#1 Afghanistan 1950 30 1
#2 Afghanistan 1951 35 0.875
#3 Albania 1950 0 0
#4 Albania 1951 5 0.125
#5 total 1950 30 1
#6 total 1951 40 1
In base R
transform(df, ratio = ave(sales, year, FUN = function(x) x / max(x)))
Or with data.table
library(data.table)
setDT(df)[, ratio := sales / max(sales), by = year][]
data
df <- structure(list(country = c("Afghanistan", "Afghanistan", "Albania",
"Albania", "total", "total"), year = c(1950L, 1951L, 1950L, 1951L,
1950L, 1951L), sales = c(30L, 35L, 0L, 5L, 30L, 40L)), .Names = c("country",
"year", "sales"), class = "data.frame", row.names = c(NA, -6L
))

Reshaping Dataframe in R (melt?)

So, I currently have a dataframe that looks like:
country continent year lifeExp pop gdpPercap
<fctr> <fctr> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
There are 140+ countries. The years are in 5 year intervals. From 1952- 2007 I want to reshape my dataframe such that I get.
Country gdpPercap(1952) gdpPercap(1957) ... gdpPercap(2007)
<fctr> <dbl>
1 Afghanistan 974.5803 .... ...
2 Albania 5937.0295 ... ...
3 Algeria 6223.3675 ... ...
4 Angola 4797.2313
5 Argentina 12779.3796
6 Australia 34435.3674
7 Austria 36126.4927
8 Bahrain 29796.0483
9 Bangladesh 1391.2538
10 Belgium 33692.6051
My attempt is this:
gapminder %>% #my dataframe
filter(year >= 1952) %>%
group_by(country) %>%
summarise(gdpPercap = mean(gdpPercap))
OUTPUT:
country gdpPercap <- but this takes the mean of gdpPercap from 1952-2007
<fctr> <dbl>
1 Afghanistan 802.6746
2 Albania 3255.3666
3 Algeria 4426.0260
4 Angola 3607.1005
5 Argentina 8955.5538
6 Australia 19980.5956
7 Austria 20411.9163
8 Bahrain 18077.6639
9 Bangladesh 817.5588
10 Belgium 19900.7581
# ... with 132 more rows
Any ideas? PS: I'm new to R. I'm also looking at melt(). Any help will be appreciated!
tidyr::spread() would solve your problem
library(dplyr); library(tidyr)
gapminder %>%
select(country, year, gdpPercap) %>%
spread(year, gdpPercap)
You should use year also in group_by, and after summary, just reshape the data the way you want using dcast or rehape
Here is a sample solution :
library(dplyr)
library(reshape2)
gapminder <- data.frame(cbind(gdpPercap=runif(10000), year =as.integer(seq(from=1952, to=2007, by=5)), country = c("India", "US", "UK")))
gapminder$gdpPercap <- as.numeric(as.character(gapminder$gdpPercap))
gapminder$year <- as.integer(as.character(gapminder$year))
gapminder %>% #my dataframe
filter(year >= 1952) %>%
group_by(country, year) %>%
summarise(gdpPercap = mean(gdpPercap)) %>%
dcast(country ~ year, value.var="gdpPercap")
I have to generate a new data, because your example is not reproducible. Go through the link How to make a great R reproducible example?. It helps in answering and understanding the problem, as well as, quicker answers.
Built-in reshape can do this.
foo.data.frame <- data.frame(
Country=rep(c("Here", "There"), each=3),
year=rep(c(1952, 1957, 1962),2),
gdpPercap=779:784
# ... other variables
)
reshape(foo.data.frame[, c("Country", "year", "gdpPercap")],
timevar="year", idvar="Country", direction="wide", sep=" ")
# Country gdpPercap 1952 gdpPercap 1957 gdpPercap 1962
# 1 Here 779 780 781
# 4 There 782 783 784

How to reshape this complicated data frame?

Here is first 4 rows of my data;
X...Country.Name Country.Code Indicator.Name
1 Turkey TUR Inflation, GDP deflator (annual %)
2 Turkey TUR Unemployment, total (% of total labor force)
3 Afghanistan AFG Inflation, GDP deflator (annual %)
4 Afghanistan AFG Unemployment, total (% of total labor force)
Indicator.Code X2010
1 NY.GDP.DEFL.KD.ZG 5.675740
2 SL.UEM.TOTL.ZS 11.900000
3 NY.GDP.DEFL.KD.ZG 9.437322
4 SL.UEM.TOTL.ZS NA
I want my data reshaped into two colums, one of each Indicator code, and I want each row correspond to a country, something like this;
Country Name NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
Turkey 5.6 11.9
Afghanistan 9.43 NA
I think I could do this with Excel, but I want to learn the R way, so that I don't need to rely on excel everytime I have a problem. Here is dput of data if you need it.
Edit: I actually want 3 colums, one for each indicator and one for the country's name.
Sticking with base R, use reshape. I took the liberty of cleaning up the column names. Here, I'm only showing you a few rows of the output. Remove head to see the full output. This assumes your data.frame is named "mydata".
names(mydata) <- c("CountryName", "CountryCode",
"IndicatorName", "IndicatorCode", "X2010")
head(reshape(mydata[-c(2:3)],
direction = "wide",
idvar = "CountryName",
timevar = "IndicatorCode"))
# CountryName X2010.NY.GDP.DEFL.KD.ZG X2010.SL.UEM.TOTL.ZS
# 1 Turkey 5.675740 11.9
# 3 Afghanistan 9.437322 NA
# 5 Albania 3.459343 NA
# 7 Algeria 16.245617 11.4
# 9 American Samoa NA NA
# 11 Andorra NA NA
Another option in base R is xtabs, but NA gets replaced with 0:
head(xtabs(X2010 ~ CountryName + IndicatorCode, mydata))
# IndicatorCode
# CountryName NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
# Afghanistan 9.437322 0.0
# Albania 3.459343 0.0
# Algeria 16.245617 11.4
# American Samoa 0.000000 0.0
# Andorra 0.000000 0.0
# Angola 22.393924 0.0
The result of xtabs is a matrix, so if you want a data.frame, wrap the output with as.data.frame.matrix.

Resources