r read.table misread special symbols - r

I am supposed to use read.table (not other functions) to import my data.
The data looks like the following:
country year pop continent lifeExp gdpPercap
Afghanistan 1952 8425333 Asia 28.801 779.4453145
Afghanistan 1957 9240934 Asia 30.332 820.8530296
Afghanistan 1962 10267083 Asia 31.997 853.10071
...
Cote d'Ivoire 1987 10761098 Africa 54.655 2156.956069
Cote d'Ivoire 1992 12772596 Africa 52.044 1648.073791
Cote d'Ivoire 1997 14625967 Africa 47.991 1786.265407
Cote d'Ivoire 2002 16252726 Africa 46.832 1648.800823
Cote d'Ivoire 2007 18013409 Africa 48.328 1544.750112
...
The read.table cannot properly read "Cote d'Ivoire" because it has the prime symbol. How do I fix that by changing the parameters of the read.table function?

You will have to use quote = when you read.table to ignore the quoting character in Cote d'Ivoire.
df.1 <- read.table("your/file.txt", quote = "", header = TRUE, sep = "\t")

Related

Why does lag() in this data frame repeat and reverse calculations?

This is a data frame I've called CA_less, which I want to use to calculate GDP change over five decades:
CountryName
Year
GDP
Costa Rica
1960
507513829.9949
Costa Rica
2010
37268635287.0856
Guatemala
1960
1043599900
Guatemala
2010
41338595380.8159
Honduras
1960
335650000
Honduras   
2010
15839344591.9842
Panama     
1960
537147100
Panama
2010
28917200000
I used this code:
CA_GDP_decade <- mutate(CA_less, Year2 = lag(Year, 1),GDP2 = lag(GDP, 1), CHANGE_PERC = ((GDP - GDP2) / GDP2 ) * 100 %>%
mutate_if(is.numeric,
round,
digits = 0)
CA_GDP_decade
I was expecting this:
CountryName
Year
GDP
Year2
GDP2
Change_perc
Costa Rica
1960
507513830
NA
NA
NA
Costa Rica
2010
37268635287
1960
507513830
7243
Guatemala
2010
41338595381
1960
1043599900
3861
Honduras
2010
15839344592
1960
335650000
4619
Panama
2010
28917200000
1960
537147100
5283
However, I got this instead:
CountryName
Year
GDP
Year2
GDP2
Change_perc
Costa Rica
1960
507513830
NA
NA
NA
Costa Rica
2010
37268635287
1960
507513830
7243
Guatemala
1960
1043599900
2010
37268635287
-97
Guatemala
2010
41338595381
1960
1043599900
3861
Honduras
1960
335650000
2010
41338595381
-99
Honduras
2010
15839344592
1960
335650000
4619
Panama
1960
537147100
2010
15839344592
-97
Panama
2010
28917200000
1960
537147100
5283
How could I use lag() in such a way that I avoid the duplication and reversal of the operations?
Taking your original code you could so something like this.
library(dplyr)
df = data.frame(
CountryName = c("Costa Rica","Costa Rica","Guatemala","Guatemala","Honduras","Honduras","Panama","Panama"),
Year = c(1960,2010,1960,2010,1960,2010,1960,2010),
GDP = c(507513829.9949,37268635287.0856, 1043599900, 41338595380.8159, 335650000, 15839344591.9842, 537147100, 28917200000.)
)
df %>%
group_by(CountryName) %>%
mutate(Year2 = lag(Year, 1),GDP2 = lag(GDP, 1), CHANGE_PERC = ((GDP - GDP2) / GDP2 ) * 100) %>%
mutate_if(is.numeric,
round,
digits = 0) %>%
na.omit()
# A tibble: 4 × 6
CountryName Year GDP Year2 GDP2 CHANGE_PERC
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Costa Rica 2010 37268635287 1960 507513830 7243
2 Guatemala 2010 41338595381 1960 1043599900 3861
3 Honduras 2010 15839344592 1960 335650000 4619
4 Panama 2010 28917200000 1960 537147100 5283

Creating predictions from a model

I am trying to create predictions from my model, but i wish to keep the Country_Name in the predictions if possible. Is there any way this can be done as i'm having no lucking using the standard predict() function
My model is;
mod = gam(gdp_per_capita ~ s(fisheries_production_pc, k = 10, bs = 'cs') + s(food_yield_pc, k = 10, bs = 'cs') +
s(freshwaster_production_pc, k = 5, bs = 'cs') + s(co2, k = 5, bs = 'cs') + Country_Name,
data = economy_df,
family = gaussian(link = "log"))
data snipet;
economy_df
Country_Name year gdp_per_capita Agriculture_GDP_per fisheries_production_pc food_yield_pc freshwaster_production_pc co2
Albania 2018 5287.6637 18.4294792 0.0052701739 1.688718e-03 3.342199e-07 1.782739
Albania 2019 5396.2159 18.3893474 0.0053295312 1.765194e-03 3.342199e-07 1.692248
Albania 2020 5332.1605 19.2644408 0.0059591472 1.835616e-03 3.342199e-07 3.926145
Algeria 2018 4142.0186 11.8742008 0.0028456292 4.622480e-05 2.321186e-07 3.920109
Algeria 2019 3989.6683 12.3362121 0.0024478768 4.105168e-05 2.321186e-07 3.977650
Algeria 2020 3306.8582 14.1347926 0.0019817330 3.467192e-05 2.321186e-07 2.448906
Bosnia 2018 6070.3530 5.8854355 0.0011864874 1.651028e-03 1.206103e-07 6.799183
Bosnia 2019 6119.7624 5.6030922 0.0012912459 1.622146e-03 1.206103e-07 6.382918
Bosnia 2020 6082.3667 6.0844855 0.0012438373 1.844267e-03 1.206103e-07 4.962175
Croatia 2018 15227.5601 2.9570919 0.0220747984 1.725996e-03 1.646345e-07 4.019235
Croatia 2019 15311.7669 2.8687641 0.0209151509 1.760604e-03 1.646345e-07 4.063708
Croatia 2020 14132.4866 3.2165075 0.0230609534 1.727508e-03 1.646345e-07 8.057848
Cyprus 2018 29334.1113 1.7335399 0.0074306923 8.853390e-04 1.740575e-07 6.054175
Cyprus 2019 29206.0762 1.8086052 0.0079922641 2.216217e-03 1.740575e-07 5.998795
Cyprus 2020 27681.5664 1.9308417 0.0071299388 1.961717e-03 1.740575e-07 5.614297
Egypt 2018 2537.1252 11.2250002 0.0199902966 6.887169e-05 7.874128e-07 2.518806
Egypt 2019 3019.0923 11.0489759 0.0203110909 6.022130e-05 7.874128e-07 2.484060
Egypt 2020 3569.2068 11.5676091 0.0196471464 6.046745e-05 7.874128e-07 5.295201
What I'm looking for would look something like this i imagine:
Country_Name prediction
Albania <value>
Albania <value>
Albania <value>
To check the order of values you can perform a correlation between observed versus predicted values:
cor(economy_df$gdp_per_capita,preds[["fit"]])
Then make new df with your desired columns:
mypred<-data.frame(Country_Name =
economy_df$Country_Name
,prediction=preds[["fit"]])
head(mypred)
Country_Name prediction
1 Albania 5259.356
2 Albania 5382.758
3 Albania 5373.350
4 Algeria 4099.978
5 Algeria 3951.231
6 Algeria 3402.162

Using mutate in custom function with mutation condition as argument

Is it possible to construct a function, say my_mut(df, condition) such that df is a dataframe, condition is a string describing a mutation, and somewhere in the function, the mutation of df according to condition is used?
For example, if df has a foo column and you run my_mut(df, "foo = 2*foo"), then somewhere within my_mut() there would be a row that produces the same dataframe as df %>% mutate(foo = 2*foo).
I managed to do something similar with filter using eval and parse.
update_filt <- function(df,
filt,
col){
sub <- df %>%
filter(eval(parse(text = filt))) %>%
mutate("{{col}}" := 2*{{ col }})
remain <- df %>%
filter(eval(parse(
text = paste0("!(",filt,")")
))
)
return(rbind(sub, remain))
}
I am not sure the update_filt function is faultproof, but it works in some cases at least, e.g., library(gapminder) date_filt(gapminder, "year == 1952", pop) returns the expected outcome.
The same trick does not seem to work with mutate though. For example,
update_mut <- function(df, mutation){
# Evaluate mutation expression
df %>% mutate(eval(parse(text = mutation))
}
produces outcomes like
library(gapminder)
update_mut(gapminder, "year = 2*year")
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap `eval(parse(text = mutation))`
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 3904
2 Afghanistan Asia 1957 30.3 9240934 821. 3914
3 Afghanistan Asia 1962 32.0 10267083 853. 3924
4 Afghanistan Asia 1967 34.0 11537966 836. 3934
5 Afghanistan Asia 1972 36.1 13079460 740. 3944
6 Afghanistan Asia 1977 38.4 14880372 786. 3954
7 Afghanistan Asia 1982 39.9 12881816 978. 3964
8 Afghanistan Asia 1987 40.8 13867957 852. 3974
9 Afghanistan Asia 1992 41.7 16317921 649. 3984
10 Afghanistan Asia 1997 41.8 22227415 635. 3994
# … with 1,694 more rows
Instead of the expected
gapminder %>% mutate(year = 2*year)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <dbl> <dbl> <int> <dbl>
1 Afghanistan Asia 3904 28.8 8425333 779.
2 Afghanistan Asia 3914 30.3 9240934 821.
3 Afghanistan Asia 3924 32.0 10267083 853.
4 Afghanistan Asia 3934 34.0 11537966 836.
5 Afghanistan Asia 3944 36.1 13079460 740.
6 Afghanistan Asia 3954 38.4 14880372 786.
7 Afghanistan Asia 3964 39.9 12881816 978.
8 Afghanistan Asia 3974 40.8 13867957 852.
9 Afghanistan Asia 3984 41.7 16317921 649.
10 Afghanistan Asia 3994 41.8 22227415 635.
# … with 1,694 more rows
library(dplyr, warn.conflicts = FALSE)
my_mut <- function(df, df_filter, ...){
df %>%
filter({{ df_filter }}) %>%
mutate(newvar = 'other function stuff',
...)
}
example_df <- data.frame(a = c('zebra', 'some value'),
b = 1:2)
example_df %>%
my_mut(df_filter = a == 'some value',
b = b*5)
#> a b newvar
#> 1 some value 10 other function stuff
Created on 2021-11-11 by the reprex package (v2.0.1)
If you can't use ... because you're already using it in the function for something else, you could wrap the mutation argument in tibble when calling the function.
library(dplyr, warn.conflicts = FALSE)
my_mut <- function(df, df_filter, mutation){
df %>%
filter({{ df_filter }}) %>%
mutate(newvar = 'other function stuff',
{{ mutation }})
}
example_df <- data.frame(a = c('zebra', 'some value'),
b = 1:2)
example_df %>%
my_mut(df_filter = a == 'some value',
mutation = tibble(b = b*5))
#> a b newvar
#> 1 some value 10 other function stuff
Created on 2021-11-11 by the reprex package (v2.0.1)
If your formula is always like origianl = do_something_original(), this may helps.(for dplyr version >= 1.0)
library(dplyr)
library(stringr)
update_mut <- function(df, mutation){
xx <- word(mutation, 1)
df %>%
mutate("{xx}" := eval(parse(text = mutation)))
}
update_mut(gapminder, "year = 2*year")
country continent year lifeExp pop gdpPercap
<fct> <fct> <dbl> <dbl> <int> <dbl>
1 Afghanistan Asia 3904 28.8 8425333 779.
2 Afghanistan Asia 3914 30.3 9240934 821.
3 Afghanistan Asia 3924 32.0 10267083 853.
4 Afghanistan Asia 3934 34.0 11537966 836.
5 Afghanistan Asia 3944 36.1 13079460 740.
6 Afghanistan Asia 3954 38.4 14880372 786.
7 Afghanistan Asia 3964 39.9 12881816 978.
8 Afghanistan Asia 3974 40.8 13867957 852.
9 Afghanistan Asia 3984 41.7 16317921 649.
10 Afghanistan Asia 3994 41.8 22227415 635.
The problem is that mutate doesn't understand the assignment, because all the syntax is evaluated inside eval. So mutate simply thinks this is a nameless expression and assigns as its name the whole text of the expression.
One way to circumvent this would be to eval the whole thing, including the mutate verb, as below.
update_mut <- function(df, mutation) {
# Evaluate the mutation expression
eval(parse(text = paste0("mutate(df, ", mutation, ")")))
}
Another way would be, inside the update_mut function, to split the mutation parameter by the = character, therefore obtaining the name of the variable and the expressions. Therefore you could use a dynamic variable assingment in mutate. However this would only be more to do, since the above code simply solves the problem.

Comparing two dataframes in R with different number of rows

I have two data frames, that have the same setup as below
Country Name Country Code Region Year Fertility Rate
Aruba ABW The Americas 1960 4.82
Afghanistan AFG Asia 1960 7.45
Angola AGO Africa 1960 7.379
Albania ALB Europe 1960 6.186
United Arab Emirates ARE Middle East 1960 6.928
Argentina ARG The Americas 1960 3.109
Armenia ARM Asia 1960 4.55
Antigua and Barbuda ATG The Americas 1960 4.425
Australia AUS Oceania 1960 3.453
Austria AUT Europe 1960 2.69
Azerbaijan AZE Asia 1960 5.571
Burundi BDI Africa 1960 6.953
Belgium BEL Europe 1960 2.54
I would like to create a data frame where I list out which countries are missing from the "merged" data frame as compared with the "merged2013" data frame. (Not my naming conventions)
I have tried numerous things I have found on the internet, with only this working below, but not to the way I would like it to
newmerged1 <- (paste(merged$Country.Name) %in% paste(merged2013$Country.Name))+1
newmerged1
This returns a "1" value for countries that aren't found in the merged2013 data frame. I'm assuming there is a way I can get this to list out the Country Name instead of a one or two, or just have a list of the countries not found in the merged2013 data frame without everything else.
You could use dplyr's anti_join, it is specifically designed to be used this way.
require(dplyr)
missing_data <-anti_join(merged2013, merged, by="Country.Name")
This will return all the rows in merged2013 not in merged.

R: Calculating 5 year averages in panel data

I have a balanced panel by country from 1951 to 2007 in a data frame. I'd like to transform it into a new data frame of five year averages of my other variables. When I sat down to do this I realized the only way I could think to do this involved a for loop and then decided that it was time to come to stackoverflow for help.
So, is there an easy way to turn data that looks like this:
country country.isocode year POP ci grgdpch
Argentina ARG 1951 17517.34 18.445022145 3.4602044759
Argentina ARG 1952 17876.96 17.76066507 -7.887407586
Argentina ARG 1953 18230.82 18.365255769 2.3118720688
Argentina ARG 1954 18580.56 16.982113434 1.5693778844
Argentina ARG 1955 18927.82 17.488907008 5.3690276523
Argentina ARG 1956 19271.51 15.907756547 0.3125559183
Argentina ARG 1957 19610.54 17.028450999 2.4896639667
Argentina ARG 1958 19946.54 17.541597134 5.0025894968
Argentina ARG 1959 20281.15 16.137310492 -6.763501447
Argentina ARG 1960 20616.01 20.519539628 8.481742144
...
Venezuela VEN 1997 22361.80 21.923577413 5.603872759
Venezuela VEN 1998 22751.36 24.451736863 -0.781844721
Venezuela VEN 1999 23128.64 21.585034168 -8.728234466
Venezuela VEN 2000 23492.75 20.224310777 2.6828641218
Venezuela VEN 2001 23843.87 23.480311721 0.2476965412
Venezuela VEN 2002 24191.77 16.290691319 -8.02535946
Venezuela VEN 2003 24545.43 10.972153646 -8.341989049
Venezuela VEN 2004 24904.62 17.147693312 14.644028806
Venezuela VEN 2005 25269.18 18.805970212 7.3156977879
Venezuela VEN 2006 25641.46 22.191098769 5.2737381326
Venezuela VEN 2007 26023.53 26.518210052 4.1367897561
into something like this:
country country.isocode period AvPOP Avci Avgrgdpch
Argentina ARG 1 18230 17.38474 1.423454
...
Venezuela VEN 12 25274 21.45343 5.454334
Do I need to transform this data frame using a specific panel data package? Or is there another easy way to do this that I'm missing?
This is the stuff aggregate is made for. :
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Level <-cut(Df$year,seq(1951,1971,by=5),right=F)
id <- c("var1","var2")
> aggregate(Df[id],list(Df$country,Level),mean)
Group.1 Group.2 var1 var2
1 Arg [1951,1956) 3 18
2 Ven [1951,1956) 53 68
3 Arg [1956,1961) 8 13
4 Ven [1956,1961) 58 63
5 Arg [1961,1966) 13 8
6 Ven [1961,1966) 63 58
7 Arg [1966,1971) 18 3
8 Ven [1966,1971) 68 53
The only thing you might want to do, is to rename the categories and the variable names.
For this type of problem, the plyr package is truely phenomenal. Here is some code that gives you what you want in essentially a single line of code plus a small helper function.
library(plyr)
library(zoo)
library(pwt)
# First recreate dataset, using package pwt
data(pwt6.3)
pwt <- pwt6.3[
pwt6.3$country %in% c("Argentina", "Venezuela"),
c("country", "isocode", "year", "pop", "ci", "rgdpch")
]
# Use rollmean() in zoo as basis for defining a rolling 5-period rolling mean
rollmean5 <- function(x){
rollmean(x, 5)
}
# Use ddply() in plyr package to create rolling average per country
pwt.ma <- ddply(pwt, .(country), numcolwise(rollmean5))
Here is the output from this:
> head(pwt, 10)
country isocode year pop ci rgdpch
ARG-1950 Argentina ARG 1950 17150.34 13.29214 7736.338
ARG-1951 Argentina ARG 1951 17517.34 18.44502 8004.031
ARG-1952 Argentina ARG 1952 17876.96 17.76067 7372.721
ARG-1953 Argentina ARG 1953 18230.82 18.36526 7543.169
ARG-1954 Argentina ARG 1954 18580.56 16.98211 7661.550
ARG-1955 Argentina ARG 1955 18927.82 17.48891 8072.900
ARG-1956 Argentina ARG 1956 19271.51 15.90776 8098.133
ARG-1957 Argentina ARG 1957 19610.54 17.02845 8299.749
ARG-1958 Argentina ARG 1958 19946.54 17.54160 8714.951
ARG-1959 Argentina ARG 1959 20281.15 16.13731 8125.515
> head(pwt.ma)
country year pop ci rgdpch
1 Argentina 1952 17871.20 16.96904 7663.562
2 Argentina 1953 18226.70 17.80839 7730.874
3 Argentina 1954 18577.53 17.30094 7749.694
4 Argentina 1955 18924.25 17.15450 7935.100
5 Argentina 1956 19267.39 16.98977 8169.456
6 Argentina 1957 19607.51 16.82080 8262.250
Note that rollmean(), by default, calculates the centred moving mean. You can modify this behaviour to get the left or right moving mean by passing this parameter to the helper function.
EDIT:
#Joris Meys gently pointed out that you might in fact be after the average for five-year periods.
Here is the modified code to do this:
pwt$period <- cut(pwt$year, seq(1900, 2100, 5))
pwt.ma <- ddply(pwt, .(country, period), numcolwise(mean))
pwt.ma
And the output:
> pwt.ma
country period year pop ci rgdpch
1 Argentina (1945,1950] 1950.0 17150.336 13.29214 7736.338
2 Argentina (1950,1955] 1953.0 18226.699 17.80839 7730.874
3 Argentina (1955,1960] 1958.0 19945.149 17.42693 8410.610
4 Argentina (1960,1965] 1963.0 21616.623 19.09067 9000.918
5 Argentina (1965,1970] 1968.0 23273.736 18.89005 10202.665
6 Argentina (1970,1975] 1973.0 25216.339 19.70203 11348.321
7 Argentina (1975,1980] 1978.0 27445.430 23.34439 11907.939
8 Argentina (1980,1985] 1983.0 29774.778 17.58909 10987.538
9 Argentina (1985,1990] 1988.0 32095.227 15.17531 10313.375
10 Argentina (1990,1995] 1993.0 34399.829 17.96758 11221.807
11 Argentina (1995,2000] 1998.0 36512.422 19.03551 12652.849
12 Argentina (2000,2005] 2003.0 38390.719 15.22084 12308.493
13 Argentina (2005,2010] 2006.5 39831.625 21.11783 14885.227
14 Venezuela (1945,1950] 1950.0 5009.006 41.07972 7067.947
15 Venezuela (1950,1955] 1953.0 5684.009 44.60849 8132.041
16 Venezuela (1955,1960] 1958.0 6988.078 37.87946 9468.001
17 Venezuela (1960,1965] 1963.0 8451.073 26.93877 9958.935
18 Venezuela (1965,1970] 1968.0 10056.910 28.66512 11083.242
19 Venezuela (1970,1975] 1973.0 11903.185 32.02671 12862.966
20 Venezuela (1975,1980] 1978.0 13927.882 36.35687 13530.556
21 Venezuela (1980,1985] 1983.0 16082.694 22.21093 10762.718
22 Venezuela (1985,1990] 1988.0 18382.964 19.48447 10376.123
23 Venezuela (1990,1995] 1993.0 20680.645 19.82371 10988.096
24 Venezuela (1995,2000] 1998.0 22739.062 20.93509 10837.580
25 Venezuela (2000,2005] 2003.0 24550.973 17.33936 10085.322
26 Venezuela (2005,2010] 2006.5 25832.495 24.35465 11790.497
Use cut on your year variable to make the period variable, then use melt and cast from the reshape package to get the averages. There's a lot of other answers that can show you how; see https://stackoverflow.com/questions/tagged/r+reshape
There is a base stats and a plyr answer, so for completeness, here is a dplyr based answer. Using the toy data given by Joris, we have
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Now, using cut to create the periods, we can then group on them and get the means:
Df %>% mutate(period = cut(Df$year,seq(1951,1971,by=5),right=F)) %>%
group_by(country, period) %>% summarise(V1 = mean(var1), V2 = mean(var2))
Source: local data frame [8 x 4]
Groups: country
country period V1 V2
1 Arg [1951,1956) 3 18
2 Arg [1956,1961) 8 13
3 Arg [1961,1966) 13 8
4 Arg [1966,1971) 18 3
5 Ven [1951,1956) 53 68
6 Ven [1956,1961) 58 63
7 Ven [1961,1966) 63 58
8 Ven [1966,1971) 68 53

Resources