what's the difference between these two classes and why I must use as_tibble function in my code? - r

Here is my code. I have two questions about my codes. The first question is what's the difference between these two classes? The second question is why I must use as_tibble() function so that I could use pivot_wider() function?
head(global_economy)
write.table(global_economy,"global_economy.csv",sep=",",row.names=FALSE)
class(global_economy)
Country Code Year GDP Growth CPI Imports Exports Population
Afghanistan AFG 1960 537777811 NA NA 7.024793 4.132233 8996351
Afghanistan AFG 1961 548888896 NA NA 8.097166 4.453443 9166764
Afghanistan AFG 1962 546666678 NA NA 9.349593 4.878051 9345868
Afghanistan AFG 1963 751111191 NA NA 16.863910 9.171601 9533954
Afghanistan AFG 1964 800000044 NA NA 18.055555 8.888893 9731361
Afghanistan AFG 1965 1006666638 NA NA 21.412803 11.258279 9938414
'tbl_ts' 'tbl_df' 'tbl' 'data.frame'
wider_tibble <- global_economy %>%
as_tibble()%>%
pivot_wider(names_from=Country,values_from=GDP)
class(wider_tibble)
'tbl_df' 'tbl' 'data.frame'

My guess: The first one is a time_series ('table_ts'). 'as_tibble' returns it to a dataframe without time series elements so that 'pivot_wider' works.

Related

Revaluing many observations with a for loop in R

I have a data set where I am looking at longitudinal data for countries.
master.set <- data.frame(
Country = c(rep("Afghanistan", 3), rep("Albania", 3)),
Country.ID = c(rep("Afghanistan", 3), rep("Albania", 3)),
Year = c(2015, 2016, 2017, 2015, 2016, 2017),
Happiness.Score = c(3.575, 3.360, 3.794, 4.959, 4.655, 4.644),
GDP.PPP = c(1766.593, 1757.023, 1758.466, 10971.044, 11356.717, 11803.282),
GINI = NA,
Status = 2,
stringsAsFactors = F
)
> head(master.set)
Country Country.ID Year Happiness.Score GDP.PPP GINI Status
1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
4 Albania Albania 2015 4.959 10971.044 NA 2
5 Albania Albania 2016 4.655 11356.717 NA 2
6 Albania Albania 2017 4.644 11803.282 NA 2
I created that Country.ID variable with the intent of turning them into numerical values 1:159.
I am hoping to avoid doing something like this to replace the value at each individual observation:
master.set$Country.ID <- master.set$Country.ID[master.set$Country.ID == "Afghanistan"] <- 1
As I implied, there are 159 countries listed in the data set. Because it' longitudinal, there are 460 observations.
Is there any way to use a for loop to save me a lot of time? Here is what I attempted. I made a couple of lists and attempted to use an ifelse command to tell R to label each country the next number.
Here is what I have:
#List of country names
N.Countries <- length(unique(master.set$Country))
Country <- unique(master.set$Country)
Country.ID <- unique(master.set$Country.ID)
CountryList <- unique(master.set$Country)
#For Loop to make Country ID numerically match Country
for (i in 1:460){
for (j in N.Countries){
master.set[[Country.ID[i]]] <- ifelse(master.set[[Country[i]]] == CountryList[j], j, master.set$Country)
}
}
I received this error:
Error in `[[<-.data.frame`(`*tmp*`, Country.ID[i], value = logical(0)) :
replacement has 0 rows, data has 460
Does anyone know how I can accomplish this task? Or will I be stuck using the ifelse command 159 times?
Thanks!
Maybe something like
master.set$Country.ID <- as.numeric(as.factor(master.set$Country.ID))
Or alternatively, using dplyr
library(tidyverse)
master.set <- master.set %>% mutate(Country.ID = as.numeric(as.factor(Country.ID)))
Or this, which creates a new variable Country.ID2based on a key-value pair between Country.ID and a 1:length(unique(Country)).
library(tidyverse)
master.set <- left_join(master.set,
data.frame( Country = unique(master.set$Country),
Country.ID2 = 1:length(unique(master.set$Country))))
master.set
#> Country Country.ID Year Happiness.Score GDP.PPP GINI Status
#> 1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
#> 2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
#> 3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
#> 4 Albania Albania 2015 4.959 10971.044 NA 2
#> 5 Albania Albania 2016 4.655 11356.717 NA 2
#> 6 Albania Albania 2017 4.644 11803.282 NA 2
#> Country.ID2
#> 1 1
#> 2 1
#> 3 1
#> 4 2
#> 5 2
#> 6 2
library(dplyr)
df<-data.frame("Country"=c("Afghanistan","Afghanistan","Afghanistan","Albania","Albania","Albania"),
"Year"=c(2015,2016,2017,2015,2016,2017),
"Happiness.Score"=c(3.575,3.360,3.794,4.959,4.655,4.644),
"GDP.PPP"=c(1766.593,1757.023,1758.466,10971.044,11356.717,11803.282),
"GINI"=NA,
"Status"=rep(2,6))
df1<-df %>% arrange(Country) %>% mutate(Country_id = group_indices_(., .dots="Country"))
View(df1)

Assign values to a name within a function

Here is my code:
get_test <- function(name){
data <- filter(data_all_country,country == name)
# transform the data to a time series using `ts` in `stats`
data <- ts(data$investment, start = 1950)
data <- log(data)
rule <- substitute(name)
assign(rule,data)
}
As in the code, I try to build a function by which I could input a country's name given in character string, and then the variable named by the country would be generated automatically. However, I run this code, and it runs but with no exact variable generated as I want. For example, I want to have a variable called Albania in the environment after I code get_test("Albania").
I wonder why?
Ps: And the dataset of data_all_country is as following:
year country investment
1 1950 Albania NA
2 1951 Albania NA
3 1952 Albania NA
4 1953 Albania NA
5 1954 Albania NA
6 1955 Albania NA
Note that the dataset is OK, just some of it is NA
I think you have to specify the environment for assign, else it will use the current environment (in this case within the function).
You could use
assign(name, data, envir = .GlobalEnv)
or
assign(name, data, pos = 1)

Removing certain rows whose column value does not match another column (all within the same data frame)

I'm attempting to remove all of the rows (cases) within a data frame in which a certain column's value does not match another column value.
The data frame bilat_total contains these 10 columns/variables:
bilat_total[,c("year", "importer1", "importer2", "flow1",
"flow2", "country", "imports", "exports", "bi_tot",
"mother")]
Thus the table's head is:
year importer1 importer2 flow1 flow2 country
6 2009 Afghanistan Bhutan NA NA Afghanistan
11 2009 Afghanistan Solomon Islands NA NA Afghanistan
12 2009 Afghanistan India 516.13 120.70 Afghanistan
13 2009 Afghanistan Japan 124.21 0.46 Afghanistan
15 2009 Afghanistan Maldives NA NA Afghanistan
19 2009 Afghanistan Bangladesh 4.56 1.09 Afghanistan
imports exports bi_tot mother
6 6689.35 448.25 NA United Kingdom
11 6689.35 448.25 NA United Kingdom
12 6689.35 448.25 1.804361e-02 United Kingdom
13 6689.35 448.25 6.876602e-05 United Kingdom
15 6689.35 448.25 NA United Kingdom
19 6689.35 448.25 1.629456e-04 United Kingdom
I've attempted to remove all the cases in which importer2 do not match mother by making a subset:
subset(bilat_total, importer2 == mother)
But each time I do, I get the error:
Error in Ops.factor(importer2, mother) : level sets of factors are different
How would I go about dropping all the rows/cases in which importer 2 and mother don't match?
The error may be because the columns are factor class. We can convert the columns to character class and then compare to subset the rows.
subset(bilat_total, as.character(importer2) == as.character(mother))
Based on the data example showed
subset(bilat_total, importer2 == mother)
# Error in Ops.factor(importer2, mother) :
# level sets of factors are different

How to reshape this complicated data frame?

Here is first 4 rows of my data;
X...Country.Name Country.Code Indicator.Name
1 Turkey TUR Inflation, GDP deflator (annual %)
2 Turkey TUR Unemployment, total (% of total labor force)
3 Afghanistan AFG Inflation, GDP deflator (annual %)
4 Afghanistan AFG Unemployment, total (% of total labor force)
Indicator.Code X2010
1 NY.GDP.DEFL.KD.ZG 5.675740
2 SL.UEM.TOTL.ZS 11.900000
3 NY.GDP.DEFL.KD.ZG 9.437322
4 SL.UEM.TOTL.ZS NA
I want my data reshaped into two colums, one of each Indicator code, and I want each row correspond to a country, something like this;
Country Name NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
Turkey 5.6 11.9
Afghanistan 9.43 NA
I think I could do this with Excel, but I want to learn the R way, so that I don't need to rely on excel everytime I have a problem. Here is dput of data if you need it.
Edit: I actually want 3 colums, one for each indicator and one for the country's name.
Sticking with base R, use reshape. I took the liberty of cleaning up the column names. Here, I'm only showing you a few rows of the output. Remove head to see the full output. This assumes your data.frame is named "mydata".
names(mydata) <- c("CountryName", "CountryCode",
"IndicatorName", "IndicatorCode", "X2010")
head(reshape(mydata[-c(2:3)],
direction = "wide",
idvar = "CountryName",
timevar = "IndicatorCode"))
# CountryName X2010.NY.GDP.DEFL.KD.ZG X2010.SL.UEM.TOTL.ZS
# 1 Turkey 5.675740 11.9
# 3 Afghanistan 9.437322 NA
# 5 Albania 3.459343 NA
# 7 Algeria 16.245617 11.4
# 9 American Samoa NA NA
# 11 Andorra NA NA
Another option in base R is xtabs, but NA gets replaced with 0:
head(xtabs(X2010 ~ CountryName + IndicatorCode, mydata))
# IndicatorCode
# CountryName NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
# Afghanistan 9.437322 0.0
# Albania 3.459343 0.0
# Algeria 16.245617 11.4
# American Samoa 0.000000 0.0
# Andorra 0.000000 0.0
# Angola 22.393924 0.0
The result of xtabs is a matrix, so if you want a data.frame, wrap the output with as.data.frame.matrix.

Creating new variable and new data rows for country-conflict-year observations

I'm very new to R, still learning the very basics, and I haven't yet figured out how to perform this particular operation, but it would save me lots and lots of labor and time.
I have a dataset of international conflicts with columns for country and dates that looks something like this:
country dates
Angola 1951-1953
Belize 1970-1972
I would like to reorganize the data to create variables for start year and end year, as well as create a year-observed (call it 'yrobs') column, so the set looks more like this:
country yrobs yrstart yrend
Angola 1951 1951 1953
Angola 1952 1951 1953
Angola 1953 1951 1953
Belize 1970 1970 1972
Belize 1971 1970 1972
Belize 1972 1970 1972
Someone suggested using data frames and a double for-loop, but I got a little confused trying that. Any help would be greatly appreciated, and feel free to use dummy language, as I'm still pretty green to the programming here. Thanks much.
No need for any for loops here. Use the power of R and its contributed packages, particularly plyr and reshape2.
library(reshape2)
library(plyr)
Create some data:
df <- data.frame(
country =c("Angola","Belize"),
dates = c("1951-1953", "1970-1972")
)
Use colsplit in the reshape package to split your dates column into two, and cbind this to the original data frame.
df <- cbind(df, colsplit(df$date, "-", c("start", "end")))
Now for the fun bit. Use ddply in package plyr to split, apply and combine (SAC). This will take df and apply a function to each change in country. The anonymous function inside ddply creates a small data.frame with country and observations, and the key bit is to use seq() to generate a sequence from start to end date. The power of ddply is that it does all of this splitting, combining and applying in one step. Think of it as a loop in other languages, but you don't need to keep track of your indexing variables.
ddply(df, .(country), function(x){
data.frame(
country=x$country,
yrobs=seq(x$start, x$end),
yrstart=x$start,
yrend=x$end
)
}
)
And the results:
country yrobs yrstart yrend
1 Angola 1951 1951 1953
2 Angola 1952 1951 1953
3 Angola 1953 1951 1953
4 Belize 1970 1970 1972
5 Belize 1971 1970 1972
6 Belize 1972 1970 1972

Resources