Creating new variable and new data rows for country-conflict-year observations - r

I'm very new to R, still learning the very basics, and I haven't yet figured out how to perform this particular operation, but it would save me lots and lots of labor and time.
I have a dataset of international conflicts with columns for country and dates that looks something like this:
country dates
Angola 1951-1953
Belize 1970-1972
I would like to reorganize the data to create variables for start year and end year, as well as create a year-observed (call it 'yrobs') column, so the set looks more like this:
country yrobs yrstart yrend
Angola 1951 1951 1953
Angola 1952 1951 1953
Angola 1953 1951 1953
Belize 1970 1970 1972
Belize 1971 1970 1972
Belize 1972 1970 1972
Someone suggested using data frames and a double for-loop, but I got a little confused trying that. Any help would be greatly appreciated, and feel free to use dummy language, as I'm still pretty green to the programming here. Thanks much.

No need for any for loops here. Use the power of R and its contributed packages, particularly plyr and reshape2.
library(reshape2)
library(plyr)
Create some data:
df <- data.frame(
country =c("Angola","Belize"),
dates = c("1951-1953", "1970-1972")
)
Use colsplit in the reshape package to split your dates column into two, and cbind this to the original data frame.
df <- cbind(df, colsplit(df$date, "-", c("start", "end")))
Now for the fun bit. Use ddply in package plyr to split, apply and combine (SAC). This will take df and apply a function to each change in country. The anonymous function inside ddply creates a small data.frame with country and observations, and the key bit is to use seq() to generate a sequence from start to end date. The power of ddply is that it does all of this splitting, combining and applying in one step. Think of it as a loop in other languages, but you don't need to keep track of your indexing variables.
ddply(df, .(country), function(x){
data.frame(
country=x$country,
yrobs=seq(x$start, x$end),
yrstart=x$start,
yrend=x$end
)
}
)
And the results:
country yrobs yrstart yrend
1 Angola 1951 1951 1953
2 Angola 1952 1951 1953
3 Angola 1953 1951 1953
4 Belize 1970 1970 1972
5 Belize 1971 1970 1972
6 Belize 1972 1970 1972

Related

Sum rows in R and store the new value in a new dataframe

I have a dataframe that looks like this:
Year
Month
total_volume
us_assigned
1953
October
55154
18384.667
1953
November
22783
7594.333
1953
December
20996
6998.667
1954
January
22096
7365.333
1954
February
18869
6289.667
1954
March
11598
3866.000
1954
April
37051
12350.333
1954
May
105856
35285.333
1954
June
61320
20440.000
1954
July
44084
14694.667
1954
August
175152
58384.000
1954
September
80071
26690.333
The dataframe goes to the year 2021 with monthly observations as shown in the table above. I am trying to sum up 12 months (i.e., rows) at a time (from Oct. to Sept.) for the column "us_assigned" and save this value in a new dataframe, which would look like this:
Year
us_assigned
1
218343
2
3
Year 2 would have the sum of the next 12 months (i.e., the next Oct.-Sept.) and so on and so forth. I have thought of simply summing the rows by specifying them, like below, but this seems too tedious.
sum(us_volume[1:12,4])
I am sure there is a much easier way to do this. I am not too proficient with R so I appreciate any help.
This is relatively straightforward using group_by() and summarise() from the dplyr package, e.g.
library(dplyr)
df <- read.table(text = "Year Month total_volume us_assigned
1953 October 55154 18384.667
1953 November 22783 7594.333
1953 December 20996 6998.667
1954 January 22096 7365.333
1954 February 18869 6289.667
1954 March 11598 3866.000
1954 April 37051 12350.333
1954 May 105856 35285.333
1954 June 61320 20440.000
1954 July 44084 14694.667
1954 August 175152 58384.000
1954 September 80071 26690.333", header = TRUE)
df2 <- df %>%
mutate(Year_int = cumsum(Month == "October")) %>% # every Oct add 1 to Year_int
group_by(Year_int) %>%
summarise(us_assigned = sum(us_assigned))
df2
#> # A tibble: 1 × 2
#> Year_int us_assigned
#> <int> <dbl>
#> 1 1 218343.
Created on 2023-01-23 with reprex v2.0.2

what's the difference between these two classes and why I must use as_tibble function in my code?

Here is my code. I have two questions about my codes. The first question is what's the difference between these two classes? The second question is why I must use as_tibble() function so that I could use pivot_wider() function?
head(global_economy)
write.table(global_economy,"global_economy.csv",sep=",",row.names=FALSE)
class(global_economy)
Country Code Year GDP Growth CPI Imports Exports Population
Afghanistan AFG 1960 537777811 NA NA 7.024793 4.132233 8996351
Afghanistan AFG 1961 548888896 NA NA 8.097166 4.453443 9166764
Afghanistan AFG 1962 546666678 NA NA 9.349593 4.878051 9345868
Afghanistan AFG 1963 751111191 NA NA 16.863910 9.171601 9533954
Afghanistan AFG 1964 800000044 NA NA 18.055555 8.888893 9731361
Afghanistan AFG 1965 1006666638 NA NA 21.412803 11.258279 9938414
'tbl_ts' 'tbl_df' 'tbl' 'data.frame'
wider_tibble <- global_economy %>%
as_tibble()%>%
pivot_wider(names_from=Country,values_from=GDP)
class(wider_tibble)
'tbl_df' 'tbl' 'data.frame'
My guess: The first one is a time_series ('table_ts'). 'as_tibble' returns it to a dataframe without time series elements so that 'pivot_wider' works.

Assign values to a name within a function

Here is my code:
get_test <- function(name){
data <- filter(data_all_country,country == name)
# transform the data to a time series using `ts` in `stats`
data <- ts(data$investment, start = 1950)
data <- log(data)
rule <- substitute(name)
assign(rule,data)
}
As in the code, I try to build a function by which I could input a country's name given in character string, and then the variable named by the country would be generated automatically. However, I run this code, and it runs but with no exact variable generated as I want. For example, I want to have a variable called Albania in the environment after I code get_test("Albania").
I wonder why?
Ps: And the dataset of data_all_country is as following:
year country investment
1 1950 Albania NA
2 1951 Albania NA
3 1952 Albania NA
4 1953 Albania NA
5 1954 Albania NA
6 1955 Albania NA
Note that the dataset is OK, just some of it is NA
I think you have to specify the environment for assign, else it will use the current environment (in this case within the function).
You could use
assign(name, data, envir = .GlobalEnv)
or
assign(name, data, pos = 1)

Pass a string argument to a function as dataframe column name in dplyr

I am trying to pass a string variable to a function, to be used as the column name after some data alteration.
Here is the function:
cleandata <- function(df,name){
df <- df %>%
gather(key = 'Year',value = name,X1960:X2015)
df <- df %>%
select(-c(X,Indicator.Name,Indicator.Code))
df$Year <- substr(df$Year,start = 2,stop = 5)
df$Year <- as.factor(df$Year)
return(df)
}
I want to pass a string variable to 'name', and have it as the column name.
The current output of the function is:
> cleandata(lifeexp,'LifeExp')
Source: local data frame [13,888 x 4]
Country.Name Country.Code Year name
(fctr) (fctr) (fctr) (dbl)
1 Aruba ABW 1960 65.56937
2 Andorra AND 1960 NA
3 Afghanistan AFG 1960 32.32851
4 Angola AGO 1960 32.98483
5 Albania ALB 1960 62.25437
6 Arab World ARB 1960 46.84706
7 United Arab Emirates ARE 1960 52.24322
8 Argentina ARG 1960 65.21554
9 Armenia ARM 1960 65.86346
10 American Samoa ASM 1960 NA
.. ... ... ... ...
>
The last column should be 'LifeExp', not name. What am I missing?
Thanks in advance,
Rahul
You want to use gather_ here. See vignette('nse') for an explanation why.
year_cols <- names(df)[grepl('^X\\d{4}$', names(df))]
df %>% gather_('Year', name, year_cols)
The issue is gather takes an unquoted name for its key and value columns, so you can't pass in a variable name. It's just going to interpret what ever variable name you put in there as the the unquoted name you want for the value column. This is consistent with the principle that the tidyr functions without underscores are meant for interactive use and those with underscores should be used when your effort is more programmatic.

How to get column mean for specific rows only?

I need to get the mean of one column (here: score) for specific rows (here: years). Specifically, I would like to know the average score for three periods:
period 1: year <= 1983
period 2: year >= 1984 & year <= 1990
period 3: year >= 1991
This is the structure of my data:
country year score
Algeria 1980 -1.1201501
Algeria 1981 -1.0526943
Algeria 1982 -1.0561565
Algeria 1983 -1.1274560
Algeria 1984 -1.1353926
Algeria 1985 -1.1734330
Algeria 1986 -1.1327666
Algeria 1987 -1.1263586
Algeria 1988 -0.8529455
Algeria 1989 -0.2930265
Algeria 1990 -0.1564207
Algeria 1991 -0.1526328
Algeria 1992 -0.9757842
Algeria 1993 -0.9714060
Algeria 1994 -1.1422258
Algeria 1995 -0.3675797
...
The calculated mean values should be added to the df in an additional column ("mean"), i.e. same mean value for years of period 1, for those of period 2 etc.
This is how it should look like:
country year score mean
Algeria 1980 -1.1201501 -1.089
Algeria 1981 -1.0526943 -1.089
Algeria 1982 -1.0561565 -1.089
Algeria 1983 -1.1274560 -1.089
Algeria 1984 -1.1353926 -0.839
Algeria 1985 -1.1734330 -0.839
Algeria 1986 -1.1327666 -0.839
Algeria 1987 -1.1263586 -0.839
Algeria 1988 -0.8529455 -0.839
Algeria 1989 -0.2930265 -0.839
Algeria 1990 -0.1564207 -0.839
...
Every possible path I tried got easily super complicated - and I have to calculate the mean scores for different periods of time for over 90 countries ...
Many many thanks for your help!
datfrm$mean <-
with (datfrm, ave( score, findInterval(year, c(-Inf, 1984, 1991, Inf)), FUN= mean) )
The title question is a bit different than the real question and would be answered by using logical indexing. If one wanted only the mean for a particular subset say year >= 1984 & year <= 1990 it would be done via:
mn84_90 <- with(datfrm, mean(score[year >= 1984 & year <= 1990]) )
Since findInterval requires year to be sorted (as it is in your example) I'd be tempted to use cut in case it isn't sorted [proved wrong, thanks #DWin]. For completeness the data.table equivalent (scales for large data) is :
require(data.table)
DT = as.data.table(DF) # or just start with a data.table in the first place
DT[, mean:=mean(score), by=cut(year,c(-Inf,1984,1991,Inf))]
or findInterval is likely faster as DWin used :
DT[, mean:=mean(score), by=findInterval(year,c(-Inf,1984,1991,Inf))]
If the rows are ordered by year, I think the easiest way to accomplish this would be:
m80_83 <- mean(dataframe[1:4,3]) #Finds the mean of the values of column 3 for rows 1 through 4
m84_90 <- mean(dataframe[5:10,3])
#etc.
If the rows are not ordered by year, I would use tapply like this.
list.of.means <- c(tapply(dataframe$score, cut(dataframe$year, c(0,1983.5, 1990.5, 3000)), mean)
Here, tapply takes three parameters:
First, the data you want to do stuff with (in this case, datafram$score).
Second, a function that cuts that data up into groups. In this case, it will cut the data into three groups based on the dataframe$year values. Group 1 will include all rows with dataframe$year values from 0 to 1983.5, Group 2 will include all rows with dataframe$year values from 1983.5 to 1990.5, and Group 3 will include all rows with dataframe$year values from 1983.5 to 3000.
Third, a function that is applied to each group. This function will apply to the data you selected as your first parameter.
So, list.of.means should be a list of the 3 values you are looking for.

Resources