Pass a string argument to a function as dataframe column name in dplyr - r

I am trying to pass a string variable to a function, to be used as the column name after some data alteration.
Here is the function:
cleandata <- function(df,name){
df <- df %>%
gather(key = 'Year',value = name,X1960:X2015)
df <- df %>%
select(-c(X,Indicator.Name,Indicator.Code))
df$Year <- substr(df$Year,start = 2,stop = 5)
df$Year <- as.factor(df$Year)
return(df)
}
I want to pass a string variable to 'name', and have it as the column name.
The current output of the function is:
> cleandata(lifeexp,'LifeExp')
Source: local data frame [13,888 x 4]
Country.Name Country.Code Year name
(fctr) (fctr) (fctr) (dbl)
1 Aruba ABW 1960 65.56937
2 Andorra AND 1960 NA
3 Afghanistan AFG 1960 32.32851
4 Angola AGO 1960 32.98483
5 Albania ALB 1960 62.25437
6 Arab World ARB 1960 46.84706
7 United Arab Emirates ARE 1960 52.24322
8 Argentina ARG 1960 65.21554
9 Armenia ARM 1960 65.86346
10 American Samoa ASM 1960 NA
.. ... ... ... ...
>
The last column should be 'LifeExp', not name. What am I missing?
Thanks in advance,
Rahul

You want to use gather_ here. See vignette('nse') for an explanation why.
year_cols <- names(df)[grepl('^X\\d{4}$', names(df))]
df %>% gather_('Year', name, year_cols)
The issue is gather takes an unquoted name for its key and value columns, so you can't pass in a variable name. It's just going to interpret what ever variable name you put in there as the the unquoted name you want for the value column. This is consistent with the principle that the tidyr functions without underscores are meant for interactive use and those with underscores should be used when your effort is more programmatic.

Related

Scatter plot with variables that have multiple different years

I'm currently trying to make a scatter plot of child mortality rate and child labor. My problem is, I don't actually have a lot of data, and some countries may only get values for some years, and some other countries may only have data for some other years, so I can't plot all the data together, nor the data in any year is big enough to limit to that only year. I was wondering if there is a function that takes the last value available in the dataset for any given specified variable. So, for instance, if my last data for child labor from Germany is from 2015 and my last data from Italy is from 2014, and so forth with the rest of the countries, is there a way I can plot the last values for each country?
Code goes like this:
head(data2)
# A tibble: 6 x 5
Entity Code Year mortality labor
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan AFG 1962 34.5 NA
2 Afghanistan AFG 1963 33.9 NA
3 Afghanistan AFG 1964 33.3 NA
4 Afghanistan AFG 1965 32.8 NA
5 Afghanistan AFG 1966 32.2 NA
6 Afghanistan AFG 1967 31.7 NA
Never mind about those NA's. Labor data just doesn't go back there. But I do have it in the dataset, for more recent years. Child mortality data, on the other hand, is actually pretty complete.
Thanks.
I cannot find which variable to plot, but following code can select only last of each country.
data2 %>%
group_by(Entity) %>%
filter(Year == max(Year)) %>%
ungroup
result is like
Entity Code Year mortality labor
<chr> <chr> <dbl> <dbl> <lgl>
1 Afghanistan AFG 1967 31.7 NA
No you can plot some variable.
You might want to define what you mean by 'last' value per group - as in most recent, last occurrence in the data or something else?
dplyr::last picks out the last occurrence in the data, so you could use it along with arrange to order your data. In this example we sort the data by Year (ascending order by default), so the last observation will be the most recent. Assuming you don't want to include NA values, we also use filter to remove them from the data.
data2 %>%
# first remove NAs from the data
filter(
!is.na(labor)
) %>%
# then sort the data by Year
arrange(Year) %>%
# then extract the last observation per country
group_by(Entity) %>%
summarise(
last_record = last(labor)
)

revising the values of a variable in a data frame [duplicate]

This question already has answers here:
Remove part of a string in dataframe column (R)
(3 answers)
removing particular character in a column in r
(3 answers)
Closed 3 years ago.
I want to revise the values of a variable. The values are for a series of years. They start from 1960 and end at 2017. There are multiple 1960s, 1961s and so on till 2017. The multiple values for each year correspond to different countries. Countries are another variable in another column. However, each year is tagged with an X. eg. each 1960 has X1960 and so on till X2017. I want to remove the X for all years.
database is as shown below
Country Year GDP
Afghanistan X1960
England X1960
Sudan X1960
.
.
.
Afghanistan X2017
England X2017
Sudan X2017
.
.
Hi You can you gsub function to your data frame
ABC <- data.frame(country = c("Afghanistan", "England"), year = c("X1960","X1960"))
print(ABC)
country year
1 Afghanistan X1960
2 England X1960
ABC$year <- gsub("X","",ABC$year)
> print(ABC)
country year
1 Afghanistan 1960
2 England 1960
Here's a tidyverse solution.
# Load libraries
library(dplyr)
library(readr)
# Dummy data frame
df <- data.frame(country = c("Afghanistan", "England", "Sudan"),
year = rep("X1960", 3),
stringsAsFactors = FALSE)
# Quick peak
print(df)
#> country year
#> 1 Afghanistan X1960
#> 2 England X1960
#> 3 Sudan X1960
# Strip all non-numerics from strings
df %>% mutate(year = parse_number(year))
#> country year
#> 1 Afghanistan 1960
#> 2 England 1960
#> 3 Sudan 1960
Created on 2019-05-23 by the reprex package (v0.2.1)

Lagging a variable by adding up the previous 5 years?

I am working with data that look like this:
Country Year Aid
Angola 1995 416420000
Angola 1996 459310000
Angola 1997 354660000
Angola 1998 335270000
Angola 1999 387540000
Angola 2000 302210000
I want to create a lagged variable by adding up the previous five years in the data
So that the observation for 2000 looks like this:
Country Year Aid Lagged5
Angola 2000 416420000 1953200000
Which was derived by adding the Aid observations from 1995 to 1999 together:
416420000 + 459310000 + 354660000 + 335270000 + 387540000 = 1953200000
Also, I will need to group by country as well.
Thank You!
You could do:
library(dplyr)
df %>%
group_by(Country) %>%
mutate(Lagged5 = sapply(Year, function(x) sum(Aid[between(Year, x - 5, x - 1)])))
Output:
# A tibble: 6 x 4
# Groups: Country [1]
Country Year Aid Lagged5
<chr> <int> <int> <int>
1 Angola 1995 416420000 0
2 Angola 1996 459310000 416420000
3 Angola 1997 354660000 875730000
4 Angola 1998 335270000 1230390000
5 Angola 1999 387540000 1565660000
6 Angola 2000 302210000 1953200000
Using the input DF shown reproducibly in the Note at the end define a roll function which sums the prior 5 rows and use ave to run it for each Country. The width argument list(-seq(5)) to rollapplyr means use offsets -1, -2, -3, -4, -5 in summing, i.e. the values in the prior 5 rows.
The question did not discuss what to do with the initial rows in each country so we put in NA values but if you want partial sums add the partial = TRUE argument to rollapplyr. You can also change the fill=NA to some other value if you wish so it is quite flexible.
library(zoo)
roll <- function(x) rollapplyr(x, list(-seq(5)), sum, fill = NA)
transform(DF, Lag5 = ave(Aid, Country, FUN = roll))
Note
The input was assumed to be the following. We added a second country.
Lines <- "Country Year Aid
Angola 1995 416420000
Angola 1996 459310000
Angola 1997 354660000
Angola 1998 335270000
Angola 1999 387540000
Angola 2000 302210000"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE,
colClasses = c("character", "integer", "numeric"))
DF <- rbind(DF, transform(DF, Country = "Belize"))

Assign values to a name within a function

Here is my code:
get_test <- function(name){
data <- filter(data_all_country,country == name)
# transform the data to a time series using `ts` in `stats`
data <- ts(data$investment, start = 1950)
data <- log(data)
rule <- substitute(name)
assign(rule,data)
}
As in the code, I try to build a function by which I could input a country's name given in character string, and then the variable named by the country would be generated automatically. However, I run this code, and it runs but with no exact variable generated as I want. For example, I want to have a variable called Albania in the environment after I code get_test("Albania").
I wonder why?
Ps: And the dataset of data_all_country is as following:
year country investment
1 1950 Albania NA
2 1951 Albania NA
3 1952 Albania NA
4 1953 Albania NA
5 1954 Albania NA
6 1955 Albania NA
Note that the dataset is OK, just some of it is NA
I think you have to specify the environment for assign, else it will use the current environment (in this case within the function).
You could use
assign(name, data, envir = .GlobalEnv)
or
assign(name, data, pos = 1)

How to get column mean for specific rows only?

I need to get the mean of one column (here: score) for specific rows (here: years). Specifically, I would like to know the average score for three periods:
period 1: year <= 1983
period 2: year >= 1984 & year <= 1990
period 3: year >= 1991
This is the structure of my data:
country year score
Algeria 1980 -1.1201501
Algeria 1981 -1.0526943
Algeria 1982 -1.0561565
Algeria 1983 -1.1274560
Algeria 1984 -1.1353926
Algeria 1985 -1.1734330
Algeria 1986 -1.1327666
Algeria 1987 -1.1263586
Algeria 1988 -0.8529455
Algeria 1989 -0.2930265
Algeria 1990 -0.1564207
Algeria 1991 -0.1526328
Algeria 1992 -0.9757842
Algeria 1993 -0.9714060
Algeria 1994 -1.1422258
Algeria 1995 -0.3675797
...
The calculated mean values should be added to the df in an additional column ("mean"), i.e. same mean value for years of period 1, for those of period 2 etc.
This is how it should look like:
country year score mean
Algeria 1980 -1.1201501 -1.089
Algeria 1981 -1.0526943 -1.089
Algeria 1982 -1.0561565 -1.089
Algeria 1983 -1.1274560 -1.089
Algeria 1984 -1.1353926 -0.839
Algeria 1985 -1.1734330 -0.839
Algeria 1986 -1.1327666 -0.839
Algeria 1987 -1.1263586 -0.839
Algeria 1988 -0.8529455 -0.839
Algeria 1989 -0.2930265 -0.839
Algeria 1990 -0.1564207 -0.839
...
Every possible path I tried got easily super complicated - and I have to calculate the mean scores for different periods of time for over 90 countries ...
Many many thanks for your help!
datfrm$mean <-
with (datfrm, ave( score, findInterval(year, c(-Inf, 1984, 1991, Inf)), FUN= mean) )
The title question is a bit different than the real question and would be answered by using logical indexing. If one wanted only the mean for a particular subset say year >= 1984 & year <= 1990 it would be done via:
mn84_90 <- with(datfrm, mean(score[year >= 1984 & year <= 1990]) )
Since findInterval requires year to be sorted (as it is in your example) I'd be tempted to use cut in case it isn't sorted [proved wrong, thanks #DWin]. For completeness the data.table equivalent (scales for large data) is :
require(data.table)
DT = as.data.table(DF) # or just start with a data.table in the first place
DT[, mean:=mean(score), by=cut(year,c(-Inf,1984,1991,Inf))]
or findInterval is likely faster as DWin used :
DT[, mean:=mean(score), by=findInterval(year,c(-Inf,1984,1991,Inf))]
If the rows are ordered by year, I think the easiest way to accomplish this would be:
m80_83 <- mean(dataframe[1:4,3]) #Finds the mean of the values of column 3 for rows 1 through 4
m84_90 <- mean(dataframe[5:10,3])
#etc.
If the rows are not ordered by year, I would use tapply like this.
list.of.means <- c(tapply(dataframe$score, cut(dataframe$year, c(0,1983.5, 1990.5, 3000)), mean)
Here, tapply takes three parameters:
First, the data you want to do stuff with (in this case, datafram$score).
Second, a function that cuts that data up into groups. In this case, it will cut the data into three groups based on the dataframe$year values. Group 1 will include all rows with dataframe$year values from 0 to 1983.5, Group 2 will include all rows with dataframe$year values from 1983.5 to 1990.5, and Group 3 will include all rows with dataframe$year values from 1983.5 to 3000.
Third, a function that is applied to each group. This function will apply to the data you selected as your first parameter.
So, list.of.means should be a list of the 3 values you are looking for.

Resources