How to get first values from diferent columns at same time? - r

I'm trying to get firsts values from diferent columns to make a data frame, but I I get stranded at one point and don't know how to solve it. Imagine you're using gapminder and want to get three higer gdppercap values for each region/year. How would you do it with dplyr?
Thanks.

I'm inferring that region is continent; if it were country, then this filter would return all rows, since each country/year combination occurs only once (so "top 3" means nothing special).
library(dplyr)
gapminder::gapminder %>%
group_by(continent, year) %>%
slice_max(desc(gdpPercap), n = 3) %>%
ungroup()
# # A tibble: 168 x 6
# country continent year lifeExp pop gdpPercap
# <fct> <fct> <int> <dbl> <int> <dbl>
# 1 Lesotho Africa 1952 42.1 748747 299.
# 2 Guinea-Bissau Africa 1952 32.5 580653 300.
# 3 Eritrea Africa 1952 35.9 1438760 329.
# 4 Lesotho Africa 1957 45.0 813338 336.
# 5 Eritrea Africa 1957 38.0 1542611 344.
# 6 Ethiopia Africa 1957 36.7 22815614 379.
# 7 Burundi Africa 1962 42.0 2961915 355.
# 8 Eritrea Africa 1962 40.2 1666618 381.
# 9 Lesotho Africa 1962 47.7 893143 412.
# 10 Burundi Africa 1967 43.5 3330989 413.
# # ... with 158 more rows

Related

name the list after looping in R [duplicate]

This question already has answers here:
Split data.frame based on levels of a factor into new data.frames
(3 answers)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Duplicate This question has been answered, is not unique, and doesn’t differentiate itself from another question.
the for loop below create a different dataset, using for loop. I would like to change the name of [[i]] to its vector value which is the name of the i th country. Many thanks in advance.
library(gapminder)
cont <- unique(gapminder$continent)
df <- NULL
for(i in 1:(length(cont))) {
temp <- gapminder[gapminder$continent == cont[i], ]
colnames(temp) <- paste0(paste(cont[i]))
df[[i]] <- temp
}
df
Expected Answer
[[5]] -> I would like to see here Oceania
# A tibble: 24 x 6
Oceania `` `` `` `` ``
<fct> <fct> <int> <dbl> <int> <dbl>
1 Australia Oceania 1952 69.1 8691212 10040.
2 Australia Oceania 1957 70.3 9712569 10950.
3 Australia Oceania 1962 70.9 10794968 12217.
4 Australia Oceania 1967 71.1 11872264 14526.
5 Australia Oceania 1972 71.9 13177000 16789.
6 Australia Oceania 1977 73.5 14074100 18334.
7 Australia Oceania 1982 74.7 15184200 19477.
8 Australia Oceania 1987 76.3 16257249 21889.
9 Australia Oceania 1992 77.6 17481977 23425.
10 Australia Oceania 1997 78.8 18565243 26998.
# ... with 14 more rows
Rather use split.
res <- split(gapminder, gapminder$continent)
names(res)
# [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
res$Africa
# # A tibble: 624 × 6
# country continent year lifeExp pop gdpPercap
# <fct> <fct> <int> <dbl> <int> <dbl>
# 1 Algeria Africa 1952 43.1 9279525 2449.
# 2 Algeria Africa 1957 45.7 10270856 3014.
# 3 Algeria Africa 1962 48.3 11000948 2551.
# 4 Algeria Africa 1967 51.4 12760499 3247.
# 5 Algeria Africa 1972 54.5 14760787 4183.
# 6 Algeria Africa 1977 58.0 17152804 4910.
# 7 Algeria Africa 1982 61.4 20033753 5745.
# 8 Algeria Africa 1987 65.8 23254956 5681.
# 9 Algeria Africa 1992 67.7 26298373 5023.
# 10 Algeria Africa 1997 69.2 29072015 4797.
# # … with 614 more rows
R is a vectorised language; you can accomplish the same thing you are trying to accomplish as follows:
# Allocate some memory for the list: gpm_list => empty list
gpm_list <- vector("list", length(unique(gapminder$continent)))
# Split the data.frame into a list of data.frames:
# gpm_list => list of data.frames
gpm_list <- with(
gapminder,
split(
gapminder,
continent
)
)

Compare Life Expectancy from an initial year 1952, and compare that expectancy to all further years, for all countries in R using Dplyr

In my R class, we are currently learning how to manipulate Tibbles. I have a homework problem where I need to grab the life expectancy from 1952 for a country and compare it to all its other expectancies for however many years of data the tibble has. For all countries within the table, in one line using pipes.
Background: this table is called gap
I have used the line:
gap %>% group_by(year, lifeExp) %>% filter(year == 1952)
To filter out the lifeExp for all countries during 1952, but from there I have no idea how to pipe back into the table and compare those initial values to the other specific country values. I know what all the basic dplyr functions do, just having trouble seeing the bigger picture with all the pipes.
If this wasn't enough to understand, I will edit! Thank you for any kind of support!
You can solve it with the help of mutate and match.
library(dplyr)
gapminder::gapminder %>%
group_by(country) %>%
mutate(difference = lifeExp - lifeExp[match(1952, year)]) %>%
ungroup -> gap
gap
# country continent year lifeExp pop gdpPercap difference
# <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
# 1 Afghanistan Asia 1952 28.8 8425333 779. 0
# 2 Afghanistan Asia 1957 30.3 9240934 821. 1.53
# 3 Afghanistan Asia 1962 32.0 10267083 853. 3.20
# 4 Afghanistan Asia 1967 34.0 11537966 836. 5.22
# 5 Afghanistan Asia 1972 36.1 13079460 740. 7.29
# 6 Afghanistan Asia 1977 38.4 14880372 786. 9.64
# 7 Afghanistan Asia 1982 39.9 12881816 978. 11.1
# 8 Afghanistan Asia 1987 40.8 13867957 852. 12.0
# 9 Afghanistan Asia 1992 41.7 16317921 649. 12.9
#10 Afghanistan Asia 1997 41.8 22227415 635. 13.0
# … with 1,694 more rows

Using geom_boxplot yields different result than base boxplot()

I'm using the gapminder dataset to practice some basic data analysis on the data frame.
I want to create a subset of this data with only Argentina and New Zealand, in order to compare their values.
install.packages("gapminder")
library(gapminder)
data("gapminder")
> gapminder
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ... with 1,694 more rows
I'm subsetting the information I want like so :
df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
> df
# A tibble: 24 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Argentina Americas 1952 62.5 17876956 5911.
2 Argentina Americas 1957 64.4 19610538 6857.
3 Argentina Americas 1962 65.1 21283783 7133.
4 Argentina Americas 1967 65.6 22934225 8053.
5 Argentina Americas 1972 67.1 24779799 9443.
6 Argentina Americas 1977 68.5 26983828 10079.
7 Argentina Americas 1982 69.9 29341374 8998.
8 Argentina Americas 1987 70.8 31620918 9140.
9 Argentina Americas 1992 71.9 33958947 9308.
10 Argentina Americas 1997 73.3 36203463 10967.
# ... with 14 more rows
This works great as you can see (or that's what it seems)
Now I would like to create a simple boxplot to quickly analyze some values, but when I plot this with boxplot() and geom_boxplot I get two different results:
boxplot(lifeExp ~ country)
This is what I want, but the x axis is also taking into account all the other countries I did not select. Clearly their data is null but it makes the plot unreadable.
Instead if I use the same data and everything on ggplot, then it works perfectly:
ggplot(data = df, mapping = aes(x=country, y=lifeExp)) + geom_boxplot()
Is there something wrong I'm doing while defining the subset? Using boxplot() gives me the impression that the subset is keeping everything but putting the values for the things I don't want to NULL.
Start with the code posted in the question.
library(gapminder)
data("gapminder")
df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
boxplot(lifeExp ~ country, df)
The plot shows space for all countries because country is a factor and subsetting keeps its original levels. With str, it can be seen what df is:
str(df)
#tibble [24 × 6] (S3: tbl_df/tbl/data.frame)
# $ country : Factor w/ 142 levels "Afghanistan",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ continent: Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ year : int [1:24] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ lifeExp : num [1:24] 62.5 64.4 65.1 65.6 67.1 ...
# $ pop : int [1:24] 17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463 ...
# $ gdpPercap: num [1:24] 5911 6857 7133 8053 9443 ...
The factor country has 142 levels.
The solution is to drop the extra levels.
df2 <- df
df2$country <- droplevels(df2$country)
boxplot(lifeExp ~ country, df2)

R: How can I calculate averages for each nth interval in a data frame?

I'm trying to find the average for a column for each 5 year interval by a group using tidyverse functions (namely dplyr and/or tidyr).
So for example, if I was using the existing gapminder data in R, how would I be able to calculate average life expectancy for each 5 year interval for each continent?
I can try something like this, but it doesn't give me exactly what I want because I'm not sure how to include the 5 year intervals in the code:
library(gapminder)
gapminder <- gapminder
gapminder.avglife <- gapminder %>% group_by(continent) %>%
summarize(lifeavg = mean(lifeExp))
Create another column in group_by for every 5 years and calculate mean of lifeExp.
library(gapminder)
library(dplyr)
gapminder %>%
group_by(continent, year = ceiling(year/5) * 5) %>%
summarize(year = paste(first(year) - 5, first(year), sep = '-'),
lifeavg = mean(lifeExp)) %>%
ungroup
# continent year lifeavg
# <fct> <chr> <dbl>
# 1 Africa 1950-1955 39.1
# 2 Africa 1955-1960 41.3
# 3 Africa 1960-1965 43.3
# 4 Africa 1965-1970 45.3
# 5 Africa 1970-1975 47.5
# 6 Africa 1975-1980 49.6
# 7 Africa 1980-1985 51.6
# 8 Africa 1985-1990 53.3
# 9 Africa 1990-1995 53.6
#10 Africa 1995-2000 53.6
# … with 50 more rows
My answer would go like this
gapminder %>% group_by(continent) %>%
mutate(FiveYrInterval = ((year - min(year)) %/% 5)+1) %>%
group_by(continent, FiveYrInterval) %>%
summarise(mean(lifeExp))
# A tibble: 60 x 3
# Groups: continent [5]
continent FiveYrInterval `mean(lifeExp)`
<fct> <dbl> <dbl>
1 Africa 1 39.1
2 Africa 2 41.3
3 Africa 3 43.3
4 Africa 4 45.3
5 Africa 5 47.5
6 Africa 6 49.6
7 Africa 7 51.6
8 Africa 8 53.3
9 Africa 9 53.6
10 Africa 10 53.6
# ... with 50 more rows
Indeed answer by Ronak is far better.
You can try using cut_interval from ggplot2 to get 5-year intervals for each continent
gapminder %>%
mutate(interval = cut_interval(year,
n = (max(year)-min(year))/5)) %>%
group_by(continent, interval) %>%
summarise(avg = mean(lifeExp))
# A tibble: 55 x 3
# Groups: continent [5]
continent interval avg
<fct> <fct> <dbl>
1 Africa [1952,1957] 40.2
2 Africa (1957,1962] 43.3
3 Africa (1962,1967] 45.3
4 Africa (1967,1972] 47.5
5 Africa (1972,1977] 49.6
6 Africa (1977,1982] 51.6
7 Africa (1982,1987] 53.3
8 Africa (1987,1992] 53.6
9 Africa (1992,1997] 53.6
10 Africa (1997,2002] 53.3
# ... with 45 more rows
try using cut2 from Hmisc package
library(Hmisc)
gapminder %>%
mutate(interval = cut2(year, seq(1952,2007,5))) %>%
group_by(continent, interval) %>%
summarise(avg = mean(lifeExp))
# A tibble: 55 x 3
# Groups: continent [5]
continent interval avg
<fct> <fct> <dbl>
1 Africa 1952 39.1
2 Africa 1957 41.3
3 Africa 1962 43.3
4 Africa 1967 45.3
5 Africa 1972 47.5
6 Africa 1977 49.6
7 Africa 1982 51.6
8 Africa 1987 53.3
9 Africa 1992 53.6
10 Africa 1997 53.6
# ... with 45 more rows

Transforming a variable using the if- else if function

I have a data set that is that I want to calculate z scores by their year.
Example:
Year Score
1999 120
1999 132
1998 120
1997 132
2000 120
2002 132
1998 160
1997 142
....etc
What I want is:
Year Score Z-Score
1999 120 1.2
1999 132 .01
1998 120 -.6
1997 132 1.1
2000 120 -.6
2002 132 0.5
1998 160 2.1
1997 142 .01
I have used the following code:
DF$ZScore<-if (DR$Year== 1997){
((DF$Score-220)/20)
} else if ((DR$Year== 1998){
((DF$Score-222)/19)
}...
}else{
((DF$Score-219)/21)
}
This is not working and I cannot figure out why. Any help is appreciated.
I'm using the gapminder data for simplicity, and also the built in scale function. You might want to build your own function to apply depending on exactly how you want to scale it.
this is a little clukly, but beause you want per year scaling, then you could group by the year and make a nested data frame.
Then using purr, you could go into each data.frame within a year, and scale the variable you want.
Then you would unnest the data again, and the variable would be scaled within each year.
library(tidyverse)
library(gapminder)
gapminder::gapminder %>%
group_by(year) %>%
nest() %>%
mutate(data = map(data,
~ mutate_at(.x, vars(lifeExp, pop),
list(scale = scale)))) %>%
unnest(data)
#> # A tibble: 1,704 x 8
#> # Groups: year [12]
#> year country continent lifeExp pop gdpPercap lifeExp_scale[,…
#> <int> <fct> <fct> <dbl> <int> <dbl> <dbl>
#> 1 1952 Afghan… Asia 28.8 8.43e6 779. -1.66
#> 2 1952 Albania Europe 55.2 1.28e6 1601. 0.505
#> 3 1952 Algeria Africa 43.1 9.28e6 2449. -0.489
#> 4 1952 Angola Africa 30.0 4.23e6 3521. -1.56
#> 5 1952 Argent… Americas 62.5 1.79e7 5911. 1.10
#> 6 1952 Austra… Oceania 69.1 8.69e6 10040. 1.64
#> 7 1952 Austria Europe 66.8 6.93e6 6137. 1.45
#> 8 1952 Bahrain Asia 50.9 1.20e5 9867. 0.154
#> 9 1952 Bangla… Asia 37.5 4.69e7 684. -0.947
#> 10 1952 Belgium Europe 68 8.73e6 8343. 1.55
#> # … with 1,694 more rows, and 1 more variable: pop_scale[,1] <dbl>
Created on 2020-06-25 by the reprex package (v0.3.0)

Resources