how to extract variables from a column - r

country continent year lifeExp
<fct> <fct> <int> <dbl>
1 Afghanistan Asia 1952 28.8
2 Afghanistan Asia 1957 30.3
3 Afghanistan Asia 1962 32.0
4 Afghanistan Asia 1967 34.0
5 Afghanistan Asia 1972 36.1
6 Afghanistan Asia 1977 38.4
7 Afghanistan Asia 1982 39.9
8 Afghanistan Asia 1987 40.8
9 Afghanistan Asia 1992 41.7
10 Afghanistan Asia 1997 41.8
I want to print the observations for Afghanistan where the corresponding values for lifeExp are 41, using the packages dplyr and tidyverse.
I have tried subsetting using $,pull, filter and select.

You were on the right track with filter(). The reason you wouldn't get any observations printed filtering for lifeExp == 41 is that there are no observations that are exactly equal to 41:
library(gapminder)
library(dplyr)
data(gapminder)
filter(gapminder, country == "Afghanistan" & lifeExp == 41)
#> # A tibble: 0 x 6
#> # ... with 6 variables: country <fct>, continent <fct>, year <int>,
#> # lifeExp <dbl>, pop <int>, gdpPercap <dbl>
You either need to specify a range, or round the values before filtering:
filter(gapminder, country == "Afghanistan" & lifeExp > 39 & lifeExp < 42)
#> # A tibble: 4 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1982 39.9 12881816 978.
#> 2 Afghanistan Asia 1987 40.8 13867957 852.
#> 3 Afghanistan Asia 1992 41.7 16317921 649.
#> 4 Afghanistan Asia 1997 41.8 22227415 635.
gapminder %>%
mutate(lifeExp = round(lifeExp)) %>%
filter(country == "Afghanistan" & lifeExp == 41)
#> # A tibble: 1 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1987 41 13867957 852.

Related

save the list objects created in forloop with different names in R

I would like to save all the list objects which are created by for loop as different datasets in the environment with their proper name for example gapminder_Asia, gapminder_Europe,..etc. Many thanks in advance.
library(gapminder)
cont <- unique(gapminder$continent)
df <- NULL
for(i in 1:(length(cont))) {
temp <- gapminder[gapminder$continent == cont[i], ]
colnames(temp) <- paste0(paste(cont[i]))
df[[i]] <- temp
}
df
Expected Answer,
> unique(gapminder$continent)
[1] Asia Europe Africa Americas Oceania
head(gapminder_Asia)
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
Personally I would prefer to keep the dataset inside a list using e.g. split but if your desired result is to have different named objects then you could do so via assign:
library(gapminder)
df <- split(gapminder, gapminder$continent)
for(i in names(df)) {
assign(paste("gapminder", i, sep = "_"), df[[i]])
}
gapminder_Africa
#> # A tibble: 624 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Algeria Africa 1952 43.1 9279525 2449.
#> 2 Algeria Africa 1957 45.7 10270856 3014.
#> 3 Algeria Africa 1962 48.3 11000948 2551.
#> 4 Algeria Africa 1967 51.4 12760499 3247.
#> 5 Algeria Africa 1972 54.5 14760787 4183.
#> 6 Algeria Africa 1977 58.0 17152804 4910.
#> 7 Algeria Africa 1982 61.4 20033753 5745.
#> 8 Algeria Africa 1987 65.8 23254956 5681.
#> 9 Algeria Africa 1992 67.7 26298373 5023.
#> 10 Algeria Africa 1997 69.2 29072015 4797.
#> # … with 614 more rows
Created on 2021-10-16 by the reprex package (v2.0.1)

Compare Life Expectancy from an initial year 1952, and compare that expectancy to all further years, for all countries in R using Dplyr

In my R class, we are currently learning how to manipulate Tibbles. I have a homework problem where I need to grab the life expectancy from 1952 for a country and compare it to all its other expectancies for however many years of data the tibble has. For all countries within the table, in one line using pipes.
Background: this table is called gap
I have used the line:
gap %>% group_by(year, lifeExp) %>% filter(year == 1952)
To filter out the lifeExp for all countries during 1952, but from there I have no idea how to pipe back into the table and compare those initial values to the other specific country values. I know what all the basic dplyr functions do, just having trouble seeing the bigger picture with all the pipes.
If this wasn't enough to understand, I will edit! Thank you for any kind of support!
You can solve it with the help of mutate and match.
library(dplyr)
gapminder::gapminder %>%
group_by(country) %>%
mutate(difference = lifeExp - lifeExp[match(1952, year)]) %>%
ungroup -> gap
gap
# country continent year lifeExp pop gdpPercap difference
# <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
# 1 Afghanistan Asia 1952 28.8 8425333 779. 0
# 2 Afghanistan Asia 1957 30.3 9240934 821. 1.53
# 3 Afghanistan Asia 1962 32.0 10267083 853. 3.20
# 4 Afghanistan Asia 1967 34.0 11537966 836. 5.22
# 5 Afghanistan Asia 1972 36.1 13079460 740. 7.29
# 6 Afghanistan Asia 1977 38.4 14880372 786. 9.64
# 7 Afghanistan Asia 1982 39.9 12881816 978. 11.1
# 8 Afghanistan Asia 1987 40.8 13867957 852. 12.0
# 9 Afghanistan Asia 1992 41.7 16317921 649. 12.9
#10 Afghanistan Asia 1997 41.8 22227415 635. 13.0
# … with 1,694 more rows

Using geom_boxplot yields different result than base boxplot()

I'm using the gapminder dataset to practice some basic data analysis on the data frame.
I want to create a subset of this data with only Argentina and New Zealand, in order to compare their values.
install.packages("gapminder")
library(gapminder)
data("gapminder")
> gapminder
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ... with 1,694 more rows
I'm subsetting the information I want like so :
df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
> df
# A tibble: 24 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Argentina Americas 1952 62.5 17876956 5911.
2 Argentina Americas 1957 64.4 19610538 6857.
3 Argentina Americas 1962 65.1 21283783 7133.
4 Argentina Americas 1967 65.6 22934225 8053.
5 Argentina Americas 1972 67.1 24779799 9443.
6 Argentina Americas 1977 68.5 26983828 10079.
7 Argentina Americas 1982 69.9 29341374 8998.
8 Argentina Americas 1987 70.8 31620918 9140.
9 Argentina Americas 1992 71.9 33958947 9308.
10 Argentina Americas 1997 73.3 36203463 10967.
# ... with 14 more rows
This works great as you can see (or that's what it seems)
Now I would like to create a simple boxplot to quickly analyze some values, but when I plot this with boxplot() and geom_boxplot I get two different results:
boxplot(lifeExp ~ country)
This is what I want, but the x axis is also taking into account all the other countries I did not select. Clearly their data is null but it makes the plot unreadable.
Instead if I use the same data and everything on ggplot, then it works perfectly:
ggplot(data = df, mapping = aes(x=country, y=lifeExp)) + geom_boxplot()
Is there something wrong I'm doing while defining the subset? Using boxplot() gives me the impression that the subset is keeping everything but putting the values for the things I don't want to NULL.
Start with the code posted in the question.
library(gapminder)
data("gapminder")
df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
boxplot(lifeExp ~ country, df)
The plot shows space for all countries because country is a factor and subsetting keeps its original levels. With str, it can be seen what df is:
str(df)
#tibble [24 × 6] (S3: tbl_df/tbl/data.frame)
# $ country : Factor w/ 142 levels "Afghanistan",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ continent: Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ year : int [1:24] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ lifeExp : num [1:24] 62.5 64.4 65.1 65.6 67.1 ...
# $ pop : int [1:24] 17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463 ...
# $ gdpPercap: num [1:24] 5911 6857 7133 8053 9443 ...
The factor country has 142 levels.
The solution is to drop the extra levels.
df2 <- df
df2$country <- droplevels(df2$country)
boxplot(lifeExp ~ country, df2)

Transforming a variable using the if- else if function

I have a data set that is that I want to calculate z scores by their year.
Example:
Year Score
1999 120
1999 132
1998 120
1997 132
2000 120
2002 132
1998 160
1997 142
....etc
What I want is:
Year Score Z-Score
1999 120 1.2
1999 132 .01
1998 120 -.6
1997 132 1.1
2000 120 -.6
2002 132 0.5
1998 160 2.1
1997 142 .01
I have used the following code:
DF$ZScore<-if (DR$Year== 1997){
((DF$Score-220)/20)
} else if ((DR$Year== 1998){
((DF$Score-222)/19)
}...
}else{
((DF$Score-219)/21)
}
This is not working and I cannot figure out why. Any help is appreciated.
I'm using the gapminder data for simplicity, and also the built in scale function. You might want to build your own function to apply depending on exactly how you want to scale it.
this is a little clukly, but beause you want per year scaling, then you could group by the year and make a nested data frame.
Then using purr, you could go into each data.frame within a year, and scale the variable you want.
Then you would unnest the data again, and the variable would be scaled within each year.
library(tidyverse)
library(gapminder)
gapminder::gapminder %>%
group_by(year) %>%
nest() %>%
mutate(data = map(data,
~ mutate_at(.x, vars(lifeExp, pop),
list(scale = scale)))) %>%
unnest(data)
#> # A tibble: 1,704 x 8
#> # Groups: year [12]
#> year country continent lifeExp pop gdpPercap lifeExp_scale[,…
#> <int> <fct> <fct> <dbl> <int> <dbl> <dbl>
#> 1 1952 Afghan… Asia 28.8 8.43e6 779. -1.66
#> 2 1952 Albania Europe 55.2 1.28e6 1601. 0.505
#> 3 1952 Algeria Africa 43.1 9.28e6 2449. -0.489
#> 4 1952 Angola Africa 30.0 4.23e6 3521. -1.56
#> 5 1952 Argent… Americas 62.5 1.79e7 5911. 1.10
#> 6 1952 Austra… Oceania 69.1 8.69e6 10040. 1.64
#> 7 1952 Austria Europe 66.8 6.93e6 6137. 1.45
#> 8 1952 Bahrain Asia 50.9 1.20e5 9867. 0.154
#> 9 1952 Bangla… Asia 37.5 4.69e7 684. -0.947
#> 10 1952 Belgium Europe 68 8.73e6 8343. 1.55
#> # … with 1,694 more rows, and 1 more variable: pop_scale[,1] <dbl>
Created on 2020-06-25 by the reprex package (v0.3.0)

Question with Joining Database code issue

When I try to join two tables without the KEY, it works perfectly. But when I am providing the Key, it is giving me weird results:
Pls. help me understand what am I missing out.
library(gapminder)
A <- gapminder[gapminder$country=="India" & gapminder$year %in% 1952:1987, 1:4]
B <- gapminder[gapminder$country=="India" & gapminder$year %in% 1977:2007, c(1:3, 5, 6)]
left_join(A, B)
left_join(A, B, by = "country")
For without key: I am getting
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 India Asia 1952 37.4 NA NA
2 India Asia 1957 40.2 NA NA
3 India Asia 1962 43.6 NA NA
4 India Asia 1967 47.2 NA NA
5 India Asia 1972 50.7 NA NA
6 India Asia 1977 54.2 634000000 813.
7 India Asia 1982 56.6 708000000 856.
8 India Asia 1987 58.6 788000000 977.
But, when I use the Key, it gives me some 56 rows:
# A tibble: 56 x 7
country continent year.x lifeExp year.y pop
<fct> <fct> <int> <dbl> <int> <dbl>
1 India Asia 1952 37.4 1977 6.34e8
2 India Asia 1952 37.4 1982 7.08e8
3 India Asia 1952 37.4 1987 7.88e8
4 India Asia 1952 37.4 1992 8.72e8
5 India Asia 1952 37.4 1997 9.59e8
6 India Asia 1952 37.4 2002 1.03e9
7 India Asia 1952 37.4 2007 1.11e9
8 India Asia 1957 40.2 1977 6.34e8
9 India Asia 1957 40.2 1982 7.08e8
10 India Asia 1957 40.2 1987 7.88e8
# ... with 46 more rows, and 1 more variable:
# gdpPercap <dbl>
Its called a Cartesian Product / Cross-Join
Cross Joins
Its basically a multiplication of the rows, rather than a straight intersect.

Resources