When I try to join two tables without the KEY, it works perfectly. But when I am providing the Key, it is giving me weird results:
Pls. help me understand what am I missing out.
library(gapminder)
A <- gapminder[gapminder$country=="India" & gapminder$year %in% 1952:1987, 1:4]
B <- gapminder[gapminder$country=="India" & gapminder$year %in% 1977:2007, c(1:3, 5, 6)]
left_join(A, B)
left_join(A, B, by = "country")
For without key: I am getting
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 India Asia 1952 37.4 NA NA
2 India Asia 1957 40.2 NA NA
3 India Asia 1962 43.6 NA NA
4 India Asia 1967 47.2 NA NA
5 India Asia 1972 50.7 NA NA
6 India Asia 1977 54.2 634000000 813.
7 India Asia 1982 56.6 708000000 856.
8 India Asia 1987 58.6 788000000 977.
But, when I use the Key, it gives me some 56 rows:
# A tibble: 56 x 7
country continent year.x lifeExp year.y pop
<fct> <fct> <int> <dbl> <int> <dbl>
1 India Asia 1952 37.4 1977 6.34e8
2 India Asia 1952 37.4 1982 7.08e8
3 India Asia 1952 37.4 1987 7.88e8
4 India Asia 1952 37.4 1992 8.72e8
5 India Asia 1952 37.4 1997 9.59e8
6 India Asia 1952 37.4 2002 1.03e9
7 India Asia 1952 37.4 2007 1.11e9
8 India Asia 1957 40.2 1977 6.34e8
9 India Asia 1957 40.2 1982 7.08e8
10 India Asia 1957 40.2 1987 7.88e8
# ... with 46 more rows, and 1 more variable:
# gdpPercap <dbl>
Its called a Cartesian Product / Cross-Join
Cross Joins
Its basically a multiplication of the rows, rather than a straight intersect.
Related
Good afternoon,
Is there any way, to create a XLS document from printing data on the console?
I am using the package gapminder, and I will like to know if there is a way, when i print some information on the screen to select the first rows using head for example.
> head(gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
I will like to create a XLS file with this content.
Thank you in advanced.
Im trying to create a XLS document with the content from the output of a function.
I would like to save all the list objects which are created by for loop as different datasets in the environment with their proper name for example gapminder_Asia, gapminder_Europe,..etc. Many thanks in advance.
library(gapminder)
cont <- unique(gapminder$continent)
df <- NULL
for(i in 1:(length(cont))) {
temp <- gapminder[gapminder$continent == cont[i], ]
colnames(temp) <- paste0(paste(cont[i]))
df[[i]] <- temp
}
df
Expected Answer,
> unique(gapminder$continent)
[1] Asia Europe Africa Americas Oceania
head(gapminder_Asia)
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
Personally I would prefer to keep the dataset inside a list using e.g. split but if your desired result is to have different named objects then you could do so via assign:
library(gapminder)
df <- split(gapminder, gapminder$continent)
for(i in names(df)) {
assign(paste("gapminder", i, sep = "_"), df[[i]])
}
gapminder_Africa
#> # A tibble: 624 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Algeria Africa 1952 43.1 9279525 2449.
#> 2 Algeria Africa 1957 45.7 10270856 3014.
#> 3 Algeria Africa 1962 48.3 11000948 2551.
#> 4 Algeria Africa 1967 51.4 12760499 3247.
#> 5 Algeria Africa 1972 54.5 14760787 4183.
#> 6 Algeria Africa 1977 58.0 17152804 4910.
#> 7 Algeria Africa 1982 61.4 20033753 5745.
#> 8 Algeria Africa 1987 65.8 23254956 5681.
#> 9 Algeria Africa 1992 67.7 26298373 5023.
#> 10 Algeria Africa 1997 69.2 29072015 4797.
#> # … with 614 more rows
Created on 2021-10-16 by the reprex package (v2.0.1)
In my R class, we are currently learning how to manipulate Tibbles. I have a homework problem where I need to grab the life expectancy from 1952 for a country and compare it to all its other expectancies for however many years of data the tibble has. For all countries within the table, in one line using pipes.
Background: this table is called gap
I have used the line:
gap %>% group_by(year, lifeExp) %>% filter(year == 1952)
To filter out the lifeExp for all countries during 1952, but from there I have no idea how to pipe back into the table and compare those initial values to the other specific country values. I know what all the basic dplyr functions do, just having trouble seeing the bigger picture with all the pipes.
If this wasn't enough to understand, I will edit! Thank you for any kind of support!
You can solve it with the help of mutate and match.
library(dplyr)
gapminder::gapminder %>%
group_by(country) %>%
mutate(difference = lifeExp - lifeExp[match(1952, year)]) %>%
ungroup -> gap
gap
# country continent year lifeExp pop gdpPercap difference
# <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
# 1 Afghanistan Asia 1952 28.8 8425333 779. 0
# 2 Afghanistan Asia 1957 30.3 9240934 821. 1.53
# 3 Afghanistan Asia 1962 32.0 10267083 853. 3.20
# 4 Afghanistan Asia 1967 34.0 11537966 836. 5.22
# 5 Afghanistan Asia 1972 36.1 13079460 740. 7.29
# 6 Afghanistan Asia 1977 38.4 14880372 786. 9.64
# 7 Afghanistan Asia 1982 39.9 12881816 978. 11.1
# 8 Afghanistan Asia 1987 40.8 13867957 852. 12.0
# 9 Afghanistan Asia 1992 41.7 16317921 649. 12.9
#10 Afghanistan Asia 1997 41.8 22227415 635. 13.0
# … with 1,694 more rows
I'm using the gapminder dataset to practice some basic data analysis on the data frame.
I want to create a subset of this data with only Argentina and New Zealand, in order to compare their values.
install.packages("gapminder")
library(gapminder)
data("gapminder")
> gapminder
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ... with 1,694 more rows
I'm subsetting the information I want like so :
df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
> df
# A tibble: 24 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Argentina Americas 1952 62.5 17876956 5911.
2 Argentina Americas 1957 64.4 19610538 6857.
3 Argentina Americas 1962 65.1 21283783 7133.
4 Argentina Americas 1967 65.6 22934225 8053.
5 Argentina Americas 1972 67.1 24779799 9443.
6 Argentina Americas 1977 68.5 26983828 10079.
7 Argentina Americas 1982 69.9 29341374 8998.
8 Argentina Americas 1987 70.8 31620918 9140.
9 Argentina Americas 1992 71.9 33958947 9308.
10 Argentina Americas 1997 73.3 36203463 10967.
# ... with 14 more rows
This works great as you can see (or that's what it seems)
Now I would like to create a simple boxplot to quickly analyze some values, but when I plot this with boxplot() and geom_boxplot I get two different results:
boxplot(lifeExp ~ country)
This is what I want, but the x axis is also taking into account all the other countries I did not select. Clearly their data is null but it makes the plot unreadable.
Instead if I use the same data and everything on ggplot, then it works perfectly:
ggplot(data = df, mapping = aes(x=country, y=lifeExp)) + geom_boxplot()
Is there something wrong I'm doing while defining the subset? Using boxplot() gives me the impression that the subset is keeping everything but putting the values for the things I don't want to NULL.
Start with the code posted in the question.
library(gapminder)
data("gapminder")
df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
boxplot(lifeExp ~ country, df)
The plot shows space for all countries because country is a factor and subsetting keeps its original levels. With str, it can be seen what df is:
str(df)
#tibble [24 × 6] (S3: tbl_df/tbl/data.frame)
# $ country : Factor w/ 142 levels "Afghanistan",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ continent: Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ year : int [1:24] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ lifeExp : num [1:24] 62.5 64.4 65.1 65.6 67.1 ...
# $ pop : int [1:24] 17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463 ...
# $ gdpPercap: num [1:24] 5911 6857 7133 8053 9443 ...
The factor country has 142 levels.
The solution is to drop the extra levels.
df2 <- df
df2$country <- droplevels(df2$country)
boxplot(lifeExp ~ country, df2)
I'm trying to get firsts values from diferent columns to make a data frame, but I I get stranded at one point and don't know how to solve it. Imagine you're using gapminder and want to get three higer gdppercap values for each region/year. How would you do it with dplyr?
Thanks.
I'm inferring that region is continent; if it were country, then this filter would return all rows, since each country/year combination occurs only once (so "top 3" means nothing special).
library(dplyr)
gapminder::gapminder %>%
group_by(continent, year) %>%
slice_max(desc(gdpPercap), n = 3) %>%
ungroup()
# # A tibble: 168 x 6
# country continent year lifeExp pop gdpPercap
# <fct> <fct> <int> <dbl> <int> <dbl>
# 1 Lesotho Africa 1952 42.1 748747 299.
# 2 Guinea-Bissau Africa 1952 32.5 580653 300.
# 3 Eritrea Africa 1952 35.9 1438760 329.
# 4 Lesotho Africa 1957 45.0 813338 336.
# 5 Eritrea Africa 1957 38.0 1542611 344.
# 6 Ethiopia Africa 1957 36.7 22815614 379.
# 7 Burundi Africa 1962 42.0 2961915 355.
# 8 Eritrea Africa 1962 40.2 1666618 381.
# 9 Lesotho Africa 1962 47.7 893143 412.
# 10 Burundi Africa 1967 43.5 3330989 413.
# # ... with 158 more rows