This question already has answers here:
Split data.frame based on levels of a factor into new data.frames
(3 answers)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Duplicate This question has been answered, is not unique, and doesn’t differentiate itself from another question.
the for loop below create a different dataset, using for loop. I would like to change the name of [[i]] to its vector value which is the name of the i th country. Many thanks in advance.
library(gapminder)
cont <- unique(gapminder$continent)
df <- NULL
for(i in 1:(length(cont))) {
temp <- gapminder[gapminder$continent == cont[i], ]
colnames(temp) <- paste0(paste(cont[i]))
df[[i]] <- temp
}
df
Expected Answer
[[5]] -> I would like to see here Oceania
# A tibble: 24 x 6
Oceania `` `` `` `` ``
<fct> <fct> <int> <dbl> <int> <dbl>
1 Australia Oceania 1952 69.1 8691212 10040.
2 Australia Oceania 1957 70.3 9712569 10950.
3 Australia Oceania 1962 70.9 10794968 12217.
4 Australia Oceania 1967 71.1 11872264 14526.
5 Australia Oceania 1972 71.9 13177000 16789.
6 Australia Oceania 1977 73.5 14074100 18334.
7 Australia Oceania 1982 74.7 15184200 19477.
8 Australia Oceania 1987 76.3 16257249 21889.
9 Australia Oceania 1992 77.6 17481977 23425.
10 Australia Oceania 1997 78.8 18565243 26998.
# ... with 14 more rows
Rather use split.
res <- split(gapminder, gapminder$continent)
names(res)
# [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
res$Africa
# # A tibble: 624 × 6
# country continent year lifeExp pop gdpPercap
# <fct> <fct> <int> <dbl> <int> <dbl>
# 1 Algeria Africa 1952 43.1 9279525 2449.
# 2 Algeria Africa 1957 45.7 10270856 3014.
# 3 Algeria Africa 1962 48.3 11000948 2551.
# 4 Algeria Africa 1967 51.4 12760499 3247.
# 5 Algeria Africa 1972 54.5 14760787 4183.
# 6 Algeria Africa 1977 58.0 17152804 4910.
# 7 Algeria Africa 1982 61.4 20033753 5745.
# 8 Algeria Africa 1987 65.8 23254956 5681.
# 9 Algeria Africa 1992 67.7 26298373 5023.
# 10 Algeria Africa 1997 69.2 29072015 4797.
# # … with 614 more rows
R is a vectorised language; you can accomplish the same thing you are trying to accomplish as follows:
# Allocate some memory for the list: gpm_list => empty list
gpm_list <- vector("list", length(unique(gapminder$continent)))
# Split the data.frame into a list of data.frames:
# gpm_list => list of data.frames
gpm_list <- with(
gapminder,
split(
gapminder,
continent
)
)
Related
I would like to save all the list objects which are created by for loop as different datasets in the environment with their proper name for example gapminder_Asia, gapminder_Europe,..etc. Many thanks in advance.
library(gapminder)
cont <- unique(gapminder$continent)
df <- NULL
for(i in 1:(length(cont))) {
temp <- gapminder[gapminder$continent == cont[i], ]
colnames(temp) <- paste0(paste(cont[i]))
df[[i]] <- temp
}
df
Expected Answer,
> unique(gapminder$continent)
[1] Asia Europe Africa Americas Oceania
head(gapminder_Asia)
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
Personally I would prefer to keep the dataset inside a list using e.g. split but if your desired result is to have different named objects then you could do so via assign:
library(gapminder)
df <- split(gapminder, gapminder$continent)
for(i in names(df)) {
assign(paste("gapminder", i, sep = "_"), df[[i]])
}
gapminder_Africa
#> # A tibble: 624 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Algeria Africa 1952 43.1 9279525 2449.
#> 2 Algeria Africa 1957 45.7 10270856 3014.
#> 3 Algeria Africa 1962 48.3 11000948 2551.
#> 4 Algeria Africa 1967 51.4 12760499 3247.
#> 5 Algeria Africa 1972 54.5 14760787 4183.
#> 6 Algeria Africa 1977 58.0 17152804 4910.
#> 7 Algeria Africa 1982 61.4 20033753 5745.
#> 8 Algeria Africa 1987 65.8 23254956 5681.
#> 9 Algeria Africa 1992 67.7 26298373 5023.
#> 10 Algeria Africa 1997 69.2 29072015 4797.
#> # … with 614 more rows
Created on 2021-10-16 by the reprex package (v2.0.1)
I'm using the gapminder dataset to practice some basic data analysis on the data frame.
I want to create a subset of this data with only Argentina and New Zealand, in order to compare their values.
install.packages("gapminder")
library(gapminder)
data("gapminder")
> gapminder
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ... with 1,694 more rows
I'm subsetting the information I want like so :
df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
> df
# A tibble: 24 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Argentina Americas 1952 62.5 17876956 5911.
2 Argentina Americas 1957 64.4 19610538 6857.
3 Argentina Americas 1962 65.1 21283783 7133.
4 Argentina Americas 1967 65.6 22934225 8053.
5 Argentina Americas 1972 67.1 24779799 9443.
6 Argentina Americas 1977 68.5 26983828 10079.
7 Argentina Americas 1982 69.9 29341374 8998.
8 Argentina Americas 1987 70.8 31620918 9140.
9 Argentina Americas 1992 71.9 33958947 9308.
10 Argentina Americas 1997 73.3 36203463 10967.
# ... with 14 more rows
This works great as you can see (or that's what it seems)
Now I would like to create a simple boxplot to quickly analyze some values, but when I plot this with boxplot() and geom_boxplot I get two different results:
boxplot(lifeExp ~ country)
This is what I want, but the x axis is also taking into account all the other countries I did not select. Clearly their data is null but it makes the plot unreadable.
Instead if I use the same data and everything on ggplot, then it works perfectly:
ggplot(data = df, mapping = aes(x=country, y=lifeExp)) + geom_boxplot()
Is there something wrong I'm doing while defining the subset? Using boxplot() gives me the impression that the subset is keeping everything but putting the values for the things I don't want to NULL.
Start with the code posted in the question.
library(gapminder)
data("gapminder")
df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
boxplot(lifeExp ~ country, df)
The plot shows space for all countries because country is a factor and subsetting keeps its original levels. With str, it can be seen what df is:
str(df)
#tibble [24 × 6] (S3: tbl_df/tbl/data.frame)
# $ country : Factor w/ 142 levels "Afghanistan",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ continent: Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ year : int [1:24] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ lifeExp : num [1:24] 62.5 64.4 65.1 65.6 67.1 ...
# $ pop : int [1:24] 17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463 ...
# $ gdpPercap: num [1:24] 5911 6857 7133 8053 9443 ...
The factor country has 142 levels.
The solution is to drop the extra levels.
df2 <- df
df2$country <- droplevels(df2$country)
boxplot(lifeExp ~ country, df2)
I'm trying to get firsts values from diferent columns to make a data frame, but I I get stranded at one point and don't know how to solve it. Imagine you're using gapminder and want to get three higer gdppercap values for each region/year. How would you do it with dplyr?
Thanks.
I'm inferring that region is continent; if it were country, then this filter would return all rows, since each country/year combination occurs only once (so "top 3" means nothing special).
library(dplyr)
gapminder::gapminder %>%
group_by(continent, year) %>%
slice_max(desc(gdpPercap), n = 3) %>%
ungroup()
# # A tibble: 168 x 6
# country continent year lifeExp pop gdpPercap
# <fct> <fct> <int> <dbl> <int> <dbl>
# 1 Lesotho Africa 1952 42.1 748747 299.
# 2 Guinea-Bissau Africa 1952 32.5 580653 300.
# 3 Eritrea Africa 1952 35.9 1438760 329.
# 4 Lesotho Africa 1957 45.0 813338 336.
# 5 Eritrea Africa 1957 38.0 1542611 344.
# 6 Ethiopia Africa 1957 36.7 22815614 379.
# 7 Burundi Africa 1962 42.0 2961915 355.
# 8 Eritrea Africa 1962 40.2 1666618 381.
# 9 Lesotho Africa 1962 47.7 893143 412.
# 10 Burundi Africa 1967 43.5 3330989 413.
# # ... with 158 more rows
I have a data set that is that I want to calculate z scores by their year.
Example:
Year Score
1999 120
1999 132
1998 120
1997 132
2000 120
2002 132
1998 160
1997 142
....etc
What I want is:
Year Score Z-Score
1999 120 1.2
1999 132 .01
1998 120 -.6
1997 132 1.1
2000 120 -.6
2002 132 0.5
1998 160 2.1
1997 142 .01
I have used the following code:
DF$ZScore<-if (DR$Year== 1997){
((DF$Score-220)/20)
} else if ((DR$Year== 1998){
((DF$Score-222)/19)
}...
}else{
((DF$Score-219)/21)
}
This is not working and I cannot figure out why. Any help is appreciated.
I'm using the gapminder data for simplicity, and also the built in scale function. You might want to build your own function to apply depending on exactly how you want to scale it.
this is a little clukly, but beause you want per year scaling, then you could group by the year and make a nested data frame.
Then using purr, you could go into each data.frame within a year, and scale the variable you want.
Then you would unnest the data again, and the variable would be scaled within each year.
library(tidyverse)
library(gapminder)
gapminder::gapminder %>%
group_by(year) %>%
nest() %>%
mutate(data = map(data,
~ mutate_at(.x, vars(lifeExp, pop),
list(scale = scale)))) %>%
unnest(data)
#> # A tibble: 1,704 x 8
#> # Groups: year [12]
#> year country continent lifeExp pop gdpPercap lifeExp_scale[,…
#> <int> <fct> <fct> <dbl> <int> <dbl> <dbl>
#> 1 1952 Afghan… Asia 28.8 8.43e6 779. -1.66
#> 2 1952 Albania Europe 55.2 1.28e6 1601. 0.505
#> 3 1952 Algeria Africa 43.1 9.28e6 2449. -0.489
#> 4 1952 Angola Africa 30.0 4.23e6 3521. -1.56
#> 5 1952 Argent… Americas 62.5 1.79e7 5911. 1.10
#> 6 1952 Austra… Oceania 69.1 8.69e6 10040. 1.64
#> 7 1952 Austria Europe 66.8 6.93e6 6137. 1.45
#> 8 1952 Bahrain Asia 50.9 1.20e5 9867. 0.154
#> 9 1952 Bangla… Asia 37.5 4.69e7 684. -0.947
#> 10 1952 Belgium Europe 68 8.73e6 8343. 1.55
#> # … with 1,694 more rows, and 1 more variable: pop_scale[,1] <dbl>
Created on 2020-06-25 by the reprex package (v0.3.0)
When I try to join two tables without the KEY, it works perfectly. But when I am providing the Key, it is giving me weird results:
Pls. help me understand what am I missing out.
library(gapminder)
A <- gapminder[gapminder$country=="India" & gapminder$year %in% 1952:1987, 1:4]
B <- gapminder[gapminder$country=="India" & gapminder$year %in% 1977:2007, c(1:3, 5, 6)]
left_join(A, B)
left_join(A, B, by = "country")
For without key: I am getting
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 India Asia 1952 37.4 NA NA
2 India Asia 1957 40.2 NA NA
3 India Asia 1962 43.6 NA NA
4 India Asia 1967 47.2 NA NA
5 India Asia 1972 50.7 NA NA
6 India Asia 1977 54.2 634000000 813.
7 India Asia 1982 56.6 708000000 856.
8 India Asia 1987 58.6 788000000 977.
But, when I use the Key, it gives me some 56 rows:
# A tibble: 56 x 7
country continent year.x lifeExp year.y pop
<fct> <fct> <int> <dbl> <int> <dbl>
1 India Asia 1952 37.4 1977 6.34e8
2 India Asia 1952 37.4 1982 7.08e8
3 India Asia 1952 37.4 1987 7.88e8
4 India Asia 1952 37.4 1992 8.72e8
5 India Asia 1952 37.4 1997 9.59e8
6 India Asia 1952 37.4 2002 1.03e9
7 India Asia 1952 37.4 2007 1.11e9
8 India Asia 1957 40.2 1977 6.34e8
9 India Asia 1957 40.2 1982 7.08e8
10 India Asia 1957 40.2 1987 7.88e8
# ... with 46 more rows, and 1 more variable:
# gdpPercap <dbl>
Its called a Cartesian Product / Cross-Join
Cross Joins
Its basically a multiplication of the rows, rather than a straight intersect.