I have some data which I am trying to use tidy R and pivot longer function in R to get the out put as mentioned below. But I am not able to do it, I am getting Data
I have data in this format. ( with many other column names )
Country State Year 1 Population 1 Year 2 Population2
U.S.A IL 2009 20000 2010 30000
U.S.A VA 2009 30000 2010 40000
I want to get data in this format.
Country State Year Population
U.S.A IL 2009 20000
U.S.A IL 2010 30000
U.S.A VA 2009 30000
U.S.A VA 2010 40000
I am able to do it only for on column, but not able to pass other column likes like population
My code is below.
file1<-file %>%
pivot_longer(
cols = contains("Year"),
names_sep = "_",
names_to = c(".value", "repeat"),
)
I was able to make it work on Tidyverse.
library(tidyverse)
file<-read_excel("peps300.xlsx")
names(file)<-str_replace_all(names(file), c("Year " = "Year_" , "Num " = "Num_", "DRate " = "DRate_" , "PRate " = "PRate_", "Denom " = "Denom_"))
file<-file %>%
pivot_longer(
cols = c(contains("Year"),contains("Num"),contains("DRate"),contains("PRate"),contains("Denom")),
names_sep = "_",
names_to = c(".value", "repeat")
)
An option would be to specify the cols that starts_with "Population" or "Year"
library(dplyr)
df1 %>%
pivot_longer(cols = c(starts_with("Population"), starts_with("Year")),
names_to = c(".value", "group"), names_pattern = "(.*)_(.*)")
# A tibble: 4 x 5
# Country State group Population Year
# <chr> <chr> <chr> <int> <int>
#1 U.S.A IL 1 20000 2009
#2 U.S.A IL 2 30000 2010
#3 U.S.A VA 1 30000 2009
#4 U.S.A VA 2 40000 2010
data
df1 <- structure(list(Country = c("U.S.A", "U.S.A"), State = c("IL",
"VA"), Year_1 = c(2009L, 2009L), Population_1 = c(20000L, 30000L
), Year_2 = c(2010L, 2010L), Population_2 = c(30000L, 40000L)),
class = "data.frame", row.names = c(NA,
-2L))
df %>%
pivot_longer(
-c(Country,State),
names_to = c(".value","group"),
names_pattern = "(.+)_(.+)"
)
# A tibble: 4 x 5
Country State group Year Population
<chr> <chr> <chr> <chr> <chr>
1 U.S.A IL 1 2009 20000
2 U.S.A IL 2 2010 30000
3 U.S.A VA 1 2009 30000
4 U.S.A VA 2 2010 40000
You can then drop the group if you don't need it.
And, to do this, you will need to clean your column names first. Make sure they all follow the same pattern and words are connected with a single space or a single underscore.
df <- structure(list(Country = c("U.S.A", "U.S.A"), State = c("IL",
"VA"), Year_1 = c("2009", "2009"), Population_1 = c("20000",
"30000"), Year_2 = c("2010", "2010"), Population_2 = c("30000",
"40000")), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L))
Related
I have a large DF with certain columns that have a vector of character values as below. The number of columns varies from dataset to dataset as well as the number of character vectors it holds also varies.
ID Country1 Country2 Country3
1 1 Argentina, Japan,USA,Poland, Argentina,USA Pakistan
2 2 Colombia, Mexico,Uruguay,Dutch Mexico,Uruguay Afganisthan
3 3 Argentina, Japan,USA,NA Japan Khazagistan
4 4 Colombia, Mexico,Uruguay,Dutch Colombia, Dutch North Korea
5 5 India, China China Iran
Would like to match them one-to-one with another string vector as below
vals_to_find <-c("Argentina","USA","Mexico")
If, a column/row matches to anyone of the strings passed would like to retain that column and row. Remove duplicates, and finally remove those values that do not match.
the desired output is as follows
ID Countries.found
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
data
dput(df)
structure(list(ID = 1:5, Country1 = c("Argentina, Japan,USA,Poland,",
"Colombia, Mexico,Uruguay,Dutch", "Argentina, Japan,USA,NA",
"Colombia, Mexico,Uruguay,Dutch", "India, China"), Country2 = c("Argentina,USA",
"Mexico,Uruguay", "Japan", "Colombia, Dutch", "China"), Country3 = c("Pakistan",
"Afganisthan", "Khazagistan", "North Korea", "Iran")), class = "data.frame", row.names = c(NA,
-5L))
dput(df_out)
structure(list(ID = 1:4, Countries.found = c("Argentina, USA",
"Mexico", "Argentina, USA", "Mexico")), class = "data.frame", row.names = c(NA,
-4L))
Instead of a each column as a vector, if the file is read as one value per column. Then, was able do it as below
dput(df_out)
structure(list(ID = 1:5, X1 = c("Argentina", "Colombia", "Argentina",
"Colombia", "India"), X2 = c("Japan", "Mexico", "Japan", "Mexico",
"China"), X3 = c("USA", "Uruguay", "USA", "Uruguay", NA), X4 = c("Poland",
"Dutch", NA, "Dutch", NA), X5 = c("Argentina", "Mexico", "Japan",
"Colombia", "China"), X6 = c("USA", "Uruguay", NA, "Dutch", NA
), X7 = c("Pakistan", "Afganisthan", "Khazagistan", "North Korea",
"Iran")), class = "data.frame", row.names = c(NA, -5L))
df_out %>%
dplyr::select(
where(~ !all(is.na(.x)))
) %>%
dplyr::select(c(1, where(~ any(.x %in% vals_to_find)))) %>%
dplyr::mutate(dplyr::across(
tidyselect::starts_with("X"),
~ vals_to_find[match(., vals_to_find)]
)) %>%
tidyr::unite("countries_found", tidyselect::starts_with("X"),
sep = " | ", remove = TRUE, na.rm = TRUE
)
Output
ID countries_found
1 1 Argentina | USA | Argentina | USA
2 2 Mexico | Mexico
3 3 Argentina | USA
4 4 Mexico
unite the "Country" columns, then create a long vector by separating the values into rows, get all distinct values per ID, filter only those who are in vals_to_find, and summarise each countries.found toString.
library(tidyr)
library(dplyr)
df %>%
unite("Country", starts_with("Country"), sep = ",") %>%
separate_rows(Country) %>%
distinct(ID, Country) %>%
filter(Country %in% vals_to_find) %>%
group_by(ID) %>%
summarise(Countries.found = toString(Country))
output
# A tibble: 4 × 2
ID Countries.found
<int> <chr>
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
We may use
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(across(starts_with("Country"),
~ str_extract_all(.x, str_c(vals_to_find, collapse = "|")))) %>%
pivot_longer(cols = -ID, names_to = NULL,
values_to = 'Countries.found') %>%
unnest(Countries.found) %>%
distinct %>%
group_by(ID) %>%
summarise(Countries.found = toString(Countries.found))
-output
# A tibble: 4 × 2
ID Countries.found
<int> <chr>
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
Say you have a database like gapminder with the population per country. Even though the current year is 2021, you also have predictions for the following years to come.
location 2020.0 2021.0 2022.0
Canada 5 7 9
China 23 34 54
Congo 1 2 3
and another database like this, vaccins
location date amount_of_vaccins
Canada 2020-01-02 50
China 2021-05-03 59
Congo 2022-03-05 34
How can I merge the population of each country into the second database, but following the dates in the second database.
I managed to merge them by country like this:
merge(gapminder,vaccins, by = "location")
but I'm getting this
location date amount_of_vaccins 2020.0 2021.0 2022.0
Canada 2020-01-02 50 5 7 9
China 2021-05-03 59 23 34 54
Congo 2022-03-05 34 1 2 3
I'd like to have only a new variable giving the population of the country according to the year. Thank you.
You could do something like this with tidyverse.
library(tidyverse)
df1 <- df1 %>%
pivot_longer(!location, names_to = "date", values_to = "population") %>%
dplyr::mutate(year = str_sub(date, 1, 4))
df2 %>%
dplyr::mutate(year = str_sub(date, end = 4)) %>%
dplyr::left_join(., df1, by = c("location", "year")) %>%
dplyr::select(-c(date.y, year)) %>%
dplyr::rename(date = date.x)
Output
location date amount_of_vaccins population
1 Canada 2020-01-02 50 5
2 China 2021-05-03 59 34
3 Congo 2022-03-05 54 3
Data
df1 <-
structure(
list(
location = c("Canada", "China", "Congo"),
`2020.0` = c(5, 23, 1),
`2021.0` = c(7, 34, 2),
`2022.0` = c(9, 54, 3)
),
class = "data.frame",
row.names = c(NA,-3L)
)
df2 <-
structure(
list(
location = c("Canada", "China", "Congo"),
date = c("2020-01-02",
"2021-05-03", "2022-03-05"),
amount_of_vaccins = c(50, 59, 54)
),
class = "data.frame",
row.names = c(NA,-3L)
)
I'm sorry if this question has already been answered, but I don't really know how to phrase my question.
I have a data frame structured in this way:
country
year
score
France
2020
10
France
2019
9
Germany
2020
15
Germany
2019
14
I would like to have a new column called previous_year_score that would look into the data frame looking for the "score" of a country for the "year - 1". In this case France 2020 would have a previous_year_score of 9, while France 2019 would have a NA.
You can use match() for this. I imagine there are plenty of other solutions too.
Data:
df <- structure(list(country = c("France", "France", "Germany", "Germany"
), year = c(2020L, 2019L, 2020L, 2019L), score = c(10L, 9L, 15L,
14L), prev_score = c(9L, NA, 14L, NA)), row.names = c(NA, -4L
), class = "data.frame")
Solution:
i <- match(paste(df[[1]],df[[2]]-1),paste(df[[1]],df[[2]]))
df$prev_score <- df[i,3]
You can use the following solution:
library(dplyr)
df %>%
group_by(country) %>%
arrange(year) %>%
mutate(prev_val = ifelse(year - lag(year) == 1, lag(score), NA))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 Germany 2019 14 NA
3 France 2020 10 9
4 Germany 2020 15 14
Using case_when
library(dplyr)
df1 %>%
arrange(country, year) %>%
group_by(country) %>%
mutate(prev_val = case_when(year - lag(year) == 1 ~ lag(score)))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 France 2020 10 9
3 Germany 2019 14 NA
4 Germany 2020 15 14
I would like to transform my data from long format to wide by the values in two columns. How can I do this using tidyverse?
Updated dput
structure(list(Country = c("Algeria", "Benin", "Ghana", "Algeria",
"Benin", "Ghana", "Algeria", "Benin", "Ghana"
), Indicator = c("Indicator 1",
"Indicator 1",
"Indicator 1",
"Indicator 2",
"Indicator 2",
"Indicator 2",
"Indicator 3",
"Indicator 3",
"Indicator 3"
), Status = c("Actual", "Forecast", "Target", "Actual", "Forecast",
"Target", "Actual", "Forecast", "Target"), Value = c(34, 15, 5,
28, 5, 2, 43, 5,
1)), row.names
= c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"))
Country Indicator Status Value
<chr> <chr> <chr> <dbl>
1 Algeria Indicator 1 Actual 34
2 Benin Indicator 1 Forecast 15
3 Ghana Indicator 1 Target 5
4 Algeria Indicator 2 Actual 28
5 Benin Indicator 2 Forecast 5
6 Ghana Indicator 2 Target 2
7 Algeria Indicator 3 Actual 43
8 Benin Indicator 3 Forecast 5
9 Ghana Indicator 3 Target 1
Expected output
Country Indicator1_Actual Indicator1_Forecast Indicator1_Target Indicator2_Actual
Algeria 34 15 5 28
etc
Appreciate any tips!
foo <- data %>% pivot_wider(names_from = c("Indicator","Status"), values_from = "Value")
works perfectly!
I think the mistake is in your pivot_wider() command
data %>% pivot_wider(names_from = Indicator, values_from = c(Indicator, Status))
I bet you can't use the same column for both names and values.
Try this code
data %>% pivot_wider(names_from = c(Indicator, Status), values_from = Value))
Explanation: Since you want the column names to be Indicator 1_Actual, you need both columns indicator and status going into your names_from
It would be helpful if you provided example data and expected output. But I tested this on my dummy data and it gives the expected output -
Data:
# A tibble: 4 x 4
a1 a2 a3 a4
<int> <int> <chr> <dbl>
1 1 5 s 10
2 2 4 s 20
3 3 3 n 30
4 4 2 n 40
Call : a %>% pivot_wider(names_from = c(a2, a3), values_from = a4)
Output :
# A tibble: 4 x 5
a1 `5_s` `4_s` `3_n` `2_n`
<int> <dbl> <dbl> <dbl> <dbl>
1 1 10 NA NA NA
2 2 NA 20 NA NA
3 3 NA NA 30 NA
4 4 NA NA NA 40
Data here if you want to reproduce
structure(list(a1 = 1:4, a2 = 5:2, a3 = c("s", "s", "n", "n"),
a4 = c(10, 20, 30, 40)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Edit : For the edited question after trying out the correct pivot_wider() command - It looks like your data could actually have duplicates, in which case the output you are seeing would make sense - I would suggest you try to figure out if your data actually has duplicates by using filter(Country == .., Indicator == .., Status == ..)
This can be achieved by calling both your columns to pivot wider in the names_from argument in pivot_wider().
data %>%
pivot_wider(names_from = c("Indicator","Status"),
values_from = "Value")
Result
Country `Indicator 1_Ac… `Indicator 1_Fo… `Indicator 1_Ta… `Indicator 2_Ac… `Indicator 2_Fo…
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Algeria 34 15 5 28 5
This is the libraryI am using for creating dummies
install.packages("fastDummies")
library(fastDummies)
This is the dataset
winners <- data.frame(
city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"),
year = c(1990, 2000, 1990),
crime = 1:3)
Let's them create super dummies out of these cities:
dummy_cols(winners, select_columns = c("city"))
The results are
city year crime city_SaoPaulito city_NewAmsterdam city_BeatifulCow
1 SaoPaulito 1990 1 1 0 0
2 NewAmsterdam 2000 2 0 1 0
3 BeatifulCow 1990 3 0 0 1
So the question if that I want to return to the previous dataset, any ideas?
Thanks in advance!
We can use dcast
library(data.table)
dcast(setDT(winners), crime ~ city, length)
If we need to get the input, it would be
subset(df1, select = 1:3)
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
Or with melt
melt(setDT(df1), measure = patterns("_"))[value == 1, .(city, year, crime)]
# city year crime
#1: SaoPaulito 1990 1
#2: NewAmsterdam 2000 2
#3: BeatifulCow 1990 3
data
df1 <- structure(list(city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"
), year = c(1990L, 2000L, 1990L), crime = 1:3, city_SaoPaulito = c(1L,
0L, 0L), city_NewAmsterdam = c(0L, 1L, 0L), city_BeatifulCow = c(0L,
0L, 1L)), class = "data.frame", row.names = c("1", "2", "3"))
If you are going to have only one city as 1 in each row, you can just skip the dummy columns
df[, 1:3]
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
If you can have multiple cities one way using dplyr and tidyr::gather is
library(dplyr)
df %>%
tidyr::gather(key, value, starts_with("city_")) %>%
filter(value == 1) %>%
select(-value, -key)