Changing observation`s name using dplyr - r

Suppose I have this dataset:
Variable <- c("GDP")
Country <- c("Brazil", "Chile")
df <- data.frame(Variable, Country)
I want to change the GDP to "Country_observation" GDP, i.e, Brazil GDP and Chile GDP.
I have a much larger dataset and I've been trying to do this by using
df %>% mutate(Variable = replace(Variable, Variable == "GDP", paste(Country, "GDP")))
However, it will print the first observation of variable "Country" for every observation in "Variable" that meets the conditional. Is there any way to make paste() use the value of Country on the row it is applying to?
I've tried to use rowwise() and it did not work. I've tried the following code as well and encountered the same problem
df %>% mutate(Country = ifelse(Country == "Chile", replace(Variable, Variable == "GDP",
paste(Country, "GDP")), Variable))
Thanks to everyone!
EDIT
I can't simply use unite because I still need the variable Country. So a workaround that I found was (I had several other observations that I needed to change their names)
df %>% mutate(Variable2 = ifelse(Variable == "GDP", paste0(Country, " ",
Variable), Variable)) %>%
mutate(Variable2 = replace(Variable2, Variable2 ==
"CR", "Country Risk")) %>%
mutate(Variable2 = replace(Variable2, Variable2
== "EXR", "Exchange Rate")) %>%
mutate(Variable2 = replace(Variable2,mVariable2 == "INTR", "Interest Rate"))
%>% select(-Variable) %>%
select(Horizon, Variable = Variable2, Response, Low, Up, Shock, Country,
Status)
EDIT 2
My desired output was
Horizon Variable Response Shock Country
1 Brazil GDP 0.0037 PCOM Brazil
2 Brazil GDP 0.0060 PCOM Brazil
3 Brazil GDP 0.0053 PCOM Brazil
4 Brazil GDP 0.0033 PCOM Brazil
5 Brazil GDP 0.0021 PCOM Brazil
6 Brazil GDP 0.0020 PCOM Brazil

This example should help:
library(tidyr)
library(dplyr)
Variable <- c("GDP")
Country <- c("Brazil", "Chile")
value = c(5,10)
df <- data.frame(Variable, Country, value)
# original data
df
# Variable Country value
# 1 GDP Brazil 5
# 2 GDP Chile 10
# update
df %>% unite(NewGDP, Variable, Country)
# NewGDP value
# 1 GDP_Brazil 5
# 2 GDP_Chile 10
If you want to use paste you can do:
df %>% mutate(NewGDP = paste0(Country,"_",Variable))
# Variable Country value NewGDP
# 1 GDP Brazil 5 Brazil_GDP
# 2 GDP Chile 10 Chile_GDP

Related

R: Filtering rows based on a group criterion

I have a data frame with over 100,000 rows and with about 40 columns. The schools column has about 100 distinct schools. I have data from 1980 to 2023.
I want to keep all data from schools that have at least 10 rows for each of the years 2018 through 2022. Schools that do not meet that criterion should have all rows deleted.
In my minimal example, Schools, I have three schools.
Computing a table makes it apparent that only Washington should be retained. Adams only has 5 rows for 2018 and Jefferson has 0 for 2018.
Schools2 is what the result should look like.
How do I use the table computation or a dplyr computation to perform the filter?
Schools =
data.frame(school = c(rep('Washington', 60),
rep('Adams',70),
rep('Jefferson', 100)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5),
rep(2017, 25), rep(2018, 5), rep(2019:2022, each = 10),
rep(2019:2023, each = 20)),
stuff = rnorm(230)
)
Schools2 =
data.frame(school = c(rep('Washington', 60)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5)),
stuff = rnorm(60)
)
table(Schools$school, Schools$year)
Schools |> group_by(school, year) |> summarize(counts = n())
Keep only the year from 2018 to 2022 in the data with filter, then add a frequency count column by school, year, and filter only those 'school', having all count greater than or equal to 10 and if all the year from the range are present
library(dplyr)# version >= 1.1.0
Schools %>%
filter(all(table(year[year %in% 2018:2022]) >= 10) &
all(2018:2022 %in% year), .by = c("school")) %>%
as_tibble()
-output
# A tibble: 60 × 3
school year stuff
<chr> <dbl> <dbl>
1 Washington 2016 0.680
2 Washington 2016 -1.14
3 Washington 2016 0.0420
4 Washington 2016 -0.603
5 Washington 2016 2.05
6 Washington 2018 -0.810
7 Washington 2018 0.692
8 Washington 2018 -0.502
9 Washington 2018 0.464
10 Washington 2018 0.397
# … with 50 more rows
Or using count
library(magrittr)
Schools %>%
filter(tibble(year) %>%
filter(year %in% 2018:2022) %>%
count(year) %>%
pull(n) %>%
is_weakly_greater_than(10) %>%
all, all(2018:2022 %in% year) , .by = "school")
As it turns out, a friend just helped me come up with a base R solution.
# form 2-way table, school against year
sdTable = table(Schools$school, Schools$year)
# say want years 2018-2022 having lots of rows in school data
sdTable = sdTable[,3:7]
# which have >= 10 rows in all years 2018-2022
allGtEq = function(oneRow) all(oneRow >= 10)
whichToKeep = which(apply(sdTable,1,allGtEq))
# now whichToKeep is row numbers from the table; get the school names
whichToKeep = names(whichToKeep)
# back to school data
whichOrigRowsToKeep = which(Schools$school %in% whichToKeep)
newHousing = Schools[whichOrigRowsToKeep,]
newHousing

Create several columns from a complex column in R

Imagine dataset:
df1 <- tibble::tribble(~City, ~Population,
"United Kingdom > Leeds", 1500000,
"Spain > Las Palmas de Gran Canaria", 200000,
"Canada > Nanaimo, BC", 150000,
"Canada > Montreal", 250000,
"United States > Minneapolis, MN", 700000,
"United States > Milwaukee, WI", NA,
"United States > Milwaukee", 400000)
The same dataset for visual representation:
I would like to:
Split column City into three columns: City, Country, State (if available, NA otherwise)
Check that Milwaukee has data in state and population (the NA for Milwaukee should have a value of 400000 and then split [City-State-Country] :).
Could you, please, suggest the easiest method to do so :)
Here's another solution with extract to do the extraction of Country, City, and State in a single go with State extracted by an optional capture group (the remainder of the task is done as by #Allen's code):
library(tidyr)
library(dplyr)
df1 %>%
extract(City,
into = c("Country", "City", "State"),
regex = "([^>]+) > ([^,]+),? ?([A-Z]+)?"
) %>%
# as by #Allen Cameron:
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
You can use separate twice to get the country and state, then group_by Country and City to summarize away the NA values where appropriate:
library(tidyverse)
df1 %>%
separate(City, sep = " > ", into = c("Country", "City")) %>%
separate(City, sep = ', ', into = c('City', 'State')) %>%
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
#> # A tibble: 6 x 4
#> # Groups: Country [4]
#> Country City State Population
#> <chr> <chr> <chr> <dbl>
#> 1 Canada Montreal <NA> 250000
#> 2 Canada Nanaimo BC 150000
#> 3 Spain Las Palmas de Gran Canaria <NA> 200000
#> 4 United Kingdom Leeds <NA> 1500000
#> 5 United States Milwaukee WI 400000
#> 6 United States Minneapolis MN 700000

Can get ggplot2 bar chart to display direct values for Y axis?

When I plot a my barchart, the chart is putting out values on the Y-axis I don't understand. How can I get the barchart to use actual values?
#Here is the code for my graph
stock %>%
#Tidy data to be handled correctly
group_by(year) %>%
filter(year == "2017") %>%
pivot_longer(bio_sus:bio_notsus) %>%
mutate(value2 = ifelse(name=="bio_sus",-1*value, value)) %>%
#make the graph
ggplot(aes(ocean_whole, value2/100, fill=name)) +
geom_bar(stat = "identity")
The bar chart is putting out values between 2.5 and -2.5 when my value 2 values range between 100 and - 100
ocean_sub code year ocean_whole name value value2
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 Eastern Central Atlantic NA 2017 atlantic bio_sus 57.1 -57.1
2 Eastern Central Atlantic NA 2017 atlantic bio_notsus 42.9 42.9
3 Eastern Central Pacific NA 2017 pacific bio_sus 86.7 -86.7
4 Eastern Central Pacific NA 2017 pacific bio_notsus 13.3 13.3
5 Eastern Indian Ocean NA 2017 indian bio_sus 68.6 -68.6
How can I get the chart to display the actual values?
#My code is from TidyTuesdays Global seafood:
stock <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-12/fish-stocks-within-sustainable-levels.csv')
#transformed in the following way
oceans <- c("pacific", "atlantic", "indian", "mediterranean")
lu <- stack(sapply(oceans, grep, x = stock$entity, ignore.case = TRUE))
stock$oceans <- stock$entity
stock$oceans[lu$values] <- as.character(lu$ind)
stock %>%
group_by(oceans) %>%
summarise(across(matches("^share"), sum))
colnames(stock) <- (c("ocean_sub", "code", "year", "bio_sus", "bio_notsus", "ocean_whole"))
Your tibble contains multiple values for ocean_whole before you give it over to ggplot(). The sums of these values amount the unexpected values. Check:
library(dplyr)
stock %>%
group_by(year) %>%
filter(year == "2017") %>%
pivot_longer(bio_sus:bio_notsus) %>%
mutate(value2 = ifelse(name=="bio_sus",-1*value, value)) %>%
group_by(ocean_whole, name) %>%
summarise(sum(value2))

finding shared column information - a least common ancestor question

I have a data.frame object consisting of columns of information that is tree-like. For instance, I have performed a search of a set of features (query_name) and returned a set of potential matches (match_name). Every match has an associated location that is split into continent, country, region, and town.
The problem I'd like to resolve is finding, for a given query_name, the location information that all potential matches have in common.
For example, with this bit of example data:
query_name <- c(rep("feature1", 3), rep("feature2", 2), rep("feature3", 4))
match_name <- paste0("match", seq(1:9))
continent <- c(rep("NorthAmerica", 3), rep("NorthAmerica", 2), rep("Europe", 4))
country <- c(rep("UnitedStates", 3), rep("Canada", 2), rep("Germany", 4))
region <- c(rep("NewYork", 3), "Ontario", NA, rep("Bayern", 2), rep("Berlin", 2))
town <- c("Manhattan", "Albany", "Buffalo", "Toronto", NA, "Munich", "Nuremberg", "Berlin", "Frankfurt")
data <- data.frame(query_name, match_name, continent, country, region, town)
We'd generate this data.frame object:
query_name match_name continent country region town
1 feature1 match1 NorthAmerica UnitedStates NewYork Manhattan
2 feature1 match2 NorthAmerica UnitedStates NewYork Albany
3 feature1 match3 NorthAmerica UnitedStates NewYork Buffalo
4 feature2 match4 NorthAmerica Canada Ontario Toronto
5 feature2 match5 NorthAmerica Canada <NA> <NA>
6 feature3 match6 Europe Germany Bayern Munich
7 feature3 match7 Europe Germany Bayern Nuremberg
8 feature3 match8 Europe Germany Berlin Berlin
9 feature3 match9 Europe Germany Berlin Frankfurt
I'm hoping to get advice on how to construct a function that will produce the result below. Note that shared location information is now concatenated and separated with a ; delimiter.
Feature1 differs only at the town information, thus the returned string contains the continent through region information.
Feature2 doesn't differ at region or town in the two matches here because one of the two matches contains no information. Nevertheless, lack of information is considered distinct from values with information, so the only thing shared in common for feature2 matches are continent and country.
Feature3 contains shared continent and country information, but distinct region and town, so just continent and country are retained.
Hoping for an output file that looks like this:
query_name location_output
feature1 NorthAmerica;UnitedStates;NewYork;
feature2 NorthAmerica;Canada;;
feature3 Europe;Germany;;
Thanks for any advice you can spare.
Cheers!
Here is an option
library(tidyverse)
data %>%
gather(key, val, -query_name, -match_name) %>%
select(-match_name, -key) %>%
group_by(query_name, val) %>%
add_count() %>%
group_by(query_name) %>%
filter(n == max(n)) %>%
summarise(location_output = paste0(unique(val[!is.na(val)]), collapse = ";"))
## A tibble: 3 x 2
# query_name location_output
# <fct> <chr>
#1 feature1 NorthAmerica;UnitedStates;NewYork
#2 feature2 NorthAmerica;Canada
#3 feature3 Europe;Germany
This is less elegant than #MauritsEvers' solution (it doesn't automatically take care of an arbitrary number of levels), but it ensures that every location_output has all four ; delimiters.
library(dplyr)
data %>%
group_by(query_name) %>%
summarize(continent = ifelse(n_distinct(continent) == 1, first(continent), ""),
country = ifelse(n_distinct(country) == 1, first(country), ""),
region = ifelse(n_distinct(region) == 1, first(region), ""),
town = ifelse(n_distinct(town) == 1, first(town), "")) %>%
mutate(location_output = paste(continent, country, region, town, sep = ";")) %>%
select(query_name, location_output)
lapply(split(data, data$query_name), function(x){
x = x[,-(1:2)]
r = rle(sapply(x, function(d) length(unique(d))))
x[1, seq(r$lengths[1])]
})
#$feature1
# continent country region
#1 NorthAmerica UnitedStates NewYork
#$feature2
# continent country
#4 NorthAmerica Canada
#$feature3
# continent country
#6 Europe Germany

Calculating proportion by year using dplyr

I'm attempting to calculate the frequency a variable (in this case country) shows up in any given year. For example:
name <- c('AJ Griffin','Steve Bacon','Kevin Potatoe','Jose Hernandez','Kent Brockman',
'Sal Fasno','Kirk Kelly','Wes United','Livan Domingo','Mike Fast')
country <- c('USA', 'USA', 'Canada', 'Dominican Republic', 'Panama', 'Dominican Republic', 'Canada', 'USA', 'Dominican Republic', 'Mexico')
year <- c('2016', '2016', '2016', '2016', '2016', '2015', '2015', '2015', '2015', '2015')
country_analysis <-data.frame(name, country, year)
When I use the following code I get the proportion of countries for the entire dataset, but I would like to pare this down even further to specific years.
P <- country_analysis %>%
group_by(country) %>%
summarise(n=n())%>%
mutate(freq = round(n / sum(n), 1))
Ideally the end result would have country, year, frequency column (i.e. 2016, USA, 0.4). Any input would be appreciated.
First collapse by year and country, then by just year. For example
country_analysis %>%
group_by(year, country) %>%
summarize(count=n()) %>%
mutate(proportion=count/sum(count))
# year country count proportion
# <fctr> <fctr> <int> <dbl>
# 1 2015 Canada 1 0.2
# 2 2015 Dominican Republic 2 0.4
# 3 2015 Mexico 1 0.2
# 4 2015 USA 1 0.2
# 5 2016 Canada 1 0.2
# 6 2016 Dominican Republic 1 0.2
# 7 2016 Panama 1 0.2
# 8 2016 USA 2 0.4

Resources