Combine rows together that are duplicates [duplicate] - r

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a data frame with 4 columns and several thousands rows. The first two columns are geographical identifiers, the third one is a date, and the last one is the number of shipments in that date.
For example:
London UK 4/4/2018 1
London UK 4/4/2018 1
London UK 4/5/2018 3
London UK 4/5/2018 2
I would like to combine the rows so as to have only one row per city, country, and date.
For example, the above data would become:
London UK 4/4/2018 2
London UK 4/5/2018 5
Thank you for all help in advance.

Here is your solution:
# 1. Data set
df <- data.frame(
country = c("UK", "UK", "UK", "UK"),
city = c("London", "London", "London", "London"),
date = c("4/4/2018", "4/4/2018", "4/5/2018", "4/5/2018"),
shipment = c(1, 1, 3, 2))
# 2. Group by 'country', 'city', and 'date' features
df %>%
group_by(country, city, date) %>%
summarise(shipment = sum(shipment))

Related

How can I add the country name to a dataset based on city name and population? [duplicate]

This question already has answers here:
extracting country name from city name in R
(3 answers)
Closed 7 months ago.
I have a dataset containing information on a range of cities, but there is no column which says what country the city is located in. In order to perform the analysis, I need to add an extra column which has the name of the country.
population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin
I expect the output to look like this:
population city country
500,000 Oslo Norway
750,000 Bristol England
500,000 Liverpool England
1,000,000 Dublin Ireland
How can I add a column of country names based on the city and population to a large dataset in R?
I am adapting Tom Hoel's answer, as suggested by Ian Campbell. If this is selected I am happy to mark it as community wiki.
library(maps)
library(dplyr)
data("world.cities")
df <- readr::read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df |>
inner_join(
select(world.cities, name, country.etc, pop),
by = c("city" = "name")
) |> group_by(city) |>
filter(
abs(pop - population) == min(abs(pop - population))
)
# A tibble: 4 x 4
# Groups: city [4]
# population city country.etc pop
# <dbl> <chr> <chr> <int>
# 1 500000 Oslo Norway 821445
# 2 750000 Bristol UK 432967
# 3 500000 Liverpool UK 468584
# 4 1000000 Dublin Ireland 1030431
As stated by others, the cities exists in other countries too as well.
library(tidyverse)
library(maps)
data("world.cities")
df <- read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df %>%
merge(., world.cities %>%
select(name, country.etc),
by.x = "city",
by.y = "name")
# A tibble: 7 × 3
city population country.etc
<chr> <dbl> <chr>
1 Bristol 750000 UK
2 Bristol 750000 USA
3 Dublin 1000000 USA
4 Dublin 1000000 Ireland
5 Liverpool 500000 UK
6 Liverpool 500000 Canada
7 Oslo 500000 Norway
I think your best bet would be to add a new column in your dataset called country and fill it out, this is part of the CRSIP-DM process data preparation so this is not uncommon. If that does not answer your question please let me know and i will do my best to help.

Categorizing Data Based on Text Value in New Column [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I'm trying to take an existing data frame that has a column for state and add a new column called Region depending on what the row's state is. So for example any row that has "CA" should be categorized "West" and any row that has "IL" should be Midwest. There are 4 regions: West, South, Midwest, and Northeast.
I had tried doing this separately in 4 code chunks like this:
south <- c("FL", "KY", "GA", "TX", "MS", "SC", "NC", "AL", "LA", "AR", "TN", "VA", "DC", "MD", "DE", "WV") #16 states
south.mdata <- mdata %>% filter(state %in% south) #1832 locations
south.byyear <- south.mdata %>% group_by(Year) %>% summarize(s.total = n())
south.total <- data %>% filter(state %in% south) %>% group_by(Year) %>% summarize(yearly.total = n())
But this seems repetitive and not the most efficient way to do this. Plus I'd like to be able to group_by both Year and Region so I can compare across regions.
I'm having trouble implementing this and the first thing that comes to mind is to do some sort of if/else loop using filter but I know loops aren't really R's style.
The original data looks like this:
Field.1 ID title description streetaddress city state
1 74 DE074 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
2 75 DE075 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
3 23 DE023 Dog House 1200 DuPont Hwy. Wilmington DE
4 19 DE019 Dog House 1200 DuPont Hwy Wilmington DE
5 26 DE026 Dog House 1200 Dupont Wilmington DE
6 65 DE065 Henlopen Hotel Bar Boardwalk & Surf Rehoboth Beach DE
amenityfeatures type Year notes lon lat
1 (M),(R) Restaurant 1977 <NA> -75.07601 38.72095
2 (M),(R) Restaurant 1976 <NA> -75.07601 38.72095
3 (M),(R) Restaurant 1975 <NA> -75.58243 39.68839
4 (M),(R) Restaurant 1976 <NA> -75.58243 39.68839
5 (M),(R) Restaurant 1974 <NA> -75.58723 39.76705
6 (M) Bars/Clubs,Hotel 1972 <NA> -75.07712 38.72280
status
1 Location could not be verified. General city or location coordinates used.
2 Location could not be verified. General city or location coordinates used.
3 Google Verified Location
4 Google Verified Location
5 Google Verified Location
6 Verified Location
I want to add a new column called "Region" that would loop through each row, look at the state, and then add a value to Region.
Any suggestions on the right syntax to do something like this would be so appreciated! Thanks so much!
This is a snippet of what the solution suggested by Gregor’s comment could look like.
library(tidyverse)
orig_data <-
tribble(~ID, ~state,
1, "CA",
2, "FL",
3, "DE")
region_lookup <-
tribble(~state, ~region,
"CA", "west",
"FL", "south",
"DE", "south")
left_join(orig_data, region_lookup)
#> Joining, by = "state"
#> # A tibble: 3 x 3
#> ID state region
#> <dbl> <chr> <chr>
#> 1 1 CA west
#> 2 2 FL south
#> 3 3 DE south
Created on 2020-11-02 by the reprex package (v0.3.0)
the simplest solution would be a join. Therefore you need a data.frame/tibble that has all the states an regions. Fortunately the data is already in base R:
library(dplyr)
# build the tibble of state abbrevitation and region from base R data
state_region <- dplyr::tibble(state.abb, state.region)
# join it on your data.frame/tibble
ORIGINAL_DATA %>%
dplyr::left_join(state_region, by = c("state" = "state.abb"))
Now you should have a new column called "state.region" that you can group by. Be aware that the states have to be in upper case.

Keep specific rows of a data frame based on word sequence in R

I have a dataframe (df) like this. What I want to do is to go through the values for each ID and if there are two strings starting with the same word, I want to compare them to keep distinct values.
df <- data.frame(id = c(1,1,2,3,3,4,4,4,4,5),
value = c('australia', 'australia sydney', 'brazil',
'australia', 'usa', 'australia sydney', 'australia sydney randwick', 'australia', 'australia sydney circular quay', 'australia sydney'))
I want to get the first words to compare them and if they are different keep both but if they are the same go to the second words to compare them and so on...
so like for ID 1 I want to keep the row with the value 'australia sydney' and for Id 4 I want to keep both 'australia sydney circular quay', 'australia sydney randwick'.
For this example I need to get rows 2:5, 7, 9,10
Based on your edit, you can check within groups if any entry matches the start of any other entry and remove entries that do:
library(tidyverse)
df %>%
group_by(id) %>%
filter(!map_lgl(seq_along(value), ~ any(if (length(value) == 1) FALSE else str_detect(value[-.x], paste0("^", value[.x])))))
# A tibble: 7 x 2
# Groups: id, value [7]
id value
<dbl> <chr>
1 1 australia sydney
2 2 brazil
3 3 australia
4 3 usa
5 4 australia sydney randwick
6 4 australia sydney circular quay
7 5 australia sydney

Agregating and counting elements in the variables of a dataset

I might have not asked the proper question in my research, sorry in such case.
I have a multiple columns dataset:
helena <-
Year US$ Euros Country Regions
2001 12 13 US America
2000 13 15 UK Europe
2003 14 19 China Asia
I want to group the dataset in a way that I have for each region the total per year of the earnings plus a column showing how many countries have communicated their data per region every year
helena <-
Year US$ Euros Regions Number of countries per region per Year
2000 150 135 America 2
2001 135 151 Europe 15
2002 142 1900 Asia 18
Yet, I have tried
count(helena, c("Regions", "Year"))
but it does not work properly since includes only the columns indicated
Here is the data.table way, I have added a row for Canada for year 2000 to test the code:
library(data.table)
df <- data.frame(Year = c(2000, 2001, 2003,2000),
US = c(13, 12, 14,13),
Euros = c(15, 13, 19,15),
Country = c('US', 'UK', 'China','Canada'),
Regions = c('America', 'Europe', 'Asia','America'))
df <- data.table(df)
df[,
.(sum_US = sum(US),
sum_Euros = sum(Euros),
number_of_countries = uniqueN(Country)),
.(Regions, Year)]
Regions Year sum_US sum_Euros number_of_countries
1: America 2000 26 30 2
2: Europe 2001 12 13 1
3: Asia 2003 14 19 1
With dplyr:
library(dplyr)
your_data %>%
group_by(Regions, Year) %>%
summarize(
US = sum(US),
Euros = sum(Euros),
N_countries = n_distinct(Country)
)
using tidyr
library(tidyr)
df %>% group_by(Regions, Year) %>%
summarise(Earnings_US = sum(`US$`),
Earnings_Euros = sum(Euros),
N_Countries = length(Country))
aggregate the data set by regions, summing the earnings columns and doing a length of the country column (assuming countries are unique)
Using tidyverse and building the example
library(tidyverse)
df <- tibble(Year = c(2000, 2001, 2003,2000),
US = c(13, 12, 14,13),
Euros = c(15, 13, 19,15),
Country = c('US', 'UK', 'China','Canada'),
Regions = c('America', 'Europe', 'Asia','America'))
df %>%
group_by(Regions, Year) %>%
summarise(US = sum(US),
Euros = sum(Euros),
Countries = n_distinct(Country))
updated to reflect the data in the original question

Create a ggplot with grouped factor levels

This is variation on a question asked here: Group factor levels in ggplot.
I have a dataframe:
df <- data.frame(respondent = factor(c(1, 2, 3, 4, 5, 6, 7)),
location = factor(c("California", "Oregon", "Mexico",
"Texas", "Canada", "Mexico", "Canada")))
There are three separate levels related to the US. I don't want to collapse them as the distinction between states is useful for data analysis. I would like to have, however, a basic barplot that combines the three US states and stacks them on top of one another, so that there are three bars in the barplot--Canada, Mexico, and US--with the US bar divided into three states like so:
If the state factor levels had the "US" in their names, e.g. "US: California", I could use
library(tidyverse)
with_states <- df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other")) %>%
mutate(State = as.factor(State)
%>% fct_relevel("Other", after = Inf))
to achieve the desired outcome. But how can this be done when R doesn't know that the three states are in the US?
If you look at the previous example, all the separate and replace_na functions do is separate the location variable into a country and state variable:
df
respondent location
1 1 US: California
2 2 US: Oregon
3 3 Mexico
...
df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other"))
respondent Country State
1 1 US California
2 2 US Oregon
3 3 Mexico Other
...
So really all you need to do if get your data into this format: with a column for country and a column for state/provence.
There are many ways to do this yourself. Many times your data will already be in this format. If it isn't, the easiest way to fix it is to do a join to a table which maps location to country:
df
respondent location
1 1 California
2 2 Oregon
3 3 Mexico
4 4 Texas
5 5 Canada
6 6 Mexico
7 7 Canada
state_mapping <- data.frame(state = c("California", "Oregon", "Texas"),
country = c('US', 'US', 'US'),
stringsAsFactors = F)
df %>%
left_join(state_mapping, by = c('location' = 'state')) %>%
mutate(country = if_else(is.na(.$country),
location,
country))
respondent location country
1 1 California US
2 2 Oregon US
3 3 Mexico Mexico
4 4 Texas US
5 5 Canada Canada
6 6 Mexico Mexico
7 7 Canada Canada
Once you've got it in this format, you can just do what the other question suggested.

Resources