How do I collapse rows on key id but keep unique measurements in a new variable? - r

I have a dataset that looks like this:
Federal.Area State Total_Miles
1 Allentown, PA--NJ NJ 1094508
2 Allentown, PA--NJ PA 9957805
3 Augusta-Richmond County, GA--SC GA 6221747
4 Augusta-Richmond County, GA--SC SC 2101823
5 Beloit, WI--IL IL 324238
6 Beloit, WI--IL WI 542491
I'd like to collapse the rows by Federal.Area but create and keep new variables which contain the unique State and unique Total_Miles such that it looks like this:
Federal.Area State Total_Miles State1 State2 Total_Miles_state1 Total_Miles_state2
<fct> <fct> <dbl> <fct> <fct> <dbl> <dbl>
1 Allentown, PA--NJ NJ 1094508 NJ PA 1094508 9957805
2 Augusta-Richmond Cou… GA 6221747 GA SC 6221747 2101823
3 Beloit, WI--IL IL 324238 IL WI 324238 542491
I don't know how to collapse the variables State and Total_Miles into the same row, but as new variables keyed on Federal.Area.

Perhaps you could use pivot_wider from tidyverse to put your data into a wide format.
First would number the rows within each Federal.Area as 1 and 2. Then call pivot_wider which will append the
library(tidyverse)
df %>%
group_by(Federal.Area) %>%
mutate(rn = row_number()) %>%
pivot_wider(id_cols = Federal.Area, values_from = c(State, Total_Miles), names_from = rn)
Output
# A tibble: 3 x 5
# Groups: Federal.Area [3]
Federal.Area State_1 State_2 Total_Miles_1 Total_Miles_2
<chr> <chr> <chr> <int> <int>
1 Allentown,PA--NJ NJ PA 1094508 9957805
2 Augusta-RichmondCounty,GA--SC GA SC 6221747 2101823
3 Beloit,WI--IL IL WI 324238 542491
Data
df <- structure(list(Federal.Area = c("Allentown,PA--NJ", "Allentown,PA--NJ",
"Augusta-RichmondCounty,GA--SC", "Augusta-RichmondCounty,GA--SC",
"Beloit,WI--IL", "Beloit,WI--IL"), State = c("NJ", "PA", "GA",
"SC", "IL", "WI"), Total_Miles = c(1094508L, 9957805L, 6221747L,
2101823L, 324238L, 542491L)), class = "data.frame", row.names = c(NA,
-6L))

Related

I lose the constant variables (including id) when using pivot_longer with multiple variables

I try to reshape the following
country
region
abc2001
abc2002
xyz2001
xyz2002
Japan
East Asia
1
2
4.5
5.5
to the following
country
region
year
abc
xyz
Japan
East Asia
2001
1
4.5
Japan
East Asia
2002
2
5.5
actually there are five more variables in the same way.
I use the following code:
long <- data %>% pivot_longer(cols = c(-country, -region), names_to = c(".value", "year"), names_pattern = "([^\\.]*)\\.*(\\d{4})")
The result is long version of the data except that I lose country and region variables. What do I do wrong? Or how else can I do this better?
Thank you in advance.
We may change the regex pattern to match one or more non-digits(\\D+) as the first capture group and one or more digits (\\d+) as the second one
librarytidyr)
pivot_longer(data, cols = c(-country, -region),
names_to = c(".value", "year"), names_pattern = "(\\D+)(\\d+)")
-output
# A tibble: 2 × 5
country region year abc xyz
<chr> <chr> <chr> <int> <dbl>
1 Japan East Asia 2001 1 4.5
2 Japan East Asia 2002 2 5.5
data
data <- structure(list(country = "Japan", region = "East Asia", abc2001 = 1L,
abc2002 = 2L, xyz2001 = 4.5, xyz2002 = 5.5),
class = "data.frame", row.names = c(NA,
-1L))
Update: see comments as #akrun noted, here is better regex with lookaround:
rename_with(., ~str_replace(names(data), "(?<=\\D)(?=\\d)", "\\_"))
First answer:
Here is a version with names_sep. The challenge was to add an underscore in the column names. The preferred answer is that of #akrun:
(.*) - Group 1: any zero or more chars as many as possible
(\\d{4}$) - Group 2: for digits at the end
library(dplyr)
library(tidyr)
data %>%
rename_with(., ~sub("(.*)(\\d{4}$)", "\\1_\\2", names(data))) %>%
pivot_longer(-c(country, region),
names_to =c(".value","Year"),
names_sep ="_"
)
country region Year abc xyz
<chr> <chr> <chr> <int> <dbl>
1 Japan East Asia 2001 1 4.5
2 Japan East Asia 2002 2 5.5

Get nearest n matching strings

Hi I am trying to match one string from other string in different dataframe and get nearest n matches based on score.
EX: from string_2 (df_2) column i need to match with string_1(df_1) and get the nearest 3 matches based on each ID group.
ID = c(100, 100,100,100,103,103,103,103,104,104,104,104)
string_1 = c("Jack Daniel","Jac","JackDan","steve","Mark","Dukes","Allan","Duke","Puma Nike","Puma","Nike","Addidas")
df_1 = data.frame(ID,string_1)
ID = c(100, 100, 185, 103,103, 104, 104,104)
string_2 = c("Jack Daniel","Mark","Order","Steve","Mark 2","Nike","Addidas","Reebok")
df_2 = data.frame(ID,string_2)
My output dataframe df_out will look like below.
ID = c(100, 100,185,103,103,104,104,104)
string_2 = c("Jack Daniel","Mark","Order","Steve","Mark 2","Nike","Addidas","Reebok")
nearest_str_match_1 = c("Jack Daniel","JackDan","NA","Duke","Mark","Nike","Addidas","Nike")
nearest_str_match_2 =c("JackDan","Jack Daniel","NA","Dukes","Duke","Addidas","Nike","Puma Nike")
nearest_str_match_3 =c("Jac","Jac","NA","Allan","Allan","Puma","Puma","Addidas")
df_out = data.frame(ID,string_2,nearest_str_match_1,nearest_str_match_2,nearest_str_match_3)
i have tried manually with package "stringdist" - 'jw' method and get the nearest value.
stringdist::stringdist("Jack Daniel","Jack Daniel","jw")
stringdist::stringdist("Jack Daniel","Jac","jw")
stringdist::stringdist("Jack Daniel","JackDan","jw")
Thanks in advance
merge(df_1, df_2, by = 'ID') %>%
group_by(string_2) %>%
mutate(dist = (stringdist::stringdist(string_2,string_1, 'jw')) %>%
rank(ties = 'last')) %>%
slice_min(dist, n = 3) %>%
pivot_wider(names_from = dist, names_prefix = 'nearest_str_match_',
values_from = string_1)
# A tibble: 7 x 5
# Groups: string_2 [7]
ID string_2 nearest_str_match_1 nearest_str_match_2 nearest_str_match_3
<dbl> <chr> <chr> <chr> <chr>
1 104 Addidas Addidas Nike Puma
2 100 Jack Daniel Jack Daniel JackDan Jac
3 100 Mark JackDan Jack Daniel Jac
4 103 Mark 2 Mark Duke Allan
5 104 Nike Nike Addidas Puma
6 104 Reebok Nike Puma Nike Addidas
7 103 Steve Duke Dukes Allan

How to exact match two column values in entire Dataset using R

I have below-mentioned two dataframe in R, and I have tried various method but couldn't achieve the required output yet.
DF:
ID Date city code uid
I-1 2020-01-01 10:12:15 New York 123 K-1
I-1 2020-01-01 10:12:15 Utha 103 K-1
I-2 2020-01-02 10:12:15 Washington 122 K-1
I-3 2020-02-01 10:12:15 Tokyo 123 K-2
I-3 2020-02-01 10:12:15 Osaka 193 K-2
I-4 2020-02-02 10:12:15 London 144 K-3
I-5 2020-02-04 10:12:15 Dubai 101 K-4
I-6 2019-11-01 10:12:15 Dubai 101 K-4
I-7 2019-11-01 10:12:15 London 144 K-3
I-8 2018-12-13 10:12:15 Tokyo 143 K-5
I-9 2019-05-17 10:12:15 Dubai 101 K-4
I-19 2020-03-11 10:12:15 Dubai 150 K-7
Dput:
structure(list(ID = c("I-1", "I-1",
"I-2", "I-3", "I-3", "I-4",
"I-5", "I-6", "I-7", "I-8", "I-9","I-19"
), DATE = c("2020-01-01 11:49:40.842", "2020-01-01 09:35:33.607",
"2020-01-02 06:14:58.731", "2020-02-01 16:51:27.190", "2020-02-01 05:35:46.952",
"2020-02-02 05:48:49.443", "2020-02-04 10:00:41.616", "2019-11-01 09:10:46.536",
"2019-11-01 11:54:05.655", "2018-12-13 14:24:31.617", "2019-05-17 14:24:31.617", "2020-03-11 14:24:31.617"), CITY = c("New York",
"UTAH", "Washington", "Tokyo",
"Osaka", "London", "Dubai",
"Dubai", "London", "Tokyo", "Dubai",
"Dubai"), CODE = c("221010",
"411017", "638007", "583101", "560029", "643102", "363001", "452001",
"560024", "509208"), UID = c("K-1",
"K-1", "K-1", "K-2", "K-2",
"K-3", "K-4", "K-4", "K-3",
"K-5","K-4","K-7")), .Names = c("ID", "DATE",
"CITY", "CODE", "UID"), row.names = c(NA,
10L), class = "data.fram)
Using the above-mentioned two dataframe, I want to fetch records between 1st Jan 2020 to 29th Feb 2002 and compare those ID in entire database to check whether both city and code together match with other ID and categorize it further to check how many have the same uid and how many have different.
Where,
Match - combination of city and code match with other ID in database
Same_uid - classification of Match ids to identify how many ID have similar uid
different_uid - classification of Match ids to identify how many ID doesn't have similar uid
uid_count - count of similar uid of that particular ID in entire database
Note - I have more than 10M records in the dataframe.
Required Output
ID Date city code uid Match Same_uid different_uid uid_count
I-1 2020-01-01 10:12:15 New York 123 K-1 No 0 0 2
I-2 2020-01-02 10:12:15 Washington 122 K-1 No 0 0 2
I-3 2020-02-01 10:12:15 Tokyo 123 K-2 No 0 0 1
I-4 2020-02-02 10:12:15 London 144 K-3 Yes 1 0 2
I-5 2020-02-04 10:12:15 Dubai 101 K-4 Yes 2 0 3
An approach,
Load in the dataset
library(tidyverse)
library(lubridate)
mydata <- tibble(
ID = c("I-1","I-1",
"I-2","I-3",
"I-3","I-4",
"I-5","I-6",
"I-7","I-8",
"I-9","I-19"),
Date = c("2020-01-01", "2020-01-01",
"2020-01-02", "2020-02-01",
"2020-02-01", "2020-02-02",
"2020-02-04", "2019-11-01",
"2019-11-01", "2018-12-13",
"2019-05-17", "2020-03-11"),
city = c("New York", "Utha",
"Washington", "Tokyo",
"Osaka", "London",
"Dubai", "Dubai",
"London", "Tokyo",
"Dubai", "Dubai"),
code = c("123", "103", "122", "123", "193, "144",
"101", "101", "144", "143", "101", "150"),
uid = c("K-1", "K-1", "K-1", "K-2", "K-2", "K-3",
"K-4", "K-4", "K-3", "K-5", "K-4", "K-7"))
mydata <- mydata %>%
mutate(Date = ymd(str_remove(Date, " .*")),
code = as.character(code))
Where clause number 1
I use count from dplyr to count the codes by cities. Then case_when to further identify with a "Yes" or "No" as requested.
# This counts city and code, and fullfills your "Match" column requirement
startdate <- "2017-01-01"
enddate <- "2020-03-29"
mydata %>%
filter(Date >= startdate,
Date <= enddate) %>%
count(city, code, name = "count_samecode") %>%
mutate(Match = case_when(
count_samecode > 1 ~ "Yes",
T ~ "No")) %>%
head()
# # A tibble: 6 x 4
# city code count_samecode Match
# <chr> <chr> <int> <chr>
# 1 Dubai 101 3 Yes
# 2 Dubai 150 1 No
# 3 London 144 2 Yes
# 4 New York 123 1 No
# 5 Osaka 193 1 No
# 6 Tokyo 123 1 No
Where clause number 2
I will do the same with UID
mydata %>%
filter(Date >= startdate,
Date <= enddate ) %>%
count(city, uid, name = "UIDs_#_filtered") %>%
head()
# # A tibble: 6 x 3
# city uid `UIDs_#_filtered`
# <chr> <chr> <int>
# 1 Dubai K-4 3
# 2 Dubai K-7 1
# 3 London K-3 2
# 4 New York K-1 1
# 5 Osaka K-2 1
# 6 Tokyo K-2 1
Where clause number 3
I can repeat the count of clause number 2 to find how many of these cities have a different UID, where > 1 signals a different UID.
mydata %>%
filter(Date >= startdate,
Date <= enddate ) %>%
count(city, uid, name = "UIDs_#_filtered") %>%
count(city, name = "UIDs_#_different") %>%
head()
# # A tibble: 6 x 2
# city `UIDs_#_different`
# <chr> <int>
# 1 Dubai 2
# 2 London 1
# 3 New York 1
# 4 Osaka 1
# 5 Tokyo 2
# 6 Utha 1
Where clause number 4
Taking the same code from #2, I can eliminate the filter to find the entire dataset
mydata %>%
count(city, uid, name = "UIDs_#_all") %>%
head()
Putting it all together
Using several left_join's we can get closer to your desired output.
EDIT: Now will bring the first instance of the ID from the first City / Code combination
check_duplicates_filterview.f <- function( df, startdate, enddate ){
# df should be a tibble
# startdate should be a string "yyyy-mm-dd"
# enddate should be a string "yyyy-mm-dd"
cityfilter <- df %>% filter(Date >= startdate,
Date <= enddate) %>% distinct(city) %>% pull(1)
df <- df %>%
filter(city %in% cityfilter) %>%
mutate(Date = ymd(str_remove(Date, " .*")),
code = as.character(code))
entire.db.countcodes <- df %>% # Finds count of code in entire DB
count(city, code)
where.1 <- df %>% filter(Date >= startdate,
Date <= enddate) %>%
distinct(city, code, .keep_all = T) %>%
left_join(entire.db.countcodes) %>%
rename("count_samecode" = n) %>%
mutate(Match = case_when(
count_samecode > 1 ~ "Yes",
T ~ "No"))
where.2 <- df %>%
filter(Date >= startdate,
Date <= enddate ) %>%
count(city, uid, name = "UIDs_#_filtered")
where.3 <- df %>%
filter(Date >= startdate,
Date <= enddate ) %>%
distinct(city, uid) %>%
count(city, name = "UIDs_#_distinct")
where.4 <- df %>%
filter(city %in% cityfilter) %>%
count(city, uid, name = "UIDs_#_all")
first_half <- left_join(where.1, where.2)
second_half <- left_join(where.4, where.3)
full <- left_join(first_half, second_half)
return(full)
}
# > check_duplicates_filterview.f(mydata, "2018-01-01", "2020-01-01")
# Joining, by = "city"
# Joining, by = "city"
# Joining, by = c("city", "uid")
# # A tibble: 5 x 8
# city code count_samecode Match uid `UIDs_#_filtered` `UIDs_#_all` `UIDs_#_distinct`
# <chr> <chr> <int> <chr> <chr> <int> <int> <int>
# 1 Dubai 101 2 Yes K-4 2 3 1
# 2 London 144 1 No K-3 1 2 1
# 3 New York 123 1 No K-1 1 1 1
# 4 Tokyo 143 1 No K-5 1 1 1
# 5 Utha 103 1 No K-1 1 1 1

How can I transpose data in each variable from long to wide using group_by? R

I have a dataframe with id variable name. I'm trying to figure out a way to transpose each variable in the dataframe by name.
My current df is below:
name jobtitle companyname datesemployed empduration joblocation jobdescrip
1 David… Project… EOS IT Man… Aug 2018 – P… 1 yr 9 mos San Franci… Coordinati…
2 David… Technic… Options Te… Sep 2017 – J… 5 mos Belfast, U… Working wi…
3 David… Data An… NA Jan 2018 – J… 6 mos Belfast, U… Working wi…
However, I'd like a dataframe in which there is only one row for name, and every observation for name becomes its own column, like below:
name jobtitle_1 companyname_1 datesemployed_1 empduration_1 joblocation_1 jobdescrip_1 job_title2 companyname_2 datesemployed_2 empduration_2 joblocation_2 jobdescrip_2
1 David… Project… EOS IT Man… Aug 2018 – P… 1 yr 9 mos San Franci… Coordinati… Technic… Options Te… Sep 2017 – J… 5 mos Belfast, U… Working wi…
I have used commands like gather_by and melt in the past to reshape from long to wide, but in this case, I'm not sure how to apply it, since every observation for the id variable will need to become its own column.
It sounds like you are looking for gather and pivot_wider.
I used my own sample data with two names:
df <- tibble(name = c('David', 'David', 'David', 'Bill', 'Bill'),
jobtitle = c('PM', 'TPM', 'Analyst', 'Dev', 'Eng'),
companyname = c('EOS', 'Options', NA, 'Microsoft', 'Nintendo'))
First add an index column to distinguish the different positions for each name.
indexed <- df %>%
group_by(name) %>%
mutate(.index = row_number())
indexed
# name jobtitle companyname .index
# <chr> <chr> <chr> <int>
# 1 David PM EOS 1
# 2 David TPM Options 2
# 3 David Analyst NA 3
# 4 Bill Dev Microsoft 1
# 5 Bill Eng Nintendo 2
Then it is possible to use gather to get a long form, with one value per row.
gathered <- indexed %>% gather('var', 'val', -c(name, .index))
gathered
# name .index var val
# <chr> <int> <chr> <chr>
# 1 David 1 jobtitle PM
# 2 David 2 jobtitle TPM
# 3 David 3 jobtitle Analyst
# 4 Bill 1 jobtitle Dev
# 5 Bill 2 jobtitle Eng
# 6 David 1 companyname EOS
# 7 David 2 companyname Options
# 8 David 3 companyname NA
# 9 Bill 1 companyname Microsoft
# 10 Bill 2 companyname Nintendo
Now pivot_wider can be used to create a column for each variable and index.
gathered %>% pivot_wider(names_from = c(var, .index), values_from = val)
# name jobtitle_1 jobtitle_2 jobtitle_3 companyname_1 companyname_2 companyname_3
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 David PM TPM Analyst EOS Options NA
# 2 Bill Dev Eng NA Microsoft Nintendo NA
Get the data in long format, create a unique column identifier and get it back to wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name, col) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = c(col, row), values_from = value)

Aggregate from Data Frame

I have a dataframe in R of the following form:
City Province Poupulation
1 Bandung JABAR 500,000
2 Surabaya JATIM 600,000
3 Malang JATIM 350,000
4 Bogor JABAR 400,000
5 Semarang JATENG 550,000
6 Cirebon JABAR 300,000
7 Madiun JATIM 200,000
8 Solo JATENG 275,000
9 Tegal JATENG 290,000
What is the necessary code to compute the overall population from city in JATENG province only?
Here is a dplyr solution:
library(dplyr)
df %>%
group_by(Province) %>%
summarise(sum(Poupulation))
# Province sum
# <fctr> <dbl>
#1 JABAR 700000
#2 JATENG 1115000
#3 JATIM 1150000
When you are only interested in the province JATENG, then this will do the job:
df %>%
filter(Province == "JATENG") %>%
summarise(sum = sum(Poupulation))
# sum
#1 1115000
Perhaps you have to change the summarise function to summarise(sum = sum(as.numeric(Poupulation))).

Resources