RStudio - Multiple name for one id problem - r

I have a data frame that look like this (called df1)
trip_id
station_id
station_name
id123
s01
A Street
id385
s02
B Street
id332
s01
C Street
id423
s01
A Street
The problem is there is an inconsistency with the station name column (multiple names for one id) and I want to correct it based on the most popular name used with the same station id. For example, in the table above, all rows with station id = "s01" must have station name = "A Street" (since A Street occurred 2 times and C Street occurred only once).
The result should look like this:
trip_id
station_id
station_name
id123
s01
A Street
id385
s02
B Street
id332
s01
A Street
id423
s01
A Street
All I'm able to do so far is to extract a list of station id with more than 1 name:
dupl_list <- unique(df1[,c("station_id","station_name")]) %>% group_by (station_id) %>% count() %>% filter(n>1)
Thx for reading

Using base R you can do this:
# data
df1 <- data.frame(trip_id=c('id123', 'id332', 'id385', 'id423'),
station_id=c('s01', 's02', 's01', 's01'),
station_name=c('A Street', 'B Street', 'C Street', 'A Street'),
date_of_trip=c(1, 1, 3, 2))
# most common name for each id (alphabetically lowest in case of ties)
id.name <- tapply(df1$station_name, df1$station_id, function(x) {
tab <- table(x)
names(which.max(tab))
})
df1$station_name <- id.name[df1$station_id]
dplyr way:
df1 <- df1 %>%
group_by(station_id) %>%
mutate(station_name = names(which.max(table(station_name))))
According to the most recent trip:
df1 <- df1 %>%
group_by(station_id) %>%
mutate(station_name = station_name[which.max(date_of_trip)])

Related

extracting data from a data frame row, performing internal lookup, and restructuring into long format

I've been beating my head against this for awhile and was hoping for some suggestions. I'm trying to extract semicolon delimited text from a row in a data frame, performing an internal lookup on a string in that row based on the extracted values, and then outputting that (along with another extracted variable) into a long format...and then repeating for every row in the data frame. I can do the first and last manipulations with str_split, and I think i could just loop everything with apply, but the internal lookup (join?) has me tied in knots. I'd like to imagine that i could do this w/ dplyr but
Starting with a data frame:
name<-"Adam, B.C.; Dave, E.F.; Gerald, H."
school<-"[Adam, B.C.; Gerald, H.]U.Penn; [Dave, E.F.]U.Georgia"
index<-12345
foo<-data.frame(name,school,index)
foo
name school index
1 Adam, B.C.; Dave, E.F.; Gerald, H. [Adam, B.C.; Gerald, H.]U.Penn; [Dave, E.F.]U.Georgia 12345
Desired output:
name school index
Adam, B.C. U.Penn 12345
Dave, E.F. U.Georgia 12345
Gerald, H. U.Penn 12345
etc. etc. etc.
thanks!
A mixture of tidyr::separate() and tidyr::seperate_rows() could do the trick:
library(tidyverse)
foo |>
tidyr::separate_rows(school, sep = "\\[", convert = T) |>
tidyr::separate(col = school, into = c("name", "school"), sep = "]") |>
tidyr::separate_rows(name, sep = ";", convert = T) |>
slice(-1) |>
mutate(across(everything(), trimws)) |>
mutate(across(everything(), str_remove, ";" ))
Output:
# A tibble: 3 x 3
index name school
<chr> <chr> <chr>
1 12345 Adam, B.C. U.Penn
2 12345 Gerald, H. U.Penn
3 12345 Dave, E.F. U.Georgia

How can I modify this function to return a Unique ID generated from specific fields?

I am writing a Shiny app that writes data from a mock google form into a google Sheet. I want to create a function that uses String modifications to insert a Unique ID based on the user input. This function will be called locally upon downloading the file and reproducibility is very important..so I think it should be dependent on the input and not randomly generated like in the "ids" package...
Here is the sample code for the function:
#Unique_id produces a 9 digit code by calling phone number, timestamp, and two letters - first and last name
unique_id <- function(f, l, y, z){
##where f is the column of FIRST NAME
f %>% str_replace(" ", "") %>% toupper() %>% str_sub(1, 2)
##where l is LAST NAME
l %>% str_replace(" ", "") %>% toupper() %>% str_sub(1, 2)
##Where y is TIMESTAMP
y %>% paste0() %>% str_extract("[:digit:][:digit:]-[:digit:][:digit:]") %>% str_replace_all("-", "")
##Where z is phone number formatted NNN-NNN-NNNN
z %>% str_extract("-[:digit:][:digit:][:digit:]-") %>% str_replace_all("-", "")
##Where UID is the UserID generated...
UID <- c(f, l , y, z)
UID <- str_replace("-", "")
return(UID)
}
I suspect that the last block is wrong... but it could all be wrong, I really am just winging this.
Here is a sample of the info that might be in a dataframe I want to call it on...
Phone Address FirstName LastName Timestamp
1 951-349-8967 2134 Road Road John jerrod 2018-09-14 20:09:38
2 342-651-3752 154 House St. Dora angela 2018-09-14 20:09:38
The ideal outcome of calling this function on row 1 would be a UID of JJ0914349. I would then insert it into a new column, the "UID" field.
Right now, I'm getting
Error: Empty `pattern` not supported
Any help is appreciated.
Since the required data are all in fixed positions:
DF %>% mutate(id = paste0(toupper(substr(FirstName, 1, 1)),
toupper(substr(LastName, 1, 1)),
substr(Timestamp, 6, 7),
substr(Timestamp, 9, 10),
substr(Phone, 5, 7)))
giving:
Phone Address FirstName LastName Timestamp id
1 951-349-8967 2134 Road Road John jerrod 2018-09-14 20:09:38 JJ0914349
2 342-651-3752 154 House St. Dora angela 2018-09-14 20:09:38 DA0914651
Note
We used this data:
Lines <- "Phone, Address, FirstName, LastName, Timestamp
1,951-349-8967, 2134 Road Road, John, jerrod, 2018-09-14 20:09:38
2,342-651-3752, 154 House St., Dora, angela, 2018-09-14 20:09:38"
DF <- read.csv(text = Lines, as.is = TRUE, strip.white = TRUE)

Deleting duplicates in R, changing remainder

I have a fairly straightforward question, but very new to R and struggling a little. Basically I need to delete duplicate rows and then change the remaining unique row based on the number of duplicates that were deleted.
In the original file I have directors and the company boards they sit on, with directors appearing as a new row for each company. I want to have each director appear only once, but with column that lists the number of their board seats (so 1 + the number of duplicates that were removed) and a column that lists the names of all companies on which they sit.
So I want to go from this:
To this
Bonus if I can also get the code to list the directors "home company" as the company on which she/he is an executive rather than outsider.
Thanks so very much in advance!
N
You could use the ddply function from plyr package
#First I will enter a part of your original data frame
Name <- c('Abbot, F', 'Abdool-Samad, T', 'Abedian, I', 'Abrahams, F', 'Abrahams, F', 'Abrahams, F')
Position <- c('Executive Director', 'Outsider', 'Outsider', 'Executive Director','Outsider', 'Outsider')
Companies <- c('ARM', 'R', 'FREIT', 'FG', 'CG', 'LG')
NoBoards <- c(1,1,1,1,1,1)
df <- data.frame(Name, Position, Companies, NoBoards)
# Then you could concatenate the Positions and Companies for each Name
library(plyr)
sumPosition <- ddply(df, .(Name), summarize, Position = paste(Position, collapse=", "))
sumCompanies <- ddply(df, .(Name), summarize, Companies = paste(Companies, collapse=", "))
# Merge the results into a one data frame usin the name to join them
df2 <- merge(sumPosition, sumCompanies, by = 'Name')
# Summarize the number of oBoards of each Name
names_NoBoards <- aggregate(df$NoBoards, by = list(df$Name), sum)
names(names_NoBoards) <- c('Name', 'NoBoards')
# Merge the result whit df2
df3 <- merge(df2, names_NoBoards, by = 'Name')
You get something like this
Name Position Companies NoBoards
1 Abbot, F Executive Director ARM 1
2 Abdool-Samad, T Outsider R 1
3 Abedian, I Outsider FREIT 1
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3
In order to get a list the directors "home company" as the company on which she/he is an executive rather than outsider. You could use the next code
ExecutiveDirector <- df[Position == 'Executive Director', c(1,3)]
df4 <- merge(df3, ExecutiveDirector, by = 'Name', all.x = TRUE)
You get the next data frame
Name Position Companies.x NoBoards Companies.y
1 Abbot, F Executive Director ARM 1 ARM
2 Abdool-Samad, T Outsider R 1 <NA>
3 Abedian, I Outsider FREIT 1 <NA>
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3 FG

Merge dataframes based on regex condition

This problem involves R. I have two dataframes, represented by this minimal reproducible example:
a <- data.frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"), county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data.frame(geocode = c("360050002001002", "360850323001019"), jobs = c("4", "204"))
An example to help communicate the very specific operation I am trying to perform: the geocode_selector column in dataframe a contains the FIPS county codes of the five boroughs of NY. The geocode column in dataframe b is the 15-digit ID of a specific Census block. The first five digits of a geocode match a more general geocode_selector, indicating which county the Census block is located in. I want to add a column to b specifying which county each census block falls under, based on which geocode_selector each geocode in b matches with.
Generally, I'm trying to merge dataframes based on a regex condition. Ideally, I'd like to perform a full merge carrying all of the columns of a over to b and not just the county_name.
I tried something along the lines of:
b[, "county_name"] <- NA
for (i in 1:nrow(b)) {
for (j in 1:nrow(a)) {.
if (grepl(data.a$geocode_selector[j], b$geocode[i]) == TRUE) {
b$county_name[i] <- a$county_name[j]
}
}
}
but it took an extremely long time for the large datasets I am actually processing and the finished product was not what I wanted.
Any insight on how to merge dataframes conditionally based on a regex condition would be much appreciated.
You could do this...
b$geocode_selector <- substr(b$geocode,1,5)
b2 <- merge(b, a, all.x=TRUE) #by default it will merge on common column names
b2
geocode_selector geocode jobs county_name
1 36005 360050002001002 4 Bronx
2 36085 360850323001019 204 Richmond
If you wish, you can delete the geocode_selector column from b2 with b2[,1] <- NULL
We can use sub to create the 'geocode_selector' and then do the join
library(data.table)
setDT(a)[as.data.table(b)[, geocode_selector := sub('^(.{5}).*', '\\1', geocode)],
on = .(geocode_selector)]
# geocode_selector county_name geocode jobs
#1: 36005 Bronx 360050002001002 4
#2: 36085 Richmond 360850323001019 204
This is a great opportunity to use dplyr. I also tend to like the string handling functions in stringr, such as str_sub.
library(dplyr)
library(stringr)
a <- data_frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"),
county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data_frame(geocode = c("360050002001002", "360850323001019"),
jobs = c("4", "204"))
b %>%
mutate(geocode_selector = str_sub(geocode, end = 5)) %>%
inner_join(a, by = "geocode_selector")
#> # A tibble: 2 x 4
#> geocode jobs geocode_selector county_name
#> <chr> <chr> <chr> <chr>
#> 1 360050002001002 4 36005 Bronx
#> 2 360850323001019 204 36085 Richmond

How to search part of string that contain in a list of string, and return the matched one in R

The following data frame contain a "Campaign" column, the value of column contain information about season, name, and position, however, the order of these information are quiet different in each row. Lucky, these information is a fixed list, so we could create a vector to match the string inside the "Campaign_name" column.
Date Campaign
1 Jan-15 Summer|Peter|Up
2 Feb-15 David|Winter|Down
3 Mar-15 Up|Peter|Spring
Here is what I want to do, I want to create 3 columns as Name, Season, Position. So these column can search the string inside the campaign column and return the matched value from the list below.
Name <- c("Peter, David")
Season <- c("Summer","Spring","Autumn", "Winter")
Position <- c("Up","Down")
So my desired result would be following
Temp
Date Campaign Name Season Position
1 15-Jan Summer|Peter|Up Peter Summer Up
2 15-Feb David|Winter|Down David Winter Down
3 15-Mar Up|Peter|Spring Peter Spring Up
Another way:
L <- strsplit(df$Campaign,split = '\\|')
df$Name <- sapply(L,intersect,Name)
df$Season <- sapply(L,intersect,Season)
df$Position <- sapply(L,intersect,Position)
Do the following:
Date = c("Jan-15","Feb-15","Mar-15")
Campaign = c("Summer|Peter|Up","David|Winter|Down","Up|Peter|Spring")
df = data.frame(Date,Campaign)
Name <- c("Peter", "David")
Season <- c("Summer","Spring","Autumn", "Winter")
Position <- c("Up","Down")
for(k in Name){
df$Name[grepl(pattern = k, x = df$Campaign)] <- k
}
for(k in Season){
df$Season[grepl(pattern = k, x = df$Campaign)] <- k
}
for(k in Position){
df$Position[grepl(pattern = k, x = df$Campaign)] <- k
}
This gives:
> df
Date Campaign Name Season Position
1 Jan-15 Summer|Peter|Up Peter Summer Up
2 Feb-15 David|Winter|Down David Winter Down
3 Mar-15 Up|Peter|Spring Peter Spring Up
I had the same idea as Marat Talipov; here's a data.table option:
library(data.table)
Name <- c("Peter", "David")
Season <- c("Summer","Spring","Autumn", "Winter")
Position <- c("Up","Down")
dat <- data.table(Date=c("Jan-15", "Feb-15", "Mar-15"),
Campaign=c("Summer|Peter|Up", "David|Winter|Down", "Up|Peter|Spring"))
Gives
> dat
Date Campaign
1: Jan-15 Summer|Peter|Up
2: Feb-15 David|Winter|Down
3: Mar-15 Up|Peter|Spring
Processing is then
dat[ , `:=`(Name = sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, Name),
Season = sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, Season),
Position = sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, Position))
]
Result:
> dat
Date Campaign Name Season Position
1: Jan-15 Summer|Peter|Up Peter Summer Up
2: Feb-15 David|Winter|Down David Winter Down
3: Mar-15 Up|Peter|Spring Peter Spring Up
Maybe there's some benefit if you're doing this to a lot of columns or need to modify in place (by reference).
I'm interested if anyone can show me how to update all three columns at once.
EDIT: Never mind, figured it out;
for (icol in c("Name", "Season", "Position"))
dat[, (icol):=sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, get(icol))]

Resources