Combine every two rows of data in R - r

I have a csv file that I have read in but I now need to combine every two rows together. There is a total of 2000 rows but I need to reduce to 1000 rows. Every two rows is has the same account number in one column and the address split into two rows in another. Two rows are taken up for each observation and I want to combine two address rows into one. For example rows 1 and 2 are Acct# 1234 and have 123 Hollywood Blvd and LA California 90028 on their own lines respectively.

Using the tidyverse, you can group_by the Acct number and summarise with str_c:
library(tidyverse)
df %>%
group_by(Acct) %>%
summarise(Address = str_c(Address, collapse = " "))
# A tibble: 2 × 2
Acct Address
<dbl> <chr>
1 1234 123 Hollywood Blvd LA California 90028
2 4321 55 Park Avenue NY New York State 6666
Data:
df <- data.frame(
Acct = c(1234, 1234, 4321, 4321),
Address = c("123 Hollywood Blvd", "LA California 90028",
"55 Park Avenue", "NY New York State 6666")
)

It can be fairly simple with data.table package:
# assuming `dataset` is the name of your dataset, column with account number is called 'actN' and column with adress is 'adr'
library(data.table)
dataset2 <- data.table(dataset)[,.(whole = paste0(adr, collapse = ", ")), by = .(adr)]

Related

Combine two rows in R that are separated

I am trying to clean the dataset so that all data is in its appropriate cell by combining rows since they are oddly separated. There is an obvious nuance to the dataset in that there are some rows that are correctly coded and there are some that are not.
Here is an example of the data:
Rank
Store
City
Address
Transactions
Avg. Value
Dollar Sales Amt.
40
1404
State College
Hamilton Square Shop Center
230 W Hamilton Ave
155548
52.86
8263499
41
2310
Springfield
149 Baltimore Pike
300258
27.24
8211137
42
2514
Erie
Yorktown Centre
2501 West 12th Street
190305
41.17
7862624
Here is an example of how I want the data:
Rank
Store
City
Address
Transactions
Avg. Value
Dollar Sales Amt.
40
1404
State College
Hamilton Square Shop Center, 230 W Hamilton Ave
155548
52.86
8263499
41
2310
Springfield
149 Baltimore Pike
300258
27.28
8211137
42
2514
Erie
Yorktown Centre, 2501 West 12th Street
190305
41.17
7862624
Is there an Excel or R function to fix this, or does anyone know how to write an R functional to correct this?
I read into the CONCATENATE function in excel and realized it was not going to accomplish anything. I figured an R functional would be the only way to fix this.
The concatenate function will work here or in the excel select the columns and by using the merge formula in the formala option you can complete the task.
I recommend checking how the file is being parsed.
From the example data you provided, it looks like the address column is being
split on ", " and going to the next line.
Based on this assumption alone, below is a potential solution using the
tidyverse:
library(tidyverse)
original_data <- tibble(Rank = c(40,NA,41,42,NA),
Store = c(1404,NA,2310,2514,NA),
City = c("State College",NA,"Springfield","Erie",NA),
Address = c("Hamilton Square Shop Center",
"230 W Hamilton Ave","149 Baltimore Pike",
"Yorktown Centre","2501 West 12th Street"),
Transactions = c(NA,155548,300258,NA,190305),
`Avg. Value` = c(NA,52.86,27.24,NA,41.17),
`Dollar Sales Amt.` = c(NA,8263499,8211137,NA,7862624))
new_data <- original_data %>%
fill(Rank:City) %>%
group_by_at(vars(Rank:City)) %>%
mutate(Address1 = lag(Address)) %>%
slice(n()) %>%
ungroup() %>%
mutate(Address = if_else(is.na(Address1), Address,
str_c(Address1, Address, sep = ", "))) %>%
select(Rank:`Dollar Sales Amt.`)
new_data

how to find top highest number in R

I'm new in R coding. I want to find code for this question. Display the city name and the total attendance of the five top-attendance stadiums. I have dataframe worldcupmatches. Please, if anyone can help me out.
Since you have not provided us a subset of your data (which is strongly recommended), I will create a tiny dataset with city names and attendance like so:
df = data.frame(city = c("London", "Liverpool", "Manchester", "Birmingham"),
attendance = c(2390, 1290, 8734, 5433))
Then your problem can easily be solved. For example, one of the base R approaches is:
df[order(df$attendance, decreasing = T), ]
You could also use dplyr which makes things look a little tidier:
library(dplyr)
df %>% arrange(desc(attendance))
Output of the both methods is your original data, but ordered from the highest to the lowest attendance:
city attendance
3 Manchester 8734
4 Birmingham 5433
1 London 2390
2 Liverpool 1290
If you specifically want to display a certain number of cities (or stadiums) with top highest attendance, you could do:
df[order(df$attendance, decreasing = T), ][1:3, ] # 1:3 takes the top 3 staidums
city attendance
3 Manchester 8734
4 Birmingham 5433
1 London 2390
Again, dplyr approach/code looks much nicer:
df %>% slice_max(n = 3, order_by = attendance)
city attendance
1 Manchester 8734
2 Birmingham 5433
3 London 2390

Separate a string of multiple dates and names in R

I have a dataframe with 2 columns, where the first column lists companies and the second column are strings of multiple dates and company names as follows:
data=data.frame('Company'=(c("A","B","C")),
'Bank'=c("1/13/2020 Bank A 5/12/2020 Bank H C 11/9/2020 HelloBank",
"2/14/2020 HopeBank 1/9/2020 Liberty Bank SA",
"10/18/2020 Securities"))
I would like to separate column "Bank" into multiple columns of Dates and Bank Names, such that:
data=data.frame('Company'=(c("A","B","C")),
"Date1"=(c("1/13/2020","2/14/2020","10/18/2020")),
'Bank1'=c("Bank A", "HopeBank","Securities"),
"Date2"=(c("5/12/2020","1/9/2020",NA)),
'Bank2'=c("Bank H C", "Liberty Bank SA",NA),
"Date3"=(c("11/9/2020 ",NA,NA)),
'Bank3'=c("HelloBank", NA,NA))
I have tried using library(stringr) but the formats of the dates are not consistent. Also, I do not know how many variables I will need in the final dataframe, and some of the strings in the "Bank" column are very long (up to 824 nchar).
I have also tried using separate from tidyr but without success.
Here is a base R option using strsplit to make it
v <- strsplit(data$Bank, "\\s(?=(\\d+\\/))|(?<=\\d)\\s", perl = TRUE)
data <- cbind(
data[1],
`colnames<-`(
do.call(rbind, lapply(v, `length<-`, max(lengths(v)))),
paste0(c("Date", "Bank"), rep(1:(max(lengths(v)) / 2), each = 2))
)
)
which gives
> data
Company Date1 Bank1 Date2 Bank2 Date3 Bank3
1 A 1/13/2020 Bank A 5/12/2020 Bank H C 11/9/2020 HelloBank
2 B 2/14/2020 HopeBank 1/9/2020 Liberty Bank SA <NA> <NA>
3 C 10/18/2020 Securities <NA> <NA> <NA> <NA>
If you don't know how many banks there might be in each row, you are better off creating a dataframe in long format. Something like this will do it, using the tidyverse...
library(tidyverse)
data_long <- data %>%
mutate(Bank = str_replace_all(Bank, "( \\d+/)", "#\\1"), #add markers between banks
Bank = str_split(Bank, "#")) %>% #split at markers
unnest(Bank) %>% #convert to one row per entry
mutate(Bank = str_squish(Bank)) %>% #trim white space
separate(Bank, into = c("Date", "BankName"), sep = " ", extra = "merge")
data_long
Company Date BankName
<chr> <chr> <chr>
1 A 1/13/2020 Bank A
2 A 5/12/2020 Bank H C
3 A 11/9/2020 HelloBank
4 B 2/14/2020 HopeBank
5 B 1/9/2020 Liberty Bank SA
6 C 10/18/2020 Securities
You might then want to convert Date into date format.
If you really want it in wide format, use pivot_wider.

R observation strs split - multiple value in columns

I have a dataframe in R concerning houses. This is a small sample:
Address Type Rent
Glasgow;Scotland House 1500
High Street;Edinburgh;Scotland Apartment 1000
Dundee;Scotland Apartment 800
South Street;Dundee;Scotland House 900
I would like to just pull out the last two instances of the Address column into a City and County column in my dataframe.
I have used mutate and strsplit to split this column by:
data<-mutate(dataframe, split_add = strsplit(dataframe$Address, ";")
I now have a new column in my dataframe which resembles the following:
split_add
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
How to I extract the last 2 instances of each of these vector observations into columns "City" and "County"?
I attempted:
data<-mutate(data, city=split_add[-2] ))
thinking it would take the second instance from the end of the vectors- but this did not work.
using tidyr::separate() with the fill = "left" option is probably your best bet...
dataframe <- read.table(header = T, stringsAsFactors = F, text = "
Address Type Rent
Glasgow;Scotland House 1500
'High Street;Edinburgh;Scotland' Apartment 1000
Dundee;Scotland Apartment 800
'South Street;Dundee;Scotland' House 900
")
library(tidyr)
separate(dataframe, Address, into = c("Street", "City", "County"),
sep = ";", fill = "left")
# Street City County Type Rent
# 1 <NA> Glasgow Scotland House 1500
# 2 High Street Edinburgh Scotland Apartment 1000
# 3 <NA> Dundee Scotland Apartment 800
# 4 South Street Dundee Scotland House 900
I thinking about another way of dealing with this problem.
1.Creating a dataframe with the split_add column data
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
test_data <- data.frame(split_add <- c("Glasgow, Scotland",
"High Street, Edinburgh, Scotland",
"Dundee, Scotland",
"South Street, Dundee, Scotland"),stringsAsFactors = F)
names(test_data) <- "address"
2.Use separate() from tidyr to split the column
library(tidyr)
new_test <- test_data %>% separate(address,c("c1","c2","c3"), sep=",")
3.Use dplyr and ifelse() to only reserve the last two columns
library(dplyr)
new_test %>%
mutate(city = ifelse(is.na(c3),c1,c2),county = ifelse(is.na(c3),c2,c3)) %>%
select(city,county)
The final data looks like this.
Assuming that you're using dplyr
data <- mutate(dataframe, split_add = strsplit(Address, ';'), City = tail(split_add, 2)[1], Country = tail(split_add, 1))

Merge dataframes based on regex condition

This problem involves R. I have two dataframes, represented by this minimal reproducible example:
a <- data.frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"), county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data.frame(geocode = c("360050002001002", "360850323001019"), jobs = c("4", "204"))
An example to help communicate the very specific operation I am trying to perform: the geocode_selector column in dataframe a contains the FIPS county codes of the five boroughs of NY. The geocode column in dataframe b is the 15-digit ID of a specific Census block. The first five digits of a geocode match a more general geocode_selector, indicating which county the Census block is located in. I want to add a column to b specifying which county each census block falls under, based on which geocode_selector each geocode in b matches with.
Generally, I'm trying to merge dataframes based on a regex condition. Ideally, I'd like to perform a full merge carrying all of the columns of a over to b and not just the county_name.
I tried something along the lines of:
b[, "county_name"] <- NA
for (i in 1:nrow(b)) {
for (j in 1:nrow(a)) {.
if (grepl(data.a$geocode_selector[j], b$geocode[i]) == TRUE) {
b$county_name[i] <- a$county_name[j]
}
}
}
but it took an extremely long time for the large datasets I am actually processing and the finished product was not what I wanted.
Any insight on how to merge dataframes conditionally based on a regex condition would be much appreciated.
You could do this...
b$geocode_selector <- substr(b$geocode,1,5)
b2 <- merge(b, a, all.x=TRUE) #by default it will merge on common column names
b2
geocode_selector geocode jobs county_name
1 36005 360050002001002 4 Bronx
2 36085 360850323001019 204 Richmond
If you wish, you can delete the geocode_selector column from b2 with b2[,1] <- NULL
We can use sub to create the 'geocode_selector' and then do the join
library(data.table)
setDT(a)[as.data.table(b)[, geocode_selector := sub('^(.{5}).*', '\\1', geocode)],
on = .(geocode_selector)]
# geocode_selector county_name geocode jobs
#1: 36005 Bronx 360050002001002 4
#2: 36085 Richmond 360850323001019 204
This is a great opportunity to use dplyr. I also tend to like the string handling functions in stringr, such as str_sub.
library(dplyr)
library(stringr)
a <- data_frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"),
county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data_frame(geocode = c("360050002001002", "360850323001019"),
jobs = c("4", "204"))
b %>%
mutate(geocode_selector = str_sub(geocode, end = 5)) %>%
inner_join(a, by = "geocode_selector")
#> # A tibble: 2 x 4
#> geocode jobs geocode_selector county_name
#> <chr> <chr> <chr> <chr>
#> 1 360050002001002 4 36005 Bronx
#> 2 360850323001019 204 36085 Richmond

Resources