I am trying to clean the dataset so that all data is in its appropriate cell by combining rows since they are oddly separated. There is an obvious nuance to the dataset in that there are some rows that are correctly coded and there are some that are not.
Here is an example of the data:
Rank
Store
City
Address
Transactions
Avg. Value
Dollar Sales Amt.
40
1404
State College
Hamilton Square Shop Center
230 W Hamilton Ave
155548
52.86
8263499
41
2310
Springfield
149 Baltimore Pike
300258
27.24
8211137
42
2514
Erie
Yorktown Centre
2501 West 12th Street
190305
41.17
7862624
Here is an example of how I want the data:
Rank
Store
City
Address
Transactions
Avg. Value
Dollar Sales Amt.
40
1404
State College
Hamilton Square Shop Center, 230 W Hamilton Ave
155548
52.86
8263499
41
2310
Springfield
149 Baltimore Pike
300258
27.28
8211137
42
2514
Erie
Yorktown Centre, 2501 West 12th Street
190305
41.17
7862624
Is there an Excel or R function to fix this, or does anyone know how to write an R functional to correct this?
I read into the CONCATENATE function in excel and realized it was not going to accomplish anything. I figured an R functional would be the only way to fix this.
The concatenate function will work here or in the excel select the columns and by using the merge formula in the formala option you can complete the task.
I recommend checking how the file is being parsed.
From the example data you provided, it looks like the address column is being
split on ", " and going to the next line.
Based on this assumption alone, below is a potential solution using the
tidyverse:
library(tidyverse)
original_data <- tibble(Rank = c(40,NA,41,42,NA),
Store = c(1404,NA,2310,2514,NA),
City = c("State College",NA,"Springfield","Erie",NA),
Address = c("Hamilton Square Shop Center",
"230 W Hamilton Ave","149 Baltimore Pike",
"Yorktown Centre","2501 West 12th Street"),
Transactions = c(NA,155548,300258,NA,190305),
`Avg. Value` = c(NA,52.86,27.24,NA,41.17),
`Dollar Sales Amt.` = c(NA,8263499,8211137,NA,7862624))
new_data <- original_data %>%
fill(Rank:City) %>%
group_by_at(vars(Rank:City)) %>%
mutate(Address1 = lag(Address)) %>%
slice(n()) %>%
ungroup() %>%
mutate(Address = if_else(is.na(Address1), Address,
str_c(Address1, Address, sep = ", "))) %>%
select(Rank:`Dollar Sales Amt.`)
new_data
Related
I have a csv file that I have read in but I now need to combine every two rows together. There is a total of 2000 rows but I need to reduce to 1000 rows. Every two rows is has the same account number in one column and the address split into two rows in another. Two rows are taken up for each observation and I want to combine two address rows into one. For example rows 1 and 2 are Acct# 1234 and have 123 Hollywood Blvd and LA California 90028 on their own lines respectively.
Using the tidyverse, you can group_by the Acct number and summarise with str_c:
library(tidyverse)
df %>%
group_by(Acct) %>%
summarise(Address = str_c(Address, collapse = " "))
# A tibble: 2 × 2
Acct Address
<dbl> <chr>
1 1234 123 Hollywood Blvd LA California 90028
2 4321 55 Park Avenue NY New York State 6666
Data:
df <- data.frame(
Acct = c(1234, 1234, 4321, 4321),
Address = c("123 Hollywood Blvd", "LA California 90028",
"55 Park Avenue", "NY New York State 6666")
)
It can be fairly simple with data.table package:
# assuming `dataset` is the name of your dataset, column with account number is called 'actN' and column with adress is 'adr'
library(data.table)
dataset2 <- data.table(dataset)[,.(whole = paste0(adr, collapse = ", ")), by = .(adr)]
I'm new in R coding. I want to find code for this question. Display the city name and the total attendance of the five top-attendance stadiums. I have dataframe worldcupmatches. Please, if anyone can help me out.
Since you have not provided us a subset of your data (which is strongly recommended), I will create a tiny dataset with city names and attendance like so:
df = data.frame(city = c("London", "Liverpool", "Manchester", "Birmingham"),
attendance = c(2390, 1290, 8734, 5433))
Then your problem can easily be solved. For example, one of the base R approaches is:
df[order(df$attendance, decreasing = T), ]
You could also use dplyr which makes things look a little tidier:
library(dplyr)
df %>% arrange(desc(attendance))
Output of the both methods is your original data, but ordered from the highest to the lowest attendance:
city attendance
3 Manchester 8734
4 Birmingham 5433
1 London 2390
2 Liverpool 1290
If you specifically want to display a certain number of cities (or stadiums) with top highest attendance, you could do:
df[order(df$attendance, decreasing = T), ][1:3, ] # 1:3 takes the top 3 staidums
city attendance
3 Manchester 8734
4 Birmingham 5433
1 London 2390
Again, dplyr approach/code looks much nicer:
df %>% slice_max(n = 3, order_by = attendance)
city attendance
1 Manchester 8734
2 Birmingham 5433
3 London 2390
Learning to webscrape in R from a list of contacts on this webpage:https://ern-euro-nmd.eu/board-members/
There are 65 rows (contacts) and should be 3 columns of associated details (Name, institution, and location). Here is a copy/paste of one row of data from the webpage:
Adriano Chio
Azienda Ospedaliero Universitaria Città della Salute e della Scienza
Italy
My current approach lumps all the details into one column. How can I split the data into 3 columns.
There is only white space apparently between these details on the webpage. Not sure what to do.
#Below is my R code:
EURONMD_KOLs<- read_html("https://ern-euro-nmd.eu/board-members/") %>%
html_nodes(".detailsListing")%>%
html_text()
EURONMD_KOLs
EURONMD_KOLs_table<-data.frame(EURONMD_KOLs)
#end of R code
My resulting table lumps everything into one column. I need to separate reach row of data into 3 columns. Any help is appreciated.
Remove leading and lagging new line character from the text, split on '\n' and create a 3-column dataframe.
library(rvest)
read_html("https://ern-euro-nmd.eu/board-members/") %>%
html_nodes(".detailsListing")%>%
html_text() %>%
trimws() %>%
strsplit('\n+') %>%
do.call(rbind, .) %>%
as.data.frame() %>%
setNames(c('Name', 'institution', 'location')) -> result
head(result)
# Name institution location
#1 Adriano Chio Azienda Ospedaliero Universitaria Città della Salute e della Scienza Italy
#2 Alessandra Ferlini University Hospital St. Anna Italy
#3 Andrés Nascimento Hospital Sant Juan de Déu Universidad de Barcelona Spain
#4 Angela Kaindl Charité - Universitätsmedizin Berlin Germany
#5 Anna Kostera-Pruszczyk SPCSK, Medical University of Warsaw Poland
#6 Anneke van der Kooi Academic Medical Centre Netherlands
I have a dataframe in R concerning houses. This is a small sample:
Address Type Rent
Glasgow;Scotland House 1500
High Street;Edinburgh;Scotland Apartment 1000
Dundee;Scotland Apartment 800
South Street;Dundee;Scotland House 900
I would like to just pull out the last two instances of the Address column into a City and County column in my dataframe.
I have used mutate and strsplit to split this column by:
data<-mutate(dataframe, split_add = strsplit(dataframe$Address, ";")
I now have a new column in my dataframe which resembles the following:
split_add
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
How to I extract the last 2 instances of each of these vector observations into columns "City" and "County"?
I attempted:
data<-mutate(data, city=split_add[-2] ))
thinking it would take the second instance from the end of the vectors- but this did not work.
using tidyr::separate() with the fill = "left" option is probably your best bet...
dataframe <- read.table(header = T, stringsAsFactors = F, text = "
Address Type Rent
Glasgow;Scotland House 1500
'High Street;Edinburgh;Scotland' Apartment 1000
Dundee;Scotland Apartment 800
'South Street;Dundee;Scotland' House 900
")
library(tidyr)
separate(dataframe, Address, into = c("Street", "City", "County"),
sep = ";", fill = "left")
# Street City County Type Rent
# 1 <NA> Glasgow Scotland House 1500
# 2 High Street Edinburgh Scotland Apartment 1000
# 3 <NA> Dundee Scotland Apartment 800
# 4 South Street Dundee Scotland House 900
I thinking about another way of dealing with this problem.
1.Creating a dataframe with the split_add column data
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
test_data <- data.frame(split_add <- c("Glasgow, Scotland",
"High Street, Edinburgh, Scotland",
"Dundee, Scotland",
"South Street, Dundee, Scotland"),stringsAsFactors = F)
names(test_data) <- "address"
2.Use separate() from tidyr to split the column
library(tidyr)
new_test <- test_data %>% separate(address,c("c1","c2","c3"), sep=",")
3.Use dplyr and ifelse() to only reserve the last two columns
library(dplyr)
new_test %>%
mutate(city = ifelse(is.na(c3),c1,c2),county = ifelse(is.na(c3),c2,c3)) %>%
select(city,county)
The final data looks like this.
Assuming that you're using dplyr
data <- mutate(dataframe, split_add = strsplit(Address, ';'), City = tail(split_add, 2)[1], Country = tail(split_add, 1))
I'm looking for a way to match two different address data frames. The both contain a string of text (the 'Line' column in my example), a postcode/zip code type identifier (the 'PC' column') and a unique Ref or ID code. I would need to resulting matches to be in a new data frame with a format along the lines of: DF1$Line, DF1$PD, DF2$Line, DF2$PD, Ref, ID and some sort of numeric detailing the strength of match (this is based on the example code below).
My actual dataset contains several thousand records and I have been playing with the idea of using the 'PC' column to subset both datasets and then perform some sort of matching along the lines of this, but the resulting matches I get are completely wrong.
Here is a made-up dataset that resembles my data (in these examples the rows in each dataset correspond to each other, my real data is not formatted like this unfortunately).
DF1 <- data.frame(
Line = c("64 London Street, Jasper","46 London Road, Flat 2, Jasper","99 York Parade, Yorkie","99 Parade Road, Placename","29 Road Street, Townplace","92 Parade Street, Yorky"),
PC = c("ZZ1 4TY","ZZ1 4TY","PP1 9TR","ZZ1 4TY","PP1 9TR","PP1 9RT"),
Ref = c("123451","567348","23412","98734","43223","32453")
)
and
DF2 <- data.frame(
Line = c("64 London St, Jasper","Flat 2, 46 Road, London, Jasper","99 York Parade, Yorky","99 Parade Road, Placenames","Flat 3, 29 Road Street, Townplace, Townplace","92 Street, Parade, Yorkie"),
PC = c("ZZ1 4TY","ZZ1 4TY","PP1 9TR","ZZ1 4TY","PP1 9TR","PP1 9RT"),
ID = c("ABGED","GGFRW","UYTER","RTERF","WERWE","OYUIY")
)
Any help resolving this would be very much appreciated, as would any metric that helps me quantify how precise the matches are. Thanks.
Here is my base R solution let me know if I get it.
DF3 <- merge(DF1, DF2, by = "PC")
DF3[!duplicated(DF3$Ref) , ]
PC Line.x Ref Line.y ID
1 PP1 9RT 92 Parade Street, Yorky 32453 92 Street, Parade, Yorkie OYUIY
2 PP1 9TR 99 York Parade, Yorkie 23412 99 York Parade, Yorky UYTER
4 PP1 9TR 29 Road Street, Townplace 43223 99 York Parade, Yorky UYTER
6 ZZ1 4TY 64 London Street, Jasper 123451 64 London St, Jasper ABGED
9 ZZ1 4TY 46 London Road, Flat 2, Jasper 567348 64 London St, Jasper ABGED
12 ZZ1 4TY 99 Parade Road, Placename 98734 64 London St, Jasper ABGED
I would consider first evaluating potential matches using agrep:
for (i in 1:length(DF1$Line)) {
matchDF1 <- agrep(pattern = DF1$Line[i], x = DF2$Line, max.distance = 0.5,
value = TRUE)
}