merge tables in R on specified fields - r

I have 2 csv files I want to merge on a linked key:
results.csv column headings:
schoolID, schoolName, Easting, Northing
123933, Mark College, 338371, 147812
139335, Hemsworth Arts and Community Academy, 442859, 413420
107563, Sowerby Bridge High School, 406122, 424146
137706, Willenhall E-ACT Academy, 398288, 300042
schools.csv column headings:
URN, LA (code), LA (name), EstablishmentNumber, EstablishmentName
123933, 201, City of London, 3614, Mark College
100001, 202, Camden, 6005, City of London School for Girls
139335, 201, City of London, 6006, Hemsworth Arts and Community Academy
100003, 201, City of London, 6007, City of London School
URN == schoolID and I want a final file showing data under column headings:
schoolID, schoolName, Easting, Northing, LA (name)
I've tried the following code:
res_data <- read.csv("C:/results.csv",
head=TRUE,sep=",")
school_data <- read.csv("C:/schools.csv",
head=TRUE,sep=",")
merge_data <- merge(x = res_data , y = school_data[c(1,3),], by.x = "schoolID", by.y = "URN" )
head(merge_data, 3)
But the result is just merging all the headings and not the data:
schoolID, schoolName, Easting, Northing, URN, LA (code), LA (name), EstablishmentNumber, EstablishmentName

Tested with supplied test data
merge_data <- merge(x = res_data , y = school_data[,c(1,3)], by.x = "schoolID", by.y = "URN" )
(TWO changes!)
I think you've cut the third line instead the third column from school_data. You also need to include the merge column.

Related

Combine every two rows of data in R

I have a csv file that I have read in but I now need to combine every two rows together. There is a total of 2000 rows but I need to reduce to 1000 rows. Every two rows is has the same account number in one column and the address split into two rows in another. Two rows are taken up for each observation and I want to combine two address rows into one. For example rows 1 and 2 are Acct# 1234 and have 123 Hollywood Blvd and LA California 90028 on their own lines respectively.
Using the tidyverse, you can group_by the Acct number and summarise with str_c:
library(tidyverse)
df %>%
group_by(Acct) %>%
summarise(Address = str_c(Address, collapse = " "))
# A tibble: 2 × 2
Acct Address
<dbl> <chr>
1 1234 123 Hollywood Blvd LA California 90028
2 4321 55 Park Avenue NY New York State 6666
Data:
df <- data.frame(
Acct = c(1234, 1234, 4321, 4321),
Address = c("123 Hollywood Blvd", "LA California 90028",
"55 Park Avenue", "NY New York State 6666")
)
It can be fairly simple with data.table package:
# assuming `dataset` is the name of your dataset, column with account number is called 'actN' and column with adress is 'adr'
library(data.table)
dataset2 <- data.table(dataset)[,.(whole = paste0(adr, collapse = ", ")), by = .(adr)]

Getting a few tables together

I have three tables read into R as dataframes.
Table1:
Student ID School_Name Campus Area
4356791 BCCS Northwest Springdale
03127. BZS South Vernon
12437. BCCS. South Vernon
Table 2:
ProctorID. Date. Score. Student ID Form#
0211 10/05/16 75.57 55612 25432178
0211 10/17/16 83.04 55612 47135671
5134 10/17/16 63.28 02613 2371245
Table 3:
ProctorID First. Last. Campus Area
O211. Simone Lewis. Northwest Springdale
5134. Mona. Yashamito Northwest Springdale
0712. Steven. Lewis. South Vernon
I want to combine the data frames and create a table with the scores next to each other for each area, by school name. I want an output like the following:
School_Name Form# Northwest Springdale Southvernon
BCCS. 2543127. 83.04. 63.25
BCCS. 35674. 75.14. *
BZS. 5321567. 65.2. 62.3
A particular form for a particular school may not have a score for a certain area. Any ideas? I have been playing with sqldf package. Also Is it possible to manipulate this in R without using any sql?
To cast, something like this:
library(reshape2)
casted_df <- dcast(df, ... ~ "Campus Area", value.var="Score.")
An example that seems to work for me:
df1 <- data.frame("StudentID" = 1:3, "SchoolName" = c("School1", "School2", "School3"), "Area" = c("Area1", "Area2", "Area3"))
df2 <- data.frame("StudentID" = 1:3, "Score" = 100:102, "Proctor" = 4:6)
df3 <- data.frame("Proctor" = 4:6, "Area" = c("Area1", "Area2", "Area3"), "Name" = c("John", "Jane", "Jim"))
combined <- merge(df1, df2, by.x = "StudentID")
combined2 <- merge(combined, df3, by.x = "Proctor", by.y="Proctor")
library(reshape2)
final <- dcast(combined2, ... ~ Area.x, value.var="Score")

R observation strs split - multiple value in columns

I have a dataframe in R concerning houses. This is a small sample:
Address Type Rent
Glasgow;Scotland House 1500
High Street;Edinburgh;Scotland Apartment 1000
Dundee;Scotland Apartment 800
South Street;Dundee;Scotland House 900
I would like to just pull out the last two instances of the Address column into a City and County column in my dataframe.
I have used mutate and strsplit to split this column by:
data<-mutate(dataframe, split_add = strsplit(dataframe$Address, ";")
I now have a new column in my dataframe which resembles the following:
split_add
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
How to I extract the last 2 instances of each of these vector observations into columns "City" and "County"?
I attempted:
data<-mutate(data, city=split_add[-2] ))
thinking it would take the second instance from the end of the vectors- but this did not work.
using tidyr::separate() with the fill = "left" option is probably your best bet...
dataframe <- read.table(header = T, stringsAsFactors = F, text = "
Address Type Rent
Glasgow;Scotland House 1500
'High Street;Edinburgh;Scotland' Apartment 1000
Dundee;Scotland Apartment 800
'South Street;Dundee;Scotland' House 900
")
library(tidyr)
separate(dataframe, Address, into = c("Street", "City", "County"),
sep = ";", fill = "left")
# Street City County Type Rent
# 1 <NA> Glasgow Scotland House 1500
# 2 High Street Edinburgh Scotland Apartment 1000
# 3 <NA> Dundee Scotland Apartment 800
# 4 South Street Dundee Scotland House 900
I thinking about another way of dealing with this problem.
1.Creating a dataframe with the split_add column data
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
test_data <- data.frame(split_add <- c("Glasgow, Scotland",
"High Street, Edinburgh, Scotland",
"Dundee, Scotland",
"South Street, Dundee, Scotland"),stringsAsFactors = F)
names(test_data) <- "address"
2.Use separate() from tidyr to split the column
library(tidyr)
new_test <- test_data %>% separate(address,c("c1","c2","c3"), sep=",")
3.Use dplyr and ifelse() to only reserve the last two columns
library(dplyr)
new_test %>%
mutate(city = ifelse(is.na(c3),c1,c2),county = ifelse(is.na(c3),c2,c3)) %>%
select(city,county)
The final data looks like this.
Assuming that you're using dplyr
data <- mutate(dataframe, split_add = strsplit(Address, ';'), City = tail(split_add, 2)[1], Country = tail(split_add, 1))

Is it possible to relate two dataframes (eg: States and cities)? [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have three dataframes:
cities_df, which contains the name of a city amongst other fields
cities_df <- data.frame(
city_name = c("London", "Newcastle Upon Tyne", "Gateshead"),
city_population = c(8673713L, 289835L, 120046L),
city_area = c(1572L, 114L, NA)
)
states_df, which contains the name of a state amongst other fields
states_df <- data.frame(
state_name = c("Greater London", "Tyne and Wear"),
state_population = c(123, 456)
)
dictionary_df, which contains the whole list of cities and their corresponding state.
dictionary_df <- data.frame(
city_name = c("London", "Newcastle Upon Tyne", "Gateshead"),
state = c("Greater London", "Tyne and Wear", "Tyne and Wear")
)
Is there any way to relate/link cities_df and states_df dataframes so I can have an easy way to get all the cities' fields that belong to a certain state?
Using merge, see linked post for more options:
# tidy up column name to match with other column names
colnames(dictionary_df)[2] <- "state_name"
# merge to get state names
x <- merge(cities_df, dictionary_df, by = "city_name")
# merge to get city names
y <- merge(states_df, dictionary_df, by = "state_name")
# merge by city and state
result <- merge(x, y, by = c("state_name", "city_name"))
result
# state_name city_name city_population city_area state_population
# 1 Greater London London 8673713 1572 123
# 2 Tyne and Wear Gateshead 120046 NA 456
# 3 Tyne and Wear Newcastle Upon Tyne 289835 114 456

Extract cities from each row in excel and export to its respective row using R

I have extracted tweets in .csv format and the data looks like this:
(row 1) The latest The Admin Resources Daily! Thanks to #officerenegade #roberthalf #elliottdotorg #airfare #jobsearch
(row 2) RT #airfarewatchdog: Los Angeles #LAX to Cabo #SJD $312 nonstop on #AmericanAir for summer travel. #airfare
(row 3) RT #TheFlightDeal: #Airfare Deal: [AA] New York - Mexico City, Mexico. $270 r/t. Details:
(row 4) The latest The Nasir Muhammad Daily! Thanks to #Matt_Revel #Roddee #JaeKay #lefforum #airfare
(row 5) RT #BarefootNomads: So cool! <U+2708> <U+2764><U+FE0F> #airfare deals w #Skyscanner Facebook Messenger Bot #traveldeals #cheapflights ht…
(row 6) Flights to #Oranjestad #Aruba are £169 for a 15 day trip departing Tue, Jun 7th. #airfare via #hitlist_app"
I have made use of the NLP technique to extract city names from the tweets but the output is a list of cities with each city occupying a row one below the other. It is just identifying all the city names and making a list of it.
Output:
1 Los Angeles
2 New York
3 Mexico City
4 Mexico
5 Tue
6 London
7 New York
8 Fort Lauderdale
9 Los Angeles
10 Paris
I want the output to be something like:
1 Los Angeles Cabo (from the first tweet in row 2)
2 New York Mexico City Mexico (from the second tweet in row 3)
Code:
#Named Entity Recognition (NER)
bio <- readLines("C:\\xyz\\tweets.csv")
print(bio)
install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
install.packages("openNLPmodels.en",
repos = "http://datacube.wu.ac.at/",
type = "source")
library(NLP)
library(openNLP)
library(RWeka)
library(qdap)
library(openNLPmodels.en)
library(magrittr)
bio <- as.String(bio)
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
sents(bio_doc) %>% head(2)
words(bio_doc) %>% head(10)
location_ann <- Maxent_Entity_Annotator(kind = "location")
pipeline <- list(sent_ann,
word_ann,
location_ann)
bio_annotations <- annotate(bio, pipeline)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
entities <- function(doc, kind) {
s <- doc$content
a <- annotations(doc)[[1]]
if(hasArg(kind)) {
k <- sapply(a$features, `[[`, "kind")
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
}
entities(bio_doc, kind = "location")
cities <- entities(bio_doc, kind = "location")
library(xlsx)
write.xlsx(cities, "C:\\xyz\\xyz.xlsx")
Also is there a way that I can further separate the cities as origin and destination, i.e. by classifying cities before 'to' or '-' as origin cities and the rest as destination cities?

Resources