Is it possible to relate two dataframes (eg: States and cities)? [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have three dataframes:
cities_df, which contains the name of a city amongst other fields
cities_df <- data.frame(
city_name = c("London", "Newcastle Upon Tyne", "Gateshead"),
city_population = c(8673713L, 289835L, 120046L),
city_area = c(1572L, 114L, NA)
)
states_df, which contains the name of a state amongst other fields
states_df <- data.frame(
state_name = c("Greater London", "Tyne and Wear"),
state_population = c(123, 456)
)
dictionary_df, which contains the whole list of cities and their corresponding state.
dictionary_df <- data.frame(
city_name = c("London", "Newcastle Upon Tyne", "Gateshead"),
state = c("Greater London", "Tyne and Wear", "Tyne and Wear")
)
Is there any way to relate/link cities_df and states_df dataframes so I can have an easy way to get all the cities' fields that belong to a certain state?

Using merge, see linked post for more options:
# tidy up column name to match with other column names
colnames(dictionary_df)[2] <- "state_name"
# merge to get state names
x <- merge(cities_df, dictionary_df, by = "city_name")
# merge to get city names
y <- merge(states_df, dictionary_df, by = "state_name")
# merge by city and state
result <- merge(x, y, by = c("state_name", "city_name"))
result
# state_name city_name city_population city_area state_population
# 1 Greater London London 8673713 1572 123
# 2 Tyne and Wear Gateshead 120046 NA 456
# 3 Tyne and Wear Newcastle Upon Tyne 289835 114 456

Related

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

R observation strs split - multiple value in columns

I have a dataframe in R concerning houses. This is a small sample:
Address Type Rent
Glasgow;Scotland House 1500
High Street;Edinburgh;Scotland Apartment 1000
Dundee;Scotland Apartment 800
South Street;Dundee;Scotland House 900
I would like to just pull out the last two instances of the Address column into a City and County column in my dataframe.
I have used mutate and strsplit to split this column by:
data<-mutate(dataframe, split_add = strsplit(dataframe$Address, ";")
I now have a new column in my dataframe which resembles the following:
split_add
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
How to I extract the last 2 instances of each of these vector observations into columns "City" and "County"?
I attempted:
data<-mutate(data, city=split_add[-2] ))
thinking it would take the second instance from the end of the vectors- but this did not work.
using tidyr::separate() with the fill = "left" option is probably your best bet...
dataframe <- read.table(header = T, stringsAsFactors = F, text = "
Address Type Rent
Glasgow;Scotland House 1500
'High Street;Edinburgh;Scotland' Apartment 1000
Dundee;Scotland Apartment 800
'South Street;Dundee;Scotland' House 900
")
library(tidyr)
separate(dataframe, Address, into = c("Street", "City", "County"),
sep = ";", fill = "left")
# Street City County Type Rent
# 1 <NA> Glasgow Scotland House 1500
# 2 High Street Edinburgh Scotland Apartment 1000
# 3 <NA> Dundee Scotland Apartment 800
# 4 South Street Dundee Scotland House 900
I thinking about another way of dealing with this problem.
1.Creating a dataframe with the split_add column data
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
test_data <- data.frame(split_add <- c("Glasgow, Scotland",
"High Street, Edinburgh, Scotland",
"Dundee, Scotland",
"South Street, Dundee, Scotland"),stringsAsFactors = F)
names(test_data) <- "address"
2.Use separate() from tidyr to split the column
library(tidyr)
new_test <- test_data %>% separate(address,c("c1","c2","c3"), sep=",")
3.Use dplyr and ifelse() to only reserve the last two columns
library(dplyr)
new_test %>%
mutate(city = ifelse(is.na(c3),c1,c2),county = ifelse(is.na(c3),c2,c3)) %>%
select(city,county)
The final data looks like this.
Assuming that you're using dplyr
data <- mutate(dataframe, split_add = strsplit(Address, ';'), City = tail(split_add, 2)[1], Country = tail(split_add, 1))

r aggregate function -- how to display additional columns [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Select the row with the maximum value in each group
(19 answers)
Closed 5 years ago.
I have two data frames: City and Country. I am trying to find out the most popular city per country. City and Country have a common field, City.CountryCode and Country.Code. These two data frames were merged to one called CityCountry. I have tried the aggregate command like so:
aggregate(Population.x~CountryCode, CityCountry, max)
This aggregate command only shows the CountryCode and Population.X columns. How would I show the name of the Country and the name of the City? Is aggregate the wrong command to use here?
Could also use dplyr to group by Country, then filter by max(Population.x).
library(dplyr)
set.seed(123)
CityCountry <- data.frame(Population.x = sample(1000:2000, 10, replace = TRUE),
CountryCode = rep(LETTERS[1:5], 2),
Country = rep(letters[1:5], 2),
City = letters[11:20],
stringsAsFactors = FALSE)
CityCountry %>%
group_by(Country) %>%
filter(Population.x == max(Population.x)) %>%
ungroup()
# A tibble: 5 x 4
Population.x CountryCode Country City
<int> <chr> <chr> <chr>
1 1287 A a k
2 1789 B b l
3 1883 D d n
4 1941 E e o
5 1893 C c r

Merge dataframes based on regex condition

This problem involves R. I have two dataframes, represented by this minimal reproducible example:
a <- data.frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"), county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data.frame(geocode = c("360050002001002", "360850323001019"), jobs = c("4", "204"))
An example to help communicate the very specific operation I am trying to perform: the geocode_selector column in dataframe a contains the FIPS county codes of the five boroughs of NY. The geocode column in dataframe b is the 15-digit ID of a specific Census block. The first five digits of a geocode match a more general geocode_selector, indicating which county the Census block is located in. I want to add a column to b specifying which county each census block falls under, based on which geocode_selector each geocode in b matches with.
Generally, I'm trying to merge dataframes based on a regex condition. Ideally, I'd like to perform a full merge carrying all of the columns of a over to b and not just the county_name.
I tried something along the lines of:
b[, "county_name"] <- NA
for (i in 1:nrow(b)) {
for (j in 1:nrow(a)) {.
if (grepl(data.a$geocode_selector[j], b$geocode[i]) == TRUE) {
b$county_name[i] <- a$county_name[j]
}
}
}
but it took an extremely long time for the large datasets I am actually processing and the finished product was not what I wanted.
Any insight on how to merge dataframes conditionally based on a regex condition would be much appreciated.
You could do this...
b$geocode_selector <- substr(b$geocode,1,5)
b2 <- merge(b, a, all.x=TRUE) #by default it will merge on common column names
b2
geocode_selector geocode jobs county_name
1 36005 360050002001002 4 Bronx
2 36085 360850323001019 204 Richmond
If you wish, you can delete the geocode_selector column from b2 with b2[,1] <- NULL
We can use sub to create the 'geocode_selector' and then do the join
library(data.table)
setDT(a)[as.data.table(b)[, geocode_selector := sub('^(.{5}).*', '\\1', geocode)],
on = .(geocode_selector)]
# geocode_selector county_name geocode jobs
#1: 36005 Bronx 360050002001002 4
#2: 36085 Richmond 360850323001019 204
This is a great opportunity to use dplyr. I also tend to like the string handling functions in stringr, such as str_sub.
library(dplyr)
library(stringr)
a <- data_frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"),
county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data_frame(geocode = c("360050002001002", "360850323001019"),
jobs = c("4", "204"))
b %>%
mutate(geocode_selector = str_sub(geocode, end = 5)) %>%
inner_join(a, by = "geocode_selector")
#> # A tibble: 2 x 4
#> geocode jobs geocode_selector county_name
#> <chr> <chr> <chr> <chr>
#> 1 360050002001002 4 36005 Bronx
#> 2 360850323001019 204 36085 Richmond

A fast way to merge named vectors of different length into a data frame (preserving name information as column name) in R

I have a list L of named vectors. For example, 1st element:
> L[[1]]
$event
[1] "EventA"
$time
[1] "1416355303"
$city
[1] "Los Angeles"
$region
[1] "California"
$Locale
[1] "en-GB"
when I unlist each element of the list the resulting vectors looks like this (for the 1st 3 elements):
> unlist(L[[1]])
event time city region Locale
"EventA" "1416355303" "Los Angeles" "California" "en-GB"
> unlist(L[[2]])
event time Locale
"EventB" "1416417567" "en-GB"
> unlist(L[[3]])
event properties.time
"EventM" "1416417569"
I have over 0.5 million elements in the list and each one has up to 42 of these feaures/names. I have to merge them into a dataframe taken into account their names and that not all of them have the same number of feaures or names (in the example above, V2 has no information for region and city). At the moment, what I do is a loop through the whole list:
df1 <- merge(stack(unlist(L[[1]])), stack(unlist(L[[2]])),
by = "ind", all = TRUE)
suppressWarnings(for (i in 3:length(L)){
df1 <- merge(df1, stack(unlist(L[[i]])), by = "ind", all = TRUE)
})
df1 <- as.data.frame(t(df1))
For the example above this returns:
V1 V2 V3 V4 V5
ind city event Locale region time
values.x Los Angeles EventA en-GB California 1416355303
values.y <NA> EventB en-GB <NA> 1416417567
values <NA> EventM <NA> <NA> 1416417569
which is what I want. However, bearing in mind the length of the list and the fact that every time that the command:
df1 <- merge(df1, stack(unlist(L[[i]])), by = "ind", all = TRUE)
runs, loads the entire data frame (df1), the loop takes a very long time. Therefore, I was wondering if anyone knows a better/faster way to code this. In other words. Given a long list of named vectors with different lengths, is there a fast way to merge them into a data frame as the one described above.
For example, is there a way of doing this using foreach and %dopar%? In any case, any faster approach is welcome.
I've heard the data.table package is pretty fast. And rbindlist is perfect for this list.
library(data.table)
rbindlist(L, fill=TRUE)
# event time city region Locale
# 1: EventA 1416355303 Los Angeles California en-GB
# 2: EventB 1416417567 NA NA en-GB
# 3: EventM 1416417569 NA NA NA
I'm not sure why you use merge. It seems to me like you should simply rbind.
L <- list(list(event = "EventA", time = 1416355303,
city = "Los Angeles", region = "California",
Locale = "en-GB"),
list(event = "EventB", time = 1416417567,
Locale = "en-GB"),
list(event = "EventM", time = 1416417569))
library(plyr)
do.call(rbind.fill, lapply(L, as.data.frame))
# event time city region Locale
#1 EventA 1416355303 Los Angeles California en-GB
#2 EventB 1416417567 <NA> <NA> en-GB
#3 EventM 1416417569 <NA> <NA> <NA>
Here's a compact solution to consider:
library(reshape2)
dcast(melt(L), L1 ~ L2, value.var = "value")
# L1 city event Locale region time
# 1 1 Los Angeles EventA en-GB California 1416355303
# 2 2 <NA> EventB en-GB <NA> 1416417567
# 3 3 <NA> EventM <NA> <NA> 1416417569
The original post is about merging named vectors. Define the first two given in the example above as vectors:
>C1 <- c(event = "EventA", time = 1416355303,
city = "Los Angeles", region = "California",
Locale = "en-GB")
>C2 <- c(event = "EventB", time = 1416417567,
Locale = "en-GB")
If you want to merge them and are OK to give up the extra data in the longer vector vector, then you can index the longer vector by names in the shorter vector
>C1 <- C1[names(C2)]
Then just use rbind or cbind. Example with rbind
>C1_C2 <- rbind(C1,C2)
>C1_C2
event time Locale
C1 "EventA" "1416355303" "en-GB"
C2 "EventB" "1416417567" "en-GB"
You can combine the final two steps but will lose the name of the first vector if you do that

Resources