"Compare a variable by state abbreviations - r

How can I compare a variable by state abbreviations?
My data set has 5 variables currently. One of them is Location, and it is written as: "Raleigh, NC"
I need to create a variable that contains the two-character state abbreviation for each observation, and afterward another to group them by state. Each observation is of a college including their classification(private/public), instate/out of state tuition, and location.

This should do for you, if I understood your issue correctly.
Note: Please always share sample data using dput(your_dataset) or dput(head(your_dataset))
library(tidyverse)
d<- tibble(id = 1:3,
Location = c("Newyork, NY", "Raleigh, NC", "Delhi, IN"))
d %>% separate(Location,into = c("city", "country")) %>%
mutate_at(vars("city","country"), str_trim)
# A tibble: 3 x 3
id city country
<int> <chr> <chr>
1 1 Newyork NY
2 2 Raleigh NC
3 3 Delhi IN

Related

Can't figure out how to remove column name from dummy variable heading

I wrote this code and used
library('fastDummies'):
New_Data <- dummy_cols(New_Curve_Data, select_columns = 'CountyName')
I just want the actual county name that is Banks to be displayed and not CountyName_Banks etc.
There are like 100 dummy variables that I created. So I cant manually change the names.
The prefix substring in the column names can be removed with sub by matching the 'CountyName_' as pattern and replace it with blank ("") on the names of the 'New_Data'` and assign it back
names(New_Data) <- sub("CountyName_", "", names(New_Data))
This can be also done in base R with table
as.data.frame.matrix(table(seq_len(nrow(New_Curve_Data)),
New_Curve_Data$CountyName))
You can use pivot_wider. Since you did not share an example data taking example from ?fastDummies::dummy_cols help page.
crime <- data.frame(city = c("SF", "SF", "NYC"),
year = c(1990, 2000, 1990),
crime = 1:3)
tidyr::pivot_wider(crime, names_from = city, values_from = city,
values_fn = length, values_fill = 0)
# year crime SF NYC
# <dbl> <int> <int> <int>
#1 1990 1 1 0
#2 2000 2 1 0
#3 1990 3 0 1

Cleaning Origin and Destination data with duplicates but different factor level

I have some GIS data with origins and destinations (OD) and an information about the time of the day of the OD. I intending to make a map of this, and to color the ODs by the time of day information.
One thing is that some ODs are in the data set with both day and night and maybe with a different order. I would like to mark those differntly, e.g. "Day/Night"
Is there an easy way to do this? MY MWE is just one OD but I would need to identify it among several others. I can manage to find the duplicates regardless of the order, but I dont know how to find out wether or not there are both time cases there and how to replace them with "Day/Night"
library(data.table)
Origin<-c("London", "Paris", "Lisbon", "Madrid", "Berlin", "London")
Destination<-c("Paris", "London", "Berlin","Lisbon", "Lisbon", "Paris")
Time=factor(c("Day", "Night", "Day", "Day/Night","Day", "Day/Night"))
dt<-data.table(Origin=Origin, Destination=Destination, Time=Time)
#duplicates regardless of order
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt[duplicated(dat.sort) | duplicated(dat.sort, fromLast=TRUE),]
You can do that using dplyr package as follows;
Feel free to change the conditions to what fits your need.
library(data.table)
library(dplyr)
# Creating data
dt <-
data.table(
Origin = c("London", "Paris", "Italy", "Spain", "Portugal", "Poland"),
Destination = c("Paris", "London", "Norway", "Portugal", "Spain", "Spain"),
Time = c("Day", "Night", "Day", NA_character_, NA_character_, NA_character_)
)
dt
# Origin Destination Time
# London Paris Day
# Paris London Night
# Italy Norway Day
# Spain Portugal <NA>
# Portugal Spain <NA>
# Poland Spain <NA>
dt %>%
# pmin and pmax are used to sort the 2 columns
# in order to group by them regardless to their order
group_by(Origin2 = pmin(Origin, Destination),
Destination2 = pmax(Origin, Destination)) %>%
mutate(count = n(), # To check if Origin/destination are repeated or not
row = row_number(), # Place holder to know if it was first to repeat or second
# If not repeated then make Time = Day
# If repeated and first occurance then Time = Day
# If repeated and second occurance then Time = Night
Time = case_when(count == 1 ~ "Day",
count == 2 & row == 1 ~ "Day",
count == 2 & row == 2 ~ "Night")) %>%
ungroup() %>%
select(Origin, Destination, Time)
# Origin Destination Time
# <chr> <chr> <chr>
# 1 London Paris Day
# 2 Paris London Night
# 3 Italy Norway Day
# 4 Spain Portugal Day
# 5 Portugal Spain Night
# 6 Poland Spain Day
Thanks for the dplyr solution by #Nareman Darwisch that gave me the inspiration for my solution with data.table
I am creating a new variable as a unique ID for each Origin Destination
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt.temp<-data.table(dat.sort)
dt.temp[,unique.name:=paste(V1,V2)]
dt$unique.name<-factor(dt.temp$unique.name)
Then I can either calculate the length of the unique occurences of the factor by group or if they match more than once with any of the 3 levels. Based on this I can recode the labels with the "Day/Night" level whenever the length is > 1 or the other condition is TRUE
dt[,No.levels:=length(unique(c(Time))), by=unique.name]
dt[,No.levels.logi:=sum(c(Time) %in% c(1:3))>1 , by=unique.name]
The thing I would like to understand how I could use a logical condition in the spirit of looking at the levels by group and compares those with the cases I want.
dt[,No.levels.logi:=sum(levels(Time) %in% c("Day", "Night"))>1 , by=unique.name]
But I guess the levels command always gives me all three levels.
If I understand correctly, the OP wants to
identify city pairs regardless of the order of origin and destination, e.g. London-Paris belongs to the same city pair as Paris-London
collapse separate rows if a city pair is operated Day and Night or Day/Night
or update the original dataset
This is what I would do:
library(data.table)
dt <- data.table(Origin, Destination, Time)
# add city pair as unique grouping variable
dt[, Pair := paste(pmin(Origin, Destination), pmax(Origin, Destination), sep = "-")][]
# identify city pairs which are operated day and night
pairs_DN <- dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair][(V1), .(Pair)]
# update original dataset by an update join
dt[pairs_DN, on = "Pair", Time := "Day/Night"][]
Origin Destination Time Pair
1: London Paris Day/Night London-Paris
2: Paris London Day/Night London-Paris
3: Lisbon Berlin Day Berlin-Lisbon
4: Madrid Lisbon Day/Night Lisbon-Madrid
5: Berlin Lisbon Day Berlin-Lisbon
6: London Paris Day/Night London-Paris
The key point is to identify the city pairs which fullfil the second requirement:
dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair]
Pair V1
1: London-Paris TRUE
2: Berlin-Lisbon FALSE
3: Lisbon-Madrid TRUE
So, there is no need to deal with factor levels. BTW, factor levels are an attribute of the whole column and do not change when subsetting or grouping. What does change is which of the levels are used in a subset or group.
pairs_DN contains the unique key of those city pairs
Pair
1: London-Paris
2: Lisbon-Madrid

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

Complex Data Frame calculations in R

I currently am importing two tables (in the most basic form) that appear as such
Table 1
State Month Account Value
NY Jan Expected Sales 1.04
NY Jan Expected Expenses 1.02
Table 2
State Month Account Value
NY Jan Sales 1,000
NY Jan Customers 500
NY Jan F Expenses 1,000
NY Jan V Expenses 100
And my end goal is to create a 3rd data frame that includes the values of the first two rows and calculates a 4th column based off of functions
NextYearExpenses = (t2 F Expenses + t2 V Expenses)* t1 Expected Expenses
NextYearSales = (t2 sales) * t1 Expected Sales
So my desired output is as followed
State Month New Account Value
NY Jan Sales 1,040
NY Jan Expenses 1,122
I am relatively new to R and I think ifelse statements might be my best bet. I have tried merging the tables and calculating with simple column functions but with no real progress.
Any suggestions?
You may need to do some data wrangling but nothing out of the ordinary
require(dplyr)
Table1<-tibble(State=c("NY","NY"), Month=c("Jan","Jan"), Account=c("Expected Sales", "Expected Expenses"), Value=c(1.04,1.02))
Table2<-tibble(State=c("NY","NY","NY","NY"), Month=c("Jan","Jan","Jan","Jan"), Account=c("Sales", "Customers", "F Expenses","V Expenses"), Value=c(1000,500,1000,100))
First thing I do is rename the accounts to have a common name, i.e. expenses, this is going to help me to merge later on to Table1
Table2$Account[Table2$Account=="F Expenses"]<-"Expenses"
Table2$Account[Table2$Account=="V Expenses"]<-"Expenses"
then I use the group_by function and group by State, Month and Account and do the sum
Table2 <- Table2 %>% group_by(State, Month,Account) %>%
summarise(Tot_Value=sum(Value)) %>% ungroup()
head(Table2)
## State Month Account Tot_Value
## <chr> <chr> <chr> <dbl>
## 1 NY Jan Customers 500
## 2 NY Jan Expenses 1100
## 3 NY Jan Sales 1000
then something similar with the renaming for the accounts in table 1
Table1$Account[Table1$Account=="Expected Sales"]<-"Sales"
Table1$Account[Table1$Account=="Expected Expenses"]<-"Expenses"
Merge into a third table, Table 3
Table3<- left_join(Table1,Table2)
use mutate to do the needed operation
Table3 <- Table3 %>% mutate(Value2=Value*Tot_Value)
head(Table3)
## # A tibble: 2 x 6
## State Month Account Value Tot_Value Value2
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 NY Jan Sales 1.04 1000 1040
## 2 NY Jan Expenses 1.02 1100 1122
Here's what I did with dplyr and tidyr.
First I combined your initial tables with rbind into a single long format table. Since you have unique identifiers for each of the Account values, these don't need to be separate tables. Next I group_by State and Month to group these assuming eventually you'll have a variety of states/months. Next I summarise based on the values of Account that you specified and created two new columns. Finally to get it into the long format that you want I used gather from tidyr to go from wide format to long format. You can separate these commands into smaller chunks by deleting after the %>% to get a better idea of what each step does.
library(dplyr)
library(tidyr)
rbind(df,df2) %>%
group_by(State,Month) %>%
summarise(Expenses = (Value[which(Account == "F Expenses")] + Value[which(Account == "V Expenses")]) * Value[which(Account == "Expected Expenses")],
Sales = Value[which(Account == "Sales")] * Value[which(Account == "Expected Sales")]) %>%
gather(New_Account,Value, c(Expenses,Sales))
# A tibble: 2 x 4
# Groups: State [1]
# State Month New_Account Value
# <chr> <chr> <chr> <dbl>
#1 NY Jan Expenses 1122
#2 NY Jan Sales 1040
I'd recommend checking out the concept of "tidy data", as there are some real challenges with working on data with the structure you currently have. E.g. creating t3 should only take 2-3 lines of code, all of this is just to work around your data architecture:
library(tidyverse)
t1 <- data.frame(State = rep("NY", 2),
Month = rep(as.Date("2018-01-01"), 2),
Account = c("Expected Sales", "Expected Expenses"),
Value = c(1.04, 1.02),
stringsAsFactors = FALSE)
t2 <- data.frame(State = rep("NY", 4),
Month = rep(as.Date("2018-01-01"), 4),
Account = c("Sales", "Customers", "F Expenses", "V Expenses"),
Value = c(1000, 500, 1000, 100),
stringsAsFactors = FALSE)
t3 <- t2 %>%
spread(Account, Value) %>%
inner_join({
t1 %>%
spread(Account, Value)
}, by = c("State" = "State", "Month" = "Month")) %>%
mutate(NewExpenses = (`F Expenses` + `V Expenses`) * `Expected Expenses`,
NewSales = Sales * `Expected Sales`) %>%
select(State, Month, Sales = NewSales, Expenses = NewExpenses) %>%
gather(Sales, Expenses, key = `New Account`, value = Value)

Merge dataframes based on regex condition

This problem involves R. I have two dataframes, represented by this minimal reproducible example:
a <- data.frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"), county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data.frame(geocode = c("360050002001002", "360850323001019"), jobs = c("4", "204"))
An example to help communicate the very specific operation I am trying to perform: the geocode_selector column in dataframe a contains the FIPS county codes of the five boroughs of NY. The geocode column in dataframe b is the 15-digit ID of a specific Census block. The first five digits of a geocode match a more general geocode_selector, indicating which county the Census block is located in. I want to add a column to b specifying which county each census block falls under, based on which geocode_selector each geocode in b matches with.
Generally, I'm trying to merge dataframes based on a regex condition. Ideally, I'd like to perform a full merge carrying all of the columns of a over to b and not just the county_name.
I tried something along the lines of:
b[, "county_name"] <- NA
for (i in 1:nrow(b)) {
for (j in 1:nrow(a)) {.
if (grepl(data.a$geocode_selector[j], b$geocode[i]) == TRUE) {
b$county_name[i] <- a$county_name[j]
}
}
}
but it took an extremely long time for the large datasets I am actually processing and the finished product was not what I wanted.
Any insight on how to merge dataframes conditionally based on a regex condition would be much appreciated.
You could do this...
b$geocode_selector <- substr(b$geocode,1,5)
b2 <- merge(b, a, all.x=TRUE) #by default it will merge on common column names
b2
geocode_selector geocode jobs county_name
1 36005 360050002001002 4 Bronx
2 36085 360850323001019 204 Richmond
If you wish, you can delete the geocode_selector column from b2 with b2[,1] <- NULL
We can use sub to create the 'geocode_selector' and then do the join
library(data.table)
setDT(a)[as.data.table(b)[, geocode_selector := sub('^(.{5}).*', '\\1', geocode)],
on = .(geocode_selector)]
# geocode_selector county_name geocode jobs
#1: 36005 Bronx 360050002001002 4
#2: 36085 Richmond 360850323001019 204
This is a great opportunity to use dplyr. I also tend to like the string handling functions in stringr, such as str_sub.
library(dplyr)
library(stringr)
a <- data_frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"),
county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data_frame(geocode = c("360050002001002", "360850323001019"),
jobs = c("4", "204"))
b %>%
mutate(geocode_selector = str_sub(geocode, end = 5)) %>%
inner_join(a, by = "geocode_selector")
#> # A tibble: 2 x 4
#> geocode jobs geocode_selector county_name
#> <chr> <chr> <chr> <chr>
#> 1 360050002001002 4 36005 Bronx
#> 2 360850323001019 204 36085 Richmond

Resources