Duplicate Columns when using the merge statement - r

When I try to merge some codes to the code descriptions I get 2 duplicate columns. I start out with this:
Table Name: Test
ID State
1 5
2 2
3 5
and want to merge it with this:
Table Name: statecode
StateID State
5 Mass
2 NY
to make a table like this:
ID State
1 Mass
2 NY
3 Mass
However, I get a table like this:
ID State State
1 5 Mass
2 2 NY
3 5 Mass
I used the merge command like this:
test = merge(x = test, y = statecode, by.x = "State", by.y = "StateID", all.x = T)
Is there a better function other than merge to use in this case? Maybe one to just replace the state code with the state name?
Thank you very much for the help!

You do have to say which column you want to drop, but you can express it concisely using dplyr, for example.
Generating sample data based on yours (but correcting the column names):
test <- read.table(text =
"ID StateID
1 5
2 2
3 5", header = TRUE)
statecode <- read.table(text =
"
StateID State
5 Mass
2 NY", header = TRUE)
Using dplyr:
library(dplyr)
test %>% left_join(statecode, by = "StateID") %>% select(-StateID)
ID State
1 1 Mass
2 2 NY
3 3 Mass

Another way with base R:
Pmerge <- function(df1, df2) {
res <- suppressWarnings(merge(df1, df2, by.x = "State", by.y = "Code", all.x = T)[,-1])
newdf <- res[order(res$ID),]
row.names(newdf) <- 1:nrow(newdf)
newdf
}
Pmerge(Test, statecode)
ID State
1 1 Mass
2 2 NY
3 3 Mass

Related

Cleaning Data: Multiple Misspelled Strings in R

I have over 100 strings that I want to change, for ex:
Scheduled Caste, Schdeduled Caste, Schedulded Caste need to be changed to SC.
I have been doing it like this: Haryana3$Category[Haryana3$Category%in% "Scheduled Caste"] <- "SC"
Is there anything I can do that's more efficient?
Use gsub
Haryana3$Category <- gsub("Scheduled Caste", "SC", Haryana3$Category)
You can use data.table and try the following:
library(data.table)
setDT(Haryana3)
Haryana3[, Catergory:= gsub("Scheduled Caste", "SC", Category)]
I guess the rule is combing all the first letter from each word. If that is true, here is one idea.
library(tidyverse)
Haryana3 <- Haryana3 %>%
mutate(Category = strsplit(Category, split = " ")) %>%
mutate(Category = map_chr(Category, ~paste0(str_sub(.x, start = 1L, end = 1L), collapse = "")))
Haryana3
# ID Category
# 1 1 SC
# 2 2 SC
# 3 3 ST
# 4 4 ST
# 5 5 FC
DATA
Haryana3 <- read.table(text = "ID Category
1 'Scheduled Caste'
2 'Scheduled Caste'
3 'Scheduled Tribes'
4 'Scheduled Tribes'
5 'Forward Caste'", header = TRUE)

How to create a dataframe with rows of open and close as columns

From this example of dataframe:
dframe <- data.frame(status = c("open","close","open","close"), name = c("Google","Google","Amazon","Amazon"), id = c(1,1,2,2), volume1 = c(2,3,1,2), othercol = c(5.3,1,3,7))
How is it possible to create a new dataframe with columns of open and close as columns? Here an expected output example:
data.frame(name = c("Google", "Amazon"), id = c(1,2), volume1_open = c(2,1), volume1_close = c(3,2), othercol_open = c(5.3,3), othercol_close = c(2,7))
> name id volume1_open volume1_close othercol_open othercol_close
> Google 1 2 3 5.3 2
> Amazon 2 1 2 3.0 7
Using data.table, you can use dcast in order to reshape your data to wide format:
Code
setDT(dframe)
dcast(dframe, name + id ~ status, value.var = c('volume1', 'othercol'))
Result
name id volume1_close volume1_open othercol_close othercol_open
1: Amazon 2 2 1 7 3.0
2: Google 1 3 2 1 5.3

Create a dummy to indicating presence of string fragment in any of multiple variables

df <- data.frame (address.1.line = c("apartment 5", "25 spring street", "nice house"), address.2.line = c("london", "new york", "apartment 2"), address.3.line = c("", "", "paris"))
I'm trying to make a function that returns a new column in a data frame. The column should be a dummy variable attached to the original data frame indicating whether any of 3 address-line variables contain a string (or selection of strings).
E.g., in the example above, I want df to have a new variable called "Apartment_dummy" indicating the presence of the string fragment "apartment" in any of the three address lines---so it will take 1 in rows 1 and 3, and zero in row 0. The function needs to take 2 arguments, therefore: the name of the new dummy variable to be created, and the corresponding string fragment that needs to be detected in the address variables.
I'd tried the following. It will return a dummy, but won't give the new variable the right name. Also, I feel like there must be a way to do it in a single step. Any ideas? Many thanks!
library(tidyverse)
premises_dummy <- function(varname = NULL, strings = NULL) {
df %<>% mutate_at(.funs = funs(flagA = str_detect(., strings)), .vars = vars(ends_with(".line"))) %>%
mutate(varname = ifelse(rowSums(select(., contains("flagA"))) > 0, 1, 0))
return(df)
}
df <- premises_dummy(varname = 'Apartment_dummy', strings = 'apartment')
A tidyverse option using tidyr::unite and stringr::str_detect
library(tidyverse)
df %>%
unite(tmp, remove = F) %>%
mutate(Apartment_dummy = +str_detect(tmp, "apartment")) %>%
select(-tmp)
# address.1.line address.2.line address.3.line Apartment_dummy
#1 apartment 5 london 1
#2 25 spring street new york 0
#3 nice house apartment 2 paris 1
A quick data.table solution to it:
library(data.table)
dt <- data.table(df)
search_string <- "apartment"
dt[like(address.1.line, search_string)|
like(address.2.line, search_string)|
like(address.3.line, search_string), paste0(search_string,".Dummy") := 1]
dt[is.na(get(paste0(search_string,".Dummy"))), paste0(search_string,".Dummy") := 0]
A base R solution :
cols = endsWith(names(df),"line")
df['Apartment_dummy'] = as.integer(grepl('apartment',do.call(paste,df[cols])))
Now we can write a function that even considers the data to be used ie,data bein an argument.
premises_dummy=function(varname,strings){
cols = endsWith(names(df),"line")
df[varname]= as.integer(grepl(strings,do.call(paste,df[cols])))
df
}
premises_dummy(varname = 'Apartment_dummy', strings = 'apartment')
address.1.line address.2.line address.3.line Apartment_dummy
1 apartment 5 london 1
2 25 spring street new york 0
3 nice house apartment 2 paris 1

Match part of a string in a dataframe and replace it by entry of another dataframe

I'm fairly new to R and I'm running into the following problem.
Let's say I have the following data frames:
sale_df <- data.frame("Cheese" = c("cheese-01", "cheese-02", "cheese-03"), "Number_of_sales" = c(4, 8, 23))
id_df <- data.frame("ID" = c(1, 2, 3), "Name" = c("Leerdammer", "Gouda", "Mozerella")
What I want to do is match the numbers of the first column of id_df to the numbers in the string of the first column of sale_df.
Then I want to replace the value in sale_df by the value in the second column of id_df, i.e. I want cheese-01 to become "Leerdammer".
Does anyone have any idea how I could solve this?
With tidyverse :
sale_df %>% mutate(ID=as.numeric(str_extract(Cheese,"(?<=cheese-).*"))) %>% inner_join(id_df,by="ID")
# Cheese Number_of_sales ID Name
#1 cheese-01 4 1 Leerdammer
#2 cheese-02 8 2 Gouda
#3 cheese-03 23 3 Mozerella
Assuming that all entries for Cheese in sale_df will start with cheese-, here is a simple solution.
sale_df$CheeseID <- as.numeric(substring(sale_df$Cheese, 8))
merge(sale_df, id_df, by.x = "CheeseID", by.y = "ID", all.x = TRUE)
sale_df$Number_of_sales=id_df$Name[match(id_df$ID,as.numeric(gsub("\\D","",sale_df$Cheese)))]
> sale_df
Cheese Number_of_sales
1 cheese-01 Leerdammer
2 cheese-02 Gouda
3 cheese-03 Mozerella

Selecting specific rows from datafrom based on column match in another datafram in R

I have a question. Please help.
I have two dataframe. data1 and data2
data1 has following data
HHID..... blockid....serial_number...name
100............1............1.........xxx
100............2............2.........yyy
100............1............3.........zzz
200............1........... 1.........sss
200............1............2.........ddd
data2 is as below
HHID-.......serial....... hospital
100...........3...............Delhi
200...........2...............paris
Now,i want to select rows in data1 based on HHID and serial in data2. For eg, here, in data2, we can see a row with HHID 100 and serial 3. So, I want select only that row from data1 where HHID is 100 and serial is 3. Similarly for HHID 200 and serial 2. Also, when I select row from data1, I dont want any extra columns from data2. All I care about is if HHID and serial in data2 is matching in data1. If it does, then I need that complete row in data1. So the output should be as follows
HHID....blockid.....serial....name
100..... .....1........3......zzz
200...........1........2......ddd
Can somebody help?
Thank you
you can create a unique ID for each data frame like so:
#data frame definitions
data1 <- data.frame(HHID = c(100,100,100,200,200), blockid = c(1,2,1,1,1), serial_number = c(1,2,3,1,2), name = c('xxx', 'yyy', 'zzz', 'sss', 'ddd'))
data2 <- data.frame(HHID = c(100,200), serial = c(3,2), hospital = c('Delphi', 'paris'))
#unique identifier
data1$unique <- paste(data1$HHID, data1$serial_number, sep = '')
data2$unique <- paste(data2$HHID, data2$serial, sep = '')
Then, you can use the subset function to isolate rows in data1, like so:
result <- subset(data1, unique %in% data2$unique)
I would suggest:
Here I recreate the data:
library(tidyverse)
data1 <- read.table(text="HHID blockid serial_number name
100 1 1 xxx
100 2 2 yyy
100 1 3 zzz
200 1 1 sss
200 1 2 ddd", sep = " ", stringsAsFactors = F, header = T)
data2 <- read.table(text="HHID serial hospital
100 3 Delhi
200 2 paris", sep = " ", stringsAsFactors = F, header = T)
That's my suggestion
results <- data1 %>%
rename(serial=serial_number) %>%
right_join(data2, by=c("HHID", "serial")) %>%
select(-hospital) # get rid of the hospital column
results
If you are not familiar with the tidyverse, you can execute every line step by step until the %>% to see the single steps. That's the output:
HHID blockid serial name
1 100 1 3 zzz
2 200 1 2 ddd

Resources