Combining two columns in R

Combining two columns in R - r

I have two columns in R:
Origin Dest
ALB ATL
ALB LAG
ALB LAX
I need them to look like this (in one column):
Origin-Dest
ALB-ATL
ALB-LAG
ALB-LAX
Does anyone know how to combine two lines without getting too complicated?
This is the code I have so far:
air <- read.table(delta, header=T, sep=",")
aircolSQL <- sqldf("select Origin, Dest, ActualElapsedTime from air")
airsortSQL <- sqldf("select * from aircolSQL order by Origin asc, Dest")
airsortSQL$ActualTimeHours = round((airsortSQL$ActualElapsedTime/60),1)
airsortSQL$ActualElapsedTime <- NULL
Thanks!

If df is your data frame:
df <- data.frame(Origin = rep('ALB', 3), Dest = c('ATL', 'LAG', 'LAX'))
library(tidyr)
unite_(df, '`Origin-Dest`', c('Origin', 'Dest'), sep = "-")
`Origin-Dest`
1 ALB-ATL
2 ALB-LAG
3 ALB-LAX

In SQLite || is used for string concatenation:
library(sqldf)
sqldf("select Origin || '-' || Dest OD from air")
giving:
OD
1 ALB-ATL
2 ALB-LAG
3 ALB-LAX
data We used this as air:
air <- structure(list(Origin = structure(c(1L, 1L, 1L), .Label = "ALB",
class = "factor"), Dest = structure(1:3, .Label = c("ATL", "LAG", "LAX"),
class = "factor")), .Names = c("Origin", "Dest"), class = "data.frame",
row.names = c(NA, -3L))

I often use paste:
DataFrame$OriginDest<- paste(DataFrame$Origin,DataFrame$Dest, sep = '-')
I find it the least hassle approach.

Related

How do I lock the first digits of the 'by' column in a stringdist join?

I am trying to use stringdist_join to merge two tables. I have built my 'by' variable as the concatenation of three variables which are named as such:
UAI : a serial number
nom : surname
prenom : name
The code below works well, however I'd like to have a perfect match on the UAI part which is always the first 8 characters of the variable UAInomprenom. How can I do that?
stringdist_join(Ech_final_nom, BSA_affect_nom,
by = "UAInomprenom",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 0.1117,
distance_col = "dist")
Thank you for your help!

I am taking the following two datasets as an example:
df1 <- structure(list(V1 = c("abcNum1Num1Num1Num1", "abc1Num1Num1Num1Num",
"accArv", "accbrf"), V2 = c(1L, 4L, 5L, 2L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(V1 = c("abcNun1Nun1Nun1Nun1", "abc1Nun1Nun1Nun1Nun",
"accArv", "accNun1Nun1Nun1Nun1"), V2 = c(2L, 5L, 4L, 1L)), class = "data.frame", row.names = c(NA,
-4L))
In these two dataframes, the variable V1 is the join by field, in which the 3 first characters are not fuzzy (in your case, there are 8 not fuzzy characters).
Now, separate the column V1 to have an isolated column with the referred 3 first characters:
library(fuzzyjoin)
library(tidyverse)
df1 <- df1 %>%
extract(V1, into = c("V1A","V1B"), "(.{3})(.*)")
df2 <- df2 %>%
extract(V1, into = c("V1A","V1B"), "(.{3})(.*)")
Finally, apply the fuzzy join and remove the rows where the values of the two columns with the 3-character field are different:
stringdist_join(df1, df2,
by = "V1B",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 0.5) %>%
filter(V1A.x == V1A.y) %>%
unite("V1",c("V1A.x","V1B.x"),sep="") %>%
select(V1,V2=V2.x,V3=V2.y)

How to check if values in one dataframe exist in another dataframe in R?

Suppose we have a data frame like this:
id reply user_name
1 NA John
2 NA Amazon
3 NA Bob
And another data frame like this:
name organisation
John Amazon
Pat Apple
Is there a way to fill in the reply column in the first data frame with 'True' or 'False' if the values in column 3 match either columns 1 or 2 in the second data frame? So for example, since John and Amazon from the second data frame exist in the first data frame, I want the first data frame to update as so:
id reply user_name
1 True John
2 True Amazon
3 False Bob

Try this using %in% and a vector for all values:
#Code
df1$reply <- df1$user_name %in% c(df2$name,df2$organisation)
Output:
df1
id reply user_name
1 1 TRUE John
2 2 TRUE Amazon
3 3 FALSE Bob
Some data used:
#Data1
df1 <- structure(list(id = 1:3, reply = c(NA, NA, NA), user_name = c("John",
"Amazon", "Bob")), class = "data.frame", row.names = c(NA, -3L
))
#Data2
df2 <- structure(list(name = c("John", "Pat"), organisation = c("Amazon",
"Apple")), class = "data.frame", row.names = c(NA, -2L))

We can use %in% in base R
df1$reply <- df1$user_name %in% unlist(df2)
If we want to change the format of the logical to character string
df1$reply <- sub("^(.)(.*)", "\\1\\L\\2", df1$reply, perl = TRUE)
df1$reply
#[1] "True" "True" "False"
data
df1 <- structure(list(id = 1:3, reply = c(NA, NA, NA), user_name = c("John",
"Amazon", "Bob")), class = "data.frame", row.names = c(NA, -3L
))
df2 <- structure(list(name = c("John", "Pat"), organisation = c("Amazon",
"Apple")), class = "data.frame", row.names = c(NA, -2L))

Here's how you can get the exact output you're looking for with 3 lines of code!
df1 <- data.frame(id = 1:3, reply = NA, user.name = c("John", "Amazon", "Bob"), stringsAsFactors = F)
df2 <- data.frame(id = 1:2, name = c("John", "Pat"), organisation = c("Amazon", "Apple"), stringsAsFactors = F)
df1$reply <- df1$user.name %in% unlist(df2) %>% as.character() %>% str_to_title()
Output
id reply user.name
1 True John
2 True Amazon
3 False Bob
You will need the packages dplyr, magrittr, and stringr, which I highly recommend for data wrangling of all kinds.

Building off the first answer, you can also solve this in a tidy way too.
#Building your dataframes
df1 <- data.frame(id = 1:3, reply = NA, user.name = c("John", "Amazon", "Bob"), stringsAsFactors = F)
df2 <- data.frame(id = 1:2, name = c("John", "Pat"), organisation = c("Amazon", "Apple"), stringsAsFactors = F)
df1 %>%
mutate(reply = user.name %in% c(df2$name, df2$organisation))
I like personally the tidy solution because then you can easily pipe through the result to get more insights--for instance, if you want to know how many people replied, that just takes one more line:
df1 %>%
mutate(reply = user.name %in% c(df2$name, df2$organisation)) %>%
summarize(reply_sum = sum(reply))

Converting empty values to NULL in R - Handling date column

I have a simple dataframe as: dput(emp)
structure(list(name = structure(1L, .Label = "Alex", class = "factor"),
job = structure(1L, .Label = "", class = "factor"), Mgr = structure(1L, .Label = "", class = "factor"),
update = structure(18498, class = "Date")), class = "data.frame", row.names = c(NA,
-1L))
I want to convert all empty rows to NULL
The simplest way to achieve is:
emp[emp==""] <- NA
Which ofcourse would have worked but I get the error for the date column as:
Error in charToDate(x) :
character string is not in a standard unambiguous format
How can I convert all other empty rows to NULL without having to deal with the date column? Please note that the actual data frame has 30000+ rows.

Try formating the date variable as character, make the change and transform to date again:
#Format date
emp$update <- as.character(emp$update)
#Replace
emp[emp=='']<-NA
#Reformat date
emp$update <- as.Date(emp$update)
Output:
name job Mgr update
1 Alex <NA> <NA> 2020-08-24

You can try type.convert like below
type.convert(emp,as.is = TRUE)
such that
name job Mgr update
1 Alex NA NA 2020-08-24

You may try this using dplyr:
library(dplyr)
df %>%
mutate_at(vars(update),as.character) %>%
na_if(.,"")
As mentioned by #Duck, you have to format the date variable as character.
afterwards you can transform it back to date if you need it:
library(dplyr)
df %>%
mutate_at(vars(update),as.character) %>%
na_if(.,"") %>%
mutate_at(vars(update),as.Date)

See if this works:
> library(dplyr)
> library(purrr)
> emp <- structure(list(name = structure(1L, .Label = "Alex", class = "factor"),
+ job = structure(1L, .Label = "", class = "factor"), Mgr = structure(1L, .Label = "", class = "factor"),
+ update = structure(18498, class = "Date")), class = "data.frame", row.names = c(NA,
+ -1L))
> emp
name job Mgr update
1 Alex 2020-08-24
> emp %>% mutate(update = as.character(update)) %>% map_df(~gsub('^$',NA, .x)) %>% mutate(update = as.Date(update)) %>% mutate(across(1:3, as.factor))
# A tibble: 1 x 4
name job Mgr update
<fct> <fct> <fct> <date>
1 Alex NA NA 2020-08-24
>

Reshaping untidy and unbalanced dataset from wide to long [duplicate]

This question already has an answer here:
How do I convert a wide dataframe to a long dataframe for a multilevel structure with 'quadruple nesting'?
(1 answer)
Closed 6 years ago.
I have a dataset (data) that looks like this:
ID,ABC.BC,ABC.PL,DEF.BC,DEF.M,GHI.PL
SB0005,C01,D20,C01a,C01b,D20
BC0013,C05,D5,C05a,NA,D5
I want to reshape it from wide-to-long format to get something like this:
ID,FC,Type,Var
SB0005,ABC,BC,C01
SB0005,ABC,PL,D20
SB0005,DEF,BC,C01a
SB0005,DEF,M,C01b
SB0005,GHI,PL,D20
BC0013,ABC,BC,C05
BC0013,ABC,PL,D5
BC0013,DEF,BC,C05a
# BC0013,DEF,M,NA (This row need not be in the dataset as I will remove it later)
BC0013,GHI,PL,D5
The usual reshape package does not work as the dataset is unbalanced. I also tried Reshape from splitstackshape but it does not give me what I want.
library(splitstackshape)
vary <- grep("\\.BC$|\\.PL$|\\.M$", names(data))
stubs <- unique(sub("\\..*$", "", names(data[vary])))
Reshape(data, id.vars=c("ID"), var.stubs=stubs, sep=".")
ID,time,ABC,DEF,GHI
SB0005,1,C01,C01a,D20
BC0013,1,C05,C05a,D5
SB0005,2,D20,C01b,NA
BC0013,2,D5,NA,NA
SB0005,3,NA,NA,NA
BC0013,3,NA,NA,NA
Appreciate any suggestions, thanks!
Providing the output of dput(data) as requested
structure(list(ID = structure(c(2L, 1L), .Label = c("BC0013",
"SB0005"), class = "factor"), ABC.BC = structure(1:2, .Label = c("C01",
"C05"), class = "factor"), ABC.PL = structure(1:2, .Label = c("D20",
"D5"), class = "factor"), DEF.BC = structure(1:2, .Label = c("C01a",
"C05a"), class = "factor"), DEF.M = structure(1:2, .Label = c("C01b",
"NA"), class = "factor"), GHI.PL = structure(1:2, .Label = c("D20",
"D5"), class = "factor")), .Names = c("ID", "ABC.BC", "ABC.PL",
"DEF.BC", "DEF.M", "GHI.PL"), row.names = c(NA, -2L), class = "data.frame")

You need to reshape your data into long format first and then you can spit the variable column into to columns. With splitstackshape you could do:
library(splitstackshape) # this will also load 'data.table' from which the 'melt' function is used
cSplit(melt(mydf, id.vars = 'ID'),
'variable',
sep = '.',
direction = 'wide')[!is.na(value)]
which results in:
ID value variable_1 variable_2
1: SB0005 C01 ABC BC
2: BC0013 C05 ABC BC
3: SB0005 D20 ABC PL
4: BC0013 D5 ABC PL
5: SB0005 C01a DEF BC
6: BC0013 C05a DEF BC
7: SB0005 C01b DEF M
8: SB0005 D20 GHI PL
9: BC0013 D5 GHI PL
An alternative with tidyr:
library(tidyr)
mydf %>%
gather(var, val, -ID) %>%
separate(var, c('FC','Type')) %>%
filter(!is.na(val))

text cleaning in R

I have a single column in R that looks like this:
Path Column
ag.1.4->ao.5.5->iv.9.12->ag.4.35
ao.11.234->iv.345.455.1.2->ag.9.531
I want to transform this into:
Path Column
ag->ao->iv->ag
ao->iv->ag
How can I do this?
Thank you
Here is my full dput from my data:
structure(list(Rank = c(10394749L, 36749879L), Count = c(1L,
1L), Percent = c(0.001011122, 0.001011122), Path = c("ao.legacy payment.not_completed->ao.legacy payment.not_completed->ao.legacy payment.completed",
"ao.legacy payment.not_completed->agent.payment.completed")), .Names = c("Rank",
"Count", "Percent", "Path"), class = "data.frame", row.names = c(NA,
-2L))

You could use gsub to match the . and numbers following the . (\\.[0-9]+) and replace it with ''.
df1$Path.Column <- gsub('\\.[0-9]+', '', df1$Path.Column)
df1
# Path.Column
#1 ag -> ao -> iv -> ag
#2 ao -> iv -> ag
Update
For the new dataset df2
gsub('\\.[^->]+(?=(->|\\b))', '', df2$Path, perl=TRUE)
#[1] "ao->ao->ao" "ao->agent"
and for the string showed in the OP's post
str2 <- c('ag.1.4->ao.5.5->iv.9.12->ag.4.35',
'ao.11.234->iv.345.455.1.2->ag.9.531')
gsub('\\.[^->]+(?=(->|\\b))', '', str2, perl=TRUE)
#[1] "ag->ao->iv->ag" "ao->iv->ag"
data
df1 <- structure(list(Path.Column = c("ag.1 -> ao.5 -> iv.9 -> ag.4",
"ao.11 -> iv.345 -> ag.9")), .Names = "Path.Column",
class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(Rank = c(10394749L, 36749879L), Count = c(1L,
1L), Percent = c(0.001011122, 0.001011122),
Path = c("ao.legacy payment.not_completed->ao.legacy payment.not_completed->ao.legacy payment.completed",
"ao.legacy payment.not_completed->agent.payment.completed")),
.Names = c("Rank", "Count", "Percent", "Path"), class = "data.frame",
row.names = c(NA, -2L))

It may be easeir to split the strings on '->' and process the substrings separately
# split the stirngs into parts
subStrings <- strsplit(df$Path,'->')
# remove eveything after **first** the dot
subStrings<- lapply(subStrings,
function(x)gsub('\\..*','',x))
# paste them back together.
sapply(subStrings,paste0,collapse="->")
#> "ao->ao->ao" "ao->agent"
or
# split the stirngs into parts
subStrings <- strsplit(df$Path,'->')
# remove the parts of the identifiers after the dot
subStrings<- lapply(subStrings,
function(x)gsub('\\.[^ \t]*','',x))
# paste them back together.
sapply(subStrings,paste0,collapse="->")
#> "ao payment->ao payment->ao payment" "ao payment->agent"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Combining two columns in R - r

If df is your data frame: df <- data.frame(Origin = rep('ALB', 3), Dest = c('ATL', 'LAG', 'LAX')) library(tidyr) unite_(df, '`Origin-Dest`', c('Origin', 'Dest'), sep = "-") `Origin-Dest` 1 ALB-ATL 2 ALB-LAG 3 ALB-LAX

I often use paste: DataFrame$OriginDest<- paste(DataFrame$Origin,DataFrame$Dest, sep = '-') I find it the least hassle approach.

Related

How do I lock the first digits of the 'by' column in a stringdist join?

How to check if values in one dataframe exist in another dataframe in R?

Converting empty values to NULL in R - Handling date column

Reshaping untidy and unbalanced dataset from wide to long [duplicate]

text cleaning in R

Categories

Resources