new column with paste0 in R - r

I am looking for a function that allows me to add a new column to add the values called ID to a string, that is:
I have a list of words with your ID:
car = 9112
red = 9512
employee = 6117
sky = 2324
words<- c("car", "sky", "red", "employee", "domestic")
match<- c("car", "red", "domestic", "employee", "sky")
the comparison is made by reading in an excel file, if it finds the value equal to my vector words, it replaces the word with its ID, but leaves the original word
x10<- c(words)# string
words.corpus <- c(L4$`match`) # pattern
idwords.corpus <- c(L4$`ID`) # replace
words.corpus <- paste0("\\A",idwords.corpus, "\\z|\\A", words.corpus,"\\z")
vect.corpus <- idwords.corpus
names(vect.corpus) <- words.corpus
data15 <- str_replace_all(x10, vect.corpus)
result:
data15:
" 9112", "2324", "9512", "6117", "employee"
What I'm looking for is to add a new column with the ID, instead of replacing the word with the ID
words ID
car 9112
red 9512
employee 6117
sky 2324
domestic domestic

I'd use data.table for fast lookup based on the fixed words value. While it's not 100% clear what you are asking for, it sounds like you want to replace words with an index value if there is a match, or leave the word as a word if not. This code will do that:
library("data.table")
# associate your ids with fixed word matches in a named numeric vector
ids <- data.table(
word = c("car", "red", "employee", "sky"),
ID = c(9112, 9512, 6117, 2324)
)
setkey(ids, word)
# this is what you would read in
data <- data.table(
word = c("car", "sky", "red", "employee", "domestic", "sky")
)
setkey(data, word)
data <- ids[data]
# replace NAs from no match with word
data[, ID := ifelse(is.na(ID), word, ID)]
data
## word ID
## 1: car 9112
## 2: domestic domestic
## 3: employee 6117
## 4: red 9512
## 5: sky 2324
## 6: sky 2324
Here the "domestic" is not matched so it remains as the word in the ID column. I also repeated "sky" to show how this will work for every instance of a word.
If you want to preserve the original sort order, you could create an index variable before the merge, and then reorder the output by that index variable.

Related

how to dynamically intercalate columns with pattern in R?

this is a follow up question . I wanna know how can I intercalate dynamically the columns in the bigger data set?
Rationale: I've conducted a for-loop to import 16 dataframes. After that, I did this to merge all dataframes:
### Merge all dataframes: (ps: I got this code here in SO :)
mergefun <- function(x, y) merge(x, y, by= "ID", all = T)
merged_DF <- Reduce(mergefun, dataList)
Each dataframes has an "ID" column (which is the same for every one), but they have different column names (the ones I've created based on the other posts' answer). Hence,
I have, in total (the head() of each dataframe):
ID NARR_G1_50_AAA NARR_G1_50_AAC NARR_G1_50_AC NARR_G1_50_AB
ID NARR_G1_100_AAA NARR_G1_100_AAC NARR_G1_100_AC NARR_G1_100_AB
ID NARR_G1_150_AAA NARR_G1_150_AAC NARR_G1_150_AC NARR_G1_150_AB
ID NARR_G1_200_AAA NARR_G1_200_AAC NARR_G1_200_AC NARR_G1_200_AB
ID NARR_G2_50_AAA NARR_G2_50_AAC NARR_G2_50_AC NARR_G2_50_AB
ID NARR_G2_100_AAA NARR_G2_100_AAC NARR_G2_100_AC NARR_G2_100_AB
ID NARR_G2_150_AAA NARR_G2_150_AAC NARR_G2_150_AC NARR_G2_150_AB
ID NARR_G2_200_AAA NARR_G2_200_AAC NARR_G2_200_AC NARR_G2_200_AB
ID ARG_G1_50_AAA ARG_G1_50_AAC ARG_G1_50_AC ARG_G1_50_AB
ID ARG_G1_100_AAA ARG_G1_100_AAC ARG_G1_100_AC ARG_G1_100_AB
ID ARG_G1_150_AAA ARG_G1_150_AAC ARG_G1_150_AC ARG_G1_150_AB
ID ARG_G1_200_AAA ARG_G1_200_AAC ARG_G1_200_AC ARG_G1_200_AB
ID ARG_G2_50_AAA ARG_G2_50_AAC ARG_G2_50_AC ARG_G2_50_AB
ID ARG_G2_100_AAA ARG_G2_100_AAC ARG_G2_100_AC ARG_G2_100_AB
ID ARG_G2_150_AAA ARG_G2_150_AAC ARG_G2_150_AC ARG_G2_150_AB
ID ARG_G2_200_AAA ARG_G2_200_AAC ARG_G2_200_AC ARG_G2_200_AB
I need two arrange the joined dataframe columns in these two orders:
SET 1 :
###Desired output 1:
NARR_G1_50_AAA, NARR_G2_50_AAA,
NARR_G1_50_AAC, NARR_G2_50_AAC,
NARR_G1_50_AC, NARR_G2_50_AC,
NARR_G1_50_AB, NARR_G2_50_AB,
ARG_G1_50_AAA, ARG_G2_50_AAA,
ARG_G1_50_AAC, ARG_G2_50_AAC,
ARG_G1_50_AC, ARG_G2_50_AC,
ARG_G1_50_AB, ARG_G2_50_AB........then with 100,150 and 200
SET 2 :
###Desired output 2:
NARR_G1_50_AAA, ARG_G1_50_AAA,
NARR_G2_50_AAA, ARG_G2_50_AAA,
NARR_G1_50_AAC, ARG_G1_50_AAC,
NARR_G2_50_AAC, ARG_G2_50_AAC,
NARR_G1_50_AC, ARG_G1_50_AC,
NARR_G2_50_AC, ARG_G2_50_AC,
NARR_G1_50_AB, ARG_G1_50_AB,
NARR_G2_50_AB, ARG_G2_50_AB,........then with 100,150 and 200
I've tried many things, but I can't get the desired orders...the closer I got was this:
dfPaired <- merged_DF %>% ###still doesn't produce the desired output
# dplyr::select(sort(names(.))) %>%
dplyr::select(order(gsub("G1", "G2", names(.)))) %>%
Question:
How can I get the desired orders (set 1 and set 2) without manually intercalating the columns in select() ?
Further notes:
SET 1:
I need to intercalate (in increasing order 50, then 100, then 150, then 200) "G1" and "G2" within each variable. Ex: NARR_G1_50_AAA, NARR_G2_50_AAA... There are 4 per number (AAA, AAB, AC and AB)
SET 2:
I need to intercalate (in increasing order 50, then 100, then 150, then 200) "NARR" and "ARG" comparing G1 and G2. Such as: NARR_G1_50_AAA, NARR_G2_50_AAA... thanks in advance :)
If it should be custom order, an option would be to split up the column names at _, then convert to factor with levels specified in the order we wanted
lvls1 <- c("NARR", "ARG")
lvls2 <- c("G1", "G2")
lvls3 <- c("AAA", "AAC", "AC", "AB")
#v1 <- names(merged_DF)[-1] # assuming 'ID' is the first column
d1 <- read.table(text = v1, header = FALSE, sep = "_")
i1 <- !sapply(d1, is.numeric)
d1[i1] <- Map(factor, d1[i1], levels = list(lvls1, lvls2, lvls3))
v2 <- v1[do.call(order, d1[c(3, 1,4, 2)])]
library(dplyr)
merged_DF %>%
select(ID, all_of(v2))
where v2 is
> v2
[1] "NARR_G1_50_AAA" "NARR_G2_50_AAA" "NARR_G1_50_AAC" "NARR_G2_50_AAC" "NARR_G1_50_AC" "NARR_G2_50_AC" "NARR_G1_50_AB" "NARR_G2_50_AB"
[9] "ARG_G1_50_AAA" "ARG_G2_50_AAA" "ARG_G1_50_AAC" "ARG_G2_50_AAC" "ARG_G1_50_AC" "ARG_G2_50_AC" "ARG_G1_50_AB" "ARG_G2_50_AB"
[17] "NARR_G1_100_AAA" "NARR_G2_100_AAA" "NARR_G1_100_AAC" "NARR_G2_100_AAC" "NARR_G1_100_AC" "NARR_G2_100_AC" "NARR_G1_100_AB" "NARR_G2_100_AB"
[25] "ARG_G1_100_AAA" "ARG_G2_100_AAA" "ARG_G1_100_AAC" "ARG_G2_100_AAC" "ARG_G1_100_AC" "ARG_G2_100_AC" "ARG_G1_100_AB" "ARG_G2_100_AB"
[33] "NARR_G1_150_AAA" "NARR_G2_150_AAA" "NARR_G1_150_AAC" "NARR_G2_150_AAC" "NARR_G1_150_AC" "NARR_G2_150_AC" "NARR_G1_150_AB" "NARR_G2_150_AB"
[41] "ARG_G1_150_AAA" "ARG_G2_150_AAA" "ARG_G1_150_AAC" "ARG_G2_150_AAC" "ARG_G1_150_AC" "ARG_G2_150_AC" "ARG_G1_150_AB" "ARG_G2_150_AB"
data
# it is a random order of the column names which is ordered in the code
v1 <- c("NARR_G1_100_AB", "NARR_G1_150_AAC", "NARR_G2_50_AB", "NARR_G1_150_AB",
"NARR_G2_100_AAA", "NARR_G1_100_AAC", "ARG_G1_150_AC", "ARG_G2_50_AAA",
"ARG_G2_150_AAA", "ARG_G1_50_AAA", "ARG_G2_100_AC", "NARR_G1_150_AAA",
"NARR_G2_100_AC", "ARG_G1_50_AC", "NARR_G1_100_AAA", "ARG_G2_50_AB",
"NARR_G1_150_AC", "ARG_G2_50_AAC", "ARG_G2_150_AB", "NARR_G2_100_AAC",
"NARR_G2_150_AAA", "NARR_G1_100_AC", "ARG_G1_150_AB", "ARG_G1_50_AAC",
"NARR_G1_50_AC", "ARG_G2_150_AAC", "NARR_G1_50_AAA", "NARR_G2_150_AB",
"NARR_G2_150_AAC", "ARG_G1_150_AAA", "ARG_G2_50_AC", "NARR_G2_50_AC",
"ARG_G1_150_AAC", "ARG_G1_100_AC", "ARG_G1_100_AAA", "NARR_G1_50_AAC",
"NARR_G2_150_AC", "ARG_G1_100_AAC", "ARG_G2_100_AAA", "ARG_G2_100_AAC",
"NARR_G1_50_AB", "NARR_G2_100_AB", "ARG_G2_100_AB", "ARG_G1_50_AB",
"NARR_G2_50_AAA", "ARG_G1_100_AB", "ARG_G2_150_AC", "NARR_G2_50_AAC"
)

Using R - exchange values in column conditionally based on other column

(1) I have a data frame called COPY that looks like this
COPY <- data.frame (year = c(values_here),
Ceremony = c(values_here),
Award = c(values_here),
Winner = c(values_here),
Name = c(values_here),
Film = c(values_here),
)
(2) Some of the entry in the name and film column for some rows are mixed up
(3) I created a vector of all the names in the wrong place using this code.
COPY$Film[COPY['Award']=='Director' & COPY['Year']>1930]->name
the entry's where the Award = director and the year is greater than 1930 the name and film columns are mixed
(4) Now I would like to replace COPY$Name based on the conditions stated with my new name object. I tried this code.
replace(COPY$Name,COPY$Award =='Director' && COPY$Year>1930,name)
SO basically I'm trying to flip the Name and Film columns where the Award column == director and the year column is greater than 1930.
Lacking data, try this:
COPY <- data.frame (year = 2000:2002,
Ceremony = NA,
Award = c("A", "Director", "B"),
Winner = NA,
Name = c("A","B","C"),
Film = c("1","2","3")
)
swap <- COPY$Award == "Director"
COPY <- transform(COPY, Name = ifelse(swap, Film, Name), Film = ifelse(swap, Name, Film))
COPY
# year Ceremony Award Winner Name Film
# 1 2000 NA A NA A 1
# 2 2001 NA Director NA 2 B
# 3 2002 NA B NA C 3

R - Creating New Column Based off of a Partial String

I have a large dataset (Dataset "A") with a column Description which contains something along the lines
"1952 Rolls Royce Silver Wraith" or "1966 Holden".
I also have a separate dataset (Dataset "B") with a list of every Car Brand that I need (eg "Holden", "Rolls Royce", "Porsche").
How can I create a new column in dataset "A" that assigns the Partial strings of the Description with the correct Car Brand?
(This column would only hold the correct Car Brand with the appropriate matching cell).
Thank you.
Description New Column
1971 Austin 1300 Austin
A solution from the tidyverse
A <- data.frame (Description = c("1970 Austin"),
stringsAsFactors = FALSE)
B <- data.frame (Car_Brand = c("Austin"),
stringsAsFactors = FALSE)
library(tidyverse)
A %>% mutate( New_Column= str_match( Description, B$Car_Brand)[,1] )
# Description New_Column
# 1 1970 Austin Austin

Split a dataframe in multiple columns in R

My data frame is as follows:
User
JohnLenon03041965
RogerFederer12021954
RickLandsman01041975
and I am trying to get the output as
Name Lastname Birthdate
John Lenon 03041965
Roger Federer 12021954
Rick Landsman 01041975
I tried the following code:
**a = gsub('([[:upper:]])', ' \\1', df$User)
a <- as.data.frame(a)
library(tidyr)
a <-separate(a, a, into = c("Name", "Last"), sep = " (?=[^ ]+$)")**
I get the following:
Name Last
John Lenon03041965
Roger Federer12021954
Rick Landsman01041975
I am trying to use the separate condition like (?=[0-9]) but getting error like this:
c <-separate(c, c, into = c("last", "date"), sep = '(?=[0-9])')
Error in if (!after) c(values, x) else if (after >= lengx) c(x, values) else c(x[1L:after], : argument is of length zero
We can use a regex lookaround as sep by specifying either to split between a lower case letter and an upper case ((?<=[a-z])(?=[A-Z])) or (|) between a lower case letter and a number ((?<=[a-z])(?=[0-9]+))
df1 %>%
separate(User, into = c("Name", "LastName", "Birthdate"),
sep = "(?<=[a-z])(?=[A-Z])|(?<=[a-z])(?=[0-9]+)")
# Name LastName Birthdate
#1 John Lenon 03041965
#2 Roger Federer 12021954
#3 Rick Landsman 01041975
Or another option is extract to capture characters as a group by placing it inside the brackets ((...)). Here, the 1st capture group matches an upper case letter followed by one or more lower case letters (([A-Z][a-z])) from the start (^) of the string, 2nd captures one or more characters that are not numbers (([^0-9]+)) and in the 3rs, it is the rest of the characters ((.*))
df1 %>%
extract(User, into = c("Name", "LastName", "Birthdate"),
"^([A-Z][a-z]+)([^0-9]+)(.*)")
# Name LastName Birthdate
#1 John Lenon 03041965
#2 Roger Federer 12021954
#3 Rick Landsman 01041975
data
df1 <- structure(list(User = c("JohnLenon03041965", "RogerFederer12021954",
"RickLandsman01041975")), .Names = "User", class = "data.frame", row.names = c(NA,
-3L))

How to search part of string that contain in a list of string, and return the matched one in R

The following data frame contain a "Campaign" column, the value of column contain information about season, name, and position, however, the order of these information are quiet different in each row. Lucky, these information is a fixed list, so we could create a vector to match the string inside the "Campaign_name" column.
Date Campaign
1 Jan-15 Summer|Peter|Up
2 Feb-15 David|Winter|Down
3 Mar-15 Up|Peter|Spring
Here is what I want to do, I want to create 3 columns as Name, Season, Position. So these column can search the string inside the campaign column and return the matched value from the list below.
Name <- c("Peter, David")
Season <- c("Summer","Spring","Autumn", "Winter")
Position <- c("Up","Down")
So my desired result would be following
Temp
Date Campaign Name Season Position
1 15-Jan Summer|Peter|Up Peter Summer Up
2 15-Feb David|Winter|Down David Winter Down
3 15-Mar Up|Peter|Spring Peter Spring Up
Another way:
L <- strsplit(df$Campaign,split = '\\|')
df$Name <- sapply(L,intersect,Name)
df$Season <- sapply(L,intersect,Season)
df$Position <- sapply(L,intersect,Position)
Do the following:
Date = c("Jan-15","Feb-15","Mar-15")
Campaign = c("Summer|Peter|Up","David|Winter|Down","Up|Peter|Spring")
df = data.frame(Date,Campaign)
Name <- c("Peter", "David")
Season <- c("Summer","Spring","Autumn", "Winter")
Position <- c("Up","Down")
for(k in Name){
df$Name[grepl(pattern = k, x = df$Campaign)] <- k
}
for(k in Season){
df$Season[grepl(pattern = k, x = df$Campaign)] <- k
}
for(k in Position){
df$Position[grepl(pattern = k, x = df$Campaign)] <- k
}
This gives:
> df
Date Campaign Name Season Position
1 Jan-15 Summer|Peter|Up Peter Summer Up
2 Feb-15 David|Winter|Down David Winter Down
3 Mar-15 Up|Peter|Spring Peter Spring Up
I had the same idea as Marat Talipov; here's a data.table option:
library(data.table)
Name <- c("Peter", "David")
Season <- c("Summer","Spring","Autumn", "Winter")
Position <- c("Up","Down")
dat <- data.table(Date=c("Jan-15", "Feb-15", "Mar-15"),
Campaign=c("Summer|Peter|Up", "David|Winter|Down", "Up|Peter|Spring"))
Gives
> dat
Date Campaign
1: Jan-15 Summer|Peter|Up
2: Feb-15 David|Winter|Down
3: Mar-15 Up|Peter|Spring
Processing is then
dat[ , `:=`(Name = sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, Name),
Season = sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, Season),
Position = sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, Position))
]
Result:
> dat
Date Campaign Name Season Position
1: Jan-15 Summer|Peter|Up Peter Summer Up
2: Feb-15 David|Winter|Down David Winter Down
3: Mar-15 Up|Peter|Spring Peter Spring Up
Maybe there's some benefit if you're doing this to a lot of columns or need to modify in place (by reference).
I'm interested if anyone can show me how to update all three columns at once.
EDIT: Never mind, figured it out;
for (icol in c("Name", "Season", "Position"))
dat[, (icol):=sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, get(icol))]

Resources