I have this data
COL
AABC1
AAAABD2
AAAAAABF3
I would like to make a certain column like this:
COL NEW_COL
AABC1 T1
AAAABD2 T2
AAAAAABF3 T3
If COL contains 'BC', NEW_COL will be T1
contains 'BD', it will be T2
contains 'BF', it will be T3.
I would like to use mutate and grepl function but I have 80 conditions (like BC>T1) so that the code does not work in the R.
With the table like:
CLASS NEW_COL
BC T1
BD T2
BF T3
Could I use mutate(create) new column with above standard table ??
Here's your data:
DF <- data.frame(COL = c("AABC1",
"AAAABD2",
"AAAAABF3"),
stringsAsFactors = FALSE)
lookup_tbl <- data.frame(CLASS = c("BC", "BD", "BF"),
NEW_COL = c("T1", "T2", "T3"),
stringsAsFactors = FALSE)
Your problem is solved by a join, after some initial preparation.
To prepare DF, you need to add a column that extracts any instance of CLASS in the lookup table from COL in DF. Then you can join normally. In R:
library(dplyr)
DF %>%
mutate(CLASS = gsub(paste0("^.*(",
paste0(lookup_tbl[["CLASS"]], collapse = "|"),
").*$"),
"\\1",
lookup_tbl[["CLASS"]])) %>%
# or inner_join as required
left_join(lookup_tbl, by = "CLASS")
How the solution should behave COL matches zero or more than one instance in CLASS will need to be specified. The above handles both cases, but maybe not how you'd like.
You can create a lookup table with your 80 conditions and write a little function to match against it. Here's an example (normally, you'd read in lookup_table from file, I'm guessing):
library(tidyverse)
lookup_table <- data.frame(
row.names = c('BC', 'BD', 'BF'),
new_col = c('T1', 'T2', 'T3'),
stringsAsFactors = FALSE)
lookup <- function(x, table) {
for (class in rownames(table)) {
if (grepl(class, x)) {
return(table[class, 'new_col'])
}
}
}
data_frame(col = c('AABC1', 'AAAABD2', 'AAAAAABF3')) %>%
rowwise %>% mutate(new_col = lookup(col, lookup_table))
Note that this will take the first match it finds, so be sure your lookup table is ordered properly with respect to the priority you want to give the matching rules.
Related
I have a dataframe (df1) that contains Start times and End times for observations of different IDs:
df <- structure(list(ID = 1:4, Start = c("2021-05-12 13:22:00", "2021-05-12 13:25:00", "2021-05-12 13:30:00", "2021-05-12 13:42:00"),
End = c("2021-05-13 8:15:00", "2021-05-13 8:17:00", "2021-05-13 8:19:00", "2021-05-13 8:12:00")),
class = "data.frame", row.names = c(NA,
-4L))
I want to create a new dataframe that shows the latest Start time and the earliest End time for each possible pairwise comparison between the levels ofID.
I was able to accomplish this by making a duplicate column of ID called ID2, using dplyr::expand to expand them, and saving that in an object called Pairs:
library(dplyr)
df$ID2 <- df$ID
Pairs <-
df%>%
expand(ID, ID2)
Making two new objects a and b that store the Start and End times for each comparison separately, and then combining them into df2:
a <- left_join(df, Pairs, by = 'ID')%>%
rename(StartID1 = Start, EndID1 = End, ID2 = ID2.y)%>%
select(-ID2.x)
b <- left_join(Pairs, df, by = "ID2")%>%
rename(StartID2 = Start, EndID2 = End)%>%
select(ID2, StartID2, EndID2)
df2 <- cbind(a,b)
df2 <- df2[,-4]
and finally using dplyr::if_else to find the LatestStart time and the EarliestEnd time for each of the comparisons:
df2 <-
df2%>%
mutate(LatestStart = if_else(StartID1 > StartID2, StartID1, StartID2),
EarliestEnd = if_else(EndID1 > EndID2, EndID2, EndID1))
This seems like such a simple task to perform, is there a more concise way to achieve this from df1 without creating all of these extra objects?
For such computations usually outer comes handy:
df %>%
mutate(across(c("Start", "End"), lubridate::ymd_hms)) %>%
{
data.frame(
ID1 = rep(.$ID, each = nrow(.)),
ID2 = rep(.$ID, nrow(.)),
LatestStart = outer(.$Start, .$Start, pmax),
LatestEnd = outer(.$End, .$End, pmin)
)
}
Goal: To filter rows in dataset so that only distinct words remain At the moment, I have used inner_join to retain rows in 2 datasets which has made my rows in this dataset duplicate.
Attempt 1: I have tried to use distinct to retain only those rows which are unique, but this has not worked. I may be using it incorrectly.
This is my code so far; output attached in png format:
# join warriner emotion lemmas by `word` column in collocations data frame to see how many word matches there are
warriner2 <- dplyr::inner_join(warriner, coll, by = "word") # join data; retain only rows in both sets (works both ways)
warriner2 <- distinct(warriner2)
warriner2
coll2 <- dplyr::semi_join(coll, warriner, by = "word") # join all rows in a that have a match in b
# There are 8166 lemma matches (including double-ups)
# There are XXX unique lemma matches
You can try :
library(dplyr)
warriner2 <- inner_join(warriner, coll, by = "word") %>%
distinct(word, .keep_all = TRUE)
To even further clarify Ronak's answer, here is an example with some mock data. Note that you can just use distinct() at the end of the pipe to keep distinct columns if that's what you want. Your error might very well have occurred because you performed two operations, and assigned the result to the same name both times (warriner2).
library(dplyr)
# Here's a couple sample tibbles
name <- c("cat", "dog", "parakeet")
df1 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
df2 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
# It's much less confusing if you do this in one pipe
p <- df1 %>%
inner_join(df2, by = "name") %>%
distinct()
When using the various join functions from dplyr you can either join all variables with the same name (by default) or specify those ones using by = c("a" = "b"). Is there a way to join by exclusion? For example, I have 1000 variables in two data frames and I want to join them by 999 of them, leaving one out. I don't want to do by = c("a1" = "b1", ...,"a999" = "b999"). Is there a way to join by excluding the one variable that is not used?
Ok, using this example from one answer:
set.seed(24)
df1 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
df2 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
I want to join them using all variables excluding val. I'm looking for a more general solution. Assuming there are 1000 variables and I only remember the name of the one that I want to exclude in the join, while not knowing the index of that variable. How can I perform the join while only knowing the variable names to exclude. I understand I can find the column index first but is there a simply way to add exclusions in by =?
We create a named vector to do this
library(dplyr)
grps <- setNames(paste0("b", 1:999), paste0("a", 1:999))
Note the 'grps' vector is created with paste as the OP's post suggested a pattern. If there is no pattern, but we know the column that is not to be grouped
nogroupColumn <- "someColumn"
grps <- setNames(setdiff(names(df1), nogroupColumn),
setdiff(names(df2), nogroupColumn))
inner_join(df1, df2, by = grps)
Using a reproducible example
set.seed(24)
df1 <- data_frame(a1 = LETTERS[1:3], a2 = letters[1:3], val = rnorm(3))
df2 <- data_frame(b1 = LETTERS[3:4], b2 = letters[3:4], valn = rnorm(2))
grps <- setNames(paste0("b", 1:2), paste0("a", 1:2))
inner_join(df1, df2, by = grps)
# A tibble: 1 x 4
# a1 a2 val valn
# <chr> <chr> <dbl> <dbl>
#1 C c 0.420 -0.584
To exclude a certain field(s), you need to identify the index of the columns you want. Here's one way:
which(!names(df1) %in% "sskjs" ) #<this excludes the column "sskjs"
[1] 1 2 4 #<and shows only the desired index columns
Use unite to create a join_id in each dataframe, and join by it.
df1 <- df1 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
df2 <- df2 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
left_join(df1, df2, by = "join_id" )
I have a DataFrame with Person data and also have like 20 more DataFrames with a common key Person_Id. I want to join all of them to the Person DataFrame to have all my data in the same DataFrame.
I tried both join and merge like this:
merge(df_person, df_1, by="Person_Id", all.x=TRUE)
and
join(df_person, df_1, df_person$Person_Id == df_1$Person_Id, "left")
In both of them, I find the same error. Both functions Join the Datasets in the right way but it duplicates the field Person_Id. Is there any way to tell those functions to not duplicate the Person_Id field?
Also, anyone knows a more efficient way to join all those DataFrames together?
Thanks you so much for your help in advance.
Other supported languages support simplified equi-join syntax, but it looks like it is not implemented in R so you have to do it the old way (rename and drop):
library(magrittr)
withColumnRenamed(df_1, "Person_Id", "Person_Id_") %>%
join(df_2, column("Person_Id") == column("Person_id_")) %>%
drop("Person_Id_")
If you're doing a lot of joins in SparkR it is worthwhile to make your own function to rename then join then remove the renamed column
DFJoin <- function(left_df, right_df, key = "key", join_type = "left"){
left_df <- withColumnRenamed(left_df, key, "left_key")
right_df <- withColumnRenamed(right_df, key, "right_key")
result <- join(
left_df, right_df,
left_df$left_key == right_df$right_key,
joinType = join_type)
result <- withColumnRenamed(result, "left_key", key)
result$right_key <- NULL
return(result)
}
df1 <- as.DataFrame(data.frame(Person_Id = c("1", "2", "3"), value_1 =
c(2, 4, 6)))
df2 <- as.DataFrame(data.frame(Person_Id = c("1", "2"), value_2 = c(3,
6)))
df3 <- DFjoin(df1, df2, key = "Person_Id", join_type = "left")
head(df3)
Person_Id value_1 value_2
1 3 6 NA
2 1 2 3
3 2 4 6
I used the following code to scrape a table into R.
player.offense.201702050atl = comments.201702050atl[31] %>% html_text() %>% read_html() %>% html_node("#player_offense") %>% html_table()
Then changed the column labels using:
colnames(player.offense.201702050atl) = c("Player", "Tm", "Cmp.Passing", "Att.Passing", "Yds.Passing", "TD.Passing", "Int.Passing", "Sk.Passing", "Yds.Sk.Passing", "Lng.Passing", "Rate.Passing", "Att.Rushing", "Yds.Rushing", "TD.Rushing", "Lng.Rushing", "Tgt.Receiving", "Rec.Receiving", "Yds.Receiving", "TD.Receiving", "Lng.Receiving", "Fmb.Fumbles", "FL.Fumbles")
Next I need to eliminate rows 1, 11, and 12.
I could use:
player.offense.201702050atl.a = player.offense.201702050atl[2:10, ]
player.offense.201702050atl.b = player.offense.201702050atl[13:20, ]
player.offense.201702050atl.c = rbind(player.offense.201702050atl.a, player.offense.201702050atl.b)
However, I have multiple tables in need of similar manipulations; and, the rows which I intend to eliminate, vary with each one. The criteria for a row I desire eliminated is:
All rows for which the value in column 3 is either "Cmp" or "Passing".
Is there a way to run a function that will parse the table, identify the rows that meet the above criteria, and eliminate them?
df <- data.frame(x = c('a', 'b', 'c'), y = c('ca', 'cb', 'cc'), z=c('da', 'db', 'dc'))
x y z
1 a ca da
2 b cb db
3 c cc dc
df[-union(which(df$y == 'cc'),which(df$y == 'ca')),]
Result:
x y z
2 b cb db
Regarding
I desire eliminated is: All rows for which the value in column 3 is either "Cmp" or "Passing".
df <- data.frame(col1 = 1:3, col2 = c('Cmp', 'Passing', 'other'))
df[!df$col2 %in% c('Cmp', 'Passing'), ]