Create new column based on partial match with another column

Create new column based on partial match with another column - r

I want to create a new column for a dataframe, using partial matches in another column. The problem is that my values are only partial matches, the suffix _3p, or _5p at the end of the names only exist in the original dataframe but not in the other column I am using to test to.
The code I am using should work, but due to the partial match thing is not and I am stuck.
> head(df)
# A tibble: 6 x 2
microRNAs `number of targets`
<chr> <int>
1 bantam|LQNS02278082.1_33125_3p 128
2 bantam|LQNS02278082.1_33125_5p 8
3 Dpu-Mir-10-P2_LQNS02277998.1_30984_3p 44
4 Dpu-Mir-10-P2_LQNS02277998.1_30984_5p 78
5 Dpu-Mir-10-P3_LQNS02277998.1_30988_3p 1076
6 Dpu-Mir-10-P3_LQNS02277998.1_30988_5p 309
> dput(head(df))
structure(list(microRNAs = c("bantam|LQNS02278082.1_33125_3p",
"bantam|LQNS02278082.1_33125_5p", "Dpu-Mir-10-P2_LQNS02277998.1_30984_3p",
"Dpu-Mir-10-P2_LQNS02277998.1_30984_5p", "Dpu-Mir-10-P3_LQNS02277998.1_30988_3p",
"Dpu-Mir-10-P3_LQNS02277998.1_30988_5p"), `number of targets` = c(128L,
8L, 44L, 78L, 1076L, 309L)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
#matches to look for
unique
1 miR-9|LQNS02278094.1_36129
2 LQNS02278139.1_39527
3 LQNS02278139.1_39523
4 LQNS02278075.1_32386
5 Dpu-Mir-10-P3_LQNS02277998.1_30988
> dput(head(unique))
structure(list(unique = c("miR-9|LQNS02278094.1_36129",
"LQNS02278139.1_39527", "LQNS02278139.1_39523", "LQNS02278075.1_32386",
"Dpu-Mir-10-P3_LQNS02277998.1_30988")), row.names = c(NA,
6L), class = "data.frame")
#Create new column with Yes, No
df$new <- ifelse(df$microRNAs %in% unique$unique, 'Yes', 'No')
##But it all appears like No due to the partial match.

A fast solution using data.table.
library(data.table)
# convert data.frame to data.table
setDT(df)
# create temporary column dropping the last 3 characters
df[, microRNAs_short := substr(microRNAs ,1, nchar(microRNAs)-3) ]
# check values in common
df[, new := fifelse( microRNAs_short %in% df2$unique, 'Yes', 'No')]

We could use regex_left_join from fuzzyjoin
library(fuzzyjoin)
regex_left_join(df, unique, by = c("microRNAs" = "unique"))

Related

Sort dataframe by row index value without changing values

I have a dataframe that I have to sort in decreasing order of absolute row value without changing the actual values (some of which are negative).
To give you an example, e.g. for the 1st row, I would like to go from
-0.01189179 0.03687456 -0.12202753 to
-0.12202753 0.03687456 -0.01189179.
For the 2nd row from
-0.04220260 0.04129326 -0.07178175 to
-0.07178175 -0.04220260 0.04129326 etc.
How can I do this in R?
Many thanks!

Try this
lst <- lapply(df , \(x) order(-abs(x)))
ans <- data.frame(Map(\(x,y) x[y] , df ,lst))
output
a b
1 -0.01189179 -0.07178175
2 0.03687456 -0.04220260
3 -0.12202753 0.04129326
data
df <- structure(list(a = c(-0.12202753, 0.03687456, -0.01189179), b = c(-0.0422026,
0.04129326, -0.07178175)), row.names = c(NA, -3L), class = "data.frame")

Here is a simple approach (using #Mohamed Desouky's Data)
df <- df[nrow(df):1,]
> df
a b
3 -0.01189179 -0.07178175
2 0.03687456 0.04129326
1 -0.12202753 -0.04220260

Pre-processing data in R: filtering and replacing using wildcards

Good day!
I have a dataset in which I have values like "Invalid", "Invalid(N/A)", "Invalid(1.23456)", lots of them in different columns and they are different from file to file.
Goal is to make script file to process different CSVs.
I tried read.csv and read_csv, but faced errors with data types or no errors, but no action either.
All columns are col_character except one - col_double.
Tried this:
is.na(df) <- startsWith(as.character(df, "Inval")
no luck
Tried this:
is.na(df) <- startsWith(df, "Inval")
no luck, some error about non char object
Tried this:
df %>%
mutate(across(everything(), .fns = ~str_replace(., "invalid", NA_character_)))
no luck
And other google stuff - no luck, again, errors with data types or no errors, but no action either.
So R is incapable of simple find and replace in data frame, huh?
data frame exampl
Output of dput(dtype_Result[1:20, 1:4])
structure(list(Location = c("1(1,A1)", "2(1,B1)", "3(1,C1)",
"4(1,D1)", "5(1,E1)", "6(1,F1)", "7(1,G1)", "8(1,H1)", "9(1,A2)",
"10(1,B2)", "11(1,C2)", "12(1,D2)", "13(1,E2)", "14(1,F2)", "15(1,G2)",
"16(1,H2)", "17(1,A3)", "18(1,B3)", "19(1,C3)", "20(1,D3)"),
Sample = c("Background0", "Background0", "Standard1", "Standard1",
"Standard2", "Standard2", "Standard3", "Standard3", "Standard4",
"Standard4", "Standard5", "Standard5", "Standard6", "Standard6",
"Control1", "Control1", "Control2", "Control2", "Unknown1",
"Unknown1"), EGF = c(NA, NA, "6.71743640129069", "2.66183193679533",
"16.1289784536322", "16.1289784536322", "78.2706654825781",
"78.6376213069722", "382.004087907716", "447.193928257862",
"Invalid(N/A)", "1920.90297258996", "7574.57784103579", "29864.0308009592",
"167.830723655146", "109.746615928611", "868.821939675054",
"971.158518683179", "9.59119569511596", "4.95543581398464"
), `FGF-2` = c(NA, NA, "25.5436745776637", NA, "44.3280630362038",
NA, "91.991708192168", "81.9459159768959", "363.563899234418",
"425.754478700876", "Invalid(2002.97340881547)", "2027.71958119836",
"9159.40221389147", "11138.8722428849", "215.58494072476",
"70.9775438699825", "759.798876479002", "830.582605561901",
"58.7007261370257", "70.9775438699825")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))

The error is in the use of startsWith. The following grepl solution is simpler and works.
is.na(df) <- sapply(df, function(x) grepl("^Invalid", x))

The str_replace function will attempt to edit the content of a character string, inserting a partial replacement, rather than replacing it entirely. Also, the across function is targeting all of the columns including the numeric id. The following code works, building on the tidyverse example you provided.
To fix it, use where to identify the columns of interest, then use if_else to overwrite the data with NA values when there is a partial string match, using str_detect to spot the target text.
Example data
library(tiyverse)
df <- tibble(
id = 1:3,
x = c("a", "invalid", "c"),
y = c("d", "e", "Invalid/NA")
)
df
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 invalid e
3 3 c Invalid/NA
Solution
df <- df %>%
mutate(
across(where(is.character),
.fns = ~if_else(str_detect(tolower(.x), "invalid"), NA_character_, .x))
)
print(df)
Result
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 NA e
3 3 c NA

Calculate ratio every two rows with partial string matches

I am trying to calculate a ratio using this formula: log2(_5p/3p).
I have a dataframe in R and the entries have the same name except their last part that will be either _3p or _5p. I want to do this operation log2(_5p/_3p) for each specific name.
For instance for the first two rows the result will be like this:
LQNS02277998.1_30988 log2(40/148)= -1.887525
Ideally I want to create a new data frame with the results where only the common part of the name is kept.
LQNS02277998.1_30988 -1.887525
How can I do this in R?
> head(dup_res_LC1_b_2)
# A tibble: 6 x 2
microRNAs n
<chr> <int>
1 LQNS02277998.1_30988_3p 148
2 LQNS02277998.1_30988_5p 40
3 Dpu-Mir-279-o6_LQNS02278070.1_31942_3p 4
4 Dpu-Mir-279-o6_LQNS02278070.1_31942_5p 4
5 LQNS02000138.1_777_3p 73
6 LQNS02000138.1_777_5p 12
structure(list(microRNAs = c("LQNS02277998.1_30988_3p",
"LQNS02277998.1_30988_5p", "Dpu-Mir-279-o6_LQNS02278070.1_31942_3p",
"Dpu-Mir-279-o6_LQNS02278070.1_31942_5p", "LQNS02000138.1_777_3p",
"LQNS02000138.1_777_5p"), n = c(148L, 40L, 4L, 4L, 73L, 12L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

We can use a group by operation by removing the substring at the end i.e. _3p or _5p with str_remove, then use the log division of the pair of 'n'
library(dplyr)
library(stringr)
df1 %>%
group_by(grp = str_remove(microRNAs, "_[^_]+$")) %>%
mutate(new = log2(last(n)/first(n)))

Comparing pairs of rows in a list of data frames

I have a list that's 1314 element long. Each element is a data frame consisting of two rows and four columns.
Game.ID Team Points Victory
1 201210300CLE CLE 94 0
2 201210300CLE WAS 84 0
I would like to use the lapply function to compare points for each team in each game, and change Victory to 1 for the winning team.
I'm trying to use this function:
test_vic <- lapply(all_games, function(x) {if (x[1,3] > x[2,3]) {x[1,4] = 1}})
But the result it produces is a list 1314 elements long with just the Game ID and either a 1 or a null, a la:
$`201306200MIA`
[1] 1
$`201306160SAS`
NULL
How can I fix my code so that each data frame maintains its shape. (I'm guessing solving the null part involves if-else, but I need to figure out the right syntax.)
Thanks.

Try
lapply(all_games, function(x) {x$Victory[which.max(x$Points)] <- 1; x})
Or another option would be to convert the list to data.table by using rbindlist and then do the conversion
library(data.table)
rbindlist(all_games)[,Victory:= +(Points==max(Points)) ,Game.ID][]
data
all_games <- list(structure(list(Game.ID = c("201210300CLE",
"201210300CLE"
), Team = c("CLE", "WAS"), Points = c(94L, 84L), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
class = "data.frame", row.names = c("1",
"2")), structure(list(Game.ID = c("201210300CME", "201210300CME"
), Team = c("CLE", "WAS"), Points = c(90, 92), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
row.names = c("1", "2"), class = "data.frame"))

You could try dplyr:
library(dplyr)
all_games %>%
bind_rows() %>%
group_by(Game.ID) %>%
mutate(Victory = row_number(Points)-1)
Which gives:
#Source: local data frame [4 x 4]
#Groups: Game.ID
#
# Game.ID Team Points Victory
#1 201210300CLE CLE 94 1
#2 201210300CLE WAS 84 0
#3 201210300CME CLE 90 0
#4 201210300CME WAS 92 1

Get the time between consecutive dates stored in a single column

I am trying to figure out how to get the time between consecutive events when events are stored as a column of dates in a dataframe.
sampledf=structure(list(cust = c(1L, 1L, 1L, 1L), date = structure(c(9862,
9879, 10075, 10207), class = "Date")), .Names = c("cust", "date"
), row.names = c(NA, -4L), class = "data.frame")
I can get an answer with
as.numeric(rev(rev(difftime(c(sampledf$date[-1],0),sampledf$date))[-1]))
# [1] 17 196 132
but it is really ugly. Among other things, I only know how to exclude the first item in a vector, but not the last so I have to rev() twice to drop the last value.
Is there a better way?
By the way, I will use ddply to do this to a larger set of data for each cust id, so the solution would need to work with ddply.
library(plyr)
ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(rev(rev(difftime(c(date[-1],0),date))[-1]))
)
Thank you!

Are you looking for this?
as.numeric(diff(sampledf$date))
# [1] 17 196 132
To remove the last element, use head:
head(as.numeric(diff(sampledf$date)), -1)
# [1] 17 196
require(plyr)
ddply(sampledf, .(cust), summarise, daysBetween = as.numeric(diff(date)))
# cust daysBetween
# 1 1 17
# 2 1 196
# 3 1 132

You can just use diff.
as.numeric(diff(sampledf$date))
To leave off the last, element, you can do:
[-length(vec)] #where `vec` is your vector
In this case I don't think you need to leave anything off though, because diff is already one element shorter:
test <- ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(diff(sampledf$date)
))
test
# cust daysBetween
#1 1 17
#2 1 196
#3 1 132

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create new column based on partial match with another column - r

We could use regex_left_join from fuzzyjoin library(fuzzyjoin) regex_left_join(df, unique, by = c("microRNAs" = "unique"))

Related

Sort dataframe by row index value without changing values

Pre-processing data in R: filtering and replacing using wildcards

Calculate ratio every two rows with partial string matches

Comparing pairs of rows in a list of data frames

Get the time between consecutive dates stored in a single column

Categories

Resources