How to extract a first 3 numbers within a variable? - r

My numeric variable looks like this:
u$a <- c(1234, 1432, 1456, 13467)
How do I create a new variable a1 which is the first three characters of the variable a such that it would look like this:
u$a1 <- c(123, 143, 145, 134)
Thank you.

use integer division.
u$a1 <- u$a%/% 10^(nchar(u$a)-3)
u
#> a a1
#> 1 1234 123
#> 2 1432 143
#> 3 1456 145
#> 4 13467 134

You could first convert it to a character and use substr to get the first until third character and convert it back to numeric like this:
u$a1 <- as.numeric(substr(as.character(u$a), 1, 3))
u
#> a a1
#> 1 1234 123
#> 2 1432 143
#> 3 1456 145
#> 4 13467 134
Created on 2023-01-26 with reprex v2.0.2
Data used:
u <- data.frame(a = c(1234, 1432, 1456, 13467))

Using sub
u$a1 <- as.numeric(sub("^(...).*", "\\1", u$a))

Related

Can I identify the same values within a range between 2 columns?

I am trying to compare values between two different columns but I need it to accept values within a range of ±3.
I created this 2 tibbles:
example_tp1 <- tibble(Object_centre = c(84, 149, 489, 534, 680.5))
example_tp2 <- tibble(Object_centre = c(84.5, 149.5, 489, 528.5, 542, 680.5))
And I want the program to link the ones that are the same within a ±3 range.
So for example, I want it to identify that 84 and 84.5 are the same, also 149 and 149.5; 489 and 489; 680.5 and 680.5. But I want it to also tell me that 534, 528.5 and 542 do not have a match.
Is there any way to do this?
This could be achieved via the fuzzyjoin package like so:
library(dplyr)
library(fuzzyjoin)
example_tp1 <- tibble(Object_centre = c(84, 149, 489, 534, 680.5))
example_tp2 <- tibble(Object_centre = c(84.5, 149.5, 489, 528.5, 542, 680.5))
match_fun1 <- function(x, y) {
# (x >= y - 3) & (x <= y + 3)
# or following the suggestion by #DarrenTsai
abs(x - y) <= 3
}
fuzzy_full_join(example_tp1, example_tp2,
by = c("Object_centre"),
match_fun = match_fun1)
#> # A tibble: 7 x 2
#> Object_centre.x Object_centre.y
#> <dbl> <dbl>
#> 1 84 84.5
#> 2 149 150.
#> 3 489 489
#> 4 680. 680.
#> 5 534 NA
#> 6 NA 528.
#> 7 NA 542
Created on 2020-08-22 by the reprex package (v0.3.0)
You could look at all combinations of values and see which ones match.
# Data Frame of all combinations
example <- expand.grid(c(84, 149, 489, 534, 680.5), c(84.5, 149.5, 489, 528.5, 542, 680.5))
# Assigns a Match if the values are within a range of 3
example %>%
mutate(match = ifelse(abs(Var1-Var2) <= 3, "Match", "No Match"))
Var1 Var2 match
1 84.0 84.5 Match
2 149.0 84.5 No Match
3 489.0 84.5 No Match
4 534.0 84.5 No Match
5 680.5 84.5 No Match
6 84.0 149.5 No Match
7 149.0 149.5 Match
8 489.0 149.5 No Match
9 ..... ..... ........
10 ..... ..... ........
and so on
You could then filter out only the matches or see which values have no match.
Similar to #Jumble's answer using tidyverse functions :
tidyr::crossing(example_tp1, example_tp2, .name_repair = ~c('col1', 'col2')) %>%
dplyr::filter(abs(col1 - col2) <= 3)
# col1 col2
# <dbl> <dbl>
#1 84 84.5
#2 149 150.
#3 489 489
#4 680. 680.
crossing generates all combinations of example_tp1 and example_tp2 and we keep only those rows where the difference is less than equal to 3.

How can I divide several entire numbers separated by a comma in one column by numbers in another column

I wanted to divide numbers separated by commas in a column
by other numbers.
Here is the input I have
> df = data.frame (SAMPLE1.DP=c("555","651","641","717"), SAMPLE1.AD=c("555", "68,583","2,639","358,359"), SAMPLE2.DP=c("1023","930","683","1179"), SAMPLE2.AD=c("1023","0,930","683","585,594"))
> df
SAMPLE1.DP SAMPLE1.AD SAMPLE2.DP SAMPLE2.AD
1 555 555 1023 1023
2 651 68,583 930 0,930
3 641 2,639 683 683
4 717 358,359 1179 585,594
In the end I want to add two new columns (AD/DP) that divide the values SAMPLE1.AD by SAMPLE1.DP AND SAMPLE2.AD by SAMPLE2.DP, which represent pourcentage of numbers at each side of the comma, like this :
> end = data.frame(SAMPLE1.DP=c("555","651","641","717"),
+ SAMPLE1.AD=c("555", "68,583","204,437","358,359"),
+ SAMPLE1.AD_DP=c("1.00","0.10,0.90","0.32,0.68","0.50,0.50"),
+ SAMPLE2.DP=c("1023","930","683","1179"),
+ SAMPLE2.AD=c("1023","0,930","683","585,594"),
+ SAMPLE2.AD_DP=c("1.00","0.00,1.00","1.00","0.49,0,51"))
>end
SAMPLE1.DP SAMPLE1.AD SAMPLE1.AD_DP SAMPLE2.DP SAMPLE2.AD SAMPLE2.AD_DP
1 555 555 1.00 1023 1023 1.00
2 651 68,583 0.10,0.90 930 0,930 0.00,1.00
3 641 204,437 0.32,0.68 683 683 1.00
4 717 358,359 0.50,0.50 1179 585,594 0.49,0,51
it means :
XX YY,ZZ YY/XX,ZZ/XX AA BB,CC BB/AA,CC/AA
If I consider the values inside the table as.numeric, it does not work since values are separated by commas...
Do you have any idea to do this ?
Thanks in advance for your help
First thing you need to do is replace the , with . and cast to numeric. Then split based on your required condition and divide, i.e.
df[] <- lapply(df, function(i)as.numeric(gsub(',', '.', i)))
do.call(cbind, lapply(split.default(df, gsub('\\D+', '', names(df))), function(i) i[2] / i[1]))
# SAMPLE1.AD SAMPLE2.AD
#1 1.000000000 1.000000
#2 0.004066052 0.001000
#3 0.004117005 1.000000
#4 0.499803347 0.496687
If there are commas in your numbers than the column has most likely been poisoned and is cast as characters. What you need to do is convert your columns to numeric and then divide each column respectively.
library(tidyverse)
dat <- tribble(~"SAMPLE1.DP", ~"SAMPLE1.AD", ~"SAMPLE2.DP", ~"SAMPLE2.AD",
555, 555, 1023, 1023,
651, "2,647", 930, ",93",
641, "2,639", 683, 683,
717, "358,359", 1179, "585,594")
dat %>%
mutate_at(c(2,4), list(~str_replace(., ",", "."))) %>%
mutate_all(as.numeric) %>%
mutate(addp1 = SAMPLE1.AD / SAMPLE1.DP,
addp2 = SAMPLE2.AD / SAMPLE2.DP)
#> # A tibble: 4 x 6
#> SAMPLE1.DP SAMPLE1.AD SAMPLE2.DP SAMPLE2.AD addp1 addp2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 555 555 1023 1023 1 1
#> 2 651 2.65 930 0.93 0.00407 0.001
#> 3 641 2.64 683 683 0.00412 1
#> 4 717 358. 1179 586. 0.500 0.497
Created on 2019-05-20 by the reprex package (v0.2.1)
Thanks everyone but I was not very clear in my question, very sorry.
In my input example, I have only whole numbers separated by commas, no decimales.
For example, on line 3 of my example :
2,647 means 2 AND 647, and I want to divide both numbers by 651 in order to have as result : 2/651 , 647/651 , so it will be 0.01 and 0.99 (or 1% and 99%)
They are entire numbers (or integers), separated by commas.
Hope I am clearer ...thanks...

How to write a loop that looks for a condition in two columns then adds the value in the third of a data frame?

Table showing correct format of dataI have a data frame with four columns, and I need to find a way to sum the values in the third column. Only if the numbers in the first two columns are different. The only way I can think of is to maybe do an If loop? Is that something can be done or is there a better way?
Genotype summary`
Dnov1a Dnov1b Freq rel_geno_freq
1 220 220 1 0.003367003
7 220 224 4 0.013468013
8 224 224 8 0.026936027
13 220 228 14 0.047138047
This is a portion of the data as an example, I need to sum the third column Freq for rows 7 and 13 because they are different.
Here's a tidyverse way of doing it:
library(tidyverse)
data <- tribble(
~Dnov1a, ~Dnov1b, ~Freq, ~rel_geno_freq,
220, 220, 1, 0.003367003,
220, 224, 4, 0.013468013,
224, 224, 8, 0.026936027,
220, 228, 14, 0.047138047)
data %>%
mutate(filter_column = if_else(Dnov1a != Dnov1b, TRUE, FALSE)) %>%
filter(filter_column == TRUE) %>%
summarise(Total = sum(Freq))
# A tibble: 1 x 1
Total
<dbl>
1 18
data$new = data$Dnov1a!=data$Dnov1b
data
Dnov1a Dnov1b Freq rel_geno_freq new
<int> <int> <int> <dbl> <lgl>
1 220 220 1 0.00337 TRUE
2 220 224 4 0.0135 FALSE
3 224 224 8 0.0269 TRUE
4 220 228 14 0.0471 FALSE
sum(data$Freq[data$new])
28
Is this what you are looking for?

How to remove a character (asterisk) in column values in r?

so I have a dataframe that looks like this but has 6k rows:
AWC, LocationID
333, *Yukon
485, *Lewis Rich
76, *Kodiak
666, Kodiak
54, *Rays
I would like to remove the asterisks from the LocationID values if thats possible and just keep the original name. So *Yukon -> Yukon. If thats not possible, could you help me with a way to rename a column value? I'm new to r.
The stringr package has some very handy functions for vectorized string manipulation.
In the following code I replace the * with ''. Note that in R, literals inside the regex have to be preceded by double slashes \\ instead of the usual single slash \.
library(stringr)
LocationID <- c('*Yukon','*Lewis Rich', '*Kodiak', 'Kodiak', '*Rays')
AWC <- c(333, 485, 76, 666, 54)
df <- data.frame(LocationID, AWC)
df$location_clean <- stringr::str_replace(df$LocationID, '\\*', '')
Resulting in:
LocationID AWC location_clean
1 *Yukon 333 Yukon
2 *Lewis Rich 485 Lewis Rich
3 *Kodiak 76 Kodiak
4 Kodiak 666 Kodiak
5 *Rays 54 Rays
This can be achieved using the mutate verb from the tidyverse package. Which in my opinion is more readable. So, to exemplify this, I create a dataset called DT with a focus on the LocationID to mimic the problem at hand.
library(tidyverse)
DT <- data.frame('AWC'= c(333, 485, 76, 666, 54),
'LocationID'= c('*Yukon','*Lewis Rich', '*Kodiak', 'Kodiak', '*Rays'))
head(DT)
AWC LocationID
1 333 *Yukon
2 485 *Lewis Rich
3 76 *Kodiak
4 666 Kodiak
5 54 *Rays
In what follows, mutate allows one to alter the column content, gsub does the desired substitution (of * with ""), keeping the data cleaning flow followable.
DT <- DT %>% mutate(LocationID = gsub("\\*", "", LocationID))
head(DT)
AWC LocationID
1 333 Yukon
2 485 Lewis Rich
3 76 Kodiak
4 666 Kodiak
5 54 Rays
NOTE that \\ is placed before * as the escape character
use gsub and escape character \ because * is a special charachter to basically replace * with nothing"" (thus deleting it)
> so
AWC LocationID
1 333 *Yukon
2 485 *Lewis Rich
3 76 *Kodiak
4 666 Kodiak
5 54 *Rays
> so$LocationID=gsub("\\*","",so$LocationID)
> so
AWC LocationID
1 333 Yukon
2 485 Lewis Rich
3 76 Kodiak
4 666 Kodiak
5 54 Rays

Subset Columns based on partial matching of column names in the same data frame

I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.
Here is a small explanation of my required output. It is described below,
Lets say the data frame is eatable
fruits_area fruits_production vegetable_area vegetable_production
12 100 26 324
33 250 40 580
66 510 43 581
eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
"vegetable_production")
I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
checkExpression(eatable,"your_string")
The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.
Edit:- I think regular expressions would work here.
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map() + select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
I believe this may address your needs:
checkExpression <- function(dataset,str){
cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
subset(dataset,select=colnames(dataset)[cols])
}
Note the addition of "^" to the pattern used in grepl.
Using your data:
checkExpression(eatable,"fruit")
## fruits_area fruits_production
##1 12 100
##2 33 250
##3 660 510
checkExpression(eatable,"veget")
## vegetables_area vegetable_production
##1 26 324
##2 40 580
##3 43 581
Your function does exactly what you want but there was a small error:
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
Change the name of the object from which your subsetting from obje to dataset.
checkExpression(eatable,"fr")
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
checkExpression(eatable,"veg")
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

Resources