How to remove a character (asterisk) in column values in r? - r

so I have a dataframe that looks like this but has 6k rows:
AWC, LocationID
333, *Yukon
485, *Lewis Rich
76, *Kodiak
666, Kodiak
54, *Rays
I would like to remove the asterisks from the LocationID values if thats possible and just keep the original name. So *Yukon -> Yukon. If thats not possible, could you help me with a way to rename a column value? I'm new to r.

The stringr package has some very handy functions for vectorized string manipulation.
In the following code I replace the * with ''. Note that in R, literals inside the regex have to be preceded by double slashes \\ instead of the usual single slash \.
library(stringr)
LocationID <- c('*Yukon','*Lewis Rich', '*Kodiak', 'Kodiak', '*Rays')
AWC <- c(333, 485, 76, 666, 54)
df <- data.frame(LocationID, AWC)
df$location_clean <- stringr::str_replace(df$LocationID, '\\*', '')
Resulting in:
LocationID AWC location_clean
1 *Yukon 333 Yukon
2 *Lewis Rich 485 Lewis Rich
3 *Kodiak 76 Kodiak
4 Kodiak 666 Kodiak
5 *Rays 54 Rays

This can be achieved using the mutate verb from the tidyverse package. Which in my opinion is more readable. So, to exemplify this, I create a dataset called DT with a focus on the LocationID to mimic the problem at hand.
library(tidyverse)
DT <- data.frame('AWC'= c(333, 485, 76, 666, 54),
'LocationID'= c('*Yukon','*Lewis Rich', '*Kodiak', 'Kodiak', '*Rays'))
head(DT)
AWC LocationID
1 333 *Yukon
2 485 *Lewis Rich
3 76 *Kodiak
4 666 Kodiak
5 54 *Rays
In what follows, mutate allows one to alter the column content, gsub does the desired substitution (of * with ""), keeping the data cleaning flow followable.
DT <- DT %>% mutate(LocationID = gsub("\\*", "", LocationID))
head(DT)
AWC LocationID
1 333 Yukon
2 485 Lewis Rich
3 76 Kodiak
4 666 Kodiak
5 54 Rays
NOTE that \\ is placed before * as the escape character

use gsub and escape character \ because * is a special charachter to basically replace * with nothing"" (thus deleting it)
> so
AWC LocationID
1 333 *Yukon
2 485 *Lewis Rich
3 76 *Kodiak
4 666 Kodiak
5 54 *Rays
> so$LocationID=gsub("\\*","",so$LocationID)
> so
AWC LocationID
1 333 Yukon
2 485 Lewis Rich
3 76 Kodiak
4 666 Kodiak
5 54 Rays

Related

How to extract a first 3 numbers within a variable?

My numeric variable looks like this:
u$a <- c(1234, 1432, 1456, 13467)
How do I create a new variable a1 which is the first three characters of the variable a such that it would look like this:
u$a1 <- c(123, 143, 145, 134)
Thank you.
use integer division.
u$a1 <- u$a%/% 10^(nchar(u$a)-3)
u
#> a a1
#> 1 1234 123
#> 2 1432 143
#> 3 1456 145
#> 4 13467 134
You could first convert it to a character and use substr to get the first until third character and convert it back to numeric like this:
u$a1 <- as.numeric(substr(as.character(u$a), 1, 3))
u
#> a a1
#> 1 1234 123
#> 2 1432 143
#> 3 1456 145
#> 4 13467 134
Created on 2023-01-26 with reprex v2.0.2
Data used:
u <- data.frame(a = c(1234, 1432, 1456, 13467))
Using sub
u$a1 <- as.numeric(sub("^(...).*", "\\1", u$a))

dplyr mutate column with nearest value in external list

I'm trying to mutate a column and populate it with exact matches from a list if those occur, and if not, the closest match possible.
My data frame looks like this:
index <- seq(1, 10, 1)
blockID <- c(100, 120, 132, 133, 201, 207, 210, 238, 240, 256)
df <- as.data.frame(cbind(index, blockID))
index blockID
1 1 100
2 2 120
3 3 132
4 4 133
5 5 201
6 6 207
7 7 210
8 8 238
9 9 240
10 10 256
I want to mutate a new column that checks whether blockID is in a list. If yes, it should just keep the value of blockID. If not, It should return the nearest value in blocklist:
blocklist <- c(100, 120, 130, 150, 201, 205, 210, 238, 240, 256)
so the additional column should contain
100 (match),
120 (match),
130 (no match for 132--nearest value is 130),
130 (no match for 133--nearest value is 130),
201,
205 (no match for 207--nearest value is 205),
210,
238,
240,
256
Here's what I've tried:
df2 <- df %>% mutate(blockmatch = ifelse(blockID %in% blocklist, blockID, ifelse(match.closest(blockID, blocklist, tolerance = Inf), "missing")))
I just put in "missing" to complete the ifelse() statements, but it shouldn't actually be returned anywhere since the preceding cases will be fulfilled for every value of blockID. However, the resulting df2 just has "missing" in all the cells where it should have substituted the nearest number. I know there are base R alternatives to match.closest but I'm not sure that's the problem. Any ideas?
You don't need if..else. Your rule can simplified by saying that we always get the blocklist element with least absolute difference when compared to blockID. If values match then absolute difference is 0 (which will always be the least).
With that here's a simple base R solution -
df$blockmatch <- sapply(df$blockID, function(x) blocklist[order(abs(x - blocklist))][1])
index blockID blockmatch
1 1 100 100
2 2 120 120
3 3 132 130
4 4 133 130
5 5 201 201
6 6 207 205
7 7 210 210
8 8 238 238
9 9 240 240
10 10 256 256
Here are a couple of ways with dplyr -
df %>%
rowwise() %>%
mutate(
blockmatch = blocklist[order(abs(blockID - blocklist))][1]
)
df %>%
mutate(
blockmatch = sapply(blockID, function(x) blocklist[order(abs(x - blocklist))][1])
)
Thanks to #Onyambu, here's a faster way -
df$blockmatch <- blocklist[max.col(-abs(sapply(blocklist, '-', df$blockID)))]

gsub not working while implementing in a loop

here i have the following dataframe df in R.
kyid industry amount
112 Apparel 345436
234 APPEARELS 234567
213 apparels 345678
345 Airlines 235678
123 IT 456789
124 IT 897685
i want to replace in industry which incorrectly written Apparel, or APPEARLS to Apparels .
i tried using creating a list and run it through a loop.
l<-c('Apparel ','APPEARELS','apparels')
for(i in range(1:3)){
df$industry<-gsub(pattern=l[i],"Apparels",df$industry)
}
it is not working.only one element changes.
But, when i take the statement individually it is not creating an error and its working.
df$industry<-gsub(pattern=","Apparels",df$industry)
but this is a large dataset so i nned this to work in R please help.
sub without loop using | :
l <- c("Apparel" , "APPEARELS", "apparels")
# Using OPs data
sub(paste(l, collapse = "|"), "Apparels", df$industry)
# [1] "Apparels" "Apparels" "Apparels" "Airlines" "IT" "IT"
I'm using sub instead of gsub as there's only one occurrence of pattern in a string (at least in example).
While range returns a sequence in Python, it returns the minimum and maximum of a vector in R:
range(1:3)
# [1] 1 3
Instead, you could use 1:3 or seq(1,3) or seq_along(l), which all return
# [1] 1 2 3
Also note the difference between 'Apparel' and 'Apparel '.
So
df<-read.table(header=T, text="kyid industry amount
112 Apparel 345436
234 APPEARELS 234567
213 apparels 345678
345 Airlines 235678
123 IT 456789
124 IT 897685")
l<-c('Apparel','APPEARELS','apparels')
for(i in seq_along(l)){
df$industry<-gsub(pattern=l[i],"Apparels",df$industry)
}
df
# kyid industry amount
# 1 112 Apparels 345436
# 2 234 Apparels 234567
# 3 213 Apparels 345678
# 4 345 Airlines 235678
# 5 123 IT 456789
# 6 124 IT 897685

Rotate Row Cell Values to a Single Column in R

I can't seem to find anything relevant to my specific problem, so I am asking here.
I have my original dataframe here:
Sample#, Fert_A_Mean, Fert_B_Mean, Fert_C_Mean, Fert_D_Mean
1 987, 384, 672, 364
2 567, 845, 398, 243
And I'd like to be able to restructure it like this:
Sample#, Fert_Mean
1 987
1 384
1 672
1 364
2 567
2 845
2 398
2 243
I've found some similar topics on stack-exchange, such as here
but using 't()' in this case doesn't seem to work... or I am doing something wrong. Hopefully one of you folks can help me out. Thanks so much. Using R 3.4.1 through R-studio. Any packages you recommend for your methods are fine.
You could use something like gather from the tidyr package:
library(tidyr)
df2 <- gather(df, key=Fert_Mean, value=value, -Sample)
df2
Sample Fert_Mean value
1 1 Fert_A_Mean 987
2 2 Fert_A_Mean 567
3 1 Fert_B_Mean 384
4 2 Fert_B_Mean 845
5 1 Fert_C_Mean 672
6 2 Fert_C_Mean 398
7 1 Fert_D_Mean 364
8 2 Fert_D_Mean 243
You can remove the Fert_Mean column if you don't want it, and sort by Sample to get the ordering in your example.

R: Convert consensus output into a data frame

I'm currently performing a multiple sequence alignment using the 'msa' package from Bioconductor. I'm using this to calculate the consensus sequence (msaConsensusSequence) and conservation score (msaConservationScore). This gives me outputs that are values ...
e.g.
ConsensusSequence:
i.llE etc (str = chr)
(lower case = 20%+ conservation, uppercase = 80%+ conservation, . = <20% conservation)
ConservationScore:
221 -296 579 71 423 etc (str = named num)
I would like to convert these into a table where the first row contains columns where each is a different letter in the consensus sequence and the second row is the corresponding conservation score.
e.g.
i . l l E
221 -296 579 71 423
Could people please advise on the best way to go about this?
Thanks
Natalie
For what you have said in the comments you can get a data frame like this:
data(BLOSUM62)
alignment <- msa(mySequences)
conservation <- msaConservationScore(alignment, BLOSUM62)
# Now create the data fram
df <- data.frame(consensus = names(conservation), conservation = conservation)
head(df)
consensus conservation
1 T 141
2 E 160
3 E 165
4 E 325
5 ? 179
6 ? 71
7 T 216
8 W 891
9 ? 38
10 T 405
11 L 204
If you prefer to transpose it you can:
df <- t(df)
colnames(df) <- 1:ncol(df)

Resources