Suppose I have a dataframe as such,
df = data.frame ( a = c(1,14,15,11) , b= c("xxxchrxxx","xxxchryy","zzchrzz","aachraa") )
a b
1 1 xxxchrxxx
2 14 xxxchryy
3 15 zzchrzz
4 11 aachraa
what I want is to replace chr from column b with chrx, x derive from column a
a b
1 1 xxxchr1xxx
2 14 xxxchr14yy
3 15 zzchr15zz
4 11 aachr11aa
however I cant get gsub to work since its expecting a single element
df$b = gsub ( "chr",paste0("chr",df$a), df$b)
any way to do this?
The reason is that gsub replacement takes only a vector with length 1. According to ?gsub
replacement - if a character vector of length 2 or more is supplied, the first element is used with a warning.
If it needs to have a vectorized replacement, use str_replace
library(stringr)
str_replace(df$b, "chr", paste0("chr", df$a))
#[1] "xxxchr1xxx" "xxxchr14yy" "zzchr15zz" "aachr11aa"
Based on the example, it is only a simple paste
df$b <- with(df, paste0(b, a))
EDIT:: With stringr:
stringr::str_replace_all(df$b,"chr",paste0("chr",df$a))
Continuing with paste0:
df$b<-paste0(df$b,df$a)
a b
1 1 chr1
2 14 chr14
3 15 chr15
4 11 chr11
df = data.frame ( a = c(1,14,15,11) , b= c("chr","chr","chr","chr") )
df$b <- paste0(df$b, df$a)
df
#> a b
#> 1 1 chr1
#> 2 14 chr14
#> 3 15 chr15
#> 4 11 chr11
Created on 2019-02-22 by the reprex package (v0.2.1)
Related
I want to superscript a number (or character, it doesnt matter) to an existing string.
This is what my initial dataframe looks like:
testframe = as.data.frame(c("A34", "B21", "C64", "D83", "E92", "F24"))
testframe$V2 = c(1,3,2,2,3,NA)
colnames(testframe)[1] = "V1"
V1 V2
1 A34 1
2 B21 3
3 C64 2
4 D83 2
5 E92 3
6 F24 NA
What I want to do now is to use the V2 as a kind of "footnote", so any superscript or subscript (doesnt matter which one). When there is no entry in V2, then I just want to keep V1 as it is.
I found a similar question where I saw this answer:
> paste0("H", "\u2081", "O")
[1] "H₁O"
This is what I want, but my problem is that it has to be created automatically since I have way too many rows in my real dataframe.
I tried to add an extra column "V3" to enter the Superscripts and Subscripts Codes:
testframe$V3 = c("u2081", "u2083", "u2082", "u2082", "u2083", NA)
V1 V2 V3
1 A34 1 u2081
2 B21 3 u2083
3 C64 2 u2082
4 D83 2 u2082
5 E92 3 u2083
6 F24 NA <NA>
But when I try paste(testframe$V1, testframe$V3, sep = "\") it gives me an error. How can I use the \ in this case?
If the subscripts will always be digits 0 through 9, you can index into a vector of Unicode subscript digits:
subscripts <- c(
"\u2080",
"\u2081",
"\u2082",
"\u2083",
"\u2084",
"\u2085",
"\u2086",
"\u2087",
"\u2089"
)
testframe$V1 <- paste0(
testframe$V1,
ifelse(
is.na(testframe$V2),
"",
subscripts[testframe$V2 + 1]
)
)
testframe
V1 V2
1 A34₁ 1
2 B21₃ 3
3 C64₂ 2
4 D83₂ 2
5 E92₃ 3
6 F24 NA
(This question is based on a previous question Convert letters with duplicates to numbers)
I have series of events and non-events in column aoi, with events expressed as capital letters and non-events expressed as "*":
df <- data.frame(
Partcpt = c("B","A","B","C","A","B"),
aoi = c("B*B*B","*A*C*A*C","*B*B","A*C","*A*","*")
)
I need to convert the letters to consecutive numbers unless they are duplicates, in which case the previous number should be repeated. This conversion is accomplished by this:
df$aoi_0 <- sapply(strsplit(df$aoi, split = ""), function(x) paste(match(x[x!="*"], unique(x[x!="*"])), collapse = ""))
df
Partcpt aoi aoi_0
1 B B*B*B 111
2 A *A*C*A*C 1212
3 B *B*B 11
4 C A*C 12
5 A *A* 1
6 B *
But now the information on the non-events is lost. How can I reinstate that information in the strings themselves, by re-inserting the "*" character where appropriate, like so:
df
Partcpt aoi aoi_0
1 B B*B*B 1*1*1
2 A *A*C*A*C *1*2*1*2
3 B *B*B *1*1
4 C A*C 1*2
5 A *A* *1*
6 B * *
You can modify the anonymous function with an ifelse() to return * if the input is * but otherwise to follow the logic of your previous code, i.e. match the input to the vector of unique values.
df$aoi_1 <- sapply(
strsplit(df$aoi, split = ""),
\(x) paste0(
ifelse(
x=="*",
"*",
match(x, unique(x[x!="*"]))
), collapse = ""
)
)
df
# Partcpt aoi aoi_0 aoi_1
# 1 B B*B*B 111 1*1*1
# 2 A *A*C*A*C 1212 *1*2*1*2
# 3 B *B*B 11 *1*1
# 4 C A*C 12 1*2
# 5 A *A* 1 *1*
# 6 B * *
Another possible solution, which is based on the following ideas:
Try to match * with unique(x[x!="*"].
This outcomes no match for *.
Configure nomatch = 0.
Use gsub to replace 0 by *.
df$aoi_0 <- sapply(strsplit(df$aoi, split = ""),
function(x) gsub("0", "*", paste(match(x, unique(x[x!="*"]), nomatch = 0),
collapse = "")))
df
#> Partcpt aoi aoi_0
#> 1 B B*B*B 1*1*1
#> 2 A *A*C*A*C *1*2*1*2
#> 3 B *B*B *1*1
#> 4 C A*C 1*2
#> 5 A *A* *1*
#> 6 B * *
I have a code that outputs a pair of integers as "(1, 21)", as a string. The integers are always between 1 and 99.
I want to extract the integers into an array as numeric. How can I do this? I've done some research and it seems regex is the way to go, but I'm unsure exactly how to do this here.
Here are several base R one-liners. These each produce a data frame. Use as.matrix(...) on that if you want a matrix/array. (2) seems particularly compact.
1) trimws/read.table trim non-digits off the ends using trimws and then use read.table to read it in giving the data frame shown.
x <- c("(1, 21)", "(2, 22)", "(3, 33)") # input
read.table(text = trimws(x, white = "\\D"), sep = ",")
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
2) gsub/read.table Another approach is to convert each non-digit to a space and then use read.table:
read.table(text = gsub("\\D", " ", x))
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
3) strcapture Define a regular expression with captures to use with strcapture.
strcapture("(\\d+), (\\d+)", x, data.frame(V1 = integer(0), V2 = integer(0)))
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
4) chartr/read.table Use chartr to replace ( with a space and then use read.table defining the comment character as ).
read.table(text = chartr("(", " ", x), sep = ",", comment.char = ")")
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
You can use gsub to remove ( and ) using [()] and then use strsplit to split at , . unlist the retuned list and convert it to as.integer and create a matrix or array.
matrix(as.integer(unlist(strsplit(gsub("[()]", "", x), ", ", TRUE))), 2)
# [,1] [,2]
#[1,] 1 3
#[2,] 21 31
Data:
x <- c("(1, 21)", "(3, 31)")
Tidyverse way
x <- c("(1, 21)", "(33, 99)", "(1, 7)")
library(tidyverse)
map_dfr(str_split(str_replace(x, '\\((\\d+)\\,\\s(\\d+)\\)', '\\1 \\2'), ' '), ~ set_names(.x, c('A', 'B')))
#> # A tibble: 3 x 2
#> A B
#> <chr> <chr>
#> 1 1 21
#> 2 33 99
#> 3 1 7
Created on 2021-06-02 by the reprex package (v2.0.0)
I have a regular expression that parses a bunch of text, an when doing regmatches(myText,myRegex) it returns a list which looks like:
[[1]]
[1] "a=1" "b=3" "a=9" "c=2" "b=4"
...
I'd like to build a data.frame or table - whatever suits best - to finally have something like:
a b c
1 3 2
9 4 ...
Is it possible to make this in a simple fashion? What are your suggestions?
Thanks in advance.
Its not entirely clear what the general case is here but this works on the data provided.
Assuming this input:
x <- c("a=1", "b=3", "a=9", "c=2", "b=4")
split the values by the names producing s and massage into a data.frame:
s <- split(as.numeric(sub(".*=", "", x)), sub("=.*", "", x))
as.data.frame(do.call(cbind, lapply(s, ts)))
giving:
a b c
1 1 3 2
2 9 4 NA
No packages needed.
You can either use base R methods
d1 <- read.table(text=gsub("[[:punct:]]", " " , unlist(lst)))
d2 <- transform(d1, indx=ave(seq_along(V1), V1, FUN=seq_along))
res <- reshape(d2, timevar='V1', idvar='indx', direction='wide')[,-1]
colnames(res) <- gsub(".*\\.", "", colnames(res))
res
# a b c
#1 1 3 2
#3 9 4 2
#6 4 5 NA
#9 9 NA NA
Or using dcast from reshape2 on d2
library(reshape2)
dcast(d2,indx~V1, value.var='V2')[,-1]
# a b c
#1 1 3 2
#2 9 4 2
#3 4 5 NA
#4 9 NA NA
data
lst <- list(c('a=1', 'b=3', 'a=9', 'c=2', 'b=4'),
c('a=4', 'c=2', 'b=5', 'a=9'))
Using rex may make this type of extraction task a little simpler.
x <- c("a=1", "b=3", "a=9", "c=2", "b=4", "a=2")
First extract the names and values from the strings.
library(rex)
matches <- re_matches(x,
rex(
capture(name="name", letter),
"=",
capture(name="value", digit)
))
#> name value
#>1 a 1
#>2 b 3
#>3 a 9
#>4 c 2
#>5 b 4
#>6 a 2
Then tally the groups using split().
groups <- split(as.numeric(matches$value), matches$name)
#>$a
#>[1] 1 9 2
#>
#>$b
#>[1] 3 4
#>
#>$c
#>[1] 2
If we try to convert directly to a data.frame from split() the groups with fewer members will have their members recycled rather than NA, so instead explicitly fill with NA.
largest_group <- max(sapply(groups, length))
#>[1] 3
groups <- lapply(groups, function(group) {
if (length(group) < largest_group) {
group[largest_group] <- NA
}
group
})
#>$a
#>[1] 1 9 2
#>
#>$b
#>[1] 3 4 NA
#>
#>$c
#>[1] 2 NA NA
Finally we can create the data.frame
do.call('data.frame', groups)
#> a b c
#>1 1 3 2
#>2 9 4 NA
#>3 2 NA NA
Here's an approach using tools from my "splitstackshape" package:
library(splitstackshape)
dcast.data.table( ## Makes the long data wide
getanID( ## Adds an ID variable for dcast
## create a single column data.table and split it by the "="
cSplit(as.data.table(unlist(lst)), "V1", "="), "V1_1"),
.id ~ V1_1, value.var = "V1_2")
# .id a b c
# 1: 1 1 3 2
# 2: 2 9 4 2
# 3: 3 4 5 NA
# 4: 4 9 NA NA
This uses #akrun's sample data:
lst <- list(c('a=1', 'b=3', 'a=9', 'c=2', 'b=4'),
c('a=4', 'c=2', 'b=5', 'a=9'))
I have a dataframe where some of the values are NA. I would like to remove these columns.
My data.frame looks like this
v1 v2
1 1 NA
2 1 1
3 2 2
4 1 1
5 2 2
6 1 NA
I tried to estimate the col mean and select the column means !=NA. I tried this statement, it does not work.
data=subset(Itun, select=c(is.na(colMeans(Itun))))
I got an error,
error : 'x' must be an array of at least two dimensions
Can anyone give me some help?
The data:
Itun <- data.frame(v1 = c(1,1,2,1,2,1), v2 = c(NA, 1, 2, 1, 2, NA))
This will remove all columns containing at least one NA:
Itun[ , colSums(is.na(Itun)) == 0]
An alternative way is to use apply:
Itun[ , apply(Itun, 2, function(x) !any(is.na(x)))]
Here's a convenient way to do it using the dplyr function select_if(). Combine not (!), any() and is.na(), which is equivalent to selecting all columns that don't contain any NA values.
library(dplyr)
Itun %>%
select_if(~ !any(is.na(.)))
Alternatively, select(where(~FUNCTION)) can be used:
library(dplyr)
(df <- data.frame(x = letters[1:5], y = NA, z = c(1:4, NA)))
#> x y z
#> 1 a NA 1
#> 2 b NA 2
#> 3 c NA 3
#> 4 d NA 4
#> 5 e NA NA
# Remove columns where all values are NA
df %>%
select(where(~!all(is.na(.))))
#> x z
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e NA
# Remove columns with at least one NA
df %>%
select(where(~!any(is.na(.))))
#> x
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
You can use transpose twice:
newdf <- t(na.omit(t(df)))
data[,!apply(is.na(data), 2, any)]
A base R method related to the apply answers is
Itun[!unlist(vapply(Itun, anyNA, logical(1)))]
v1
1 1
2 1
3 2
4 1
5 2
6 1
Here, vapply is used as we are operating on a list, and, apply, it does not coerce the object into a matrix. Also, since we know that the output will be logical vector of length 1, we can feed this to vapply and potentially get a little speed boost. For the same reason, I used anyNA instead of any(is.na()).
Another alternative with the dplyr package would be to make use of the Filter function
Filter(function(x) !any(is.na(x)), Itun)
with data.table would be a little more cumbersome
setDT(Itun)[,.SD,.SDcols=setdiff((1:ncol(Itun)),
which(colSums(is.na(Itun))>0))]
You can also try:
df <- df[,colSums(is.na(df))<nrow(df)]