R Transform number/character to be a superscript or subscript

R Transform number/character to be a superscript or subscript - r

I want to superscript a number (or character, it doesnt matter) to an existing string.
This is what my initial dataframe looks like:
testframe = as.data.frame(c("A34", "B21", "C64", "D83", "E92", "F24"))
testframe$V2 = c(1,3,2,2,3,NA)
colnames(testframe)[1] = "V1"
V1 V2
1 A34 1
2 B21 3
3 C64 2
4 D83 2
5 E92 3
6 F24 NA
What I want to do now is to use the V2 as a kind of "footnote", so any superscript or subscript (doesnt matter which one). When there is no entry in V2, then I just want to keep V1 as it is.
I found a similar question where I saw this answer:
> paste0("H", "\u2081", "O")
[1] "H₁O"
This is what I want, but my problem is that it has to be created automatically since I have way too many rows in my real dataframe.
I tried to add an extra column "V3" to enter the Superscripts and Subscripts Codes:
testframe$V3 = c("u2081", "u2083", "u2082", "u2082", "u2083", NA)
V1 V2 V3
1 A34 1 u2081
2 B21 3 u2083
3 C64 2 u2082
4 D83 2 u2082
5 E92 3 u2083
6 F24 NA <NA>
But when I try paste(testframe$V1, testframe$V3, sep = "\") it gives me an error. How can I use the \ in this case?

If the subscripts will always be digits 0 through 9, you can index into a vector of Unicode subscript digits:
subscripts <- c(
"\u2080",
"\u2081",
"\u2082",
"\u2083",
"\u2084",
"\u2085",
"\u2086",
"\u2087",
"\u2089"
)
testframe$V1 <- paste0(
testframe$V1,
ifelse(
is.na(testframe$V2),
"",
subscripts[testframe$V2 + 1]
)
)
testframe
V1 V2
1 A34₁ 1
2 B21₃ 3
3 C64₂ 2
4 D83₂ 2
5 E92₃ 3
6 F24 NA

Related

Replace ( gsub) all rows in a column from values in another column?

Suppose I have a dataframe as such,
df = data.frame ( a = c(1,14,15,11) , b= c("xxxchrxxx","xxxchryy","zzchrzz","aachraa") )
a b
1 1 xxxchrxxx
2 14 xxxchryy
3 15 zzchrzz
4 11 aachraa
what I want is to replace chr from column b with chrx, x derive from column a
a b
1 1 xxxchr1xxx
2 14 xxxchr14yy
3 15 zzchr15zz
4 11 aachr11aa
however I cant get gsub to work since its expecting a single element
df$b = gsub ( "chr",paste0("chr",df$a), df$b)
any way to do this?

The reason is that gsub replacement takes only a vector with length 1. According to ?gsub
replacement - if a character vector of length 2 or more is supplied, the first element is used with a warning.
If it needs to have a vectorized replacement, use str_replace
library(stringr)
str_replace(df$b, "chr", paste0("chr", df$a))
#[1] "xxxchr1xxx" "xxxchr14yy" "zzchr15zz" "aachr11aa"
Based on the example, it is only a simple paste
df$b <- with(df, paste0(b, a))

EDIT:: With stringr:
stringr::str_replace_all(df$b,"chr",paste0("chr",df$a))
Continuing with paste0:
df$b<-paste0(df$b,df$a)
a b
1 1 chr1
2 14 chr14
3 15 chr15
4 11 chr11

df = data.frame ( a = c(1,14,15,11) , b= c("chr","chr","chr","chr") )
df$b <- paste0(df$b, df$a)
df
#> a b
#> 1 1 chr1
#> 2 14 chr14
#> 3 15 chr15
#> 4 11 chr11
Created on 2019-02-22 by the reprex package (v0.2.1)

Text processing on data frame in r

I have a text file in which data is stored is stored as given below
{{2,3,4},{1,3},{4},{1,2} .....}
I want to remove the brackets and convert it to two column format where first column is bracket number and followed by the term
1 2
1 3
1 4
2 1
2 3
3 4
4 1
4 2
so far i have read the file
tab <- read.table("test.txt",header=FALSE,sep="}")
This gives a dataframe
V1 V2 V3 V4
1 {{2,3,4 {1,3 {4 {1,2 .....
How to proceed ?

We read it with readLines and then remove the {} with strsplit and convert it to two column dataframe with index and reshape to 'long' format with separate_rows
library(tidyverse)
v1 <- setdiff(unlist(strsplit(lines, "[{}]")), c("", ","))
tibble(index = seq_along(v1), Col = v1) %>%
separate_rows(Col, convert = TRUE)
# A tibble: 8 x 2
# index Col
# <int> <int>
#1 1 2
#2 1 3
#3 1 4
#4 2 1
#5 2 3
#6 3 4
#7 4 1
#8 4 2
Or a base R method would be replace the , after the } with another delimiter, split by , into a list and stack it to a two column data.frame
v1 <- scan(text=gsub("[{}]", "", gsub("},", ";", lines)), what = "", sep=";", quiet = TRUE)
stack(setNames(lapply(strsplit(v1, ","), as.integer), seq_along(v1)))[2:1]
data
lines <- readLines(textConnection("{{2,3,4},{1,3},{4},{1,2}}"))
#reading from file
lines <- readLines("yourfile.txt")

Data:
tab <- read.table(text=' V1 V2 V3 V4
1 {{2,3,4 {1,3 {4 {1,2
2 {{2,3,4 {1,3 {4 {1,2 ')
Code: using gsub, remove { and split the string by ,, then make a data frame. The column names are removed. Finally the list of dataframes in df1 are combined together using rbindlist
df1 <- lapply( seq_along(tab), function(x) {
temp <- data.frame( x, strsplit( gsub( "{", "", tab[[x]], fixed = TRUE ), split = "," ),
stringsAsFactors = FALSE)
colnames(temp) <- NULL
temp
} )
Output:
data.table::rbindlist(df1)
# V1 V2 V3
# 1: 1 2 2
# 2: 1 3 3
# 3: 1 4 4
# 4: 2 1 1
# 5: 2 3 3
# 6: 3 4 4
# 7: 4 1 1
# 8: 4 2 2

"fill" missing columns for fread() [duplicate]

Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?

Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")

Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

Fill option for fread

Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?

Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")

Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

merge multiple data.frame by row in R

I would like to merge multiple data.frame in R using row.names, doing a full outer join. For this I was hoping to do the following:
x = as.data.frame(t(data.frame(a=10, b=13, c=14)))
y = as.data.frame(t(data.frame(a=1, b=2)))
z = as.data.frame(t(data.frame(a=3, b=4, c=3, d=11)))
res = Reduce(function(a,b) merge(a,b,by="row.names",all=T), list(x,y,z))
Warning message:
In merge.data.frame(a, b, by = "row.names", all = T) :
column name ‘Row.names’ is duplicated in the result
> res
Row.names Row.names V1.x V1.y V1
1 1 a 10 1 NA
2 2 b 13 2 NA
3 3 c 14 NA NA
4 a <NA> NA NA 3
5 b <NA> NA NA 4
6 c <NA> NA NA 3
7 d <NA> NA NA 11
What I was hoping to get would be:
V1 V2 V3
a 10 1 3
b 13 2 4
c 14 NA 3
d NA NA 11

The following works (up to some final column renaming):
res <- Reduce(function(a,b){
ans <- merge(a,b,by="row.names",all=T)
row.names(ans) <- ans[,"Row.names"]
ans[,!names(ans) %in% "Row.names"]
}, list(x,y,z))
Indeed:
> res
V1.x V1.y V1
a 10 1 3
b 13 2 4
c 14 NA 3
d NA NA 11
What happens with a row join is that a column with the original rownames is added in the answer, which in turn does not contain row names:
> merge(x,y,by="row.names",all=T)
Row.names V1.x V1.y
1 a 10 1
2 b 13 2
3 c 14 NA
This behavior is documented in ?merge (under Value)
If the matching involved row names, an extra character column called
Row.names is added at the left, and in all cases the result has
‘automatic’ row names.
When Reduce tries to merge again, it doesn't find any match unless the names are cleaned up manually.

For continuity, this is not a clean solution but a workaround, I transform the list argument of 'Reduce' using sapply.
Reduce(function(a,b) merge(a,b,by=0,all=T),
sapply(list(x,y,z),rbind))[,-c(1,2)]
x y.x y.y
1 10 1 3
2 13 2 4
3 14 NA 3
4 NA NA 11
Warning message:
In merge.data.frame(a, b, by = 0, all = T) :
column name ‘Row.names’ is duplicated in the result

For some reason I did not have much success with Reduce. given a list of data.frames (df.lst) and a list of suffixes (suff.lst) to change the names of identical columns, this is my solution (it's loop, I know it's ugly for R standards, but it works):
df.merg <- as.data.frame(df.lst[1])
colnames(df.merg)[-1] <- paste(colnames(df.merg)[-1],suff.lst[[1]],sep="")
for (i in 2:length(df.lst)) {
df.i <- as.data.frame(df.lst[i])
colnames(df.i)[-1] <- paste(colnames(df.i)[-1],suff.lst[[i]],sep="")
df.merg <- merge(df.merg, df.i, by.x="",by.y="", all=T)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Transform number/character to be a superscript or subscript - r

Related

Replace ( gsub) all rows in a column from values in another column?

Text processing on data frame in r

"fill" missing columns for fread() [duplicate]

Fill option for fread

merge multiple data.frame by row in R

Categories

Resources