Add leading space to certain values in a data frame - r

I have the following data frame and for each positive number (yes they are need to be stored as strings) I want to add a leading space.
d <- data.frame(c1 = c("4", "-1.5", "5", "-3"))
> d
c1
1 4
2 -1.5
3 5
4 -3
So far I used grep and invert to return only the positive numbers which I want to add a leading space to:
d$c1[grep("-", d$c1, invert = TRUE)]
However, I am not sure how to proceed. I think I rather have to work with indices than with the actual number. And probably incorporate gsub? Is that right?

Here is an approach using formatC(). Similar results could be achieved using sprintf(). Note that I don't just add a single space. instead this approach pads each string to a maximum width.
d <- data.frame(c1 = c("4", "-1.5", "5", "-3"), stringsAsFactors = FALSE)
d <- transform(d, d2 = formatC(c1, width = 4), stringsAsFactors = FALSE)
R> d
c1 d2
1 4 4
2 -1.5 -1.5
3 5 5
4 -3 -3
R> str(d)
'data.frame': 4 obs. of 2 variables:
$ c1: chr "4" "-1.5" "5" "-3"
$ d2: chr " 4" "-1.5" " 5" " -3"
If you don't know ahead of time what the width argument should be, compute it from d$c1:
R> with(d, max(nchar(as.character(c1))))
[1] 4
Or use it directly inline
d <- transform(d, d2 = formatC(c1, width = max(nchar(as.character(c1)))),
stringsAsFactors = FALSE)

paste(' ',d[d[,1] > 0,]) does that look like what you want?

The print method for data.frames has nice automated padding features. In general, the strings are padded on the left with spaces to ensure right alignment (by default). You can take advantage of this by capturing the print output. For example, using your d:
> print(d, print.gap = 0, row.names = FALSE)
c1
4
-1.5
5
-3
The argument print.gap = 0 ensures that there are no extra padding spaces in front of the longest string. row.names = FALSE prevents row names from being printed.
This case is special in a couple ways: The column name is shorter than the longest character string in the data, and the data.frame is only one column. To generalize, you could subset the data and unname it:
myChar <- unname(d[, 1, drop = FALSE])
Then, you can capture the printed object using capture.output:
> (dStr <- capture.output(print(myChar, print.gap = 0, row.names = FALSE)))
[1] " NA" " 4" "-1.5" " 5" " -3"
Since the column name is also printed, you could subset the object thusly:
> dStr[-1]
[1] " 4" "-1.5" " 5" " -3"
This way, you don't have to know how long the longest character string is, and this can handle most data types, not just character.

Related

How to order from highest to lowest in R

My goal is to order multiple variables in a list in R:
"+5x^{5}" "-2x^{3}" "5x^{7}" "0" "1"
I want to get this order:
"5x^{7}" "+5x^{5}" "-2x^{3}" "1" "0"
So exponents from highest to lowest, then numerical order for the numbers.
How can I achieve this? Decreasing for numerical alone is clear. For the exponents it would be necessary to detect if there is a x in the string and then extract the exponent and order it based on this. But I dont know how to do it.
A more verbose option is to extract the exponents and the multipliers and then use arrange. This has the advantage of having these numbers ready if you need to use them.
library(stringr)
library(dplyr)
dat <- data.frame(x = c("+5x^{5}", "-2x^{3}", "5x^{7}", "0", "1"))
dat |> mutate(m = as.numeric(str_match(x, "([+-]*\\d+)x\\^\\{(\\d)\\}")[, 2]),
exp = as.numeric(str_match(x, "([+-]*\\d+)x\\^\\{(\\d)\\}")[, 3]),
n = as.numeric(x)) |>
arrange(desc(exp), desc(n))
Output
#> x m exp n
#> 1 5x^{7} 5 7 NA
#> 2 +5x^{5} 5 5 NA
#> 3 -2x^{3} -2 3 NA
#> 4 1 NA NA 1
#> 5 0 NA NA 0
Created on 2022-06-13 by the reprex package (v2.0.1)
Base R solution:
x[
order(
gsub(
".*\\{(\\d+)\\}.*",
"\\1",
x
),
decreasing = TRUE
)
]
Input data:
x <- c(
"+5x^{5}",
"-2x^{3}",
"5x^{7}",
"0",
"1"
)
Well... it works
> x=c("+5x^{5}","-2x^{3}","5x^{7}","0","1")
> x[order(gsub("(.*\\^\\{)(.+)(\\}.*)","\\2",x),decreasing=T)]
[1] "5x^{7}" "+5x^{5}" "-2x^{3}" "1" "0"
the regex string (.*\\^\\{)(.+)(\\}.*) looks for three things:
(.*\\^\\{) searches for anything before ^{, this is the first split,
(.+) searches for anything inside curly brackets, second split,
(\\}.*) searches for anything after }, third split,
in the end it returns only \\2, the contents of the second split,
which is what we use to order the elements of the string vector.

Change a char value in a data column into zero?

I have a simple problem in that I have a very long data frame which reports 0 as a char "nothing" in the data frame column. How would I replace all of these to a numeric 0. A sample data frame is below
Group
Candy
A
5
B
nothing
And this is what I want to change it into
Group
Candy
A
5
B
0
Keeping in mind my actual dataset is 100s of rows long.
My own attempt was to use is.na but apparently it only works for NA and can convert those into zeros with ease but wasn't sure if there's a solution for actual character datatypes.
Thanks
The best way is to read the data in right, not with "nothing" for missing values. This can be done with argument na.strings of functions read.table or read.csv. Then change the NA's to zero.
The following function is probably slow for large data.frames but replaces the "nothing" values by zeros.
nothing_zero <- function(x){
tc <- textConnection("nothing", "w")
sink(tc) # divert output to tc connection
print(x) # print in string "nothing" instead of console
sink() # set the output back to console
close(tc) # close connection
tc <- textConnection(nothing, "r")
y <- read.table(tc, na.strings = "nothing", header = TRUE)
close(tc) # close connection
y[is.na(y)] <- 0
y
}
nothing_zero(df1)
# Group Candy
#1 A 5
#2 B 0
The main advantage is to read numeric data as numeric.
str(nothing_zero(df1))
#'data.frame': 2 obs. of 2 variables:
# $ Group: chr "A" "B"
# $ Candy: num 5 0
Data
df1 <- read.table(text = "
Group Candy
A 5
B nothing", header = TRUE)
sapply(df,function(x) {x <- gsub("nothing",0,x)})
Output
a
[1,] "0"
[2,] "5"
[3,] "6"
[4,] "0"
Data
df <- structure(list(a = c("nothing", "5", "6", "nothing")),
class = "data.frame",
row.names = c(NA,-4L))
Another option
df[] <- lapply(df, gsub, pattern = "nothing", replacement = "0", fixed = TRUE)
If you are only wanting to apply to one column
library(tidyverse)
df$a <- str_replace(df$a,"nothing","0")
Or applying to one column in base R
df$a <- gsub("nothing","0",df$a)

R version of Vlookup with multiple matches per cell

I have a vector with numbers and a lookup table. I want the numbers replaced by the description from the lookup table.
This is easy when vectors are straight forward like this example:
> variable <- sample(1:5, 10, replace=T)
> variable
[1] 5 4 5 3 2 3 2 3 5 2
>
> lookup <- data.frame(var = 1:5, description=LETTERS[1:5])
> lookup
var description
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
>
> with(lookup, description[match(variable, var)])
[1] E D E C B C B C E B
Levels: A B C D E
However, when single elements of a vector contain multiple outcomes, I get in trouble:
variable <- c("1", "2^3", "1^5", "4", "4")
I would like the vector returned to give:
c("A", "B^C", "A^E", "D", "D")
If you have only one character match and replacement you can use chartr
chartr(paste0(lookup$var, collapse = ""),
paste0(lookup$description, collapse = ""), variable)
#[1] "A" "B^C" "A^E" "D" "D"
chartr basically tells that replace
paste0(lookup$var, collapse = "")
#[1] "12345"
with
paste0(lookup$description, collapse = "")
#[1] "ABCDE"
It is also useful since it does not change or return NA for characters which do not match.
As mentioned in the comments, there are a couple of steps needed to achieve the desired output. The following splits your variable, indexes the results against the description variable and then uses paste to collapse multiple elements.
sapply(strsplit(variable, "\\^"), function(x) paste0(lookup$description[as.numeric(x)], collapse = "^"))
[1] "A" "B^C" "A^E" "D" "D"
You can use scan to parse text into numeric, which can then be used as an index to pick items which can then be collapsed together. Add quiet=TRUE to suppress "Read" messages.
sapply(variable, function(t) {
paste( lookup$description[ scan(text=t, sep="^")], collapse="^")} )
Read 1 item
Read 2 items
Read 2 items
Read 1 item
Read 1 item
1 2^3 1^5 4 4
"A" "B^C" "A^E" "D" "D"

Splitting a Large Data File in R using Strsplit and R Connection

Hi I am trying to read in a large data file into R. It is a tab delimited file, however the first two columns are filled with multiple pieces of data separated by a "|". The file looks like:
A|1 B|2 0.5 0.4
C|3 D|4 0.9 1
I only care about the first values in both the first and second columns as well as the third and fourth column. In the end I want to end up with a vectors for each line that look like:
A B 0.5 0.4
I am using a connection to read in the file:
con <- file("inputfile.txt", open = "r")
lines <- readLines(con)
which gives me:
lines[1]
[1] "A|1\tB|2/t0.5\t0.4"
then I am using strsplit to split the tab delimited file:
linessplit <- strsplit(lines, split="\t")
which gives me:
linessplit[1]
[1] "A|1" "B|2"
[3] "0.5" "0.4"
When I try the following to split "A|1" into "A" "1":
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")
I get:
"Error in strsplit(line1[1], split = "|") : non-character argument"
Does anyone have a way in which I can fix this?
Thanks!
Since you provided an approach I explain the errors in the code even though for your problem maybe you have to consider another approach.
Anyway putting aside personal tastes about code, the problems are:
you have to extract the first element of the list with double brackets
line1[[1]]
the split argument accepts regular
expressions. If you supply | which is a metacharacter, it won't be
read as is. You must escape it with \\| or (as suggested by #nongkrong) you have to use the fixed = T argument that allows you to match strings exactly as is (say, without their meaning as a meta characters).
The final code is l1 <- strsplit(line1[[1]], split = "\\|")
as a final personal consideration, you might take into considerations an lapply solution:
lapply(linessplit, strsplit, split = "|", fixed = T)
Here is my solution to your original problem, says
split lines
"A|1\tB|2\t0.5\t0.4"
"C|3\tD|4\t0.9\t1"
into
A B 0.5 0.4
C D 0.9 1
Below is my code:
lines <- c("A|1\tB|2\t0.5\t0.4", "C|3\tD|4\t0.9\t1", "E|5\tF|6\t0.7\t0.2")
lines
library(reshape2)
linessplit <- colsplit(lines, pattern="\t", names=c(1:4))
linessplit
split_n_select <- function(x, sel=c(1), pat="\\|", nam=c(1:2)){
tmp <- t(colsplit(x, pattern=pat, names=nam))
tmp[sel,]
}
linessplit2 <- sapply(linessplit, split_n_select)
linessplit2
That's break it down:
Read original data into lines
lines <- c("A|1\tB|2\t0.5\t0.4", "C|3\tD|4\t0.9\t1", "E|5\tF|6\t0.7\t0.2")
lines
Results:
[1] "A|1\tB|2\t0.5\t0.4" "C|3\tD|4\t0.9\t1" "E|5\tF|6\t0.7\t0.2"
Load reshape2 library to import function colsplit, then use it with pattern "\t" to split lines into 4 columns named 1,2,3,4.
library(reshape2)
linessplit <- colsplit(lines, pattern="\t", names=c(1,2,3,4))
linessplit
Results:
1 2 3 4
1 A|1 B|2 0.5 0.4
2 C|3 D|4 0.9 1.0
3 E|5 F|6 0.7 0.2
That's make a function to take a row, split into rows and select the row we want.
Take the first row of linessplit into colsplit
tmp <- colsplit(linessplit[1,], pattern="\\|", names=c(1:2))
tmp
Results:
1 2
1 A 1
2 B 2
3 0.5 NA
4 0.4 NA
Take transpose
tmp <- t(colsplit(linessplit[1,], pattern="\\|", names=c(1:2)))
tmp
Results:
[,1] [,2] [,3] [,4]
1 "A" "B" "0.5" "0.4"
2 " 1" " 2" NA NA
Select first row:
tmp[1,]
Results:
[1] "A" "B" "0.5" "0.4"
Make above steps a function split_n_select:
split_n_select <- function(x, sel=c(1), pat="\\|", nam=c(1:2)){
tmp <- t(colsplit(x, pattern=pat, names=nam))
tmp[sel,]
}
Use sapply to apply function split_n_select to each row in linessplit
linessplit2 <- sapply(linessplit, split_n_select)
linessplit2
Results:
1 2 3 4
[1,] "A" "B" "0.5" "0.4"
[2,] "C" "D" "0.9" "1"
[3,] "E" "F" "0.7" "0.2"
You can also select the second row by adding sel=c(2)
linessplit2 <- sapply(linessplit, split_n_select, sel=c(2))
linessplit2
Results:
1 2 3 4
[1,] "1" "2" NA NA
[2,] "3" "4" NA NA
[3,] "5" "6" NA NA
Change
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")
to
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "[|]") #i added square brackets

read.table with variables and numbers in same column

Is it possible to use read.table to read a data set in which variables and numbers appear in the same column? Here is an example that does not work, followed by matrix which does work:
aa <- 10
bb <- 20
my.data1 <- read.table(text = '
aa 0
0 bb
', header = FALSE, stringsAsFactors = FALSE)
my.data1
# matrix returns desired output
my.data2 <- matrix(c( aa, 0,
0, bb ), nrow = 2, ncol = 2, byrow = TRUE)
my.data2
# [,1] [,2]
# [1,] 10 0
# [2,] 0 20
My guess is that text converts aa and bb to text, but can this be avoided while still using read.table?
str(my.data1)
'data.frame': 2 obs. of 2 variables:
$ V1: chr "aa" "0"
$ V2: chr "0" "bb"
The text argument to the read.table function doesn't convert your input to a character string, it's the (single) quotes that create a string literal or string constant as the R documentation for quotes (do ?Quotes in R) explains:
Single and double quotes delimit character constants. They can be used interchangeably but double quotes are preferred (and character constants are printed using double quotes), so single quotes are normally only used to delimit character constants containing double quotes.
The aa and bb in the matrix command are symbols in R as they are not quoted and the interpreter will find their values in the global environment.
So to answer your question: no, it's not possible to do this unless you use a function like sprintf to subsitute the values of aa and bb in the string. For example:
my.data1 <- read.table(text = sprintf('
%f 0
0 %f', aa, bb), header = FALSE, stringsAsFactors = FALSE)
But this is not something that would be very common to do!

Resources