read.table with variables and numbers in same column - r

Is it possible to use read.table to read a data set in which variables and numbers appear in the same column? Here is an example that does not work, followed by matrix which does work:
aa <- 10
bb <- 20
my.data1 <- read.table(text = '
aa 0
0 bb
', header = FALSE, stringsAsFactors = FALSE)
my.data1
# matrix returns desired output
my.data2 <- matrix(c( aa, 0,
0, bb ), nrow = 2, ncol = 2, byrow = TRUE)
my.data2
# [,1] [,2]
# [1,] 10 0
# [2,] 0 20
My guess is that text converts aa and bb to text, but can this be avoided while still using read.table?
str(my.data1)
'data.frame': 2 obs. of 2 variables:
$ V1: chr "aa" "0"
$ V2: chr "0" "bb"

The text argument to the read.table function doesn't convert your input to a character string, it's the (single) quotes that create a string literal or string constant as the R documentation for quotes (do ?Quotes in R) explains:
Single and double quotes delimit character constants. They can be used interchangeably but double quotes are preferred (and character constants are printed using double quotes), so single quotes are normally only used to delimit character constants containing double quotes.
The aa and bb in the matrix command are symbols in R as they are not quoted and the interpreter will find their values in the global environment.
So to answer your question: no, it's not possible to do this unless you use a function like sprintf to subsitute the values of aa and bb in the string. For example:
my.data1 <- read.table(text = sprintf('
%f 0
0 %f', aa, bb), header = FALSE, stringsAsFactors = FALSE)
But this is not something that would be very common to do!

Related

Change a char value in a data column into zero?

I have a simple problem in that I have a very long data frame which reports 0 as a char "nothing" in the data frame column. How would I replace all of these to a numeric 0. A sample data frame is below
Group
Candy
A
5
B
nothing
And this is what I want to change it into
Group
Candy
A
5
B
0
Keeping in mind my actual dataset is 100s of rows long.
My own attempt was to use is.na but apparently it only works for NA and can convert those into zeros with ease but wasn't sure if there's a solution for actual character datatypes.
Thanks
The best way is to read the data in right, not with "nothing" for missing values. This can be done with argument na.strings of functions read.table or read.csv. Then change the NA's to zero.
The following function is probably slow for large data.frames but replaces the "nothing" values by zeros.
nothing_zero <- function(x){
tc <- textConnection("nothing", "w")
sink(tc) # divert output to tc connection
print(x) # print in string "nothing" instead of console
sink() # set the output back to console
close(tc) # close connection
tc <- textConnection(nothing, "r")
y <- read.table(tc, na.strings = "nothing", header = TRUE)
close(tc) # close connection
y[is.na(y)] <- 0
y
}
nothing_zero(df1)
# Group Candy
#1 A 5
#2 B 0
The main advantage is to read numeric data as numeric.
str(nothing_zero(df1))
#'data.frame': 2 obs. of 2 variables:
# $ Group: chr "A" "B"
# $ Candy: num 5 0
Data
df1 <- read.table(text = "
Group Candy
A 5
B nothing", header = TRUE)
sapply(df,function(x) {x <- gsub("nothing",0,x)})
Output
a
[1,] "0"
[2,] "5"
[3,] "6"
[4,] "0"
Data
df <- structure(list(a = c("nothing", "5", "6", "nothing")),
class = "data.frame",
row.names = c(NA,-4L))
Another option
df[] <- lapply(df, gsub, pattern = "nothing", replacement = "0", fixed = TRUE)
If you are only wanting to apply to one column
library(tidyverse)
df$a <- str_replace(df$a,"nothing","0")
Or applying to one column in base R
df$a <- gsub("nothing","0",df$a)

Using a list of characters in a function? [duplicate]

This question already has answers here:
Convert comma separated string to integer in R
(3 answers)
Closed 1 year ago.
I am using a function where Timepoints need to be defined as
Timepoints = c(x,y,z)
Now i have a chr list
List
$ chr: "1,2,3,4,5,6,7"
with the timepoints i need to use, already seperated by commas.
I want to use this list in the function and lose the quotation marks, so the function can read my timepoints as
Timepoints= c(1,2,3,4,5,6,7)
I tried using noquote(List), but this is not accepted.
Am is missing something ? printing the list with noquote() results in the desired line of characters 1,2,3,4,5,6,7
1) Base R - scan Assuming that you have a list containing a single character string as shown in L below use scan as shown.
L <- list("1,2,3,4,5,6")
scan(text = L[[1]], sep = ",", quiet = TRUE)
## [1] 1 2 3 4 5 6
2) gsubfn::strapply Another possibility is to use strapply to match each string of digits, convert them to numeric and return it as a vector. (We assume that the numbers have no signs or decimal points but that could readily be added if needed.)
library(gsubfn)
strapply(L[[1]], "\\d+", as.numeric, simplify = unlist)
[1] 1 2 3 4 5 6
Added
In a comment the poster indicated an interest in having a list of character strings as input. The output was not specified but if we assume we want a list of numeric vectors then
L2 <- list(A = "1,2,3,4,5,6", B = "1,2")
Scan <- function(x) scan(text = x, sep = ",", quiet = TRUE)
lapply(L2, Scan)
## $A
## [1] 1 2 3 4 5 6
##
## $B
## [1] 1 2
library(gsubfn)
strapply(L2, "\\d", as.numeric)
## $A
## [1] 1 2 3 4 5 6
##
## $B
## [1] 1 2
Here is an option with strsplit.
as.integer(unlist(strsplit(L[[1]], ",")))
#[1] 1 2 3 4 5 6

Subset R list by partial string match of sublist element against character vector, using base R

My actual case is a list of combined header strings and corresponding data as sub-lists; I wish to subset the list to return a list of sub-lists , i.e the same structure, that only contain the sub-lists whose header strings contain strings that match the strings in a character vector.
Test Data:
lets <- letters
x <- c(1,4,8,11,13,14,18,22,24)
ls <- list()
for (i in 1:9) {
ls[[i]] <- list(hdr = paste(lets[x[i]:(x[i]+2)], collapse=""),
data = seq(1,rnd[i]))
}
filt <- c("bc", "lm", "rs", "xy")
To produce a result list, as returned by:
logical_match <- c(T, F, F, T, F, F, T, F, T)
ls_result <- ls[logical_match]
So the function I seek is: ls_result <- fn(ls, filt)
I've looked at: subset list by dataframe; partial match with %in%; nested sublist by condition; subset list by logical condition; and, my favorite, extract sublist elements to array - this uses some neat purr and dplyr solutions, but unfortunately these aren't viable, as I'm looking for a base R solution to make deployment more straightforward (I'd welcome extended R solutions, for interest, of course).
I'm guessing some variation of logical_match <- lapply(ls, fn, '$hdr', filt) is where I'm heading; I started with pmatch(), and wondered how to incorporate grep, but I'm struggling to see how to generate the logical_match vector.
Can someone set me on the right track, please?
EDIT:
when agrepl() is applied to the real data, this becomes trickier; the header string, hdr, may be typically 255 characters long, whilst a string element of the filter vector , filt is of the order of 16 characters. The default agrepl() max.distance argument of 0.1 needs adjusted to somewhere between 0.94 and 0.96 for the example below, which is pretty tight. Even if I use the lower end of this range, and apply it to the ~360 list elements, the function returns a load of total non-matches.
> hdr <- "#CCHANNELSDI12-PTx|*|CCHANNELNO2|*|CDASA1570|*|CDASANAMEShenachieBU_1570|*|CTAGSODATSID|*|CTAGKEYWISKI_LIVE,ShenachieBU_1570,SDI12-PTx,Highres|*|LAYOUT(timestamp,value)|*|RINVAL-777|*|RSTATEW6|*|RTIMELVLhigh-resolution|*|TZEtc/GMT|*|ZDATE20210110130805|*|"
> filt <- c("ShenachieBU_1570", "Pitlochry_4056")
> agrepl(hdr, filt, max.distance = 0.94)
[1] TRUE FALSE
You could do:
Filter(function(x)any(agrepl(x$hdr,filt)), ls)
You could reduce the code to:
Filter(function(x)grepl(paste0(filt, collapse = "|"), x$hdr), ls)
We can also do
library(purrr)
library(stringr)
keep(ls, ~ str_detect(.x$hdr, str_c(filt, collapse = "|")))
-output
#[[1]]
#[[1]]$hdr
#[1] "abc"
#[[1]]$data
#[1] 1
#[[2]]
#[[2]]$hdr
#[1] "klm"
#[[2]]$data
#[1] 1 2 3 4
#[[3]]
#[[3]]$hdr
#[1] "rst"
#[[3]]$data
#[1] 1 2 3 4 5 6 7
#[[4]]
#[[4]]$hdr
#[1] "xyz"
#[[4]]$data
#[1] 1 2 3 4 5 6 7 8 9

How to find patterns between sets of strings in R?

I'm trying to find patterns in a set of strings as the following example:
"2100780D001378FF01E1000000040000--------01A456000000------------"
"3100782D001378FF03E1008100040000--------01A445800000------------"
If I use the standard get_pattern from the bpa library, since it looks individually to every string I will get
"9999999A999999AA99A9999999999999--------99A999999999------------"
But my idea would be to find something like:
"X10078XD001378FF0XE100XX00040000--------01A4XXX00000------------"
The main objective is to find the set of strings with the most similar "pattern"
My first idea was to calculating the hamming distance between them and then analyzing the groups resulting from this distance but it gets tedious. Is there any "automatic" approach?
Any idea of how I can accomplish this mission?
for your sample data, the code below is working.. no idea how it scales to production...
library( data.table )
#sample data
data <- data.table( name = c("2100780D001378FF01E1000000040000--------01A456000000------------",
"3100782D001378FF03E1008100040000--------01A445800000------------"))
# name
# 1: 2100780D001378FF01E1000000040000--------01A456000000------------
# 2: 3100782D001378FF03E1008100040000--------01A445800000------------
#use data.table::tstrsplit() to split the string to individual characters
l <- lapply( data.table::tstrsplit( data$name, ""), function(x) {
#if the same character appears in all strings on the same position,return the character, else return 'X'
if ( length( unique( x ) ) == 1 ) as.character(x[1]) else "X"
})
#paste it all together
paste0(l, collapse = "")
# [1] "X10078XD001378FF0XE100XX00040000--------01A4XXX00000------------"
small explanation
data.table::tstrsplit( data$name, "") returns the following list
[[1]]
[1] "2" "3"
[[2]]
[1] "1" "1"
[[3]]
[1] "0" "0"
etc...
Using lapply(), you can loop over this list, determining the length of the vector with unique elements. Ith this length == 1, then the same character exists in all strings on this position, so return the character.
If the length > 1, then multiple characters apprear on this possition in different strings, and return "X".
Update
if you are after the hamming distances, use the stringdist-package
library(stringdist)
m <- stringdist::stringdistmatrix(a = data$name, b = data$name, ,method="hamming" )
# [,1] [,2]
# [1,] 0 8
# [2,] 8 0
#to get to the minimum value for each row, exclude the diagonal first (by making it NA)
# and the find the position with the minimum value
diag(m) <- NA
apply( m, 1, which.min )
# [1] 2 1
Here is a base R solution, where a custom function findPat is defined and Reduce is applied to find common pattern among a set of strings, i.e.,
findPat <- function(s1,s2){
r1 <- utf8ToInt(s1)
r2 <- utf8ToInt(s2)
r1[bitwXor(r1,r2)!=0]<- utf8ToInt("X")
pat <- intToUtf8(r1)
}
pat <- Reduce(findPat,list(s1,s2,s3))
such that
> pat
[1] "X10078XDX0X378FF0XE100XX00040000--------01AXXXXX0000------------"
DATA
s1 <- "2100780D001378FF01E1000000040000--------01A456000000------------"
s2 <- "3100782D001378FF03E1008100040000--------01A445800000------------"
s3 <- "4100781D109378FF03E1008100040000--------01A784580000------------"

Add leading space to certain values in a data frame

I have the following data frame and for each positive number (yes they are need to be stored as strings) I want to add a leading space.
d <- data.frame(c1 = c("4", "-1.5", "5", "-3"))
> d
c1
1 4
2 -1.5
3 5
4 -3
So far I used grep and invert to return only the positive numbers which I want to add a leading space to:
d$c1[grep("-", d$c1, invert = TRUE)]
However, I am not sure how to proceed. I think I rather have to work with indices than with the actual number. And probably incorporate gsub? Is that right?
Here is an approach using formatC(). Similar results could be achieved using sprintf(). Note that I don't just add a single space. instead this approach pads each string to a maximum width.
d <- data.frame(c1 = c("4", "-1.5", "5", "-3"), stringsAsFactors = FALSE)
d <- transform(d, d2 = formatC(c1, width = 4), stringsAsFactors = FALSE)
R> d
c1 d2
1 4 4
2 -1.5 -1.5
3 5 5
4 -3 -3
R> str(d)
'data.frame': 4 obs. of 2 variables:
$ c1: chr "4" "-1.5" "5" "-3"
$ d2: chr " 4" "-1.5" " 5" " -3"
If you don't know ahead of time what the width argument should be, compute it from d$c1:
R> with(d, max(nchar(as.character(c1))))
[1] 4
Or use it directly inline
d <- transform(d, d2 = formatC(c1, width = max(nchar(as.character(c1)))),
stringsAsFactors = FALSE)
paste(' ',d[d[,1] > 0,]) does that look like what you want?
The print method for data.frames has nice automated padding features. In general, the strings are padded on the left with spaces to ensure right alignment (by default). You can take advantage of this by capturing the print output. For example, using your d:
> print(d, print.gap = 0, row.names = FALSE)
c1
4
-1.5
5
-3
The argument print.gap = 0 ensures that there are no extra padding spaces in front of the longest string. row.names = FALSE prevents row names from being printed.
This case is special in a couple ways: The column name is shorter than the longest character string in the data, and the data.frame is only one column. To generalize, you could subset the data and unname it:
myChar <- unname(d[, 1, drop = FALSE])
Then, you can capture the printed object using capture.output:
> (dStr <- capture.output(print(myChar, print.gap = 0, row.names = FALSE)))
[1] " NA" " 4" "-1.5" " 5" " -3"
Since the column name is also printed, you could subset the object thusly:
> dStr[-1]
[1] " 4" "-1.5" " 5" " -3"
This way, you don't have to know how long the longest character string is, and this can handle most data types, not just character.

Resources