I need to process some data that are mostly csv. The problem is that R ignores the comma if it comes at the end of a line (e.g., the one that comes after 3 in the example below).
> strsplit("1,2,3,", ",")
[[1]]
[1] "1" "2" "3"
I'd like it to be read in as [1] "1" "2" "3" NA instead. How can I do this? Thanks.
Here are a couple ideas
scan(text="1,2,3,", sep=",", quiet=TRUE)
#[1] 1 2 3 NA
unlist(read.csv(text="1,2,3,", header=FALSE), use.names=FALSE)
#[1] 1 2 3 NA
Those both return integer vectors. You can wrap as.character around either of them to get the exact output you show in the Question:
as.character(scan(text="1,2,3,", sep=",", quiet=TRUE))
#[1] "1" "2" "3" NA
Or, you could specify what="character" in scan, or colClasses="character" in read.csv for slightly different output
scan(text="1,2,3,", sep=",", quiet=TRUE, what="character")
#[1] "1" "2" "3" ""
unlist(read.csv(text="1,2,3,", header=FALSE, colClasses="character"), use.names=FALSE)
#[1] "1" "2" "3" ""
You could also specify na.strings="" along with colClasses="character"
unlist(read.csv(text="1,2,3,", header=FALSE, colClasses="character", na.strings=""),
use.names=FALSE)
#[1] "1" "2" "3" NA
Hadley's stringi (and previously stringr) libraries are a huge improvement on base string functions (fully vectorized, consistent function interface):
require(stringr)
str_split("1,2,3,", ",")
[1] "1" "2" "3" ""
as.integer(unlist(str_split("1,2,3,", ",")))
[1] 1 2 3 NA
Using stringi package:
require(stringi)
> stri_split_fixed("1,2,3,",",")
[[1]]
[1] "1" "2" "3" ""
## you can directly specify if you want to omit this empty elements
> stri_split_fixed("1,2,3,",",",omit_empty = TRUE)
[[1]]
[1] "1" "2" "3"
Related
> foo <- as.character(c(0, 2))
> foo
[1] "0" "2"
> foo[1]
[1] "0"
> foo[2]
[1] "2"
> as.character("0-2")
[1] "0-2" #this is the output I want from the command below:
> as.character("foo[1]-foo[2]")
[1] "foo[1]-foo[2]" # ... was hoping to get "0-2"
I tried some variations of eval(parse()), but same problem. I also tried these simple examples:
> as.character("as.name(foo[1])")
[1] "as.name(foo[1])"
> as.character(as.name("foo[1]"))
[1] "foo[1]"
Any chance of getting something simple like as.character("foo[1]-foo[2]") to display "0-2"?
UPDATE
Similar example (with a much longer string):
> lol <- as.character(seq(0, 20, 2))
> lol
[1] "0" "2" "4" "6" "8" "10" "12" "14" "16" "18" "20"
> c(as.character("0-2"), as.character("2-4"), as.character("4-6"), as.character("6-8"), as.character("8-10"), as.character("10-12"), as.character("12-14"),as.character("14-16"),as.character("16-18"),as.character("18-20"))
[1] "0-2" "2-4" "4-6" "6-8" "8-10" "10-12" "12-14" "14-16" "16-18" "18-20"
I would like to be able to actually call the object lol from within my character string.
We can use paste with the collapse argument
paste(foo, collapse='-')
#[1] "0-2"
If we need to paste adjacent elements together, remove the first and last elements of 'lol' and then paste it together with the sep argument.
paste(lol[-length(lol)], lol[-1], sep='-')
#[1] "0-2" "2-4" "4-6" "6-8" "8-10" "10-12" "12-14" "14-16" "16-18"
#[10] "18-20"
toString seems to convert a whole vector to a single string -
toString(c(1,2))
[1] "1, 2"
how does one map the string conversion over each element; i.e. for the above example, to obtain ("1", "2") ?
> as.character(c(1,2))
[1] "1" "2"
Is the output I get from the R-console.
Since the result is a character vector with a single element, the strategy of using as.character will have no effect. Need to use scan:
> scan(text = toString(0:11), sep="," )
Read 12 items
[1] 0 1 2 3 4 5 6 7 8 9 10 11
Then you can use as.character if that is needed:
> res <- scan(text = toString(0:11), sep="," )
Read 12 items
> as.character(res)
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
I prefer paste0 since it's shorter and (from what I can tell) accomplishes the same thing as as.character:
> paste0(1:2)
[1] "1" "2"
> identical(paste0(1:2),as.character(1:2))
[1] TRUE
I'm wondering if there is any way to remove blanks from the list.
As far as I've searched, I found out that there are many Q&As for removing
the whole element from the list, but couldn't find the one regarding
a specific component of the element.
To be specific, the list now I'm working with looks like this:
[[1]]
[1] "1" "" "" "2" "" "" "3"
[[2]]
[1] "weak"
[[3]]
[1] "22" "33"
[[4]]
[1] "44" "34p" "45"
From above, you can find " ", which should be removed.
I've tried different commands like
text.words.bl <- text.words.ll[-which(text.words.ll==" ")]
text.words.bl <- text.words.ll[!sapply(text.words.ll, is.null)]
etc, but seems like " "s in [[1]] of the list still remains.
Is it impossible to apply commands to small pieces in each element of the list?
(e.g. 1, 2, weak, 22, 33... respectively)
I've used "lapply" function to run specific commands to each elements,
and it seemed like those lapply commands all worked....
JY
Use %in%, but negate it with !:
## Sample data:
L <- list(c(1, 2, "", "", 4), c(1, "", "", 2), c("", "", 3))
L
# [[1]]
# [1] "1" "2" "" "" "4"
#
# [[2]]
# [1] "1" "" "" "2"
#
# [[3]]
# [1] "" "" "3"
The replacement:
lapply(L, function(x) x[!x %in% ""])
# [[1]]
# [1] "1" "2" "4"
#
# [[2]]
# [1] "1" "2"
#
# [[3]]
# [1] "3"
Obviously, assign the output to "L" if you want to overwrite the original dataset:
L[] <- lapply(L, function(x) x[!x %in% ""])
Another way would be to use nchar(). I borrowed L from #Ananda Mahto.
lapply(L, function(x) x[nchar(x) >= 1])
#[[1]]
#[1] "1" "2" "4"
#
#[[2]]
#[1] "1" "2"
#
#[[3]]
#[1] "3"
Could somebody explain me why this does not print all the numbers separately in R.
numberstring <- "0123456789"
for (number in numberstring) {
print(number)
}
Aren't strings just arrays of chars? Whats the way to do it in R?
In R "0123456789" is a character vector of length 1.
If you want to iterate over the characters, you have to split the string into
a vector of single characters using strsplit.
numberstring <- "0123456789"
numberstring_split <- strsplit(numberstring, "")[[1]]
for (number in numberstring_split) {
print(number)
}
# [1] "0"
# [1] "1"
# [1] "2"
# [1] "3"
# [1] "4"
# [1] "5"
# [1] "6"
# [1] "7"
# [1] "8"
# [1] "9"
Just for fun, here are a few other ways to split a string at each character.
x <- "0123456789"
substring(x, 1:nchar(x), 1:nchar(x))
# [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
regmatches(x, gregexpr(".", x))[[1]]
# [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
scan(text = gsub("(.)", "\\1 ", x), what = character())
# [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
Possible with tidyverse::str_split
numberstring <- "0123456789"
str_split(numberstring,boundary("character"))
1. '0''1''2''3''4''5''6''7''8''9'
Here's a naive approach for iterating a string using a for loop and substring. This isn't any better than existing answers for the common case, but it might be useful if you want to break out of the loop early instead of always traversing the entire string once up front, as str_split/scan/substring(x, 1:nchar(x), 1:nchar(x))/regmatches requires.
s <- "0123456789"
if (s != "") {
for (i in 1:nchar(s)) {
print(substring(s, i, i))
}
}
The if is needed to avoid looping backwards from 1 to 0, inclusive of both ends.
Your question is not 100% clear as to the desired outcome (print each character individually from a string, or store each number in a way that the given print loop will result in each number being produced on its own line).
To store numberstring such that it prints using the loop you included:
numberstring<-c(0,1,2,3,4,5,6,7,8,9)
for(number in numberstring){print(number);}
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
>
Obviously I dont get the way grep works in R. If I use grep on my OS X terminal, I am able to use the parameter -o which makes grep only return the matching part. In R, I can't find how to do a corresponding thing. Reading the manual I thought values was the right approach, which is better inasmuch that it returns characters not indexes, but still returns the whole string.
# some string fasdjlk465öfsdj123
# R
test <- fasdjlk465öfsdj123
grep("[0-9]",test,value=TRUE) # returns "fasdjlk465öfsdj123"
# shell
grep -o '[0-9]' fasdjlk465öfsdj123
# returns 4 6 5 1 2 3
What's the parameter I am missing in R ?
EDIT: Joris Meys' suggestions comes really close to what I am trying to do. I get a vector as a result of readLines. And I'd like to check every element of the vector for numbers and return these numbers. I am really surprised there's no standard solution for that. I thought of using some regexp function that works on a string and returns the match like grep -o and then use lapply on that vector. grep.custom comes closest – i'll try to make that work for me.
Spacedman said it already. If you really want to simulate grep in the shell, you have to work on the characters itself, using strsplit() :
> chartest <- unlist(strsplit(test,""))
> chartest
[1] "f" "a" "s" "d" "j" "l" "k" "4" "6" "5" "ö" "f" "s" "d" "j" "1" "2" "3"
> grep("[0-9]",chartest,value=T)
[1] "4" "6" "5" "1" "2" "3"
EDIT :
As Nico said, if you want to do this for complete regular expressions, you need to use the gregexpr() and substr(). I'd make a custom function like this one :
grep.custom <- function(x,pattern){
strt <- gregexpr(pattern,x)[[1]]
lngth <- attributes(strt)$match.length
stp <- strt + lngth - 1
apply(cbind(strt,stp),1,function(i){substr(x,i[1],i[2])})
}
Then :
> grep.custom(test,"sd")
[1] "sd" "sd"
> grep.custom(test,"[0-9]")
[1] "4" "6" "5" "1" "2" "3"
> grep.custom(test,"[a-z]s[a-z]")
[1] "asd" "fsd"
EDIT2 :
for vectors, use the function Vectorize(), eg:
> X <- c("sq25dfgj","sqd265jfm","qs55d26fjm" )
> v.grep.custom <- Vectorize(grep.custom)
> v.grep.custom(X,"[0-9]+")
$sq25dfgj
[1] "25"
$sqd265jfm
[1] "265"
$qs55d26fjm
[1] "55" "26"
and if you want to call grep from the shell, see ?system
That's because 'grep' for R works on vectors - it will do the search on every element and return the element indices that match. It says 'which elements in this vector match this pattern?' For example, here we make a vector of 3 and then ask 'which elements in this vector have a single number in them?'
> test = c("fasdjlk465öfsdj123","nonumbers","123")
> grep("[0-9]",test)
[1] 1 3
Elements 1 and 3 - not 2, which is only characters.
You probably want gsub - substitute anything that doesn't match digits with nothing:
> gsub("[^0-9]","",test)
[1] "465123" "" "123"
All this dancing around with strings is the problem the stringr package was designed to solve.
library(stringr)
str_extract_all('fasdjlk465fsdj123', '[0-9]')
[[1]]
[1] "4" "6" "5" "1" "2" "3"
# It is vectorized too
str_extract_all(rep('fasdjlk465fsdj123',3), '[0-9]')
[[1]]
[1] "4" "6" "5" "1" "2" "3"
[[2]]
[1] "4" "6" "5" "1" "2" "3"
[[3]]
[1] "4" "6" "5" "1" "2" "3"
The motivation behind stringr is to unify string operations in R under two principles:
Use a sane and consistent naming scheme for functions (str_do_something).
Make it so that all the string operations that take one step in other programing languages, yet fifty steps in R, take only one step in R.
grep will only tell you whether the string matches or not.
For instance if you have:
values <- c("abcde", "12345", "abc123", "123abc")
Then
grep <- ("[0-9]", values)
[1] 2 3 4
This tells you that elements 2,3 and 4 of the array match the regexp. You can pass value=TRUE to return the strings rather then the indices.
If you want to check where the match is happening you can use regexpr instead
> regexpr("[0-9]", values)
[1] -1 1 4 1
attr(,"match.length")
[1] -1 1 1 1
which tells you where the first match is happening.
Even better, you can use gregexpr for multiple matches
> gregexpr("[0-9]", values)
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1 2 3 4 5
attr(,"match.length")
[1] 1 1 1 1 1
[[3]]
[1] 4 5 6
attr(,"match.length")
[1] 1 1 1
[[4]]
[1] 1 2 3
attr(,"match.length")
[1] 1 1 1
No idea where you get the impression that
> test <- "fasdjlk465öfsdj123"
> grep("[0-9]",test)
[1] 1
returns "fasdjlk465öfsdj123"
If you want to return the matches, you need to break test into it's component parts, grep on those and then use the thing returned from grep to index test.
> test <- strsplit("fasdjlk465öfsdj123", "")[[1]]
> matched <- grep("[0-9]", test)
> test[matched]
[1] "4" "6" "5" "1" "2" "3"
Or just return the matched strings directly, depends what you want:
> grep("[0-9]", test, value = TRUE)
[1] "4" "6" "5" "1" "2" "3"
strapply in the gsubfn package can do such extraction:
> library(gsubfn)
> strapply(c("ab34de123", "55x65"), "\\d+", as.numeric, simplify = TRUE)
[,1] [,2]
[1,] 34 55
[2,] 123 65
Its based on the apply paradigm where the first argument is the object, the second is the modifier (margin for apply, regular expression for strapply) and the third argument is the function to apply on the matches.
str_extract_all(obj, re) in the stringr package is similar to strapply specialized to use c for the function, i.e. its the similar to strapply(obj, re, c) .
strapply supports the sets of regular expressions supported by R and also supports tcl regular expressions.
See the gsubfn home page at http://gsubfn.googlecode.com