how to separate a factor without whitespace in R? - r

I want to separate a factor with 14 rows, each row is like "cg17205324 (Adolescence)"
I tried strsplit(), but always ended up with "cg17205324 ".
Googled various methods to clean the tailing whitespace but did not work, because it is a factor rather than string.
any tips?

We can use scan
scan(text=str1, what ="", quiet=TRUE)
#[1] "cg17205324" "(Adolescence)"
data
str1 <- "cg17205324 (Adolescence)"

You can try the following:
"cg17205324 (Adolescence)" -> outp
strsplit(outp," ") # " " serves as space and separate the two strings
[[1]]
[1] "cg17205324" "(Adolescence)"

a <- "cg17205324 (Adolescence)"
b <- strsplit(a, " ")
b
#[[1]]
#[1] "cg17205324" "(Adolescence)"

Related

concat a SPLIT variable in R

I've been trying to split a string in R and then joining it back together but none of the tricks have worked for what I need.
!!!Important !!! My question is not a duplicate:
saving a split result into a variable and then pasting, collapsing etc is not the same as just paste a vector like this
paste(c("bla", "bla"), collapse = " ")
> paste(c("The","birch", "canoe"), collapse = ' ')
[1] "The birch canoe"
> paste(s, collapse=" ")
[1] "c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
Here's the code:
I take pre-saved sentences in R
sentences[1]
and split it
s <- str_split(sentences[1])
this is what I get:
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
Now when I try to join this back together I get backslashes
toString(s)
"c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
paste produces the same result:
> paste(s)
[1] "c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
I tried using str_split_fixed and wrap it into a vector, but it joins the sentence back together with a comma, even if I ask it not to.
v <- as.vector(str_split_fixed(sentences[1], " ", 5))
toString(v, sep="")
[1] "The, birch, canoe, slid, on the smooth planks."
I thought maybe str_split_i or str_split_1 could solve it as according to the documentation in theory it should, but that's what I get when I try to use it
"could not find function "str_split_1" "
Are there any other ways to join back a string after splitting it without it producing commas or backslashes?..
See the difference between:
s <- list(c("The" , "birch" , "canoe" , "slid" , "on" , "the" , "smooth" , "planks."))
paste(s[1], collapse = " ")
#[1] "c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
and
paste(s[[1]], collapse = " ")
#[1] "The birch canoe slid on the smooth planks."
This is because [[ will extract the vector, and [ and will keep the output as a list.

R tuple as factor (specifically longitude lattitude as factor)

I am having problems with accessing factors in R. I have a dataframe of tuple factor
test1
#[1] (34.0467, -118.2470) (34.0637, -118.2440) (34.0438, -118.2547)
#[4] (34.0523, -118.2676) (34.0584, -118.2810) (34.0583, -118.2616)
#39497 Levels: (0, 0) (0.0000, 0.0000) ... (34.6837, -118.1853)
How do I access just the first digit of the tuple?
thanks!
dput(test1)
...
"(34.3256, -118.4307)", "(34.3256, -118.4798)", "(34.3256, -118.5033)",
"(34.3257, -118.4244)", "(34.3258, -118.4343)", "(34.3262, -118.4104)",
"(34.3262, -118.4112)", "(34.3266, -118.4234)", "(34.3266, -118.4269)",
"(34.3266, -118.4323)", "(34.3269, -118.4278)", "(34.3272, -118.4365)",
"(34.3273, -118.4342)", "(34.3274, -118.4321)", "(34.3274, -118.4331)",
"(34.3275, -118.4247)", "(34.3275, -118.4298)", "(34.3276, -118.4115)",
"(34.3277, -118.4071)", "(34.3285, -118.4266)", "(34.3286, -118.4277)",
"(34.3287, -118.4286)", "(34.3292, -118.5048)", "(34.3293, -118.4246)",
"(34.3298, -118.4300)", "(34.3327, -118.5062)", "(34.3374, -118.5042)",
"(34.3760, -118.5254)", "(34.3767, -118.5263)", "(34.3775, -118.5270)",
"(34.3805, -118.5293)", "(34.4638, -118.1995)", "(34.5095, -117.9273)",
"(34.5304, -118.1418)", "(34.5453, -118.0405)", "(34.5650, -118.0856)",
"(34.5693, -118.0228)", "(34.5957, -118.1784)", "(34.6818, -118.0954)",
"(34.6837, -118.1853)"), class = "factor")
Can't get the beginning of that anyhow.
test1 <- factor(c("(34.3242, -118.4494)", "(34.3242, -118.4914)", "(34.3243, -118.4167)"))
First, convert the factor vector to a character vector.
test1 <- as.character(test1)
Then, remove all (s and )s, and split the strings by ,.
test1 <- gsub("\\(|\\)", "", test1)
test1 <- strsplit(test1, ",")
After that, change the digits from character format to numeric format.
test1 <- lapply(test1, as.numeric)
Finally, get the first coordinate of each point (change 1 to 2, if you want the second one).
test1 <- unlist(lapply(test1, '[[', 1))
Here is the output.
> test1
[1] 34.3242 34.3242 34.3243
Just index again
x[1][1]
x[2][1]
Try this
as.numeric(unlist(strsplit(gsub("[\\(\\)]", "",as.character(test1)),","))[c(T,F)])
Explanation
gsub is applicable only on character. So, as.character(test1) is converting test1 to character from factor. Then I am removing "(" & ")" from them like this
gsub("[\\(\\)]", "",as.character(test1))
#[1] "34.5693, -118.0228" "34.5957, -118.1784" "34.6818, -118.0954" "34.6837, -118.1853"
Later I split them into two parts depending on the separator , as
strsplit(gsub("[\\(\\)]", "",as.character(test1)),",")
#[[1]]
#[1] "34.5693" " -118.0228"
#[[2]]
#[1] "34.5957" " -118.1784"
#[[3]]
#[1] "34.6818" " -118.0954"
#[[4]]
#[1] "34.6837" " -118.1853"
Previous output is a list. unlist made output a vector.
unlist(strsplit(gsub("[\\(\\)]", "",as.character(test1)),","))
#[1] "34.5693" " -118.0228" "34.5957" " -118.1784" "34.6818" " -118.0954"
#[7] "34.6837" " -118.1853"
Basically [c(T,F)] is generating an alternating sequence of TRUE and FALSE for selection of first elements.
At last I made the output numeric using as.numeric
Output
#[1] 34.5693 34.5957 34.6818 34.6837

Removing punctuation between two words

I have a data frame (df) and I would like to remove punctuation.
However there an issue with dot between 2 words and at the end of one word like this:
test.
test1.test2
I use this to remove the punctuation:
library(tm)
removePunctuation(df)
and the result I take is this:
test
test1test2
but I would like to take this as result:
test
test1 test2
How is it possible to have a space between two words in the removing process?
You can use chartr for single character substitution:
chartr(".", " ", c("test1.test2"))
# [1] "test1 test2"
#akrun suggested trimws to remove the space at the end of your test string:
str <- c("test.", "test1.test2")
trimws(chartr(".", " ", str))
# [1] "test" "test1 test2"
We can use gsub to replace the . with a white space and remove the trailing/leading spaces (if any) with trimws.
trimws(gsub('[.]', ' ', str1))
#[1] "test" "test1 test2"
NOTE: In regex, . by itself means any character. So we should either keep it inside square brackets[.]) or escape it (\\.) or with option fixed=TRUE
trimws(gsub('.', ' ', str1, fixed=TRUE))
data
str1 <- c("test.", "test1.test2")
you can also use strsplit:
a <- "test."
b <- "test1.test2"
do.call(paste, as.list(strsplit(a, "\\.")[[1]]))
[1] "test"
do.call(paste, as.list(strsplit(b, "\\.")[[1]]))
[1] "test1 test2"

Replacing strings in R

I am trying to replace strings in R in a large number of texts.
Essentially, this reproduces the format of the data from which I try to delete the '\n' parts.
document <- as.list(c("This is \\na try-out", "And it \\nfails"))
I can do this with a loop and gsub but it takes forever. I looked at this post for a solution. So I tried: temp <- apply(document, 2, function(x) gsub("\\n", " ", fixed=TRUE)). I also used lapply, but it also gives an error message. I can't figure this out, help!
use lapply if you want to return a list
document <- as.list(c("This is \\na try-out", "And it \\nfails"))
temp <- lapply(document, function(x) gsub("\\n", " ", x, fixed=TRUE))
##[[1]]
##[1] "This is a try-out"
##[[2]]
##[1] "And it fails"

Concatenating strings with

I have a data frame with several variables. What I want is create a string using (concatenation) the variable names but with something else in between them...
Here is a simplified example (number of variables reduced to only 3 whereas I have actually many)
Making up some data frame
df1 <- data.frame(1,2,3) # A one row data frame
names(df1) <- c('Location1','Location2','Location3')
Actual code...
len1 <- ncol(df1)
string1 <- 'The locations that we are considering are'
for(i in 1:(len1-1)) string1 <- c(string1,paste(names(df1[i]),sep=','))
string1 <- c(string1,'and',paste(names(df1[len1]),'.'))
string1
This gives...
[1] "The locations that we are considering are"
[2] "Location1"
[3] "Location2"
[4] "Location3 ."
But I want
The locations that we are considering are Location1, Location2 and Location3.
I am sure there is a much simpler method which some of you would know...
Thank you for you time...
Are you looking for the collapse argument of paste?
> paste (letters [1:3], collapse = " and ")
[1] "a and b and c"
The fact that these are names of a data.frame does not really matter, so I've pulled that part out and assigned them to a variable strs.
strs <- names(df1)
len1 <- length(strs)
string1 <- paste("The locations that we are considering are ",
paste(strs[-len1], collapse=", ", sep=""),
" and ",
strs[len1],
".\n",
sep="")
This gives
> cat(string1)
The locations that we are considering are Location1, Location2 and Location3.
Note that this will not give sensible English if there is only 1 element in strs.
The idea is to collapse all but the last string with comma-space between them, and then paste that together with the boilerplate text and the last string.
If your main goal is to print the results to the screen (or other output) then use the cat function (whose name derives from concatenate):
> cat(names(iris), sep=' and '); cat('\n')
Sepal.Length and Sepal.Width and Petal.Length and Petal.Width and Species
If you need a variable with the string, then you can use paste with the collapse argument. The sprintf function can also be useful for inserting strings into other strings (or numbers into strings).
An other options would be:
library(stringr)
str_c("The location that we are consiering are ", str_c(str_c(names(df1)[1:length(names(df1))-1], collapse=", "), names(df1)[length(names(df1))], sep=" and "))

Resources