I want to edit out some information from row.names that are created automatically once split and cut2 were used. See following code:
#Mock data
date_time <- as.factor(c('8/24/07 17:30','8/24/07 18:00','8/24/07 18:30',
'8/24/07 19:00','8/24/07 19:30','8/24/07 20:00',
'8/24/07 20:30','8/24/07 21:00','8/24/07 21:30',
'8/24/07 22:00','8/24/07 22:30','8/24/07 23:00',
'8/24/07 23:30','8/25/07 00:00','8/25/07 00:30'))
U. <- as.numeric(c('0.2355','0.2602','0.2039','0.2571','0.1419','0.0778','0.3557',
'0.3065','0.1559','0.0943','0.1519','0.1498','0.1574','0.1929'
,'0.1407'))
#Mock data frame
test_data <- data.frame(date_time,U.)
#To use cut2
library(Hmisc)
#Splitting the data into categories
sub_data <- split(test_data,cut2(test_data$U.,c(0,0.1,0.2)))
new_data <- do.call("rbind",sub_data)
test_data <- new_data
You will see that "test_data" would have an extra column "row.names" with values such as "[0.000,0.100).6", "[0.000,0.100).10", etc.
How do I remove "[0.000,0.100)" and keep the number after the "." such as 6 and 10 so that I can reference these rows by their original row number later?
Any other better method to do this?
You could also set the names of sub_data to NULL.
names(sub_data) <- NULL
test_data <- do.call('rbind', sub_data)
row.names(test_data)
#[1] "6" "10" "5" "9" "11" "12" "13" "14" "15" "1" "2" "3" "4" "7" "8"
You could use a Regular Expression (Regex), as follows:
rownames(test_data) = gsub(".*[]\\)]\\.", "", rownames(test_data))
It's cryptic if you're not familiar with Regular Expressions, but it basically says match any sequence of characters (.*) that are followed by either a brace or parenthesis ([]\\)]) and then by a period (\\.) and remove all of it.
The double backslashes are "escapes" indicating that the character following the double-backslash should be interpreted literally, rather than in its special Regex meaning (e.g., . means "match any single character", but \\. means "this is really just a period").
Just for fun, you can also use regmatches
> Names <- rownames(test_data)
> ( rownames(test_data) <- regmatches(Names, regexpr("[0-9]+$", Names)) )
[1] "6" "10" "5" "9" "11" "12" "13" "14" "15" "1" "2" "3" "4" "7" "8"
Related
I deleted rows from my R dataframe and now the index numbers are out of order. For example, the row-index was 1,2,3,4,5 before but now it is 2,3,4 because I deleted rows 1 and 5.
Do I want to change the index labels from 2,3,4 to 1,2,3 on my new dataframe?
If so, how do I do this?
If not, why not?
library(rvest)
url <- "https://en.wikipedia.org/wiki/Mid-American_Conference"
pg <- read_html(url) # Download webpage
pg
tb <- html_table(pg, fill = TRUE) # Extract HTML tables as data frames
tb
macdf <- tb[[2]]
macdf <- subset(macdf, select=c(1,2,5))
colnames(macdf) <- c("School","Location","NumStudent")
macdf <- macdf[-c(1,8),]
You can change the labels from "2" "3" "4" "5" "6" "7" "9" "10" "11" "12" "13" "14" to "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" using:
row.names(macdf) <- 1:nrow(macdf)
You can do something like this-
> library(data.table)
> subset(setDT(macdf,row.names),select=-rn)
OR
rownames(macdf) <- NULL
When I apply the seqdef function from the TraMineR package to a list of vector and then take a look at the levels obtained, I get two unwanted levels. I can't figure out how to erase those levels. Here is my code:
> require(TraMineR)
> seqW <- lapply(X = myListOfVectors, FUN = function(s){
seqdef(s, alphabet = 1:9)
})
After verification, there is only numbers from 1 to 9 in my sequences, but then I get
> levels(s$T1)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "*" "%"
Where do these "*" and "%" come from ? How can I avoid their creation ?
I am trying to match the last digit in a character vector and replace it with the matched digit - 1. I have believe gsub is what I need to use but I cannot figure out what to use as the 'replace' argument. I can match the last number using:
gsub('[0-9]$', ???, chrvector)
But I am not sure how to replace the matched number with itself - 1.
Any help would be much appreciated.
Thank you.
We can do this easily with gsubfn
library(gsubfn)
gsubfn("([0-9]+)", ~as.numeric(x)-1, chrvector)
#[1] "str97" "v197exdf"
Or for the last digit
gsubfn("([0-9])([^0-9]*)$", ~paste0(as.numeric(x)-1, y), chrvector2)
#[1] "str97" "v197exdf" "v33chr138d"
data
chrvector <- c("str98", "v198exdf")
chrvector2 <- c("str98", "v198exdf", "v33chr139d")
Assuming the last digit is not zero,
chrvector <- as.character(1:5)
chrvector
#[1] "1" "2" "3" "4" "5"
chrvector <- paste(chrvector, collapse='') # convert to character string
chrvector <- paste0(substring(chrvector,1, nchar(chrvector)-1), as.integer(gsub('.*([0-9])$', '\\1', chrvector))-1)
unlist(strsplit(chrvector, split=''))
# [1] "1" "2" "3" "4" "4"
This works even if you have the last digit zero:
chrvector <- c(as.character(1:4), '0') # [1] "1" "2" "3" "4" "0"
chrvector <- paste(chrvector, collapse='')
chrvector <- as.character(as.integer(chrvector)-1)
unlist(strsplit(chrvector, split=''))
# [1] "1" "2" "3" "3" "9"
I want to get all characters that are ahead of the first "." if there is one. Otherwise, I want to get back the same character ("8" -> "8").
Example:
v<-c("7.7.4","8","12.6","11.5.2.1")
I want to get something like this:
[1] "7 "8" "12" "11"
My idea was to split each element at "." and then only take the first split. I found no solution that worked...
You can use sub
sub("\\..*", "", v)
#[1] "7" "8" "12" "11"
or a few stringi options:
library(stringi)
stri_replace_first_regex(v, "\\..*", "")
#[1] "7" "8" "12" "11"
# extract vs. replace
stri_extract_first_regex(v, "[^\\.]+")
#[1] "7" "8" "12" "11"
If you want to use a splitting approach, these will work:
unlist(strsplit(v, "\\..*"))
#[1] "7" "8" "12" "11"
# stringi option
unlist(stri_split_regex(v, "\\..*", omit_empty=TRUE))
#[1] "7" "8" "12" "11"
unlist(stri_split_fixed(v, ".", n=1, tokens_only=TRUE))
unlist(stri_split_regex(v, "[^\\w]", n=1, tokens_only=TRUE))
Other sub variations that use a capture group to target the leading characters specifically:
sub("(\\w+).+", "\\1", v) # \w matches [[:alnum:]_] (i.e. alphanumerics and underscores)
sub("([[:alnum:]]+).+", "\\1", v) # exclude underscores
# variations on a theme
sub("(\\w+)\\..*", "\\1", v)
sub("(\\d+)\\..*", "\\1", v) # narrower: \d for digits specifically
sub("(.+)\\..*", "\\1", v) # broader: "." matches any single character
# stringi variation just for fun:
stri_extract_first_regex(v, "\\w+")
scan() would actually work well for this. Since we want everything before the first ., we can use that as a comment character and scan() will remove everything after and including that character, for each element in v.
scan(text = v, comment.char = ".")
# [1] 7 8 12 11
The above returns a numeric vector, which might be where you are headed. If you need to stick with characters, add the what argument to denote we want a character vector returned.
scan(text = v, comment.char = ".", what = "")
# [1] "7" "8" "12" "11"
Data:
v <- c("7.7.4", "8", "12.6", "11.5.2.1")
I have this character vector:
variables <- c("ret.SMB.l1", "ret.mkt.l1", "ret.mkt.l4", "vix.l4", "ret.mkt.l5" "vix.l6", "slope.l11", "slope.l12", "us2yy.l2")
Desired output:
> suffixes(variables)
[1] 1 1 4 4 5 6 11 12 2
In other words, I need a function that will return a numeric vector showing the suffixes (each of which be 1 or 2 digits long). Note, I need something that can work with a much larger number of strings which may or may not have numbers somewhere the middle. The numerical suffixes range from 1 to 99.
Many thanks
Just use gsub:
> gsub(".*?([0-9]+)$", "\\1", variables)
[1] "1" "1" "4" "4" "5" "6" "11" "12" "2"
Wrap it in as.numeric if you want the result as a number.
You could use sub function.
> variables <- c("ret.SMB.l1", "ret.mkt.l1", "ret.mkt.l4", "vix.l4", "ret.mkt.l5" ,"vix.l6", "slope.l11", "slope.l12", "us2yy.l2")
> sub(".*\\D", "", variables)
[1] "1" "1" "4" "4" "5" "6" "11" "12" "2"
.*\\D matches all the characters from the start upto the last non-digit character. Replacing those matched characters with an empty string will give you the desired output.