I am currently using R to convert data from an experiment into a high quality dataset. One of the features of my code is to detect repetitions of the experiment and label them accordingly. I have written the following code for this:-
DAYREP<-function(a){
CAPS<-c("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P",
"Q","R","S","T","U","V","W","X","Y","Z")
if (unique(table(a))==1 && length(unique(table(a)))==1){
return(a)
}
else{
for (i in a){
if (table(a)[[i]]>=2){
CAPS.sum<-CAPS[1:as.vector(table(a)[[i]])-1]
val<-c(i,paste0(i,CAPS.sum))
del<-a[!a %in% i]
vec<-append(del,val,after=i-1)
return(vec)
}
}
}
}
I have used the following vectors of day numbers for testing and they highlight every possible outcome known so far.
a<-c(1,2,3,4,5,6,7,8,9)
b<-c(1,2,3,4,5,6,7,8,8)
c<-c(1,2,3,3,4,5,6)
d<-c(1,1,1,1,1,1)
e<-c(1,2,2,3,4,5,6,6,7)
f<-c(2,7,8,10,11,11,14)
It produces the following output:-
> DAYREP(a)
[1] 1 2 3 4 5 6 7 8 9
> DAYREP(b)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "8A"
> DAYREP(c)
[1] "1" "2" "3" "3A" "4" "5" "6"
> DAYREP(d)
[1] "1" "1A" "1B" "1C" "1D" "1E"
> DAYREP(e)
[1] "1" "2" "2A" "3" "4" "5" "6" "6" "7"
> DAYREP(f)
Error in table(a)[[i]] : subscript out of bounds
The function works on all the tests but e and f. With e it only converts the first set of repeated values, and with f it returns an error message.
I am aware that the problem is being caused by the table(a)[[i]] element calling the frequency value from the table, however I am unsure as to whether or not there is a method to call the values being tabulated from the table. E.g.
> table(e)
e
1 2 3 4 5 6 7
1 2 1 1 1 2 1
The method I am using is calling the bottom line, however I wish to call the top line. Does anybody know of a solution to this?
#cr1msonB1ade has kindly suggested the use of the make.unique function which is able to perform what the above function does with slight variation.
> make.unique(e)
[1] "1" "2" "2.1" "3" "4" "5" "6" "6.1" "7"
Thank you!
As stated in my comment I think what you want is the builtin function make.unique, but there are also some issues with how you are using the table, so I would like to address those as well. When you want to access the values in a table via the name of the variable (i in your for loop), you want to index with single brackets [ not double brackets [[. The other issue is that table converts the values to factors and thus you would have to index with an as.character(i). I don't think this completely fixed your script, but it might get you close enough.
Related
When I apply the seqdef function from the TraMineR package to a list of vector and then take a look at the levels obtained, I get two unwanted levels. I can't figure out how to erase those levels. Here is my code:
> require(TraMineR)
> seqW <- lapply(X = myListOfVectors, FUN = function(s){
seqdef(s, alphabet = 1:9)
})
After verification, there is only numbers from 1 to 9 in my sequences, but then I get
> levels(s$T1)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "*" "%"
Where do these "*" and "%" come from ? How can I avoid their creation ?
toString seems to convert a whole vector to a single string -
toString(c(1,2))
[1] "1, 2"
how does one map the string conversion over each element; i.e. for the above example, to obtain ("1", "2") ?
> as.character(c(1,2))
[1] "1" "2"
Is the output I get from the R-console.
Since the result is a character vector with a single element, the strategy of using as.character will have no effect. Need to use scan:
> scan(text = toString(0:11), sep="," )
Read 12 items
[1] 0 1 2 3 4 5 6 7 8 9 10 11
Then you can use as.character if that is needed:
> res <- scan(text = toString(0:11), sep="," )
Read 12 items
> as.character(res)
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
I prefer paste0 since it's shorter and (from what I can tell) accomplishes the same thing as as.character:
> paste0(1:2)
[1] "1" "2"
> identical(paste0(1:2),as.character(1:2))
[1] TRUE
I have this character vector:
variables <- c("ret.SMB.l1", "ret.mkt.l1", "ret.mkt.l4", "vix.l4", "ret.mkt.l5" "vix.l6", "slope.l11", "slope.l12", "us2yy.l2")
Desired output:
> suffixes(variables)
[1] 1 1 4 4 5 6 11 12 2
In other words, I need a function that will return a numeric vector showing the suffixes (each of which be 1 or 2 digits long). Note, I need something that can work with a much larger number of strings which may or may not have numbers somewhere the middle. The numerical suffixes range from 1 to 99.
Many thanks
Just use gsub:
> gsub(".*?([0-9]+)$", "\\1", variables)
[1] "1" "1" "4" "4" "5" "6" "11" "12" "2"
Wrap it in as.numeric if you want the result as a number.
You could use sub function.
> variables <- c("ret.SMB.l1", "ret.mkt.l1", "ret.mkt.l4", "vix.l4", "ret.mkt.l5" ,"vix.l6", "slope.l11", "slope.l12", "us2yy.l2")
> sub(".*\\D", "", variables)
[1] "1" "1" "4" "4" "5" "6" "11" "12" "2"
.*\\D matches all the characters from the start upto the last non-digit character. Replacing those matched characters with an empty string will give you the desired output.
I need to extract the length and strength (amount) of rain showers from a data set. The data is a matrix where each row includes the data for one day. The data is split in 5min intervals, so each column is for one 5 min interval (288 columns). I now want to find the beginning of a rain shower and sum the amount and length until it stops. Because rain-showers can extend into the next day, I need to be able to keep summing in the next row. For my loops to work, I added the last column to the front of the matrix, but moved it one row down (basically adding the last cell of the previous row to the front of the next):
# create an example matrix
Dat=matrix(1:25, 5, 5)
a=c(0,0,0,0,1) # last column added to the front
b=c(0,0,0,1,0) # last column
Dat=cbind(a,Dat,b,b)
e=c(0,0,0,0,0,0,0,0) # just another row
Dat=rbind(Dat,e)
Matrix looks like this
0 1 6 11 16 21 0 0
0 2 7 12 17 22 0 0
0 3 8 13 18 23 0 0
0 4 9 14 19 24 1 1
1 5 10 15 20 25 0 0
0 0 0 0 0 0 0 0
now I run my code:
Rain=0
Length=0
results=data.frame()
i=1
j=2
for (i in 1:nrow(Dat)) { # rows
for (j in 2:ncol(Dat)) { # cols
if(Dat[i,j]==0){ # if there is no rain
print(c(i,j,"if"))
j=j+1 # move on to next cell
if(j==(ncol(Dat)+1)){ # at the end of the line,move to the next row
i=i+1
j=2
}}
else {print(c(i,j,"else")) # if there is rain
if (Dat[i,j-1]==0) { # check if there was no rain before => start of rain)
Rain=0
Length=0
while(Dat[i,j]>0){ # while it is raining, add up
print(c(i,j,"while"))
Rain=Rain+Dat[i,j]
Length=Length+5
j=j+1 # move to next cell
if(j==(ncol(Dat)+1)){ # at the end of a row, move to the beginning of the next
i=i+1
j=2
}
}
results_vector=c(Rain,Length) # save the results
results=rbind(results, results_vector)
}}}}
This works quite well (meaning the added up results are ok), however, the indexes don't seem to get handed over from the while loop to the for loops, and I couldn't find out why. So when the while loop jumps to the next line, the for loop repeats checking this line where there is no rain, see output:
>[1] "1" "2" "else"
>[1] "1" "2" "while" ** #enters while loop**
>[1] "1" "3" "while"
>[1] "1" "4" "while"
>[1] "1" "5" "while"
>[1] "1" "6" "while"
>[1] "1" "3" "else" **#exit while loop, but runs in if-else loop**
>[1] "1" "4" "else"
>[1] "1" "5" "else"
>[1] "1" "6" "else"
>[1] "1" "7" "if"
>[1] "1" "8" "if"
>[1] "2" "2" "else" **# next line**
>[1] "2" "2" "while"
>[1] "2" "3" "while"
.....
[1] "4" "5" "while" # in while loop
[1] "4" "6" "while"
[1] "4" "7" "while"
[1] "4" "8" "while"
[1] "5" "2" "while" # jumps to next row correctly
[1] "5" "3" "while"
[1] "5" "4" "while"
[1] "5" "5" "while"
[1] "5" "6" "while"
[1] "5" "3" "else" # repeats in if-else loop...
[1] "5" "4" "else"
[1] "5" "5" "else"
[1] "5" "6" "else"
[1] "5" "7" "if"
[1] "5" "8" "if"
[1] "5" "2" "else" # repeats row 5 in if-else loop!!! Why?
[1] "5" "3" "else"
[1] "5" "4" "else"
[1] "5" "5" "else"
[1] "5" "6" "else"
[1] "5" "7" "if"
[1] "5" "8" "if"
[1] "6" "2" "if" # back on track...
[1] "6" "3" "if"
Thanks for reading to the bottom! Any help or suggestions to improve/fix this would be highly appreciated because the data sets are very large (60 years in 5min intervals for several stations).
If you're concerned about the dates and times the rain events started, I would follow the advice I gave in my comment about converting to a data.frame. If not, what follows is a vectorized solution that should be fast on large data.
First convert your data to a single vector if you don't care about days.
dat.t <- c(t(Dat))
Now you can get sequences with a boolean check:
rain_events <- dat.t != 0
Using rle we can get the duration of each event:
rain_rled <- rle(rain_events)
# I'm using the fact that my values are logical.
# this is equivalent to rain_rled$values == TRUE.
rain_duration <- rain_rled$lengths[rain_rled$values]
Then we just need to determine amount.
total_rain <- cumsum(dat.t)
total_rain_shift <- c(total_rain[-1], 0)
sums <- unique(total_rain[total_rain == total_rain_shift])
rain_fall <- c(sums[1], sums[2:length(sums)] - sums[1:(length(sums)-1)])
out <- data.frame(rain_duration = rain_duration,
rain_fall = rain_fall)
There are other ways, but this is one. You can also use info from your rle to determine the position where the rain started and stopped if you desire, but I'll leave that as an exercise for the reader...
Obviously I dont get the way grep works in R. If I use grep on my OS X terminal, I am able to use the parameter -o which makes grep only return the matching part. In R, I can't find how to do a corresponding thing. Reading the manual I thought values was the right approach, which is better inasmuch that it returns characters not indexes, but still returns the whole string.
# some string fasdjlk465öfsdj123
# R
test <- fasdjlk465öfsdj123
grep("[0-9]",test,value=TRUE) # returns "fasdjlk465öfsdj123"
# shell
grep -o '[0-9]' fasdjlk465öfsdj123
# returns 4 6 5 1 2 3
What's the parameter I am missing in R ?
EDIT: Joris Meys' suggestions comes really close to what I am trying to do. I get a vector as a result of readLines. And I'd like to check every element of the vector for numbers and return these numbers. I am really surprised there's no standard solution for that. I thought of using some regexp function that works on a string and returns the match like grep -o and then use lapply on that vector. grep.custom comes closest – i'll try to make that work for me.
Spacedman said it already. If you really want to simulate grep in the shell, you have to work on the characters itself, using strsplit() :
> chartest <- unlist(strsplit(test,""))
> chartest
[1] "f" "a" "s" "d" "j" "l" "k" "4" "6" "5" "ö" "f" "s" "d" "j" "1" "2" "3"
> grep("[0-9]",chartest,value=T)
[1] "4" "6" "5" "1" "2" "3"
EDIT :
As Nico said, if you want to do this for complete regular expressions, you need to use the gregexpr() and substr(). I'd make a custom function like this one :
grep.custom <- function(x,pattern){
strt <- gregexpr(pattern,x)[[1]]
lngth <- attributes(strt)$match.length
stp <- strt + lngth - 1
apply(cbind(strt,stp),1,function(i){substr(x,i[1],i[2])})
}
Then :
> grep.custom(test,"sd")
[1] "sd" "sd"
> grep.custom(test,"[0-9]")
[1] "4" "6" "5" "1" "2" "3"
> grep.custom(test,"[a-z]s[a-z]")
[1] "asd" "fsd"
EDIT2 :
for vectors, use the function Vectorize(), eg:
> X <- c("sq25dfgj","sqd265jfm","qs55d26fjm" )
> v.grep.custom <- Vectorize(grep.custom)
> v.grep.custom(X,"[0-9]+")
$sq25dfgj
[1] "25"
$sqd265jfm
[1] "265"
$qs55d26fjm
[1] "55" "26"
and if you want to call grep from the shell, see ?system
That's because 'grep' for R works on vectors - it will do the search on every element and return the element indices that match. It says 'which elements in this vector match this pattern?' For example, here we make a vector of 3 and then ask 'which elements in this vector have a single number in them?'
> test = c("fasdjlk465öfsdj123","nonumbers","123")
> grep("[0-9]",test)
[1] 1 3
Elements 1 and 3 - not 2, which is only characters.
You probably want gsub - substitute anything that doesn't match digits with nothing:
> gsub("[^0-9]","",test)
[1] "465123" "" "123"
All this dancing around with strings is the problem the stringr package was designed to solve.
library(stringr)
str_extract_all('fasdjlk465fsdj123', '[0-9]')
[[1]]
[1] "4" "6" "5" "1" "2" "3"
# It is vectorized too
str_extract_all(rep('fasdjlk465fsdj123',3), '[0-9]')
[[1]]
[1] "4" "6" "5" "1" "2" "3"
[[2]]
[1] "4" "6" "5" "1" "2" "3"
[[3]]
[1] "4" "6" "5" "1" "2" "3"
The motivation behind stringr is to unify string operations in R under two principles:
Use a sane and consistent naming scheme for functions (str_do_something).
Make it so that all the string operations that take one step in other programing languages, yet fifty steps in R, take only one step in R.
grep will only tell you whether the string matches or not.
For instance if you have:
values <- c("abcde", "12345", "abc123", "123abc")
Then
grep <- ("[0-9]", values)
[1] 2 3 4
This tells you that elements 2,3 and 4 of the array match the regexp. You can pass value=TRUE to return the strings rather then the indices.
If you want to check where the match is happening you can use regexpr instead
> regexpr("[0-9]", values)
[1] -1 1 4 1
attr(,"match.length")
[1] -1 1 1 1
which tells you where the first match is happening.
Even better, you can use gregexpr for multiple matches
> gregexpr("[0-9]", values)
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1 2 3 4 5
attr(,"match.length")
[1] 1 1 1 1 1
[[3]]
[1] 4 5 6
attr(,"match.length")
[1] 1 1 1
[[4]]
[1] 1 2 3
attr(,"match.length")
[1] 1 1 1
No idea where you get the impression that
> test <- "fasdjlk465öfsdj123"
> grep("[0-9]",test)
[1] 1
returns "fasdjlk465öfsdj123"
If you want to return the matches, you need to break test into it's component parts, grep on those and then use the thing returned from grep to index test.
> test <- strsplit("fasdjlk465öfsdj123", "")[[1]]
> matched <- grep("[0-9]", test)
> test[matched]
[1] "4" "6" "5" "1" "2" "3"
Or just return the matched strings directly, depends what you want:
> grep("[0-9]", test, value = TRUE)
[1] "4" "6" "5" "1" "2" "3"
strapply in the gsubfn package can do such extraction:
> library(gsubfn)
> strapply(c("ab34de123", "55x65"), "\\d+", as.numeric, simplify = TRUE)
[,1] [,2]
[1,] 34 55
[2,] 123 65
Its based on the apply paradigm where the first argument is the object, the second is the modifier (margin for apply, regular expression for strapply) and the third argument is the function to apply on the matches.
str_extract_all(obj, re) in the stringr package is similar to strapply specialized to use c for the function, i.e. its the similar to strapply(obj, re, c) .
strapply supports the sets of regular expressions supported by R and also supports tcl regular expressions.
See the gsubfn home page at http://gsubfn.googlecode.com