What I need to do is very similar to what the function below does
x = c("abcde", "ghij", "klmnopq")
tstrsplit(x, "", fixed=TRUE, keep=c(1,3,5), names=c('first','second','third'))
However, I would like to be able to return strings using ranges of values. For example, I would like to specify that in first I want to have the first two letters for each element.
Thus instead of having:
$first
[1] "a" "g" "k"
$second
[1] "c" "i" "m"
$third
[1] "e" NA "o"
The output should look like
$first
[1] "ab" "gh" "kl"
$second
[1] "c" "i" "m"
$third
[1] "e" NA "o"
Background:
I have a large .txt file of records and a lookup table that tells from which position to which position each attribute goes, and the expected max width from which position. The txt file looks like:
James Brown M 01-01-1970
And then in a separate file I have a lookup table that says:
Field Start width
Name 1 7
FamilyN 9 7
Gender 11 1
Incidentally, I would appreciate any feedback on the best way to import this type of large .txt file. I feel like read.table is inappropriate since it tries to reduce to a dataframe format which is not what these files really are.
Something like this maybe:
x = c("abcde", "ghij", "klmnopq")
library(tidyverse)
list(c(1,3,5), c(2,1,1)) %>%
pmap(~ substr(x, .x, .x + .y - 1) %>% replace(., .=="", NA))
[[1]]
[1] "ab" "gh" "kl"
[[2]]
[1] "c" "i" "m"
[[3]]
[1] "e" NA "o"
I've hardcoded the positions. Per #MrFlick's comment, if you have a large number of strings, you'll need some strategy for deciding on the character positions so that you can automate it, rather than hardcoding it.
Related
Could someone please enlighten me why subtracting an empty vector in R results in the whole content of a data frame being deleted? Just to give an example
WhichInstances2 <- which(JointProcedures3$JointID %in% KneeIDcount$Var1[KneeIDcount$Freq >1])
JointProcedures3 <-JointProcedures3[-WhichInstances2,]
Will give me all blanks in JointProcedures3 if WhichInstances2 has all its value as FALSE, but it should simply give me what JointProcedures3 was before those lines of code.
This is not the first time it has happened to me and I have asked my supervisor and it has happened to him as well and he just thinks t is a quirk of R.
Rewriting the code as
WhichInstances2 <- which(JointProcedures3$JointID %in% KneeIDcount$Var1[KneeIDcount$Freq >1])
if(length(WhichInstances2)>0)
{
JointProcedures3 <-JointProcedures3[-WhichInstances2,]
}
fixes the issue. But it should not have in principle made a scooby of a difference if that conditional was there or not, since if length(WhichInstances2) was equal to 0, I would simply be subtract nothing from the original JointProcedures3...
Thanks all for your input.
Let's try a simpler example to see what's happening.
x <- 1:5
y <- LETTERS[1:5]
which(x>4)
## [1] 5
y[which(x>4)]
## [1] "E"
So far so good ...
which(x>5)
## integer(0)
> y[which(x>5)]
## character(0)
This is also fine. Now what if we negate? The problem is that integer(0) is a zero-length vector, so -integer(0) is also a zero-length vector, so y[-which(x>5] is also a zero-length vector ..
What can you do about it? Don't use which(); instead use logical indexing directly, and use ! to negate the condition:
y[!(x>5)]
## [1] "A" "B" "C" "D" "E"
In your case:
JointID_OK <- (JointProcedures3$JointID %in% KneeIDcount$Var1[KneeIDcount$Freq >1])
JointProcedures3 <-JointProcedures3[!JointID_OK,]
For what it's worth, this is section 8.1.13 in the R Inferno, "negative nothing is something"
It seems you are checking for ids in a vector and you intend to remove them from another; probably setdiff is what you are looking for.
Consider if we have a vector of the lowercase letters of the alphabet (its an r builtin) and we want to remove any entry that matches something that is not in there ("ab") , as programmers we would wish for nothing to be removed and keep our 26 letters
# wont work
letters[ - which(letters=="ab")]
#works
setdiff(letters , which(letters=="ab"))
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
[22] "v" "w" "x" "y" "z"
I have a number simple files with a single entry per line and I want to read those files into a list with the content of a file as a vector.
> list(file_one = c(1,2,3,4), file_two = c(9,99,999))
$file_one
[1] 1 2 3 4
$file_two
[1] 9 99 999
...
This is basically the resulting format i want.
What I have so far is a similar result, but not correct:
> list.files("/home/x/y/z", pattern="^rep.*List$", full.names=TRUE) %>% lapply(read.table)
[[1]]
V1
1 a
2 b
3 c
4 d
How can I read the data in the correct format or transform it from here? - preferably I would have a "pipeline" to read the data:
list files
read files in correct format or
format the read data into a list of named vectors
Perhaps you need something like this
library(tidyverse)
list.files("xyz/", full.names = TRUE) %>%
set_names(basename(.)) %>%
map(read_lines)
#> $`rep1List`
#> [1] "a" "b" "c" "d" "e" "f"
#>
#> $rep2List
#> [1] "e" "f" "g" "h" "i" "j" "k"
#>
#> $rep3List
#> [1] "l" "m" "m" "o" "p" "q" "r" "s"
where each of the files look like this:
based on the information you gave, I would try something like below, using the purrr-Package:
list.files("/home/x/y/z", pattern="^rep.*List$", full.names = TRUE) %>%
purrr::map_df(., read.table, ADD YOUR ARGUMENTS HERE)
This is working for a real-life example for me. It fails with your made up file. I would have just commented, but I am too low. ^^
myfunction<-function(x){if (x=="g"){g_var<-x g_nvar<-length(g_var)} return(g_nvar)}
I have written the above script to obtain specific elements out of a list. The argument x will be a list when I will call upon this function but R does not consider x as a list. How can I write a function such that when I provide a list, my output are the elements that I have specified in the function?
m
[[1]]
[[1]] [[1]]
[1] "g" "g" "h" "g" "g" "g" "k" "l"
[[2]]
[[2]] [[1]]
[1] "g" "h" "k" "k" "l" "g"
Expected result
[[1]] 5 # No. of g
[[2]] 2 # No. of g
Similarly I would like to obtain numbers for h,k and l also. I am putting m as x while calling the function.
For eg:- myfunction (m)
Your case is somewhat complicated by the fact that your m is not simply a list of character vectors, but, in your example, a list of 2 lists of 1 vector of characters, as would be generated by
m = list(strsplit("gghgggkl", ""), strsplit("ghkklg", ""))
If we want myfunction to operate on this data structure, we have to refer to the component of the length-1-lists with the operation [[1]] (see x[[1]] below), and, as loki suggested, we can use lapply to work on all components of the outer list, and sum with a logical expression to obtain the desired count:
myfunction = function(m) lapply(m, function(x) sum(x[[1]]=='g'))
myfunction(m)
result:
[[1]]
[1] 5
[[2]]
[1] 2
This question already has answers here:
Convert letters to numbers
(5 answers)
Closed 5 years ago.
I need to return number of element in vector based on vector element name. Lets say i have vector of letters:
myLetters=letters[1:26]
> myLetters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
and what I intent to do is to create/find function that returns me the number of element when called for example:
myFunction(myLetters["b"])
[1] 2
myFunction(myLetters["z"])
[1]26
In summary I need a way to refer to excel columns by writing letters of a column (A,B,C later maybe even AA or further) and to get the number.
If you want to refer to excel columnnames, you could create a reference vector with all possible excel column names:
eg1 <- expand.grid(LETTERS, LETTERS)
eg2 <- expand.grid(LETTERS, LETTERS, LETTERS)
excelcols <- c(LETTERS, paste0(eg1[[2]], eg1[[1]]), paste0(paste0(eg2[[3]], eg2[[2]], eg2[[1]])))
After which you can use which:
> which(excelcols == 'A')
[1] 1
> which(excelcols == 'AB')
[1] 28
> which(excelcols == 'ABC')
[1] 731
If you need to find the number of times specific letter occurs then the following should work:
myLetters = c("a","a", "b")
myFunction = function(myLetters, findLetter){
length(which(myLetters==findLetter))
}
Let find how many times "a" occurs in myLetters:
myFunction(myLetters, "a")
# [1] 2
Say I have the following dataframe consisting of two vectors containing character strings:
df <- data.frame(
"ID"= c("1a", "1b", "1c", "1d"),
"Codes" = c("BX.MX|GX.WX", "MX.RX|BX.YX", "MX.OX|GX.GX", "MX.OX|YX.OX"),
stringsAsFactors = FALSE)
I'd like a simple way to determine which characters have been used in a given vector. In other words, the output of such a function would reveal:
find.characters(df$Codes) # hypothetical function
[1] "B" "G" "M" "W" "X" "R" "Y" "O" "|" "."
find.characters(df$ID) # hypothetical function
[1] "1" "a" "b" "c" "d"
You can create a custom function to do this. The idea is to split the strings into individual characters (strsplit(v1, '')), output will be list. We can unlist it to make it a vector, then get the unique elements. But, this is not sorted yet. Based on the example showed, you may want to sort the letters and other characters differently. So, we use grep to index the 'LETTER' character, and use this to separately sort the subset of vectors and concatenate c( it together.
find.characters <- function(v1){
x1 <- unique(unlist(strsplit(v1, '')))
indx <- grepl('[A-Z]', x1)
c(sort(x1[indx]), sort(x1[!indx]))
}
find.characters(df$Codes)
#[1] "B" "G" "M" "O" "R" "W" "X" "Y" "|" "."
find.characters(df$ID)
#[1] "1" "a" "b" "c" "d"
NOTE: Generally, I would use grepl('[A-Za-z]', x1), but I didn't do that because the expected result for the 'ID' column is different.
find.characters<-function(x){
unique(c(strsplit(split="",x),recursive = T))
}