I've been trying to understand how to deal with the output of strsplit a bit better. I often have data such as this that I wish to split:
mydata <- c("144/4/5", "154/2", "146/3/5", "142", "143/4", "DNB", "90")
#[1] "144/4/5" "154/2" "146/3/5" "142" "143/4" "DNB" "90"
After splitting that the results are as follows:
strsplit(mydata, "/")
#[[1]]
#[1] "144" "4" "5"
#[[2]]
#[1] "154" "2"
#[[3]]
#[1] "146" "3" "5"
#[[4]]
#[1] "142"
#[[5]]
#[1] "143" "4"
#[[6]]
#[1] "DNB"
#[[7]]
#[1] "90"
I know from the strsplit help guide that final empty strings are not produced. Therefore, there will be 1, 2 or 3 elements in each of my results based on the number of "/" to split by
Getting the first element is very trivial:
sapply(strsplit(mydata, "/"), "[[", 1)
#[1] "144" "154" "146" "142" "143" "DNB" "90"
But I am not sure how to get the 2nd, 3rd... when there are these unequal number of elements in each result.
sapply(strsplit(mydata, "/"), "[[", 2)
# Error in FUN(X[[4L]], ...) : subscript out of bounds
I would hope to return from a working solution, the following:
#[1] "4" "2" "3" "NA" "4" "NA" "NA"
This is a relatively small example. I could do some for loop very easily on these data, but for real data with 1000s of observations to run the strsplit on and dozens of elements produced from that, I was hoping to find a more generalizable solution.
(at least regarding 1D vectors) [ seems to return NA when "i > length(x)" whereas [[ returns an error.
x = runif(5)
x[6]
#[1] NA
x[[6]]
#Error in x[[6]] : subscript out of bounds
Digging a bit, do_subset_dflt (i.e. [) calls ExtractSubset where we notice that when a wanted index ("ii") is "> length(x)" NA is returned (a bit modified to be clean):
if(0 <= ii && ii < nx && ii != NA_INTEGER)
result[i] = x[ii];
else
result[i] = NA_INTEGER;
On the other hand do_subset2_dflt (i.e. [[) returns an error if the wanted index ("offset") is "> length(x)" (modified a bit to be clean):
if(offset < 0 || offset >= xlength(x)) {
if(offset < 0 && (isNewList(x)) ...
else errorcall(call, R_MSG_subs_o_b);
}
where #define R_MSG_subs_o_b _("subscript out of bounds")
(I'm not sure about the above code snippets but they do seem relevant based on their returns)
Try this:
> read.table(text = mydata, sep = "/", as.is = TRUE, fill = TRUE)
V1 V2 V3
1 144 4 5
2 154 2 NA
3 146 3 5
4 142 NA NA
5 143 4 NA
6 DNB NA NA
7 90 NA NA
If you want to treat DNB as an NA then add the argument na.strings="DNB" .
If you really want to use strsplit then try this:
> do.call(rbind, lapply(strsplit(mydata, "/"), function(x) head(c(x,NA,NA), 3)))
[,1] [,2] [,3]
[1,] "144" "4" "5"
[2,] "154" "2" NA
[3,] "146" "3" "5"
[4,] "142" NA NA
[5,] "143" "4" NA
[6,] "DNB" NA NA
[7,] "90" NA NA
Note: Using alexis_laz's observation that x[i] returns NA if i is not in 1:length(x) the last line of code above could be simplified to:
t(sapply(strsplit(mydata, "/"), "[", 1:3))
You could use regex (if it is allowed)
library(stringr)
str_extract(mydata , perl("(?<=\\d/)\\d+"))
#[1] "4" "2" "3" NA "4" NA NA
str_extract(mydata , perl("(?<=/\\d/)\\d+"))
#[1] "5" NA "5" NA NA NA NA
You can assign the length inside sapply, resulting in NA where the current length is shorter than the assigned length.
s <- strsplit(mydata, "/")
sapply(s, function(x) { length(x) <- 3; x[2] })
# [1] "4" "2" "3" NA "4" NA NA
Then you can add a second indexing argument with mapply
m <- max(sapply(s, length))
mapply(function(x, y, z) { length(x) <- z; x[y] }, s, 2, m)
# [1] "4" "2" "3" NA "4" NA NA
Related
Struggling with string handling in R...
I've got a column of strings in an R data frame. Each one contains the "=" character once and only once. I'd like to know the position of the "=" character in each element of the column, as a step to splitting the column into two separate columns (one for the bit before the "=" and one for the bit after the "="). Can anyone help please? I'm sure it's simple but I'm struggling to find the answer.
For example, if I have:
x <- data.frame(string = c("aa=1", "aa=2", "aa=3", "b=1", "b=2", "abc=5"))
I'd like a bit of code to return
(3, 3, 3, 2, 2, 4)
Thank you.
To get the position of "=" you can use the regexp function:
regexpr("=", x$string)
#[1] 3 3 3 2 2 4
#attr(,"match.length")
#[1] 1 1 1 1 1 1
#attr(,"useBytes")
#[1] TRUE
However, as #Michael stated if your goal is to split the string you can use strsplit:
strsplit(x$string, "=")
#[[1]]
#[1] "aa" "1"
#
#[[2]]
#[1] "aa" "2"
#
#[[3]]
#[1] "aa" "3"
#
#[[4]]
#[1] "b" "1"
#
#[[5]]
#[1] "b" "2"
#
#[[6]]
#[1] "abc" "5"
Or to combine with do.call and `rbind to create a new dataframe:
do.call(rbind, strsplit(x$string, "="))
# [,1] [,2]
#[1,] "aa" "1"
#[2,] "aa" "2"
#[3,] "aa" "3"
#[4,] "b" "1"
#[5,] "b" "2"
#[6,] "abc" "5"
Here's a way to do:
library(stringr)
str_locate(x$string, "=")[,1]
You can use gregexpr:
unlist(lapply(gregexpr(pattern = '=', x$string), min))
[1] 3 3 3 2 2 4
In Base R you can do:
as.numeric(lapply(strsplit(as.character(x$string), ""), function(x) which(x == "=")))
[1] 3 3 3 2 2 4
Here is another solution to obtain a two column dataframe, the first containing the characters before = and the second one containing the characters after =. You can do that without obtaining the positions of the = character.
library(stringr)
t(as.data.frame(strsplit(x$string, "=")))
# [,1] [,2]
#c..aa....1.. "aa" "1"
#c..aa....2.. "aa" "2"
#c..aa....3.. "aa" "3"
#c..b....1.. "b" "1"
#c..b....2.. "b" "2"
#c..abc....5.. "abc" "5"
Some may find this more readable
library(tidyverse)
x %>%
mutate(
number = string %>% str_extract('[:digit:]+'),
text = string %>% str_extract('[:alpha:]+')
) %>%
as_tibble()
# A tibble: 6 x 3
string number text
<fct> <chr> <chr>
1 aa=1 1 aa
2 aa=2 2 aa
3 aa=3 3 aa
4 b=1 1 b
5 b=2 2 b
6 abc=5 5 abc
I am using recode and its working fine as its replace all matching value but I don't want to replace its non-matching value, can someone please help
sample dataset
x <- c(1:5, NA)
[1] 1 2 3 4 5 NA
now using recode
recode(x, '1' = "Hello", .default = "World")
[1] "Hello" "World" "World" "World" "World" NA
But this is not my requirement, I want it only change '1' but not remaining records, it should be like that
[1] "Hello" 2 3 4 5 NA
We can use the assignment
x[x==1] <- "Hello"
x
#[1] "Hello" "2" "3" "4" "5" NA
The types need to be the same, so it would be
recode(as.character(x), "1" = "Hello")
# [1] "Hello" "2" "3" "4" "5" NA
But you could also just use base R's replace().
replace(x, x == 1, "Hello")
# [1] "Hello" "2" "3" "4" "5" NA
I have a sequence of events, coded as A,B, and C. For each element I need to count how many times this element was repeated before but if it is not repeated, the counter should decrease by one for each row. On the first encounter of each item the counter for it is zero. For example:
x<-c('A','A','A','B','C','C','A','B','A','C')
y<-c(0,1,2,0,0,1,-2,-4,-4,-3)
cbind(x,y)
x y
[1,] "A" "0"
[2,] "A" "1"
[3,] "A" "2"
[4,] "B" "0"
[5,] "C" "0"
[6,] "C" "1"
[7,] "A" "-2"
[8,] "B" "-4"
[9,] "A" "-4"
[10,] "C" "-3"
I need to generate column y from x. I know that I can use rle for run length, but I don't know how to get time since the last encounter of specific event to make counter decrease.
I think this is sort of an R way to solve the problem. We can calculate the index of all different elements in x the same way, offset it by its initial position and then combine them together.
Calculate the index separately for each unique element in x:
library(data.table)
sepIndex <- lapply(unique(x), function(i) {
s = cumsum(ifelse(duplicated(rleid(x == i)) & x == i, 1, -1)) + min(which(x == i));
# use `rleid` with `duplicated` to find out the duplicated elements in each block.
# and assign `1` to each duplicated element and `-1` otherwise and use cumsum for cumulative index
# offset the index by the initial position of the element `min(which(x == i))`
replace(s, x != i, NA)
})
Which gives us a list of index for each unique element:
sepIndex
# [[1]]
# [1] 0 1 2 NA NA NA -2 NA -4 NA
# [[2]]
# [1] NA NA NA 0 NA NA NA -4 NA NA
# [[3]]
# [1] NA NA NA NA 0 1 NA NA NA -3
Combine the list into one using the Reduce function should give you what you need:
Reduce(function(x, y) ifelse(is.na(x), y, x), sepIndex)
# [1] 0 1 2 0 0 1 -2 -4 -4 -3
There is another way using base R
positions <- sapply(unique(x),function(t) which(x %in% t))
values <- sapply(sapply(positions,diff),function(s) c(0,cumsum(ifelse(s>1,-s,s))))
df <- data.frame(positions=unlist(positions),values=unlist(values))
df[with(df,order(positions)),2]
I have a data frame which has a column:
> head(df$lengths,5)
[[1]]
[1] "28"
[[2]]
[1] "33"
[[3]]
[1] "47" "37" "42" "41"
[[4]]
[1] "41" "39" "64" "54"
[[5]]
[1] "45" "22" "23"
I would like to operate on the elements in the vectors, to obtain the ratios of the element(i) to the element(i-k) in each vector. Where a ratio cannot be obtained because element(i-k) has invalid index, the result should be NA. The desired output is like this, where I specified k=1:
[[1]]
[1] NA
[[2]]
[1] NA
[[3]]
[1] NA (37/47) (42/37) (41/42)
[[4]]
[1] NA (39/41) (64/39) (54/64)
[[5]]
[1] NA (22/45) (23/22)
as for k=2:
[[1]]
[1] NA
[[2]]
[1] NA
[[3]]
[1] NA NA (42/47) (41/37)
[[4]]
[1] NA NA (64/41) (54/39)
[[5]]
[1] NA NA (23/45)
I have little clue on how to approach this, I would think to perform some loops, but in R, it seems complicated. Please advice.
We loop through the list elements (lapply(..), if the length of the list element is 1, we return 'NA' or else divide the next value by the current value and concatenate with NA. We convert to numeric as the original list elements were character class.
lapply(df$lengths, function(x) if(length(x)==1) NA
else c(NA, as.numeric(x[-1])/as.numeric(x[-length(x)])))
Update
We could use the lag/lead function in dplyr/data.table for k values greater than 1.
library(dplyr)
k <- 2
lapply(df$lengths, function(x) {x <- as.numeric(x)
if(length(x)==1) NA
else c(rep(NA,k), na.omit(lead(x,k)))/na.omit(lag(x,k))})
#[[1]]
#[1] NA
#[[2]]
#[1] NA
#[[3]]
#[1] NA NA 0.893617 1.108108
#[[4]]
#[1] NA NA 1.560976 1.384615
#[[5]]
#[1] NA NA 0.5111111
Or without using any packages, we can do with head/tail functions
lapply(lst, function(x) {x <- as.numeric(x)
if(length(x)==1) NA
else c(rep(NA, k), tail(x, -k)/head(x,-k))})
I have a long list, whose elements are lists of length one containing a character vector. These vectors can have different lengths.
The element of the vectors are 'characters' but I would like to convert them in numeric, as they actually represent numbers.
I would like to create a matrix, or a data frame, whose rows are the vectors above, converted into numeric. Since they have different lengths, the "right ends" of each row could be filled with NA.
I am trying to use the function rbind.fill.matrix from the library {plyr}, but the only thing I could get is a long numeric 1-d array with all the numbers inside, instead of a matrix.
This is the best I could do to get a list of numeric (dat here is my original list):
dat<-sapply(sapply(dat,unlist),as.numeric)
How can I create the matrix now?
Thank you!
I would do something like:
library(stringi)
temp <- stri_list2matrix(dat, byrow = TRUE)
final <- `dim<-`(as.numeric(temp), dim(temp))
The basic idea is that stri_list2matrix will convert the list to a matrix, but it would still be a character matrix. as.numeric would remove the dimensional attributes of the matrix, so we add those back in with:
`dim<-` ## Yes, the backticks are required -- or at least quotes
POC:
dat <- list(1:2, 1:3, 1:2, 1:5, 1:6)
dat <- lapply(dat, as.character)
dat
# [[1]]
# [1] "1" "2"
#
# [[2]]
# [1] "1" "2" "3"
#
# [[3]]
# [1] "1" "2"
#
# [[4]]
# [1] "1" "2" "3" "4" "5"
#
# [[5]]
# [1] "1" "2" "3" "4" "5" "6"
library(stringi)
temp <- stri_list2matrix(dat, byrow = TRUE)
final <- `dim<-`(as.numeric(temp), dim(temp))
final
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 2 NA NA NA NA
# [2,] 1 2 3 NA NA NA
# [3,] 1 2 NA NA NA NA
# [4,] 1 2 3 4 5 NA
# [5,] 1 2 3 4 5 6