Replacing an element in a character string by the previous value - r

I have a character string looking like this:
string <- c("1","2","3","","5","6","")
I would like to replace the gaps by the previous value, obtaining a string similar to this:
string <- c("1","2","3","3","5","6","6")
I have adjusted this solution (Replace NA with previous and next rows mean in R) and I do get the correct result:
string <- as.data.frame(string)
ind <- which(string == "")
string$string[ind] <- sapply(ind, function(i) with(string, string[i-1]))
This way is however quite cumbersome and there must be an easier way that does not require me to transform the string to a data frame first. Thanks for your help!

We can use na.locf from zoo after changing the blank ("") to NA so that the NA values get replaced by the non-NA adjacent previous values
library(zoo)
na.locf(replace(string, string =="", NA))
#[1] "1" "2" "3" "3" "5" "6" "6"
If there is only atmost one blank between the elements, then create an index as in the OP's post and then do the replacement by the element corresponding to the index subtracted 1
i1 <- which(string == "")
string[i1] <- string[i1-1]

Related

Row names disappear after as.matrix

I notice that if the row names of the dataframe follows a sequence of numbers from 1 to the number of rows. The row names of the dataframe will disappear after using as.matrix. But the row names re-appear if the row name is not a sequence.
Here are a reproducible example:
test <- as.data.frame(list(x=c(0.1, 0.1, 1), y=c(0.1, 0.2, 0.3)))
rownames(test)
# [1] "1" "2" "3"
rownames(as.matrix(test))
# NULL
rownames(as.matrix(test[c(1, 3), ]))
# [1] "1" "3"
Does anyone have an idea on what is going on?
Thanks a lot
You can enable rownames = TRUE when you apply as.matrix
> as.matrix(test, rownames = TRUE)
x y
1 0.1 0.1
2 0.1 0.2
3 1.0 0.3
First and foremost, we always have a numerical index for sub-setting that won't disappear and that we should not confuse with row names.
as.matrix(test)[c(1, 3), ]
# x y
# [1,] 0.1 0.1
# [2,] 1.0 0.3
WHAT's going on while using rownames is the dimnames feature in the serene source code of base:::rownames(),
function (x, do.NULL = TRUE, prefix = "row")
{
dn <- dimnames(x)
if (!is.null(dn[[1L]]))
dn[[1L]]
else {
nr <- NROW(x)
if (do.NULL)
NULL
else if (nr > 0L)
paste0(prefix, seq_len(nr))
else character()
}
}
which yields NULL for dimnames(as.matrix(test))[[1]] but yields "1" "3" in the case of dimnames(as.matrix(test[c(1, 3), ]))[[1]].
Note, that the method base:::row.names.data.frame is applied in case of data frames, e.g. rownames(test).
The WHAT should be explained with it, fortunately you did not ask for the WHY, which would be rather opinion-based.
There is a difference between 'automatic' and non-'automatic' row names.
Here is a motivating example:
automatic
test <- as.data.frame(list(x = c(0.1,0.1,1), y = c(0.1,0.2,0.3)))
rownames(test)
# [1] "1" "2" "3"
rownames(as.matrix(test))
# NULL
non-'automatic'
test1 <- test
rownames(test1) <- as.character(1:3)
rownames(test1)
# [1] "1" "2" "3"
rownames(as.matrix(test1))
# [1] "1" "2" "3"
You can read about this in e.g. ?data.frame, which mentions the behavior you discovered at the end:
If row.names was supplied as NULL or no suitable component was found the row names are the integer sequence starting at one (and such row names are considered to be ‘automatic’, and not preserved by as.matrix).
When you call test[c(1, 3), ] then you create non-'automatic' rownames implicitly, which is kinda documented in ?Extract.data.frame:
If `[` returns a data frame it will have unique (and non-missing) row names.
(type `[.data.frame` into your console if you want to go deeper here.)
Others showed what this means for your case already, see the argument rownames.force in ?matrix:
rownames.force: ... The default, NA, uses NULL rownames if the data frame has ‘automatic’ row.names or for a zero-row data frame.
The difference dataframe vs. matrix:
?rownames
rownames(x, do.NULL = TRUE, prefix = "row")
The important part is do.NULL = TRUE the default is TRUE: This means:
If do.NULL is FALSE, a character vector (of length NROW(x) or NCOL(x)) is returned in any case,
If the replacement versions are called on a matrix without any existing dimnames, they will add suitable dimnames. But constructions such as
rownames(x)[3] <- "c"
may not work unless x already has dimnames, since this will create a length-3 value from the NULL value of rownames(x).
For me that means (maybe not correct or professional) to apply rownames() function to a matrix the dimensions of the row must be declared before otherwise you will get NULL -> because this is the default setting in the function rownames().
In your example you experience this kind of behaviour:
Here you declare row 1 and 3 and get 1 and 3
rownames(as.matrix(test[c(1, 3), ]))
[1] "1" "3"
Here you declare nothing and get NULL because NULL is the default.
rownames(as.matrix(test))
NULL
You can overcome this by declaring before:
rownames(test) <- 1:3
rownames(as.matrix(test))
[1] "1" "2" "3"
or you could do :
rownames(as.matrix(test), do.NULL = FALSE)
[1] "row1" "row2" "row3"
> rownames(as.matrix(test), do.NULL = FALSE, prefix="")
[1] "1" "2" "3"
Similar effect with rownames.force:
rownames.force
logical indicating if the resulting matrix should have character (rather than NULL) rownames. The default, NA, uses NULL rownames if the data frame has ‘automatic’ row.names or for a zero-row data frame.
dimnames(matrix_test)
I don't know exactly why it happens, but one way to fix it is to include the argument rownames.force = T, inside as.matrix
rownames(as.matrix(test, rownames.force = T))

Extract numbers after a pattern in vector of characters

I'm trying to extract values from a vector of strings. Each string in the vector, (there are about 2300 in the vector), follows the pattern of the example below:
"733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"
What I'd like is to extract the numbers following the pattern "Sent. " and place them into a separate vector. For the example, I'd like to extract "1311531".
I'm having trouble using gsub to accomplish this.
library(tidyverse)
Data <- c("PASTE YOUR WHOLE STRING")
str_locate(Data, "Sent. ")
Reference <- str_locate_all(Data, "Sent. ") %>% as.data.frame()
Reference %>% names() #Returns [1] "start" "end"
Reference <- Reference %>% mutate(end = end +1)
YourNumbers <- substr(Data,start = Reference$end[1], stop = Reference$end[1])
for (i in 2:dim(Reference)[1]){
Temp <- substr(Data,start = Reference$end[i], stop = Reference$end[i])
YourNumbers <- paste(YourNumbers, Temp, sep = "")
}
YourNumbers #Returns "1234567"
We can use str_match_all from stringr to get all the numbers followed by "Sent".
str_match_all(text, "Sent.*?_+(\\d+)")[[1]][, 2]
#[1] "1" "3" "1" "1" "5" "3" "1"
A base R option using strsplit and sub
lapply(strsplit(ss, "\\|"), function(x)
sub("Sent.+: _+(\\d+)_+", "\\1", x[grepl("^Sent", x)]))
#[[1]]
#[1] "1" "3" "1" "1" "5" "3" "1"
Sample data
ss <- "733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"

How to remove constant parts of a string in R

I would like to remove constant (shared) parts of a string automatically and retain the variable parts.
e.g. i have a column with the following:
D20181116_Basel-Take1_digital
D20181116_Basel-Take2_digital
D20181116_Basel-Take3_digital
D20181116_Basel-Take4_digital
D20181116_Basel-Take5_digital
D20181116_Basel-Take5a_digital
how can i get automatically to for any similar column (here removing: "D20181116_Basel-Take" and "_digital"). But the code should be find the constant part itself and remove them.
1
2
3
4
5
5a
I hope this is clear. Thank you very much.
You can do it with a regex: it will remove everything before 'Take' and after the underscore character:
vec<- c("D20181116_Basel-Take1_digital",
"D20181116_Basel-Take2_digital",
"D20181116_Basel-Take3_digital",
"D20181116_Basel-Take4_digital",
"D20181116_Basel-Take5_digital",
"D20181116_Basel-Take5a_digital")
sub(".*?Take(.*?)_.*", "\\1", vec)
[1] "1" "2" "3" "4" "5" "5a"
with gsub():
assuming you have a dataframe df and want to change column
df$column <- gsub("^D20181116_Basel-Take","",df$column)
df$column <- gsub("_digital$","",df$column)

Match unlist output against set of column names

Here is sample data:
main.data <- c("id","num","open","close","char","gene","valid")
data.step.1 <- list(id="12",num="00",open="01-01-2015",char="yes",gene="1234",valid="NA")
match.step.1 <- unlist(data.step.1)
The main.data are the column names of all possible column data.
I have a loop that streams data step-by-step, which could have missing column (list name).
I would like to match the each step (data.step.n) against the master column names (main.data).
Desired output:
id num open close char gene valid
"12" "00" "01-01-2015" "" "yes" "1234" "NA"
How can I unlist the data and match it against the names so that if the entry is missing like in this case close that would be filled with empty string.
Try
v1 <- setNames(rep('', length(main.data)), main.data)
v1[main.data %in% names(match.step.1)] <- match.step.1
Or use match
v1[match(names(match.step.1), main.data)] <- match.step.1
Or just use [
v2 <- setNames(match.step.1[main.data], main.data)
v2[is.na(v2)] <- ''

Delete a row in a dataframe and get a dataframe back

I want to "subset" this dataframe and remove the second row using the rowname
myDataFrame <- as.data.frame(rnorm(5))
rownames(MyDataFrame)
#"1" "2" "3" "4" "5"
myDataFrame[-2,]
# 0.2706859 0.9708845 0.7559821 -0.2063368
I want to be able to get the results above, but in a data frame form (with the original row names). I looked around and it seems the way to select by rowname is to use the which function, but I'm not sure how it would work in this context.
You can add an argument drop = FALSE.
> mydf[-2, , drop = FALSE]
rnorm(5)
1 1.9602780
3 0.1078827
4 -0.8517422
5 -0.8300695

Resources