There is a table which has two columns with each column having the type character. It is:
"FTGS" "JKLP"
"CVVA" "CVVA"
"HGFF" "CVVD"
"CVVD" "HGFF"
"OPSF" "WQSR"
...
Can somebody tell me how I would write a function that spits out the index (row number) of a specific combination of characters in column1 and 2? If I enter the function (HGFF,CVVD) it would return 3 and 4 (whether the HGFF or CVVD is in column1 or 2 does not matter). If I enter (CVVA,CVVA) it would be 2. The problem is that it should check accross two columns. Is there a solution in R? Otherwise bash would also be fine.
A function like the following should work for you:
myFun <- function(v1, v2, indf) {
x <- sort(c(v1, v2))
which(apply(indf, 1, function(z) all(sort(z) == x)))
}
The usage would be like this (assuming your data are in a data.frame called "mydf"):
myFun("CVVA", "CVVD", indf = mydf)
myFun("HGFF", "CVVD", indf = mydf)
In R, the function that it sounds like you are looking for is which, but it won't do what you are looking for directly.
This also seems to work
fun1 <- function(v1, v2, mat) {
ind <- c(0, -nrow(mat))
indx1 <- which(mat == v1) + ind
indx2 <- which(mat == v2) + ind
if (all(sort(indx1) == sort(indx2))) {
indx1
} else NULL
}
fun1("HGFF","CVVD", mat) #mat is the matrix
#[1] 3 4
fun1("CVVA","CVVD", mat)
#NULL
Related
Consider the following data.frame:
df <- setNames(data.frame(rep("text_2010"),rep(1,5)), c("id", "value"))
I only want to keep the 4 last characters of the cells in the column "id". Therefore, I can use the following code:
df$id <- substr(df$id,nchar(df$id)-3,nchar(df$id))
However, I want to create a function that does the same. Therefore, I create the following function and apply it:
testfunction <- function(x) {
x$id <- substr(x$id,nchar(x$id)-3,nchar(x$id))
}
df <- testfunction(df)
But I do not get the same result. Why is that?
Add return(x) in your function to return the changed object.
testfunction <- function(x) {
x$id <- substr(x$id,nchar(x$id)-3,nchar(x$id))
return(x)
}
df <- testfunction(df)
However, you don't need an explicit return statement always (although it is better to have one). R by default returns the last line in your function so here you can also do
testfunction <- function(x) {
transform(x, id = substring(id, nchar(id)-3))
}
df <- testfunction(df)
which should work the same.
We can also create a function that takes an argument n (otherwise, the function would be static for the n and only useful as a dynamic function for different data) and constructs a regex pattern to be used with sub
testfunction <- function(x, n) {
pat <- sprintf(".*(%s)$", strrep(".", n))
x$id <- sub(pat, "\\1", x$id)
return(x)
}
-testing
testfunction(df, n = 4)
# id value
#1 2010 1
#2 2010 1
#3 2010 1
#4 2010 1
#5 2010 1
Base R solution attempting to mirror Excel's RIGHT() function:
# Function to extract the right n characters from each element of a provided vector:
right <- function(char_vec, n = 1){
# Check if vector provided isn't of type character:
if(!is.character(char_vec)){
# Coerce it, if not: char_vec => character vector
char_vec <- vapply(char_vec, as.character, "character")
}
# Store the number of characters in each element of the provided vector:
# num_chars => integer vector
num_chars <- nchar(char_vec)
# Return the right hand n characters of the string: character vector => Global Env()
return(substr(char_vec, (num_chars + 1) - n, num_chars))
}
# Application:
right(df$id, 4)
Data:
df <- setNames(data.frame(rep("text_2010"),rep(1,5)), c("id", "value"))
Problem
I´m making a function that describes the changes in the temporal state of a given time series. It will say if the value of a given column is more, less, equal than the previous one, and print the result:
It could be in the same data frame or in other different object. I´m doing
it to transform the data in order to be good for survival analysis.
What has been done
I already made an if else ladder that looks like this: where (x) is an i column in a data drame and (y) is the column just before it (i-1). However, I am clueless about how to define the first line of the function to actually do this operation in each column of the data frame(counting from the second one), also to dont crash with the last column
func_name <- function (x, columns) {
if (x == NA) {
print("gone")
} else if (x < y) {
print("less")
} else if (x > y) {
print("more")
} else if (x = y) {
print("same")
} else {
print ("")
}
}
What is being expected
Ideally will be transforming something like this:
Id <- c(1,2,3)
Time1 <- c(3,3,4)
Time2 <- c(2,5,4)
Time3 <- c(1,5,8)
df <- data.frame(Id,Time1,Time2,Time3)
df
Into something like this:
Id <- c(1,2,3)
Time1 <- c(3,3,4)
Time2 <- c("Less","More","Same")
Time3 <- c("Less","Same","More")
df2 <- data.frame(Id,Time1,Time2,Time3)
df2
Any help, highly apreciated!
Solutions: Both #Andrew and #Cole solution works solving the problem!
This sounds like it is what you are looking for. It is not a custom function, but if can be adapted if you need one. Hope this helps!
# Select the columns you need. NOTE: used [-1] to remove starting time column
cols <- grep("Time", names(df), fixed = T)[-1]
# Use case_when with your conditions
df[cols] <- lapply(cols, function(i) dplyr::case_when(
is.na(df[i]) ~ "Gone",
df[i] > df[i-1] ~ "More",
df[i] < df[i-1] ~ "Less",
df[i] == df[i-1] ~ "Same"
))
df
Id Time1 Time2 Time3
1 1 3 Less Less
2 2 3 More Same
3 3 4 Same More
Here's the use of mapply with an anonymous function inside:
df <- data.frame(Id,Time1,Time2,Time3)
df[, 3:4] <- mapply(function(x, y) ifelse(y < x , 'Less', ifelse(y > x, 'More', 'Same'))
, df[, 2:3]
, df[, 3:4])
df
mapply will walk along each field of the datasets and apply a function. In other words, I am taking the difference between df[, 2] and df[, 3], and then df[, 3] and df[, 4]. I could have also done something like:
fx_select <- function(x, y) {
ifelse(y < x, 'Less', ifelse(y > x, 'More', 'Same'))
}
df[, 3:4] <- mapply(fx_select, df[, 2:3], df[, 3:4])
And here's one more approach:
df[3:4] <- lapply(sign(df[2:3] - df[3:4]) + 2,
function(x) c('More', 'Same', 'Less')[x]
)
R has problems when reading .csv files with column names that begin with a number; it changes these names by putting an "X" as the first character.
I am trying to write a function which simply solves this problem (although: is this the easiest way?)
As an example file, I simply created two new (non-sensical) columns in iris:
iris$X12.0 <- iris$Sepal.Length
iris$X18.0 <- iris$Petal.Length
remv.X <- function(x){
if(substr(colnames(x), 1, 1) == "X"){
colnames(x) <- substr(colnames(x), 2, 100)
}
else{
colnames(x) <- substr(colnames(x), 1, 100)
}
}
remv.X(iris)
When printing, I get a warning, and nothing changes.
What do I do wrong?
check.names=FALSE
Use the read.table/read.csv argument check.names = FALSE to turn off column name mangling.
For example,
read.csv(text = "1x,2x\n10,20", check.names = FALSE)
giving:
1x 2x
1 10 20
Removing X using sub
If for some reason you did have an unwanted X character at the beginning of some column names they could be removed like this. This only removes an X at the beginning of columns names for which the next character is a digit. If the next character is not a digit or if there is no next character then the column name is left unchanged.
names(iris) <- sub("^X(\\d.*)", "\\1", names(iris))
or as a function:
rmX <- function(data) setNames(data, sub("^X(\\d.*)", "\\1", names(data)))
# test
iris <- rmX(iris)
Problem with code in question
There are two problems with the code in the question.
in if (condition) ... the condition is a vector but must be a
scalar.
the data frame is never returned.
Here it is fixed up. We have also factored out the LHS of the two legs of the if.
remv.X2 <- function(x) {
for (i in seq_along(x)) {
colnames(x)[i] <- if (substr(colnames(x)[i], 1, 1) == "X") {
substr(colnames(x)[i], 2, 100)
} else {
substr(colnames(x)[i], 1, 100)
}
}
x
}
iris <- remv.X2(iris)
or maybe even:
remv.X3 <- function(x) {
setNames(x, substr(colnames(x), (substr(colnames(x), 1, 1) == "X") + 1, 100))
}
iris <- remv.X3(iris)
I want to search through a vector for the sequence of strings "hello" "world". When I find this sequence, I want to copy it, including the 10 elements before and after, as a row in a data.frame to which I'll apply further analysis.
My problem: I get an error "new column would leave holes after existing columns". I'm new to coding, so I'm not sure how to manipulate data.frames. Maybe I need to create rows in the loop?
This is what I have:
df = data.frame()
i <- 1
for(n in 1:length(v))
{
if(v[n] == 'hello' & v[n+1] == 'world')
{
df[i,n-11:n+11] <- v[n-10:n+11]
i <- i+1
}
}
Thanks!
May be this helps
indx <- which(v1[-length(v1)]=='hello'& v1[-1]=='world')
lst <- Map(function(x,y) {s1 <- seq(x,y)
v1[s1[s1>0 & s1 < length(v1)]]}, indx-10, indx+11)
len <- max(sapply(lst, length))
d1 <- as.data.frame(do.call(rbind,lapply(lst, `length<-`, len)))
data
set.seed(496)
v1 <- sample(c(letters[1:3], 'hello', 'world'), 100, replace=TRUE)
I have the following function taken from R: iterative outliers detection (this is an updated version):
dropout<-function(x) {
outliers <- NULL
res <- NULL
if(length(x)<2) return (1)
vals <- rep.int(1, length(x))
r <- chisq.out.test(x)
while (r$p.value<.05 & sum(vals==1)>2) {
if (grepl("highest",r$alternative)) {
d <- which.max(ifelse(vals==1,x, NA))
res <- rbind(list(as.numeric(strsplit(r$alternative," ")[[1]][3]),as.numeric(r$p.value)),fill=TRUE)
}
else {
d <- which.min(ifelse(vals==1, x, NA))
}
vals[d] <- r$p.value
r <- chisq.out.test(x[vals==1])
}
return(res)
}
The problem is that in each round it gives me some missing rows to fill in the data.frame
i want to fill res but in some iterations it contains missing values.
I used all possible things e.g rbindlist, rbind.fill, rbind (with fill=TRUE) but nothing is working.
When i do something like :
res <- c(res,as.numeric(strsplit(r$alternative," ")[[1]][3]),as.numeric(r$p.value))
it works but it creates 2 rows for each set of (V1,V2), one with the last column as r$alternativeand the second row with the same first 2 columns but with the p-value in the last column instead.
Thats how I'm calling the function on data similar as the one in the mentioned question:
outliers <- d[, dropout(V3), list(V1, V2)]
and im getting always this error : j doesn't evaluate to the same number of columns for each group