Obtain previous level of ordinal factor in R - r

I must be missing something in this question, because it looks simple and it has already been too much time on it.
Let's say we have a ordinal factor column in a dataframe. We want a new column by dropping or scaling up the original column by one level or category. What is the fastest way?
Data
col_string <- as.character(c(1:5))
col_factor <- factor(col_string, levels = as.character(c(1:8)), ordered = TRUE)
Desired solution:
col_solution <- c(8,1,2,3,4)
df <- cbind(col_string, col_factor, col_solution)
df
col_string col_factor col_solution
[1,] "1" "1" "8"
[2,] "2" "2" "1"
[3,] "3" "3" "2"
[4,] "4" "4" "3"
[5,] "5" "5" "4"
How, in code, can i tell R to:
col_solution <- shift down one level of the element in col_factor
edit for clarification:
The col_factor column has 8 categories altough there are just representation of 5 of them. The categories are ordered as 1-2-3-4-5-6-7-8. If one element is on category 1 and we want to go down by one category, we would go to category 8.

function(myFactor,shift = -1){
myFactor[] <- (as.numeric(myFactor)-1+shift)(length(levels(myFactor)))+1
return(myFactor)
}
A bit of a pain to get your head around what the indexing is doing.
((x-1) %% y) +1 Gives the remainder of x/y, but when the remainder is 0, it returns y.

Related

Subset rows based on "start and stop" strings

looking to write an R script that will search a column for a specific value and begin sub setting rows until a specific text value is reached.
Example:
X1 X2
[1,] "a" "1"
[2,] "b" "2"
[3,] "c" "3"
[4,] "d" "4"
[5,] "e" "5"
[6,] "f" "6"
[7,] "c" "7"
[8,] "k" "8"
What I'd like to do is search through X1 until the letter 'c' is found, and begin to subset rows until another letter 'c' is found, at which point the subset procedure would stop. Using the above example, the result should be a vector containing c(3,4,5,6,7).
Assume there will be no more than 2 rows where X1 equals 'c'
Any help is greatly appreciated.
You can lookup where a value is with the function which, and use that as in index to get the values you are looking for. If you want everything from the first to the second "c", it would look like this:
indices <- which(df$X1=='c')
range <- indices[1]:indices[2]
df$X2[range]

sapply not applying a function created to all rows in R dataframe

I have the following dataframe in R and am trying to use a stringsplit function to the same to yield a different dataframe
DF
A B C
"1,2,3" "1,2"
"2" "1"
The cells of the dataframe are filled with characters. The empty spaces are blank values. I have created the following function
sepfunc<-function(x){strsplit(as.character(x, split= ","))[[1]][1]}
The function works neatly when i use it on a single column
sapply(DF$A, sepfunc)
[1] "1" "2"
However, the following command yields only a single row
sapply(DF, sepfunc)
A B C
"1" NA "1"
The second row is not displayed. I know I must be missing something rudimentary. I request someone to help.
The expected output is
A B C
"1" NA "1"
"2" "1" "NA"
When we do the strsplit, the output is a list of vectors. If we just subset the first list element with [[1]], then the rest of the elements are skipped. Here the first element corresponds to the first row. But, when we do the same on a single column, it is looping through each element and then do the strsplit. It will not hurt by taking the first element [[1]] because the list is of length 1. Here, the case is different. The number of list elements are the same as the number of rows for each of the columns. So, we need to loop through the list (either with sapply/lapply - former gives a vector depends on the case, while latter always return list)
sapply(DF, function(x) sapply(strsplit(as.character(x), ","), `[`, 1))
# A B C
#[1,] "1" NA "1"
#[2,] "2" "1" NA
Let's look this more closely by splitting the codes into chunks. On each column, we can find the output as list of splitted vectors
lapply(DF, function(x) strsplit(as.character(x), ","))
#$A
#$A[[1]]
#[1] "1" "2" "3"
#$A[[2]]
#[1] "2"
#$B
#$B[[1]]
#[1] NA
#$B[[2]]
#[1] "1"
#$C
#$C[[1]]
#[1] "1" "2"
#$C[[2]]
#character(0)
When we do [[1]], the first element is extracted i.e. the first row of 'A', 'B', 'C'
lapply(DF, function(x) strsplit(as.character(x), ",")[[1]])
#$A
#[1] "1" "2" "3"
#$B
#[1] NA
#$C
#[1] "1" "2"
If we again subset on the above, i.e. the first element, the output will be 1 NA 1.
Instead we want to loop through the list and get the first element of each list
As you only want to extract the first part before the , you can also do
sapply(DF, function(x) gsub("^([^,]*),.*$", "\\1", x))
# A B C
# [1,] "1" NA "1"
# [2,] "2" NA "1"
This extracts the the first group (\\1) which is here marked with brackets. ([^,]*)
Or with stringr:
library(stringr)
sapply(DF, function(x) str_extract(x, "^([^,]*)"))
Here is another version of this
lapply(X = df, FUN = function(x) sapply(strsplit(x = as.character(x), split = ","), FUN = head, n=1))
First of all, notice that your sepfun should always give an error:
sepfunc<-function(x){strsplit(as.character(x, split= ","))[[1]][1]}
split should go with strsplit, not as.character, so what you meant is probably:
sepfunc<-function(x){strsplit(as.character(x), split= ",")[[1]][1]}
Second, the question of data sanity. You have character variables stored as factors, and missing data stored as empty strings. I would recommend dealing with these issues before trying to do anything else. (Why do I say NA is more sensible here than an empty string? Because you told me so. You want NA's in the output, so I guess this means that if there are no numbers in the string, it means that something is missing. Missing = NA. There is also a technical reason which would take a bit longer to explain.)
So in the following, I'm just using an altered version of your DF:
DF <- data.frame(A=c("1,2,3", "2"), B=c(NA, "1"), C=c("1,2", NA), stringsAsFactors=FALSE)
(If DF comes from a file, then you could use read.csv("file", as.is=TRUE). And then DF[DF==""] <- NA.)
The output of strsplit is a list so you'll need sapply to get something useful out from it. And another sapply to apply it to all columns in a data frame.
sapply(DF, function(x) sapply(strsplit(x, ","), head, 1))
# A B C
# [1,] "1" NA "1"
# [2,] "2" "1" NA
Or step by step. Before you can sapply a function over all columns of a data frame, you need it to give meaningful results for all the columns. Let's try:
sf <- function(x) sapply(strsplit(x, ","), head, 1)
# and sepfunc as defined above:
sepfunc<-function(x){strsplit(as.character(x), split= ",")[[1]][1]}
sf(DF$A)
# [1] "1" "2"
# as expected
sepfunc(DF$A)
# [1] "1"
Notice that sepfunc uses only the first element (as you told it to!) of each column, and the rest is discarded. You need sapply or something similar to use all elements. So as a consequence, you get this:
sapply(DF, sepfunc)
# A B C
# "1" NA "1"
(It works, because we've redefined empty strings as NA. But you get the results only for the first row of each variable.)
sapply(DF, sf)
# A B C
# [1,] "1" NA "1"
# [2,] "2" "1" NA

How can I compare two lists and output "hits" into a dataframe

I've tried to find answers here and on google but no luck, been struggling with this issue for some days so would really appreciate help. I'm analyzing a network to see if cycles tend to be within discreet communities or between them, or no pattern. My data are a list of cycles (three nodes forming a loop) and a list of communities (variable amount of nodes). I have two questions, 1) how to compare two lists, and 2) how to output the comparison results in a way which is readable:
Question 1
I have two lists (both igraph objects), one containing 678 items (each of 3 elements, all characters) and another containing 11 items each with a differing number of elements. Example:
x1 <- as.character(c(1,3,5))
x2 <- as.character(c(2,4,6))
x3 <- as.character(c(7,8,9))
x4 <- as.character(c(10,11,12))
x <- list(x1, x2, x3, x4)
y1 <- as.character(c(1,2,3,4,5))
y2 <- as.character(c(2,3,4,5))
y3 <- as.character(c(1,2,3,4,5,7,8,9))
y <- list(y1, y2, y3)
Giving:
> x
[[1]]
[1] "1" "3" "5"
[[2]]
[1] "2" "4" "6"
[[3]]
[1] "7" "8" "9"
[[4]]
[1] "10" "11" "12"
> y
[[1]]
[1] "1" "2" "3" "4" "5"
[[2]]
[1] "2" "3" "4" "5"
[[3]]
[1] "1" "2" "3" "4" "5" "7" "8" "9"
I want to compare every component in x against every component in y and add every hit (i.e. when all the elements from x[[i]] are also found in y[[i]]) to a new dataframe. I tried a loop using all() and %in% but this didn't work:
for (i in 1:length(x)) {
for (j in 1:length(y)) {
hits <- all(y[[j]] %in% x[[i]]) == TRUE
print(hits)
}
}
This returns 12 FALSE hits. Checking individual components, it should have worked, because:
all(x[[1]] %in% y[[1]])
Returns TRUE as it should, and:
all(x[[1]] %in% y[[2]])
Returns FALSE as it should. Where am I going wrong here?
Question 2
I have seen some solutions for outputting loop results into a df, but that's not exactly what I need. What I want as an output is a dataframe telling me which community every cycle is in. Since there's only 11 communities, it could just refer me to the list component's index, but I haven't found a way to do this. I could also just use paste() to concatenate the node names of a community into a title. Either way, here is the output I need:
cycle community
1 1_3_5 1_2_3_4_5
2 1_3_5 1_2_3_4_5_7_8_9
3 7_8_9 1_2_3_4_5_7_8_9
I'm guessing some kind of an if statement. I feel this should be fairly simple to execute and that I should have been able to work it out myself. Nevertheless, thank you for your time and sorry about the essay.
You made a mistake
for (i in 1:length(x)) {
for (j in 1:length(y)) {
# hits <- all(y[[j]] %in% x[[i]]) == TRUE
hits <- all(x[[i]] %in% y[[j]]) == TRUE
print(hits)
}
}
For the second part you can store the indexes that have a hit and use them for later.
a <- list()
for (i in 1:length(x)) {
for (j in 1:length(y)) {
# hits <- all(y[[j]] %in% x[[i]]) == TRUE
hits <- all(x[[i]] %in% y[[j]]) == TRUE
if(hits == TRUE){
a[[length(a)+1]] <- c(i,j)
}
}
}
The final part of the question, creation of cycle and community tags, can be accomplished with stringi::stri_join() (or paste() as pointed out in the comments). The final step to wrangle the list created in Jt Miclat's answer is as follows, using the indexes in the list a to extract the appropriate strings for cycle and community, generate data frames, and rbind() the result to a single data frame.
# combine with cycle & community tags
cycles <- sapply(x,paste,collapse="_")
communities <- sapply(y,paste,collapse="_")
b <- lapply(a,function(x){
cycle <- cycles[x[1]]
community <- communities[x[2]]
data.frame(x=x[1],y=x[2],cycle=cycle,community=community,
stringsAsFactors=FALSE)
})
df <- do.call(rbind,b)
df
...and the output:
> df <- do.call(rbind,b)
> df
x y cycle community
1 1 1 1_3_5 1_2_3_4_5
2 1 3 1_3_5 1_2_3_4_5_7_8_9
3 3 3 7_8_9 1_2_3_4_5_7_8_9
>
Well you can make use of outer:
outer(x,y,function(w,z)Map(function(i,j)all(i%in%j),w,z))->results
[,1] [,2] [,3]
[1,] TRUE FALSE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE FALSE
x is the rows while y is the columns, so to check all(x[[1]]%in%y[[2]]),just check row 1 column 2 ie element [1,2] etc..
Then you can use apply with a own created function:
fun<-function(i)c(paste(x[[i[1]]],collapse ="_"), paste(y[[i[2]]],collapse ="_"))
t(apply(which(result==T,T),1,fun))
[,1] [,2]
[1,] "1_3_5" "1_2_3_4_5"
[2,] "1_3_5" "1_2_3_4_5_7_8_9"
[3,] "7_8_9" "1_2_3_4_5_7_8_9"

Subsetting Identical Observations in R [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 8 years ago.
I am trying to look at protein sequence homology using R, and I'd like to go through a data frame looking for identical pairs of Position and Letter. The data look similar to the frame below:
Letter <- c("A", "B", "C", "D", "D", "E", "G", "L")
Position <- c(1, 2, 3, 4, 4, 5, 6, 7)
data.set <- cbind(Position, Letter)
Which yields:
Position Letter
[1,] "1" "A"
[2,] "2" "B"
[3,] "3" "C"
[4,] "4" "D"
[5,] "4" "D"
[6,] "5" "E"
[7,] "6" "G"
[8,] "7" "L"
I'd like to loop through and find all identical observations (in this case, observations 4 and 5), but I'm having difficulty in discovering the best way to do it.
I'd like the resultant data frame to look like:
Position Letter
[1,] "4" "D"
[2,] "4" "D"
The ways I've tried to do this ended up yielding this code, but unfortunately it returns one value of TRUE because I realized that I am comparing two identical data frames:
> identical(data.set[1:nrow(data.set),1:2], data.set[1:nrow(data.set),1:2])
[1] TRUE
I'm not sure if looping through using the identical() function would be the best way? I'm sure there's a more elegant solution that I am missing.
Thanks for any help!
Try the unique function:
unique(data.set)
...
You can use duplicated using fromLast to go in two directions:
data.set[(duplicated(data.set)==T | duplicated(data.set, fromLast = TRUE) == T),]
# Position Letter
#[1,] "4" "D"
#[2,] "4" "D"

r - pairwise combinations of rows from table?

Assume a table as below:
X =
col1 col2 col3
row1 "A" "0" "1"
row2 "B" "2" "NA"
row3 "C" "1" "2"
I select combinations of two rows, using the code below:
pair <- apply(X, 2, combn, m=2)
This returns a matrix of the form:
pair =
[,1] [,2] [,3]
[1,] "A" "0" "1"
[2,] "B" "2" NA
[3,] "A" "0" "1"
[4,] "C" "1" "2"
[5,] "B" "2" NA
[6,] "C" "1" "2"
I wish to iterate over pair, taking two rows at a time, i.e. first isolate [1,] and [2,], then [3,] and [4,] and finaly, [5,] and [6,]. These rows will then be passed as arguments to regression models, i.e. lm(Y ~ row[i]*row[j]).
I am dealing with a large dataset. Can anybody advise how to iterate over a matrix two rows at a time, assign those rows to variables and pass as arguments to a function?
Thanks,
S ;-)
It is unnecessary to multiply the rows of your matrix like that, and if you have a large data set it is might get problematic. In stead just pick out the relevant rows for each instance. But it is convenient to create the selection beforehand, something like this perhaps:
xselect <- combn(1:nrow(X),2)
To illustrate with your data (assuming you only use columns 2 and 3):
X <- matrix(c("A", "B", "C", 0,2,1,1,NA,2),3,3)
Y <- rnorm(2, 4, 2)
for (i in 1:ncol(xselect))
{
x1 <- as.numeric(X[xselect[1,i], c(2,3)])
x2 <- as.numeric(X[xselect[2,i], c(2,3)])
print(lm(Y ~ x1 * x2))
}
I'm not sure exactly what you're trying to do with the linear models, but to iterate over X, a pair of rows at a time, make a factor for each pair, and then use by
fac <- as.factor(sort(rep(1:(nrow(X)/2), 2)))
by(X, fac, FUN)
where FUN is whatever function you want to apply over the pairs of rows in X.

Resources