Change values from categorical to nominal in R - r

I want to change all the values in categorical columns by rank. Rank can be decided using the index of the sorted unique elements in the column.
For instance,
> data[1:5,1]
[1] "B2" "C4" "C5" "C1" "B5"
then I want these entries in the column replacing categorical values
> data[1:5,1]
[1] "1" "4" "5" "3" "2"
Another column:
> data[1:5,3]
[1] "Verified" "Source Verified" "Not Verified" "Source Verified" "Source Verified"
Then the updated column:
> data[1:5,3]
[1] "3" "2" "1" "2" "2"
I used this code for this task but it is taking a lot of time.
for(i in 1:ncol(data)){
if(is.character(data[,i])){
temp <- sort(unique(data[,i]))
for(j in 1:nrow(data)){
for(k in 1:length(temp)){
if(data[j,i] == temp[k]){
data[j,i] <- k}
}
}
}
}
Please suggest me the efficient way to do this, if possible.
Thanks.

Here a solution in base R. I create a helper function that convert each column to a factor using its unique sorted values as levels. This is similar to what you did except I use as.integer to get the ranking values.
rank_fac <- function(col1)
as.integer(factor(col1,levels = unique(col1)))
Some data example:
dx <- data.frame(
col1= c("B2" ,"C4" ,"C5", "C1", "B5"),
col2=c("Verified" , "Source Verified", "Not Verified" , "Source Verified", "Source Verified")
)
Applying it without using a for loop. Better to use lapply here to avoid side-effect.
data.frame(lapply(dx,rank_fac)
Results:
# col1 col2
# [1,] 1 3
# [2,] 4 2
# [3,] 5 1
# [4,] 3 2
# [5,] 2 2
using data.table syntax-sugar
library(data.table)
setDT(dx)[,lapply(.SD,rank_fac)]
# col1 col2
# 1: 1 3
# 2: 4 2
# 3: 5 1
# 4: 3 2
# 5: 2 2
simpler solution:
Using only as.integer :
setDT(dx)[,lapply(.SD,as.integer)]

Using match:
# df is your data.frame
df[] <- lapply(df, function(x) match(x, sort(unique(x))))

Related

Specifying column names with a defined structure

Assume I've a (used defined covariance) matrix and I want define the column names like this:
y <- matrix(rnorm(15*10),ncol=15)
colnames(y) <- c("Var1", "Cov12", "Cov13","Cov14", "Cov15",
"Var2", "Cov23", "Cov24", "Cov25",
"Var3", "Cov34" , "Cov35"
"Var4", "Cov45",
"Var5")
where each row contained the variance or co variance for an date t. I want to find a more general way to assign the column names as above, because I'll not always have 5 different variances. I tried something with rep and seq but I didn't find a solution.
Maybe not the most optimal way but we can do
n <- 5
paste0("Var", rep(1:n, n:1), unlist(sapply(2:n, function(x) c("",seq(x, n)))))
[1] "Var1" "Var12" "Var13" "Var14" "Var15" "Var2" "Var23" "Var24" "Var25" "Var3"
"Var34" "Var35" "Var4" "Var45" "Var5"
Breaking it down for better understanding
rep(1:n, n:1)
#[1] 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5
unlist(sapply(2:n, function(x) c("",seq(x, n))))
#[1] "" "2" "3" "4" "5" "" "3" "4" "5" "" "4" "5" "" "5"
We take these outputs and paste them parallely with "Var" to get the desired column names.

How can I compare two lists and output "hits" into a dataframe

I've tried to find answers here and on google but no luck, been struggling with this issue for some days so would really appreciate help. I'm analyzing a network to see if cycles tend to be within discreet communities or between them, or no pattern. My data are a list of cycles (three nodes forming a loop) and a list of communities (variable amount of nodes). I have two questions, 1) how to compare two lists, and 2) how to output the comparison results in a way which is readable:
Question 1
I have two lists (both igraph objects), one containing 678 items (each of 3 elements, all characters) and another containing 11 items each with a differing number of elements. Example:
x1 <- as.character(c(1,3,5))
x2 <- as.character(c(2,4,6))
x3 <- as.character(c(7,8,9))
x4 <- as.character(c(10,11,12))
x <- list(x1, x2, x3, x4)
y1 <- as.character(c(1,2,3,4,5))
y2 <- as.character(c(2,3,4,5))
y3 <- as.character(c(1,2,3,4,5,7,8,9))
y <- list(y1, y2, y3)
Giving:
> x
[[1]]
[1] "1" "3" "5"
[[2]]
[1] "2" "4" "6"
[[3]]
[1] "7" "8" "9"
[[4]]
[1] "10" "11" "12"
> y
[[1]]
[1] "1" "2" "3" "4" "5"
[[2]]
[1] "2" "3" "4" "5"
[[3]]
[1] "1" "2" "3" "4" "5" "7" "8" "9"
I want to compare every component in x against every component in y and add every hit (i.e. when all the elements from x[[i]] are also found in y[[i]]) to a new dataframe. I tried a loop using all() and %in% but this didn't work:
for (i in 1:length(x)) {
for (j in 1:length(y)) {
hits <- all(y[[j]] %in% x[[i]]) == TRUE
print(hits)
}
}
This returns 12 FALSE hits. Checking individual components, it should have worked, because:
all(x[[1]] %in% y[[1]])
Returns TRUE as it should, and:
all(x[[1]] %in% y[[2]])
Returns FALSE as it should. Where am I going wrong here?
Question 2
I have seen some solutions for outputting loop results into a df, but that's not exactly what I need. What I want as an output is a dataframe telling me which community every cycle is in. Since there's only 11 communities, it could just refer me to the list component's index, but I haven't found a way to do this. I could also just use paste() to concatenate the node names of a community into a title. Either way, here is the output I need:
cycle community
1 1_3_5 1_2_3_4_5
2 1_3_5 1_2_3_4_5_7_8_9
3 7_8_9 1_2_3_4_5_7_8_9
I'm guessing some kind of an if statement. I feel this should be fairly simple to execute and that I should have been able to work it out myself. Nevertheless, thank you for your time and sorry about the essay.
You made a mistake
for (i in 1:length(x)) {
for (j in 1:length(y)) {
# hits <- all(y[[j]] %in% x[[i]]) == TRUE
hits <- all(x[[i]] %in% y[[j]]) == TRUE
print(hits)
}
}
For the second part you can store the indexes that have a hit and use them for later.
a <- list()
for (i in 1:length(x)) {
for (j in 1:length(y)) {
# hits <- all(y[[j]] %in% x[[i]]) == TRUE
hits <- all(x[[i]] %in% y[[j]]) == TRUE
if(hits == TRUE){
a[[length(a)+1]] <- c(i,j)
}
}
}
The final part of the question, creation of cycle and community tags, can be accomplished with stringi::stri_join() (or paste() as pointed out in the comments). The final step to wrangle the list created in Jt Miclat's answer is as follows, using the indexes in the list a to extract the appropriate strings for cycle and community, generate data frames, and rbind() the result to a single data frame.
# combine with cycle & community tags
cycles <- sapply(x,paste,collapse="_")
communities <- sapply(y,paste,collapse="_")
b <- lapply(a,function(x){
cycle <- cycles[x[1]]
community <- communities[x[2]]
data.frame(x=x[1],y=x[2],cycle=cycle,community=community,
stringsAsFactors=FALSE)
})
df <- do.call(rbind,b)
df
...and the output:
> df <- do.call(rbind,b)
> df
x y cycle community
1 1 1 1_3_5 1_2_3_4_5
2 1 3 1_3_5 1_2_3_4_5_7_8_9
3 3 3 7_8_9 1_2_3_4_5_7_8_9
>
Well you can make use of outer:
outer(x,y,function(w,z)Map(function(i,j)all(i%in%j),w,z))->results
[,1] [,2] [,3]
[1,] TRUE FALSE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE FALSE
x is the rows while y is the columns, so to check all(x[[1]]%in%y[[2]]),just check row 1 column 2 ie element [1,2] etc..
Then you can use apply with a own created function:
fun<-function(i)c(paste(x[[i[1]]],collapse ="_"), paste(y[[i[2]]],collapse ="_"))
t(apply(which(result==T,T),1,fun))
[,1] [,2]
[1,] "1_3_5" "1_2_3_4_5"
[2,] "1_3_5" "1_2_3_4_5_7_8_9"
[3,] "7_8_9" "1_2_3_4_5_7_8_9"

Using lapply to apply function to each row in a tibble

This is my code that attempts apply a function to each row in a tibble , mytib :
> mytib
# A tibble: 3 x 1
value
<chr>
1 1
2 2
3 3
Here is my code where I'm attempting to apply a function to each line in the tibble :
mytib = as_tibble(c("1" , "2" ,"3"))
procLine <- function(f) {
print('here')
print(f)
}
lapply(mytib , procLine)
Using lapply :
> lapply(mytib , procLine)
[1] "here"
[1] "1" "2" "3"
$value
[1] "1" "2" "3"
This output suggests the function is not invoked once per line as I expect the output to be :
here
1
here
2
here
3
How to apply function to each row in tibble ?
Update : I appreciate the supplied answers that allow my expected result but what have I done incorrectly with my implementation ? lapply should apply a function to each element ?
invisible is used to avoid displaying the output. Also you have to loop through elements of the column named 'value', instead of the column as a whole.
invisible( lapply(mytib$value , procLine) )
# [1] "here"
# [1] "1"
# [1] "here"
# [1] "2"
# [1] "here"
# [1] "3"
lapply loops through columns of a data frame by default. See the example below. The values of two columns are printed as a whole in each iteration.
mydf <- data.frame(a = letters[1:3], b = 1:3, stringsAsFactors = FALSE )
invisible(lapply( mydf, print))
# [1] "a" "b" "c"
# [1] 1 2 3
To iterate through each element of a column in a data frame, you have to loop twice like below.
invisible(lapply( mydf, function(x) lapply(x, print)))
# [1] "a"
# [1] "b"
# [1] "c"
# [1] 1
# [1] 2
# [1] 3

How to ignore a "Level" in R?

Not sure how to do the following. Please refer to the picture in the link below:
https://i.stack.imgur.com/Kx79x.png
I have some blank spaces, and they are the missing values. I do not want this level to be read. I want R to ignore this level. I want to write a regression so that this empty category is not part of the model.
The data was read from a csv file. The variable is "I", "II"...."IV", but there is an extra "" factor because of missing data. I want R to ignore this factor. My question is how?
you can do the following:
df <- data.frame(letters=letters[1:5], numbers=c(1,2,3,"",5)) # my data frame
# letters numbers
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d
# 5 e 5
levels(df$numbers)
# "" "1" "2" "3" "5"
subdf <- subset(df, numbers != "") # data subset
subdf$numbers <- factor(subdf$numbers)
levels(subdf$numbers)
# "1" "2" "3" "5"
change the "" data to missing:
# generate sample data
df <- data.frame(x = sample(c("","I","II","III"),100, replace = T), stringsAsFactors = T)
option 1
df[df$x=="",'x'] <- NA
option 2
df$x <- factor(ifelse(df$x == "",NA,as.character(df$x)))

Dataframe within dataframe?

Consider this example:
df <- data.frame(id=1:10,var1=LETTERS[1:10],var2=LETTERS[6:15])
fun.split <- function(x) tolower(as.character(x))
df$new.letters <- apply(df[ ,2:3],2,fun.split)
df$new.letters.var1
#NULL
colnames(df)
# [1] "id" "var1" "var2" "new.letters"
df$new.letters
# var1 var2
# [1,] "a" "f"
# [2,] "b" "g"
# [3,] "c" "h"
# [4,] "d" "i"
# [5,] "e" "j"
# [6,] "f" "k"
# [7,] "g" "l"
# [8,] "h" "m"
# [9,] "i" "n"
# [10,] "j" "o"
Would be someone so kind and explain what is going on here? A new dataframe within dataframe?
I expected this:
colnames(df)
# id var1 var2 new.letters.var1 new.letters.var2
The reason is because you assigned a single new column to a 2 column matrix output by apply. So, the result will be a matrix in a single column. You can convert it back to normal data.frame with
do.call(data.frame, df)
A more straightforward method will be to assign 2 columns and I use lapply instead of apply as there can be cases where the columns are of different classes. apply returns a matrix and with mixed class, the columns will be 'character' class. But, lapply gets the output in a list and preserves the class
df[paste0('new.letters', names(df)[2:3])] <- lapply(df[2:3], fun.split)
#akrun solved 90% of my problem. But I had data.frames buried within data.frames, buried within data.frames and so on, without knowing the depth to which this was happening.
In this case, I thought sharing my recursive solution might be helpful to others searching this thread as I was:
unnest_dataframes <- function(x) {
y <- do.call(data.frame, x)
if("data.frame" %in% sapply(y, class)) unnest_dataframes(y)
y
}
new_data <- unnest_dataframes(df)
Although this itself sometimes has problems and it can be helpful to separate all columns of class "data.frame" from the original data set then cbind() it back together like so:
# Find all columns that are data.frame
# Assuming your data frame is stored in variable 'y'
data.frame.cols <- unname(sapply(y, function(x) class(x) == "data.frame"))
z <- y[, !data.frame.cols]
# All columns of class "data.frame"
dfs <- y[, data.frame.cols]
# Recursively unnest each of these columns
unnest_dataframes <- function(x) {
y <- do.call(data.frame, x)
if("data.frame" %in% sapply(y, class)) {
unnest_dataframes(y)
} else {
cat('Nested data.frames successfully unpacked\n')
}
y
}
df2 <- unnest_dataframes(dfs)
# Combine with original data
all_columns <- cbind(z, df2)
In this case R doesn't behave like one would expect but maybe if we dig deeper we can solve it. What is a data frame? as Norman Matloff says in his book (chapter 5):
a data frame is a list, with the components of that list being
equal-length vectors
The following code might be useful to understand.
class(df$new.letters)
[1] "matrix"
str(df)
'data.frame': 10 obs. of 4 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10
$ var1 : Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
$ var2 : Factor w/ 10 levels "F","G","H","I",..: 1 2 3 4 5 6 7 8 9 10
$ new.letters: chr [1:10, 1:2] "a" "b" "c" "d" ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "var1" "var2"
Maybe the reason why it looks strange is in the print methods. Consider this:
colnames(df$new.letters)
[1] "var1" "var2"
maybe there must something in the print methods that combine the sub-names of objects and display them all.
For example here the vectors that constitute the df are:
names(df)
[1] "id" "var1" "var2" "new.letters"
but in this case the vector new.letters also has a dim attributes (in fact it is a matrix) were dimensions have names var1 and var1 too. See this code:
attributes(df$new.letters)
$dim
[1] 10 2
$dimnames
$dimnames[[1]]
NULL
$dimnames[[2]]
[1] "var1" "var2"
but when we print we see all of them like they were separated vectors (and so columns of the data.frame!).
Edit: Print methods
Just for curiosity in order to improve this question I looked inside the methods of the print functions:
methods(print)
The previous code produces a very long list of methods for the generic function print but there is no one for data.frame. The one that looks for data frame (but I am sure there is a more technically way to find out that) is listof.
getS3method("print", "listof")
function (x, ...)
{
nn <- names(x)
ll <- length(x)
if (length(nn) != ll)
nn <- paste("Component", seq.int(ll))
for (i in seq_len(ll)) {
cat(nn[i], ":\n")
print(x[[i]], ...)
cat("\n")
}
invisible(x)
}
<bytecode: 0x101afe1c8>
<environment: namespace:base>
Maybe I am wrong but It seems to me that in this code there might be useful informations about why that happens, specifically when the if (length(nn) != ll) is stated.

Resources