I want to change all the values in categorical columns by rank. Rank can be decided using the index of the sorted unique elements in the column.
For instance,
> data[1:5,1]
[1] "B2" "C4" "C5" "C1" "B5"
then I want these entries in the column replacing categorical values
> data[1:5,1]
[1] "1" "4" "5" "3" "2"
Another column:
> data[1:5,3]
[1] "Verified" "Source Verified" "Not Verified" "Source Verified" "Source Verified"
Then the updated column:
> data[1:5,3]
[1] "3" "2" "1" "2" "2"
I used this code for this task but it is taking a lot of time.
for(i in 1:ncol(data)){
if(is.character(data[,i])){
temp <- sort(unique(data[,i]))
for(j in 1:nrow(data)){
for(k in 1:length(temp)){
if(data[j,i] == temp[k]){
data[j,i] <- k}
}
}
}
}
Please suggest me the efficient way to do this, if possible.
Thanks.
Here a solution in base R. I create a helper function that convert each column to a factor using its unique sorted values as levels. This is similar to what you did except I use as.integer to get the ranking values.
rank_fac <- function(col1)
as.integer(factor(col1,levels = unique(col1)))
Some data example:
dx <- data.frame(
col1= c("B2" ,"C4" ,"C5", "C1", "B5"),
col2=c("Verified" , "Source Verified", "Not Verified" , "Source Verified", "Source Verified")
)
Applying it without using a for loop. Better to use lapply here to avoid side-effect.
data.frame(lapply(dx,rank_fac)
Results:
# col1 col2
# [1,] 1 3
# [2,] 4 2
# [3,] 5 1
# [4,] 3 2
# [5,] 2 2
using data.table syntax-sugar
library(data.table)
setDT(dx)[,lapply(.SD,rank_fac)]
# col1 col2
# 1: 1 3
# 2: 4 2
# 3: 5 1
# 4: 3 2
# 5: 2 2
simpler solution:
Using only as.integer :
setDT(dx)[,lapply(.SD,as.integer)]
Using match:
# df is your data.frame
df[] <- lapply(df, function(x) match(x, sort(unique(x))))
Related
Assume I've a (used defined covariance) matrix and I want define the column names like this:
y <- matrix(rnorm(15*10),ncol=15)
colnames(y) <- c("Var1", "Cov12", "Cov13","Cov14", "Cov15",
"Var2", "Cov23", "Cov24", "Cov25",
"Var3", "Cov34" , "Cov35"
"Var4", "Cov45",
"Var5")
where each row contained the variance or co variance for an date t. I want to find a more general way to assign the column names as above, because I'll not always have 5 different variances. I tried something with rep and seq but I didn't find a solution.
Maybe not the most optimal way but we can do
n <- 5
paste0("Var", rep(1:n, n:1), unlist(sapply(2:n, function(x) c("",seq(x, n)))))
[1] "Var1" "Var12" "Var13" "Var14" "Var15" "Var2" "Var23" "Var24" "Var25" "Var3"
"Var34" "Var35" "Var4" "Var45" "Var5"
Breaking it down for better understanding
rep(1:n, n:1)
#[1] 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5
unlist(sapply(2:n, function(x) c("",seq(x, n))))
#[1] "" "2" "3" "4" "5" "" "3" "4" "5" "" "4" "5" "" "5"
We take these outputs and paste them parallely with "Var" to get the desired column names.
I've tried to find answers here and on google but no luck, been struggling with this issue for some days so would really appreciate help. I'm analyzing a network to see if cycles tend to be within discreet communities or between them, or no pattern. My data are a list of cycles (three nodes forming a loop) and a list of communities (variable amount of nodes). I have two questions, 1) how to compare two lists, and 2) how to output the comparison results in a way which is readable:
Question 1
I have two lists (both igraph objects), one containing 678 items (each of 3 elements, all characters) and another containing 11 items each with a differing number of elements. Example:
x1 <- as.character(c(1,3,5))
x2 <- as.character(c(2,4,6))
x3 <- as.character(c(7,8,9))
x4 <- as.character(c(10,11,12))
x <- list(x1, x2, x3, x4)
y1 <- as.character(c(1,2,3,4,5))
y2 <- as.character(c(2,3,4,5))
y3 <- as.character(c(1,2,3,4,5,7,8,9))
y <- list(y1, y2, y3)
Giving:
> x
[[1]]
[1] "1" "3" "5"
[[2]]
[1] "2" "4" "6"
[[3]]
[1] "7" "8" "9"
[[4]]
[1] "10" "11" "12"
> y
[[1]]
[1] "1" "2" "3" "4" "5"
[[2]]
[1] "2" "3" "4" "5"
[[3]]
[1] "1" "2" "3" "4" "5" "7" "8" "9"
I want to compare every component in x against every component in y and add every hit (i.e. when all the elements from x[[i]] are also found in y[[i]]) to a new dataframe. I tried a loop using all() and %in% but this didn't work:
for (i in 1:length(x)) {
for (j in 1:length(y)) {
hits <- all(y[[j]] %in% x[[i]]) == TRUE
print(hits)
}
}
This returns 12 FALSE hits. Checking individual components, it should have worked, because:
all(x[[1]] %in% y[[1]])
Returns TRUE as it should, and:
all(x[[1]] %in% y[[2]])
Returns FALSE as it should. Where am I going wrong here?
Question 2
I have seen some solutions for outputting loop results into a df, but that's not exactly what I need. What I want as an output is a dataframe telling me which community every cycle is in. Since there's only 11 communities, it could just refer me to the list component's index, but I haven't found a way to do this. I could also just use paste() to concatenate the node names of a community into a title. Either way, here is the output I need:
cycle community
1 1_3_5 1_2_3_4_5
2 1_3_5 1_2_3_4_5_7_8_9
3 7_8_9 1_2_3_4_5_7_8_9
I'm guessing some kind of an if statement. I feel this should be fairly simple to execute and that I should have been able to work it out myself. Nevertheless, thank you for your time and sorry about the essay.
You made a mistake
for (i in 1:length(x)) {
for (j in 1:length(y)) {
# hits <- all(y[[j]] %in% x[[i]]) == TRUE
hits <- all(x[[i]] %in% y[[j]]) == TRUE
print(hits)
}
}
For the second part you can store the indexes that have a hit and use them for later.
a <- list()
for (i in 1:length(x)) {
for (j in 1:length(y)) {
# hits <- all(y[[j]] %in% x[[i]]) == TRUE
hits <- all(x[[i]] %in% y[[j]]) == TRUE
if(hits == TRUE){
a[[length(a)+1]] <- c(i,j)
}
}
}
The final part of the question, creation of cycle and community tags, can be accomplished with stringi::stri_join() (or paste() as pointed out in the comments). The final step to wrangle the list created in Jt Miclat's answer is as follows, using the indexes in the list a to extract the appropriate strings for cycle and community, generate data frames, and rbind() the result to a single data frame.
# combine with cycle & community tags
cycles <- sapply(x,paste,collapse="_")
communities <- sapply(y,paste,collapse="_")
b <- lapply(a,function(x){
cycle <- cycles[x[1]]
community <- communities[x[2]]
data.frame(x=x[1],y=x[2],cycle=cycle,community=community,
stringsAsFactors=FALSE)
})
df <- do.call(rbind,b)
df
...and the output:
> df <- do.call(rbind,b)
> df
x y cycle community
1 1 1 1_3_5 1_2_3_4_5
2 1 3 1_3_5 1_2_3_4_5_7_8_9
3 3 3 7_8_9 1_2_3_4_5_7_8_9
>
Well you can make use of outer:
outer(x,y,function(w,z)Map(function(i,j)all(i%in%j),w,z))->results
[,1] [,2] [,3]
[1,] TRUE FALSE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE FALSE
x is the rows while y is the columns, so to check all(x[[1]]%in%y[[2]]),just check row 1 column 2 ie element [1,2] etc..
Then you can use apply with a own created function:
fun<-function(i)c(paste(x[[i[1]]],collapse ="_"), paste(y[[i[2]]],collapse ="_"))
t(apply(which(result==T,T),1,fun))
[,1] [,2]
[1,] "1_3_5" "1_2_3_4_5"
[2,] "1_3_5" "1_2_3_4_5_7_8_9"
[3,] "7_8_9" "1_2_3_4_5_7_8_9"
This is my code that attempts apply a function to each row in a tibble , mytib :
> mytib
# A tibble: 3 x 1
value
<chr>
1 1
2 2
3 3
Here is my code where I'm attempting to apply a function to each line in the tibble :
mytib = as_tibble(c("1" , "2" ,"3"))
procLine <- function(f) {
print('here')
print(f)
}
lapply(mytib , procLine)
Using lapply :
> lapply(mytib , procLine)
[1] "here"
[1] "1" "2" "3"
$value
[1] "1" "2" "3"
This output suggests the function is not invoked once per line as I expect the output to be :
here
1
here
2
here
3
How to apply function to each row in tibble ?
Update : I appreciate the supplied answers that allow my expected result but what have I done incorrectly with my implementation ? lapply should apply a function to each element ?
invisible is used to avoid displaying the output. Also you have to loop through elements of the column named 'value', instead of the column as a whole.
invisible( lapply(mytib$value , procLine) )
# [1] "here"
# [1] "1"
# [1] "here"
# [1] "2"
# [1] "here"
# [1] "3"
lapply loops through columns of a data frame by default. See the example below. The values of two columns are printed as a whole in each iteration.
mydf <- data.frame(a = letters[1:3], b = 1:3, stringsAsFactors = FALSE )
invisible(lapply( mydf, print))
# [1] "a" "b" "c"
# [1] 1 2 3
To iterate through each element of a column in a data frame, you have to loop twice like below.
invisible(lapply( mydf, function(x) lapply(x, print)))
# [1] "a"
# [1] "b"
# [1] "c"
# [1] 1
# [1] 2
# [1] 3
Not sure how to do the following. Please refer to the picture in the link below:
https://i.stack.imgur.com/Kx79x.png
I have some blank spaces, and they are the missing values. I do not want this level to be read. I want R to ignore this level. I want to write a regression so that this empty category is not part of the model.
The data was read from a csv file. The variable is "I", "II"...."IV", but there is an extra "" factor because of missing data. I want R to ignore this factor. My question is how?
you can do the following:
df <- data.frame(letters=letters[1:5], numbers=c(1,2,3,"",5)) # my data frame
# letters numbers
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d
# 5 e 5
levels(df$numbers)
# "" "1" "2" "3" "5"
subdf <- subset(df, numbers != "") # data subset
subdf$numbers <- factor(subdf$numbers)
levels(subdf$numbers)
# "1" "2" "3" "5"
change the "" data to missing:
# generate sample data
df <- data.frame(x = sample(c("","I","II","III"),100, replace = T), stringsAsFactors = T)
option 1
df[df$x=="",'x'] <- NA
option 2
df$x <- factor(ifelse(df$x == "",NA,as.character(df$x)))
Consider this example:
df <- data.frame(id=1:10,var1=LETTERS[1:10],var2=LETTERS[6:15])
fun.split <- function(x) tolower(as.character(x))
df$new.letters <- apply(df[ ,2:3],2,fun.split)
df$new.letters.var1
#NULL
colnames(df)
# [1] "id" "var1" "var2" "new.letters"
df$new.letters
# var1 var2
# [1,] "a" "f"
# [2,] "b" "g"
# [3,] "c" "h"
# [4,] "d" "i"
# [5,] "e" "j"
# [6,] "f" "k"
# [7,] "g" "l"
# [8,] "h" "m"
# [9,] "i" "n"
# [10,] "j" "o"
Would be someone so kind and explain what is going on here? A new dataframe within dataframe?
I expected this:
colnames(df)
# id var1 var2 new.letters.var1 new.letters.var2
The reason is because you assigned a single new column to a 2 column matrix output by apply. So, the result will be a matrix in a single column. You can convert it back to normal data.frame with
do.call(data.frame, df)
A more straightforward method will be to assign 2 columns and I use lapply instead of apply as there can be cases where the columns are of different classes. apply returns a matrix and with mixed class, the columns will be 'character' class. But, lapply gets the output in a list and preserves the class
df[paste0('new.letters', names(df)[2:3])] <- lapply(df[2:3], fun.split)
#akrun solved 90% of my problem. But I had data.frames buried within data.frames, buried within data.frames and so on, without knowing the depth to which this was happening.
In this case, I thought sharing my recursive solution might be helpful to others searching this thread as I was:
unnest_dataframes <- function(x) {
y <- do.call(data.frame, x)
if("data.frame" %in% sapply(y, class)) unnest_dataframes(y)
y
}
new_data <- unnest_dataframes(df)
Although this itself sometimes has problems and it can be helpful to separate all columns of class "data.frame" from the original data set then cbind() it back together like so:
# Find all columns that are data.frame
# Assuming your data frame is stored in variable 'y'
data.frame.cols <- unname(sapply(y, function(x) class(x) == "data.frame"))
z <- y[, !data.frame.cols]
# All columns of class "data.frame"
dfs <- y[, data.frame.cols]
# Recursively unnest each of these columns
unnest_dataframes <- function(x) {
y <- do.call(data.frame, x)
if("data.frame" %in% sapply(y, class)) {
unnest_dataframes(y)
} else {
cat('Nested data.frames successfully unpacked\n')
}
y
}
df2 <- unnest_dataframes(dfs)
# Combine with original data
all_columns <- cbind(z, df2)
In this case R doesn't behave like one would expect but maybe if we dig deeper we can solve it. What is a data frame? as Norman Matloff says in his book (chapter 5):
a data frame is a list, with the components of that list being
equal-length vectors
The following code might be useful to understand.
class(df$new.letters)
[1] "matrix"
str(df)
'data.frame': 10 obs. of 4 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10
$ var1 : Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
$ var2 : Factor w/ 10 levels "F","G","H","I",..: 1 2 3 4 5 6 7 8 9 10
$ new.letters: chr [1:10, 1:2] "a" "b" "c" "d" ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "var1" "var2"
Maybe the reason why it looks strange is in the print methods. Consider this:
colnames(df$new.letters)
[1] "var1" "var2"
maybe there must something in the print methods that combine the sub-names of objects and display them all.
For example here the vectors that constitute the df are:
names(df)
[1] "id" "var1" "var2" "new.letters"
but in this case the vector new.letters also has a dim attributes (in fact it is a matrix) were dimensions have names var1 and var1 too. See this code:
attributes(df$new.letters)
$dim
[1] 10 2
$dimnames
$dimnames[[1]]
NULL
$dimnames[[2]]
[1] "var1" "var2"
but when we print we see all of them like they were separated vectors (and so columns of the data.frame!).
Edit: Print methods
Just for curiosity in order to improve this question I looked inside the methods of the print functions:
methods(print)
The previous code produces a very long list of methods for the generic function print but there is no one for data.frame. The one that looks for data frame (but I am sure there is a more technically way to find out that) is listof.
getS3method("print", "listof")
function (x, ...)
{
nn <- names(x)
ll <- length(x)
if (length(nn) != ll)
nn <- paste("Component", seq.int(ll))
for (i in seq_len(ll)) {
cat(nn[i], ":\n")
print(x[[i]], ...)
cat("\n")
}
invisible(x)
}
<bytecode: 0x101afe1c8>
<environment: namespace:base>
Maybe I am wrong but It seems to me that in this code there might be useful informations about why that happens, specifically when the if (length(nn) != ll) is stated.