Reordering columns in a large dataframe - r

Using the following example dataframe:
a <- c(1:5)
b <- c("Cat", "Dog", "Rabbit", "Cat", "Dog")
c <- c("Dog", "Rabbit", "Cat", "Dog", "Dog")
d <- c("Rabbit", "Cat", "Dog", "Dog", "Rabbit")
e <- c("Cat", "Dog", "Dog", "Rabbit", "Cat")
f <- c("Cat", "Dog", "Dog", "Rabbit", "Cat")
df <- data.frame(a,b,c,d,e,f)
I want to investigate how to reorder the columns WITHOUT having to type in all the column names, i.e., df[,c("a","d","e","f","b","c")]
How would I just say I want columns b and c AFTER column f? (only referencing the columns or range of columns that I want to move?).
Many thanks in advance for your help.

To move specific columns to the beginning or end of a data.frame, use select from the dplyr package and its everything() function. In this example we are sending to the end:
library(dplyr)
df %>%
select(-b, -c, everything())
a d e f b c
1 1 Rabbit Cat Cat Cat Dog
2 2 Cat Dog Dog Dog Rabbit
3 3 Dog Dog Dog Rabbit Cat
4 4 Dog Rabbit Rabbit Cat Dog
5 5 Rabbit Cat Cat Dog Dog
Without the negation, the columns would be sent to the front.

If you're just moving certain columns to the end, you can create a little helper-function like the following:
movetolast <- function(data, move) {
data[c(setdiff(names(data), move), move)]
}
movetolast(df, c("b", "c"))
# a d e f b c
# 1 1 Rabbit Cat Cat Cat Dog
# 2 2 Cat Dog Dog Dog Rabbit
# 3 3 Dog Dog Dog Rabbit Cat
# 4 4 Dog Rabbit Rabbit Cat Dog
# 5 5 Rabbit Cat Cat Dog Dog
I would not recommend getting too into the habit of using column positions, especially not from a programmatic standpoint, since those positions might change.
"For fun" update
Here's an extended interpretation of the above function. It allows you to move columns to either the first or last position, or to be before or after another column.
moveMe <- function(data, tomove, where = "last", ba = NULL) {
temp <- setdiff(names(data), tomove)
x <- switch(
where,
first = data[c(tomove, temp)],
last = data[c(temp, tomove)],
before = {
if (is.null(ba)) stop("must specify ba column")
if (length(ba) > 1) stop("ba must be a single character string")
data[append(temp, values = tomove, after = (match(ba, temp)-1))]
},
after = {
if (is.null(ba)) stop("must specify ba column")
if (length(ba) > 1) stop("ba must be a single character string")
data[append(temp, values = tomove, after = (match(ba, temp)))]
})
x
}
Try it with the following.
moveMe(df, c("b", "c"))
moveMe(df, c("b", "c"), "first")
moveMe(df, c("b", "c"), "before", "e")
moveMe(df, c("b", "c"), "after", "e")
You'll need to adapt it to have some error checking--for instance, if you try to move columns "b" and "c" to "before c", you'll (obviously) get an error.

You can refer to columns by position. e.g.
df <- df[ ,c(1,4:6,2:3)]
> df
a d e f b c
1 1 Rabbit Cat Cat Cat Dog
2 2 Cat Dog Dog Dog Rabbit
3 3 Dog Dog Dog Rabbit Cat
4 4 Dog Rabbit Rabbit Cat Dog
5 5 Rabbit Cat Cat Dog Dog

The package dplyr and the function dplyr::relocate, a new verb introduced in dplyr 1.0.0, does exactly what you are looking for with highly readable syntax.
df %>% dplyr::relocate(b, c, .after = f)

To generalize the reshuffling of columns in any order using dplyr, for example, to reshuffle:
df <- data.frame(a,b,c,d,e,f)
to
df[,c("a","d","e","f","b","c")]
df %>% select(a, d:f, b:c)

Use the subset function:
> df <- data.frame(a,b,c,d,e,f)
> df <- subset(df, select = c(a, d:f, b:c))
> df
a d e f b c
1 1 Rabbit Cat Cat Cat Dog
2 2 Cat Dog Dog Dog Rabbit
3 3 Dog Dog Dog Rabbit Cat
4 4 Dog Rabbit Rabbit Cat Dog
5 5 Rabbit Cat Cat Dog Dog

I changed the previous function to use it for data.table usinf the function setcolorder of the package data.table.
moveMeDataTable <-function(data, tomove, where = "last", ba = NULL) {
temp <- setdiff(names(data), tomove)
x <- switch(
where,
first = setcolorder(data,c(tomove, temp)),
last = setcolorder(data,c(temp, tomove)),
before = {
if (is.null(ba)) stop("must specify ba column")
if (length(ba) > 1) stop("ba must be a single character string")
order = append(temp, values = tomove, after = (match(ba, temp)-1))
setcolorder(data,order)
},
after = {
if (is.null(ba)) stop("must specify ba column")
if (length(ba) > 1) stop("ba must be a single character string")
order = append(temp, values = tomove, after = (match(ba, temp)))
setcolorder(data,order)
})
x
}
DT <- data.table(A=sample(3, 10, TRUE),
B=sample(letters[1:3], 10, TRUE), C=sample(10))
DT <- moveMeDataTable(DT, "C", "after", "A")

Here is another option:
df <- cbind( df[, -(2:3)], df[, 2:3] )

Related

Display a table of strings and their variations per row (R)

For a large database, I would like to find a solution where I could predefine the strings to be searched and then get a table that would contain the frequency of these strings and their possible variations per row.
strings <- c("dog", "cat", "mouse")
var1 <- c("black dog", "white dog", "angry dog", "dogs and cats are nice", "dog")
var2 <- c("white cat", "black cat", "tiny cat", NA, "cow")
var3 <- c("little mouse", "big mouse", NA, NA, "mouse")
data <- data.frame(var1, var2, var3)
The result should look like this while I am looking for dog, cat and mouse:
dog&cat 4
mouse 3
We may try
v1 <- do.call(paste, data)
stack(setNames(lapply(c( "\\bdog.*\\bcat|\\bcat.*\\bdog", "mouse"),
\(pat) sum(grepl(pat, v1))), c("dog&cat", "mouse")))[2:1]
ind values
1 dog&cat 4
2 mouse 3
Or if we need all the combinations
lst1 <- lapply(c(strings, combn(strings, 2, FUN = \(x)
sprintf("\\b%1$s.*\\b%2$s|\\b%2$s.*\\b%1$s", x[1], x[2]))),
\(pat) sum(grepl(pat, v1)))
names(lst1) <- c(strings, combn(strings, 2, FUN = paste, collapse = "&"))
stack(lst1)[2:1]
ind values
1 dog 5
2 cat 4
3 mouse 3
4 dog&cat 4
5 dog&mouse 3
6 cat&mouse 2
For more combinations, it may be better to use Reduce with individually applying grepl
lst1 <- lapply(1:3, \(n) {
vals <- colSums(combn(strings, n,
FUN = \(pats) Reduce(`&`, lapply(pats, \(pat) grepl(pat, v1)))))
nms <- combn(strings, n, FUN = paste, collapse = "&")
setNames(vals, nms)
})
stack(unlist(lst1))[2:1]
ind values
1 dog 5
2 cat 4
3 mouse 3
4 dog&cat 4
5 dog&mouse 3
6 cat&mouse 2
7 dog&cat&mouse 2
Or with tidyverse
library(dplyr)
library(stringr)
library(tidyr)
data %>%
unite(var, everything(), na.rm = TRUE, sep = " ") %>%
summarise(`dog&cat` = sum(str_detect(var,
"\\bdog.*\\bcat|\\bcat.*\\bdog")),
mouse = sum(str_detect(var, 'mouse'))) %>%
pivot_longer(everything())
-output
# A tibble: 2 × 2
name value
<chr> <int>
1 dog&cat 4
2 mouse 3

More efficient way to purrr::map2 for a large dataframe

Is there a faster way to do the following, where in the real application, df has many rows (and therefore list_of_colnames has the same number of elements):
list_of_colnames <- list(c("A", "B"), c("A"))
some_vector <- c("fish", "cat")
map2(split(df, seq(nrow(df))), list_of_colnames, function(row, colnames) {
row$indicator <- ifelse(any(row[, colnames] %in% some_vector), 1, 0)
return(row)
})
While this current implementation works, it takes centuries for the big df. In fact I think split() is a major bottleneck.
Thank you!
One option may be to make use of row/column indexing
rowind <- rep(seq_len(nrow(df)), lengths(list_of_colnames) * nrow(df))
df$indicator <- +(tapply(c(t(df[unlist(list_of_colnames)])) %in% some_vector,
rowind, FUN = any))
-output
> df
A B indicator
1 fish A 1
2 hello cat 1
data
df <- data.frame(A = c('fish', 'hello'), B = c('A', 'cat'))
You can avoid splitting your data frame into a list all together and instead apply your condition across the rows using rowwise and c_across from dplyr:
library(dplyr)
library(purrr)
list_of_colnames <- list(c("A", "B"), c("A"))
some_vector <- c("fish", "cat")
map(list_of_colnames, ~
df %>%
rowwise() %>%
mutate(indicator = as.numeric(any(c_across(all_of(.x)) %in% some_vector))) %>%
ungroup()
)
Output
Still mapping over list_of_columns returns a list output:
[[1]]
# A tibble: 3 x 4
A B C indicator
<chr> <chr> <chr> <lgl>
1 fish dog bird TRUE
2 dog cat bird TRUE
3 bird lion cat FALSE
[[2]]
# A tibble: 3 x 4
A B C indicator
<chr> <chr> <chr> <lgl>
1 fish dog bird TRUE
2 dog cat bird FALSE
3 bird lion cat FALSE
Data
structure(list(A = c("fish", "dog", "bird"), B = c("dog", "cat",
"lion"), C = c("bird", "bird", "cat")), class = "data.frame", row.names = c(NA,
-3L))

Add a different value (by groups) to another value in dataframe in R

I have two dataframes, I want to subtract the value of column in one dataframe to another, if dataframes are equal in value of another column.
I have two dataframes A and B that are similar to the following:
[A]
Col1 Col2
1 cat
2 dog
3 bird
4 cat
5 dog
[B]
Col1 Col2
[cat] 1
[dog] 2
[bird] 3
I want to be able to add the values A$Col1 + B$Col2 if A$Col2 matches the tag of [B] and create a list with the results that will have the same length as the rows in [A]
I have tried this code
(A$Col1-B$Col2)[A$Col2==B$Col1]
which seems to work, but a following warning shows up:
longer object length is not a multiple of shorter object length
Use Left_Join and then create a new column for difference.
merge(A, B, by.x = "Col2", by.y = "Col1", all.x = TRUE) %>% mutate(Difference = Col1 - Col2.y )
Col2 Col1 Col2.y Difference
1 bird 3 3 0
2 cat 1 1 0
3 cat 4 1 3
4 dog 2 2 0
5 dog 5 2 3
Here is a tidyverse-style example to join the two dataframes and substract the columns. You can then take the new column and convert it to a list or whatever you need.
library(tidyverse)
A <- tibble(
Col1 = 1:5,
Col2 = c("cat", "dog", "bird", "cat", "dog")
)
B <- tibble(
Col1 = c("cat", "dog", "bird"),
Col2 = 1:3
)
A %>%
left_join(B, by = c("Col2" = "Col1")) %>%
mutate(Col3 = Col1 - Col2.y)

Most Efficient Way to Combine String Columns and Skip Particular Fields

I will try to simplify my df:
Animal1 Animal2 Animal3
dog cat mouse
dog 0 mouse
0 cat 0
with just 3 records.
I wish to combine all 3 animals into a single field where it would look like the following column:
Animals
dog + cat + mouse
dog + mouse
cat
I think paste, or some kind of variation of it would be best but I cannot find my exact solution - I am sure it is easy. Maybe substituting the 0s with NAs would be a good first step?
Please note that it needs to be done for about 10 million rows.
You could use nested sub function to get the desired result:
df <- data.frame(Animal1 = c("dog", "dog", "0"),
Animal2 = c("cat", "0", "cat"),
Animal3 = c("mouse", "mouse", "0"))
df$Animals <- sub("\\+ 0", "", sub("0 \\+", "", paste(df$Animal1, df$Animal2, df$Animal3, sep = " + ")))
1) Using DF shown reproducibly in the Note at the end define a Collapse function which takes a character vector, removes the "0" elements and collapses the rest into a string separated with plus signs. Use apply to apply that to each row.
Collapse = function(x) paste(x[x != 0], collapse = "+")
transform(DF, Animals = apply(DF, 1, Collapse))
giving:
Animal1 Animal2 Animal3 Animals
1 dog cat mouse dog+cat+mouse
2 dog 0 mouse dog+mouse
3 0 cat 0 cat
2) Alternately if comma followed by space is ok as the separator then use this for Collapse:
Collapse <- function(x) toString(x[x != 0])
which when used with the transform statement in (1) gives:
Animal1 Animal2 Animal3 Animals
1 dog cat mouse dog, cat, mouse
2 dog 0 mouse dog, mouse
3 0 cat 0 cat
3) Another possibility is to make the Animals column a list of vectors:
DF2 <- DF
DF2$Animals <- lapply(split(DF, 1:nrow(DF)), function(x) x[x != 0])
giving:
> DF2
Animal1 Animal2 Animal3 Animals
1 dog cat mouse dog, cat, mouse
2 dog 0 mouse dog, mouse
3 0 cat 0 cat
> str(DF2)
'data.frame': 3 obs. of 4 variables:
$ Animal1: chr "dog" "dog" "0"
$ Animal2: chr "cat" "0" "cat"
$ Animal3: chr "mouse" "mouse" "0"
$ Animals:List of 3
..$ 1: chr "dog" "cat" "mouse"
..$ 2: chr "dog" "mouse"
..$ 3: chr "cat"
Note
Lines <- "Animal1 Animal2 Animal3
dog cat mouse
dog 0 mouse
0 cat 0"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
Another idea:
library(tidyverse)
df2 %>%
na_if(0) %>%
mutate(Animals = pmap_chr(., .f = ~stringi::stri_flatten(
c(...), collapse = " + ",
na_empty = TRUE, omit_empty = TRUE)))
Which gives:
# Animal1 Animal2 Animal3 Animals
#1 <NA> <NA> mouse mouse
#2 dog cat mouse dog + cat + mouse
#3 dog <NA> mouse dog + mouse
#4 <NA> cat <NA> cat
#5 <NA> <NA> <NA>
Data
df2 <- data.frame(
Animal1 = c("0", "dog", "dog", "0", "0"),
Animal2 = c("0", "cat", "0", "cat","0"),
Animal3 = c("mouse", "mouse", "mouse", "0","0"),
stringsAsFactors = FALSE)

r programming: align two sequences of words

I want to align two datasets that mostly intersect on one column -- but each dataset is missing some rows. For example:
df1 <- data.frame(word = c("my", "dog", "ran", "with", "your", "dog"),
freq = c(5, 2, 2, 6, 5, 10))
df2 <- data.frame(word = c("my", "brown", "dog", "ran", "your", "dog"),
pos = c("a", "b", "c", "d", "a", "e"))
What I want as output is to have gaps inserted wherever there's a missing item. Thus in the output, the new form of df1 will have NAs where df1 was missing a word match that was in df2, and the new form of df2 will have NAs where df2 was missing a word-instance that was in df1.
As in my example, the sequence matters and elements do repeat. (so this isn't a generic "merge" situation.) I suspect DTW could figure in to the solution but I'm not sure. For present purposes it's fair to stipulate that only exact matches do match.
For the above case the desired output would be a data frame with these columns:
$word1 my NA dog ran with your dog
$freq 5 NA 2 2 6 5 2
$word2 my brown dog ran NA your dog
$pos a b c d NA a c
Thus, the sequence in each original data frame is maintained; nothing is deleted; word tokens remain tokens (it's a corpus, not a dictionary); all that's really happened is spaces (NAs) have been inserted where data are missing.
df1$count = ave(seq_along(df1$word), df1$word, FUN = seq_along)
df2$count = ave(seq_along(df2$word), df2$word, FUN = seq_along)
df1$merge = paste(df1$count, df1$word)
df2$merge = paste(df2$count, df2$word)
output = merge(x = df1, y = df2, by = "merge", all.x = TRUE, all.y = TRUE)
output[c(2, 3, 5, 6)]
# word.x freq word.y pos
#1 <NA> NA brown b
#2 dog 2 dog c
#3 my 5 my a
#4 ran 2 ran d
#5 with 6 <NA> <NA>
#6 your 5 your a
#7 dog 2 dog c

Resources