Multiply rows of R dataframe by matching named numeric - r

How can you selectively multiple a row in a dataframe by the corresponding number in a named numeric?
## Create DataFrame
d <- data.frame(a=c(1,1,2), b=c(2,2,1), c=c(3,0,1))
rownames(d) = c("X", "Y", "Z")
## Create Named Numeric
named_numeric = c(2,1,3)
names(named_numeric) = c("X", "Y", "Z")
named_numeric
All values in the first row of the dataframe ("X": 1,2,3) would by multiplied by the value of "X" in the named numeric (2) in this case. The result of the first row would be "X": 2,4,6.
expected_output = data.frame(a=c(2,1,6), b=c(4,2,3), c=c(6,0,3))
rownames(expected_output) = c("X", "Y", "Z")

We can just do
out <- d * named_numeric
identical(expected_output, out)
#[1] TRUE
if the names of named_numeric is not aligned with the row names, then reorder one of them and multiply
d * named_numeric[row.names(d)]

Related

Mutate a new column that is the list of the row_id of minimally different rows from .row

tibble(
A= c("x","x","y","y"),
B= c("y","y","y","y"),
C= c("x","y","z","y")
) %>%
mutate(
id = row_number(),
.before = "A"
) %>%
mutate(
neighs_id = list(
c("2"),
c("1,4"),
c("4"),
c("2,3")
)
) %>% View()
The output of neighs_id is the list of id_row when is TRUE the condition that exactly ==1 value of A,B, or C is != from the values in that .row, in the same columns.
I want a code to replace the second mutate with map that has as outcome a list (keep: the operation would be rowise!) of all the rows that, given a selection of columns, have 1 column with a value != column[.row].
In theory, I could setup a square matrix of id X id, check the sum of columns of the tibble such that column[id] =! column[column[.id] and then keep all the matches where the element == 1, but I think that should be a more straightforward way to select vectorise a filter on these "minimally different rows", given a selector of columns.
In base R:
cols = LETTERS[1:3]
tib$neighs_id <- lapply(seq(nrow(tib)),
function(i) which(sapply(seq(nrow(tib)),
function(x) sum(tib[x, cols] != tib[i, cols])) == 1))
#> pull(tib, neighs_id)
[[1]]
[1] 2
[[2]]
[1] 1 4
[[3]]
[1] 4
[[4]]
[1] 2 3
One way to speed this up is not to work with tibbles but with a matrix instead. I guess this is because tibbles (or data frames) are lists of columns so repeated extraction of rows is expensive compared to working with a matrix.
Another significant improvement can be achieved by changing the character matrix to a numeric one so that some operations can be vectorized. This way the inner sapply from Maƫl's answer can be replaced with subtraction and summing over matrix columns.
n.rep <- 1
tib <- tibble(
A=rep(c("x", "x", "y", "y"), n.rep),
B=rep(c("y", "y", "y", "y"), n.rep),
C=rep(c("x", "y", "z", "y"), n.rep)
)
cols <- LETTERS[1:3]
# change tibble to a matrix
tib.m <- as.matrix(tib[, cols])
# named vector used to translate values to their order
val.ord <- unique(c(tib.m))
val.ord <- setNames(seq_along(val.ord), val.ord)
# create numeric representation using the orders
tib.m[] <- val.ord[tib.m]
mode(tib.m) <- 'numeric'
tib$neighs_id <- apply(tib.m, 1, function(row)
which(colSums(t(tib.m) - row != 0) == 1))
This finishes in about a second when n.rep is 1000 (i.e., tib is a 4000-row matrix). Scaling it up to 1M, however, might still be problematic, I'm afraid. For this, using Rcpp might help.

In R, compare vectors of different length to match and replace values

Thanks for your help.
I have two data frames. The data frames are of differing lengths. One is a data set that often includes mistakes. Another is a set of corrections. I'm trying to do two things at once with these two data sets. First, I would like to compare three columns of df1 with three columns in df2. This means reading the first row of data in df1 and seeing if those three variables match any of the rows in df2 for those three variables, then moving on to row 2, and so on. If a match is found in a row for all three variables, then replace the value in one of the columns in df1 with a replacement in df2. I have included an example below.
df1 <- data.frame("FIRM" = c("A", "A", "B", "B", "C", "C"), "LOCATION" = c("N", "S", "N", "S", "N", "S"), "NAME" = c("Apple", "Blooberry", "Cucumber", "Date", "Egplant", "Fig"))
df2 <- data.frame("FIRM" = c("A", "C"), "LOCATION" = c("S", "N"), "NAME" = c("Blooberry", "Egplant"), "NEW_NAME" = c("Blueberry", "Eggplant"))
df1[] <- lapply(df1, as.character)
df2[] <- lapply(df2, as.character)
If there is a row in df1 that matches against "FIRM", "LOCATION" and "NAME" in df2, then I would like to replace the "NAME" in df1 with "NEW_NAME" in df2, such that "Blooberry" and "Egplant" change to "Blueberry" and "Eggplant".
I can do the final replacements using*:
df1$NAME[match(df2$NAME, df1$NAME)] <- df2$NEW_NAME[match(df1$NAME[match(df2$NAME, df1$NAME)], df2$NAME)]
But this does not include the constraint of the three matches. Also, my code seems unnecessarily complex with the nested match functions. I think I could accomplish this task by subsetting df2 and using a for loop to match rows one by one but I would think that there is a better vectorized method out there.
*I'm aware that inside the brackets of df2$NEW_NAME[], the function calls both elements in that column, but I'm trying to generalize.
Consider an all.x merge (i.e., LEFT JOIN in SQL speak) with an ifelse conditional comparing NAME and NEW_NAME.
Below, transform allows same line column assignment and the bracketed sequence at end keeps first three columns.
mdf <- transform(merge(df1,df2,all.x=TRUE),NAME=ifelse(is.na(NEW_NAME),NAME,NEW_NAME))[1:3]
mdf
# FIRM LOCATION NAME
# 1 A N Apple
# 2 A S Blueberry
# 3 B N Cucumber
# 4 B S Date
# 5 C N Eggplant
# 6 C S Fig

Extract rows from a single column to form two new columns

Update:
I realized that the dummy data frame I created originally does not reflect the structure of the data frame that I am working with. Allow me to rephrase my question here.
Data frame that I'm starting with:
StudentAndClass <- c("Anthropology College_Name","x","y",
"Geology College_Name","z","History College_Name", "x","y","z")
df <- data.frame(StudentAndClass)
Students ("x","y","z") are enrolled in classes that they are listed under. e.g. "x" and "y" are in Anthropology, while "x", "y", "z" are in History.
How can I create the desired data frame below?
Student <- c("x", "y", "z", "x", "y","z")
Class <- c("Anthropology College_Name", "Anthropology College_Name",
"Geology College_Name", "History College_Name",
"History College_Name", "History College_Name")
df_tidy <- data.frame(Student, Class)
Original post:
I have a data frame with observations of two variables merged in a single column like so:
StudentAndClass <- c("A","x","y","A","B","z","B","C","x","y","z","C")
df <- data.frame(StudentAndClass)
where "A", "B", "C" represent classes, and "x", "y", "z" students who are taking these classes. Notice that observations of students are wedged between observations of classes.
I'm wondering how I can create a new data frame with the following format:
Student <- c("x", "y", "z", "x", "y","z")
Class <- c("A", "A", "B", "C", "C", "C")
df_tidy <- data.frame(Student, Class)
I want to extract the rows containing observations of students and put them in a new column, while making sure that each Student observation is paired with the corresponding Class observation in the Class column.
One option is to create a vector
v1 <- c('x', 'y', 'z')
Then split the data based on logical vector and rbind
setNames(do.call(cbind, split(df, !df[,1] %in% v1)), c('Student', 'Class'))
# Student Class
#2 x A
#3 y A
#6 z B
#9 x B
#10 y C
#11 z C
Or with tidyverse
library(tidyverse)
df %>%
group_by(grp = c('Class', 'Student')[(StudentAndClass %in% v1) + 1]) %>%
mutate(n = row_number()) %>%
spread(grp, StudentAndClass) %>%
select(-n)
# A tibble: 6 x 2
# Class Student
#* <fctr> <fctr>
#1 A x
#2 A y
#3 B z
#4 B x
#5 C y
#6 C z
Update
If we need this based on elements between each pair of same 'LETTERS'
grp <- with(df, cummax(match(StudentAndClass, LETTERS[1:3], nomatch = 0)))
do.call(rbind, lapply(split(df, grp), function(x)
data.frame(Class = x[,1][2:(nrow(x)-1)], Student = x[[1]][1], stringsAsFactors=FALSE)))
Updated
In essence, you just need to find which indexes have college names, use those to get the range of students in each college, then subset the main vector by those ranges. Since students aren't guaranteed to be nested between two similar values, you have to be careful about any "empty" colleges.
college_indices <- which(endsWith(StudentAndClass, 'College_Name'))
colleges <- StudentAndClass[college_indices]
bounds_mat <- rbind(
start = college_indices,
end = c(college_indices[-1], length(StudentAndClass))
)
colnames(bounds_mat) <- colleges
bounds_mat['start', ] <- bounds_mat['start', ] + 1
bounds_mat['end', ] <- bounds_mat['end', ] - 1
# This prevents any problems if a college has no listed students
empty_college <- bounds_mat['start', ] > bounds_mat['end', ]
bounds_mat <- bounds_mat[, !empty_college]
class_listing <- apply(
bounds_mat,
2,
function(bounds) {
StudentAndClass[bounds[1]:bounds[2]]
}
)
df_tidy <- data.frame(
Student = unlist(class_listing),
Class = rep(names(class_listing), lengths(class_listing)),
row.names = NULL
)

R - reshape dataframe from duplicated column names but unique values

Hi I have a dataframe that looks like the following
I want to apply a function to it so that it reshapes it like this
How would I do that?
Here is one option that could work. W loop through the unique names of the dataset, create a logical index with ==, extract the columns, unlist, create a data.frame, and then cbind it together or just use data.frame (assumption is that the number of duplicate elements are equal for each set)
data.frame(lapply(unique(names(df1)), function(x)
setNames(data.frame(unlist(df1[names(df1)==x], use.names = FALSE)), x)))
# type model make
#1 a b c
#2 d e f
data
df1 <- data.frame(type = "a", model = "b", make = "c", type = "d",
model = "e",
make = "f", check.names=FALSE, stringsAsFactors=FALSE)

sample data.table rows with different conditions

I have a data.table with multiple columns. One of these columns currently works as a 'key' (keyb for the example). Another column (let's say A), may or may not have data in it. I would like to supply a vector that randomly sample two rows per key, -if this key appears in the vector, where 1 row contains data in A, while the other does not.
MRE:
#data.table
trys <- structure(list(keyb = c("x", "x", "x", "x", "x", "y", "y", "y",
"y", "y"), A = c("1", "", "1", "", "", "1", "", "", "1", "")), .Names = c("keyb",
"A"), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
setkey(trys,keyb)
#list with keys
list_try <- structure(list(a = "x", b = c("r", "y","x")), .Names = c("a", "b"))
I could, for instance subset the data.table based on the elements that appear in list_try:
trys[keyb %in% list_try[[2]]]
My original (and probably inefficient idea), was to try to chain a sample of two rows per key, where the A column has data or no data, and then merge. But it does not work:
#here I was trying to sample rows based on whether A has data or not
#here for rows where A has no data
trys[keyb %in% list_try[[2]]][nchar(A)==0][sample(.N, 2), ,by = keyb]
#here for rows where A has data
trys[keyb %in% list_try[[2]]][nchar(A)==1][sample(.N, 2), ,by = keyb]
In this case, my expected output would be two data.tables (one for a and one for b in list_try), of two rows per appearing element: So the data.table from a would have two rows (one with and without data in A), and the one from b, four rows (two with and two without data in A).
Please let me know if I can make this post any clearer
You could add A to the by statement too, while converting it to a binary vector by modifying to A != "", combine with a binary join (while adding nomatch = 0L in order to remove non-matches) you could then sample from the row index .I by those two aggregators and then subset from the original data set
For a single subset case
trys[trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1]
# keyb A
# 1: y 1
# 2: y
# 3: x 1
# 4: x
For a more general case, when you want to create separate data sets according to a list of keys, you could easily embed this into lapply
lapply(list_try,
function(x) trys[trys[x, nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1])
# $a
# keyb A
# 1: x 1
# 2: x
#
# $b
# keyb A
# 1: y 1
# 2: y
# 3: x 1
# 4: x

Resources