aggregate values in dataframe by partly matching rownames in R - r

I'm thumbling around with the following problem, but to no evail:
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
value
abc 1
abcd 2
ef 3
gh 4
l <- nrow(d)
wordmat <- matrix(rep(NA, l^2), l, l, dimnames = list(row.names(d), row.names(d)))
for (i in 1:ncol(wordmat)) {
rid <- agrep(colnames(wordmat)[i], rownames(wordmat), max = 0)
d$matchid[i] <- paste(rid, collapse = ";")
}
# desired output:
(d_agg <- data.frame(value = c(3, 3, 4), row.names = c("abc;abcd", "ef", "gh")))
value
abc;abcd 3
ef 3
gh 4
is there a function for this?

Here's a possible solution that you might be able to modify to suit your needs.
Some notes:
I couldn't figure out how to deal with rownames() directly, particularly in the last stage, so this depends on you being happy with copying your row names as a new variable.
The function below "hard-codes" the variable names, functions, and so on. That is to say, it is not by any means a generalized function, but one which might be useful as you look further into this problem.
Here's the function.
matches <- function(data, ...) {
temp = vector("list", nrow(data))
for (i in 1:nrow(data)) {
temp1 = agrep(data$RowNames[i], data$RowNames, value = TRUE, ...)
temp[[i]] = data.frame(RowNames = paste(temp1, collapse = "; "),
value = sum(data[temp1, "value"]))
}
temp = do.call(rbind, temp)
temp[!duplicated(temp$RowNames), ]
}
Note that the function needs a column called RowNames, so we'll create that, and then test the function.
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
d$RowNames <- rownames(d)
matches(d)
# RowNames value
# 1 abc; abcd 3
# 3 ef 3
# 4 gh 4
matches(d, max.distance = 2)
# RowNames value
# 1 abc; abcd 3
# 3 abc; abcd; ef; gh 10
matches(d, max.distance = 4)
# RowNames value
# 1 abc; abcd; ef; gh 10

This works for your example but may need tweaking for the real thing:
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
rowclust <- hclust(as.dist(adist(rownames(d))), method="single")
rowgroups <- cutree(rowclust, h=1.5)
rowagg <- aggregate(d, list(rowgroups), sum)
rowname <- unclass(by(rownames(d), rowgroups, paste, collapse=";"))
rownames(rowagg) <- rowname
rowagg
Group.1 value
abc;abcd 1 3
ef 2 3
gh 3 4

Related

R lapply function with two arguments that are not fixed

I found a similar question asked before. My question is a bit more complex than the previous one. For my question, the y parameter is not fixed.
In the function(X,Y){SOME FUNCTION}, X is a list of characters and Y is a list of dataframe. Basically, I want the function to work on the pair of X and Y in sequence respectively, and produce the output as one list. For example, the first element of X list and the first element of Y list, the second element of X list and the second element of Y list, the third element of X list and the third element of Y list,...
Example of X, Y
X <- c("1", "2")
y1 <- data.frame("person.1" = "Amy", "bestfood..1" = "fish", "bestthing..1" = "book",
"person.2" = "Mike", "bestfood..2" = "fish", "bestthing..2" = "book")
y2 <- data.frame("person.1" = "Amy","bestfood..1" = "carrot", "bestthing..1" = "cloth",
"person.2" = "Mike","bestfood..2" = "carrot", "bestthing..2" = "cloth")
Y <- list(y1,y2)
The function:
addID <- function(X, Y) {
rowlength <- length(Y)
df <- as.data.frame(matrix(NA, nrow = rowlength, ncol = 3))
colnames(df) <- c("ID", "Person", "Food")
df[1:nrow(df), 1] <- X
# name
namecols <-grep("person",colnames(Y))
for (i in 1:length(namecols)) {
name <- Y[1, namecols[i]]
df[i, 2] <- as.character(name)
}
# food
foodcols <-
grep("bestfood",colnames(Y))
for (i in 1:length(foodcols)) {
food <- Y[1, foodcols[i]]
df[i, 3] <- as.character(foodcols)
}
return(df)
}
}
I tried to use lapply but can't figure out the way to include the X list. When I try this:
lapply(Y, function, X=X)
The function doesn't work properly. I wonder if there are other ways to include X in it(I tried the function on individual character and dataframe, it works just fine. )
I hope this is clear. If not, please point it out, I will try my best to clarify. Thanks in advance.
UPDATE:
I tried Map as suggested by comments. It returns: incorrect number of dimensions. I added some details in the function. It seems like R stucks on the last line.
outcome <- Map(addID, Y, X)
I get
error in Y[1, namecols[i]] : incorrect number of dimensions
In addition: Warning message:
In `[<-.data.frame`(`*tmp*`, 1:nrow(df), 1, value = list(person.1 = 1L, :
provided 6 variables to replace 1 variables
The outcome should looks like:
z1 <- data.frame(ID = c(1,2), Person = c("Amy","Mike"), Food = c("fish", "fish"))
z2 <- data.frame(ID = c(1,2), Person = c("Amy","Mike"), Food = c("carrot", "carrot"))
outcome <- list(z1,z2)
We could do this easily in tidyverse
library(dplyr)
library(tidyr)
bind_rows(Y, .id = 'ID') %>%
select(ID, starts_with('person'), contains('food')) %>%
pivot_longer(cols = -ID, names_to = c(".value"),
names_pattern = "([^.]+)\\.+\\d+")
-output
# A tibble: 4 x 3
ID person bestfood
<chr> <chr> <chr>
1 1 Amy fish
2 1 Mike fish
3 2 Amy carrot
4 2 Mike carrot
With the OP's function, if we modify, it would work
addID <- function(X, Y) {
rowlength <- length(Y)
df <- as.data.frame(matrix(NA, nrow = rowlength, ncol = 3))
colnames(df) <- c("ID", "Person", "Food")
df[1:nrow(df), 1] <- X
namecols <- grep("person",colnames(Y))
df[, 2] <- unlist(Y[namecols])
foodcols <- grep("bestfood", colnames(Y))
df[,3] <- unlist(Y[foodcols])
return(unique(df))
}
-testing
Map(addID, X, Y)
$`1`
ID Person Food
1 1 Amy fish
2 1 Mike fish
$`2`
ID Person Food
1 2 Amy carrot
2 2 Mike carrot

How does the table and $freq function work in R

I want a function for the mode of a vector. Abhiroop Sarkar's answer to This question works, but I want to understand why.
Here is the code
Mode <- function(x){
y <- data.frame(table(x))
y[y$Freq == max(y$Freq),1]
}
1) Wy do we need to put the table in a data frame,
2) in this line
y[y$Freq == max(y$Freq),1]
what does the y$Freq do? is frequency a default columns in the table?
When we convert a table output to data.frame, it creates a two column data.frame
set.seed(24)
v1 <- table(sample(1:5, 100, replace = TRUE))
y <- data.frame(v1)
y
# Var1 Freq
#1 1 19
#2 2 24
#3 3 22
#4 4 16
#5 5 19
The first column 'Var1' is the names of the frequency output from table and the 'Freq' is the actual frequency of those names
y[y$Freq == max(y$Freq), 1]
#[1] 2
#Levels: 1 2 3 4 5
Now, we are subsetting the first column 'Var1' based on the max value of 'Freq', and it returns a vector because of the drop = TRUE in [ when there is a single column
If we want to return a data.frame with single, add drop = FALSE at the end
y[y$Freq == max(y$Freq), 1, drop = FALSE]
# Var1
#2 2
Regarding the default name Freq, it is created from the as.data.frame.table method
as.data.frame.table
function (x, row.names = NULL, ..., responseName = "Freq", stringsAsFactors = TRUE,
sep = "", base = list(LETTERS))
{
ex <- quote(data.frame(do.call("expand.grid", c(dimnames(provideDimnames(x,
sep = sep, base = base)), KEEP.OUT.ATTRS = FALSE, stringsAsFactors = stringsAsFactors)),
Freq = c(x), row.names = row.names))
names(ex)[3L] <- responseName
eval(ex)
}

the number of variables that are repeated 2 or more times in R

I have two forms of data: a list (i.e., r) and a data.frame (i.e., df). For each form of data, how can I know the number of variables that are repeated 2 or more times (in the example below, my desired output is: AA 3 times, BB 2 times, CC 2 times)?
NOTE: the answer regardless of the form of data, should be the same.
r <- list( data.frame( AA = c(2,2,1,1,NA, NA), BB = c(1,1,1,2,2,NA), CC = c(1:5, NA)), # LIST
data.frame( AA = c(1,NA,3,1,NA,NA), DD = c(1,1,1,2,NA,NA)),
data.frame( AA = c(1,NA,3,1,NA,NA), BB = c(1,1,1,2,2,NA), CC = c(0:4, NA)) )
df <- do.call(cbind, r) ## DATA.FRAME
We can create a frequency count with >= 2 on the names of the dataset,
tbl <- table(names(df))
tbl1 <- tbl[tbl >=2]
tbl1
# AA BB CC
# 3 2 2
lapply(r, function(x) table(names(x)[names(x) %in% names(tbl1)]))
If we need it from another answer
vec <- names(unlist(r, recursive = FALSE))
nm1 <- unique(vec[duplicated(vec)])
lapply(r, function(x) table(names(x)[names(x) %in% nm1]))

Parse unexpected symbol error in function applied over list

I'm trying to check the "pin" numbers of cases with missing data for each variable of interest in my dataset.
Here are some fake data:
c <- data.frame(pin = c(1, 2, 3, 4), type = c(1, 1, 2, 2), v1 = c(1, NA, NA,
NA), v2 = c(NA, NA, 1, 1))
I wrote a function "m.pin" to do this:
m.pin <- function(x, data = "c", return = "$pin") {
sect <- gsub("^.*\\[", "\\[", deparse(substitute(x)))
vect <- eval(parse(text = paste(data, return, sect, sep = "")))
return(vect[is.na(x)])
}
And I use it like so:
m.pin(c$v1[c$type == 1])
[1] 2
I wrote a function to apply "m.pin" over a list of variables to only return pins with missing data:
return.m.pin <- function(x, fun = m.pin) {
val.list <- lapply(x, fun)
condition <- lapply(val.list, function(x) length(x) > 0)
val.list[unlist(condition)]
}
But when I apply it, I get this error:
l <- lst(c$v1[c$type == 1], c$v2[c$type == 2])
return.m.pin(l)
Error in parse(text = paste(data, return, sect, sep = "")) :
<text>:1:9: unexpected ']'
1: c$pin[i]]
^
How can I rewrite my function(s) to avoid this issue?
Many thanks!
Please see Gregor's comment for the most critical issues with your code (to add: don't use return as a variable name as it is the name of a base R function).
It's not clear to me why you want to define a specific function m.pin, nor what you ultimately are trying to do, but I am assuming this is a critical design component.
Rewriting m.pin as
m.pin <- function(df, type, vcol) which(df[, "type"] == type & is.na(df[, vcol]))
we get
m.pin(df, 1, "v1")
#[1] 2
Or to identify rows with NA in "v1" for all types
lapply(unique(df$type), function(x) m.pin(df, x, "v1"))
#[[1]]
#[1] 2
#
#[[2]]
#[1] 3 4
Update
In response to Gregor's comment, perhaps this is what you're after?
by(df, df$type, function(x)
list(v1 = x$pin[which(is.na(x$v1))], v2 = x$pin[which(is.na(x$v2))]))
# df$type: 1
# $v1
# [1] 2
#
# $v2
# [1] 1 2
#
# ------------------------------------------------------------
# df$type: 2
# $v1
# [1] 3 4
#
# $v2
# integer(0)
This returns a list of the pin numbers for every type and NA entries in v1/v2.
Sample data
df <- data.frame(
pin = c(1, 2, 3, 4),
type = c(1, 1, 2, 2),
v1 = c(1, NA, NA, NA),
v2 = c(NA, NA, 1, 1))
I would suggest rewriting like this (if this approach is to be taken at all). I call your data d because c is already the name of an extremely common function.
# string column names, pass in the data frame as an object
# means no need for eval, parse, substitute, etc.
foo = function(data, na_col, return_col = "pin", filter_col, filter_val) {
if(! missing(filter_col) & ! missing(filter_val)) {
data = data[data[, filter_col] == filter_val, ]
}
data[is.na(data[, na_col]), return_col]
}
# working on the whole data frame
foo(d, na_col = "v1", return_col = "pin")
# [1] 2 3 4
# passing in a subset of the data
foo(d[d$type == 1, ], "v1", "pin")
# [1] 2
# using function arguments to subset the data
foo(d, "v1", "pin", filter_col = "type", filter_val = 1)
# [1] 2
# calling it with changing arguments:
# you could use `Map` or `mapply` to be fancy, but this for loop is nice and clear
inputs = data.frame(na_col = c("v1", "v2"), filter_val = c(1, 2), stringsAsFactors = FALSE)
result = list()
for (i in 1:nrow(inputs)) {
result[[i]] = foo(d, na_col = inputs$na_col[i], return_col = "pin",
filter_col = "type", filter_val = inputs$filter_val[i])
}
result
# [[1]]
# [1] 2
#
# [[2]]
# numeric(0)
A different approach I would suggest is melting your data into a long format, and simply taking a subset of the NA values, hence getting all combinations of type and the v* columns that have NA values at once. Do this once, and no function is needed to look up individual combinations.
d_long = reshape2::melt(d, id.vars = c("pin", "type"))
library(dplyr)
d_long %>% filter(is.na(value)) %>%
arrange(variable, type)
# pin type variable value
# 1 2 1 v1 NA
# 2 3 2 v1 NA
# 3 4 2 v1 NA
# 4 1 1 v2 NA
# 5 2 1 v2 NA

How to Add Dataframe name to Columns from Multiple Dataframes

The goal is to rename a list of dataframes columns, but while adding the dataframe name to the new column name.
ex: from x to a_x and b_x.
Why? Because I plan to later merge the sets and would like clean ids for the columns.
a = data.frame(x = c(1,2))
b = data.frame(x = c(3,4))
frameList = list(a = a, b = b)
newName = c(*frameName+'_'+'x')
names = lapply(names, setNames, nm = newName)
list2env(names,.GlobalEnv)
Here is one way for you. I looped through each data frame in frameList using the length of frameList. For column names in each data frame, I took the name of a data frame (i.e., names(frameList)) and past it to column names in the data frame.
a = data.frame(x = c(1,2), y = 1:2)
b = data.frame(x = c(3,4), y = 1:2)
frameList = list(a = a, b = b)
lapply(1:length(names(frameList)), function(x) {
names(frameList[[x]]) <- paste(names(frameList)[x], names(frameList[[x]]), sep = "_")
return(frameList[[x]])
})
[[1]]
a_x a_y
1 1 1
2 2 2
[[2]]
b_x b_y
1 3 1
2 4 2
Or another option is Map
Map(function(x, y) setNames(x, paste(y, names(x), sep="_")), frameList, names(frameList))
#$a
# a_x a_y
#1 1 1
#2 2 2
#$b
# b_x b_y
#1 3 1
#2 4 2
Or with tidyverse
library(purrr)
library(dplyr)
f1 <- function(x, y) paste(y, x, sep="_")
map2(frameList, names(frameList), ~ .x %>%
rename_all(f1, .y))
If we need it in the reverse order, this is more simple
map2(frameList, names(frameList), ~ .x %>%
rename_all(paste, sep="_", .y))

Resources