I'm thumbling around with the following problem, but to no evail:
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
value
abc 1
abcd 2
ef 3
gh 4
l <- nrow(d)
wordmat <- matrix(rep(NA, l^2), l, l, dimnames = list(row.names(d), row.names(d)))
for (i in 1:ncol(wordmat)) {
rid <- agrep(colnames(wordmat)[i], rownames(wordmat), max = 0)
d$matchid[i] <- paste(rid, collapse = ";")
}
# desired output:
(d_agg <- data.frame(value = c(3, 3, 4), row.names = c("abc;abcd", "ef", "gh")))
value
abc;abcd 3
ef 3
gh 4
is there a function for this?
Here's a possible solution that you might be able to modify to suit your needs.
Some notes:
I couldn't figure out how to deal with rownames() directly, particularly in the last stage, so this depends on you being happy with copying your row names as a new variable.
The function below "hard-codes" the variable names, functions, and so on. That is to say, it is not by any means a generalized function, but one which might be useful as you look further into this problem.
Here's the function.
matches <- function(data, ...) {
temp = vector("list", nrow(data))
for (i in 1:nrow(data)) {
temp1 = agrep(data$RowNames[i], data$RowNames, value = TRUE, ...)
temp[[i]] = data.frame(RowNames = paste(temp1, collapse = "; "),
value = sum(data[temp1, "value"]))
}
temp = do.call(rbind, temp)
temp[!duplicated(temp$RowNames), ]
}
Note that the function needs a column called RowNames, so we'll create that, and then test the function.
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
d$RowNames <- rownames(d)
matches(d)
# RowNames value
# 1 abc; abcd 3
# 3 ef 3
# 4 gh 4
matches(d, max.distance = 2)
# RowNames value
# 1 abc; abcd 3
# 3 abc; abcd; ef; gh 10
matches(d, max.distance = 4)
# RowNames value
# 1 abc; abcd; ef; gh 10
This works for your example but may need tweaking for the real thing:
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
rowclust <- hclust(as.dist(adist(rownames(d))), method="single")
rowgroups <- cutree(rowclust, h=1.5)
rowagg <- aggregate(d, list(rowgroups), sum)
rowname <- unclass(by(rownames(d), rowgroups, paste, collapse=";"))
rownames(rowagg) <- rowname
rowagg
Group.1 value
abc;abcd 1 3
ef 2 3
gh 3 4
Related
I found a similar question asked before. My question is a bit more complex than the previous one. For my question, the y parameter is not fixed.
In the function(X,Y){SOME FUNCTION}, X is a list of characters and Y is a list of dataframe. Basically, I want the function to work on the pair of X and Y in sequence respectively, and produce the output as one list. For example, the first element of X list and the first element of Y list, the second element of X list and the second element of Y list, the third element of X list and the third element of Y list,...
Example of X, Y
X <- c("1", "2")
y1 <- data.frame("person.1" = "Amy", "bestfood..1" = "fish", "bestthing..1" = "book",
"person.2" = "Mike", "bestfood..2" = "fish", "bestthing..2" = "book")
y2 <- data.frame("person.1" = "Amy","bestfood..1" = "carrot", "bestthing..1" = "cloth",
"person.2" = "Mike","bestfood..2" = "carrot", "bestthing..2" = "cloth")
Y <- list(y1,y2)
The function:
addID <- function(X, Y) {
rowlength <- length(Y)
df <- as.data.frame(matrix(NA, nrow = rowlength, ncol = 3))
colnames(df) <- c("ID", "Person", "Food")
df[1:nrow(df), 1] <- X
# name
namecols <-grep("person",colnames(Y))
for (i in 1:length(namecols)) {
name <- Y[1, namecols[i]]
df[i, 2] <- as.character(name)
}
# food
foodcols <-
grep("bestfood",colnames(Y))
for (i in 1:length(foodcols)) {
food <- Y[1, foodcols[i]]
df[i, 3] <- as.character(foodcols)
}
return(df)
}
}
I tried to use lapply but can't figure out the way to include the X list. When I try this:
lapply(Y, function, X=X)
The function doesn't work properly. I wonder if there are other ways to include X in it(I tried the function on individual character and dataframe, it works just fine. )
I hope this is clear. If not, please point it out, I will try my best to clarify. Thanks in advance.
UPDATE:
I tried Map as suggested by comments. It returns: incorrect number of dimensions. I added some details in the function. It seems like R stucks on the last line.
outcome <- Map(addID, Y, X)
I get
error in Y[1, namecols[i]] : incorrect number of dimensions
In addition: Warning message:
In `[<-.data.frame`(`*tmp*`, 1:nrow(df), 1, value = list(person.1 = 1L, :
provided 6 variables to replace 1 variables
The outcome should looks like:
z1 <- data.frame(ID = c(1,2), Person = c("Amy","Mike"), Food = c("fish", "fish"))
z2 <- data.frame(ID = c(1,2), Person = c("Amy","Mike"), Food = c("carrot", "carrot"))
outcome <- list(z1,z2)
We could do this easily in tidyverse
library(dplyr)
library(tidyr)
bind_rows(Y, .id = 'ID') %>%
select(ID, starts_with('person'), contains('food')) %>%
pivot_longer(cols = -ID, names_to = c(".value"),
names_pattern = "([^.]+)\\.+\\d+")
-output
# A tibble: 4 x 3
ID person bestfood
<chr> <chr> <chr>
1 1 Amy fish
2 1 Mike fish
3 2 Amy carrot
4 2 Mike carrot
With the OP's function, if we modify, it would work
addID <- function(X, Y) {
rowlength <- length(Y)
df <- as.data.frame(matrix(NA, nrow = rowlength, ncol = 3))
colnames(df) <- c("ID", "Person", "Food")
df[1:nrow(df), 1] <- X
namecols <- grep("person",colnames(Y))
df[, 2] <- unlist(Y[namecols])
foodcols <- grep("bestfood", colnames(Y))
df[,3] <- unlist(Y[foodcols])
return(unique(df))
}
-testing
Map(addID, X, Y)
$`1`
ID Person Food
1 1 Amy fish
2 1 Mike fish
$`2`
ID Person Food
1 2 Amy carrot
2 2 Mike carrot
I want a function for the mode of a vector. Abhiroop Sarkar's answer to This question works, but I want to understand why.
Here is the code
Mode <- function(x){
y <- data.frame(table(x))
y[y$Freq == max(y$Freq),1]
}
1) Wy do we need to put the table in a data frame,
2) in this line
y[y$Freq == max(y$Freq),1]
what does the y$Freq do? is frequency a default columns in the table?
When we convert a table output to data.frame, it creates a two column data.frame
set.seed(24)
v1 <- table(sample(1:5, 100, replace = TRUE))
y <- data.frame(v1)
y
# Var1 Freq
#1 1 19
#2 2 24
#3 3 22
#4 4 16
#5 5 19
The first column 'Var1' is the names of the frequency output from table and the 'Freq' is the actual frequency of those names
y[y$Freq == max(y$Freq), 1]
#[1] 2
#Levels: 1 2 3 4 5
Now, we are subsetting the first column 'Var1' based on the max value of 'Freq', and it returns a vector because of the drop = TRUE in [ when there is a single column
If we want to return a data.frame with single, add drop = FALSE at the end
y[y$Freq == max(y$Freq), 1, drop = FALSE]
# Var1
#2 2
Regarding the default name Freq, it is created from the as.data.frame.table method
as.data.frame.table
function (x, row.names = NULL, ..., responseName = "Freq", stringsAsFactors = TRUE,
sep = "", base = list(LETTERS))
{
ex <- quote(data.frame(do.call("expand.grid", c(dimnames(provideDimnames(x,
sep = sep, base = base)), KEEP.OUT.ATTRS = FALSE, stringsAsFactors = stringsAsFactors)),
Freq = c(x), row.names = row.names))
names(ex)[3L] <- responseName
eval(ex)
}
I have two forms of data: a list (i.e., r) and a data.frame (i.e., df). For each form of data, how can I know the number of variables that are repeated 2 or more times (in the example below, my desired output is: AA 3 times, BB 2 times, CC 2 times)?
NOTE: the answer regardless of the form of data, should be the same.
r <- list( data.frame( AA = c(2,2,1,1,NA, NA), BB = c(1,1,1,2,2,NA), CC = c(1:5, NA)), # LIST
data.frame( AA = c(1,NA,3,1,NA,NA), DD = c(1,1,1,2,NA,NA)),
data.frame( AA = c(1,NA,3,1,NA,NA), BB = c(1,1,1,2,2,NA), CC = c(0:4, NA)) )
df <- do.call(cbind, r) ## DATA.FRAME
We can create a frequency count with >= 2 on the names of the dataset,
tbl <- table(names(df))
tbl1 <- tbl[tbl >=2]
tbl1
# AA BB CC
# 3 2 2
lapply(r, function(x) table(names(x)[names(x) %in% names(tbl1)]))
If we need it from another answer
vec <- names(unlist(r, recursive = FALSE))
nm1 <- unique(vec[duplicated(vec)])
lapply(r, function(x) table(names(x)[names(x) %in% nm1]))
I'm trying to check the "pin" numbers of cases with missing data for each variable of interest in my dataset.
Here are some fake data:
c <- data.frame(pin = c(1, 2, 3, 4), type = c(1, 1, 2, 2), v1 = c(1, NA, NA,
NA), v2 = c(NA, NA, 1, 1))
I wrote a function "m.pin" to do this:
m.pin <- function(x, data = "c", return = "$pin") {
sect <- gsub("^.*\\[", "\\[", deparse(substitute(x)))
vect <- eval(parse(text = paste(data, return, sect, sep = "")))
return(vect[is.na(x)])
}
And I use it like so:
m.pin(c$v1[c$type == 1])
[1] 2
I wrote a function to apply "m.pin" over a list of variables to only return pins with missing data:
return.m.pin <- function(x, fun = m.pin) {
val.list <- lapply(x, fun)
condition <- lapply(val.list, function(x) length(x) > 0)
val.list[unlist(condition)]
}
But when I apply it, I get this error:
l <- lst(c$v1[c$type == 1], c$v2[c$type == 2])
return.m.pin(l)
Error in parse(text = paste(data, return, sect, sep = "")) :
<text>:1:9: unexpected ']'
1: c$pin[i]]
^
How can I rewrite my function(s) to avoid this issue?
Many thanks!
Please see Gregor's comment for the most critical issues with your code (to add: don't use return as a variable name as it is the name of a base R function).
It's not clear to me why you want to define a specific function m.pin, nor what you ultimately are trying to do, but I am assuming this is a critical design component.
Rewriting m.pin as
m.pin <- function(df, type, vcol) which(df[, "type"] == type & is.na(df[, vcol]))
we get
m.pin(df, 1, "v1")
#[1] 2
Or to identify rows with NA in "v1" for all types
lapply(unique(df$type), function(x) m.pin(df, x, "v1"))
#[[1]]
#[1] 2
#
#[[2]]
#[1] 3 4
Update
In response to Gregor's comment, perhaps this is what you're after?
by(df, df$type, function(x)
list(v1 = x$pin[which(is.na(x$v1))], v2 = x$pin[which(is.na(x$v2))]))
# df$type: 1
# $v1
# [1] 2
#
# $v2
# [1] 1 2
#
# ------------------------------------------------------------
# df$type: 2
# $v1
# [1] 3 4
#
# $v2
# integer(0)
This returns a list of the pin numbers for every type and NA entries in v1/v2.
Sample data
df <- data.frame(
pin = c(1, 2, 3, 4),
type = c(1, 1, 2, 2),
v1 = c(1, NA, NA, NA),
v2 = c(NA, NA, 1, 1))
I would suggest rewriting like this (if this approach is to be taken at all). I call your data d because c is already the name of an extremely common function.
# string column names, pass in the data frame as an object
# means no need for eval, parse, substitute, etc.
foo = function(data, na_col, return_col = "pin", filter_col, filter_val) {
if(! missing(filter_col) & ! missing(filter_val)) {
data = data[data[, filter_col] == filter_val, ]
}
data[is.na(data[, na_col]), return_col]
}
# working on the whole data frame
foo(d, na_col = "v1", return_col = "pin")
# [1] 2 3 4
# passing in a subset of the data
foo(d[d$type == 1, ], "v1", "pin")
# [1] 2
# using function arguments to subset the data
foo(d, "v1", "pin", filter_col = "type", filter_val = 1)
# [1] 2
# calling it with changing arguments:
# you could use `Map` or `mapply` to be fancy, but this for loop is nice and clear
inputs = data.frame(na_col = c("v1", "v2"), filter_val = c(1, 2), stringsAsFactors = FALSE)
result = list()
for (i in 1:nrow(inputs)) {
result[[i]] = foo(d, na_col = inputs$na_col[i], return_col = "pin",
filter_col = "type", filter_val = inputs$filter_val[i])
}
result
# [[1]]
# [1] 2
#
# [[2]]
# numeric(0)
A different approach I would suggest is melting your data into a long format, and simply taking a subset of the NA values, hence getting all combinations of type and the v* columns that have NA values at once. Do this once, and no function is needed to look up individual combinations.
d_long = reshape2::melt(d, id.vars = c("pin", "type"))
library(dplyr)
d_long %>% filter(is.na(value)) %>%
arrange(variable, type)
# pin type variable value
# 1 2 1 v1 NA
# 2 3 2 v1 NA
# 3 4 2 v1 NA
# 4 1 1 v2 NA
# 5 2 1 v2 NA
The goal is to rename a list of dataframes columns, but while adding the dataframe name to the new column name.
ex: from x to a_x and b_x.
Why? Because I plan to later merge the sets and would like clean ids for the columns.
a = data.frame(x = c(1,2))
b = data.frame(x = c(3,4))
frameList = list(a = a, b = b)
newName = c(*frameName+'_'+'x')
names = lapply(names, setNames, nm = newName)
list2env(names,.GlobalEnv)
Here is one way for you. I looped through each data frame in frameList using the length of frameList. For column names in each data frame, I took the name of a data frame (i.e., names(frameList)) and past it to column names in the data frame.
a = data.frame(x = c(1,2), y = 1:2)
b = data.frame(x = c(3,4), y = 1:2)
frameList = list(a = a, b = b)
lapply(1:length(names(frameList)), function(x) {
names(frameList[[x]]) <- paste(names(frameList)[x], names(frameList[[x]]), sep = "_")
return(frameList[[x]])
})
[[1]]
a_x a_y
1 1 1
2 2 2
[[2]]
b_x b_y
1 3 1
2 4 2
Or another option is Map
Map(function(x, y) setNames(x, paste(y, names(x), sep="_")), frameList, names(frameList))
#$a
# a_x a_y
#1 1 1
#2 2 2
#$b
# b_x b_y
#1 3 1
#2 4 2
Or with tidyverse
library(purrr)
library(dplyr)
f1 <- function(x, y) paste(y, x, sep="_")
map2(frameList, names(frameList), ~ .x %>%
rename_all(f1, .y))
If we need it in the reverse order, this is more simple
map2(frameList, names(frameList), ~ .x %>%
rename_all(paste, sep="_", .y))