Parse unexpected symbol error in function applied over list - r

I'm trying to check the "pin" numbers of cases with missing data for each variable of interest in my dataset.
Here are some fake data:
c <- data.frame(pin = c(1, 2, 3, 4), type = c(1, 1, 2, 2), v1 = c(1, NA, NA,
NA), v2 = c(NA, NA, 1, 1))
I wrote a function "m.pin" to do this:
m.pin <- function(x, data = "c", return = "$pin") {
sect <- gsub("^.*\\[", "\\[", deparse(substitute(x)))
vect <- eval(parse(text = paste(data, return, sect, sep = "")))
return(vect[is.na(x)])
}
And I use it like so:
m.pin(c$v1[c$type == 1])
[1] 2
I wrote a function to apply "m.pin" over a list of variables to only return pins with missing data:
return.m.pin <- function(x, fun = m.pin) {
val.list <- lapply(x, fun)
condition <- lapply(val.list, function(x) length(x) > 0)
val.list[unlist(condition)]
}
But when I apply it, I get this error:
l <- lst(c$v1[c$type == 1], c$v2[c$type == 2])
return.m.pin(l)
Error in parse(text = paste(data, return, sect, sep = "")) :
<text>:1:9: unexpected ']'
1: c$pin[i]]
^
How can I rewrite my function(s) to avoid this issue?
Many thanks!

Please see Gregor's comment for the most critical issues with your code (to add: don't use return as a variable name as it is the name of a base R function).
It's not clear to me why you want to define a specific function m.pin, nor what you ultimately are trying to do, but I am assuming this is a critical design component.
Rewriting m.pin as
m.pin <- function(df, type, vcol) which(df[, "type"] == type & is.na(df[, vcol]))
we get
m.pin(df, 1, "v1")
#[1] 2
Or to identify rows with NA in "v1" for all types
lapply(unique(df$type), function(x) m.pin(df, x, "v1"))
#[[1]]
#[1] 2
#
#[[2]]
#[1] 3 4
Update
In response to Gregor's comment, perhaps this is what you're after?
by(df, df$type, function(x)
list(v1 = x$pin[which(is.na(x$v1))], v2 = x$pin[which(is.na(x$v2))]))
# df$type: 1
# $v1
# [1] 2
#
# $v2
# [1] 1 2
#
# ------------------------------------------------------------
# df$type: 2
# $v1
# [1] 3 4
#
# $v2
# integer(0)
This returns a list of the pin numbers for every type and NA entries in v1/v2.
Sample data
df <- data.frame(
pin = c(1, 2, 3, 4),
type = c(1, 1, 2, 2),
v1 = c(1, NA, NA, NA),
v2 = c(NA, NA, 1, 1))

I would suggest rewriting like this (if this approach is to be taken at all). I call your data d because c is already the name of an extremely common function.
# string column names, pass in the data frame as an object
# means no need for eval, parse, substitute, etc.
foo = function(data, na_col, return_col = "pin", filter_col, filter_val) {
if(! missing(filter_col) & ! missing(filter_val)) {
data = data[data[, filter_col] == filter_val, ]
}
data[is.na(data[, na_col]), return_col]
}
# working on the whole data frame
foo(d, na_col = "v1", return_col = "pin")
# [1] 2 3 4
# passing in a subset of the data
foo(d[d$type == 1, ], "v1", "pin")
# [1] 2
# using function arguments to subset the data
foo(d, "v1", "pin", filter_col = "type", filter_val = 1)
# [1] 2
# calling it with changing arguments:
# you could use `Map` or `mapply` to be fancy, but this for loop is nice and clear
inputs = data.frame(na_col = c("v1", "v2"), filter_val = c(1, 2), stringsAsFactors = FALSE)
result = list()
for (i in 1:nrow(inputs)) {
result[[i]] = foo(d, na_col = inputs$na_col[i], return_col = "pin",
filter_col = "type", filter_val = inputs$filter_val[i])
}
result
# [[1]]
# [1] 2
#
# [[2]]
# numeric(0)
A different approach I would suggest is melting your data into a long format, and simply taking a subset of the NA values, hence getting all combinations of type and the v* columns that have NA values at once. Do this once, and no function is needed to look up individual combinations.
d_long = reshape2::melt(d, id.vars = c("pin", "type"))
library(dplyr)
d_long %>% filter(is.na(value)) %>%
arrange(variable, type)
# pin type variable value
# 1 2 1 v1 NA
# 2 3 2 v1 NA
# 3 4 2 v1 NA
# 4 1 1 v2 NA
# 5 2 1 v2 NA

Related

R: if statement inside function (lapply)

I have a large list of dataframes with environmental variables from different localities. For each of the dataframes in the list, I want to summarize the values across locality (= group measurements of the same locality into one), using the name of the dataframes as a condition for which variables need to be summarized. For example, for a dataframe with the name 'salinity' I want to only summarize across salinity, and not the other environmental variables. Note that the different dataframes contain data from different localities, so I cannot simply merge them into one dataframe.
Let's do this with a dummy dataset:
#create list of dataframes
df1 = data.frame(locality = c(1, 2, 2, 5, 7, 7, 9),
Temp = c(14, 15, 16, 18, 20, 18, 21),
Sal = c(16, NA, NA, 12, NA, NA, 9))
df2 = data.frame(locality = c(1, 1, 3, 6, 8, 9, 9),
Temp = c(1, 2, 4, 5, 0, 2, -1),
Sal = c(18, NA, NA, NA, 36, NA, NA))
df3 = data.frame(locality = c(1, 3, 4, 4, 5, 5, 9),
Temp = c(14, NA, NA, NA, 17, 18, 21),
Sal = c(16, 8, 24, 23, 11, 12, 9))
df4 = data.frame(locality = c(1, 1, 1, 4, 7, 8, 10),
Temp = c(1, NA, NA, NA, NA, 0, 2),
Sal = c(18, 17, 13, 16, 20, 36, 30))
df_list = list(df1, df2, df3, df4)
names(df_list) = c("Summer_temperature", "Winter_temperature",
"Summer_salinity", "Winter_salinity")
Next, I used lapply to summarize environmental variables:
#select only those dataframes in the list that have either 'salinity' or 'temperature' in the dataframe names
df_sal = df_list[grep("salinity", names(df_list))]
df_temp = df_list[grep("temperature", names(df_list))]
#use apply to summarize salinity or temperature values in each dataframe
##salinity
df_sal2 = lapply(df_sal, function(x) {
x %>%
group_by(locality) %>%
summarise(Sal = mean(Sal, na.rm = TRUE))
})
##temperature
df_temp2 = lapply(df_temp, function(x) {
x %>%
group_by(locality) %>%
summarise(Temp = mean(Temp, na.rm = TRUE))
})
Now, this code is repetitive, so I want to downsize this by combining everything into one function. This is what I tried:
df_env = lapply(df_list, function(x) {
if (grepl("salinity", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Sal = mean(Sal, na.rm = TRUE))}
if (grepl("temperature", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Temp = mean(Temp, na.rm = TRUE))}
})
But I am getting the following output:
$Summer_temperature
NULL
$Winter_temperature
NULL
$Summer_salinity
NULL
$Winter_salinity
NULL
And the following warning messages:
Warning messages:
1: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
2: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
3: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
4: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
5: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
6: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
7: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
8: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
Now, I read here that this warning message can potentially be solved by using ifelse. However, in the final dataset I will have more than two environmental variables, so I will have to add many more if statements - for this reason I believe ifelse is not a solution here. Does anyone have an elegant solution to my problem? I am new to using both functions and lapply, and would appreciate any help you can give me.
EDIT:
I tried using the else if option suggested in one of the answers, but this still returns NULL values. I also tried the return and assigning output to x but both have the same problem as below code - any ideas?
#else if
df_env = lapply(df_list, function(x) {
if (grepl("salinity", names(x)) == TRUE) {
x %>% group_by(locality) %>%
summarise(Sal = mean(Sal, na.rm = TRUE))}
else if (grepl("temperature", names(x)) == TRUE) {
x %>% group_by(locality) %>%
summarise(Temp = mean(Temp, na.rm = TRUE))}
})
df_env
What I think is happening is that my if argument does not get passed to the summarize function, so nothing is being summarized.
Several things going on here, including
as akrun said, if statements must have a condition with a length of 1. Yours are not.
grepl("locality", names(df1))
# [1] TRUE FALSE FALSE
That must be reduced so that it is always exactly length 1. Frankly, grepl is the wrong tool here, since technically a column named notlocality would match and then it would error. I suggest you change to
"locality" %in% names(df1)
# [1] TRUE
You need to return something. Always. You shifted from if ...; if ...; to if ... else if ..., which is a good start, but really if you meet neither condition, then nothing is returned. I suggest one of the following: either add one more } else x, or reassign as if (..) { x <- x %>% ...; } else if (..) { x <- x %>% ... ; } and then end the anon-func with just x (to return it).
However, I think ultimately the problem is that you are looking for "temperature" or "salinity" which are in the names of the list-objects, not in the frames themselves. For instance, your reference to names(x) is returning c("locality", "Temp", "Sal"), the names of the frame x itself.
I think this is what you want?
Map(function(x, nm) {
if (grepl("salinity", nm)) {
x %>%
group_by(locality) %>%
summarize(Sal = mean(Sal, na.rm = TRUE))
} else if (grepl("temperature", nm)) {
x %>%
group_by(locality) %>%
summarize(Temp = mean(Temp, na.rm = TRUE))
} else x
}, df_list, names(df_list))
# $Summer_temperature
# # A tibble: 5 x 2
# locality Temp
# <dbl> <dbl>
# 1 1 14
# 2 2 15.5
# 3 5 18
# 4 7 19
# 5 9 21
# $Winter_temperature
# # A tibble: 5 x 2
# locality Temp
# <dbl> <dbl>
# 1 1 1.5
# 2 3 4
# 3 6 5
# 4 8 0
# 5 9 0.5
# $Summer_salinity
# # A tibble: 5 x 2
# locality Sal
# <dbl> <dbl>
# 1 1 16
# 2 3 8
# 3 4 23.5
# 4 5 11.5
# 5 9 9
# $Winter_salinity
# # A tibble: 5 x 2
# locality Sal
# <dbl> <dbl>
# 1 1 16
# 2 4 16
# 3 7 20
# 4 8 36
# 5 10 30

Conversion of dataframe into specially formatted list for use in function

I have a data frame like the following:
df = data.frame(name = c("chr", "test"), ncol = c(2, 3))
However, for input into a function (ComplexHeatmap), I need a list like the following:
list(chr = list(ncol = 2), test = list(ncol = 3))
What is the easiest way to convert from this data frame into a list of this format? Doing as.list does not get the right format.
Thanks!
Jack
One approach would involve using plyr. I suspect you may also have more than those two columns, so let
df <- data.frame(name = c("chr", "test"), ncol = c(2, 3), a = 1:2)
# name ncol a
# 1 chr 2 1
# 2 test 3 2
Then
dlply(df, .(name), function(r) as.list(r[-1]))
# $chr
# $chr$ncol
# [1] 2
#
# $chr$a
# [1] 1
#
#
# $test
# $test$ncol
# [1] 3
#
# $test$a
# [1] 2
#
#
# attr(,"split_type")
# [1] "data.frame"
# attr(,"split_labels")
# name
# 1 chr
# 2 test
One option would be split from base R
split(setNames(as.list(df$ncol), rep('ncol', nrow(df))), df$name)

how to handle vector as element in data frame? [R]

How do I store a few numbers in one element of data frame?
For example I want a summary of my data, including the class and values in each column.
dat = data.frame(STATE = 1:5,
MONTH = 1:5)
should yield:
var class values
STATE numeric c(1,2,3,4,5)
MONTH numeric c(1,2,3,4,5)
Now I try:
dat = data.frame(STATE = 1:5,
MONTH = 1:5)
vars = data.frame(var = colnames(dat), class = NA, values = NA,
stringsAsFactors = F)
vars$class = sapply(dat, class)
vars
# var class values
# 1 STATE integer NA
# 2 MONTH integer NA
vars$values = sapply(dat, function(x) unique(x))
# Error in `$<-.data.frame`(`*tmp*`, "values", value = c(1L, 2L, 3L, 4L, :
# replacement has 5 rows, data has 2
# UPDATE: #jMathew 's answer:
vars$values = sapply(dat, function(x) list(unique(x)))
vars
# var class values
# 1 STATE integer 1, 2, 3, 4, 5
# 2 MONTH integer 1, 2, 3, 4, 5
It doesn't work because unique(dat$STATE) = c(1,2,3,4,5), and R thinks it should be 5 elements in data frame, and can't fit in one element.
But the above code works for many data sets I work with, e.g.:
library(foreign)
dat = read.xport('LLCP2013.XPT')
# download from http://www.cdc.gov/brfss/annual_data/2013/files/LLCP2013XPT.ZIP
dat = dat[1:5, 1:3]
dat
# X_STATE FMONTH IDATE
# 1 1 1 01092013
# 2 1 1 01192013
# 3 1 1 01192013
# 4 1 1 01112013
# 5 1 2 02062013
vars = data.frame(var = colnames(dat), class = NA, values = NA,
stringsAsFactors = F)
vars$class = sapply(dat, class)
vars$values = sapply(dat, function(x) unique(x))
vars
# var class values
# 1 X_STATE numeric 1
# 2 FMONTH numeric 1, 2
# 3 IDATE factor 16, 36, 20, 70
# UPDATE:
class(vars[3,3])
# [1] "list"
# #jMathew was right, it was somehow coerced to list
Can somebody tells me why this works in the second case but not in the first? Thanks
I suspect that in your second case, the vector is being coerced to a list
Try, this on your first example
vars$values = sapply(dat, function(x) list(unique(x)))
We could try
do.call(rbind,lapply(seq_along(dat), function(i)
data.frame(var=names(dat)[i], class=class(dat[,i]),
values= sprintf('c(%s)', toString(unique(dat[,i]))))))
# var class values
#1 STATE integer c(1, 2, 3, 4, 5)
#2 MONTH integer c(1, 2, 3, 4, 5)

Rename multiple columns by names

Someone should have asked this already, but I couldn't find an answer. Say I have:
x = data.frame(q=1,w=2,e=3, ...and many many columns...)
what is the most elegant way to rename an arbitrary subset of columns, whose position I don't necessarily know, into some other arbitrary names?
e.g. Say I want to rename "q" and "e" into "A" and "B", what is the most elegant code to do this?
Obviously, I can do a loop:
oldnames = c("q","e")
newnames = c("A","B")
for(i in 1:2) names(x)[names(x) == oldnames[i]] = newnames[i]
But I wonder if there is a better way? Maybe using some of the packages? (plyr::rename etc.)
With dplyr you would do:
library(dplyr)
df = data.frame(q = 1, w = 2, e = 3)
df %>% rename(A = q, B = e)
# A w B
#1 1 2 3
Or if you want to use vectors, as suggested by #Jelena-bioinf:
library(dplyr)
df = data.frame(q = 1, w = 2, e = 3)
oldnames = c("q","e")
newnames = c("A","B")
df %>% rename_at(vars(oldnames), ~ newnames)
# A w B
#1 1 2 3
L. D. Nicolas May suggested a change given rename_at is being superseded by rename_with:
df %>%
rename_with(~ newnames[which(oldnames == .x)], .cols = oldnames)
# A w B
#1 1 2 3
setnames from the data.tablepackage will work on data.frames or data.tables
library(data.table)
d <- data.frame(a=1:2,b=2:3,d=4:5)
setnames(d, old = c('a','d'), new = c('anew','dnew'))
d
# anew b dnew
# 1 1 2 4
# 2 2 3 5
Note that changes are made by reference, so no copying (even for data.frames!)
Another solution for dataframes which are not too large is (building on #thelatemail answer):
x <- data.frame(q=1,w=2,e=3)
> x
q w e
1 1 2 3
colnames(x) <- c("A","w","B")
> x
A w B
1 1 2 3
Alternatively, you can also use:
names(x) <- c("C","w","D")
> x
C w D
1 1 2 3
Furthermore, you can also rename a subset of the columnnames:
names(x)[2:3] <- c("E","F")
> x
C E F
1 1 2 3
Here is the most efficient way I have found to rename multiple columns using a combination of purrr::set_names() and a few stringr operations.
library(tidyverse)
# Make a tibble with bad names
data <- tibble(
`Bad NameS 1` = letters[1:10],
`bAd NameS 2` = rnorm(10)
)
data
# A tibble: 10 x 2
`Bad NameS 1` `bAd NameS 2`
<chr> <dbl>
1 a -0.840
2 b -1.56
3 c -0.625
4 d 0.506
5 e -1.52
6 f -0.212
7 g -1.50
8 h -1.53
9 i 0.420
10 j 0.957
# Use purrr::set_names() with annonymous function of stringr operations
data %>%
set_names(~ str_to_lower(.) %>%
str_replace_all(" ", "_") %>%
str_replace_all("bad", "good"))
# A tibble: 10 x 2
good_names_1 good_names_2
<chr> <dbl>
1 a -0.840
2 b -1.56
3 c -0.625
4 d 0.506
5 e -1.52
6 f -0.212
7 g -1.50
8 h -1.53
9 i 0.420
10 j 0.957
Update dplyr 1.0.0
The newest dplyr version became more flexible by adding rename_with() where _with refers to a function as input. The trick is to reformulate the character vector newnames into a formula (by ~), so it would be equivalent to function(x) return (newnames).
In my subjective opinion, that is the most elegant dplyr expression.
Update: thanks to #desval, the oldnames vector must be wrapped by all_of to include all its elements:
# shortest & most elegant expression
df %>% rename_with(~ newnames, all_of(oldnames))
A w B
1 1 2 3
Side note:
If you reverse the order, either argument .fn must be specified as .fn is expected before .cols argument:
df %>% rename_with(oldnames, .fn = ~ newnames)
A w B
1 1 2 3
or specify argument .col:
df %>% rename_with(.col = oldnames, ~ newnames)
A w B
1 1 2 3
So I recently ran into this myself, if you're not sure if the columns exist and only want to rename those that do:
existing <- match(oldNames,names(x))
names(x)[na.omit(existing)] <- newNames[which(!is.na(existing))]
Building on #user3114046's answer:
x <- data.frame(q=1,w=2,e=3)
x
# q w e
#1 1 2 3
names(x)[match(oldnames, names(x))] <- newnames
x
# A w B
#1 1 2 3
This won't be reliant on a specific ordering of columns in the x dataset.
You can use a named vector. Below two options (with base R and dplyr).
base R, via subsetting:
x = data.frame(q = 1, w = 2, e = 3)
rename_vec <- c(q = "A", e = "B")
## vector of same length as names(x) which returns NA if there is no match to names(x)
which_rename <- rename_vec[names(x)]
## simple ifelse where names(x) will be renamed for every non-NA
names(x) <- ifelse(is.na(which_rename), names(x), which_rename)
x
#> A w B
#> 1 1 2 3
Or a dplyr option with !!!:
library(dplyr)
rename_vec <- c(A = "q", B = "e") # the names are just the other way round than in the base R way!
x %>% rename(!!!rename_vec)
#> A w B
#> 1 1 2 3
The latter works because the 'big-bang' operator !!! is forcing evaluation of a list or a vector.
?`!!`
!!! forces-splice a list of objects. The elements of the list are
spliced in place, meaning that they each become one single argument.
names(x)[names(x) %in% c("q","e")]<-c("A","B")
This would change all the occurrences of those letters in all names:
names(x) <- gsub("q", "A", gsub("e", "B", names(x) ) )
There are a few answers mentioning the functions dplyr::rename_with and rlang::set_names already. By they are separate. this answer illustrates the differences between the two and the use of functions and formulas to rename columns.
rename_with from the dplyr package can use either a function or a formula
to rename a selection of columns given as the .cols argument. For example passing the function name toupper:
library(dplyr)
rename_with(head(iris), toupper, starts_with("Petal"))
Is equivalent to passing the formula ~ toupper(.x):
rename_with(head(iris), ~ toupper(.x), starts_with("Petal"))
When renaming all columns, you can also use set_names from the rlang package. To make a different example, let's use paste0 as a renaming function. pasteO takes 2 arguments, as a result there are different ways to pass the second argument depending on whether we use a function or a formula.
rlang::set_names(head(iris), paste0, "_hi")
rlang::set_names(head(iris), ~ paste0(.x, "_hi"))
The same can be achieved with rename_with by passing the data frame as first
argument .data, the function as second argument .fn, all columns as third
argument .cols=everything() and the function parameters as the fourth
argument .... Alternatively you can place the second, third and fourth
arguments in a formula given as the second argument.
rename_with(head(iris), paste0, everything(), "_hi")
rename_with(head(iris), ~ paste0(.x, "_hi"))
rename_with only works with data frames. set_names is more generic and can
also perform vector renaming
rlang::set_names(1:4, c("a", "b", "c", "d"))
If the table contains two columns with the same name then the code goes like this,
rename(df,newname=oldname.x,newname=oldname.y)
You can get the name set, save it as a list, and then do your bulk renaming on the string. A good example of this is when you are doing a long to wide transition on a dataset:
names(labWide)
Lab1 Lab10 Lab11 Lab12 Lab13 Lab14 Lab15 Lab16
1 35.75366 22.79493 30.32075 34.25637 30.66477 32.04059 24.46663 22.53063
nameVec <- names(labWide)
nameVec <- gsub("Lab","LabLat",nameVec)
names(labWide) <- nameVec
"LabLat1" "LabLat10" "LabLat11" "LabLat12" "LabLat13" "LabLat14""LabLat15" "LabLat16" "
Sidenote, if you want to concatenate one string to all of the column names, you can just use this simple code.
colnames(df) <- paste("renamed_",colnames(df),sep="")
Lot's of sort-of-answers, so I just wrote the function so you can copy/paste.
rename <- function(x, old_names, new_names) {
stopifnot(length(old_names) == length(new_names))
# pull out the names that are actually in x
old_nms <- old_names[old_names %in% names(x)]
new_nms <- new_names[old_names %in% names(x)]
# call out the column names that don't exist
not_nms <- setdiff(old_names, old_nms)
if(length(not_nms) > 0) {
msg <- paste(paste(not_nms, collapse = ", "),
"are not columns in the dataframe, so won't be renamed.")
warning(msg)
}
# rename
names(x)[names(x) %in% old_nms] <- new_nms
x
}
x = data.frame(q = 1, w = 2, e = 3)
rename(x, c("q", "e"), c("Q", "E"))
Q w E
1 1 2 3
If one row of the data contains the names you want to change all columns to you can do
names(data) <- data[row,]
Given data is your dataframe and row is the row number containing the new values.
Then you can remove the row containing the names with
data <- data[-row,]
This is the function that you need:
Then just pass the x in a rename(X) and it will rename all values that appear and if it isn't in there it won't error
rename <-function(x){
oldNames = c("a","b","c")
newNames = c("d","e","f")
existing <- match(oldNames,names(x))
names(x)[na.omit(existing)] <- newNames[which(!is.na(existing))]
return(x)
}
Many good answers above using specialized packages. This is a simple way of doing it only with base R.
df.rename.cols <- function(df, col2.list) {
tlist <- transpose(col2.list)
names(df)[which(names(df) %in% tlist[[1]])] <- tlist[[2]]
df
}
Here is an example:
df1 <- data.frame(A = c(1, 2), B = c(3, 4), C = c(5, 6), D = c(7, 8))
col.list <- list(c("A", "NewA"), c("C", "NewC"))
df.rename.cols(df1, col.list)
NewA B NewC D
1 1 3 5 7
2 2 4 6 8
I recently built off of #agile bean's answer (using rename_with, formerly rename_at) to build a function which changes column names if they exist in the data frame, such that one can make the column names of heterogeneous data frames match each other when applicable.
The looping can surely be improved, but figured I'd share for posterity.
create example data frame:
x= structure(list(observation_date = structure(c(18526L, 18784L,
17601L), class = c("IDate", "Date")), year = c(2020L, 2021L,
2018L)), sf_column = "geometry", agr = structure(c(id = NA_integer_,
common_name = NA_integer_, scientific_name = NA_integer_, observation_count = NA_integer_,
country = NA_integer_, country_code = NA_integer_, state = NA_integer_,
state_code = NA_integer_, county = NA_integer_, county_code = NA_integer_,
observation_date = NA_integer_, time_observations_started = NA_integer_,
observer_id = NA_integer_, sampling_event_identifier = NA_integer_,
protocol_type = NA_integer_, protocol_code = NA_integer_, duration_minutes = NA_integer_,
effort_distance_km = NA_integer_, effort_area_ha = NA_integer_,
number_observers = NA_integer_, all_species_reported = NA_integer_,
group_identifier = NA_integer_, year = NA_integer_, checklist_id = NA_integer_,
yday = NA_integer_), class = "factor", .Label = c("constant",
"aggregate", "identity")), row.names = c("3", "3.1", "3.2"), class = "data.frame")
function
match_col_names <- function(x){
col_names <- list(date = c("observation_date", "date"),
C = c("observation_count", "count","routetotal"),
yday = c("dayofyear"),
latitude = c("lat"),
longitude = c("lon","long")
)
for(i in seq_along(col_names)){
newname=names(col_names)[i]
oldnames=col_names[[i]]
toreplace = names(x)[which(names(x) %in% oldnames)]
x <- x %>%
rename_with(~newname, toreplace)
}
return(x)
}
apply function
x <- match_col_names(x)
For execution time purposes , I would like to suggest to use data tables structure:
> df = data.table(x = 1:10, y = 3:12, z = 4:13)
> oldnames = c("x","y","z")
> newnames = c("X","Y","Z")
> library(microbenchmark)
> library(data.table)
> library(dplyr)
> microbenchmark(dplyr_1 = df %>% rename_at(vars(oldnames), ~ newnames) ,
+ dplyr_2 = df %>% rename(X=x,Y=y,Z=z) ,
+ data_tabl1= setnames(copy(df), old = c("x","y","z") , new = c("X","Y","Z")),
+ times = 100)
Unit: microseconds
expr min lq mean median uq max neval
dplyr_1 5760.3 6523.00 7092.538 6864.35 7210.45 17935.9 100
dplyr_2 2536.4 2788.40 3078.609 3010.65 3282.05 4689.8 100
data_tabl1 170.0 218.45 368.261 243.85 274.40 12351.7 100

aggregate values in dataframe by partly matching rownames in R

I'm thumbling around with the following problem, but to no evail:
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
value
abc 1
abcd 2
ef 3
gh 4
l <- nrow(d)
wordmat <- matrix(rep(NA, l^2), l, l, dimnames = list(row.names(d), row.names(d)))
for (i in 1:ncol(wordmat)) {
rid <- agrep(colnames(wordmat)[i], rownames(wordmat), max = 0)
d$matchid[i] <- paste(rid, collapse = ";")
}
# desired output:
(d_agg <- data.frame(value = c(3, 3, 4), row.names = c("abc;abcd", "ef", "gh")))
value
abc;abcd 3
ef 3
gh 4
is there a function for this?
Here's a possible solution that you might be able to modify to suit your needs.
Some notes:
I couldn't figure out how to deal with rownames() directly, particularly in the last stage, so this depends on you being happy with copying your row names as a new variable.
The function below "hard-codes" the variable names, functions, and so on. That is to say, it is not by any means a generalized function, but one which might be useful as you look further into this problem.
Here's the function.
matches <- function(data, ...) {
temp = vector("list", nrow(data))
for (i in 1:nrow(data)) {
temp1 = agrep(data$RowNames[i], data$RowNames, value = TRUE, ...)
temp[[i]] = data.frame(RowNames = paste(temp1, collapse = "; "),
value = sum(data[temp1, "value"]))
}
temp = do.call(rbind, temp)
temp[!duplicated(temp$RowNames), ]
}
Note that the function needs a column called RowNames, so we'll create that, and then test the function.
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
d$RowNames <- rownames(d)
matches(d)
# RowNames value
# 1 abc; abcd 3
# 3 ef 3
# 4 gh 4
matches(d, max.distance = 2)
# RowNames value
# 1 abc; abcd 3
# 3 abc; abcd; ef; gh 10
matches(d, max.distance = 4)
# RowNames value
# 1 abc; abcd; ef; gh 10
This works for your example but may need tweaking for the real thing:
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
rowclust <- hclust(as.dist(adist(rownames(d))), method="single")
rowgroups <- cutree(rowclust, h=1.5)
rowagg <- aggregate(d, list(rowgroups), sum)
rowname <- unclass(by(rownames(d), rowgroups, paste, collapse=";"))
rownames(rowagg) <- rowname
rowagg
Group.1 value
abc;abcd 1 3
ef 2 3
gh 3 4

Resources