I have need to name columns of a data.frame with duplicate names. inside of data.frame you can use check.names = FALSE to do the naughty name deed. But if you index this then you lose the naughty names when indexing. I want to retain those names. So beloe is an example and the output I get and I'd like to get:
x <- data.frame(b= 4:6, a =6:8, a =6:8, check.names = FALSE)
x[, -1]
I get:
a a.1
1 6 6
2 7 7
3 8 8
I'd like:
a a
1 6 6
2 7 7
3 8 8
How about this:
subdf <- function(df, ii) {
do.call("data.frame", c(as.list(df)[ii], check.names=FALSE))
}
subdf(x, -1)
# a a
# 1 6 6
# 2 7 7
# 3 8 8
subdf(x, 2:3)
# a a
# 1 6 6
# 2 7 7
# 3 8 8
Here's an ugly solution
> tmp <- data.frame(b=4:6, a=6:8, a=6:8, check.names=FALSE)
> setNames(tmp[, -1], names(tmp)[-1])
a a
1 6 6
2 7 7
3 8 8
Looking at the code for [.data.frame gives this as part of the code
if (anyDuplicated(cols))
names(y) <- make.unique(cols)
and I couldn't see anything in the code that would allow one to skip that check. So it looks like we'll just have to write our own function. It's not very safe though and I'm sure a much better version could be created...
dropCols <- function(x, cols){
nm <- colnames(x)
x <- x[, -cols]
colnames(x) <- nm[-cols]
x
}
x <- data.frame(b= 4:6, a =6:8, a =6:8, check.names = FALSE)
#x[, -1]
dropCols(x, 1)
# a a
#1 6 6
#2 7 7
#3 8 8
per dirks tongue in cheek comment:
safe.data.frame <- function(dat, index) {
colnam <-colnames(dat)[index]
dat2 <- dat[, index]
colnames(dat2) <- colnam
dat2
}
safe.data.frame(x, -1)
I was hoping for something better :)
Related
My data frame
set.seed(1)
df <- data_frame(col1 = c(1:49), col2 = sample(c(0:20), 49, replace = T))
My list
fields <- list(A = c(2:4, 12:16, 24:28, 36:40, 48:49),
B = c(6:10, 18:22, 30:34, 42:46))
I would like to create a new column that contains the name of the (vector) object in fields, which contains the number in df$col1
I have created a conditional for loop over fields:
col1 <- df$col1
for (i in col1) {
if (col1[i] %in% fields[[1]] == T) {
col1[i] <- names(fields)[1]
} else if (col1[i] %in% fields[[2]] == T) {
col1[i] <- names(fields)[2]
}
}
Although this works, and I can then assign the resulting new vector col1 to my data frame, this doesn't seem very efficient to me- especially because I also have lists with more objects.
The reason why I want to do this: I would like to use ggplot and dplyr to grouping and summarising the observations according to their position in my lists (fields, but also other lists) . I hope it is clear from my question what I intend to do. Thanks!
EDIT
I have created a more generalised function that contains a nested for-loop
find_object <- function(x, list) {
for (j in 1:length(list)) {
for (i in 1:length(x)) {
if (x[i] %in% list[[j]] == TRUE) {
x[i] <- names(list)[j]
}
}
}
x
}
find_object(col1, fields)
That is more or less what I want - but this is a nested for loop, and I have heard that this is bad... Does anyone have a better solution??
Thanks
A better way is to transform the list to data.frame and then do a join/merge:
library(dplyr)
fields.df <- stack(fields) %>% mutate(ind = as.character(ind))
df %>% left_join(fields.df, by = c('col1' = 'values'))
# col1 col2 ind
# <int> <int> <chr>
# 1 1 5 <NA>
# 2 2 7 A
# 3 3 12 A
# 4 4 19 A
# 5 5 4 <NA>
# 6 6 18 B
# 7 7 19 B
# 8 8 13 B
# 9 9 13 B
# 10 10 1 B
note: I use left_join from dplyr because you are using data_frame. The base R merge should also work.
Another way would be to use match() after creating a data frame with stack().
library(dplyr)
foo <- stack(fields)
mutate(df, whatever = foo$ind[match(df$col1, foo$values)])
col1 col2 whatever
<int> <int> <fctr>
1 1 5 <NA>
2 2 7 A
3 3 12 A
4 4 19 A
5 5 4 <NA>
6 6 18 B
7 7 19 B
8 8 13 B
9 9 13 B
10 10 1 B
I have several frames called y2010, y2011, y2012, ... and a frame called z.
"firstcolumn" contains fitting names.
I want to match the content of each frame (y2010, y2011, y2012, ...) by left_join to z within a loop.
for(i in 2010:2017) {
z<-left_join(z, paste0("y", 2011) , by="firstcolumn")
}
But I cannot choose the frames y2010, y2011, y2012, ... by paste0.
How can I proceed?
Use get:
z <- left_join(z, get(paste0("y", 2011)), by="firstcolumn")
To avoid a for-loop, you can use mget to put them in a list and lapply to merge,
library(dplyr)
lapply(mget(ls(pattern = 'y[0-9]+')), function(i) left_join(z, i, by = 'firstcolumn'))
It sounds like you might also want to look at Reduce:
Reduce(function(x, y) left_join(x, y, by = "firstcolumn"),
mget(c("z", paste0("y", 2010:2017))))
It's always better to provide some sample data along with expected output. Here's some sample data:
ls() ## Just to show I'm starting with nothing in my workspace
# character(0)
set.seed(1)
list2env(setNames(replicate(9, data.frame(firstcolumn = sample(letters[1:5], 3), data = sample(10, 3, TRUE), stringsAsFactors = FALSE), FALSE), c("z", paste0("y", 2010:2017))), .GlobalEnv)
ls()
# [1] "y2010" "y2011" "y2012" "y2013" "y2014" "y2015" "y2016" "y2017" "z"
Here's a comparison of using Reduce with using your for loop:
library(dplyr)
Reduce(function(x, y) left_join(x, y, by = "firstcolumn"),
mget(c("z", paste0("y", 2010:2017))))
# firstcolumn data.x data.y data.x.x data.y.y data.x.x.x data.y.y.y data.x.x.x.x
# 1 b 10 2 8 3 4 7 NA
# 2 e 3 1 NA NA 9 9 NA
# 3 d 9 NA 5 7 NA NA 5
# data.y.y.y.y data
# 1 5 3
# 2 NA NA
# 3 8 9
usingGet <- function() {
for(i in 2010:2017) {
z <- left_join(z, get(paste0("y", i)) , by="firstcolumn")
}
z
}
# firstcolumn data.x data.y data.x.x data.y.y data.x.x.x data.y.y.y data.x.x.x.x
# 1 b 10 2 8 3 4 7 NA
# 2 e 3 1 NA NA 9 9 NA
# 3 d 9 NA 5 7 NA NA 5
# data.y.y.y.y data
# 1 5 3
# 2 NA NA
# 3 8 9
DF <- data.frame(x1=c(NA,7,7,8,NA), x2=c(1,4,NA,NA,4)) # a data frame with NA
WhereAreMissingValues <- which(is.na(DF), arr.ind=TRUE) # find the position of the missing values
Modes <- apply(DF, 2, function(x) {which(tabulate(x) == max(tabulate(x)))}) # find the modes of each column
DF
WhereAreMissingValues
Modes
I would like to replace the NAs of each column of DF with the mode, accordingly.
Please for some help.
Map provides here a one line solution:
data.frame(Map(function(u,v){u[is.na(u)]=v;u},DF, Modes))
# x1 x2
#1 7 1
#2 7 4
#3 7 4
#4 8 4
#5 7 4
Here's how I would do this.
First I'll define an helper function
Myfunc <- function(x) as.numeric(names(sort(-table(x)))[1L])
Then just use lapply over the data set
DF[] <- lapply(DF, function(x){x[is.na(x)] <- Myfunc(x) ; x})
DF
# x1 x2
# 1 7 1
# 2 7 4
# 3 7 4
# 4 8 4
# 5 7 4
The idea is extracting the position of df charactes with a reference of other df, example:
L<-LETTERS[1:25]
A<-c(1:25)
df<-data.frame(L,A)
Compare<-c(LETTERS[sample(1:25, 25)])
df[] <- lapply(df, as.character)
for (i in 1:nrow(df)){
df[i,1]<-which(df[i,1]==Compare)
}
head(df)
L A
1 14 1
2 12 2
3 2 3
This works good but scale very bad, like all for, any ideas with apply, or dplyr?
Thanks
Just use match
Your data (use set.seed when providing data using sample)
df <- data.frame(L = LETTERS[1:25], A = 1:25)
set.seed(1)
Compare <- LETTERS[sample(1:25, 25)]
Solution
df$L <- match(df$L, Compare)
head(df)
# L A
# 1 10 1
# 2 23 2
# 3 12 3
# 4 11 4
# 5 5 5
# 6 21 6
Why doesn't this work for an example? There's same value in each row and warning as well
data <- data.frame(id = 1:10)
slowCall <- function(id) data.frame(b = rep(id, 3), c = runif(3))
data[,c("d", "e")] <- sapply(data$id, function(id) {
tmp <- slowCall(id)
list(sum(tmp$b), min(tmp$c))
})
Warning message:
In `[<-.data.frame`(`*tmp*`, , c("d", "e"), value = list(3L, 0.104784948984161, :
provided 20 variables to replace 2 variables
print(data)
id d e
1 1 3 0.1047849
2 2 3 0.1047849
3 3 3 0.1047849
4 4 3 0.1047849
5 5 3 0.1047849
6 6 3 0.1047849
7 7 3 0.1047849
8 8 3 0.1047849
9 9 3 0.1047849
10 10 3 0.1047849
You could try something like this. First, vectorize the assign function (per #Joran's answer here), then modify your code slightly.
# vectorize
assignVec <- Vectorize("assign",c("x","value"))
library(plyr)
set.seed(1) # this is just here for reproducibility
data <- data.frame(id = 1:10)
slowCall <- function(id) data.frame(b = rep(id, 3), c = runif(3))
# I store this as `tmp` just to make the code below look cleaner
tmp <- mlply(sapply(data$id, function(id) {
tmp <- slowCall(id)
list(sum(tmp$b), min(tmp$c))
}), c)
# here's the key part:
data <- within(data, assignVec(c('d','e'), tmp, envir=environment()))
Output:
> data
id e d
1 1 0.26550866 3
2 2 0.20168193 6
3 3 0.62911404 9
4 4 0.06178627 12
5 5 0.38410372 15
6 6 0.49769924 18
7 7 0.38003518 21
8 8 0.12555510 24
9 9 0.01339033 27
10 10 0.34034900 30
Note: I invoke plyr::mlply to get your sapply output into a list.
The simpler answer, though, is to change the righthand side of your operation into:
data[,c("d", "e")] <- as.data.frame(t(sapply(data$id, function(id) {
tmp <- slowCall(id)
list(sum(tmp$b), min(tmp$c))
})))
which would give you the same result.
The problem here is that the matrix returned by your sapply contains one-element lists instead of numeric values. Change your list to a c and transpose the output, then it will work.
data[, c("d", "e")] <- t(sapply(data$id, function(id) {
tmp <- slowCall(id)
c(sum(tmp$b), min(tmp$c))
}))
Here's a generic method to add two columns of different data types (e.g. character and numeric). It uses lists and transposes lists (via this answer).
Here, this answer would preserve the integer and numeric types of the two outputs.
rowwise <- lapply(data$id, function(id) {
tmp <- slowCall(id)
list(sum(tmp$b), min(tmp$c))
})
colwise <- lapply(seq_along(rowwise[[1]]), function(i) lapply(rowwise, "[[", i))
data[,c("d", "e")] <- colwise