rename columns of dataframe - r

I have one dataframe that basically looks like this (contains data):
t <- data.frame(x1 = 1:5, x2 = 1:5, stingsAsFactors = FALSE)
I have another dataframe that contains the original column names and a replacement for each
n <- data.frame(abb = c("x1", "x2"), erf = c("XX1", "XX2"), stringsAsFactors = FALSE)
What I would like to do is rename columns in dataframe t according to the specification in dataframe n. My problem is I can't figure out how to do that with map. Why is the following wrong:
map2_dfr(n$abb, n$erf, function(x, y) rename(t, !!y := x))

We can use rename_at
library(dplyr)
t %>%
rename_at(n$abb, ~ n$erf)

Here is a one-liner in base R using match,
names(t) <- n$erf[match(names(t), n$abb)]
t
# XX1 XX2
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 5

Related

R - assign element names to column names inside a list of lists of dataframes in tidyverse

I have a list of lists of dataframes, each having one column, like this:
list(list(A = data.frame(X = 1:5),
B = data.frame(Y = 6:10),
C = data.frame(Z = 11:15)),
list(A = data.frame(X = 16:20),
B = data.frame(Y = 21:25),
C = data.frame(Z = 26:30)),
list(A = data.frame(X = 31:35),
B = data.frame(Y = 36:40),
C = data.frame(Z = 41:45))) -> dflist
I need to make it so that the column names X, Y and Z inside of each dataframe are changed to A, B and C. An important thing to add is that the names A, B and C are not known beforehand, but must be extracted from the list element names. I have a simple script that can accomplish this:
for(i in 1:3){
for(j in 1:3){
colnames(dflist[[i]][[j]]) <- names(dflist[[i]])[[j]]
}
}
However, I need to do this in tidyverse style. I have found similar questions on here, however, they only deal with lists of dataframes and not with lists of lists of dataframes and I can't find a way to make it work.
Using combination of map and imap -
library(dplyr)
library(purrr)
map(dflist, function(x)
imap(x, function(data, name)
data %>% rename_with(function(y) name)))
#[[1]]
#[[1]]$A
# A
#1 1
#2 2
#3 3
#4 4
#5 5
#[[1]]$B
# B
#1 6
#2 7
#3 8
#4 9
#5 10
#[[1]]$C
# C
#1 11
#2 12
#3 13
#4 14
#5 15
#...
#...
Also possible without purrr, using lapply and mapply (the latter with SIMPLIFY=FALSE). If dflist is your list of lists:
lapply(dflist, function(x){
mapply(function(y,z){
`colnames<-`(y, z)
}, y=x, z=names(x), SIMPLIFY=F)
})
#or on one line:
lapply(dflist, function(x) mapply(function(y,z) `colnames<-`(y, z), y=x, z=names(x), SIMPLIFY=F))
A solution with purrr:walk:
library(tidyverse)
walk(1:length(dflist),
function(x)
walk(names(dflist[[x]]), ~ {names(dflist[[x]][[.x]]) <<- .x}))

the number of variables that are repeated 2 or more times in R

I have two forms of data: a list (i.e., r) and a data.frame (i.e., df). For each form of data, how can I know the number of variables that are repeated 2 or more times (in the example below, my desired output is: AA 3 times, BB 2 times, CC 2 times)?
NOTE: the answer regardless of the form of data, should be the same.
r <- list( data.frame( AA = c(2,2,1,1,NA, NA), BB = c(1,1,1,2,2,NA), CC = c(1:5, NA)), # LIST
data.frame( AA = c(1,NA,3,1,NA,NA), DD = c(1,1,1,2,NA,NA)),
data.frame( AA = c(1,NA,3,1,NA,NA), BB = c(1,1,1,2,2,NA), CC = c(0:4, NA)) )
df <- do.call(cbind, r) ## DATA.FRAME
We can create a frequency count with >= 2 on the names of the dataset,
tbl <- table(names(df))
tbl1 <- tbl[tbl >=2]
tbl1
# AA BB CC
# 3 2 2
lapply(r, function(x) table(names(x)[names(x) %in% names(tbl1)]))
If we need it from another answer
vec <- names(unlist(r, recursive = FALSE))
nm1 <- unique(vec[duplicated(vec)])
lapply(r, function(x) table(names(x)[names(x) %in% nm1]))

R - co-locate columns with the same name after merge

Situation
I have two data frames, df1 and df2with the same column headings
x <- c(1,2,3)
y <- c(3,2,1)
z <- c(3,2,1)
names <- c("id","val1","val2")
df1 <- data.frame(x, y, z)
names(df1) <- names
a <- c(1, 2, 3)
b <- c(1, 2, 3)
c <- c(3, 2, 1)
df2 <- data.frame(a, b, c)
names(df2) <- names
And am performing a merge
#library(dplyr) # not needed for merge
joined_df <- merge(x=df1, y=df2, c("id"),all=TRUE)
This gives me the columns in the joined_df as id, val1.x, val2.x, val1.y, val2.y
Question
Is there a way to co-locate the columns that had the same heading in the original data frames, to give the column order in the joined data frame as id, val1.x, val1.y, val2.x, val2.y?
Note that in my actual data frame I have 115 columns, so I'd like to stay clear of using joned_df <- joined_df[, c(1, 2, 4, 3, 5)] if possible.
Update/Edit: also, I would like to maintain the original order of column headings, so sorting alphabetically is not an option (-on my actual data, I realise it would work with the example I have given).
My desired output is
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1
Update with solution for general case
The accepted answer solves my issue nicely.
I've adapted the code slightly here to use the original column names, without having to hard-code them in the rep function.
#specify columns used in merge
merge_cols <- c("id")
# identify duplicate columns and remove those used in the 'merge'
dup_cols <- names(df1)
dup_cols <- dup_cols [! dup_cols %in% merge_cols]
# replicate each duplicate column name and append an 'x' and 'y'
dup_cols <- rep(dup_cols, each=2)
var <- c("x", "y")
newnames <- paste(dup_cols, ".", var, sep = "")
#create new column names and sort the joined df by those names
newnames <- c(merge_cols, newnames)
joined_df <- joined_df[newnames]
How about something like this
numrep <- rep(1:2, each = 2)
numrep
var <- c("x", "y")
var
newnames <- paste("val", numrep, ".", var, sep = "")
newdf <- cbind(joined_df$id, joined_df[newnames])
names(newdf)[1] <- "id"
Which should give you the dataframe like this
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1

Boxplot of pre-aggregated/grouped data in R

In R I want to create a boxplot over count data instead of raw data. So my table schema looks like
Value | Count
1 | 2
2 | 1
...
Instead of
Value
1
1
2
...
Where in the second case I could simply do boxplot(x)
I'm sure there's a way to do what you want with the already summarized data, but if not, you can abuse the fact that rep takes vectors:
> dat <- data.frame(Value = 1:5, Count = sample.int(5))
> dat
Value Count
1 1 1
2 2 3
3 3 4
4 4 2
5 5 5
> rep(dat$Value, dat$Count)
[1] 1 2 2 2 3 3 3 3 4 4 5 5 5 5 5
Simply wrap boxplot around that and you should get what you want. I'm sure there's a more efficient / better way to do that, but this should work for you.
I solved a similar issue recently by using the 'apply' function on each column of counts with the 'rep' function:
> datablock <- apply(countblock[-1], 2, function(x){rep(countblock$value, x)})
> boxplot(datablock)
...The above assumes that your values are in the first column and subsequent columns contain count data.
A combination of rep and data.frame can be used as an approach if another variable is needed for classification
Eg.
with(data.frame(v1=rep(data$v1,data$count),v2=(data$v2,data$count)),
boxplot(v1 ~ v2)
)
Toy data:
(besides Value and Count, I add a categorical variable Group)
set.seed(12345)
df <- data.frame(Value = sample(1:100, 100, replace = T),
Count = sample(1:10, 100, replace = T),
Group = sample(c("A", "B", "C"), 100, replace = T),
stringsAsFactors = F)
Use purrr::pmap and purrr::reduce to manipulate the data frame:
library(purrr)
data <- pmap(df, function(Value, Count, Group){
data.frame(x = rep(Value, Count),
y = rep(Group, Count))
}) %>% reduce(rbind)
boxplot(x ~ y, data = data)

Conditional merge/replacement in R

I have two data frames:
df1
x1 x2
1 a
2 b
3 c
4 d
and
df2
x1 x2
2 zz
3 qq
I want to replace some of the values in df1$x2 with values in df2$x2 based on the conditional match between df1$x1 and df2$x2 to produce:
df1
x1 x2
1 a
2 zz
3 qq
4 d
use match(), assuming values in df1 are unique.
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)
df1$x2[match(df2$x1,df1$x1)] <- df2$x2
> df1
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
If the values aren't unique, use :
for(id in 1:nrow(df2)){
df1$x2[df1$x1 %in% df2$x1[id]] <- df2$x2[id]
}
We can use {powerjoin}, and handle the conflicting columns with coalesce_yx
library(powerjoin)
df1 <- data.frame(x1 = 1:4, x2 = letters[1:4], stringsAsFactors = FALSE)
df2 <- data.frame(x1 = 2:3, x2 = c("zz", "qq"), stringsAsFactors = FALSE)
power_left_join(df1, df2, by = "x1", conflict = coalesce_yx)
#> x1 x2
#> 1 1 a
#> 2 2 zz
#> 3 3 qq
#> 4 4 d
The first part of Joris' answer is good, but in the case of non-unique values in df1, the row-wise for-loop will not scale well on large data.frames.
You could use a data.table "update join" to modify in place, which will be quite fast:
library(data.table)
setDT(df1); setDT(df2)
df1[df2, on = .(x1), x2 := i.x2]
Or, assuming you don't care about maintaining row order, you could use SQL-inspired dplyr:
library(dplyr)
union_all(
inner_join( df1["x1"], df2 ), # x1 from df1 with matches in df2, x2 from df2
anti_join( df1, df2["x1"] ) # rows of df1 with no match in df2
) # %>% arrange(x1) # optional, won't maintain an arbitrary row order
Either of these will scale much better than the row-wise for-loop.
I see that Joris and Aaron have both chosen to build examples without factors. I can certainly understand that choice. For the reader with columns that are already factors there would also be to option of coercion to "character". There is a strategy that avoids that constraint and which also allows for the possibility that there may be indices in df2 that are not in df1 which I believe would invalidate Joris Meys' but not Aaron's solutions posted so far:
df1 <- data.frame(x1=1:4,x2=letters[1:4])
df2 <- data.frame(x1=c(2,3,5), x2=c("zz", "qq", "xx") )
It requires that the levels be expanded to include the intersection of both factor variables and then also the need to drop non-matching columns (= NA values) in match(df1$x1, df2$x1)
df1$x2 <- factor(df1$x2 , levels=c(levels(df1$x2), levels(df2$x2)) )
df1$x2[na.omit(match(df2$x1,df1$x1))] <- df2$x2[which(df2$x1 %in% df1$x1)]
df1
#-----------
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
(Note that recent versions of R do not have stringsAsFactors set to TRUE in the data.frame function defaults, unlike it was for most of the history of R.)
You can do it by matching the other way too but it's more complicated. Joris's solution is better but I'm putting this here also as a reminder to think about which way you want to match.
df1 <- data.frame(x1=1:4, x2=letters[1:4], stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3, x2=c("zz", "qq"), stringsAsFactors=FALSE)
swap <- df2$x2[match(df1$x1, df2$x1)]
ok <- !is.na(swap)
df1$x2[ok] <- swap[ok]
> df1
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
It can be done with dplyr.
library(dplyr)
full_join(df1,df2,by = c("x1" = "x1")) %>%
transmute(x1 = x1,x2 = coalesce(x2.y,x2.x))
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
new here, but using the following dplyr approach seems to work as well
similar but slightly different to one of the answers above
df3 <- anti_join(df1, df2, by = "x1")
df3 <- rbind(df3, df2)
df3
As of dplyr 1.0.0 there is a function specifically for this:
library(dplyr)
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)
rows_update(df1, df2, by = "x1")
See https://stackoverflow.com/a/65254214/2738526

Resources