R - "find and replace" integers in a column with character labels - r

I have two data frames the first (DF1) is similar to this:
Ba Ram You Sheep
30 1 33.2 120.9
27 3 22.1 121.2
22 4 39.1 99.1
11 1 20.0 101.6
9 3 9.8 784.3
The second (DF2) contains titles for column "Ram":
V1 V2
1 RED
2 GRN
3 YLW
4 BLU
I need to replace the DF1$Ram with corresponding character strings of DF2$V2:
Ba Ram You Sheep
30 RED 33.2 120.9
27 YLW 22.1 121.2
22 BLU 39.1 99.1
11 RED 20.0 101.6
9 YLW 9.8 784.3
I can do this with a nested for loop, but it feels REALLY inefficient:
x <- c(1:nrows(DF1))
y <- c(1:4)
for (i in x) {
for (j in y) {
if (DF1$Ram[i] == x) {
DF1$Ram[i] <- DF2$V2[y]
}
}
}
Is there a way to do this more efficiently??!?! I know there is. I'm a noob.

Use merge
> result <- merge(df1, df2, by.x="Ram", by.y="V1")[,-1] # merging data.frames
> colnames(result)[4] <- "Ram" # setting name
The following is just for getting the output in the order you showed us
> result[order(result$Ba, decreasing = TRUE), c("Ba", "Ram", "You", "Sheep")]
Ba Ram You Sheep
1 30 RED 33.2 120.9
3 27 YLW 22.1 121.2
5 22 BLU 39.1 99.1
2 11 RED 20.0 101.6
4 9 YLW 9.8 784.3

Usually, when you encode some character strings with integers, you likely want factor. They offer some benefits you can read about in the fine manual.
df1 <- data.frame(V2 = c(3,3,2,3,1))
df2 <- data.frame(V1=1:4, V2=c('a','b','c','d'))
df1 <- within(df1, {
f <- factor(df1$V2, levels=df2$V1, labels=df2$V2)
aschar <- as.character(f)
asnum <- as.numeric(f)
})

Related

Follow-up: Extracting a data.frame from two lists in BASE R

I'm following up on this answer. The output in that answer works great (see below):
tlist.mpre tlist.sdpre tlist.n clist.mpre clist.sdpre clist.n
Dlsk_Krlr.102 81.6 10.8 73 80.5 11.2 80
Dlsk_Krlr.103 85.7 13.7 66 90.3 6.6 74
Dlsk_Krlr.104 81.4 10.9 72 80.5 11.2 80
Dlsk_Krlr.105 90.4 8.2 61 90.3 6.6 74
However, I want to extract and add two columns named nms = c('time_wk','treats') from data to the above output. But the new output (see end of R code below) gets scrambled when I add nms to it.
Given my reproducible R code below is there a fix?
data <- read.csv("https://raw.githubusercontent.com/rnorouzian/m2/main/q.csv")
nms = c('time_wk','treats')
m = split(data, data$study.name)
(mm = m["Dlsk_Krlr"])
(input <- lapply(mm, function(i)
rev(expand.grid(post = unique(i$post),outcome = unique(i$outcome)))))
res <- setNames(lapply(1:0, function(i) lapply(input, function(inp) Map(function(o, p)
do.call(rbind, lapply(mm, function(x)
x[x$control == i & x$post == p & x$outcome == o, , drop = FALSE])),
inp$outcome, inp$post))), c("clist", "tlist"))
(aa = setNames(lapply(seq_along(res), function(i) Filter(NROW, res[[i]][[1]])), names(res)))
b <- lapply(aa, function(x) {
y <- do.call(rbind, x);
y[order(y$group), c("mpre", "sdpre", "n", nms)] }) ## I'm adding `nms` HERE but that scrambles the output below
cc = do.call(cbind,rev(b))
cc_1 = cc[!duplicated(cc),]
names(cc_1)[1:6] = c('mT','sdT','nT','mC','sdC','nC')
### NEW SCRAMBLED OUTPUT AFTER ADDING `nms`:
# mT sdT nT mC sdC nC clist.sdpre clist.n clist.time_wk clist.treats
#Dlsk_Krlr.102 81.6 10.8 73 1 2 80.5 11.2 80 1 2
#Dlsk_Krlr.103 85.7 13.7 66 1 2 90.3 6.6 74 1 2
#Dlsk_Krlr.104 81.4 10.9 72 1 2 80.5 11.2 80 1 2
#Dlsk_Krlr.105 90.4 8.2 61 1 2 90.3 6.6 74 1 2
Here is the base R solution:
cc[,grep(sprintf("clist.(%s)", paste0(nms, collapse="|")), names(cc), invert = TRUE)]
EDIT:
to ensure the nms comes the last:
d<-cc[grep(sprintf("clist.(%s)", paste0(nms, collapse="|")), names(cc), invert = TRUE)]
i1 <- grepl(sprintf("(%s)", paste0(nms, collapse="|")), names(d))
cbind(d[!i1], d[i1])

groupLabels not shown when using dendextend colour_branches

The workflow I want to implement is:
dm <- dist(data)
dend <- hclust(dm)
k <- stats::cutree(dend, k = 10)
data$clusters <- k
plot(hclust, colorBranchees = k) #???? What I can use here.
So I searched for color dendrogram branches using cutree output. All I found is dendextend.
Problem is that I am failing to implement the workflow with dendextend.
This is what I came up with, but I would now like to have clusterLabels shown
library(dendextend)
hc <- hclust(dist(USArrests))
dend <- as.dendrogram(hc)
kcl <- dendextend::cutree(dend, k = 4)
dend1 <- color_branches(dend, clusters = kcl[order.dendrogram(dend)], groupLabels = TRUE)%>% set("labels_cex", 1)
plot(dend1, main = "Dendrogram dist JK")
Also, trying something like groupLabels = 1:4 does not help.
Specifying with the param k (number of o clusters) the groupLable does work. But unfortunately, the labels are different than those generated by dendextend own cutree method.
Note that here cluster 4 has 2 members.
> table(kcl)
kcl
1 2 3 4
14 14 20 2
This post suggest to use dendextend::cutree(dend,k = nrCluster, order_clusters_as_data = FALSE)
r dendrogram - groupLabels not match real labels (package dendextend)
But then I can not use the output of dendextend::cutree to group the data (since the ordering does not match.
I would be happy to use a different dendrogram plotting library in R but so far my Web searches for "coloring dendrogram branches by cutree output" point to the dendextend package.
I'm sorry but I'm not sure I fully understand your question.
It seems like you want to align between curtree's output and your original data.
If that's the case, then you need to use dendextend::cutree(dend,k = nrCluster, order_clusters_as_data = TRUE) e.g.:
require(dendextend)
d1 <- USArrests[1:10,]
hc <- hclust(dist(d1))
dend <- as.dendrogram(hc)
k <- dendextend::cutree(dend, k = 3, order_clusters_as_data = TRUE)
d2 <- cbind(d1, k)
plot(color_branches(dend, 3))
d2
# an easier way to see the clusters is by ordering the rows of the data based on the order of the dendrogram
d2[order.dendrogram(dend),]
The plot is fine:
And the clusters are mapped correctly to the data (see outputs)
> require(dendextend)
> d1 <- USArrests[1:10,]
> hc <- hclust(dist(d1))
> dend <- as.dendrogram(hc)
> k <- dendextend::cutree(dend, k = 3, order_clusters_as_data = TRUE)
> d2 <- cbind(d1, k)
> plot(color_branches(dend, 3))
> d2
Murder Assault UrbanPop Rape k
Alabama 13.2 236 58 21.2 1
Alaska 10.0 263 48 44.5 1
Arizona 8.1 294 80 31.0 2
Arkansas 8.8 190 50 19.5 1
California 9.0 276 91 40.6 2
Colorado 7.9 204 78 38.7 1
Connecticut 3.3 110 77 11.1 3
Delaware 5.9 238 72 15.8 1
Florida 15.4 335 80 31.9 2
Georgia 17.4 211 60 25.8 1
> # an easier way to see the clusters is by ordering the rows of the data based on the order of the dendrogram
> d2[order.dendrogram(dend),]
Murder Assault UrbanPop Rape k
Connecticut 3.3 110 77 11.1 3
Florida 15.4 335 80 31.9 2
Arizona 8.1 294 80 31.0 2
California 9.0 276 91 40.6 2
Arkansas 8.8 190 50 19.5 1
Colorado 7.9 204 78 38.7 1
Georgia 17.4 211 60 25.8 1
Alaska 10.0 263 48 44.5 1
Alabama 13.2 236 58 21.2 1
Delaware 5.9 238 72 15.8 1
Please LMK if this answers your question or if you have followup questions here.

Select rows based on multi-column attributes in R

I have to merge two datasets (spatial) for which I need to keep the row (polyline) with the most information (i.e. the longest line).
I can select the rows with the same ID as the other dataframe or not select rows with the same ID (see sample below). And reverse that operation. But I can't figure out how to choose the row of the two dataframes that has the bigger length value.
#set up sample data
sample = data.frame(x=c(1:5), length=c(1.2,1.3,1.5,7.2,36.1), ID_obj=c("a3", "4a", "5b", "8b", "a7"))
sample2 = data.frame(x=c(1:5), length=c(15.1,1.3,1.5,17.2,6.1), ID_obj=c("a3", "k6", "9c", "8b", "a7"))
#select the rows with identical values
sample[sample$ID_obj %in% sample2$ID_obj,]
sample2[sample2$ID_obj %in% sample$ID_obj,]
#select rows without duplicates in ID
'%!in%' <- function(x,y)!('%in%'(x,y))
sample[sample$ID_obj %!in% sample2$ID_obj,]
sample2[sample2$ID_obj %!in% sample$ID_obj,]
error<-rbind(sample, sample2[sample2$ID_obj %!in% sample$ID_obj,])
# x length ID_obj
#1 1 1.2 a3
#2 2 1.3 4a
#3 3 1.5 5b
#4 4 7.2 8b#keep 8b from the first set should not have been kept because length is shorter
#5 5 36.1 a7
#21 2 1.3 k6
#31 3 1.5 9c
#this is the result I want to get automatically
final<-rbind(sample[c(2, 3, 5),], sample2[c(1, 2, 3, 4),])#
# x length ID_obj
#2 2 1.3 4a
#3 3 1.5 5b
#5 5 36.1 a7#keep a7 from the first set because length is longer
#1 1 15.1 a3
#21 2 1.3 k6
#31 3 1.5 9c
#4 4 17.2 8b#keep 8b from the second set because length is longer
Use the data.table package for a simplified syntax (and better performance than data.frame):
sample = data.frame(x=c(1:5), length=c(1.2,1.3,1.5,7.2,36.1), ID_obj=c("a3", "4a", "5b", "8b", "a7"))
sample2 = data.frame(x=c(1:5), length=c(15.1,1.3,1.5,17.2,6.1), ID_obj=c("a3", "k6", "9c", "8b", "a7"))
library(data.table)
setDT(sample) # convert data.frame to data.table "in-place"
setDT(sample2)
x <- rbind(sample, sample2) # combine rows vertically
setorder(x, -length) # order by length descending
x[, head(.SD, 1), by = ID_obj] # output the first row ("head") per ID_obj group
To get the result (in a different order than your expected result):
ID_obj x length
1: a7 5 36.1
2: 8b 4 17.2
3: a3 1 15.1
4: 5b 3 1.5
5: 9c 3 1.5
6: 4a 2 1.3
7: k6 2 1.3
A bit more more cryptic with base functions, but just as an exercise:
x <- rbind(sample, sample2)
x <- x[order(x$length), ]
x <- do.call(rbind, lapply(split(x, x$ID_obj), tail, n=1))
x
# x length ID_obj
# 4a 2 1.3 4a
# 5b 3 1.5 5b
# 8b 4 17.2 8b
# a3 1 15.1 a3
# a7 5 36.1 a7
# 9c 3 1.5 9c
# k6 2 1.3 k6
Add rownames(x) <- NULL if you don't want to use ID_obj as row names.

Outputting percentiles by filtering a data frame

Note that, as requested in the comments, that this question has been revised.
Consider the following example:
df <- data.frame(FILTER = rep(1:10, each = 10), VALUE = 1:100)
I would like to, for each value of FILTER, create a data frame which contains the 1st, 2nd, ..., 99th percentiles of VALUE. The final product should be
PERCENTILE df_1 df_2 ... df_10
1 [first percentiles]
2 [second percentiles]
etc., where df_i is based on FILTER == i.
Note that FILTER, although it contains numbers, is actually categorical.
The way I have been doing this is by using dplyr:
nums <- 1:10
library(dplyr)
for (i in nums){
df_temp <- filter(df, FILTER == i)$VALUE
assign(paste0("df_", i), quantile(df_temp, probs = (1:99)/100))
}
and then I would have to cbind these (with 1:99 in the first column), but I would rather not type in every single df name. I have considered using a loop on the names of these data frames, but this would involve using eval(parse()).
Here's a basic outline of a possibly smoother approach. I have not included every single aspect of your desired output, but the modification should be fairly straightforward.
df <- data.frame(FILTER = rep(1:10, each = 10), VALUE = 1:100)
df_s <- lapply(split(df,df$FILTER),
FUN = function(x) quantile(x$VALUE,probs = c(0.25,0.5,0.75)))
out <- do.call(cbind,df_s)
colnames(out) <- paste0("df_",colnames(out))
> out
df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
25% 3.25 13.25 23.25 33.25 43.25 53.25 63.25 73.25 83.25 93.25
50% 5.50 15.50 25.50 35.50 45.50 55.50 65.50 75.50 85.50 95.50
75% 7.75 17.75 27.75 37.75 47.75 57.75 67.75 77.75 87.75 97.75
I did this for just 3 quantiles to keep things simple, but it obviously extends. And you can add the 1:99 column afterwards as well.
I suggest that you use a list.
list_of_dfs <- list()
nums <- 1:10
for (i in nums){
list_of_dfs[[i]] <- nums*i
}
df <- data.frame(list_of_dfs[[1]])
df <- do.call("cbind",args=list(df,list_of_dfs))
colnames(df) <- paste0("df_",1:10)
You'll get the result you want:
df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
1 1 2 3 4 5 6 7 8 9 10
2 2 4 6 8 10 12 14 16 18 20
3 3 6 9 12 15 18 21 24 27 30
4 4 8 12 16 20 24 28 32 36 40
5 5 10 15 20 25 30 35 40 45 50
6 6 12 18 24 30 36 42 48 54 60
7 7 14 21 28 35 42 49 56 63 70
8 8 16 24 32 40 48 56 64 72 80
9 9 18 27 36 45 54 63 72 81 90
10 10 20 30 40 50 60 70 80 90 100
How about using get?
df <- data.frame(1:10)
for (i in nums) {
df <- cbind(df, get(paste0("df_", i)))
}
# get rid of first useless column
df <- df[, -1]
# get names
names(df) <- paste0("df_", nums)
df

R equivalent of Stata's for-loop over local macro list of stubnames

I'm a Stata user that's transitioning to R and there's one Stata crutch that I find hard to give up. This is because I don't know how to do the equivalent with R's "apply" functions.
In Stata, I often generate a local macro list of stubnames and then loop over that list, calling on variables whose names are built off of those stubnames.
For a simple example, imagine that I have the following dataset:
study_id year varX06 varX07 varX08 varY06 varY07 varY08
1 6 50 40 30 20.5 19.8 17.4
1 7 50 40 30 20.5 19.8 17.4
1 8 50 40 30 20.5 19.8 17.4
2 6 60 55 44 25.1 25.2 25.3
2 7 60 55 44 25.1 25.2 25.3
2 8 60 55 44 25.1 25.2 25.3
and so on...
I want to generate two new variables, varX and varY that take on the values of varX06 and varY06 respectively when year is 6, varX07 and varY07 respectively when year is 7, and varX08 and varY08 respectively when year is 8.
The final dataset should look like this:
study_id year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
1 6 50 40 30 20.5 19.8 17.4 50 20.5
1 7 50 40 30 20.5 19.8 17.4 40 19.8
1 8 50 40 30 20.5 19.8 17.4 30 17.4
2 6 60 55 44 25.1 25.2 25.3 60 25.1
2 7 60 55 44 25.1 25.2 25.3 55 25.2
2 8 60 55 44 25.1 25.2 25.3 44 25.3
and so on...
To clarify, I know that I can do this with melt and reshape commands - essentially converting this data from wide to long format, but I don't want to resort to that. That's not the intent of my question.
My question is about how to loop over a local macro list of stubnames in R and I'm just using this simple example to illustrate a more generic dilemma.
In Stata, I could generate a local macro list of stubnames:
local stub varX varY
And then loop over the macro list. I can generate a new variable varX or varY and replace the new variable value with the value of varX06 or varY06 (respectively) if year is 6 and so on.
foreach i of local stub {
display "`i'"
gen `i'=.
replace `i'=`i'06 if year==6
replace `i'=`i'07 if year==7
replace `i'=`i'08 if year==8
}
The last section is the section that I find hardest to replicate in R. When I write 'x'06, Stata takes the string "varX", concatenates it with the string "06" and then returns the value of the variable varX06. Additionally, when I write 'i', Stata returns the string "varX" and not the string "'i'".
How do I do these things with R?
I've searched through Muenchen's "R for Stata Users", googled the web, and searched through previous posts here at StackOverflow but haven't been able to find an R solution.
I apologize if this question is elementary. If it's been answered before, please direct me to the response.
Thanks in advance,
Tara
Well, here's one way. Columns in R data frames can be accessed using their character names, so this will work:
# create sample dataset
set.seed(1) # for reproducible example
df <- data.frame(year=as.factor(rep(6:8,each=100)), #categorical variable
varX06 = rnorm(300), varX07=rnorm(300), varX08=rnorm(100),
varY06 = rnorm(300), varY07=rnorm(300), varY08=rnorm(100))
# you start here...
years <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
print(head(df),digits=4)
# year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
# 1 6 -0.6265 0.8937 -0.3411 -0.70757 1.1350 0.3412 -0.6265 -0.70757
# 2 6 0.1836 -1.0473 1.5024 1.97157 1.1119 1.3162 0.1836 1.97157
# 3 6 -0.8356 1.9713 0.5283 -0.09000 -0.8708 -0.9598 -0.8356 -0.09000
# 4 6 1.5953 -0.3836 0.5422 -0.01402 0.2107 -1.2056 1.5953 -0.01402
# 5 6 0.3295 1.6541 -0.1367 -1.12346 0.0694 1.5676 0.3295 -1.12346
# 6 6 -0.8205 1.5122 -1.1367 -1.34413 -1.6626 0.2253 -0.8205 -1.34413
For a given yr, the anonymous function extracts the rows with that yr and column named "varX0" + yr (the result of paste0(...). Then lapply(...) "applies" this function for each year, and unlist(...) converts the returned list into a vector.
Maybe a more transparent way:
sub <- c("varX", "varY")
for (i in sub) {
df[[i]] <- NA
df[[i]] <- ifelse(df[["year"]] == 6, df[[paste0(i, "06")]], df[[i]])
df[[i]] <- ifelse(df[["year"]] == 7, df[[paste0(i, "07")]], df[[i]])
df[[i]] <- ifelse(df[["year"]] == 8, df[[paste0(i, "08")]], df[[i]])
}
This method reorders your data, but involves a one-liner, which may or may not be better for you (assume d is your dataframe):
> do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
study_id year varX06 varX07 varX08 varY06 varY07 varY08 varY varX
6.1 1 6 50 40 30 20.5 19.8 17.4 20.5 50
6.4 2 6 60 55 44 25.1 25.2 25.3 25.1 60
7.2 1 7 50 40 30 20.5 19.8 17.4 19.8 40
7.5 2 7 60 55 44 25.1 25.2 25.3 25.2 55
8.3 1 8 50 40 30 20.5 19.8 17.4 17.4 30
8.6 2 8 60 55 44 25.1 25.2 25.3 25.3 44
Essentially, it splits the data based on year, then uses within to create the varX and varY variables within each subset, and then rbind's the subsets back together.
A direct translation of your Stata code, however, would be something like the following:
u <- unique(d$year)
for(i in seq_along(u)){
d$varX <- ifelse(d$year == 6, d$varX06, ifelse(d$year == 7, d$varX07, ifelse(d$year == 8, d$varX08, NA)))
d$varY <- ifelse(d$year == 6, d$varY06, ifelse(d$year == 7, d$varY07, ifelse(d$year == 8, d$varY08, NA)))
}
Here's another option.
Create a 'column selection matrix' based on year, then use that to grab the values you want from any block of columns.
# indexing matrix based on the 'year' column
col_select_mat <-
t(sapply(your_df$year, function(x) unique(your_df$year) == x))
# make selections from col groups by stub name
sapply(c('varX', 'varY'),
function(x) your_df[, grep(x, names(your_df))][col_select_mat])
This gives the desired result (which you can cbind to your_df if you like)
varX varY
[1,] 50 20.5
[2,] 60 25.1
[3,] 40 19.8
[4,] 55 25.2
[5,] 30 17.4
[6,] 44 25.3
OP's dataset:
your_df <- read.table(header=T, text=
'study_id year varX06 varX07 varX08 varY06 varY07 varY08
1 6 50 40 30 20.5 19.8 17.4
1 7 50 40 30 20.5 19.8 17.4
1 8 50 40 30 20.5 19.8 17.4
2 6 60 55 44 25.1 25.2 25.3
2 7 60 55 44 25.1 25.2 25.3
2 8 60 55 44 25.1 25.2 25.3')
Benchmarking: Looking at the three posted solutions, this appears to be the fastest on average, but the differences are very small.
df <- your_df
d <- your_df
arvi1000 <- function() {
col_select_mat <- t(sapply(your_df$year, function(x) unique(your_df$year) == x))
# make selections from col groups by stub name
cbind(your_df,
sapply(c('varX', 'varY'),
function(x) your_df[, grep(x, names(your_df))][col_select_mat]))
}
jlhoward <- function() {
years <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
}
Thomas <- function() {
do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
}
> microbenchmark(arvi1000, jlhoward, Thomas)
Unit: nanoseconds
expr min lq mean median uq max neval
arvi1000 37 39 43.73 40 42 380 100
jlhoward 38 40 46.35 41 42 377 100
Thomas 37 40 56.99 41 42 1590 100

Resources