Changing multiple columns name, pasting at the beginning/end of column name - r

I have a very easy question but I am a bit struggling with it as I am not good with string manipulation,
I have a dataset that looks something like this
df <- data.frame(id= c(1,1,1,2,2,2,3,3,3), time=c(1,2,3,1,2,3,1,2,3),y = rnorm(9), x1 = rnorm(9), x2 = c(0,0,0,0,1,0,1,1,1),c2 = rnorm(9))
df
# id time y x1 x2 c2
# 1 1 1 0.2849573 -2.0675484 0 -0.07262881
# 2 1 2 0.7790181 -0.7575962 0 -0.58792408
# 3 1 3 1.5612293 0.6249859 0 1.19410761
# 4 2 1 0.5001897 3.4156129 0 -0.03577452
# 5 2 2 0.7155184 -0.5672982 1 -1.22208675
# 6 2 3 0.5086272 -0.7848763 0 -0.41084467
# 7 3 1 -0.4707959 0.1159467 1 0.77233201
# 8 3 2 0.8641184 0.2498162 1 0.49336869
# 9 3 3 1.3348043 -0.6803672 1 -0.33189217
I would simply like to change all the column names from x1 onwards adding a "_0". the final dataset should look like this.
final
# id time y x1_o x2_o c2_o
# 1 1 1 1.1251762 -0.7191008 0 -0.07478527
# 2 1 2 0.7585758 1.8694635 0 -0.42652822
# 3 1 3 -1.3180201 -0.4336776 0 0.38417779
# 4 2 1 1.7335904 2.2968254 0 -0.35639828
# 5 2 2 0.1506950 -0.5481873 1 -0.38523601
# 6 2 3 -1.9475207 -0.5302951 0 0.21721675
# 7 3 1 -0.1024133 -0.2872962 1 -0.06347213
# 8 3 2 0.1316069 0.1463118 1 -0.19518602
# 9 3 3 -1.1037682 -0.1129085 1 -0.24011278
I am able to change column names one by one, but I would like to find a one-liner command.
I have tried this, but it is only able to paste at the beginning.
dp_o<-df %>% rename_at(3:5, ~paste("_o",.))
Probably it is just a variation of the code above, but I am struggling a bit to understand which variation given that I do not understand well string manipulation
thanks in advance

We need the _o at the end as paste concatenates based on arguments from left to right and not the reverse
library(dplyr)
df %>%
rename_at(3:5, ~ paste0(., "_o"))
# id time y_o x1_o x2_o c2
#1 1 1 0.62714872 -0.70259726 0 0.4386072
#2 1 2 -0.53052052 -0.37854004 0 1.8857944
#3 1 3 -0.97729791 0.70909984 0 0.3611839
#4 2 1 -0.31016711 -1.12787900 0 0.9684549
#5 2 2 -1.91335148 -1.84690443 1 -0.1196826
#6 2 3 -0.03967186 0.21916880 0 0.6295054
#7 3 1 1.18847857 -0.75449457 1 -1.4622606
#8 3 2 0.81352527 -0.44126036 1 0.8604688
#9 3 3 1.92443154 -0.04599181 1 -0.9240210
Or if we need to match the column name
df %>%
rename_at(vars(match('x1', names(.)):ncol(.)), ~ paste0(., '_o'))
Or with str_c
library(stringr)
df %>%
rename_at(vars(x1:c2), ~ str_c(., '_o'))

If you don't mind using the data.table package, the following would work
library(data.table)
setDT(df)
old <- colnames(df)[c(which(colnames(df)=="x1"):length(colnames(df)))]
new <- paste(old, "0", sep="_")
setnames(df, old, new)
df[]
## id time y x1_0 x2_0 c2_0
## 1 1 1 -1.5612344 0.9711583 0 -1.08198269
## 2 1 2 0.8090729 -0.9474716 0 -0.21020803
## 3 1 3 0.8070253 0.9765167 0 2.13507943
## 4 2 1 0.7446732 -0.2459540 0 0.64870743
## 5 2 2 -1.1853776 -0.3828339 1 -0.09298909
## 6 2 3 0.5057534 0.5822639 0 0.79730587
## 7 3 1 -0.3655794 -0.1628970 1 -0.57866153
## 8 3 2 -1.3465086 1.1107107 1 1.11290979
## 9 3 3 -0.8271092 -0.4105378 1 0.88522610

With base R, maybe you can make it via the following code:
names(df)[-(1:3)] <- paste0(names(df)[-(1:3)],"_o")
which gives:
> df
id time y x1_o x2_o c2_o
1 1 1 -1.1861828 -0.97027842 0 1.8556257
2 1 2 1.1964478 0.48936940 0 -0.2144602
3 1 3 -1.1164802 0.03258791 0 -1.7737551
4 2 1 0.4940969 -1.31300219 0 0.1865097
5 2 2 -0.8735071 -1.01195060 1 0.6515702
6 2 3 0.1749421 0.27409115 0 -1.2432389
7 3 1 1.8849013 0.92642054 1 0.9861089
8 3 2 -0.3765072 -1.15343868 1 0.8451167
9 3 3 -0.2033892 1.66717960 1 -0.1480590

Related

nested for loop in R, where the second index counts inside the first one

I have for example a datset like this:
data <- data.frame(matrix(c(1,2,2,3,4,5,5,"a","a","b","a","a","a","b"), nrow = 7, ncol = 2, byrow = F))
X1 X2
1 a
2 a
2 b
3 a
4 a
5 a
5 b
then I add another variable "tag", initially set to 0.
data$tag <- 0
X1 X2 tag
1 a 0
2 a 0
2 b 0
3 a 0
4 a 0
5 a 0
5 b 0
I'd like to have "tag" equal to 1 for each row that is repeated, like:
X1 X2 tag
1 a 0
2 a 1
2 b 1
3 a 0
4 a 0
5 a 1
5 b 1
I used the followed code:
for (i in data$X1) {
for (j in 1:length(data$X1)) {
if (j==2) {data$tag[j] <- 1}
}
}
but it doesn't work like I would like to. I'd like the second loop (j) to work inside the previous one in order to obtain what I want, where j starts from 1 every time X1 changes.
How can I manage it?
Thanks a lot
Maybe you can try ave
within(
data,
tag <- +(ave(X1, X1, FUN = length) > 1)
)
which gives
X1 X2 tag
1 1 a 0
2 2 a 1
3 2 b 1
4 3 a 0
5 4 a 0
6 5 a 1
7 5 b 1
You can use duplicated from both the ends in base R :
data$tag <- as.integer(duplicated(data$X1) |
duplicated(data$X1, fromLast = TRUE))
data
# X1 X2 tag
#1 1 a 0
#2 2 a 1
#3 2 b 1
#4 3 a 0
#5 4 a 0
#6 5 a 1
#7 5 b 1
An option with add_count
library(dplyr)
data %>%
add_count(X1) %>%
mutate(n = +(n > 1))

sum of each column between them in R

I have a dataframe like this :
dd<-data.frame(col1=c(1,0,1),col2=c(1,1,1),col3=c(1,0,0),col4=c(1,0,1,0,1,0,1,0))
And i would like to have the sum of each column between them like:
col1+col2 col1+col3 col1+col4 col2+col3 col2+col4 col3+col4
2 2 2 2 2 2
1 1 1 1 1 0
1 1 2 1 2 1
2 1 1 1 1 0
I did'nt find any fonctions who does that
Please help me
One base R option might be combn + rowSums
setNames(
as.data.frame(combn(dd, 2, rowSums)),
combn(names(dd), 2, paste0, collapse = "+")
)
which gives
col1+col2 col1+col3 col1+col4 col2+col3 col2+col4 col3+col4
1 2 2 2 2 2 2
2 1 0 0 1 1 0
3 2 1 2 1 2 1
Data
dd<-data.frame(col1=c(1,0,1),col2=c(1,1,1),col3=c(1,0,0),col4=c(1,0,1))
One dplyr and purrr possibility could be:
map_dfc(.x = combn(names(dd), 2, simplify = FALSE),
~ dd %>%
rowwise() %>%
transmute(!!paste(.x, collapse = "+") := sum(c_across(all_of(.x)))))
`col1+col2` `col1+col3` `col1+col4` `col2+col3` `col2+col4` `col3+col4`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 2 2 2 2 2
2 1 0 0 1 1 0
3 2 1 2 1 2 1
Uglier, and slower than the base R above:
do.call("cbind", setNames(Map(function(i){y <- dd[,i] + dd[,-c(1:i)]},
seq_along(dd)[1:ncol(dd)-1]), names(dd)[1:(ncol(dd)-1)]))

Cumulative sum for positive numbers only [duplicate]

This question already has answers here:
Create counter within consecutive runs of certain values
(6 answers)
Closed 1 year ago.
I have this vector :
x = c(1,1,1,1,1,0,1,0,0,0,1,1)
And I want to do a cumulative sum for the positive numbers only. I should have the following vector in return:
xc = (1,2,3,4,5,0,1,0,0,0,1,2)
How could I do it?
I've tried : cumsum(x) but that do the cumulative sum for all values and gives :
cumsum(x)
[1] 1 2 3 4 5 5 6 6 6 6 7 8
One option is
x1 <- inverse.rle(within.list(rle(x), values[!!values] <-
(cumsum(values))[!!values]))
x[x1!=0] <- ave(x[x1!=0], x1[x1!=0], FUN=seq_along)
x
#[1] 1 2 3 4 5 0 1 0 0 0 1 2
Or a one-line code would be
x[x>0] <- with(rle(x), sequence(lengths[!!values]))
x
#[1] 1 2 3 4 5 0 1 0 0 0 1 2
Here's a possible solution using data.table v >= 1.9.5 and its new rleid funciton
library(data.table)
as.data.table(x)[, cumsum(x), rleid(x)]$V1
## [1] 1 2 3 4 5 0 1 0 0 0 1 2
Base R, one line solution with Map Reduce :
> Reduce('c', Map(function(u,v) if(v==0) rep(0,u) else 1:u, rle(x)$lengths, rle(x)$values))
[1] 1 2 3 4 5 0 1 0 0 0 1 2
Or:
unlist(Map(function(u,v) if(v==0) rep(0,u) else 1:u, rle(x)$lengths, rle(x)$values))
x=c(1,1,1,1,1,0,1,0,0,0,1,1)
cumsum_ <- function(x) {
r <- rle(x)
s <- split(x, rep(seq_along(r$values), rle(x)$lengths))
return(unlist(sapply(s, cumsum), use.names = F))
}
(xc <- cumsum_(x))
# [1] 1 2 3 4 5 0 1 0 0 0 1 2
I dont know much of R but i have written a small code in Python. Logic remains the same in all language. Hope this will help you
x=[1,1,1,1,1,0,1,0,0,0,1,1]
tot=0
for i in range(0,len(x)):
if x[i]!=0:
tot=tot+x[i]
x[i]=tot
else:
tot=0
print x
x<-c(1,1,1,1,1,0,1,0,0,0,1,1)
skumulowana<-function(x) {
dl<-length(x)
xx<-numeric(dl+1)
for (i in 1:dl){
ifelse (x[i]==0,xx[i+1]<-0,xx[i+1]<-xx[i]+x[i])
}
wynik<<-xx[1:dl+1]
return (wynik)
}
skumulowana(x)
## [1] 1 2 3 4 5 0 1 0 0 0 1 2
Try this one-liner...
Reduce(function(x,y) (x+y)*(y!=0), x, accumulate=T)
split and lapply version:
x <- c(1,1,1,1,1,0,1,0,0,0,1,1)
unlist(lapply(split(x, cumsum(x==0)), cumsum))
step by step:
a <- split(x, cumsum(x==0)) # divides x into pieces where each 0 starts a new piece
b <- lapply(a, cumsum) # calculates cumsum in each piece
unlist(b) # rejoins the pieces
Result has useless names but is otherwise what you wanted:
# 01 02 03 04 05 11 12 2 3 41 42 43
# 1 2 3 4 5 0 1 0 0 0 1 2
Here is another base R solution using aggregate. The idea is to make a data frame with x and a new column named x.1 by which we can apply aggregate functions (cumsum in this case):
x <- c(1,1,1,1,1,0,1,0,0,0,1,1)
r <- rle(x)
df <- data.frame(x,
x.1=unlist(sapply(1:length(r$lengths), function(i) rep(i, r$lengths[i]))))
# df
# x x.1
# 1 1 1
# 2 1 1
# 3 1 1
# 4 1 1
# 5 1 1
# 6 0 2
# 7 1 3
# 8 0 4
# 9 0 4
# 10 0 4
# 11 1 5
# 12 1 5
agg <- aggregate(df$x~df$x.1, df, cumsum)
as.vector(unlist(agg$`df$x`))
# [1] 1 2 3 4 5 0 1 0 0 0 1 2

Complex long to wide data transformation (with time-varying variable)

I am currently working on a Multistate Analysis dataset in "long" form (one row for each individual's observation; each individual is repeatedly measured up to 5 times).
The idea is that each individual can recurrently transition across the levels of the time-varying state variable s = 1, 2, 3, 4. All the other variables that I have (here cohort) are fixed within any given id.
After some analyses, I need to reshape the dataset in "wide" form, according to the specific sequence of visited states. Here is an example of the initial long data:
dat <- read.table(text = "
id cohort s
1 1 2
1 1 2
1 1 1
1 1 4
2 3 1
2 3 1
2 3 3
3 2 1
3 2 2
3 2 3
3 2 3
3 2 4",
header=TRUE)
The final "wide" dataset should take into account the specific individual sequence of visited states, recorded into the newly created variables s1, s2, s3, s4, s5, where s1 is the first state visited by the individual and so on.
According to the above example, the wide dataset looks like:
id cohort s1 s2 s3 s4 s5
1 1 2 2 1 4 0
2 3 1 1 3 0 0
3 2 1 2 3 3 4
I tried to use reshape(), and also to focus on transposing s, but without the intended result. Actually, my knowledge of the R functions is quite limited.. Can you give any suggestion? Thanks.
EDIT: obtaining a different kind of wide dataset
Thank you all for your help, I have a related question if I can. Especially when each individual is observed for a long time and there are few transitions across states, it is very useful to reshape the initial sample dat in this alternative way:
id cohort s1 s2 s3 s4 s5 dur1 dur2 dur3 dur4 dur5
1 1 2 1 4 0 0 2 1 1 0 0
2 3 1 3 0 0 0 2 1 0 0 0
3 2 1 2 3 4 0 1 1 2 1 0
In practice now s1-s5 are the distinct visited states, and dur1-dur5 the time spent in each respective distinct visited state.
Can you please give a hand for reaching this data structure? I believe it is necessary to create all the dur- and s- variables in an intermediate sample before using reshape(). Otherwise maybe it is possible to directly adopt -reshape2-?
dat <- read.table(text = "
id cohort s
1 1 2
1 1 2
1 1 1
1 1 4
2 3 1
2 3 1
2 3 3
3 2 1
3 2 2
3 2 3
3 2 3
3 2 4",
header=TRUE)
df <- data.frame(
dat,
period = sequence(rle(dat$id)$lengths)
)
wide <- reshape(df, v.names = "s", idvar = c("id", "cohort"),
timevar = "period", direction = "wide")
wide[is.na(wide)] = 0
wide
Gives:
id cohort s.1 s.2 s.3 s.4 s.5
1 1 1 2 2 1 4 0
5 2 3 1 1 3 0 0
8 3 2 1 2 3 3 4
then using the following line gives your names:
names(wide) <- c('id','cohort', paste('s', seq_along(1:5), sep=''))
# id cohort s1 s2 s3 s4 s5
# 1 1 1 2 2 1 4 0
# 5 2 3 1 1 3 0 0
# 8 3 2 1 2 3 3 4
If you use sep='' in the wide statement you do not have to rename the variables:
wide <- reshape(df, v.names = "s", idvar = c("id", "cohort"),
timevar = "period", direction = "wide", sep='')
I suspect there are ways to avoid creating the period variable and avoid replacing NA directly in the wide statement, but I have not figured those out yet.
ok...
library(plyr)
library(reshape2)
dat2 <- ddply(dat,.(id,cohort), function(x)
data.frame(s=x$s,name=paste0("s",seq_along(x$s))))
dat2 <- ddply(dat2,.(id,cohort), function(x)
dcast(x, id + cohort ~ name, value.var= "s" ,fill= 0)
)
dat2[is.na(dat2)] <- 0
dat2
# id cohort s1 s2 s3 s4 s5
# 1 1 1 2 2 1 4 0
# 2 2 3 1 1 3 0 0
# 3 3 2 1 2 3 3 4
This seem right? I admit the first ddply is hardly elegant.
Try this:
library(reshape2)
dat$seq <- ave(dat$id, dat$id, FUN = function(x) paste0("s", seq_along(x)))
dat.s <- dcast(dat, id + cohort ~ seq, value.var = "s", fill = 0)
which gives this:
> dat.s
id cohort s1 s2 s3 s4 s5
1 1 1 2 2 1 4 0
2 2 3 1 1 3 0 0
3 3 2 1 2 3 3 4
If you did not mind using just 1, 2, ..., 5 as column names then you could shorten the ave line to just:
dat$seq <- ave(dat$id, dat$id, FUN = seq_along)
Regarding the second question that was added later try this:
library(plyr)
dur.fn <- function(x) {
r <- rle(x$s)$length
data.frame(id = x$id[1], dur.value = r, dur.seq = paste0("dur", seq_along(r)))
}
dat.dur.long <- ddply(dat, .(id), dur.fn)
dat.dur <- dcast(dat.dur.long, id ~ dur.seq, c, value.var = "dur.value", fill = 0)
cbind(dat.s, dat.dur[-1])
which gives:
id cohort s1 s2 s3 s4 s5 dur1 dur2 dur3 dur4
1 1 1 2 2 1 4 0 2 1 1 0
2 2 3 1 1 3 0 0 2 1 0 0
3 3 2 1 2 3 3 4 1 1 2 1

Reshaping from wide to long and vice versa (multistate/survival analysis dataset)

I am trying to reshape the following dataset with reshape(), without much results.
The starting dataset is in "wide" form, with each id described through one row. The dataset is intended to be adopted for carry out Multistate analyses (a generalization of Survival Analysis).
Each person is recorded for a given overall time span. During this period the subject can experience a number of transitions among states (for simplicity let us fix to two the maximum number of distinct states that can be visited). The first visited state is s1 = 1, 2, 3, 4. The person stays within the state for dur1 time periods, and the same applies for the second visited state s2:
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3
The dataset in long format which I woud like to obtain is:
id cohort s
1 1 3
1 1 3
1 1 3
1 1 3
1 1 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 0 1
2 0 1
2 0 1
2 0 4
2 0 4
2 0 4
In practice, each id has dur1 + dur2 rows, and s1 and s2 are melted in a single variable s.
How would you do this transformation? Also, how would you cmoe back to the original dataset "wide" form?
Many thanks!
dat <- cbind(id=c(1,2), cohort=c(1, 0), s1=c(3, 1), dur1=c(4, 4), s2=c(2, 4), dur2=c(5, 3))
You can use reshape() for the first step, but then you need to do some more work. Also, reshape() needs a data.frame() as its input, but your sample data is a matrix.
Here's how to proceed:
reshape() your data from wide to long:
dat2 <- reshape(data.frame(dat), direction = "long",
idvar = c("id", "cohort"),
varying = 3:ncol(dat), sep = "")
dat2
# id cohort time s dur
# 1.1.1 1 1 1 3 4
# 2.0.1 2 0 1 1 4
# 1.1.2 1 1 2 2 5
# 2.0.2 2 0 2 4 3
"Expand" the resulting data.frame using rep()
dat3 <- dat2[rep(seq_len(nrow(dat2)), dat2$dur), c("id", "cohort", "s")]
dat3[order(dat3$id), ]
# id cohort s
# 1.1.1 1 1 3
# 1.1.1.1 1 1 3
# 1.1.1.2 1 1 3
# 1.1.1.3 1 1 3
# 1.1.2 1 1 2
# 1.1.2.1 1 1 2
# 1.1.2.2 1 1 2
# 1.1.2.3 1 1 2
# 1.1.2.4 1 1 2
# 2.0.1 2 0 1
# 2.0.1.1 2 0 1
# 2.0.1.2 2 0 1
# 2.0.1.3 2 0 1
# 2.0.2 2 0 4
# 2.0.2.1 2 0 4
# 2.0.2.2 2 0 4
You can get rid of the funky row names too by using rownames(dat3) <- NULL.
Update: Retaining the ability to revert to the original form
In the example above, since we dropped the "time" and "dur" variables, it isn't possible to directly revert to the original dataset. If you feel this is something you would need to do, I suggest keeping those columns in and creating another data.frame with the subset of the columns that you need if required.
Here's how:
Use aggregate() to get back to "dat2":
aggregate(cbind(s, dur) ~ ., dat3, unique)
# id cohort time s dur
# 1 2 0 1 1 4
# 2 1 1 1 3 4
# 3 2 0 2 4 3
# 4 1 1 2 2 5
Wrap reshape() around that to get back to "dat1". Here, in one step:
reshape(aggregate(cbind(s, dur) ~ ., dat3, unique),
direction = "wide", idvar = c("id", "cohort"))
# id cohort s.1 dur.1 s.2 dur.2
# 1 2 0 1 4 4 3
# 2 1 1 3 4 2 5
There are probably better ways, but this might work.
df <- read.table(text = '
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3',
header=TRUE)
hist <- matrix(0, nrow=2, ncol=9)
hist
for(i in 1:nrow(df)) {
hist[i,] <- c(rep(df[i,3], df[i,4]), rep(df[i,5], df[i,6]), rep(0, (9 - df[i,4] - df[i,6])))
}
hist
hist2 <- cbind(df[,1:2], hist)
colnames(hist2) <- c('id', 'cohort', paste('x', seq_along(1:9), sep=''))
library(reshape2)
hist3 <- melt(hist2, id.vars=c('id', 'cohort'), variable.name='x', value.name='state')
hist4 <- hist3[order(hist3$id, hist3$cohort),]
hist4
hist4 <- hist4[ , !names(hist4) %in% c("x")]
hist4 <- hist4[!(hist4[,2]==0 & hist4[,3]==0),]
Gives:
id cohort state
1 1 1 3
3 1 1 3
5 1 1 3
7 1 1 3
9 1 1 2
11 1 1 2
13 1 1 2
15 1 1 2
17 1 1 2
2 2 0 1
4 2 0 1
6 2 0 1
8 2 0 1
10 2 0 4
12 2 0 4
14 2 0 4
Of course, if you have more than two states per id then this would have to be modified (and it might have to be modified if you have more than two cohorts). For example, I suppose with 9 sample periods one person could be in the following sequence of states:
1 3 2 4 3 4 1 1 2

Resources