sum of each column between them in R - r

I have a dataframe like this :
dd<-data.frame(col1=c(1,0,1),col2=c(1,1,1),col3=c(1,0,0),col4=c(1,0,1,0,1,0,1,0))
And i would like to have the sum of each column between them like:
col1+col2 col1+col3 col1+col4 col2+col3 col2+col4 col3+col4
2 2 2 2 2 2
1 1 1 1 1 0
1 1 2 1 2 1
2 1 1 1 1 0
I did'nt find any fonctions who does that
Please help me

One base R option might be combn + rowSums
setNames(
as.data.frame(combn(dd, 2, rowSums)),
combn(names(dd), 2, paste0, collapse = "+")
)
which gives
col1+col2 col1+col3 col1+col4 col2+col3 col2+col4 col3+col4
1 2 2 2 2 2 2
2 1 0 0 1 1 0
3 2 1 2 1 2 1
Data
dd<-data.frame(col1=c(1,0,1),col2=c(1,1,1),col3=c(1,0,0),col4=c(1,0,1))

One dplyr and purrr possibility could be:
map_dfc(.x = combn(names(dd), 2, simplify = FALSE),
~ dd %>%
rowwise() %>%
transmute(!!paste(.x, collapse = "+") := sum(c_across(all_of(.x)))))
`col1+col2` `col1+col3` `col1+col4` `col2+col3` `col2+col4` `col3+col4`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 2 2 2 2 2
2 1 0 0 1 1 0
3 2 1 2 1 2 1

Uglier, and slower than the base R above:
do.call("cbind", setNames(Map(function(i){y <- dd[,i] + dd[,-c(1:i)]},
seq_along(dd)[1:ncol(dd)-1]), names(dd)[1:(ncol(dd)-1)]))

Related

Check condition row wise for a number of columns [duplicate]

This question already has an answer here:
How to subset all rows in a dataframe that have a particular value
(1 answer)
Closed 2 years ago.
Data example:
df <- data.frame("a" = c(1,2,3,4), "b" = c(4,3,2,1), "x_ind" = c(1,0,1,1), "y_ind" = c(0,0,1,1), "z_ind" = c(0,1,1,1) )
> df
a b x_ind y_ind z_ind
1 1 4 1 0 0
2 2 3 0 0 1
3 3 2 1 1 1
4 4 1 1 1 1
I want to add a new column which checks if the whole row for the columns which end in "_ind" has all values equal to 1. If it does then returns 1 else returns 0. So the result dataframe looks like:
a b x_ind y_ind z_ind keep
1 1 4 1 0 0 0
2 2 3 0 0 1 0
3 3 2 1 1 1 1
4 4 1 1 1 1 1
I can select the columns by using df %>% select(contains("_ind")) however I am not sure how to do a rowwise operation which checks if every value in the row contains a 1, and then append the column back to the original dataframe.
Any help would be appreicated! Working with Dplyr but appreciate any solution
You can use rowSums when your df is equal to 1, i.e.
rowSums(df[grepl('_ind', names(df))] == 1) == ncol(df[grepl('_ind', names(df))])
#[1] FALSE FALSE TRUE TRUE
Continuing your dplyr attempt you can do,
df %>%
select(contains("_ind")) %>%
mutate(new = rowSums(. == 1) == ncol(.))
# x_ind y_ind z_ind new
#1 1 0 0 FALSE
#2 0 0 1 FALSE
#3 1 1 1 TRUE
#4 1 1 1 TRUE
#OR you can filter directly
df %>%
select(contains("_ind")) %>%
filter(rowSums(. == 1) == ncol(.))
# x_ind y_ind z_ind
#1 1 1 1
#2 1 1 1
If you want to also keep the origina columns, you can use,
df %>%
filter_at(vars(ends_with('_ind')), all_vars(. == 1))
# a b x_ind y_ind z_ind
#1 3 2 1 1 1
#2 4 1 1 1 1
NOTE: When we use (.), the dot refers to the resulting data frame. In this case, it refers to columns specify in the condition (i.e. to the columns that end with _ind)
Similarly in base R,
df[rowSums(df[grepl('_ind', names(df))] == 1) == ncol(df[grepl('_ind', names(df))]),]
# a b x_ind y_ind z_ind
#3 3 2 1 1 1
#4 4 1 1 1 1
You can use rowwise with c_across in new dplyr :
library(dplyr)
df %>% rowwise() %>% mutate(keep = +all(c_across(ends_with('ind')) == 1))
# a b x_ind y_ind z_ind keep
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#1 1 4 1 0 0 0
#2 2 3 0 0 1 0
#3 3 2 1 1 1 1
#4 4 1 1 1 1 1
You can use apply with all, using endsWith to get the columns ending with _ind and testing if they are == 1.
df$keep <- +(apply(df[,endsWith(colnames(df), "_ind")]==1, 1, all))
df
# a b x_ind y_ind z_ind keep
#1 1 4 1 0 0 0
#2 2 3 0 0 1 0
#3 3 2 1 1 1 1
#4 4 1 1 1 1 1
or using rowSums
df$keep <- +(rowSums(df[,endsWith(colnames(df), "_ind")]!=1) == 0)

Changing multiple columns name, pasting at the beginning/end of column name

I have a very easy question but I am a bit struggling with it as I am not good with string manipulation,
I have a dataset that looks something like this
df <- data.frame(id= c(1,1,1,2,2,2,3,3,3), time=c(1,2,3,1,2,3,1,2,3),y = rnorm(9), x1 = rnorm(9), x2 = c(0,0,0,0,1,0,1,1,1),c2 = rnorm(9))
df
# id time y x1 x2 c2
# 1 1 1 0.2849573 -2.0675484 0 -0.07262881
# 2 1 2 0.7790181 -0.7575962 0 -0.58792408
# 3 1 3 1.5612293 0.6249859 0 1.19410761
# 4 2 1 0.5001897 3.4156129 0 -0.03577452
# 5 2 2 0.7155184 -0.5672982 1 -1.22208675
# 6 2 3 0.5086272 -0.7848763 0 -0.41084467
# 7 3 1 -0.4707959 0.1159467 1 0.77233201
# 8 3 2 0.8641184 0.2498162 1 0.49336869
# 9 3 3 1.3348043 -0.6803672 1 -0.33189217
I would simply like to change all the column names from x1 onwards adding a "_0". the final dataset should look like this.
final
# id time y x1_o x2_o c2_o
# 1 1 1 1.1251762 -0.7191008 0 -0.07478527
# 2 1 2 0.7585758 1.8694635 0 -0.42652822
# 3 1 3 -1.3180201 -0.4336776 0 0.38417779
# 4 2 1 1.7335904 2.2968254 0 -0.35639828
# 5 2 2 0.1506950 -0.5481873 1 -0.38523601
# 6 2 3 -1.9475207 -0.5302951 0 0.21721675
# 7 3 1 -0.1024133 -0.2872962 1 -0.06347213
# 8 3 2 0.1316069 0.1463118 1 -0.19518602
# 9 3 3 -1.1037682 -0.1129085 1 -0.24011278
I am able to change column names one by one, but I would like to find a one-liner command.
I have tried this, but it is only able to paste at the beginning.
dp_o<-df %>% rename_at(3:5, ~paste("_o",.))
Probably it is just a variation of the code above, but I am struggling a bit to understand which variation given that I do not understand well string manipulation
thanks in advance
We need the _o at the end as paste concatenates based on arguments from left to right and not the reverse
library(dplyr)
df %>%
rename_at(3:5, ~ paste0(., "_o"))
# id time y_o x1_o x2_o c2
#1 1 1 0.62714872 -0.70259726 0 0.4386072
#2 1 2 -0.53052052 -0.37854004 0 1.8857944
#3 1 3 -0.97729791 0.70909984 0 0.3611839
#4 2 1 -0.31016711 -1.12787900 0 0.9684549
#5 2 2 -1.91335148 -1.84690443 1 -0.1196826
#6 2 3 -0.03967186 0.21916880 0 0.6295054
#7 3 1 1.18847857 -0.75449457 1 -1.4622606
#8 3 2 0.81352527 -0.44126036 1 0.8604688
#9 3 3 1.92443154 -0.04599181 1 -0.9240210
Or if we need to match the column name
df %>%
rename_at(vars(match('x1', names(.)):ncol(.)), ~ paste0(., '_o'))
Or with str_c
library(stringr)
df %>%
rename_at(vars(x1:c2), ~ str_c(., '_o'))
If you don't mind using the data.table package, the following would work
library(data.table)
setDT(df)
old <- colnames(df)[c(which(colnames(df)=="x1"):length(colnames(df)))]
new <- paste(old, "0", sep="_")
setnames(df, old, new)
df[]
## id time y x1_0 x2_0 c2_0
## 1 1 1 -1.5612344 0.9711583 0 -1.08198269
## 2 1 2 0.8090729 -0.9474716 0 -0.21020803
## 3 1 3 0.8070253 0.9765167 0 2.13507943
## 4 2 1 0.7446732 -0.2459540 0 0.64870743
## 5 2 2 -1.1853776 -0.3828339 1 -0.09298909
## 6 2 3 0.5057534 0.5822639 0 0.79730587
## 7 3 1 -0.3655794 -0.1628970 1 -0.57866153
## 8 3 2 -1.3465086 1.1107107 1 1.11290979
## 9 3 3 -0.8271092 -0.4105378 1 0.88522610
With base R, maybe you can make it via the following code:
names(df)[-(1:3)] <- paste0(names(df)[-(1:3)],"_o")
which gives:
> df
id time y x1_o x2_o c2_o
1 1 1 -1.1861828 -0.97027842 0 1.8556257
2 1 2 1.1964478 0.48936940 0 -0.2144602
3 1 3 -1.1164802 0.03258791 0 -1.7737551
4 2 1 0.4940969 -1.31300219 0 0.1865097
5 2 2 -0.8735071 -1.01195060 1 0.6515702
6 2 3 0.1749421 0.27409115 0 -1.2432389
7 3 1 1.8849013 0.92642054 1 0.9861089
8 3 2 -0.3765072 -1.15343868 1 0.8451167
9 3 3 -0.2033892 1.66717960 1 -0.1480590

Identifying Duplicates in `data.frame` Using `dplyr`

I want to identify (not eliminate) duplicates in a data frame and add 0/1 variable accordingly (wether a row is a duplicate or not), using the R dplyr package.
Example:
| A B C D
1 | 1 0 1 1
2 | 1 0 1 1
3 | 0 1 1 1
4 | 0 1 1 1
5 | 1 1 1 1
Clearly, row 1 and 2 are duplicates, so I want to create a new variable (with mutate?), say E, that is equal to 1 in row 1,2,3 and 4 since row 3 and 4 are also identical.
Moreover, I want to add another variable, F, that is equal to 1 if there is a duplicate differing only by one column. That is, F in row 1,2 and 5 would be equal to 1 since they only differ in the B column.
I hope it is clear what I want to do and I hope that dplyr offers a smooth solution to this problem. This is of course possible in "base" R but I believe (hope) that there exists a smoother solution.
You can use dist() to compute the differences, and then a search in the resulting distance object can give the needed answers (E, F, etc.). Here is an example code, where X is the original data.frame:
W=as.matrix(dist(X, method="manhattan"))
X$E = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=0))
X$F = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=1))
Just change D= for the number of different columns needed.
It's all base R though. Using plyr::laply instead of sappy has same effect. dplyr looks overkill here.
Here is a data.table solution that is extendable to an arbitrary case (1..n columns the same)- not sure if someone can convert to dpylr for you. I had to change your dataset a bit to show your desired F column - in your example all rows would get a 1 because 3 and 4 are one column different from 5 as well.
library(data.table)
DT <- data.frame(A = c(1,1,0,0,1), B = c(0,0,1,1,1), C = c(1,1,1,1,1), D = c(1,1,1,1,1), E = c(1,1,0,0,0))
DT
A B C D E
1 1 0 1 1 1
2 1 0 1 1 1
3 0 1 1 1 0
4 0 1 1 1 0
5 1 1 1 1 0
setDT(DT)
DT_ncols <- length(DT)
base <- data.table(t(combn(1:nrow(DT), 2)))
setnames(base, c("V1","V2"),c("ind_x","ind_y"))
DT[, ind := .I)]
DT_melt <- melt(DT, id.var = "ind", variable.name = "column")
base <- merge(base, DT_melt, by.x = "ind_x", by.y = "ind", allow.cartesian = TRUE)
base <- merge(base, DT_melt, by.x = c("ind_y", "column"), by.y = c("ind", "column"))
base <- base[, .(common_cols = sum(value.x == value.y)), by = .(ind_x, ind_y)]
This gives us a data.frame that looks like this:
base
ind_x ind_y common_cols
1: 1 2 5
2: 1 3 2
3: 2 3 2
4: 1 4 2
5: 2 4 2
6: 3 4 5
7: 1 5 3
8: 2 5 3
9: 3 5 4
10: 4 5 4
This says that rows 1 and 2 have 5 common columns (duplicates). Rows 3 and 5 have 4 common columns, and 4 and 5 have 4 common columns. We can now use a fairly extendable format to flag any combination we want:
base <- melt(base, id.vars = "common_cols")
# Unique - common_cols == DT_ncols
DT[, F := ifelse(ind %in% unique(base[common_cols == DT_ncols, value]), 1, 0)]
# Same save 1 - common_cols == DT_ncols - 1
DT[, G := ifelse(ind %in% unique(base[common_cols == DT_ncols - 1, value]), 1, 0)]
# Same save 2 - common_cols == DT_ncols - 2
DT[, H := ifelse(ind %in% unique(base[common_cols == DT_ncols - 2, value]), 1, 0)]
This gives:
A B C D E ind F G H
1: 1 0 1 1 1 1 1 0 1
2: 1 0 1 1 1 2 1 0 1
3: 0 1 1 1 0 3 1 1 0
4: 0 1 1 1 0 4 1 1 0
5: 1 1 1 1 0 5 0 1 1
Instead of manually selecting, you can append all combinations like so:
# run after base <- melt(base, id.vars = "common_cols")
base <- unique(base[,.(ind = value, common_cols)])
base[, common_cols := factor(common_cols, 1:DT_ncols)]
merge(DT, dcast(base, ind ~ common_cols, fun.aggregate = length, drop = FALSE), by = "ind")
ind A B C D E 1 2 3 4 5
1: 1 1 0 1 1 1 0 1 1 0 1
2: 2 1 0 1 1 1 0 1 1 0 1
3: 3 0 1 1 1 0 0 1 0 1 1
4: 4 0 1 1 1 0 0 1 0 1 1
5: 5 1 1 1 1 0 0 0 1 1 0
Here is a dplyr solution:
test%>%mutate(flag = (A==lag(A)&
B==lag(B)&
C==lag(C)&
D==lag(D)))%>%
mutate(twice = lead(flag)==T)%>%
mutate(E = ifelse(flag == T | twice ==T,1,0))%>%
mutate(E = ifelse(is.na(E),0,1))%>%
mutate(FF = ifelse( ( (A +lag(A)) + (B +lag(B)) + (C+lag(C)) + (D + lag(D))) == 7,1,0))%>%
mutate(FF = ifelse(is.na(FF)| FF == 0,0,1))%>%
select(A,B,C,D,E,FF)
Result:
A B C D E FF
1 1 0 1 1 1 0
2 1 0 1 1 1 0
3 0 1 1 1 1 0
4 0 1 1 1 1 0
5 1 1 1 1 0 1

Transforming dataframe to contain counts of values

Have created a dataframe that contains ids and stringvalues :
mycols <- c('id','2')
ids <- c(1,1,2,3)
stringvalues <- c('a','a','b','c')
mydf <- data.frame(ids , stringvalues)
mydf contains :
ids stringvalues
1 1 a
2 1 a
3 2 b
4 3 c
I'm attempting to produce a new dataframe that contains the id and
corresponding counts for each string :
id, a , b , c
1 , 2 , 0 , 0
2 , 0 , 1 , 0
3 , 0 , 0 , 1
I'm trying to create multiple summarise implementations :
g1 <- group_by(mydf , ids)
s1 <- summarise(g1 , a = count('a'))
s2 <- summarise(g1 , b = count('b'))
s3 <- summarise(g1 , c = count('c'))
But returns error : Evaluation error: no applicable method for 'groups' applied to an object of class "character".
How to create new columns that count number of string entries in the column ?
Does doing a dplyr::count followed by tidyr::spread work for you? (I'm only posting this as you mentioned you were wanting to create a dataframe of this sort - otherwise it's much simpler to use table(mydf) as the other comments/answers suggest.)
library(dplyr)
library(tidyr)
mydf %>% count(ids, stringvalues) %>% spread(stringvalues, n, fill = 0)
#> # A tibble: 3 x 4
#> ids a b c
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 0
#> 2 2 0 1 0
#> 3 3 0 0 1
You can use count directly. First,
count(mydf, ids,stringvalues)
gives
# A tibble: 3 x 3
ids stringvalues n
<dbl> <fctr> <int>
1 1 a 2
2 2 b 1
3 3 c 1
then reshape,
count(mydf, ids,stringvalues) %>% tidyr::spread(stringvalues, n)
gives
# A tibble: 3 x 4
ids a b c
* <dbl> <int> <int> <int>
1 1 2 NA NA
2 2 NA 1 NA
3 3 NA NA 1
then replace the NAs with something like res[is.na(res)] <- 0, where res is the object constructed above.
Here's a base-R solution:
data.frame(cbind(table(mydf)))
Output option 1 (row # = ID):
a b c
1 2 0 0
2 0 1 0
3 0 0 1
Output option 2 (with ID as column):
data.frame(cbind(id=unique(mydf$ids),table(mydf)))
id a b c
1 1 2 0 0
2 2 0 1 0
3 3 0 0 1

Reshaping from wide to long and vice versa (multistate/survival analysis dataset)

I am trying to reshape the following dataset with reshape(), without much results.
The starting dataset is in "wide" form, with each id described through one row. The dataset is intended to be adopted for carry out Multistate analyses (a generalization of Survival Analysis).
Each person is recorded for a given overall time span. During this period the subject can experience a number of transitions among states (for simplicity let us fix to two the maximum number of distinct states that can be visited). The first visited state is s1 = 1, 2, 3, 4. The person stays within the state for dur1 time periods, and the same applies for the second visited state s2:
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3
The dataset in long format which I woud like to obtain is:
id cohort s
1 1 3
1 1 3
1 1 3
1 1 3
1 1 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 0 1
2 0 1
2 0 1
2 0 4
2 0 4
2 0 4
In practice, each id has dur1 + dur2 rows, and s1 and s2 are melted in a single variable s.
How would you do this transformation? Also, how would you cmoe back to the original dataset "wide" form?
Many thanks!
dat <- cbind(id=c(1,2), cohort=c(1, 0), s1=c(3, 1), dur1=c(4, 4), s2=c(2, 4), dur2=c(5, 3))
You can use reshape() for the first step, but then you need to do some more work. Also, reshape() needs a data.frame() as its input, but your sample data is a matrix.
Here's how to proceed:
reshape() your data from wide to long:
dat2 <- reshape(data.frame(dat), direction = "long",
idvar = c("id", "cohort"),
varying = 3:ncol(dat), sep = "")
dat2
# id cohort time s dur
# 1.1.1 1 1 1 3 4
# 2.0.1 2 0 1 1 4
# 1.1.2 1 1 2 2 5
# 2.0.2 2 0 2 4 3
"Expand" the resulting data.frame using rep()
dat3 <- dat2[rep(seq_len(nrow(dat2)), dat2$dur), c("id", "cohort", "s")]
dat3[order(dat3$id), ]
# id cohort s
# 1.1.1 1 1 3
# 1.1.1.1 1 1 3
# 1.1.1.2 1 1 3
# 1.1.1.3 1 1 3
# 1.1.2 1 1 2
# 1.1.2.1 1 1 2
# 1.1.2.2 1 1 2
# 1.1.2.3 1 1 2
# 1.1.2.4 1 1 2
# 2.0.1 2 0 1
# 2.0.1.1 2 0 1
# 2.0.1.2 2 0 1
# 2.0.1.3 2 0 1
# 2.0.2 2 0 4
# 2.0.2.1 2 0 4
# 2.0.2.2 2 0 4
You can get rid of the funky row names too by using rownames(dat3) <- NULL.
Update: Retaining the ability to revert to the original form
In the example above, since we dropped the "time" and "dur" variables, it isn't possible to directly revert to the original dataset. If you feel this is something you would need to do, I suggest keeping those columns in and creating another data.frame with the subset of the columns that you need if required.
Here's how:
Use aggregate() to get back to "dat2":
aggregate(cbind(s, dur) ~ ., dat3, unique)
# id cohort time s dur
# 1 2 0 1 1 4
# 2 1 1 1 3 4
# 3 2 0 2 4 3
# 4 1 1 2 2 5
Wrap reshape() around that to get back to "dat1". Here, in one step:
reshape(aggregate(cbind(s, dur) ~ ., dat3, unique),
direction = "wide", idvar = c("id", "cohort"))
# id cohort s.1 dur.1 s.2 dur.2
# 1 2 0 1 4 4 3
# 2 1 1 3 4 2 5
There are probably better ways, but this might work.
df <- read.table(text = '
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3',
header=TRUE)
hist <- matrix(0, nrow=2, ncol=9)
hist
for(i in 1:nrow(df)) {
hist[i,] <- c(rep(df[i,3], df[i,4]), rep(df[i,5], df[i,6]), rep(0, (9 - df[i,4] - df[i,6])))
}
hist
hist2 <- cbind(df[,1:2], hist)
colnames(hist2) <- c('id', 'cohort', paste('x', seq_along(1:9), sep=''))
library(reshape2)
hist3 <- melt(hist2, id.vars=c('id', 'cohort'), variable.name='x', value.name='state')
hist4 <- hist3[order(hist3$id, hist3$cohort),]
hist4
hist4 <- hist4[ , !names(hist4) %in% c("x")]
hist4 <- hist4[!(hist4[,2]==0 & hist4[,3]==0),]
Gives:
id cohort state
1 1 1 3
3 1 1 3
5 1 1 3
7 1 1 3
9 1 1 2
11 1 1 2
13 1 1 2
15 1 1 2
17 1 1 2
2 2 0 1
4 2 0 1
6 2 0 1
8 2 0 1
10 2 0 4
12 2 0 4
14 2 0 4
Of course, if you have more than two states per id then this would have to be modified (and it might have to be modified if you have more than two cohorts). For example, I suppose with 9 sample periods one person could be in the following sequence of states:
1 3 2 4 3 4 1 1 2

Resources