Conditional average or replace value if zero in dataframe - r

I have a problem finding the right code for the following problem.
Here is a simplified and short version of my dataframe df :
Line Id Amount
1 1 10
2 2 12
3 2 13
4 2 0
5 3 11
6 4 12
7 4 14
8 5 0
9 6 11
10 6 0
I would like to create another colum Amount_Avrg with the folowing conditions:
-if several lines have the same Id and an Amount that is different from zero the case for lines 2 and 3 and for lines 6 and 7, calculate the average of the different amounts
-if one line has an amount that is equal with 0 then:
A/ erase it if it is alone (if there is no other line with the same Id and a value different from 0) (the case of line 8)
B/ if there is one line with the same Id and a value different from 0 (the case for lines 9 and 10), replace 0 with the value of the other
C/ if there are two lines or more with a value different from zero (the case for lines 2 and 3), replace 0 with the average of the other amounts
The final dataframe I am expecting would then look like this one:
Line Id Amount Amount_Avrg
1 1 10 10
2 2 12 12.5
3 2 13 12.5
4 2 0 12.5
5 3 11 11
6 4 12 13
7 4 14 13
9 6 11 11
10 6 0 11
I have read in many answers that if loops were not efficient on R so if you could help me with another solution, that would be fantastic :-)

Using dplyr, we can group_by ID and take mean of non-zero Amount and remove rows with NA in them.
library(dplyr)
df %>%
group_by(Id) %>%
mutate(mn = mean(Amount[Amount > 0])) %>%
filter(!is.na(mn))
# Line Id Amount mn
# <int> <int> <int> <dbl>
#1 1 1 10 10
#2 2 2 12 12.5
#3 3 2 13 12.5
#4 4 2 0 12.5
#5 5 3 11 11
#6 6 4 12 13
#7 7 4 14 13
#8 9 6 11 11
#9 10 6 0 11
Or with data.table
library(data.table)
setDT(df)[, mn := mean(Amount[Amount > 0]), by = Id][!is.na(mn)]
data
df <- structure(list(Line = 1:10, Id = c(1L, 2L, 2L, 2L, 3L, 4L, 4L,
5L, 6L, 6L), Amount = c(10L, 12L, 13L, 0L, 11L, 12L, 14L, 0L,
11L, 0L)), class = "data.frame", row.names = c(NA, -10L))

You can use ave to calculate the mean per Id and then subset with !is.na to remove the rows where you have only 0 per Id.
x$Amount_Avrg <- ave(x$Amount, x$Id, FUN=function(x) mean(x[x>0]))
x <- x[!is.na(x$Amount_Avrg),]
x
# Line Id Amount Amount_Avrg
#1 1 1 10 10.0
#2 2 2 12 12.5
#3 3 2 13 12.5
#4 4 2 0 12.5
#5 5 3 11 11.0
#6 6 4 12 13.0
#7 7 4 14 13.0
#9 9 6 11 11.0
#10 10 6 0 11.0
Or with within and na.omit:
na.omit(within(x, mount_Avrg <- ave(Amount, Id, FUN=function(x) mean(x[x>0]))))
Or using aggregate and merge:
merge(x, aggregate(cbind(Amount_Avrg = Amount) ~ Id, data=x[x$Amount>0,], mean))
Data:
x <- read.table(header=TRUE, text="Line Id Amount
1 1 10
2 2 12
3 2 13
4 2 0
5 3 11
6 4 12
7 4 14
8 5 0
9 6 11
10 6 0")

If you create a summary table of all the nonzero-means, you can right-join that to the original table to get the result displayed in the question.
library(data.table)
setDT(df)
nonzero_means <- df[Amount > 0, .(Amount_Avg = mean(Amount)), Id]
df[nonzero_means, on = .(Id)]
# Line Id Amount Amount_Avg
# 1: 1 1 10 10.0
# 2: 2 2 12 12.5
# 3: 3 2 13 12.5
# 4: 4 2 0 12.5
# 5: 5 3 11 11.0
# 6: 6 4 12 13.0
# 7: 7 4 14 13.0
# 8: 9 6 11 11.0
# 9: 10 6 0 11.0

Related

How to combine two or more variables into one in R?

Im currently trying to do a t-test with my data. I have three variables (or lets say groups). People that have cats or dogs or no pets. Now I want to form groups and put cat and dog-people into one group called "pets". And then im comparing this group with the "no-pet" group. How can i do this?
> mytable <- read.csv2("versuch.csv")
> mytable
cats dogs none
1 3 1 3
2 5 2 2
3 3 6 5
4 8 8 9
5 5 5 8
6 6 9 2
I want it to look like this:
> mytable <- read.csv2("versuch.csv")
> mytable
cats dogs none pets
1 3 1 3 3
2 5 2 2 5
3 3 6 5 3
4 8 8 9 8
5 5 5 8 5
6 6 9 2 6
7 1
8 2
9 6
10 8
... ....
So basically I want to have one extra variable that consists both of the values of the cats and dog variable. Is there a possibility to achieve that?
We could use add_row from tibble package:
library(tidyverse)
df %>%
mutate(pets = cats) %>%
add_row(pets = df$dogs)
Output:
cats dogs none pets
<dbl> <dbl> <dbl> <dbl>
1 3 1 3 3
2 5 2 2 5
3 3 6 5 3
4 8 8 9 8
5 5 5 8 5
6 6 9 2 6
7 NA NA NA 1
8 NA NA NA 2
9 NA NA NA 6
10 NA NA NA 8
11 NA NA NA 5
12 NA NA NA 9
data:
df <- tibble::tribble(
~cats, ~dogs, ~none,
3, 1, 3,
5, 2, 2,
3, 6, 5,
8, 8, 9,
5, 5, 8,
6, 9, 2)
You cannot have unequal number of rows for different columns in a dataframe. You can append NA's to other column.
vec <- unlist(mytable[c('cats', 'dogs')], use.names = FALSE)
mytable <- cbind(mytable[1:length(vec), ], pets = vec)
rownames(mytable) <- NULL
mytable
# cats dogs none pets
#1 3 1 3 3
#2 5 2 2 5
#3 3 6 5 3
#4 8 8 9 8
#5 5 5 8 5
#6 6 9 2 6
#7 NA NA NA 1
#8 NA NA NA 2
#9 NA NA NA 6
#10 NA NA NA 8
#11 NA NA NA 5
#12 NA NA NA 9
data
mytable <- structure(list(cats = c(3L, 5L, 3L, 8L, 5L, 6L), dogs = c(1L,
2L, 6L, 8L, 5L, 9L), none = c(3L, 2L, 5L, 9L, 8L, 2L)),
class = "data.frame", row.names = c(NA, -6L))

Data transformation: I am looking for an efficient way in R to recode/expand many-to-one for survival analysis

I am looking at graft patency after surgery (CABG)
In a CABG procedure, a single patient will typically get more than one graft (bypass), and we are looking at time-to-failure. This is indicated in the raw data by a variable indicating number of failed grafts, and the time at which diagnosed.
My raw data is currently one-line-per-patient and I believe I need to make it one-line-per-graft in order to continue to KM and Cox analyses. I am considering assorted if/then loops, but wonder if there is a more-efficient way to recode here.
Example data:
Patient VeinGrafts VeinsOccluded Months
1 2 0 36
2 4 1 34
3 3 2 38
4 4 0 33
In order to look at this "per vein" I need to recode such that each #VeinGraft gets its own row, and VeinsOccluded becomes 1/0
I need each row replicated (VeinGrafts) times, such that patient 2 will have 4 rows, but one of them has the VeinsOccluded indicator and the other 3 do not
This is what I would need the above data to look like for my next analytic move.
Patient VeinGrafts VeinsOccluded Months
1 2 0 36
1 2 0 36
2 4 1 34
2 4 0 34
2 4 0 34
2 4 0 34
3 3 1 38
3 3 1 38
3 3 0 38
4 4 0 33
4 4 0 33
4 4 0 33
4 4 0 33
This community has been so incredibly helpful to this point, but I have not been able to find a similar question answered - if I have overlooked I apologize, but most certainly appreciate any ideas you may have!
We can uncount to expand the data, then grouped by 'Patient', mutate the 'VeinsOccluded' by creating a logical expression with row_number() on the first value of 'VeinsOccluded', coerced to binary with +
library(dplyr)
library(tidyr)
df1 %>%
uncount(VeinGrafts, .remove = FALSE) %>%
group_by(Patient) %>%
mutate(VeinsOccluded = +(row_number() <= first(VeinsOccluded))) %>%
ungroup %>%
select(names(df1))
-output
# A tibble: 13 x 4
# Patient VeinGrafts VeinsOccluded Months
# <int> <int> <int> <int>
# 1 1 2 0 36
# 2 1 2 0 36
# 3 2 4 1 34
# 4 2 4 0 34
# 5 2 4 0 34
# 6 2 4 0 34
# 7 3 3 1 38
# 8 3 3 1 38
# 9 3 3 0 38
#10 4 4 0 33
#11 4 4 0 33
#12 4 4 0 33
#13 4 4 0 33
Or this can be done with data.table (probably in a more efficient way)
library(data.table)
setDT(df1)[rep(seq_len(.N), VeinGrafts)][,
VeinsOccluded := +(seq_len(.N) <= first(VeinsOccluded)), Patient][]
-output
# Patient VeinGrafts VeinsOccluded Months
# 1: 1 2 0 36
# 2: 1 2 0 36
# 3: 2 4 1 34
# 4: 2 4 0 34
# 5: 2 4 0 34
# 6: 2 4 0 34
# 7: 3 3 1 38
# 8: 3 3 1 38
# 9: 3 3 0 38
#10: 4 4 0 33
#11: 4 4 0 33
#12: 4 4 0 33
#13: 4 4 0 33
data
df1 <- structure(list(Patient = 1:4, VeinGrafts = c(2L, 4L, 3L, 4L),
VeinsOccluded = c(0L, 1L, 2L, 0L), Months = c(36L, 34L, 38L,
33L)), class = "data.frame", row.names = c(NA, -4L))

Fill Missing Values

data=data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,3))
library(dplyr);library(tidyverse)
data$timeWANTattempt=data$timeHAVE
data <- data %>%
group_by(student) %>%
fill(timeWANTattempt)+3
I have 'timeHAVE' and I want to replace missing times with the previous time +3. I show my dplyr attempt but it does not work. I seek a data.table solution. Thank you.
you can try.
data %>%
group_by(student) %>%
mutate(n_na = cumsum(is.na(timeHAVE))) %>%
mutate(timeHAVE = ifelse(is.na(timeHAVE), timeHAVE[n_na == 0 & lead(n_na) == 1] + 3*n_na, timeHAVE))
student timeHAVE timeWANT n_na
<dbl> <dbl> <dbl> <int>
1 1 1 1 0
2 1 4 4 0
3 1 7 7 0
4 1 10 10 0
5 2 2 2 0
6 2 5 5 0
7 2 8 8 1
8 2 11 11 1
9 3 6 6 0
10 3 9 9 1
11 3 12 12 2
12 3 15 15 3
13 4 3 3 0
I included the little helper n_na which counts NA's in a row. Then the second mutate muliplies the number of NAs with three and adds this to the first non-NA element before NA's
Here's an approach using 'locf' filling
setDT(data)
data[ , by = student, timeWANT := {
# carry previous observations forward whenever missing
locf_fill = nafill(timeHAVE, 'locf')
# every next NA, the amount shifted goes up by another 3
na_shift = cumsum(idx <- is.na(timeHAVE))
# add the shift, but only where the original data was missing
locf_fill[idx] = locf_fill[idx] + 3*na_shift[idx]
# return the full vector
locf_fill
}]
Warning that this won't work if a given student can have more than one non-consecutive set of NA values in timeHAVE
Another data.table option without grouping:
setDT(data)[, w := fifelse(is.na(timeHAVE) & student==shift(student),
nafill(timeHAVE, "locf") + 3L * rowid(rleid(timeHAVE)),
timeHAVE)]
output:
student timeHAVE timeWANT w
1: 1 1 1 1
2: 1 4 4 4
3: 1 7 7 7
4: 1 10 10 10
5: 2 2 2 2
6: 2 5 5 5
7: 2 NA 8 8
8: 2 11 11 11
9: 3 6 6 6
10: 3 NA 9 9
11: 3 NA 12 12
12: 3 NA 15 15
13: 4 NA NA NA
14: 4 3 3 3
data with student=4 having NA for the first timeHAVE:
data = data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,NA,3))

sum for each ID depending on another variable

I would like to sum a column (by ID) depending on another variable (group). If we take for instance:
ID t group
1 12 1
1 14 1
1 2 6
2 0.5 7
2 12 1
3 3 1
4 2 4
I'd like to sum values of column t separately for each ID only if group==1, and obtain:
ID t group sum
1 12 1 26
1 14 1 26
1 2 6 NA
2 0.5 7 NA
2 12 1 12
3 3 1 3
4 2 4 NA
Using dplyr,
df %>%
group_by(ID) %>%
mutate(new = sum(t[group == 1]),
new = replace(new, group != 1, NA))
which gives,
# A tibble: 7 x 4
# Groups: ID [4]
ID t group new
<int> <dbl> <int> <dbl>
1 1 12 1 26
2 1 14 1 26
3 1 2 6 NA
4 2 0.5 7 NA
5 2 12 1 12
6 3 3 1 3
7 4 2 4 NA
Consider base R with ifelse and ave() for conditional inline aggregation.
df$sum <- with(df, ifelse(group == 1, ave(t, ID, group, FUN=sum), NA))
df
# ID t group sum
# 1 1 12.0 1 26
# 2 1 14.0 1 26
# 3 1 2.0 6 NA
# 4 2 0.5 7 NA
# 5 2 12.0 1 12
# 6 3 3.0 1 3
# 7 4 2.0 4 NA
Rextester demo
We can use data.table methods. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'ID', specify the i with the logical expression group ==1, get the sum of 't' and assign (:=) it to 'new'. By default, other rows are assigned to NA by default
library(data.table)
setDT(df)[group == 1, new := sum(t), ID]
df
# ID t group new
#1: 1 12.0 1 26
#2: 1 14.0 1 26
#3: 1 2.0 6 NA
#4: 2 0.5 7 NA
#5: 2 12.0 1 12
#6: 3 3.0 1 3
#7: 4 2.0 4 NA
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 4L), t = c(12,
14, 2, 0.5, 12, 3, 2), group = c(1L, 1L, 6L, 7L, 1L, 1L, 4L)),
class = "data.frame", row.names = c(NA,
-7L))

Create edgelist for all interactions from data.frame

I am trying to do network analysis in igraph but having some issues with transforming the dataset I have into an edge list (with weights), given the differing amount of columns.
The data set looks as follows (df1) (much larger of course): First is the main operator id (main operator can also be partner and vice versa, so the Ids are staying the same in the edge list) The challenge is that the amount of partners varies (from 0 to 40) and every interaction has to be considered (not just "IdMain to IdPartnerX").
IdMain IdPartner1 IdPartner2 IdPartner3 IdPartner4 .....
1 4 3 7 6
2 3 1 NA NA
3 1 4 2 NA
4 9 6 3 NA
.
.
I already got the helpful tip to use reshape to do this, like:
data_melt <- reshape2::melt(data, id.vars = "IdMain")
edgelist <- data_melt[!is.na(data_melt$value), c("IdMain", "value")]
However, this only creates a 'directed' edgelist (from Main to Partners). What I need is something like below, where every interaction is recorded.
Id1 Id2
1 4
1 3
1 7
1 6
4 3
4 7
4 6
3 7
etc
Does anyone have a tip what the best way to go is? I also looked into the igraph library and couldn't find the function to do this.
There is no need for reshape(2) and melting etc. You just need to grap every combination of column pairs and then bind them together.
x <- read.table(text="IdMain IdPartner1 IdPartner2 IdPartner3 IdPartner4
1 4 3 7 6
2 3 1 NA NA
3 1 4 2 NA
4 9 6 3 NA", header=TRUE)
idx <- t(combn(seq_along(x), 2))
edgelist <- lapply(1:nrow(idx), function(i) x[, c(idx[i, 1], idx[i, 2])])
edgelist <- lapply(edgelist, setNames, c("ID1","ID2"))
edgelist <- do.call(rbind, edgelist)
edgelist <- edgelist[rowSums(is.na(edgelist))==0, ]
edgelist
# ID1 ID2
# 1 1 4
# 2 2 3
# 3 3 1
# 4 4 9
# 5 1 3
# 6 2 1
# 7 3 4
# 8 4 6
# 9 1 7
# 11 3 2
# 12 4 3
# 13 1 6
# 17 4 3
# 18 3 1
# 19 1 4
# 20 9 6
# 21 4 7
# 23 1 2
# 24 9 3
# 25 4 6
# 29 3 7 <--
# 31 4 2
# 32 6 3
# 33 3 6 <--
# 37 7 6 <--
Using the data below. You can achieve what looks to be your goal with apply and combn. This returns a list matrices with the pairwise comparison of the row element of your data.frame
myPairs <- apply(t(dat), 2, function(x) t(combn(x[!is.na(x)], 2)))
Note that the output of apply can be finicky and it is necessary here to have at least one row with an NA so that apply will return a list rather than a matrix.
If you want a data.frame at the end, use do.call and rbind to put the matrices together and then data.frame and setNames for the object coercion and to add names.
setNames(data.frame(do.call(rbind, myPairs)), c("Id1", "Id2"))
Id1 Id2
1 1 4
2 1 3
3 1 7
4 1 6
5 4 3
6 4 7
7 4 6
8 3 7
9 3 6
10 7 6
11 2 3
12 2 1
13 3 1
14 3 1
15 3 4
16 3 2
17 1 4
18 1 2
19 4 2
20 4 9
21 4 6
22 4 3
23 9 6
24 9 3
25 6 3
data
dat <-
structure(list(IdMain = 1:4, IdPartner1 = c(4L, 3L, 1L, 9L),
IdPartner2 = c(3L, 1L, 4L, 6L), IdPartner3 = c(7L, NA, 2L,
3L), IdPartner4 = c(6L, NA, NA, NA)), .Names = c("IdMain",
"IdPartner1", "IdPartner2", "IdPartner3", "IdPartner4"),
class = "data.frame", row.names = c(NA, -4L))

Resources