R - conditional cumsum using multiple columns - r

I'm new to stackoverflow So I hope I post my question in the right format. I have a test dataset with three columns where rank is the ranking of a cell, Esvalue is the value of a cell and zoneID is an area identifier(Note! in the real dataset I have up to 40.000 zoneIDs)
rank<-seq(0.1,1,0.1)
Esvalue<-seq(10,1)
zoneID<-rep(seq.int(1,2),times=5)
rank Esvalue zoneID
0.1 10 1
0.2 9 2
0.3 8 1
0.4 7 2
0.5 6 1
0.6 5 2
0.7 4 1
0.8 3 2
0.9 2 1
1.0 1 2
I want to calculate the following:
% ES value <- For each rank, including all lower ranks, the cumulative % share of the total ES value relative to the ES value of all zones
cumsum(df$Esvalue)/sum(df$Esvalue)
% ES value zone <- For each rank, including all lower ranks, the cumulative % share of the total Esvalue relative to the ESvalue of a zoneID for each zone. I tried this now using mutate and using dplyr. Both so far only give me the cumulative sum, not the share. In the end this will generate a variable for each zoneID
df %>%
mutate(cA=cumsum(ifelse(!is.na(zoneID) & zoneID==1,Esvalue,0))) %>%
mutate(cB=cumsum(ifelse(!is.na(zoneID) & zoneID==2,Esvalue,0)))
These two variables I want to combine by
1) calculating the abs difference between the two for all the zoneIDs
2) for each rank calculate the mean of the absolute difference over all zoneIDs
In the end the final output should look like:
rank Esvalue zoneID mean_abs_diff
0.1 10 1 0.16666667
0.2 9 2 0.01333333
0.3 8 1 0.12000000
0.4 7 2 0.02000000
0.5 6 1 0.08000000
0.6 5 2 0.02000000
0.7 4 1 0.04666667
0.8 3 2 0.01333333
0.9 2 1 0.02000000
1.0 1 2 0.00000000
Now I created the last using some intermediate steps in Excel but my final dataset will be way too big to be handled by Excel. Any advice on how to proceed would be appreciated

Related

create list and generate descriptives for each variable

I want to generate descriptive statistics for multiple variables at a time (close to 50), rather than writing out the code several times.
Here is a very basic example of data:
id var1 var2
1 1 3
2 2 3
3 1 4
4 2 4
I typically write out each line of code to get a frequency count and descriptives, like so:
library(psych)
table(df$var1)
table(df1$var2)
describe(df1$var1)
describe(df1$var2)
I would like to create a list and get the output from these analyses, rather than writing out 100 lines of code. I tried this, but it is not working:
variable_list<-list(df1$var, df2$var)
for (variable in variable_list){
table(df$variable_list))
describe(df$variable_list))}
Does anyone have advice on getting this to work?
The describe from psych can take a data.frame and returns the descriptive statistics for each column
library(psych)
describe(df1)
# vars n mean sd median trimmed mad min max range skew kurtosis se
#id 1 4 2.5 1.29 2.5 2.5 1.48 1 4 3 0 -2.08 0.65
#var1 2 4 1.5 0.58 1.5 1.5 0.74 1 2 1 0 -2.44 0.29
#var2 3 4 3.5 0.58 3.5 3.5 0.74 3 4 1 0 -2.44 0.29
If it is subset of columns, specify either column index or column name to select and subset the dataset
describe(df1[2:3])
Another option is descr from collapse
library(collapse)
descr(slt(df1, 2:3))
Or to select numeric columns
descr(num_vars(df1))
Or for factors
descr(fact_vars(df1))

How to reference multiple dataframe columns to calculate a new column of weighted averages in R

I am currently calculating the weighted average column for my dataframe through manual referencing of each column name. Is there a way to shorten the code by multiplying sets of arrays
eg:
df[,c(A,B,C)] and df[,c(PerA,PerB,PerC)] to obtain the weighted average, like the SUMPRODUCT in Excel? Especially if I have multiple input columns to calculate the weighted average column
df$WtAvg = df$A*dfPerA + df$B*df$PerB + df$C*df$PerC
Without transforming your dataframe and assuming that first half of the dataframe is the size and the second half is the weight, you can use weighted.mean function in apply function:
df$WtAvg = apply(df,1,function(x){weighted.mean(x[1:(ncol(df)/2)],
x[(ncol(df)/2+1):ncol(df)])})
And you get the following output:
> df
A B C PerA PerB PerC WtAvg
1 1 2 3 0.1 0.2 0.7 2.6
2 4 5 6 0.5 0.3 0.2 4.7
3 7 8 9 0.6 0.1 0.3 7.7

Calculate win loss ratio

I want to create a win/loss ratio from a dataset that has a series of winning and losing teamID's. The dataset looks something like this:
WTeamID LTeamID
11 12
12 13
11 13
Im trying to get a dataset that would look like this:
TeamID WLRatio
11 1.0
12 0.5
13 0.0
A straightforward way is to just divide the count of the first column by the count of both columns:
res <- table(factor(df$WTeamID, levels = unique(unlist(df)))) / table(factor(unlist(df)))
as.data.frame(res)
Var1 Freq
1 11 1.0
2 12 0.5
3 13 0.0

Create data frame from values in every two continuous rows from an existing data frame

I have data frame z1:
z1 <- data.frame(time=as.factor(rep(0.5:9.5,times=rep(c(9,10,8,11,12),2))),
roi= rep(c(1:9,1:10,1:8,1:11,1:12),2), area=runif(100, 5.0, 7.5))
I want to create a new data frame z2 has 10*nrow(z1) rows with condition:
at each time value, every second row (z1$roi[i:i+1] and z1$area[i:i+1]) for i in 1: c(nrow(z1) -1) are used to make column roi and area in z2, like
z2$roi <- seq(z1$roi[i],z1$roi[i+1], length.out = 10)
z2$area <- seq(z1$area[i],z1$area[i+1], length.out = 10)
If the data frame z1 looks like:
time roi area
1 0.5 1 6.181150 #=z1$roi[1]
2 0.5 2 5.469366 #=z1$roi[2]
3 0.5 3 6.742525
.
.
.
98 9.5 10 6.063234
99 9.5 11 6.824393 #=z1$roi[99]
100 9.5 12 7.346298 #=z1$roi[100]
the data frame z2 would be:
time roi area
1 0.5 1.000000 6.181150 #=z1$roi[1]
2 0.5 1.111111 6.102063
.
.
.
9 0.5 1.888889 5.548453
10 0.5 2.000000 5.469366 #=z1$roi[2]
.
.
.
991 9.5 11.00000 6.824393 #=z1$roi[99]
992 9.5 11.11111 6.882383
.
.
.
999 9.5 11.88889 7.288309
1000 9.5 12.00000 7.346298 #=z1$roi[100]
Can anyone help me? Thank you!
with tidyverse, changing a bit your values to appreciate the output (replace 5 by 10):
z1 <- head(z1,3)
library(tidyverse)
z1 %>%
mutate_at(vars(roi,area),~map2(.,c(.[-1],last(.)),~seq(.x,.y,length.out=5))) %>%
unnest %>%
head(-5)
# time roi area
# 1 0.5 1.00 6.302351
# 2 0.5 1.25 6.151644
# 3 0.5 1.50 6.000938
# 4 0.5 1.75 5.850231
# 5 0.5 2.00 5.699525
# 6 0.5 2.00 5.699525
# 7 0.5 2.25 5.687045
# 8 0.5 2.50 5.674566
# 9 0.5 2.75 5.662087
# 10 0.5 3.00 5.649608
We will apply the same transformations to cols time and area, so we use mutate_at on those.
We want to transform them into list columns containing vectors, so we can unnest afterwards and get a long data.frame(you may need to get acquainted with tidyr::unnest to understand this step, basically it makes a 'regular' data.frame out of a data.frame that would have vectors, lists, or nested data.frames as elements).
The map family will return such a list output, but each value depends on current AND next value, so we use purrr::map2 to get both input.
. is current value, c(.[-1],last(.)) is the next value (for last element there is no next value, so we keep the last value).
We unnest to create a long data.frames.
The repeated last value created duplicated rows, so we remove them with head(-n)
You could do this as a linear interpolation problem using approx():
s1 <- seq_len(nrow(z1)-1)
s2 <- rep(s1,each=9)
out <- approx(
x = seq_along(z1$area),
y = z1$area,
xout = c(s2 + head(seq(0,1,length.out=10),-1), nrow(z1))
)$y
z1
# time roi area
#1 0.5 1 6.413124
#2 0.5 2 6.837422
#3 0.5 3 6.656612
And then just join the results back together using row indexing:
cbind(z1[c(s2,nrow(z1)),], out)
# time roi area out
#1 0.5 1 6.413124 6.413124
#1.1 0.5 1 6.413124 6.460268
#1.2 0.5 1 6.413124 6.507413
#1.3 0.5 1 6.413124 6.554557
#1.4 0.5 1 6.413124 6.601701
#1.5 0.5 1 6.413124 6.648845
#1.6 0.5 1 6.413124 6.695989
#1.7 0.5 1 6.413124 6.743134
#1.8 0.5 1 6.413124 6.790278
#2 0.5 2 6.837422 6.837422
#2.1 0.5 2 6.837422 6.817332
#2.2 0.5 2 6.837422 6.797242
#2.3 0.5 2 6.837422 6.777152
#2.4 0.5 2 6.837422 6.757062
#2.5 0.5 2 6.837422 6.736972
#2.6 0.5 2 6.837422 6.716882
#2.7 0.5 2 6.837422 6.696792
#2.8 0.5 2 6.837422 6.676702
#3 0.5 3 6.656612 6.656612
This sort of logic should scale much better than having to calculate a sequence for each row. Something of the order of 10 secs vs 1 minute for 1 million rows from a quick and dirty test.

How to match two columns with nearest time points?

I have a following dataframe. It is a time series with each observations having values for days 1-4. There is an additional column that shows at which time the test was made in hrs.
dt
Name values Days Test
a 0.2 1 20
a 0.3 2 20
a 0.6 3 20
a 0.2 4 20
b 0.3 1 44
b 0.4 2 44
b 0.8 3 44
b 0.7 4 44
c 0.2 1 24
c 0.7 2 24
I have to make a time series such that each line represents the subject.
First I made a plot with days and values, with subjects as colors.
This gave me a line plot for each subject, plotted against days and values. I am happy with it.
However, I have to incorporte when the test was taken on the line plot. I could do it separately at the top or bottom of the plot. But not exactly on the line.
Could someone please help me?
Thanks in advance!
Use the directlabels package to add the times:
library(ggplot2)
library(directlabels)
ggplot(DF, aes(Days, values, color = Name)) +
geom_line() +
geom_dl(aes(label = Test), method = "last.points")
Note
The input DF in reproducible form is:
Lines <- "
Name values Days Test
a 0.2 1 20
a 0.3 2 20
a 0.6 3 20
a 0.2 4 20
b 0.3 1 44
b 0.4 2 44
b 0.8 3 44
b 0.7 4 44
c 0.2 1 24
c 0.7 2 24"
DF <- read.table(text = Lines, header = TRUE)

Resources