I would like to add one new row for each of the subjects in my dataframe, which looks something like this:
Subject = c("1","5","10")
time = c("2", "2.25", "2.5")
value = c("3", "17", "9")
DF <- data.frame(Subject, time, value)
Subject time value
1 1 2 3
2 5 2.25 17
3 10 2.5 9
I want to add a new row for each subject with a time = 0 and value = 0, giving this:
Subject = c("1","1","5","5","10","10")
time = c("0","2","0", "2.25","0", "2.5")
value = c("0","3","0", "17","0", "9")
DF2 <- data.frame(Subject, time, value)
Subject time value
1 1 0 0
2 1 2 3
3 5 0 0
4 5 2.25 17
5 10 0 0
6 10 2.5 9
I have a lot of subjects with a lot of gaps in their subject numbers, and want do this for all of them in a reasonable way. Any suggestions?
Thank you in advance.
Sincerily,
ykl
I would just rbind in the new values (not sure why you specified all your values as character values, here I changed them to numeric)
DF <- data.frame(
Subject = c(1,5,10),
time = c(2, 2.25, 2.5),
value = c(3, 17, 9)
)
DF2 <- rbind(
DF,
data.frame(Subject = unique(DF$Subject), time="0", value="0")
)
this puts them at the bottom, but you could re-sort of you like
DF2[order(DF2$subject, DF2$time), ]
You can also use interleave from the "gdata" package:
library(gdata)
interleave(DF, data.frame(Subject = 0, time = 0, value = 0))
# Subject time value
# 1 1 2.00 3
# 11 0 0.00 0
# 2 5 2.25 17
# 1.1 0 0.00 0
# 3 10 2.50 9
# 1.2 0 0.00 0
This uses #MrFlick's sample data.
DF <- data.frame(
Subject = c(1,5,10),
time = c(2, 2.25, 2.5),
value = c(3, 17, 9)
)
Related
My dataframe consists of three columns:
customerID: shows an unique ID for each customer.
buy: "1" if the customer bought a certain product and "0" if they didn't.
loy: Indicates the grade of "loyalty" of the customer. The value changes depending if the customer has previously bought (or not) the product.
The same customer can buy the product multiple times.
The construction of the variable "loy" goes in the following way for each observation [i]:
0.9buy[i-1] + 0.1loy[i-1]
This is the piece of code I have until now:
dta <- data.frame(
customerID = c("10","10","10","11","11","11","12","12","12","13","13", "13", "14",
"14", "14", "15", "15", "15", "16", "16", "16"),
buy = c(1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1,
0, 0, 1)
)
dta$loy <- 0
n=nrow(dta)
for(i in 2:n){
dta$loy[i] <- dta$buy[i-1]*0.9 + 0.1*dta$loy[i-1]
}
(The first value of "loy" for each customer is supposed to be fixed to "0")
I need to construct the variable "loy" for each customer. That means, the loop has to reset every time it passes to a new customerID, but I don't know how to do it.
With the code I have until now, the loop keeps going for all of the observations.
It's worth to mention that this dataframe was created from scratch just for the ease of this question. I have another set of data with over 2000 customerID's, on which I plan to apply the solution of this simplified problem.
You can run a loop inside another loop. This solution depends on your data being sorted by customerID and time.
ids <- unique(dta$customerID)
for(i in ids){
# location of ith customer's records
index <- which(dta$customerID == i)
# calculate running loyalty score
for (j in index[-1]) {
dta[j, "loy"] <- dta[
j - 1, "buy"]*0.9 + 0.1*dta[
j - 1, "loy"]
}
}
dta
# customerID buy loy
# 1 10 1 0.00
# 2 10 1 0.90
# 3 10 1 0.99
# 4 11 1 0.00
# 5 11 0 0.90
# 6 11 0 0.09
# ...
You could use dplyr:
library(dplyr)
dta%>%
group_by(customerID) %>%
mutate(tmp = 0.9 * lag(buy, default = 0),
loy = tmp + 0.1 * lag(tmp, default = 0)) %>%
select(-tmp) %>%
ungroup()
which returns
# A tibble: 21 x 3
customerID buy loy
<chr> <dbl> <dbl>
1 10 1 0
2 10 1 0.9
3 10 1 0.99
4 11 1 0
5 11 0 0.9
6 11 0 0.09
7 12 0 0
8 12 1 0
9 12 1 0.9
10 13 1 0
# ... with 11 more rows
I have used the following code previously to add values of a row:
subset$EBIT <- rowSums(subset[c("rorresul", "resand", "rteinknc",
"rteinext", "rteinov")], na.rm = TRUE)
However, I would actually need to include the condition that "resand" should only be included if it is positive. The other values can be either positive or negative, it does not matter. I have used rowSums because otherwise my total ended up a missing value if one of the variables had a missing value in them.
If you need sample of data, here is some:
rorresul resand rteinknc rteinext rteinov
40 30 2 2 2
50 -40 5 5 5
30 0 1 1 1
Super appreciative of any help! Thanks!
I would just sum everything, and then subtract resand after:
library(dplyr)
df %>%
mutate(
EBIT = rowSums(across(everything())),
EBIT = ifelse(resand < 0, EBIT - resand, EBIT)
)
# rorresul resand rteinknc rteinext rteinov EBIT
# 1 40 30 2 2 2 76
# 2 50 -40 5 5 5 65
# 3 30 0 1 1 1 33
Here is the data:
df <- data.frame(
rorresul = c(40, 50, 30),
resand = c(30, -40, 0),
rteinknc = c(2, 5, 1),
rteinext = c(2, 5, 1),
rteinov = c(2, 5, 1),
stringsAsFactors = FALSE
)
Edit
In case you have variables that shouldn't be included in the rowSums, then you can prespecify these:
sumVars <- c("rorresul", "resand", "rteinknc", "rteinext", "rteinov")
df %>%
mutate(
EBIT = rowSums(across(all_of(sumVars))),
EBIT = ifelse(resand < 0, EBIT - resand, EBIT)
)
You can use pmax to turn the negative values to 0 for resand and calculate rowSums.
cols <- c("rorresul", "resand", "rteinknc", "rteinext", "rteinov")
df$EBIT <- rowSums(transform(df, resand = pmax(resand, 0))[cols])
df
# rorresul resand rteinknc rteinext rteinov EBIT
#1 40 30 2 2 2 76
#2 50 -40 5 5 5 65
#3 30 0 1 1 1 33
I want to calculate a moving sum of one column (populated with ones and zeroes), but only when the value in a corresponding column (time) is within a (moving) range of values.
My data looks like this:
values <- c(1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0)
seconds <- c(0.0, 1.0, 2.5, 3.0, 5.5, 6.0, 6.5, 7.0, 8.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.5, 16.0, 17.0, 18.0, 19.0, 20.0)
data <- data.frame(values, seconds)
Say I want to sum every 5 seconds worth of data in the 'values' column.
Then my first 5-second sum (seconds >=0 & seconds <= 5) would be:
1 (because it corresponds to a 'seconds', 0.0, within the interval of interest)
+
0 (corresponds to 1.0 in 'seconds')
+
0 (2.5)
+
0 (3.0)
= 1
STOPs here because the next value (1) corresponds to 5.5 seconds, outside of the interval.
The next 5-second interval (seconds >= 1 & seconds <= 6) would equal:
0 + 0 + 0 + 1 + 1 = 2
3rd interval:
(seconds >= 2.5 & seconds <= 7.5) = 0 + 0 + 1 + 1 + 0 + 1 = 3
and so on.
I'm an R noob, so this is the method I'm using to calculate it (and it is super slow, so I know there must be a better way):
for(i in 1:20){movsum[i] <- sum(subset(data, seconds >= (seconds[i] - 5.0) & seconds <= seconds[i])$values)}
Thanks for your help. Let me know if there's anything I should clarify.
Here's a possible data.table::foverlaps solution. The idea here is to create 5 seconds interval look up table and then lookup within data which values fall in each interval.
Choose an interval
int <- 5 ## 5 seconds
The load the package, add additional (identical) column to data in order to set boundaries, create a new data set which will have the desired boundaries per row, run foverlaps, key data in order to enable the binary join, find the corresponding values in data$values and sum them per each interval, something like the following seem to work
library(data.table)
setkey(setDT(data)[, seconds2 := seconds], seconds, seconds2)
lookup <- data[, .(seconds, seconds2 = seconds + int)]
res <- foverlaps(lookup, data, which = TRUE)[, values := data$values[yid]]
res[, .(SumValues = sum(values)), by = .(SecInterval = xid)]
# SecInterval SumValues
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 3
# 5: 5 3
# 6: 6 2
# 7: 7 1
# 8: 8 2
# 9: 9 1
# 10: 10 2
# 11: 11 3
# 12: 12 3
# 13: 13 2
# 14: 14 2
# 15: 15 1
# 16: 16 0
# 17: 17 0
# 18: 18 0
# 19: 19 0
# 20: 20 0
You may try some functions from the zoo package:
library(zoo)
# convert your data to a zoo time series
z <- read.zoo(data, index = "seconds")
# create an empty, regular time series,
# which contains the full time range, in steps of 0.5 sec
z0 <- zoo(, seq(from = start(z), to = end(z), by = 0.5))
# 'expand' the irregular, original data to a regular series, by merging it with z0
z2 <- merge(z, z0)
# apply the desired function (sum) to a rolling window of width 11
# (number of observations in each window)
# move the time frame in steps of 2 (by = 2) which correspond to 1 sec
# use partial = TRUE, to allow the window to pass outside the time range
rollapply(z2, width = 11, by = 2, FUN = sum, na.rm = TRUE,
align = "left", partial = TRUE)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 1 2 3 3 3 3 2 2 1 2 2 3 3 2 2 1 0 0 0 0 0
I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA
I have a following sample code to make one data frame containing information for more than 1 ID.
I want to sort them by defined categories.
In which I want to see the percentage change at specific (given time for e.h here t=10) with respect to
its baseline value and return the value of that found category in output.
I have explained detailed step of my calculation below.
a=c(100,105,126,130,150,100,90,76,51,40)
t=c(0,5,10,20,30)
t=rep(t,2)
ID=c(1,1,1,1,1,2,2,2,2,2)
data=data.frame(ID,t,a)
My desired Calculation
1)for all ID at t=0 "a" value is baseline
2) Computation
e.g At Given t=10 (Have to provide) take corresponding a value
%Change(answer) = (taken a value - baseline/baseline)
3) Compare the answer in the following define CATEGORIES..
#category
1-If answer>0.25
2-If -0.30<answer<0.25
3-If -1.0<answer< -0.30
4-If answer== -1.0
4)Return the value of category
Desired Output
ID My_Answer
1 1
2 3
Can anybody help me in this.I do understand the flow of my computation but not awre of efficient way of doing it as i have so many ID in that data frame.
Thank you
It's easier to do math with columns than with rows. So the first step is to move baseline numbers into their own columns, then use cut to define these groups:
library(dplyr)
library(tidyr)
foo <- data %>%
filter(t == 0) %>%
left_join(data %>%
filter(t != 0),
by = "ID") %>%
mutate(percentchange = (a.y - a.x) / a.x,
My_Answer = cut(percentchange, breaks = c(-1, -0.3, 0.25, Inf),
right = FALSE, include.lowest = TRUE, labels = c("g3","g2","g1")),
My_Answer = as.character(My_Answer),
My_Answer = ifelse(percentchange == -1, "g4", My_Answer)) %>%
select(ID, t = t.y, My_Answer)
foo
ID t.x a.x t.y a.y percentchange My_Answer
1 1 0 100 5 105 0.05 g2
2 1 0 100 10 126 0.26 g1
3 1 0 100 20 130 0.30 g1
4 1 0 100 30 150 0.50 g1
5 2 0 100 5 90 -0.10 g2
6 2 0 100 10 76 -0.24 g2
7 2 0 100 20 51 -0.49 g3
8 2 0 100 30 40 -0.60 g3
You can see that this lets us calculate My_Answer for all values at once. if you want to find out the values for t == 10, you can just pull out those rows:
foo %>%
filter(t == 10)
ID t My_Answer
1 1 10 g1
2 2 10 g2