Related
I have the following example.
I want to create a new column with the absolute difference in AGE compared to each Treat==1 in the same PairID.
Desired output should be as shown below.
I have tried using dplyr with:
Data complete:
Treat <- c(1,0,0,1,0,0,1,0)
PairID <- c(1,1,1,2,2,2,3,3)
Age <- c(30,60,31,20,20,40,50,52)
D <- data.frame(Treat,PairID,Age)
D
D %>%
group_by(PairID) %>%
abs(Age - Age[Treat == 1])
in Base-R:
D$absD <- unlist(lapply(split(D,D$PairID), function(x) abs(x$Age - x$Age[x$Treat==1])))
> D
Treat PairID Age absD
1 1 1 30 0
2 0 1 60 30
3 0 1 31 1
4 1 2 20 0
5 0 2 20 0
6 0 2 40 20
7 1 3 50 0
8 0 3 52 2
I have the following data set:
District Type DaysBtwn Start_Day End_Day Start_Vol End_Vol
1 A 0 3 0 31 28 23
2 A 1 3 0 31 24 0
3 B 0 3 0 31 17700 10526
4 B 1 3 0 31 44000 35800
5 C 0 3 0 31 5700 0
6 C 1 3 0 31 35000 500
For each of the group combinations District & Type, I want to do a simple linear interpolation: for a x=Days (Start_Day and End_Day) and y=Volumes (Start_Vol and End_Vol), I want the estimated volume returned for xout=DaysBtwn.
I have tried so many things. I think I am having issues because of the way my data is set up. Can someone point me in the right direction for how to use the approx function in R to get the desired output? I don't mind moving my data set around to get the correct format for approx.`
Example of desired output:
District Type EstimatedVol
1 0 25
2 1 15
3 0 13000
4 1 39000
5 0 2500
6 1 25000
dt <- data.table(input) interpolation <- dt[, approx(x,y,xout=z), by=list(input$District,input$Type)]
Why not simply calculate it directly?
dt$EstimatedVol <- (End_Vol - Start_Vol) / (End_Day - Start_Day) * (DaysBtwn - Start_Day) + Start_Vol
I'm really new to R but I haven't been able to find a simple solution to this. As an example, I have the following dataframe:
case <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
areas <- c(1,2,1,1,1,2,2,2,2,1,1,2,2,2,1,1,1,2,2,2)
A <- c(1,2,11,12,20,21,26,43,43,47,48,59,63,64,65,66,67,83,90,91)
var <- c(1,1,0,0,0,1,1,0,0,1,0,1,0,1,1,0,0,0,0,0)
outcome <- c(1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0)
df <- data.frame(case,areas,A,var,outcome)
case areas A var outcome
1 1 1 1 1 1
2 2 2 2 1 0
3 3 1 11 0 0
4 4 1 12 0 0
5 5 1 20 0 0
6 6 2 21 1 0
7 7 2 26 1 0
8 8 2 43 0 0
9 9 2 43 0 0
10 10 1 47 1 1
11 11 1 48 0 0
12 12 2 59 1 1
13 13 2 63 0 0
14 14 2 64 1 0
15 15 1 65 1 0
16 16 1 66 0 0
17 17 1 67 0 0
18 18 2 83 0 1
19 19 2 90 0 0
20 20 2 91 0 0
in the 'A' column we have a wide range of integers, and I'd like to create an extra column that groups each case by its membership to the following categories:
<5; 5 - 19; 20 - 49; 50 - 79; 80+
So the first 3 rows of the column should be a string value that says "<5", "<5", "5 - 19"... and so on, and the last value in the column will be "80+".
I could write out something like this, but it seems very sloppy:
A_groups = ifelse(df$A<5, "<5", df$A)
A_groups = ifelse(df$A>4 & df$A<20, "5-19", A_groups)
A_groups = ifelse(df$A>19 & df$A<50, "20-49", A_groups)
What is the best alternative to this?
You're looking for the cut() function. You want to create a factor based on interval, which is what this function provides.
df$new_factor <- cut(df$A, breaks = c(-Inf, 5, 20, 50, 80, Inf),
labels = c('<5', '5-19', '20-49', '50-79', '80+'),
right = FALSE)
View the helppage: ?cut to see why I included right = FALSE. To double check whether it works what you do, it's always nice to create some cases where you wouldn't be sure of. For example: check case == 5 with right = FALSE on and without it and see what happens to new_factor.
You can use cut() or findInterval().
breaks = c(0,5,20,50,80,Inf)
labels = c("<5", "5-19", "20-49", "50-79", "80+")
# Using cut()
df$A_groups = cut(df$A, breaks = breaks, right = FALSE, labels = labels)
# Using findInterval()
df$B_groups = factor(findInterval(df$A, breaks), labels = labels)
I am trying to create a function to apply to a variable in a dataframe that, for a windows of 3 days forward from the current observation, calculate if the current price decrease and then return to the original price. The dataframe looks like this:
VarA VarB Date Price Diff VarD
1 1 2007-04-09 50 NA 0
1 1 2007-04-10 50 0 0
1 1 2007-04-11 48 -2 1
1 1 2007-04-12 48 0 1
1 1 2007-04-13 50 2 0
1 1 2007-04-14 50 0 0
1 1 2007-04-15 45 -5 1
1 1 2007-04-16 50 5 0
1 1 2007-04-17 45 -5 0
1 1 2007-04-18 48 3 0
1 1 2007-04-19 48 0 0
1 1 2007-04-20 50 2 0
Where VarA and VarB are grouping variables (in this example, they do not change), Price is the variable I wish to detect if it decrease and then increase again to the starting level, and Diff is the lagged price difference (if is of any help).
VarD shows the result of applying the function I am trying to guess. There are two conditions for VarD to take the value 1: 1) the price decrease from a level and then, in any of the two following days window, returns to the original level (i.e., 50 to 48 and again to 50, in rows 2 to 5, or 50 to 45 and again to 50 in rows 6 to 8); 2) there is a maximum of two days for the price to increase again to the starting level. Otherwise, VarD should take the value 0.
I do not have any clue of how to start.
The dataframe db is:
db <- read.table(header = TRUE, sep = ",", text = "VarA,VarB,Date,Price,Diff
1,1,2007-04-09,50,NA
1,1,2007-04-10,50,0
1,1,2007-04-11,48,-2
1,1,2007-04-12,48,0
1,1,2007-04-13,50,2
1,1,2007-04-14,50,0
1,1,2007-04-15,45,-5
1,1,2007-04-16,50,5
1,1,2007-04-17,45,-5
1,1,2007-04-18,48,3
1,1,2007-04-19,48,0
1,1,2007-04-20,50,2")
Thanks in advance.
Hope I understood your requirements correctly:
library(dplyr)
db %>%
#create Diff.2 as helper variable: increase in price from current day to 2 days later
mutate(Diff.2 = diff(c(Price,NA,NA), lag = 2)) %>%
mutate(Var.D = ifelse(
Diff.2 + lag(Diff.2, 2) == 0 & #condition 1: price increase from current day to 2 days later
#is cancelled out by price decrease from 2 days ago to current day
Diff.2 > 0, #condition 2: price increases from current day to 2 days later
1, 0)) %>%
mutate(Var.D = ifelse(is.na(Var.D), 0, Var.D)) %>%
select(-Diff.2)
VarA VarB Date Price Diff Var.D
1 1 1 2007-04-09 50 NA 0
2 1 1 2007-04-10 50 0 0
3 1 1 2007-04-11 48 -2 1
4 1 1 2007-04-12 48 0 1
5 1 1 2007-04-13 50 2 0
6 1 1 2007-04-14 50 0 0
7 1 1 2007-04-15 48 -2 0
8 1 1 2007-04-16 49 1 0
9 1 1 2007-04-17 45 -4 0
10 1 1 2007-04-18 45 0 0
11 1 1 2007-04-19 45 0 0
12 1 1 2007-04-20 50 0 0
I think I found the solution, if it is of interest. I use inputs from #G. Grothendieck so he deserve most of the credit (but not the blame for errors). The solution is in four steps:
Step 1: create a dummy variable equal to 1 if prices decrease and for each month it stays low.
db$Tmp1 <- 0
for (n in 1 : length(db$Date)) db$Tmp1[n] <- ifelse
(db$Diff[n] < 0, 1, ifelse (db$Tmp1[n-1:min(0, n)] == 1 &&
db$Diff[n] == 0, 1, 0))
The first part of ifelse tells that if price at date [n] decrease or if the previous value of Step1 is equal to 1 and price do not change, then assign the value 1, else 0.
Step 2: restrict the number of days the price could be lower in Step 1 to two days (thanks to #G. Grothendieck).
loop <- function(x) if (all(x[-1] == 1)) 0 else x[1]
db$Tmp2 = ifelse(db$Diff < 0, rollapply(db$Tmp1, 3, loop, partial = T,
align = "left"), ifelse(db$Diff==0 & lag(db$Tmp2) ==
1, 1, 0))
loop is a function that value 0 if all values -except the current date- are equal to 1, else take the value of Tmp1. Then, if the price decrease (db$diff < 0) apply loop to 3 values of Tmp1 forward, but if the price do not change and the previous value of Tmp2 is 1, assign a value of 1. Else assign 0.
Step 3: calculate if the price previous to the price decrease is repeated in a three days window from the original price.
loop2 <- function(x) if (any(x[-1] == x[1])) 1 else 0
Tmp3 = rollapply(Price, 4, loop2, partial = T, align = "left")
The function loop2 search if any price is repeated in 3 days from the current date (the 4 in the Tmp3 function). Then, Tmp3 apply loop2 to the Price vector (following this tread Ifelse statement with dataframe subset using date)
Step 4: Multiply Tmp2 and Tmp3 to obtain the result (and delete the auxiliary variables).
db$Sale <- db$Tmp2 * db$Tmp3
db$Tmp1 <- db$Tmp2 <- db$Tmp3 <- NULL
Now, Sale is just to multiply Tmp2 and Tmp3, as the first one adjust sales to a 3 days windows, and the second one show if the original price at the start of the decrease in price is present at a window of 3 days before.
Hope its useful to anyone.If any has corrections or suggestions, they are very welcome. Lastly, each one of the codes should be applied to each VarA and VarB, so each step should be in the following code:
db <-
db %>% group_by(VarA, VarB) %>%
mutate(
code
)
The output is:
VarA VarB Date Price Diff Tmp1 Tmp2 Tmp3 Sale
<int> <int> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2007-04-09 50 0 0 NA 1 NA
2 1 1 2007-04-10 50 0 0 0 1 0
3 1 1 2007-04-11 48 -2 1 1 1 1
4 1 1 2007-04-12 48 0 1 1 1 1
5 1 1 2007-04-13 50 2 0 0 1 0
6 1 1 2007-04-14 50 0 0 0 0 0
7 1 1 2007-04-15 48 -2 1 1 0 0
8 1 1 2007-04-16 49 1 0 0 0 0
9 1 1 2007-04-17 45 -4 1 0 1 0
10 1 1 2007-04-18 45 0 1 0 1 0
11 1 1 2007-04-19 45 0 1 0 0 0
12 1 1 2007-04-20 50 5 0 0 0 0
Thanks a lot.
I have a dataframe with person_id, study_id columns like below:
person_id study_id
10 1
11 2
10 3
10 4
11 5
I want to get the count for number of persons (unique by person_id) with 1 study or 2 studies - so not those with particular value for study_id but:
2 persons with 1 study
3 persons with 2 studies
1 person with with 3 studies
etc
How can I do this? I think maybe a count through loop but I wonder if there is a package that makes it easier?
To get a sample data set that better matches your expected output, i'll use this
dd <- data.frame(
person_id = c(10, 11, 15, 12, 10, 13, 10, 11, 12, 14, 15),
study_id = 1:11
)
Now I can count the number of people with a given number of studies with.
table(rowSums(with(dd, table(person_id, study_id))>0))
# 1 2 3
# 2 3 1
Where the top line is the number of studies, and the bottom line it the number of people with that number of studies.
This works because
with(dd, table(person_id, study_id))
returns
study_id
person_id 1 2 3 4 5 6 7 8 9 10 11
10 1 0 0 0 1 0 1 0 0 0 0
11 0 1 0 0 0 0 0 1 0 0 0
12 0 0 0 1 0 0 0 0 1 0 0
13 0 0 0 0 0 1 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 1 0
15 0 0 1 0 0 0 0 0 0 0 1
and then we use >0 and rowSums to get a count of unique studies for each person. Then we use table again to summarize the results.
The creating the table for your data is taking up too much RAM, you can try
table(with(dd, tapply(study_id, person_id, function(x) length(unique(x)))))
which is a slightly different way to get at the same thing.
You can use the aggregate function to get counts per user.
Then use it again to get counts per counts
i.e. assume your data is called "test"
person_id study_id
10 1
11 2
10 3
10 4
11 5
12 NA
You can set your NA to be a number such as zero so they are not ignored i.e.
test$study_id[is.na(test$study_id)] = 0
Then you can run the same function but with a condition that the study_id has to be greater than zero
stg=setNames(
aggregate(
study_id~person_id,
data=test,function(x){sum(x>0)}),
c("person_id","num_studies"))
Output:
stg
person_id num_studies
10 3
11 2
12 0
Then do the same to get counts of counts
setNames(
aggregate(
person_id~num_studies,
data=stg,length),
c("num_studies","num_users"))
Output:
num_studies num_users
0 1
2 1
3 1
Here's a solution using dplyr
library(dplyr)
tmp <- df %>%
group_by(person_id) %>%
summarise(num.studies = n()) %>%
group_by(num.studies) %>%
summarise(num.persons = n())
> dat <- read.table(h=T, text = "person_id study_id
10 1
11 2
10 3
10 4
11 5
12 6")
I think you can just use xtabs for this. I may have misunderstood the question, but it seems like that's what you want.
> table(xtabs(dat))
# 10 11 12
# 3 2 1
df <- data.frame(
person_id = c(10,11,10,10,11,11,11),
study_id = c(1,2,3,4,5,5,1))
# remove replicated rows
df <- unique(df)
# number of studies each person has been in:
summary(as.factor(df$person_id))
#10 11
# 3 4
# number of people in each study
summary(as.factor(df$study_id))
# 1 2 3 4 5
# 2 1 1 1 2