I have a dataframe with person_id, study_id columns like below:
person_id study_id
10 1
11 2
10 3
10 4
11 5
I want to get the count for number of persons (unique by person_id) with 1 study or 2 studies - so not those with particular value for study_id but:
2 persons with 1 study
3 persons with 2 studies
1 person with with 3 studies
etc
How can I do this? I think maybe a count through loop but I wonder if there is a package that makes it easier?
To get a sample data set that better matches your expected output, i'll use this
dd <- data.frame(
person_id = c(10, 11, 15, 12, 10, 13, 10, 11, 12, 14, 15),
study_id = 1:11
)
Now I can count the number of people with a given number of studies with.
table(rowSums(with(dd, table(person_id, study_id))>0))
# 1 2 3
# 2 3 1
Where the top line is the number of studies, and the bottom line it the number of people with that number of studies.
This works because
with(dd, table(person_id, study_id))
returns
study_id
person_id 1 2 3 4 5 6 7 8 9 10 11
10 1 0 0 0 1 0 1 0 0 0 0
11 0 1 0 0 0 0 0 1 0 0 0
12 0 0 0 1 0 0 0 0 1 0 0
13 0 0 0 0 0 1 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 1 0
15 0 0 1 0 0 0 0 0 0 0 1
and then we use >0 and rowSums to get a count of unique studies for each person. Then we use table again to summarize the results.
The creating the table for your data is taking up too much RAM, you can try
table(with(dd, tapply(study_id, person_id, function(x) length(unique(x)))))
which is a slightly different way to get at the same thing.
You can use the aggregate function to get counts per user.
Then use it again to get counts per counts
i.e. assume your data is called "test"
person_id study_id
10 1
11 2
10 3
10 4
11 5
12 NA
You can set your NA to be a number such as zero so they are not ignored i.e.
test$study_id[is.na(test$study_id)] = 0
Then you can run the same function but with a condition that the study_id has to be greater than zero
stg=setNames(
aggregate(
study_id~person_id,
data=test,function(x){sum(x>0)}),
c("person_id","num_studies"))
Output:
stg
person_id num_studies
10 3
11 2
12 0
Then do the same to get counts of counts
setNames(
aggregate(
person_id~num_studies,
data=stg,length),
c("num_studies","num_users"))
Output:
num_studies num_users
0 1
2 1
3 1
Here's a solution using dplyr
library(dplyr)
tmp <- df %>%
group_by(person_id) %>%
summarise(num.studies = n()) %>%
group_by(num.studies) %>%
summarise(num.persons = n())
> dat <- read.table(h=T, text = "person_id study_id
10 1
11 2
10 3
10 4
11 5
12 6")
I think you can just use xtabs for this. I may have misunderstood the question, but it seems like that's what you want.
> table(xtabs(dat))
# 10 11 12
# 3 2 1
df <- data.frame(
person_id = c(10,11,10,10,11,11,11),
study_id = c(1,2,3,4,5,5,1))
# remove replicated rows
df <- unique(df)
# number of studies each person has been in:
summary(as.factor(df$person_id))
#10 11
# 3 4
# number of people in each study
summary(as.factor(df$study_id))
# 1 2 3 4 5
# 2 1 1 1 2
Related
I'm having a bit of a struggle trying to figure out how to do the following. I want to map how many days of high sales I have previously a change of price. For example, I have a price change on day 10 and the high sales indicator will tell me any sale greater than or equal to 10. Need my algorithm to count the number of consecutive high sales.
In this case it should return 5 (day 5 to 9)
For example purposes, the dataframe is called df. Code:
#trying to create a while loop that will check if lag(high_sales) is 1, if yes it will count until
#there's a lag(high_sales) ==0
#loop is just my dummy variable that will take me out of the while loop
count_sales<-0
loop<-0
df<- df %>% mutate(consec_high_days= ifelse(price_change > 0, while(loop==0){
if(lag(High_sales_ind)==1){
count_sales<-count_sales +1}
else{loop<-0}
count_sales},0))
day
price
price_change
sales
High_sales_ind
1
5
0
12
1
2
5
0
6
0
3
5
0
5
0
4
5
0
4
0
5
5
0
10
1
6
5
0
10
1
7
5
0
10
1
8
5
0
12
1
9
5
0
14
1
10
7
2
3
0
11
7
0
2
0
This is my error message:
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i the condition has length > 1 and only the first element will be used
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i 'x' is NULL so the result will be NULL
Error: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
x replacement has length zero
Any help would be greatly appreciated.
This is a very inelegant brute-force answer, though hopefully someone better than me can provide a more elegant answer - but to get the desired dataset, you can try:
df <- read.table(text = "day price price_change sales High_sales_ind
1 5 0 12 1
2 5 0 6 0
3 5 0 5 0
4 5 0 4 0
5 5 0 10 1
6 5 0 10 1
7 5 0 10 1
8 5 0 12 1
9 5 0 14 1
10 7 2 3 0
11 7 0 2 0", header = TRUE)
# assign consecutive instances of value
df$seq <- sequence(rle(as.character(df$sales >= 10))$lengths)
# Find how many instance of consecutive days occurred before price change
df <- df %>% mutate(lseq = lag(seq))
# define rows you want to keep and when to end
keepz <- df[df$price_change != 0, "lseq"]
end <- as.numeric(rownames(df[df$price_change != 0,]))-1
df_want <- df[keepz:end,-c(6:7)]
Output:
# day price price_change sales High_sales_ind
# 5 5 5 0 10 1
# 6 6 5 0 10 1
# 7 7 5 0 10 1
# 8 8 5 0 12 1
# 9 9 5 0 14 1
I have the following example.
I want to create a new column with the absolute difference in AGE compared to each Treat==1 in the same PairID.
Desired output should be as shown below.
I have tried using dplyr with:
Data complete:
Treat <- c(1,0,0,1,0,0,1,0)
PairID <- c(1,1,1,2,2,2,3,3)
Age <- c(30,60,31,20,20,40,50,52)
D <- data.frame(Treat,PairID,Age)
D
D %>%
group_by(PairID) %>%
abs(Age - Age[Treat == 1])
in Base-R:
D$absD <- unlist(lapply(split(D,D$PairID), function(x) abs(x$Age - x$Age[x$Treat==1])))
> D
Treat PairID Age absD
1 1 1 30 0
2 0 1 60 30
3 0 1 31 1
4 1 2 20 0
5 0 2 20 0
6 0 2 40 20
7 1 3 50 0
8 0 3 52 2
I have a data set with 2 VPs and 350 interval values for each. I am writing an if loop to select when a minimum value of VP1 overlaps with the maximum value of VP2.
The data usually sorts by VP, but I arranged to sort by minimum since it is a timeframe.
I ran the following code that worked to assign 0 or 1 when the values overlap the previous item, but it does not account for what the previous item is (ie. whether the previous item is VP1 or VP2).
for (i in 2:length(df$newvariable)) {
if (df$minimum[i] < df$maximum[i-1]){
df$newvariable[i] <- 0
} else {
df$newvariable[i] <- 1
}
}
I want to say if df$minimum[i] of VP1 < df$maximum[i] of VP2, then df$newvariable = 0. Otherwise, df$newvariable = 1.
I have not been able to find how to make it conditional per each row and loop again. Does anyone have any recommendations?
Many thanks.
Sample Data:
VP xmin xmax
1 0 6
2 0 2
2 6 14
1 14 24
2 20 30
1 30 36
... And so on for 600 or so rows.
Desired Output:
VP xmin xmax newvariable
1 0 6 -
2 0 2 0
2 6 14 1
1 14 24 1
2 20 30 0
1 30 36 1
If I have a dataframe that had another variable and I subsetted to only look at one part of the variable. For example, variable = talking and the assignments are 1 (yes) or 0 (no). I originally subsetted to just look at 0 and create new variables, like quiet_together. However, I want to put these dataframes back together but have added columns in the separate dataframes. If I want the same exact thing as described above but with the dataframe together (instead of 2 separate ones), how would I specify for the each assigned variable? I want to end up with two new columns based on xmin and xmax values while accounting for the value in the talking variable. The new columns would be talk_together (for the 1 value of the talking variable) and quiet_together (for the 0 value of the talking variable, when xmin <= xmax for the previous line.
For example:
Sample Data:
VP xmin xmax talking
1 0 6 0
2 0 2 0
2 2 6 1
2 6 14 0
1 6 14 1
2 14 24 1
1 14 20 0
1 20 30 1
2 24 32 0
1 30 32 0
... And so on for 600 or so rows.
Desired Output:
VP xmin xmax talking talk_together quiet_together
1 0 6 0 0 0
2 0 2 0 0 0
2 2 6 1 0 0
2 6 14 0 0 0
1 6 14 1 0 0
1 14 20 0 0 0
2 14 24 1 1 0
1 20 30 1 1 0
2 24 32 0 0 1
1 30 32 0 0 1
You could use lag from dplyr to compare with previous xmax value.
library(dplyr)
df %>% mutate(newvariable = as.integer(xmin >= lag(xmax)))
# VP xmin xmax newvariable
#1 1 0 6 NA
#2 2 0 2 0
#3 2 6 14 1
#4 1 14 24 1
#5 2 20 30 0
#6 1 30 36 1
Or shift with data.table
library(data.table)
setDT(df)[, newvariable := +(xmin >= shift(xmax))]
Base R alternatives are :
df$newvariable <- as.integer(c(NA, df$xmin[-1] >= df$xmax[-nrow(df)]))
and
df$newvariable <- +c(NA, tail(df$xmin, -1) >= head(df$xmax, -1))
With data.table, we can do
library(data.table)
setDT(df)[, newvariable := as.integer(xmin >= shift(xmax))]
This question already has an answer here:
Bulk update in subset obtained from dataframe filtering [duplicate]
(1 answer)
Closed 3 years ago.
My usecase involve me to filter a dataframe with some condition. Once I get the subset dataframe, I want to traverse through the subset one row at a time and checking for certain condition and updating a value in that particular row.
Here is my implementation:
> sales_data[sales_data$month == 1 & sales_data$dept_name == 1,]
emp_name month dept_name revenue status n_points x_partition y_partition x y
1 Sam 1 1 100 Low 9 3 3 0 0
7 Kenneth 1 1 500 Very High 9 3 3 0 0
11 Jonathan 1 1 500 Low 9 3 3 0 0
12 Sam 1 1 100 Low 9 3 3 0 0
18 Kenneth 1 1 500 Very High 9 3 3 0 0
22 Jonathan 1 1 500 Low 9 3 3 0 0
23 Sam 1 1 100 Low 9 3 3 0 0
29 Kenneth 1 1 500 Very High 9 3 3 0 0
33 Jonathan 1 1 500 Low 9 3 3 0 0
Now, my subset dataframe has 9 rows. So, a for loop:
for(i in 1:nrow(sales_data[sales_data$month == 1 & sales_data$dept_name == 1, ] )) {
#Here I want to update the value of column named x with i
sales_data[sales_data$month == month_item & sales_data$dept_name == dept_item, ][i]$x <- x_vector_data[i] ##NOT CORRECT APPROACH
}
Why loop, maybe:
sales_data[sales_data$month == 1 & sales_data$dept_name == 1, "x"] <- x_vector_data
My data frame looks like this
personID t1 t2 t3
1 0 11 0
1 0 11 0
2 0 11 13
2 0 11 13
3 0 0 0
3 0 0 0
I need to make sure that each person has one test score above 10. If they do not, they have to be removed from the data frame. I also want to keep track of the lowest score above 10, and add it to a new column.
Thus, the result would look like this:
personID t1 t2 t3 new
1 0 11 0 11
1 0 11 0 11
2 0 11 13 11
2 0 11 13 11
If I was to go the data.table route, I think you could do it with a melt and join:
library(data.table)
setDT(dat)
dat[
melt(dat, id.vars="personID")[value > 10, .(new=min(value)), by=personID],
on="personID"
]
# personID t1 t2 t3 new
#1: 1 0 11 0 11
#2: 1 0 11 0 11
#3: 2 0 11 13 11
#4: 2 0 11 13 11
using data.table
library(data.table)
#convert your data (named DF here) to use data.table syntax
setDT(DF)
DF[ , {
# vector of row-wise minima within ID
m = do.call(pmin, .SD)
# confirm acceptance condition
if (min(m) > 10)
# add new column by appending it to current data
c(.SD, list(new = m))
}, by = personID]