I want to create a loop with a large dataframe in R - r

Problem
I want to create a loop from data in df1 it's important the data is taken one ID value at a time.
I'm unsure how this can be done with R.
#original dataset
id=c(1,1,1,2,2,2,3,3,3)
dob=c("11-08","12-04","04-03","10-04","03-07","06-02","12-09","01-01","03-08")
count=c(1,6,3,2,5,6,8,6,4)
outcome=rep(1:0,length.out=9)
df1=data.frame(id,dob,count,outcome)
#changes for each value this needs to be completed separately for each value
df2<-df1[df1$id==1,]
df2<-df2[,-4]
addition<-df2$count+45
df2<-cbind(df2,addition)
df3<-df1[df1$id==2,]
df3<-df3[,-4]
addition<-df3$count+45
df3<-cbind(df3,addition)
df4<-df1[df1$id==3,]
df4<-df4[,-4]
addition<-df4$count+45
df4<-cbind(df4,addition)
df5<-rbind(df2,df3,df4)
Expected Output
df5<-rbind(df2,df3,df4)
1 1 11-08 1 46
2 1 12-04 6 51
3 1 04-03 3 48
4 2 10-04 2 47
5 2 03-07 5 50
6 2 06-02 6 51
7 3 12-09 8 53
8 3 01-01 6 51
9 3 03-08 4 49

In the present context (could be a simplified example) it doesn't even need that to loop, as we can directly add the 'count' with a number
df1$addition <- df1$count + 45
However, if it is a complicated operation and needs to look into the 'id' separately, then do a group_by operation
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(addition = count + 45)
# A tibble: 9 x 5
# Groups: id [3]
# id dob count outcome addition
# <dbl> <fct> <dbl> <int> <dbl>
#1 1 11-08 1 1 46
#2 1 12-04 6 0 51
#3 1 04-03 3 1 48
#4 2 10-04 2 0 47
#5 2 03-07 5 1 50
#6 2 06-02 6 0 51
#7 3 12-09 8 1 53
#8 3 01-01 6 0 51
#9 3 03-08 4 1 49
Also, data.table syntax would be
library(data.table)
setDT(df1)[, addition := count + 45, by = id]
or simply
setDT(df1)[, addition := count + 45]

Related

I want to delete redundant lines in my table in R

I have a huge table where there is information of 2 professionals in each line that goes like this:
df1 <- data.frame("Date" = c(1,2,3,4), "prof1" = c(25,59,10,5), "prof2" = c(5,7,8,25))
# Date prof1 prf2
#1 1 25 5
#2 2 59 7
#3 3 10 8
#4 4 5 25
... ... ...
I want to delete the line 4 because its the same with line 1, just with alternate values.
So I created a copy os that table with the values of the columns B and C switched like this:
df2 <- data.frame("Date" = c(1,2,3,4), "prof2" = c(5,7,8,25), "prof1" = c(25,59,10,5))
# Date prof2 prof1
#1 1 5 25
#2 2 7 59
#3 3 8 10
#4 4 25 5
... ... ...
And executed the code:
df1<- df1[!do.call(paste, df1[2:3]) %in% do.call(paste, df2[2:3]), ]
But it end up deleting the line 1 as well. Giving me this table:
# Date prof2 prof1
#2 2 7 59
#3 3 8 10
... ... ...
when what I wanted was this:
# Date prof2 prof1
#1 1 5 25
#2 2 7 59
#3 3 8 10
... ... ...
How can I delete only one of the lines that are similar to another?
If you don't care about which one of the duplicates you keep, you can just make sure that
prof2 > prof1 and then remove duplicates.
SWAP = which(df2$prof2 < df2$prof1)
temp = df2$prof2
df2$prof2[SWAP] = df2$prof1[SWAP]
df2$prof1[SWAP] = temp[SWAP]
df2 = df2[!duplicated(df2[,2:3]), ]
df2
Date prof2 prof1
1 1 25 5
2 2 59 7
3 3 10 8
We can do this with apply to loop over the rows of the dataset, sort, them, get the transpose, apply duplicated on it to get a logical vector and subset
df1[!duplicated(t(apply(df1[-1], 1, sort))),]
# Date prof1 prof2
#1 1 25 5
#2 2 59 7
#3 3 10 8
Or another option is pmin/pmax
subset(df1, !duplicated(cbind(pmin(prof1, prof2), pmax(prof1, prof2))))
# Date prof1 prof2
#1 1 25 5
#2 2 59 7
#3 3 10 8
Or using filter from dplyr
library(dplyr)
df1 %>%
filter( !duplicated(cbind(pmin(prof1, prof2), pmax(prof1, prof2))))

How to reshape data frame from a row level to person level in R

I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.
**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**
I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?
library(dplyr)
treatment <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))
viewing_data %>%
left_join(treatment, by = "unique_person_id")
You can change the way of sampling if you need to...
You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:
library(dplyr)
viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
# A tibble: 10,000 x 3
# Groups: unique_person_id [100]
unique_person_id tv_show group
<int> <chr> <chr>
1 9 Drive to Survive control
2 64 Shetland treated
3 90 The Crown treated
4 93 Drive to Survive treated
5 17 Duty-Shame treated
6 29 The Crown control
7 84 Broadchurch control
8 83 The Crown treated
9 3 The Crown control
10 33 Broadchurch control
# … with 9,990 more rows
We can check our results, all of the ids have only 1 group of treated / control:
newdata <- viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
tapply(newdata$group,newdata$unique_person_id,n_distinct)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.
library(dplyr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=sample(100), # in case the ids are not truly random
group=ifelse(group %% 2 == 0, 0, 1)) # works if only two groups
Persons
# A tibble: 100 x 2
unique_person_id group
<int> <dbl>
1 1 0
2 2 0
3 3 1
4 4 0
5 5 1
6 6 1
7 7 1
8 8 0
9 9 1
10 10 0
# ... with 90 more rows
And to check that we've got 50 in each group:
Persons %>% count(group)
# A tibble: 2 x 2
group n
<dbl> <int>
1 0 50
2 1 50
You could also use the randomizr package, which has many more features apart from complete random allocation.
library(randomizr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=complete_ra(N=100, m=50))
Persons %>% count(group) # Check
To link this back to the viewing_data, use inner_join.
viewing_data %>% inner_join(Persons, by="unique_person_id")
# A tibble: 10,000 x 3
unique_person_id tv_show group
<int> <chr> <int>
1 10 Shetland 1
2 95 Broadchurch 0
3 7 Duty-Shame 1
4 68 Drive to Survive 0
5 17 Drive to Survive 1
6 70 Shetland 0
7 78 Drive to Survive 0
8 21 Broadchurch 1
9 80 The Crown 0
10 70 Shetland 0
# ... with 9,990 more rows

Better way of binning data in a group in a data frame by equal intervals

I have a dataframe of which is characterized by many different ID's. For every ID there are multiple events which are characterized by the cumulative time duration between events(hours) and the duration of that event(seconds). So, it would look something like:
Id <- c(1,1,1,1,1,1,2,2,2,2,2)
cumulative_time<-c(0,3.58,8.88,11.19,21.86,29.54,0,5,14,19,23)
duration<-c(188,124,706,53,669,1506.2,335,349,395,385,175)
test = data.frame(Id,cumulative_time,duration)
> test
Id cummulative_time duration
1 1 0.00 188.0
2 1 3.58 124.0
3 1 8.88 706.0
4 1 11.19 53.0
5 1 21.86 669.0
6 1 29.54 1506.2
7 2 0.00 335.0
8 2 5.00 349.0
9 2 14.00 395.0
10 2 19.00 385.0
11 2 23.00 175.0
I would like to group by the ID and then restructure the group by sampling by a cumulative amount of every say 10 hours, and in that 10 hours sum by the duration that occurred in the 10 hour interval. The number of bins I want should be from say 0 to 30 hours. Thus were would be 3 bins.
I looked at the cut function and managed to make a hack of it within a dataframe - even me as a new r user I know it isn't pretty
test_cut = test %>%
mutate(bin_durations = cut(test$cummulative_time,breaks = c(0,10,20,30),labels = c("10","20","30"),include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
mutate(total_duration = sum(duration)) %>%
select(Id,bin_durations,total_duration) %>%
distinct()
which gives the output:
test_cut
Id time_bins duration
1 1 10 1018.0
2 1 20 53.0
3 1 30 2175.2
4 2 10 684.0
5 2 20 780.0
6 2 30 175.0
Ultimately I want the interval window and number of bins to be arbitrary - If I have a span of 5000 hours and I want to bin in 1 hour samples. For this I would use breaks=seq(0,5000,1) for the bins I would say labels = as.character(seq(1,5000,1))
This is will also be applied to a very large data frame, so computational speed somewhat desired.
A dplyr solution would be great since I am applying the binning per group.
My guess is there is a nice interaction between cut and perhaps split to generate the desired output.
Thanks in advance.
Update
After testing, I find that even my current implementation isn't quite what I'd like as if I say:
n=3
test_cut = test %>%
mutate(bin_durations = cut(test$cumulative_time,breaks=seq(0,30,n),labels = as.character(seq(n,30,n)),include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
mutate(total_duration = sum(duration)) %>%
select(Id,bin_durations,total_duration) %>%
distinct()
I get
test_cut
# A tibble: 11 x 3
# Groups: Id, bin_durations [11]
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 3 188
2 1 6 124
3 1 9 706
4 1 12 53
5 1 24 669
6 1 30 1506.
7 2 3 335
8 2 6 349
9 2 15 395
10 2 21 385
11 2 24 175
Where there are no occurrences in the bin sequence I should just get 0 in the duration column. Rather than an omission.
Thus, it should look like:
test_cut
# A tibble: 11 x 3
# Groups: Id, bin_durations [11]
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 3 188
2 1 6 124
3 1 9 706
4 1 12 53
5 1 15 0
6 1 18 0
7 1 21 0
8 1 24 669
9 1 27 0
10 1 30 1506.
11 2 3 335
12 2 6 349
13 2 9 0
14 2 12 0
15 2 15 395
16 2 18 0
17 2 21 385
18 2 24 175
19 2 27 0
20 2 30 0
Here is one idea via integer division (%/%)
library(tidyverse)
test %>%
group_by(Id, grp = cumulative_time %/% 10) %>%
summarise(toatal_duration = sum(duration))
which gives,
# A tibble: 6 x 3
# Groups: Id [?]
Id grp toatal_duration
<dbl> <dbl> <dbl>
1 1 0 1018
2 1 1 53
3 1 2 2175.
4 2 0 684
5 2 1 780
6 2 2 175
To address your updated issue, we can use complete in order to add the missing rows. So, for the same example, binning in hours of 3,
test %>%
group_by(Id, grp = cumulative_time %/% 3) %>%
summarise(toatal_duration = sum(duration)) %>%
ungroup() %>%
complete(Id, grp = seq(min(grp), max(grp)), fill = list(toatal_duration = 0))
which gives,
# A tibble: 20 x 3
Id grp toatal_duration
<dbl> <dbl> <dbl>
1 1 0 188
2 1 1 124
3 1 2 706
4 1 3 53
5 1 4 0
6 1 5 0
7 1 6 0
8 1 7 669
9 1 8 0
10 1 9 1506.
11 2 0 335
12 2 1 349
13 2 2 0
14 2 3 0
15 2 4 395
16 2 5 0
17 2 6 385
18 2 7 175
19 2 8 0
20 2 9 0
We could make these changes:
test$cummulative_time can be simply cumulative_time
breaks could be factored out and then used in the cut as shown
the second mutate could be changed to summarize in which case the select and distinct are not needed
it is always a good idea to close any group_by with a matching ungroup or in the case of summarize we can use .groups = "drop")
add complete to insert 0 for levels not present
Implementing these changes we have:
library(dplyr)
library(tidyr)
breaks <- seq(0, 40, 10)
test %>%
mutate(bin_durations = cut(cumulative_time, breaks = breaks,
labels = breaks[-1], include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
summarize(total_duration = sum(duration), .groups = "drop") %>%
complete(Id, bin_durations, fill = list(total_duration = 0))
giving:
# A tibble: 8 x 3
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 10 1018
2 1 20 53
3 1 30 2175.
4 1 40 0
5 2 10 684
6 2 20 780
7 2 30 175
8 2 40 0

Select row meeting condition and all subsequent rows by group

Let's assume I have a data frame consisting of a categorical variable and a numerical one.
df <- data.frame(group=c(1,1,1,1,1,2,2,2,2,2),days=floor(runif(10, min=0, max=101)))
df
group days
1 1 54
2 1 61
3 1 31
4 1 52
5 1 21
6 2 22
7 2 18
8 2 50
9 2 46
10 2 35
I would like to select the row corresponding to the maximum number of days by group as well as all the following/subsequent group rows. For the example above, my subset df2 should look as follows:
df2
group days
2 1 61
3 1 31
4 1 52
5 1 21
8 2 50
9 2 46
10 2 35
Please note that the groups could have different lengths.
For a base R solution, aggregate days by group using a function that keeps the elements with index greater than or equal to the maximum, and then reshape as a long data.frame
df0 = aggregate(days ~ group, df, function(x) x[seq_along(x) >= which.max(x)])
data.frame(group=rep(df0$group, lengths(df0$days)),
days=unlist(df0$days, use.names=FALSE)))
leading to
group days
1 1 84
2 1 31
3 1 65
4 1 23
5 2 94
6 2 69
7 2 45
You can use which.max to find out the index of the maximum of the days and then use slice from dplyr to select all the rows after that, where n() gives the number of rows in each group:
library(dplyr)
df %>% group_by(group) %>% slice(which.max(days):n())
#Source: local data frame [7 x 2]
#Groups: group [2]
# group days
# <int> <int>
#1 1 61
#2 1 31
#3 1 52
#4 1 21
#5 2 50
#6 2 46
#7 2 35
data.table syntax would be similar, .N is similar to n() in dplyr and gives the number of rows in each group:
library(data.table)
setDT(df)[, .SD[which.max(days):.N], group]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35
We can use a faster option with data.table where we find the row index (.I) and then subset the rows based on that.
library(data.table)
setDT(df)[df[ , .I[which.max(days):.N], by = group]$V1]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35

Remove duplicate observations based on set of rules

I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45

Resources