Below is the sample data. The task at hand is to sum quarter1 and quarter2 for ownership code 30 but exclude indcode 115. From there complete a new row that contain this sum. In excel, this is very simple but hoping to automate this a bit using R. The bottom half of the desired result is below. First question is would I pivot_wider so that I summing columns not rows?
area <- c(000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000)
indcode <- c(110,111,112,113,114,115,110,111,112,113,114,115)
quarter1 <- c(NA,2,4,6,16,3,NA,1,2,3,8,2)
quarter2 <- c(2,3,5,7,22,1,9,1,2,4,11,1)
ownership <- c(00,00,00,00,00,00,30,30,30,30,30,30)
employment <- data.frame(area,indcode,quarter1,quarter2,ownership)
area indcode quarter1 quarter2 ownership
000000 111 1 1 30
000000 112 2 2 30
000000 113 3 4 30
000000 114 8 11 30
000000 115 2 1 30
000000 993 14 18 30
I've assumed you want this done for area groups, but if not you can delete the group_by(area) line.
employment %>%
group_by(area) %>%
summarize(
across(quarter1:quarter2, ~sum(.x[ownership == 30 & indcode != 115], na.rm = TRUE)),
indcode = 993,
ownership = 30
) %>%
bind_rows(employment, .)
# area indcode quarter1 quarter2 ownership
# 1 0 111 2 3 0
# 2 0 112 4 5 0
# 3 0 113 6 7 0
# 4 0 114 16 22 0
# 5 0 115 3 1 0
# 6 0 111 1 1 30
# 7 0 112 2 2 30
# 8 0 113 3 4 30
# 9 0 114 8 11 30
# 10 0 115 2 1 30
# 11 0 993 14 18 30
Related
How can I transpose specific columns in a data.frame as:
id<- c(1,2,3)
t0<- c(0,0,0)
bp0<- c(88,95,79)
t1<- c(15,12,12)
bp1<- c(92,110,82)
t2<- c(25,30,20)
bp2<- c(75,99,88)
df1<- data.frame(id, t0, bp0, t1, bp1, t2, bp2)
df1
> df1
id t0 bp0 t1 bp1 t2 bp2
1 1 0 88 15 92 25 75
2 2 0 95 12 110 30 99
3 3 0 79 12 82 20 88
In order to obtain:
> df2
id t bp
1 1 0 88
2 2 0 95
3 3 0 79
4 1 15 92
5 2 12 110
6 3 12 82
7 1 25 75
8 2 30 99
9 3 20 88
In order to obtain df2, with represent t(t0,t1,t2) and bp(bp0,bp1,bp2) for the corresponding "id"
Using Base R, you can do:
Reprex
Code
df2 <- cbind(df1[1], stack(df1, startsWith(names(df1), "t"))[1], stack(df1,startsWith(names(df1), "bp"))[1])
names(df2)[2:3] <- c("t", "bp")
Output
df2
#> id t bp
#> 1 1 0 88
#> 2 2 0 95
#> 3 3 0 79
#> 4 1 15 92
#> 5 2 12 110
#> 6 3 12 82
#> 7 1 25 75
#> 8 2 30 99
#> 9 3 20 88
Created on 2022-02-14 by the reprex package (v2.0.1)
Here is solution with pivot_longer using name_pattern:
\\w+ = one or more alphabetic charachters
\\d+ = one or more digits
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer (
-id,
names_to = c(".value", "name"),
names_pattern = "(\\w+)(\\d+)"
) %>%
select(-name)
id t bp
<dbl> <dbl> <dbl>
1 1 0 88
2 1 15 92
3 1 25 75
4 2 0 95
5 2 12 110
6 2 30 99
7 3 0 79
8 3 12 82
9 3 20 88
A base R option using reshape
reshape(
setNames(df1, sub("(\\d+)", ".\\1", names(df1))),
direction = "long",
idvar = "id",
varying = -1
)
gives
id time t bp
1.0 1 0 0 88
2.0 2 0 0 95
3.0 3 0 0 79
1.1 1 1 15 92
2.1 2 1 12 110
3.1 3 1 12 82
1.2 1 2 25 75
2.2 2 2 30 99
3.2 3 2 20 88
I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.
**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**
I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?
library(dplyr)
treatment <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))
viewing_data %>%
left_join(treatment, by = "unique_person_id")
You can change the way of sampling if you need to...
You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:
library(dplyr)
viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
# A tibble: 10,000 x 3
# Groups: unique_person_id [100]
unique_person_id tv_show group
<int> <chr> <chr>
1 9 Drive to Survive control
2 64 Shetland treated
3 90 The Crown treated
4 93 Drive to Survive treated
5 17 Duty-Shame treated
6 29 The Crown control
7 84 Broadchurch control
8 83 The Crown treated
9 3 The Crown control
10 33 Broadchurch control
# … with 9,990 more rows
We can check our results, all of the ids have only 1 group of treated / control:
newdata <- viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
tapply(newdata$group,newdata$unique_person_id,n_distinct)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.
library(dplyr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=sample(100), # in case the ids are not truly random
group=ifelse(group %% 2 == 0, 0, 1)) # works if only two groups
Persons
# A tibble: 100 x 2
unique_person_id group
<int> <dbl>
1 1 0
2 2 0
3 3 1
4 4 0
5 5 1
6 6 1
7 7 1
8 8 0
9 9 1
10 10 0
# ... with 90 more rows
And to check that we've got 50 in each group:
Persons %>% count(group)
# A tibble: 2 x 2
group n
<dbl> <int>
1 0 50
2 1 50
You could also use the randomizr package, which has many more features apart from complete random allocation.
library(randomizr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=complete_ra(N=100, m=50))
Persons %>% count(group) # Check
To link this back to the viewing_data, use inner_join.
viewing_data %>% inner_join(Persons, by="unique_person_id")
# A tibble: 10,000 x 3
unique_person_id tv_show group
<int> <chr> <int>
1 10 Shetland 1
2 95 Broadchurch 0
3 7 Duty-Shame 1
4 68 Drive to Survive 0
5 17 Drive to Survive 1
6 70 Shetland 0
7 78 Drive to Survive 0
8 21 Broadchurch 1
9 80 The Crown 0
10 70 Shetland 0
# ... with 9,990 more rows
I am trying to rename grouped unique id sequentially using dplyr in R.
There are five columns in the data frame as below.
## Load package if necessary
library(tidyverse)
## Set data frame
df <- data.frame(
hid=c(10001,10001,10001,10001,10002,10002,10002,10002,10002,
10003,10003,10003,10003,10003,10003,10004,10004,10004,10004,10004),
mid=c(1,2,3,4,1,2,3,4,5,1,2,3,4,5,6,1,2,3,4,5),
tmc=c(010,01010,0,01020,010,010,010,010,010,010,010,010,0,010,010,010,0,01010,010,01010),
thc=c(010,01010,0,02030,010,020,020,020,030,010,010,010,0,020,030,010,0,02020,030,04040),
mdc=c(000,01010,0,02020,000,010,010,010,010,000,000,010,0,010,020,000,0,02020,010,01010),
itc=c(010,01010,0,02020,020,020,020,020,020,010,010,010,0,020,020,010,0,02020,020,02020)
)
Unique ids are given to each row being grouped by some columns: tmc, thc, mdc and itc.
## Add unique id grouped by tmc, thc, mdc and itc
df.id <- df %>% mutate(id=as.numeric(interaction(tmc,thc,mdc,itc)))
As it does not give sequential ids, I need to rename it.
However, I could not find solution for that. The conditions are:
If tmc, thc, mdc and itc are all 0, id is set as 0 (I do not know the reason but interaction gives 1 for such recoreds in my data frame)
Other ids should be sequentially renamed but need to keep its group. (if ids are set as 4,8,2,2,8, it should be renamed as 1,2,3,3,2)
Followings scripts show what I am doing currently. id is temporary id obtained from interaction function but I need to obtain sequential id indicated in id.desired column.
## Replace unique id sequentially
## IT DOES NOT GIVE DESIRED OUTPUT
# df.id %>% group_by(id) %>% mutate(id2=seq_along(id))
## Desired id is shown in `id.desired`
## `id` is the ones obtained from `interaction` function, which are not set sequentially
hid mid tmc thc mdc itc id id.desired
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 10001 1 10 10 0 10 166 1
2 10001 2 1010 1010 1010 1010 595 2
3 10001 3 0 0 0 0 1 0
4 10001 4 1020 2030 2020 2020 796 3
5 10002 1 10 10 0 20 326 4
6 10002 2 10 20 10 20 362 5
7 10002 3 10 20 10 20 362 5
8 10002 4 10 20 10 20 362 5
9 10002 5 10 30 10 20 366 6
10 10003 1 10 10 0 10 166 1
11 10003 2 10 10 0 10 166 1
12 10003 3 10 10 10 10 198 7
13 10003 4 0 0 0 0 1 0
14 10003 5 10 20 10 20 362 5
15 10003 6 10 30 20 20 398 8
16 10004 1 10 10 0 10 166 1
17 10004 2 0 0 0 0 1 0
18 1004 3 1010 2020 2020 2020 791 9
19 10004 4 10 30 10 20 366 6
20 10004 5 1010 4040 1010 2020 767 10
Any suggestions?
I prefer to use dplyr in this operation.
I received some suggestions in the previous question, however it is not the same structure in this case (dummy field does not exist in current data frame).
How to renumber result of intersection/group_indices in R?
A solution using the tidyverse. Notice that I did not use the interaction function. Instead, I used the group_indices function from dplyr to create the group index and then convert to factor and change the levels based on the occurrence order in the column. df2 is the final output.
library(tidyverse)
df2 <- df %>%
filter_at(vars(tmc, thc, mdc, itc), any_vars(. != 0)) %>%
mutate(id = group_indices(., tmc, thc, mdc, itc)) %>%
mutate(id = as.numeric(factor(id, levels = unique(id)))) %>%
left_join(df, ., by = names(df)) %>%
replace_na(list(id = 0))
df2
# hid mid tmc thc mdc itc id
# 1 10001 1 10 10 0 10 1
# 2 10001 2 1010 1010 1010 1010 2
# 3 10001 3 0 0 0 0 0
# 4 10001 4 1020 2030 2020 2020 3
# 5 10002 1 10 10 0 20 4
# 6 10002 2 10 20 10 20 5
# 7 10002 3 10 20 10 20 5
# 8 10002 4 10 20 10 20 5
# 9 10002 5 10 30 10 20 6
# 10 10003 1 10 10 0 10 1
# 11 10003 2 10 10 0 10 1
# 12 10003 3 10 10 10 10 7
# 13 10003 4 0 0 0 0 0
# 14 10003 5 10 20 10 20 5
# 15 10003 6 10 30 20 20 8
# 16 10004 1 10 10 0 10 1
# 17 10004 2 0 0 0 0 0
# 18 10004 3 1010 2020 2020 2020 9
# 19 10004 4 10 30 10 20 6
# 20 10004 5 1010 4040 1010 2020 10
Not sure how to interpret the id.desired column but here is an example based on the two conditions and using data.table:
require(data.table)
df = data.table(df)
df[tmc != 0 & thc != 0 & mdc != 0 & itc != 0, ID := 1:.N, by = .(tmc, thc, mdc, itc)]
df[is.na(ID), ID := 0]
(edited mutate based on your clarification in comments)
Here are the two things I tried to do:
To ensure that id = 0 when certain variables are 0, I used if_else in the mutate function with the specific conditions you specified.
To get id.desired I used dense_rank() function.
Here is the code based on the dataset you shared:
df %>%
mutate(id = if_else(tmc == 0 & thc == 0 & mdc == 0 & itc == 0, 0,
as.numeric(interaction(tmc, thc, mdc, itc, lex.order = TRUE)))) %>%
mutate(id.desired = dense_rank(id) - 1)
The output looks like this
hid mid tmc thc mdc itc id id.desired
1 10001 1 10 10 0 10 227 1
2 10001 2 1010 1010 1010 1010 519 7
3 10001 3 0 0 0 0 0 0
4 10001 4 1020 2030 2020 2020 775 10
5 10002 1 10 10 0 20 228 2
6 10002 2 10 20 10 20 258 4
7 10002 3 10 20 10 20 258 4
8 10002 4 10 20 10 20 258 4
9 10002 5 10 30 10 20 283 5
10 10003 1 10 10 0 10 227 1
11 10003 2 10 10 0 10 227 1
12 10003 3 10 10 10 10 232 3
13 10003 4 0 0 0 0 0 0
14 10003 5 10 20 10 20 258 4
15 10003 6 10 30 20 20 288 6
16 10004 1 10 10 0 10 227 1
17 10004 2 0 0 0 0 0 0
18 10004 3 1010 2020 2020 2020 550 8
19 10004 4 10 30 10 20 283 5
20 10004 5 1010 4040 1010 2020 595 9
I have a data frame with daily average temperature data in it, structured like so:
'data.frame': 4666 obs. of 6 variables:
$ Site : chr "EB" "FFCE" "IB" "FFCE" ...
$ Date : Date, format: "2013-01-01" "2013-01-01" "2013-01-01" "2014-01-01" ...
$ Day : int 1 1 1 1 1 1 1 1 1 1 ...
$ Year : int 2013 2013 2013 2014 2014 2014 2014 2015 2015 2015 ...
$ Month: int 1 1 1 1 1 1 1 1 1 1 ...
$ Temp : num 28.5 28.3 28.3 27 27.8 ...
i am attempting to produce a summary table which just totals the number of days in a year per site above certain temperatures thresholds e.g 25c, 26c.
i can achieve this manually by using dplyr like so-
Days_above = Site_Daily_average %>%
group_by(Year, Site) %>%
summarise("23" = sum(Temp > 23), "24" = sum(Temp > 24),"25"= sum(Temp >
25), "26"= sum(Temp > 26), "27"= sum(Temp > 27), "28"= sum(Temp > 28), "29"
= sum(Temp > 29),"30"= sum(Temp > 30), "31" = sum(Temp > 31), "ABOVE
THRESHOLD" = sum(Temp > maxthreshold))%>% as.data.frame()
Which produces a table like so :
Year Site 23 24 25 26 27 28 29 30 31 ABOVE THRESHOLD
1 2012 EB 142 142 142 91 64 22 0 0 0 0
2 2012 FFCE 238 238 238 210 119 64 0 0 0 0
3 2012 IB 238 238 238 218 138 87 1 0 0 0
4 2013 EB 115 115 115 115 115 109 44 0 0 0
5 2013 FFCE 223 223 216 197 148 114 94 0 0 0
6 2013 IB 365 365 365 348 299 194 135 3 0 0
...
however, as you can see the code is fairly verbose. The problem i am having is producing this same output for a sequence of temperature thresholds, i.e Tempclasses = Seq(16,32,0.25).
As you can see that would take a long time to type that out manually. i feel like this is a very simple calculation and there should be way to use dplyr to recognize each variable in the sequence vector, perform this function and produce an output in a complete table format. sorry if that was unclear as i am relatively new to R,
any suggestions would be welcome, thankyou.
Here's a tidyverse approach, likewise using mtcars for illustration:
library(tidyverse)
mtcars %>%
mutate(threshold = cut(mpg,
breaks=seq(10, max(mtcars$mpg)+10, 5),
labels=seq(10, max(mtcars$mpg)+5, 5))) %>%
group_by(cyl, threshold) %>%
tally %>%
ungroup %>%
complete(threshold, nesting(cyl), fill=list(n=0)) %>%
arrange(desc(threshold)) %>%
group_by(cyl) %>%
mutate(N_above = cumsum(n)) %>%
select(-n) %>%
arrange(cyl, threshold)
threshold cyl N_above
1 10 4 11
2 15 4 11
3 20 4 11
4 25 4 6
5 30 4 4
6 35 4 0
7 10 6 7
8 15 6 7
9 20 6 3
10 25 6 0
11 30 6 0
12 35 6 0
13 10 8 14
14 15 8 8
15 20 8 0
16 25 8 0
17 30 8 0
18 35 8 0
If you want the final data in wide format, add a spread at the end and remove the arrange:
... %>%
select(-n) %>%
spread(threshold, N_above)
cyl 10 15 20 25 30 35
1 4 11 11 11 6 4 0
2 6 7 7 3 0 0 0
3 8 14 8 0 0 0 0
As #dww commented we can use cut to get the required format. I have tried this on mtcars dataset where we create range from 10 to 35 with step of 5 for mpg column.
df <- mtcars
df$group <- cut(df$mpg, seq(10, 35, 5))
and then we group by cyl and use table to get count of how many of them fall in the the respective buckets.
table(df$cyl, df$group)
# (10,15] (15,20] (20,25] (25,30] (30,35]
#4 0 0 5 2 4
#6 0 4 3 0 0
#8 6 8 0 0 0
Now , if certain value is greater than 10, it is also greater than 15, hence the number in (15, 20) bucket should also include number from (10,15) bucket and number in (20, 15) bucket should include both the previous number. Hence, we need a row-wise cumsum for this table
t(apply(table(df$cyl, df$group), 1, cumsum))
# (10,15] (15,20] (20,25] (25,30] (30,35]
# 4 0 0 5 7 11
# 6 0 4 7 7 7
# 8 6 14 14 14 14
For your case , the code would go
Site_Daily_average$group <- cut(Site_Daily_average$Temp, seq(16,32,0.25))
#and then do table to get required answer.
t(apply(table(Site_Daily_average$Year,Site_Daily_average$Site,
Site_Daily_average$group), 1, cumsum)
I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2