I am very, very new to any type of coding language. I am used to Pivot tables in Excel, and trying to replicate a pivot I have done in Excel in R. I have spent a long time searching the internet/ YouTube, but I just can't get it to work.
I am looking to produce a table in which I the left hand side column shows a number of locations, and across the top of the table it shows different pages that have been viewed. I want to show in the table the number of views per location which each of these pages.
The data frame 'specificreports' shows all views over the past year for different pages on an online platform. I want to filter for the month of October, and then pivot the different Employee Teams against the number of views for different pages.
specificreports <- readxl::read_excel("Multi-Tab File - Dashboard
Usage.xlsx", sheet = "Specific Reports")
specificreportsLocal <- tbl_df(specificreports)
specificreportsLocal %>% filter(Month == "October") %>%
group_by("Employee Team") %>%
This bit works, in that it groups the different team names and filters entries for the month of October. After this I have tried using the summarise function to summarise the number of hits but can't get it to work at all. I keep getting errors regarding data type. I keep getting confused because solutions I look up keep using different packages.
I would appreciate any help, using the simplest way of doing this as I am a total newbie!
Thanks in advance,
Holly
let's see if I can help a bit. It's hard to know what your data looks like from the info you gave us. So I'm going to guess and make some fake data for us to play with. It's worth noting that having field names with spaces in them is going to make your life really hard. You should start by renaming your fields to something more manageable. Since I'm just making data up, I'll give my fields names without spaces:
library(tidyverse)
## this makes some fake data
## a data frame with 3 fields: month, team, value
n <- 100
specificreportsLocal <-
data.frame(
month = sample(1:12, size = n, replace = TRUE),
team = letters[1:5],
value = sample(1:100, size = n, replace = TRUE)
)
That's just a data frame called specificreportsLocal with three fields: month, team, value
Let's do some things with it:
# This will give us total values by team when month = 10
specificreportsLocal %>%
filter(month == 10) %>%
group_by(team) %>%
summarize(total_value = sum(value))
#> # A tibble: 4 x 2
#> team total_value
#> <fct> <int>
#> 1 a 119
#> 2 b 172
#> 3 c 67
#> 4 d 229
I think that's sort of like what you already did, except I added the summarize to show how it works.
Now let's use all months and reshape it from 'long' to 'wide'
# if I want to see all months I leave out the filter and
# add a group_by month
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
head(5) # this just shows the first 5 values
#> # A tibble: 5 x 3
#> # Groups: team [1]
#> team month total_value
#> <fct> <int> <int>
#> 1 a 1 17
#> 2 a 2 46
#> 3 a 3 91
#> 4 a 4 69
#> 5 a 5 83
# to make this 'long' data 'wide', we can use the `spread` function
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
spread(team, total_value)
#> # A tibble: 12 x 6
#> month a b c d e
#> <int> <int> <int> <int> <int> <int>
#> 1 1 17 122 136 NA 167
#> 2 2 46 104 158 94 197
#> 3 3 91 NA NA NA 11
#> 4 4 69 120 159 76 98
#> 5 5 83 186 158 19 208
#> 6 6 103 NA 118 105 84
#> 7 7 NA NA 73 127 107
#> 8 8 NA 130 NA 166 99
#> 9 9 125 72 118 135 71
#> 10 10 119 172 67 229 NA
#> 11 11 107 81 NA 131 49
#> 12 12 174 87 39 NA 41
Created on 2018-12-01 by the reprex package (v0.2.1)
Now I'm not really sure if that's what you want. So feel free to make a comment on this answer if you need any of this clarified.
Welcome to Stack Overflow!
I'm not sure I correctly understand your need without a data sample, but this may work for you:
library(rpivotTable)
specificreportsLocal %>% filter(Month == "October")
rpivotTable(specificreportsLocal, rows="Employee Team", cols="page", vals="views", aggregatorName = "Sum")
Otherwise, if you do not need it interactive (as the Pivot Tables in Excel), this may work as well:
specificreportsLocal %>% filter(Month == "October") %>%
group_by_at(c("Employee Team", "page")) %>%
summarise(nr_views = sum(views, na.rm=TRUE))
Related
I want to filter my data. Below you can see how is look like my data.
df<-data.frame(
Description=c("15","11","12","NA","Total","NA","9","18","NA","Total"),
Value=c(158,196,NA,156,140,693,854,NA,904,925))
df
Now I want to filter and assign some text in an additional column. Desired output is need to look like the table shown below. Namely, I want to introduce additional columns with the title Sales.In this column, with the if-else statement, I want to introduce two categorical values. First is Sold and the second is Unsold.The first rows until row 'Total' needs to have the value 'Sold' and other values under this need to have Unsold.
I tried to do this with this command but unfortunately is not work that I expected.
df1$Sales <- ifelse(df$Description==c('Total'),'Sold','Unsold')
So can anybody help me how to solve this?
df$Sales <- ifelse(cumsum(dplyr::lag(df$Description, default = "") == "Total") > 0,
"Unsold",
"Sold")
df
#> Description Value Sales
#> 1 15 158 Sold
#> 2 11 196 Sold
#> 3 12 NA Sold
#> 4 NA 156 Sold
#> 5 Total 140 Sold
#> 6 NA 693 Unsold
#> 7 9 854 Unsold
#> 8 18 NA Unsold
#> 9 NA 904 Unsold
#> 10 Total 925 Unsold
To break down the logic:
dplyr::lag checks whether the previous entry was "Total". Setting a default of any string other than "Total" prevents creating NA as the first entry, because that would carry over an unwanted NA into the next step.
cumsum returns how many times "Total" has been seen as the previous entry.
Checking that the result of cumsum is greater than 0 turns step 2 into a binary result: "Total" has either been found, or it hasn't.
If "Total" has been found, it's unsold; otherwise it's sold.
You could also rearrange things:
dplyr::lag(cumsum(df$Description == "Total") < 1, default = TRUE)
gets the same result, with the true & false results in the same order.
If you know there are as many sold as unsold you can use the first solution.
If you want to allow for uneven and unknown numbers of each you could use the second solution.
library(tidyverse)
# FIRST SOLUTION
df |>
mutate(Sales = ifelse(row_number() <= nrow(df) / 2, "Sold", "Unsold"))
# SECOND SOLUTION
df |>
mutate(o = Description == "Total") |>
mutate(Sales = ifelse(row_number() > match(TRUE, o), "Unsold", "Sold")) |>
select(-o)
#> Description Value Sales
#> 1 15 158 Sold
#> 2 11 196 Sold
#> 3 12 NA Sold
#> 4 NA 156 Sold
#> 5 Total 140 Sold
#> 6 NA 693 Unsold
#> 7 9 854 Unsold
#> 8 18 NA Unsold
#> 9 NA 904 Unsold
#> 10 Total 925 Unsold
I am currently working on a task that requires me to query a list of stocks from an sql db.
The problem is that it is a list where there are 1:n stocks traded per date. I want to calculate the the share of each stock int he portfolio on a given day (see example) and pass it to a new data frame. In other words date x occurs 2 times (once for stock A and once for stock B) and then pull it together that date x occurs only one time with the new values.
'data.frame': 1010 obs. of 5 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Date : Date, format: "2019-11-22" "2019-11-21" "2019-11-20" "2019-11-19" ...
$ Close: num 52 51 50.1 50.2 50.2 ...
$ Volume : num 5415 6196 3800 4784 6189 ...
$ Stock_ID : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
RawInput<-data.frame(Date=c("2017-22-11","2017-22-12","2017-22-13","2017-22-11","2017-22-12","2017-22-13","2017-22-11"), Close=c(50,55,56,10,11,12,200),Volume=c(100,110,150,60,70,80,30),Stock_ID=c(1,1,1,2,2,2,3))
RawInput$Stock_ID<-as.factor(RawInput$Stock_ID)
*cannot transfer the date to a date variable in this example
I would like to have a new dataframe that generates the Value traded per day, the weight of each stock, and the daily returns per day, while keeping the number of stocks variable.
I hope I translated the issue properly so that I can receive help.
Thank you!
I think the easiest way to do this would be to use the dplyr package. You may need to read some documentation but the mutate and group_by function may be able do what you want. This function will allow you to modify the current dataframe by either adding a new column or changing the existing data.
Lets start with a reproducible dataset
RawInput<-data.frame(Date=c("2017-22-11","2017-22-12","2017-22-13","2017-22-11","2017-22-12","2017-22-13","2017-22-11"),
Close=c(50,55,56,10,11,12,200),
Volume=c(100,110,150,60,70,80,30),
Stock_ID=c(1,1,1,2,2,2,3))
RawInput$Stock_ID<-as.factor(RawInput$Stock_ID)
library(magrittr)
library(dplyr)
dat2 <- RawInput %>%
group_by(Date, Stock_ID) %>% #this example only has one stock type but i imagine you want to group by stock
mutate(CloseMean=mean(Close),
CloseSum=sum(Close),
VolumeMean=mean(Volume),
VolumeSum=sum(Volume)) #what ever computation you need to do with
#multiple stock values for a given date goes here
dat2 %>% select(Stock_ID, Date, CloseMean, CloseSum, VolumeMean,VolumeSum) %>% distinct() #dat2 will still be the same size as dat, thus use the distinct() function to reduce it to unique values
# A tibble: 7 x 6
# Groups: Date, Stock_ID [7]
Stock_ID Date CloseMean CloseSum VolumeMean VolumeSum
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 2017-22-11 50 50 100 100
2 1 2017-22-12 55 55 110 110
3 1 2017-22-13 56 56 150 150
4 2 2017-22-11 10 10 60 60
5 2 2017-22-12 11 11 70 70
6 2 2017-22-13 12 12 80 80
7 3 2017-22-11 200 200 30 30
This data set that you provided actually only has one unique Stock_ID and Date combinations so there was nothing actually done with the data. However if you remove Stock_ID where necessary you can see how this function would work
dat2 <- RawInput %>%
group_by(Date) %>%
mutate(CloseMean=mean(Close),
CloseSum=sum(Close),
VolumeMean=mean(Volume),
VolumeSum=sum(Volume))
dat2 %>% select(Date, CloseMean, CloseSum, VolumeMean,VolumeSum) %>% distinct()
# A tibble: 3 x 5
# Groups: Date [3]
Date CloseMean CloseSum VolumeMean VolumeSum
<fct> <dbl> <dbl> <dbl> <dbl>
1 2017-22-11 86.7 260 63.3 190
2 2017-22-12 33 66 90 180
3 2017-22-13 34 68 115 230
After reading your first reply, You will have to be specific on how you are trying to calculate the weight. Also define your end result.
Im going to assume weight is just percentage by total cost. And the end result is for each date show the weight per stock. In other words a matrix of dates and stock Ids
library(tidyr)
RawInput %>%
group_by(Date) %>%
mutate(weight=Close/sum(Close)) %>%
select(Date, weight, Stock_ID) %>%
spread(key = "Stock_ID", value = "weight", fill = 0)
# A tibble: 3 x 4
# Groups: Date [3]
Date `1` `2` `3`
<fct> <dbl> <dbl> <dbl>
1 2017-22-11 0.192 0.0385 0.769
2 2017-22-12 0.833 0.167 0
3 2017-22-13 0.824 0.176 0
I'm attempting to use dplyr to analyze experiment data. My current data set represents five patients. For each patient, two samples are non-treated and there are four treated samples. I want to average the non-treated samples and then normalize all the observations for each patient to the average of the non-treated samples.
I'm easily able to get the baseline for each patient:
library(dplyr)
library(magrittr)
baselines <-main_table %>%
filter(Treatment == "N/A") %>%
group_by(PATIENT.ID) %>%
summarize(mean_CD4 = mean(CD3pos.CD8neg))
What is an efficient way to reference these values when I go back to mutate in the main table? Ideally being able to use PATIENT.ID to filter/select somehow rather than having to specify the actual patient IDs, which change from one experiment to the next?
What I've been doing is saving the values out of the summarized table and then using those inside mutate, but this solution is UGLY. I really do not like having the patient IDs hard coded in like this because they change from experiment to experiment and manually changing them introduces errors that are hard to catch.
patient_1_baseline <- baselines[[1, 2]]
patient_2_baseline <- baselines[[2, 2]]
main_table %>%
mutate(percent_of_baseline = ifelse(
PATIENT.ID == "108", CD3pos.CD8neg / patient_1_basline * 100,
ifelse(PATIENT.ID == "patient_2", ......
Another way to approach this would be to try to group by patient ID, summarize to get the baseline, and then mutate, but I cannot quite figure out how to do that either.
This is ultimately a symptom of a larger problem. I have the tidyverse basics down ok but I am struggling to move to the next level where I can handle more complex situations like this one. Any advice about this specific scenario or the big picture problem are deeply appreciated.
Edited to add:
Sample data set
PATIENT.ID Dose.Day Single.Live.Lymphs CD3pos.CD8neg
1 108 Day 1 42570 24324
2 108 Day 2 36026 20842
3 108 Day 3 40449 22882
4 108 Day 4 52831 32034
5 108 N/A 71348 38340
6 108 N/A 60113 34294
Use left_join() to merge the baselines that you calculated back into the main_table:
main_table %>%
left_join(baselines, by = "PATIENT.ID")
See e.g. here and here for more about merging data in R.
Another approach in this case could also avoid the need for a separate baseline dataset entirely by just adding the baseline with a grouped mutate():
library(tidyverse)
main_table %>%
group_by(PATIENT.ID) %>%
mutate(baseline = mean(CD3pos.CD8neg[Dose.Day == "N/A"])) %>%
mutate(pctbl = CD3pos.CD8neg / baseline * 100)
#> # A tibble: 6 x 6
#> # Groups: PATIENT.ID [1]
#> PATIENT.ID Dose.Day Single.Live.Lymphs CD3pos.CD8neg baseline pctbl
#> <int> <chr> <int> <int> <dbl> <dbl>
#> 1 108 Day1 42570 24324 36317 67.0
#> 2 108 Day2 36026 20842 36317 57.4
#> 3 108 Day3 40449 22882 36317 63.0
#> 4 108 Day4 52831 32034 36317 88.2
#> 5 108 N/A 71348 38340 36317 106.
#> 6 108 N/A 60113 34294 36317 94.4
Data:
txt <- "
PATIENT.ID Dose.Day Single.Live.Lymphs CD3pos.CD8neg
1 108 Day1 42570 24324
2 108 Day2 36026 20842
3 108 Day3 40449 22882
4 108 Day4 52831 32034
5 108 N/A 71348 38340
6 108 N/A 60113 34294"
main_table <- read.table(text = txt, header = TRUE,
stringsAsFactors = FALSE)
Created on 2018-07-11 by the reprex package (v0.2.0.9000).
Someone named Tarqon on Reddits /r/Rlanguage solved the problem. 1 + cumsum(days_between >= 45 instead of the if_else.
group_by(DMHID) %>%
arrange(DMHID, DateOfService) %>%
mutate(days_between = as.numeric(DateOfService - lag(DateOfService, default = DateOfService[1]))) %>%
mutate(eoc_45dco = 1 + cumsum(days_between >= 45)) %>%
mutate(id_eoc = as.integer(paste0(DMHID, eoc_45dco))) %>%
ORIGINAL QUESTION
So I am trying to split cases based on the amount of days (> 45) between one visit and the next. It works for the individual instance when there is more than 45 days between one visit and the next, but I need each visit after that to be part of the second group. For example, Participant 1234 has 362 visits, but between visit 105 (2016-12-26) and 106 (2017-02-23) there was a 59 day gap so i want all cases after that to be labeled 2. Rather All cases leading up to and including 105 are 12341 and after that 12342, so I can group by this variable for later analyses. Problem is that I can only seem to get the 106th visit to be labeled 12342 and everything before and after are 12341. I created a stripped down dataset and script that does reproduce the problem.
https://www.dropbox.com/s/k6gvo8igvbhpgti/reprex.zip?dl=0
EDIT: I just thought of another way to say it. I basically need to figure out how to group/subset data for each person, with the dividing line being the first time there is a gap of 45 days or more. I might be going down the wrong road with my current implementation, so if you can suggest alternative ways to split the data the way I want let me know. In the example I only have one persons visits, so the full dataset has a few thousand people with similar issues.
barometer <- df_pdencs_orig %>%
select(-EncID, -SiteName, -EOCKey, -ProgramLevel, -ProgramLevelCode, -ProcedureDesc, -MedicationValue, -CheckDate, -PdAmount, -PayerType) %>%
mutate_at(vars(contains("Date")), funs(ymd)) %>%
filter(DMHID %in% valid_diag$DMHID & DateOfService >= ymd(open_date)) %>%
group_by(DMHID) %>%
arrange(DMHID, DateOfService) %>%
mutate(days_between = DateOfService - lag(DateOfService, n = 1, default = DateOfService[1])) %>%
mutate(eoc_45dco = 1) %>%
mutate(eoc_45dco = if_else(days_between >= 45, lag(eoc_45dco) + 1, eoc_45dco)) %>%
mutate(eoc_45dco2 = if_else(lag(eoc_45dco) > 1, eoc_45dco + 1, eoc_45dco)) %>%
mutate(id_eoc = as.integer(paste0(DMHID, eoc_45dco))) %>%
...
The reprex below works just fine so I don't think that helps.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
df <- data.frame(
date = sample(seq(as.Date('2016/06/01'), as.Date('2017/03/01'), by="day"), 11),
days = as.difftime(c(40:50), units = "days")
)
df %>%
mutate(id = 1234) %>%
arrange(days) %>%
mutate(Z = 1) %>%
mutate(Z = if_else(days >= 45, lag(Z) + 1, Z)) %>%
mutate(id_eoc = as.integer(paste0(id, Z)))
#> date days id Z id_eoc
#> 1 2016-06-30 40 days 1234 1 12341
#> 2 2016-11-25 41 days 1234 1 12341
#> 3 2016-09-09 42 days 1234 1 12341
#> 4 2017-01-16 43 days 1234 1 12341
#> 5 2016-08-16 44 days 1234 1 12341
#> 6 2016-09-23 45 days 1234 2 12342
#> 7 2016-09-05 46 days 1234 2 12342
#> 8 2016-08-29 47 days 1234 2 12342
#> 9 2016-07-08 48 days 1234 2 12342
#> 10 2017-01-11 49 days 1234 2 12342
#> 11 2017-02-22 50 days 1234 2 12342
Created on 2018-04-17 by the reprex package (v0.2.0).
As such I think the issue is with the dates maybe since subtracting dates gives a time variable and not an integer.
library(dplyr)
library(forcats)
Using the simple dataframe and code below, I want to create a table with total rows and sub-rows. For example, the first row would be "Region1" from the NEW column and 70 from the TotNumber column, then below that would be three rows for "Town1", "Town2", and "Town3", and their associated numbers from the Number column, and the same for "Region2" and "Region3". I attached a pic of the desired table...
I'm also looking for a solution using dplyr and Tidyverse.
Number<-c(10,30,30,10,56,30,40,50,33,10)
Town<-("Town1","Town2","Town3","Town4","Town5","Town6","Town7","Town8","Town9","Town10")
DF<-data_frame(Town,Number)
DF<-DF%>%mutate_at(vars(Town),funs(as.factor))
To create Region variable...
DF<-DF%>%mutate(NEW=fct_collapse(Town,
Region1=c("Town1","Town2","Town3"),
Region2=c("Town4","Town5","Town6"),
Region3=c("Town7","Town8","Town9","Town10")))%>%
group_by(NEW)%>%
summarise(TotNumber=sum(Number))
Modifying your last pipes and adding some addition steps:
library(dplyr)
library(forcats)
DF%>%mutate(NEW=fct_collapse(Town,
Region1=c("Town1","Town2","Town3"),
Region2=c("Town4","Town5","Town6"),
Region3=c("Town7","Town8","Town9","Town10")),
NEW = as.character(NEW)) %>%
group_by(NEW) %>%
mutate(TotNumber=sum(Number)) %>%
ungroup() %>%
split(.$NEW) %>%
lapply(function(x) rbind(setNames(x[1,3:4], names(x)[1:2]), x[1:2])) %>%
do.call(rbind, .)
Results:
# A tibble: 13 × 2
Town Number
* <chr> <dbl>
1 Region1 70
2 Town1 10
3 Town2 30
4 Town3 30
5 Region2 96
6 Town4 10
7 Town5 56
8 Town6 30
9 Region3 133
10 Town7 40
11 Town8 50
12 Town9 33
13 Town10 10
Data:
Number<-c(10,30,30,10,56,30,40,50,33,10)
Town<-c("Town1","Town2","Town3","Town4","Town5","Town6","Town7","Town8","Town9","Town10")
DF<-data_frame(Town,Number) %>%
mutate_at(vars(Town),funs(as.factor))