Restructure data.frame - r

I have survey data structured as follows:
df <- data.frame(userid = c(1, 2, 3),
pos1 = c("itemA_1", "itemB_1", "itemA_2"),
pos2 = c("itemB_1", "itemC_2", "itemC_1"),
pos3 = c("itemC_5", "itemA_4", "itemB_3")
)
df
> df
userid pos1 pos2 pos3
1 1 itemA_1 itemB_3 itemC_3
2 2 itemB_1 itemC_1 itemA_1
3 3 itemA_2 itemC_4 itemB_1
In the survey several items (itemA, itemB, itemC ...) were rated on a five-point likert-skale ranging from 1 to 5. The order in which the items were answered was also saved.
For example in the above data.frame user 1 rated itemA first and the rating was 1. Then he rated itemB and the rating was 3. Finally he rated itemC and the rating was 3.
user 2 started with itemB and the rating was 1 etc.
Obviously, that structure is not very useful to analyse the data. So I'd rather have it in a form like this:
userid itemA itemB itemC ...
1 1 3 3
2 1 1 1
3 2 1 4
But how can I get there? Thanks for help!

Get the data in long format, separate the rating value from 'item' and get the data in wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('pos'), names_to = NULL) %>%
separate(value, c('item', 'rating'), sep = '_', convert = TRUE) %>%
pivot_wider(names_from = item, values_from = rating)
# userid itemA itemB itemC
# <dbl> <int> <int> <int>
#1 1 1 1 5
#2 2 4 1 2
#3 3 2 3 1

Related

How to add rows to a table based on a condition?

I would like to create a table in which I can give access to a certain page of a report for a certain user
Imagine I have a table like this:
I have another table in which I have the name of every report page:
I want to get a table in which I have the users and all the pages at which they hace access to depending on their group. Group 1 can see all pages, but group 2 only can see the Team page:
The best option would be doing it in DAX code but I think it could be easier doing it using R. Thanks in advance!
The tidyverse package gives you easy tools to manipulate dataframes. You can create 3 variables (Orders, Sales, Team) recording the access rights for each page (with 1 or 0 for example) using case_when with a condition on Group, and then pivot_longer on these variables, and finally only keep rows where there is an access right with filter.
library(tidyverse)
Group <- c(1,2,1,2,2)
User <- c("alex","pablo","carlos","pepe","paula") %>% paste0("#gmail.com")
df <- data.frame(Group, User)
df2 <- df %>%
mutate(Orders = case_when(Group==1 ~ 1,
Group==2 ~ 0),
Sales = Orders,
Team = 1) %>%
pivot_longer(cols = c(Orders, Sales, Team), names_to = "Page") %>%
filter(value == 1) %>%
select(-value)
Output
> df
Group User
1 1 alex#gmail.com
2 2 pablo#gmail.com
3 1 carlos#gmail.com
4 2 pepe#gmail.com
5 2 paula#gmail.com
> df2
# A tibble: 9 x 3
Group User Page
<dbl> <chr> <chr>
1 1 alex#gmail.com Orders
2 1 alex#gmail.com Sales
3 1 alex#gmail.com Team
4 2 pablo#gmail.com Team
5 1 carlos#gmail.com Orders
6 1 carlos#gmail.com Sales
7 1 carlos#gmail.com Team
8 2 pepe#gmail.com Team
9 2 paula#gmail.com Team
An idea can be,
library(dplyr)
library(tidyr)
df %>%
mutate(page = toString(df1$page_name)) %>%
separate_rows(page, sep = ', ') %>%
mutate(page = replace(page, Group == 2 & page != 'team', NA)) %>%
na.omit()
# A tibble: 9 x 3
Group User page
<dbl> <chr> <chr>
1 1 A orders
2 1 A sales
3 1 A team
4 2 B team
5 1 C orders
6 1 C sales
7 1 C team
8 2 D team
9 2 E team
DATA
dput(df)
structure(list(Group = c(1, 2, 1, 2, 2), User = c("A", "B", "C",
"D", "E")), class = "data.frame", row.names = c(NA, -5L))
dput(df1)
structure(list(page_name = c("orders", "sales", "team")), class = "data.frame", row.names = c(NA,
-3L))
Another idea using fuzzyjoin package:
Data
users <- data.frame(
Group = c("1","2","1","2","2"),
User = c("alex","pablo","carlos","pepe","paula")
)
Group User
1 1 alex
2 2 pablo
3 1 carlos
4 2 pepe
5 2 paula
You can then add a column to the Page dataframe which tell the groups allowed to have access to each category:
pagename <- data.frame(
Page = c("Order","Sales","Team"),
Allowed = c("1","1","1|2")
)
Page Allowed
1 Order 1
2 Sales 1
3 Team 1|2
And finally using fuzzyjoin::regex_left_join:
users |>
fuzzyjoin::regex_left_join(pagename,
by = c(Group = "Allowed")) |>
dplyr::select(-Allowed)
Output
Group User Page
1 1 alex Order
2 1 alex Sales
3 1 alex Team
4 2 pablo Team
5 1 carlos Order
6 1 carlos Sales
7 1 carlos Team
8 2 pepe Team
9 2 paula Team

How would you loop this in R?

I have this kinda simple task I'm having hard time looping.
So, lets assume I have this tibble:
library(tidyverse)
dat <- tibble(player1 = c("aa","bb","cc"), player2 = c("cc","aa","bb"))
My goal here, is to make three new columns ( for each unique "player" I have) and assign value of 1 to the column, if the player is "player1", -1 if the player is "player2" and 0 otherwise.
Previously, I have been doing it like this:
dat %>% mutate( aa = ifelse(player1 == "aa",1,ifelse(player2 == "aa",-1,0)),
bb = ifelse(player1 == "bb",1,ifelse(player2 == "bb",-1,0)),
cc = ifelse(player1 == "cc",1,ifelse(player2 == "cc",-1,0)))
This works, but now I have hundreds of different "players", so it would seem silly to do this manually like that. I have tried and read about loops in R, but I just can't get this one right.
Using model.matrix() from base R:
dat[unique(dat$player1)] <-
model.matrix(~0+ player1, data = dat) - model.matrix(~0+ player2, data = dat)
dat
player1 player2 aa bb cc
<chr> <chr> <dbl> <dbl> <dbl>
1 aa cc 1 0 -1
2 bb aa -1 1 0
3 cc bb 0 -1 1
This assumes you have all players in both columns. Otherwise you would need to convert them to factors with the appropriate levels and replace unique with levels.
We could go from initial structure to "long"(-er) format, with one row per (game, player), recode to 1/-1, and then go wide again with the desired output:
dat %>%
mutate(game_id = row_number()) %>%
gather("role", "player", -game_id) %>%
mutate(role = recode(role, "player1" = 1L, "player2" = -1L)) %>%
spread(player, role, fill = 0L)
#> # A tibble: 3 x 4
#> game_id aa bb cc
#> <int> <int> <int> <int>
#> 1 1 1 0 -1
#> 2 2 -1 1 0
#> 3 3 0 -1 1
You can use pivot_longer() to stack those columns starting with "player" and then pivot it to wide. The advantage is that you can do recoding within pivot_wider() by the argument values_fn.
library(tidyverse)
dat %>%
rowid_to_column("id") %>%
pivot_longer(starts_with("player")) %>%
pivot_wider(names_from = value, names_sort = TRUE,
values_from = name, values_fill = 0,
values_fn = function(x) c(1, -1)[match(x, c("player1", "player2"))])
# # A tibble: 3 x 4
# id aa bb cc
# <int> <dbl> <dbl> <dbl>
# 1 1 1 0 -1
# 2 2 -1 1 0
# 3 3 0 -1 1
Note: Development on gather()/spread() is complete, and for new code we recommend switching to pivot_longer()/_wider(), which is easier to use, more featureful, and still under active development.

Use a specific value in summarise (dplyr) without filtering it out

I am trying to compare a new algorithm result versus an old one. I need to know approximately how many days of a difference the new algorithm has in predicting a "D" versus the old one.
I can't seem to figure out how to point to the first row (day) that contains a 'D' (min(day) and new == 'D') without filtering (I was able to grab the row using a double filter due to the grouping, but not use it). I want to use it in summarise using dplyr which is why I have included pseudo code similar to where i am currently at in my own dataset.
In my data there are groups of varying length (number of days) for each ID, which is why I made groups of different lengths in the example.
library(dplyr)
id = c(123,123,123,123,123,456,456,456,456)
old = c('S','S','S','S','D','S','S','D','D')
new = c('S','S','D','D','D','S','D','D','D')
day = c(1,2,3,4,5,1,2,3,4)
data = data.frame(id,old,new,day)
data
#> id old new day
#> 1 123 S S 1
#> 2 123 S S 2
#> 3 123 S D 3
#> 4 123 S D 4
#> 5 123 D D 5
#> 6 456 S S 1
#> 7 456 S D 2
#> 8 456 D D 3
#> 9 456 D D 4
d = data %>%
group_by(id)%>%
arrange(day,.by_group=T)%>%
add_tally(new=='S',name='S')%>%
add_tally(new=='D',name='D')%>%
group_by(id,S,D)
# summarise(diff = (day of 1st old D) - (day of 1st new D) )
#Expected Outcome
ido = c(123,456)
S = c(2,1)
D = c(3,3)
diff = c(2,1)
outcome = data.frame(ido,S,D,diff)
outcome
#> ido S D diff
#> 1 123 2 3 2
#> 2 456 1 3 1
Created on 2019-12-26 by the reprex package (v0.3.0)
We can group_by id and count the occurrence of 'S' and 'D' and the difference between first occurrence of old and new 'D'.
library(dplyr)
data %>%
group_by(id) %>%
summarise(S = sum(new == 'S'),
D = sum(new == 'D'),
diff = which.max(old == 'D') - which.max(new == 'D'))
#OR if there could be id without D use
#diff = which(old == 'D')[1] - which(new == 'D')[1])
# A tibble: 2 x 4
# id S D diff
# <dbl> <int> <int> <int>
#1 123 2 3 2
#2 456 1 3 1
We can use pivot_wider after summariseing to get the frequency count after creating a column to take the difference between the 'day' based on the first occurence of 'D' in both 'old' and 'new' columnss
library(dplyr)
library(tidyr)
data %>%
group_by(id) %>%
group_by(diff = day[match("D", old)] - day[match("D", new)],
new, add = TRUE) %>%
summarise(n = n()) %>%
ungroup %>%
pivot_wider(names_from = new, values_from = n)
# A tibble: 2 x 4
# id diff D S
# <dbl> <dbl> <int> <int>
#1 123 2 3 2
#2 456 1 3 1

R dplyr count observations within groups

I have a data frame with yes/no values for different days and hours. For each day, I want to get a total number of hours where I have data, as well as the total number of hours where there is a value of Y.
df <- data.frame(day = c(1,1,1,2,2,3,3,3,3,4),
hour = c(1,2,3,1,2,1,2,3,4,1),
YN = c("Y","Y","Y","Y","Y","Y","N","N","N","N"))
df %>%
group_by(day) %>%
summarise(tot.hour = n(),
totY = WHAT DO I PUT HERE?)
Using boolean then add it up
df %>%
group_by(day) %>%
dplyr::summarise(tot.hour = n(),
totY = sum(YN=='Y'))
# A tibble: 4 x 3
day tot.hour totY
<dbl> <int> <int>
1 1 3 3
2 2 2 2
3 3 4 1
4 4 1 0

Flag dates based on multiple columns

I have a df, this provides information about the create_date and delete_date(if any) for a given ID.
Structure:
ID create_date1 create_date2 delete_date1 delete_date2
1 01-01-2014 NA NA NA
2 01-04-2014 01-08-2014 01-05-2014 NA
the create_date and delete_date extends till 10, i.e. create_date10
and delete_date10 columns are present
Rules/Logic:
We charge a user on monthly basis, if a user was created on 30th of a month, even then it's treated as if the user was active for a month(very low cost)
If a user has a delete date (irrespective on which date) in this month, then from next month the user is not charged
If a user has only create_date and no delete_date then all dates including the create_month is charged
Output expected:
ID 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08
1 1 1 1 1 1 1 1 1
2 0 0 0 1 1 0 0 1
so on till current date
1 indicates the user is charged/active for that month
Problem:
I have been struggling to do this, but can't even understand how to do this. My earlier method is a bit too slow
Previous Solution:
Make the dataset into tall
Insert sequence of dates for each ID as a new column
Use a for loop to check the status
for each ID, status is equal to 1,
if create_date is equal to sequence, and it's 0 if the lag(delete_date) is equal to sequence
else is same as lag(status)
ID create_date delete_date sequence status?
1 01-01-2014 NA 2014-01 1
1 01-01-2014 NA 2014-02 1
1 01-01-2014 NA 2014-03 1
may not be that efficient : assuming this is just for a single year(could be extended easily)
# convert all dates to Date format
df[,colnames(df[-1])] = lapply(colnames(df[-1]), function(x) as.Date(df[[x]], format = "%d-%m-%Y"))
# extract the month
library(lubridate)
df[,colnames(df[-1])] = lapply(colnames(df[-1]), function(x) month(df[[x]]))
# df
# ID create_date1 create_date2 delete_date1 delete_date2
#1 1 1 NA NA NA
#2 2 4 8 5 NA
# get the current month
current.month <- month(Sys.Date())
# assume for now current month is 9
current.month <- 9
flags <- rep(FALSE, current.month)
func <- function(x){
x[is.na(x)] <- current.month # replacing all NA with current month(9)
create.columns.indices <- x[grepl("create_date", colnames(df[-1]))] # extract the create_months
delete.columns.indices <- x[grepl("delete_date", colnames(df[-1]))] # extract the delete_months
flags <- pmin(1,colSums(t(sapply(seq_along(create.columns.indices),
function(x){
flags[create.columns.indices[x]:delete.columns.indices[x]] = TRUE;
flags
}))))
flags
}
df1 = cbind(df$ID , t(apply(df[-1], 1, func)))
colnames(df1) = c("ID", paste0("month",1:current.month))
# df1
# ID month1 month2 month3 month4 month5 month6 month7 month8 month9
#[1,] 1 1 1 1 1 1 1 1 1 1
#[2,] 2 0 0 0 1 1 0 0 1 1
Here's a still-pretty-long tidyverse approach:
library(tidyverse)
df %>% gather(var, date, -ID) %>% # reshape to long form
# separate date type from column set number
separate(var, c('action', 'number'), sep = '_date', convert = TRUE) %>%
mutate(date = as.Date(date, '%d-%m-%Y')) %>% # parse dates
spread(action, date) %>% # spread create and delete to two columns
mutate(min_date = min(create, delete, na.rm = TRUE), # add helper columns; use outside
max_date = max(create, delete, na.rm = TRUE)) %>% # variable to save memory if an issue
group_by(ID, number) %>%
mutate(month = list(seq(min_date, max_date, by = 'month')), # add month sequence list column
# boolean vector of whether range of months in whole range
active = ifelse(is.na(create),
list(rep(FALSE, length(month[[1]]))),
lapply(month, `%in%`,
seq.Date(create,
min(delete, max_date, na.rm = TRUE),
by = 'month')))) %>%
unnest() %>% # unnest list columns to long form
group_by(ID, month = format(month, '%Y-%m')) %>%
summarise(active = any(active) * 1L) %>% # combine muliple rows for one ID
spread(month, active) # reshape to wide form
## Source: local data frame [2 x 9]
## Groups: ID [2]
##
## ID `2014-01` `2014-02` `2014-03` `2014-04` `2014-05` `2014-06` `2014-07` `2014-08`
## * <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 1 1 1 1 1 1 1 1 1
## 2 2 0 0 0 1 1 0 0 1

Resources