I have a dataframe with the following structure:
Timestamp Value1 Value2 Problem1 Problem2
00:00 32 40 No No
00:05 12 55 No No
00:10 14 42 Yes No
00:15 50 33 Yes No
00:20 78 47 No No
Where Problem1 defines if there is a problem with Value1, and Problem2 defines if there is a problem with Value2. In case of having a Yes in Problem1, I'd need to replace Value1 by Value2. In case of having problems in both, they should keep unchanged.
My problem here is that I won't know how many Value and Problem columns I'll have. So, in case of having more than 2, I'd need to replace the value with problems by the average of those values without problems.
So, in another example:
Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
00:00 32 40 45 No No No
00:05 12 55 48 No No No
00:10 14 42 55 Yes No No
00:15 50 33 13 Yes No Yes
00:20 78 47 75 No No No
Here I'd need to replace Value1 at 00:10 by the average of Value2 and Value3. Also, I'd need to replace Value1 and Value3 at 00:15 by Value2.
I bet there is a more elegant solution.
library(tidyr)
library(dplyr)
df %>%
mutate(across(starts_with("Problem"), ~ .x == "Yes")) %>%
pivot_longer(-Timestamp, names_to = c("name", "id"), names_pattern = "(.*)(\\d+)") %>%
pivot_wider() %>%
group_by(Timestamp) %>%
mutate(Value = case_when(sum(Problem) == 0 | sum(Problem) == n() | !Problem ~ Value,
TRUE~ sum(Value * (1 - Problem))/sum(1-Problem))) %>%
pivot_longer(cols=c("Value", "Problem")) %>%
mutate(name = paste0(name,id), .keep="unused") %>%
pivot_wider() %>%
ungroup() %>%
mutate(across(starts_with("Problem"), ~ ifelse(.x == 1, "Yes", "No")))
returns
# A tibble: 5 x 7
Timestamp Value1 Problem1 Value2 Problem2 Value3 Problem3
<time> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 00'00" 32 No 40 No 45 No
2 05'00" 12 No 55 No 48 No
3 10'00" 48.5 Yes 42 No 55 No
4 15'00" 33 Yes 33 No 33 Yes
5 20'00" 78 No 47 No 75 No
What approach did I use?
Transform your Problem Variable into a boolean. R is able to use booleans in calculations, technically it is transformed later into a double.
Turn your value/problem numbers into a id, so for every timestamp there are several rows for Value and Problem.
Calculate the new value based on the number of problems and if the value is problematic.
Restore the shape of your data.frame.
Data
df <- structure(list(Timestamp = structure(c(0, 300, 600, 900, 1200
), class = c("hms", "difftime"), units = "secs"), Value1 = c(32,
12, 14, 50, 78), Value2 = c(40, 55, 42, 33, 47), Value3 = c(45,
48, 55, 13, 75), Problem1 = c("No", "No", "Yes", "Yes", "No"),
Problem2 = c("No", "No", "No", "No", "No"), Problem3 = c("No",
"No", "No", "Yes", "No")), problems = structure(list(row = 5L,
col = "Problem3", expected = "", actual = "embedded null",
file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -5L), spec = structure(list(
cols = list(Timestamp = structure(list(format = ""), class = c("collector_time",
"collector")), Value1 = structure(list(), class = c("collector_double",
"collector")), Value2 = structure(list(), class = c("collector_double",
"collector")), Value3 = structure(list(), class = c("collector_double",
"collector")), Problem1 = structure(list(), class = c("collector_character",
"collector")), Problem2 = structure(list(), class = c("collector_character",
"collector")), Problem3 = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
I use the data.table package. I call your data given in the second example "din".
I hope this code helps you:
#I use the library data.table; get data in data.table format
library(data.table)
din <- data.table(din)
din[,Value1:=as.numeric(Value1)]
din[,Value2:=as.numeric(Value2)]
din[,Value3:=as.numeric(Value3)]
#set Values to NA if there is a Problem
din[Problem1=="Yes", Value1:=NA]
din[Problem2=="Yes", Value2:=NA]
din[Problem3=="Yes", Value3:=NA]
#print table with NA replaced if we have a Problem
#print(din)
# Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
#1: 00:00 32 40 45 No No No
#2: 00:05 12 55 48 No No No
#3: 00:10 NA 42 55 Yes No No
#4: 00:15 NA 33 NA Yes No Yes
#5: 00:20 78 47 75 No No No
#use the mean function to replace if I have an NA in the table (just working if Timestamp is a unique id, otherwise you need to generate one and use this in the by argument)
din[is.na(Value1), Value1:=mean(c(Value2,Value3), na.rm=T), by=Timestamp]
din[is.na(Value2), Value2:=mean(c(Value1,Value2), na.rm=T), by=Timestamp]
din[is.na(Value3), Value3:=mean(c(Value1,Value2), na.rm=T), by=Timestamp]
#print final table
#print(din)
# Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
#1: 00:00 32.0 40 45 No No No
#2: 00:05 12.0 55 48 No No No
#3: 00:10 48.5 42 55 Yes No No
#4: 00:15 33.0 33 33 Yes No Yes
#5: 00:20 78.0 47 75 No No No
``
I made a sample for the data:
df <- data.frame(value1 = runif(10, min = 0, max = 100),
value2 = runif(10, min = 0, max = 100),
value3 = runif(10, min = 0, max = 100))
df_problem <- data.frame(problem1 = sample(c('yes','no'), 10, replace = T),
problem2 = sample(c('yes','no'), 10, replace = T),
problem3 = sample(c('yes','no'), 10, replace = T))
See that I separated the values from the problems. Then:
df_problem[df_problem == 'yes'] <- 1
df_problem[df_problem == 'no'] <- NA
df_problem <- matrix(as.numeric(unlist(df_problem)), nrow = nrow(df)) #rebuild matrix
Finally:
df <- df * df_problem
for (i in 1:nrow(df)){
if (T %in% is.na(df[i,])){
df[i,c(which(is.na(df[i,])))] <- mean(unlist(df[i,]), na.rm = T)
}
}
df
Related
I want to pick up rows of which time data is between multiple intervals.
The data frame is like this:
dputs
structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B"), score_time = c("2022/09/01 9:00:00", "2022/09/02 18:00:00",
"2022/09/03 12:00:00", NA, NA, "2022/09/15 18:00:00", "2022/09/18 20:00:00",
NA, NA, NA), score = c(243, 232, 319, NA, NA, 436, 310, NA, NA,
NA), treatment_start = c(NA, NA, NA, "2022/09/02 8:00:00", "2022/09/03 11:00:00",
NA, NA, "2022/09/15 8:00:00", "2022/09/16 14:00:00", "2022/09/16 23:00:00"
), treatment_end = c(NA, NA, NA, "2022/09/02 22:00:00", "2022/09/09 12:00:00",
NA, NA, "2022/09/16 2:00:00", "2022/09/16 22:00:00", "2022/09/17 0:00:00"
)), row.names = c(NA, -10L), spec = structure(list(cols = list(
ID = structure(list(), class = c("collector_character", "collector"
)), score_time = structure(list(), class = c("collector_character",
"collector")), score = structure(list(), class = c("collector_double",
"collector")), treatment_start = structure(list(), class = c("collector_character",
"collector")), treatment_end = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x6000000190b0>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
ID score_time score treatment_start treatment_end
<chr> <chr> <dbl> <chr> <chr>
1 A 2022/09/01 9:00:00 243 NA NA
2 A 2022/09/02 18:00:00 232 NA NA
3 A 2022/09/03 12:00:00 319 NA NA
4 A NA NA 2022/09/02 8:00:00 2022/09/02 22:00:00
5 A NA NA 2022/09/03 11:00:00 2022/09/09 12:00:00
6 B 2022/09/15 18:00:00 436 NA NA
7 B 2022/09/18 20:00:00 310 NA NA
8 B NA NA 2022/09/15 8:00:00 2022/09/16 2:00:00
9 B NA NA 2022/09/16 14:00:00 2022/09/16 22:00:00
10 B NA NA 2022/09/16 23:00:00 2022/09/17 0:00:00
Multiple score values are given for each ID with the measurement time.
And each ID has more than one information of treatment duration shown by start and end time.
My target is score values that are measured during treatment periods.
I tried with the package lubridate and tidyverse to mutate intervals but could not apply "%in%" method.
Here is my attempt until putting intervals in the same rows with score values.
data %>%
mutate(trt_interval = interval(start = treatment_start, end = treatment_end)) %>%
group_by(ID) %>%
mutate(num = row_number()) %>%
pivot_wider(names_from = num, names_prefix = "intvl", values_from = trt_interval) %>%
fill(c(intvl1:last_col()), .direction = "up")
Desired output is like this.
(The first score of A and the last score of B dismissed because their score_time are out of interval.)
ID score
<chr> <dbl>
1 A 232
2 A 319
3 B 436
I want to know the smarter way to put data in a row and how to apply "%in%" for multiple intervals.
Sorry that the question is not qualified and include multiple steps but any advices will be a great help for me.
Hi I would first create two seperate data frames. One for the scores and one for the intervalls. Then would I join them both and filter the score that are within an treatment intervall.
data_score <- data %>%
filter(!is.na(score_time)) %>%
select(-starts_with("treat")) %>%
mutate(score_time = ymd_hms(score_time))
data_score
data_interval <- data %>%
filter(is.na(score_time)) %>%
select(ID,starts_with("treat")) %>%
mutate(trt_interval = interval(start = treatment_start, end = treatment_end))
data_score %>%
inner_join(
data_interval
) %>%
filter(
lubridate::`%within%`(score_time,trt_interval )
)
Hope this helps!!
Problem: I have 2 datasets with no matching identifiers (like ID) and need to find the closest match in df1$time to df2$tstart. df1 (with time column) has 660,000 rows with time stamps approximately every 0.00125 s.
Whatever the closest match is to df2$tstart, I would like a new column made (df1$trial_start) that says "yes", otherwise "no".
I've tried findInterval, but it only seems to match in ascending order, and doesn't check values in both directions. In the below code, it looks good for most of the outputs, but there are some indices where the value after the returned index is closer to $tstart
#my actual code:
index_closest <- findInterval(iti_summaries_2183[["24"]]$tstart, poke_1s$time)
poke_1s$trial_start <- ifelse(seq_len(nrow(poke_1s)) %in% index_closest, "yes", "no")
I've also tried which.min, which doesn't work since the lists lengths don't match.
Additionally, I've fought with roll = "nearest" like here but the functions return values and I'm not sure how to create a new column and assign y/n.
Code to replicate problem:
n <- 773
df1 <- structure(list(initiate = sample(c(0,1), replace=TRUE, size=n),
left = sample(c(0,1), replace=TRUE, size=n),
right = sample(c(0,1), replace=TRUE, size=n),
time = seq(from = 2267.2, to = 2363.75, by = 0.125)))
df1 <- data.frame(df1)
df2 <- structure(list(trial = c(156:162),
control = c(0, 0, 0, 0, 3, 0, 3),
t_start = c(2267.231583, 2289.036355, 2298.046849, 2318.933635, 2328.334036, 2347.870449, 2363.748095),
t_end = c(2268.76760, 2290.83370, 2299.38547, 2320.71400, 2329.93985, 2349.15464, 2365.12455)),
class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L), spec = structure(list(
cols = list(trial = structure(list(), class = c("collector_double",
"collector")), control = structure(list(), class = c("collector_double",
"collector")), t_start = structure(list(), class = c("collector_double",
"collector")), t_end = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
If I understand your question correctly :
library(data.table)
setDT(df1)
setDT(df2)
df1[df2,.(initiate,left,right,x.time,trial,control,t_start,t_end,
trial_start=fifelse(x.time>t_start&x.time<t_end,'Y','N')),
on=.(time=t_start),roll='nearest']
initiate left right x.time trial control t_start t_end trial_start
<num> <num> <num> <num> <int> <num> <num> <num> <char>
1: 0 0 1 2267.200 156 0 2267.232 2268.768 N
2: 0 0 1 2289.075 157 0 2289.036 2290.834 Y
3: 0 0 1 2298.075 158 0 2298.047 2299.385 Y
4: 1 1 1 2318.950 159 0 2318.934 2320.714 Y
5: 1 1 1 2328.325 160 3 2328.334 2329.940 N
6: 0 0 1 2347.825 161 0 2347.870 2349.155 N
7: 1 1 0 2363.700 162 3 2363.748 2365.125 N
Hi I am looking into figuring out how to match data frames together by column, then renaming it. If there is no name that matches, then I would want to drop that column instead.
For example, I would use this main dataset, call it DF1:
Name
Reference
Good
Fair
Bad
Great
Poor
George
Hill
34
21
33
21
32
Frank
Stairs
29
28
29
30
29
Bertha
Trail
25
25
24
21
26
Then another DF, call this DF2, that allows me to replace the names of the columns of DF1
Name
Adjusted_Name
Good
good_run
Great
very_great_work
Bad
bad run
Fair
fair run decent
Essentially, the words that would be substituted would not be any pattern of any sort, and I would try to match this first column in DF2 and match to DF1, and if there is a match in DF2$Name and DF(whatever column), then I would replace that name with the same row of DF2$Adjusted_Name. If there is no match, then the value in DF1 is dropped.
So the final goal would be to achieve:
Name
Reference
good_run
fair run decent
Bad run
very_great_work
George
Hill
34
21
33
21
Frank
Stairs
29
28
29
30
Bertha
Trail
25
25
24
21
In this case, "poor" was dropped because it didnt match the column name of DF1.
How should I go about this? How would I account if there thousands of columns? Does that change anything in how i Code? I am a bit new to R, and would appreciate any tips. Thank you!
If you are open to a tidyverse solution, you could use
library(dplyr)
library(tibble)
df %>%
rename_with(~deframe(df2)[.x], .cols = df2$Name) %>%
select(Name, Reference, any_of(df2$Adjusted_Name))
This returns
# A tibble: 3 x 6
Name Reference good_run very_great_work bad_run fair_run_decent
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 George Hill 34 21 33 21
2 Frank Stairs 29 30 29 28
3 Bertha Trail 25 21 24 25
Data
df <- structure(list(Name = c("George", "Frank", "Bertha"), Reference = c("Hill",
"Stairs", "Trail"), Good = c(34, 29, 25), Fair = c(21, 28, 25
), Bad = c(33, 29, 24), Great = c(21, 30, 21), Poor = c(32, 29,
26)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L), spec = structure(list(cols = list(Name = structure(list(), class = c("collector_character",
"collector")), Reference = structure(list(), class = c("collector_character",
"collector")), Good = structure(list(), class = c("collector_double",
"collector")), Fair = structure(list(), class = c("collector_double",
"collector")), Bad = structure(list(), class = c("collector_double",
"collector")), Great = structure(list(), class = c("collector_double",
"collector")), Poor = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df2 <- structure(list(Name = c("Good", "Great", "Bad", "Fair"), Adjusted_Name = c("good_run",
"very_great_work", "bad_run", "fair_run_decent")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), spec = structure(list(
cols = list(Name = structure(list(), class = c("collector_character",
"collector")), Adjusted_Name = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Try the following - using the list of adjusted names, you can grep the list of desired words against column names and subset the data frame on it:
Data
df <- read.table(header = TRUE, text = "Name Reference Good Fair Bad Great Poor
George Hill 34 21 33 21 32
Frank Stairs 29 28 29 30 29
Bertha Trail 25 25 24 21 26")
adj_name <- c("good_run","very_great_run","bad run","fair run decent")
Index the columns based on grep from the string of desired names (note the tolower() on the column names as well)
desired_words <- paste(unlist(strsplit(adj_name, "_| ")), collapse = "|")
df[,c(1:2,grep(desired_words, tolower(names(df))))]
Output
# Name Reference Good Fair Bad Great
#1 George Hill 34 21 33 21
#2 Frank Stairs 29 28 29 30
#3 Bertha Trail 25 25 24 21
I need a chart of accounts to stay in order when new accounts are added or dropped in future years. This is because in Accounting the accounts are sorted by type (for example Asset, Liability Equity) but it is not explicit in the dataset. This is an example of the code that is putting new "Accounts" from Year2 and Year3 at the bottom.
XYZCompany_Consolidated <- XYZCompany_Year1 %>%
full_join(XYZCompany_Year2 by = "Account") %>%
full_join(XYZCompany_Year3, by = "Account")
Example: This picture is just to give a simplified example. The highlight in orange is where the new accounts are going and to the right is the code i'm using, and the green is what I'm trying to achieve
Perhaps I'm overthinking this problem but I find it hard to solve. Let's define some data first:
df_year1 <- structure(list(Account = c("Cash", "Accounts", "Loan1", "Auto",
"JaneDoe"), Year_1 = c(100, 1000, 20, 300, 500)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L), spec = structure(list(
cols = list(Account = structure(list(), class = c("collector_character",
"collector")), Year_1 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df_year2 <- structure(list(Account = c("Cash", "Accounts", "Loan1", "Auto",
"Laptop", "JaneDoe"), Year_2 = c(80, 1200, 50, 300, 500, 0)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(Account = structure(list(), class = c("collector_character",
"collector")), Year_2 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df_year3 <- structure(list(Account = c("Cash", "Accounts", "Loan1", "Auto",
"Rent", "JaneDoe"), Year_3 = c(80, 1200, 50, 300, 1000, 0)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(Account = structure(list(), class = c("collector_character",
"collector")), Year_3 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Those are similar to the data shown in the OP's picture, e.g. df_year1 looks like
# A tibble: 5 x 2
Account Year_1
<chr> <dbl>
1 Cash 100
2 Accounts 1000
3 Loan1 20
4 Auto 300
5 JaneDoe 500
Next we transform those data a little bit, namely
library(dplyr)
library(tidyr)
df_y1 <- df_year1 %>%
mutate(Year = 1,
no = row_number()) %>%
rename(value = Year_1)
which returns
# A tibble: 5 x 4
Account value Year no
<chr> <dbl> <dbl> <int>
1 Cash 100 1 1
2 Accounts 1000 1 2
3 Loan1 20 1 3
4 Auto 300 1 4
5 JaneDoe 500 1 5
The new column no stores the account's original position, column Year stores the chart's year. All three data.frames are processed like this, so we get df_y1, df_y2, df_y3.
Finally we bind them together
bind_rows(df_y1, df_y2, df_y3) %>%
mutate(num_years = max(Year)) %>%
group_by(Account) %>%
mutate(rank = sum((num_years - n() + 1) * no), .keep = "unused") %>%
pivot_wider(names_from = Year) %>%
arrange(rank) %>%
select(-rank) %>%
ungroup()
and calculate a rank for each account. The accounts are ordered by this rank. As a result, we get
# A tibble: 7 x 4
Account Year_1 Year_2 Year_3
<chr> <dbl> <dbl> <dbl>
1 Cash 100 80 80
2 Accounts 1000 1200 1200
3 Loan1 20 50 50
4 Auto 300 300 300
5 Laptop NA 500 NA
6 Rent NA NA 1000
7 JaneDoe 500 0 0
Note
I believe, there are better approaches, but at least this works for the example data.
I'm not sure about the calculated rank's stability. Take care.
I am using src_postgres to connect and dplyr::tbl function to fetch data from redshift database. I have applied some filters and top function to it using the dplyr itself. Now my data looks as below:
riid day hour
<dbl> <chr> <chr>
1 5542. "THURSDAY " 12
2 5862. "FRIDAY " 15
3 5982. "TUESDAY " 15
4 6022. WEDNESDAY 16
My final output should be as below:
riid MON TUES WED THUR FRI SAT SUN
5542 12
5862 15
5988 15
6022 16
I have tried spread. It throws the below error because of the class type:
Error in UseMethod("spread_") : no applicable method for 'spread_'
applied to an object of class "c('tbl_dbi', 'tbl_sql', 'tbl_lazy',
'tbl')"
Since this is a really big table, I do not want to use dataframe as it takes a longer time.
I was able to use as below:
df_mon <- df2 %>% filter(day == 'MONDAY') %>% mutate(MONDAY = hour) %>% select(riid,MONDAY)
df_tue <- df2 %>% filter(day == 'TUESDAY') %>% mutate(TUESDAY = hour) %>% select(riid,TUESDAY)
df_wed <- df2 %>% filter(day == 'WEDNESDAY') %>% mutate(WEDNESDAY = hour) %>% select(riid,WEDNESDAY)
df_thu <- df2 %>% filter(day == 'THURSDAY') %>% mutate(THURSDAY = hour) %>% select(riid,THURSDAY)
df_fri <- df2 %>% filter(day == 'FRIDAY') %>% mutate(FRIDAY = hour) %>% select(riid,FRIDAY)
Is it possible to write all above in one statement?
Any help to transpose this in a faster manner is really appreciated.
EDIT
Adding the dput of the tbl object:
structure(list(src = structure(list(con = <S4 object of class structure("PostgreSQLConnection", package = "RPostgreSQL")>,
disco = <environment>), .Names = c("con", "disco"), class = c("src_dbi",
"src_sql", "src")), ops = structure(list(name = "select", x = structure(list(
name = "filter", x = structure(list(name = "filter", x = structure(list(
name = "group_by", x = structure(list(x = structure("SELECT riid,day,hour,sum(weightage) AS score FROM\n (SELECT riid,day,hour,\n POWER(2,(cast(datediff (seconds,convert_timezone('UTC','PKT',SYSDATE),TO_DATE(TO_CHAR(event_captured_dt,'mm/dd/yyyy hh24:mi:ss'),'mm/dd/yyyy hh24:mi:ss')) as decimal) / cast(7862400 as decimal))) AS weightage\n FROM (\n SELECT riid,convert_timezone('GMT','PKT',event_captured_dt) AS EVENT_CAPTURED_DT,\n TO_CHAR(convert_timezone('GMT','PKT',event_captured_dt),'DAY') AS day,\n TO_CHAR(convert_timezone('GMT','PKT',event_captured_dt),'HH24') AS hour\n FROM Zameen_STO_DATA WHERE EVENT_CAPTURED_DT >= TO_DATE((sysdate -30),'yyyy-mm-dd') and LIST_ID = 4282\n )) group by riid,day,hour", class = c("sql",
"character")), vars = c("riid", "day", "hour", "score"
)), .Names = c("x", "vars"), class = c("op_base_remote",
"op_base", "op")), dots = structure(list(riid = riid,
day = day), .Names = c("riid", "day")), args = structure(list(
add = FALSE), .Names = "add")), .Names = c("name",
"x", "dots", "args"), class = c("op_group_by", "op_single",
"op")), dots = structure(list(~min_rank(desc(~score)) <=
1), .Names = ""), args = list()), .Names = c("name",
"x", "dots", "args"), class = c("op_filter", "op_single",
"op")), dots = structure(list(~row_number() == 1), .Names = ""),
args = list()), .Names = c("name", "x", "dots", "args"), class = c("op_filter",
"op_single", "op")), dots = structure(list(~riid, ~day, ~hour), class = "quosures", .Names = c("",
"", "")), args = list()), .Names = c("name", "x", "dots", "args"
), class = c("op_select", "op_single", "op"))), .Names = c("src",
"ops"), class = c("tbl_dbi", "tbl_sql", "tbl_lazy", "tbl"))
I think what you're looking for is the ability to run the tidyr::spread() function against a remote source, or database. I have a PR for dbplyr that attempts to implement that here: https://github.com/tidyverse/dbplyr/pull/72, you can try it out by using: devtools::install_github("tidyverse/dbplyr", ref = devtools::github_pull(72)).
Use dcast from reshape2 package
> data
# A tibble: 4 x 3
riid day hour
<dbl> <chr> <dbl>
1 1.00 TH 12.0
2 2.00 FR 15.0
3 3.00 TU 15.0
4 4.00 WE 16.0
> dcast(data, riid~day, value.var = "hour")
riid FR TH TU WE
1 1 NA 12 NA NA
2 2 15 NA NA NA
3 3 NA NA 15 NA
4 4 NA NA NA 16
Further if you want to remove NA, then
> z <- dcast(data, riid~day, value.var = "hour")
> z[is.na(z)] <- ""
> z
riid FR TH TU WE
1 1 12
2 2 15
3 3 15
4 4 16
I tried to combine your multiple line attempts into one. Can you try this and let us know the outcome?
library(dplyr)
df %>%
rowwise() %>%
mutate(Mon = ifelse(day=='MONDAY', hour[day=='MONDAY'], NA),
Tue = ifelse(day=='TUESDAY', hour[day=='TUESDAY'], NA),
Wed = ifelse(day=='WEDNESDAY', hour[day=='WEDNESDAY'], NA),
Thu = ifelse(day=='THURSDAY', hour[day=='THURSDAY'], NA),
Fri = ifelse(day=='FRIDAY', hour[day=='FRIDAY'], NA),
Sat = ifelse(day=='SATURDAY', hour[day=='SATURDAY'], NA),
Sun = ifelse(day=='SUNDAY', hour[day=='SUNDAY'], NA)) %>%
select(-day, -hour)
Output is:
riid Mon Tue Wed Thu Fri Sat Sun
1 5542 NA NA NA 12 NA NA NA
2 5862 NA NA NA NA 15 NA NA
3 5982 NA 15 NA NA NA NA NA
4 6022 NA NA 16 NA NA NA NA
Sample data:
# A tibble: 4 x 3
riid day hour
* <dbl> <chr> <int>
1 5542 THURSDAY 12
2 5862 FRIDAY 15
3 5982 TUESDAY 15
4 6022 WEDNESDAY 16
Update:
Can you try below approach using data.table?
library(data.table)
dt <- setDT(df)[, c("Mon","Tue","Wed","Thu","Fri","Sat","Sun") :=
list(ifelse(day=='MONDAY', hour[day=='MONDAY'], NA),
ifelse(day=='TUESDAY', hour[day=='TUESDAY'], NA),
ifelse(day=='WEDNESDAY', hour[day=='WEDNESDAY'], NA),
ifelse(day=='THURSDAY', hour[day=='THURSDAY'], NA),
ifelse(day=='FRIDAY', hour[day=='FRIDAY'], NA),
ifelse(day=='SATURDAY', hour[day=='SATURDAY'], NA),
ifelse(day=='SUNDAY', hour[day=='SUNDAY'], NA))][, !c("day","hour"), with=F]