Force the application of the right as_tibble function on an object in a purrr::map call in R - r

I am trying to use the fact that as_tibble is a generic function so that I can process a column containing one out of two types of objects in the same way.
The list column temp can store either a list or an object of type AsspDataObj.
I can define an as_tibble function for that class
library(dplyr)
as_tibble.AsspDataObj <- function(x,field=1, prefix=NULL,na.zeros=TRUE){
df <- data.frame(x[[field]])
if(is.null(prefix)){
if(is.numeric(field)){
prefix <- names(x)[field]
}else{
prefix <- field
}
}
colnames(df) <- paste(prefix,seq(1,ncol(df),1),sep="_")
times <- seq(from=attr(x,"startTime"),
by=1/attr(x,"sampleRate"),
length.out=nrow(df))
out <-
tibble(times_orig=times,
times_rel=seq(from=0,to=(attr(x,"endRecord")-1)* 1000/attr(x,"sampleRate") ,by=1000/attr(x,"sampleRate")),
times_norm=times_rel / (max(times_rel) - min(times_rel))
) %>%
dplyr::bind_cols(df)
if(na.zeros){
out <- out %>%
dplyr::mutate(across(!times_orig & !times_rel & !times_norm, ~ na_if(.,0)))
}
return(out)
}
and then apply the function to one of the stored objects and get the expected result.
> class(pluck(illustration, "temp",1))
[1] "AsspDataObj"
> as_tibble(pluck(illustration, "temp",1))
# A tibble: 581 × 7
times_orig times_rel times_norm fm_1 fm_2 fm_3 fm_4
<dbl> <dbl> <dbl> <int> <int> <int> <int>
1 0.0025 0 0 NA 1062 2073 3156
2 0.0075 5 0.00172 1239 2109 3113 4247
3 0.0125 10 0.00345 NA 1352 2316 3310
4 0.0175 15 0.00517 NA 1448 2555 3870
5 0.0225 20 0.00690 NA 1438 2564 3958
[...]
Now, I want to apply the function to each object, and expand (unnest) the output so that result is actually the as_tibble output columns for each stored AsspDataObj, but joined with the other columns in the illustration data set.
Now, if I do this (and simplify the output a but by removing some columns just for now)
> illustration %>% mutate(out = map(.x = temp,.f=as_tibble)) %>% select(sl_rowIdx,out) %>% unnest(out)
I get this output
# A tibble: 1,162 × 10
sl_rowIdx frame_time fm1 fm2 fm3 fm4 bw1 bw2 bw3 bw4
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2.5 0 1062 2073 3156 0 800 369 890
2 1 7.5 1239 2109 3113 4247 644 493 792 859
3 1 12.5 0 1352 2316 3310 0 486 762 933
4 1 17.5 0 1448 2555 3870 0 577 716 1442
5 1 22.5 0 1438 2564 3958 0 548 552 1062
6 1 27.5 0 1343 2518 4001 0 637 617 1096
which is not the output I would expect to see if the as_tibble generic above had been applied. Instead, it seems that some other as_tibble function has been applied to the object.
Why? How do I find out what as_tibble.* function has been applied here? How do I force the application of my definition?
You need a data excerpt https://umeauniversity-my.sharepoint.com/:u:/g/personal/frkkan96_ad_umu_se/ET_56GqikHxAuFc60y_ce1UBL0lOOJcqWaMDjDwLOxjuOw?e=IsEZPf

Sorry, I found the answer. The problem was in the lexical scoping of R (I think) with a (not exported) definition of as_table.AsspDataObj defined in the package where the AsspDataObj was defined
https://github.com/IPS-LMU/wrassp/blob/cbbc6e9fe100f5f32f7b30510f3008c5a5553440/R/AsspDataObj.R#L357
being called instead of my function.
Confusing.

Related

How to use loop to generate the data in a table in Shiny?

I just started to learn shiny few days, and I have been troubled by this problem for a long time.
I need to generate a table(Two-column table), and the data in the table needs to be calculated based on the input (then I can use this table to generate a scatter plot in ggplot()).
I try to make the code more visible, so I want to use for loop to replace potentially hundreds of lines of highly repetitive code. Otherwise, it will look like (input$meansy1)-1)^2, (input$meansy1)-2)^2......(input$meansy1)-100)^2.
I don't know why it can't be used correctly in data.frame().
This is part of the code,
shinyUI(fluidPage(
numericInput("y1", "y1:", sample(1:100,1), min = 1, max = 100)),
tableOutput("tb")
))
shinyServer(function(input, output,session) {
list <-c()
for (i in 1:100) {
local({
list[[i]] <-reactive(((input$y1)-i)^2)}
)}
dt = data.frame(y_roof = 1:100, B=list)
output$tb <- renderTable({
dt
})
})
When developing a feature for a shiny app it makes sense to look at the underlying operation separately from the shiny context. That way you can figure out if you have a shiny specific issue or not.
Let's look at the operation you want to do first: Iteratively subtracting the values 1 to 100 from x and squaring the result.
You can do this in base R, like this:
x <- 1
dt1 <- data.frame(y_roof = 1:100)
(x - dt1$y_roof)^2
#> [1] 0 1 4 9 16 25 36 49 64 81 100 121 144 169 196
#> [16] 225 256 289 324 361 400 441 484 529 576 625 676 729 784 841
#> [31] 900 961 1024 1089 1156 1225 1296 1369 1444 1521 1600 1681 1764 1849 1936
#> [46] 2025 2116 2209 2304 2401 2500 2601 2704 2809 2916 3025 3136 3249 3364 3481
#> [61] 3600 3721 3844 3969 4096 4225 4356 4489 4624 4761 4900 5041 5184 5329 5476
#> [76] 5625 5776 5929 6084 6241 6400 6561 6724 6889 7056 7225 7396 7569 7744 7921
#> [91] 8100 8281 8464 8649 8836 9025 9216 9409 9604 9801
To store the results in a dataframe change the last line to:
dt1$col2 <- (x - dt1$y_roof)^2
head(dt1)
#> y_roof col2
#> 1 1 0
#> 2 2 1
#> 3 3 4
#> 4 4 9
#> 5 5 16
#> 6 6 25
Doing the same in the tidyverse would look like this:
library(dplyr)
dt2 <-
data.frame(y_roof = 1:100) %>%
mutate(col2 = (x - y_roof)^2)
head(dt2)
#> y_roof col2
#> 1 1 0
#> 2 2 1
#> 3 3 4
#> 4 4 9
#> 5 5 16
#> 6 6 25
Now we can work this into the shiny app:
library(shiny)
library(dplyr)
ui <-
shinyUI(fluidPage(
numericInput("y1", "y1:", sample(1:100, 1), min = 1, max = 100),
tableOutput("tb")
))
server <-
shinyServer(function(input, output, session) {
output$tb <- renderTable({
data.frame(y_roof = 1:100) %>%
mutate(col2 = (input$y1 - y_roof) ^ 2)
})
})
shinyApp(ui, server, options = list(launch.browser = TRUE))

Create a sequence of values by group between a min and max interval using dplyr

this is surely a basic question but couldn't find a way to solve.
I need to create a sequence of values for a minimum (dds_min) to maximum (dds_max) per group (fs).
This is my data:
fs <- c("early", "late")
dds_min <-as.numeric(c("47.2", "40"))
dds_max <-as.numeric(c("122", "105"))
dds_min.max <-as.data.frame(cbind(fs,dds_min, dds_max))
And this is what I did....
dss_levels <-dds_min.max %>%
group_by(fs) %>%
mutate(dds=seq(dds_min,dds_max,length.out=100))
I intended to create a new variable (dds), that has to be 100 length and start and end at different values depending on "fs". My expectation was to end with another dataframe (dss_levels) with two columns (fs and dds), 200 values on it.
But I am getting this error.
Error: Column `dds` must be length 1 (the group size), not 100
In addition: Warning messages:
1: In Ops.factor(to, from) : ‘-’ not meaningful for factors
2: In Ops.factor(from, seq_len(length.out - 2L) * by) :
‘+’ not meaningful for factors
Any help would be really appreciated.
Thanks!
I make the sequence length 5 for illustrative purposes, you can change it to 100.
library(purrr)
library(tidyr)
dds_min.max %>%
mutate(dds= map2(dds_min, dds_max, seq, length.out = 5)) %>%
unnest(cols = dds)
# # A tibble: 10 x 4
# fs dds_min dds_max dds
# <fct> <dbl> <dbl> <dbl>
# 1 early 47.2 122 47.2
# 2 early 47.2 122 65.9
# 3 early 47.2 122 84.6
# 4 early 47.2 122 103.
# 5 early 47.2 122 122
# 6 late 40 105 40
# 7 late 40 105 56.2
# 8 late 40 105 72.5
# 9 late 40 105 88.8
# 10 late 40 105 105
Using this data (make sure your numeric columns are numeric! Don't use cbind!)
fs <- c("early", "late")
dds_min <-c(47.2, 40)
dds_max <-c(122, 105)
dds_min.max <-data.frame(fs,dds_min, dds_max)

How can I divide several entire numbers separated by a comma in one column by numbers in another column

I wanted to divide numbers separated by commas in a column
by other numbers.
Here is the input I have
> df = data.frame (SAMPLE1.DP=c("555","651","641","717"), SAMPLE1.AD=c("555", "68,583","2,639","358,359"), SAMPLE2.DP=c("1023","930","683","1179"), SAMPLE2.AD=c("1023","0,930","683","585,594"))
> df
SAMPLE1.DP SAMPLE1.AD SAMPLE2.DP SAMPLE2.AD
1 555 555 1023 1023
2 651 68,583 930 0,930
3 641 2,639 683 683
4 717 358,359 1179 585,594
In the end I want to add two new columns (AD/DP) that divide the values SAMPLE1.AD by SAMPLE1.DP AND SAMPLE2.AD by SAMPLE2.DP, which represent pourcentage of numbers at each side of the comma, like this :
> end = data.frame(SAMPLE1.DP=c("555","651","641","717"),
+ SAMPLE1.AD=c("555", "68,583","204,437","358,359"),
+ SAMPLE1.AD_DP=c("1.00","0.10,0.90","0.32,0.68","0.50,0.50"),
+ SAMPLE2.DP=c("1023","930","683","1179"),
+ SAMPLE2.AD=c("1023","0,930","683","585,594"),
+ SAMPLE2.AD_DP=c("1.00","0.00,1.00","1.00","0.49,0,51"))
>end
SAMPLE1.DP SAMPLE1.AD SAMPLE1.AD_DP SAMPLE2.DP SAMPLE2.AD SAMPLE2.AD_DP
1 555 555 1.00 1023 1023 1.00
2 651 68,583 0.10,0.90 930 0,930 0.00,1.00
3 641 204,437 0.32,0.68 683 683 1.00
4 717 358,359 0.50,0.50 1179 585,594 0.49,0,51
it means :
XX YY,ZZ YY/XX,ZZ/XX AA BB,CC BB/AA,CC/AA
If I consider the values inside the table as.numeric, it does not work since values are separated by commas...
Do you have any idea to do this ?
Thanks in advance for your help
First thing you need to do is replace the , with . and cast to numeric. Then split based on your required condition and divide, i.e.
df[] <- lapply(df, function(i)as.numeric(gsub(',', '.', i)))
do.call(cbind, lapply(split.default(df, gsub('\\D+', '', names(df))), function(i) i[2] / i[1]))
# SAMPLE1.AD SAMPLE2.AD
#1 1.000000000 1.000000
#2 0.004066052 0.001000
#3 0.004117005 1.000000
#4 0.499803347 0.496687
If there are commas in your numbers than the column has most likely been poisoned and is cast as characters. What you need to do is convert your columns to numeric and then divide each column respectively.
library(tidyverse)
dat <- tribble(~"SAMPLE1.DP", ~"SAMPLE1.AD", ~"SAMPLE2.DP", ~"SAMPLE2.AD",
555, 555, 1023, 1023,
651, "2,647", 930, ",93",
641, "2,639", 683, 683,
717, "358,359", 1179, "585,594")
dat %>%
mutate_at(c(2,4), list(~str_replace(., ",", "."))) %>%
mutate_all(as.numeric) %>%
mutate(addp1 = SAMPLE1.AD / SAMPLE1.DP,
addp2 = SAMPLE2.AD / SAMPLE2.DP)
#> # A tibble: 4 x 6
#> SAMPLE1.DP SAMPLE1.AD SAMPLE2.DP SAMPLE2.AD addp1 addp2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 555 555 1023 1023 1 1
#> 2 651 2.65 930 0.93 0.00407 0.001
#> 3 641 2.64 683 683 0.00412 1
#> 4 717 358. 1179 586. 0.500 0.497
Created on 2019-05-20 by the reprex package (v0.2.1)
Thanks everyone but I was not very clear in my question, very sorry.
In my input example, I have only whole numbers separated by commas, no decimales.
For example, on line 3 of my example :
2,647 means 2 AND 647, and I want to divide both numbers by 651 in order to have as result : 2/651 , 647/651 , so it will be 0.01 and 0.99 (or 1% and 99%)
They are entire numbers (or integers), separated by commas.
Hope I am clearer ...thanks...

Reading fixed width format data into R with entries exceeding column width

I need to use the Annual Building Permits by Metropolitan Area Data distributed by the US Census Bureau, which are downloadable here as fixed width format text files. Here is an excerpt of the file (I've stripped the column names as they aren't in a nice format and can be replaced after reading the file into a date frame):
999 10180 Abilene, TX 306 298 8 0 0 0
184 10420 Akron, OH 909 905 0 4 0 0
999 13980 Blacksburg-Christiansburg-Radford,
VA 543 455 0 4 84 3
145 14010 Bloomington, IL 342 214 4 0 124 7
160 15380 Buffalo-Cheektowaga-Niagara Falls,*
NY 1964 931 14 14 1005 68
268 15500 Burlington, NC 1353 938 12 16 387 20
As seen in the above excerpt, many of the entries in the Name column exceed the width of the column (which looks to be 36 characters). I've experimented with the various fwf reading functions of both the utils package and readr but can't find a solution that takes these entries into account. Any tips would be much appreciated.
Edit: The original file excerpt was edited by a mod for formatting and in the process the example entries where the third column width was exceeded were deleted. I've since updated the excerpt to reinclude them and have stripped the column names.
I ran #markdly 's code, which was submitted before this edit, works for all the entries that don't have this issue. I exported the result to a csv, and included an excerpt below to show what happens with these entries:
"38","999",NA,"13980",NA,"Blacksburg-Christiansburg-Radford,",NA,NA,NA,NA,NA,NA
"39","V","A",NA,NA,NA,"543",455,0,4,84,3
"40","145",NA,"14010",NA,"Bloomington, IL","342",214,4,0,124,7
"51","160",NA,"15380",NA,"Buffalo-Cheektowaga-Niagara Falls,*",NA,NA,NA,NA,NA,NA
"52","N","Y",NA,NA,NA,"1964",931,14,14,1005,68
"53","268",NA,"15500",NA,"Burlington, NC","1353",938,12,16,387,20
Edit 2: Most of the major metro areas I'm actually looking at don't fall into this problem category, so while it would be nice to have the data for the ones that do, if there is no workable solution, would there be a way to remove these entries from the data set altogether?
Edit:
Based on the updated information, the files are not fixed width for some records. In this situation, I think readr::read_table is more useful than read_fwf. The following example is a tidyverse approach to importing and processing one of the source files (tb3u2016.txt). A base approach might involve using something like readLines.
Step 1 Read the file in and assign the split records a common record id
library(tidyverse)
df <- read_table("tb3u2016.txt", col_names = FALSE, skip = 11) %>%
rownames_to_column() %>%
mutate(record = if_else(lag(is.na(X2) & rowname > 1), lag(rowname), rowname))
df[37:40, ]
#> # A tibble: 4 x 8
#> rowname X1 X2
#> <chr> <chr> <int>
#> 1 37 999 13900 Bismarck, ND 856 629
#> 2 38 999 13980 Blacksburg-Christiansburg-Radford, NA
#> 3 39 VA 543 455
#> 4 40 145 14010 Bloomington, IL 342 214
#> # ... with 5 more variables: X3 <int>, X4 <int>, X5 <int>, X6 <int>,
#> # record <chr>
Step 2 Combine the split record text then put the contents into separate variables using tidyr::extract. Trim whitespace and remove the redundant records.
df <- df %>%
mutate(new_X1 = if_else(rowname != record, paste0(lag(X1), X1), X1)) %>%
extract(new_X1, c("CSA", "CBSA", "Name", "Total"), "([0-9]+) ([0-9]+) (.+) ([0-9]+)") %>%
mutate(Name = trimws(Name)) %>%
filter((lead(record) != record) | rowname == 1) %>%
select(CSA, CBSA, Name, Total, X2, X3, X4, X5, X6)
df[37:39, ]
#> # A tibble: 3 x 9
#> CSA CBSA Name Total X2 X3 X4
#> <chr> <chr> <chr> <chr> <int> <int> <int>
#> 1 999 13900 Bismarck, ND 856 629 16 6
#> 2 999 13980 Blacksburg-Christiansburg-Radford,VA 543 455 0 4
#> 3 145 14010 Bloomington, IL 342 214 4 0
#> # ... with 2 more variables: X5 <int>, X6 <int>
Below is a condensed version of the solution provided to an earlier version of the question using readr::read_fwf.
Example data
library(readr)
# example data
txt <- " Num of
Struc-
tures
With
3 and 4 5 Units 5 Units
CSA CBSA Name Total 1 Unit 2 Units Units or more or more
999 10180 Abilene, TX 306 298 8 0 0 0
184 10420 Akron, OH 909 905 0 4 0 0"
write_file(txt, "example.txt")
Solution
col_widths <- c(3, 1, 5, 1, 36, 8, 8, 8, 8, 8, NA)
col_names <- c("CSA", "blank_1", "CBSA", "blank_2", "Name", "Total", "units_1", "units_2",
"units_3_and_4", "units_5_or_more", "num_struc_5_or_more")
df <- read_fwf("example.txt", fwf_widths(col_widths, col_names), skip = 7)
df
#> # A tibble: 2 x 11
#> CSA blank_1 CBSA blank_2 Name Total units_1 units_2
#> <int> <chr> <int> <chr> <chr> <int> <int> <int>
#> 1 999 <NA> 10180 <NA> Abilene, TX 306 298 8
#> 2 184 <NA> 10420 <NA> Akron, OH 909 905 0
#> # ... with 3 more variables: units_3_and_4 <int>, units_5_or_more <int>,
#> # num_struc_5_or_more <int>

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Resources