locating specific columns in a pdf table from R - r

I was wondering how I could locate the 2nd & 3rd columns from left in the table on the last page (page 18) of the this pdf document.
I'm using pdftools package, I'm wondering if there is a way to extract the 2nd & 3rd columns from left which are just numeric data?
library(pdftools)
df <- pdf_data("https://github.com/rnorouzian/m/raw/master/Kang_et_al%20(2015).pdf")[[18]]

Making use of some tidy verse packages this could be achieved like so:
Filter for the values in the 2nd and 3rd column. The 2nd column values start at position x=189, the 3rd col at x=252.
Additionally to make sure that we only get the values I first convert to numeric whereby all text gets converted to NA. Note: One of the values has a comma as decimal mark, which I first had to remove.
After getting the values I reshape the dataset using pivot_wider for which I add a row id.
Finally I rename the cols.
library(pdftools)
#> Using poppler version 0.73.0
df <- pdf_data("https://github.com/rnorouzian/m/raw/master/Kang_et_al%20(2015).pdf")[[18]]
library(dplyr)
library(tidyr)
library(stringr)
col_x <- c(189, 252)
df %>%
mutate(value = str_replace(text, "(\\d+),(\\d+)", "\\1.\\2"),
value = as.numeric(value)) %>%
filter(x %in% col_x, !is.na(value)) %>%
select(x, value) %>%
group_by(x) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = x, values_from = value) %>%
rename(row = 1, g = 2, se = 3)
#> Warning: Problem with `mutate()` input `value`.
#> ℹ NAs introduced by coercion
#> ℹ Input `value` is `as.numeric(value)`.
#> Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
#> # A tibble: 20 x 3
#> row g se
#> <int> <dbl> <dbl>
#> 1 1 0.089 0.179
#> 2 2 0.383 0.257
#> 3 3 0.481 0.355
#> 4 4 0.496 0.356
#> 5 5 0.103 0.335
#> 6 6 0.104 0.257
#> 7 7 0.068 0.289
#> 8 8 0.43 0.359
#> 9 9 1.48 0.351
#> 10 10 1.38 0.257
#> 11 11 0.888 0.388
#> 12 12 0.570 0.314
#> 13 13 0.642 0.39
#> 14 14 1.16 0.364
#> 15 15 0.341 0.432
#> 16 16 0.607 0.299
#> 17 17 0.473 0.361
#> 18 18 0.472 0.423
#> 19 19 0.902 0.368
#> 20 20 0.245 0.363
Created on 2020-12-31 by the reprex package (v0.3.0)

The accepted answer is complicated, and after wasting quite a bit of time fiddling with pdf_data() output, thought it might help to show how to extract and manipulate vectors.
library(pdftools)
library(stringr)
df <- pdf_data("https://github.com/rnorouzian/m/raw/master/Kang_et_al%20(2015).pdf")[[18]]
df <- df[stringr::str_detect(df$text, "\\d"),]
data.frame(se = df$text[df$x == 189], g = df$text[df$x == 252])

Related

How is have to be a dataset to perform an ANOVA test in R?

I have three columns, one per group, with numeric values. I want to analyze them using an Anova test, but I found applications when you have the different groups in a column and the respective values in the second column. I wonder if it is necessary to reorder the data like that, or if there is a method that I can use for the columns that I currently have. Here I attached a capture:
Thanks!
You can convert a wide table having many columns into another table having only two columns for key (group) and value (response) by pivoting the data:
library(tidyverse)
# create example data
set.seed(1337)
data <- tibble(
VIH = runif(100),
VIH2 = runif(100),
VIH3 = runif(100)
)
data
#> # A tibble: 100 × 3
#> VIH VIH2 VIH3
#> <dbl> <dbl> <dbl>
#> 1 0.576 0.485 0.583
#> 2 0.565 0.495 0.108
#> 3 0.0740 0.868 0.350
#> 4 0.454 0.833 0.324
#> 5 0.373 0.242 0.915
#> 6 0.331 0.0694 0.0790
#> 7 0.948 0.130 0.563
#> 8 0.281 0.122 0.287
#> 9 0.245 0.270 0.419
#> 10 0.146 0.488 0.838
#> # … with 90 more rows
data %>%
pivot_longer(everything()) %>%
aov(value ~ name, data = .)
#> Call:
#> aov(formula = value ~ name, data = .)
#>
#> Terms:
#> name Residuals
#> Sum of Squares 0.124558 25.171730
#> Deg. of Freedom 2 297
#>
#> Residual standard error: 0.2911242
#> Estimated effects may be unbalanced
Created on 2022-05-10 by the reprex package (v2.0.0)

R: paired t-test on multiple columns

I am trying to run a t-test on multiple columns. Basically trying to find the change from baseline to year 1 for a number of joint angles. I only want to conduct this on the study side. Below is an image with the first few rows and columns of the data. Sample Data
I have tried using both of these functions without success:
Code 1:
res <- FAI_SLS %>%
filter(study_side == "Study")%>%
select(-id,-subject,-activity,-side,-study_side,-year) %>%
map_df(~ broom::tidy(t.test(. ~ year)), .id = 'var')
I get the following error:
Error in eval(predvars, data, env) : object 'year' not found
I tried taking out -year but I still have the same issue.
Code 2:
t(sapply(FAI_SLS%>%filter(study_side == "Study")%>%select(-id,-subject,-activity,-side,-study_side,-year), function(x)
unlist(t.test(x~FAI_SLS$year)[c("estimate","p.value","statistic","conf.int")])))
I get the following error:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 't': variable lengths differ (found for 'FAI_SLS$year')
Again I tried taking -year out without success.
Any suggestions on how I can fix this? Thanks
Try fitting the t-test within summarise() on all the columns you want to test (selected in across()). Here's an example with a different dataset:
library(dplyr)
library(tidyr)
data("storms")
storms %>%
filter(year %in% c(2019, 2020)) %>%
summarise(across(-c(name, year, status, category),
~broom::tidy(t.test(. ~ year)))) %>%
pivot_longer(everything(), names_to = "variable") %>%
unnest(value)
#> # A tibble: 9 × 11
#> variable estimate estimate1 estimate2 statistic p.value parameter conf.low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 month 0.0917 8.93 8.84 1.15 2.52e- 1 892. -0.0654
#> 2 day 4.29 18.2 13.9 7.49 2.34e-13 641. 3.17
#> 3 hour -0.0596 9.13 9.19 -0.128 8.99e- 1 687. -0.978
#> 4 lat 2.14 25.9 23.7 3.75 1.94e- 4 668. 1.02
#> 5 long 6.06 -60.7 -66.8 4.27 2.25e- 5 736. 3.27
#> 6 wind 8.42 58.8 50.4 4.42 1.18e- 5 529. 4.68
#> 7 pressure -4.46 989. 993. -3.03 2.59e- 3 537. -7.35
#> 8 tropicalst… 7.39 153. 145. 0.810 4.18e- 1 701. -10.5
#> 9 hurricane_… 10.9 24.1 13.2 3.92 1.02e- 4 508. 5.45
#> # … with 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
Created on 2022-06-02 by the reprex package (v2.0.1)

Implementing mutate within a function called with variables

I'd like to call a function several times with different variables, each time setting a value to a new variable in the data frame. Here is my failed attempt. I appreciate any help!
dat <- tibble(score1 = runif(10), score2 = score1 * 2)
call_mutate_with_vars <- function(df, var1, var2, var3) {
df <-
df %>%
mutate({{var3}} := ifelse({{var1}} >= {{var2}},0,{{var2}} - {{var1}}))
df
}
call_mutate_with_vars(dat,"score1","score2","newscore")
I receive this error:
Error: Problem with `mutate()` column `newscore`.
i `newscore = ifelse("score1" >= "score2", 0, "score2" - "score1")`.
x non-numeric argument to binary operator
Run `rlang::last_error()` to see where the error occurred.
The embracing operator {{ is meant for variables passed as symbols (i.e., fx(var), not fx("var")). If you need to pass your variables as characters, you can instead use the .data pronoun.
So you can either pass symbols to your current function:
library(dplyr)
set.seed(1)
call_mutate_with_vars <- function(df, var1, var2, var3) {
df %>%
mutate(
{{var3}} := ifelse(
{{var1}} >= {{var2}},
0,
{{var2}} - {{var1}}
)
)
}
call_mutate_with_vars(dat, score1, score2, newscore)
#> # A tibble: 10 x 3
#> score1 score2 newscore
#> <dbl> <dbl> <dbl>
#> 1 0.266 0.531 0.266
#> 2 0.372 0.744 0.372
#> 3 0.573 1.15 0.573
#> 4 0.908 1.82 0.908
#> 5 0.202 0.403 0.202
#> 6 0.898 1.80 0.898
#> 7 0.945 1.89 0.945
#> 8 0.661 1.32 0.661
#> 9 0.629 1.26 0.629
#> 10 0.0618 0.124 0.0618
Or rewrite the function to handle characters:
call_mutate_with_chr_vars <- function(df, var1, var2, var3) {
df %>%
mutate(
!!var3 := ifelse( # note use of !! unquote operator
.data[[var1]] >= .data[[var2]], # to use character as name
0,
.data[[var2]] - .data[[var1]]
)
)
}
call_mutate_with_chr_vars(dat, "score1", "score2", "newscore")
#> # A tibble: 10 x 3
#> score1 score2 newscore
#> <dbl> <dbl> <dbl>
#> 1 0.266 0.531 0.266
#> 2 0.372 0.744 0.372
#> 3 0.573 1.15 0.573
#> 4 0.908 1.82 0.908
#> 5 0.202 0.403 0.202
#> 6 0.898 1.80 0.898
#> 7 0.945 1.89 0.945
#> 8 0.661 1.32 0.661
#> 9 0.629 1.26 0.629
#> 10 0.0618 0.124 0.0618
Created on 2022-03-07 by the reprex package (v2.0.1)
The "Programming with dplyr" vignette is a nice reference for these sorts of problems.

row bind list columns using dplyr

I would like to find a better way to bind together the results of any number of regressions after adding an identifier for each model. The code below is my current solution but is too manual for a large number of regressions. This is part of a larger tidy workflow so a solution inside of the tidyverse is preferred but whatever works is fine. Thanks
library(tidyverse)
library(broom)
model_dat=mtcars %>%
do(lm_1 = tidy(lm(disp~ wt*vs, data = .),conf.int=T),
lm_2=tidy(lm(cyl ~ wt*vs, data = .),conf.int=T ),
lm_3=tidy(lm(mpg ~ wt*vs, data = .),conf.int=T ))
df=model_dat %>%
select(lm_1) %>%
unnest(c(lm_1)) %>%
mutate(model="one") %>%
select(model,term,estimate,p.value:conf.high) %>%
bind_rows(
model_dat %>%
select(lm_2) %>%
unnest(c(lm_2)) %>%
mutate(model="two") %>%
select(model,term,estimate,p.value:conf.high)) %>%
bind_rows(
model_dat %>%
select(lm_3) %>%
unnest(c(lm_3)) %>%
mutate(model="three") %>%
select(model,term,estimate,p.value:conf.high))
It may be easier with map2 i.e. loop across the columns and the corresponding english word for the sequence of columns, pluck the list element, create the 'model' column with second argument i.e. engish words (.y), select the columns of interest, and create a single dataset by specifying _dfr in map
library(purrr)
library(english)
library(dplyr)
library(broom)
map2_dfr(model_dat, as.character(english(seq_along(model_dat))),
~ .x %>%
pluck(1) %>%
mutate(model = .y) %>%
select(model, term, estimate, p.value:conf.high) )
-output
# A tibble: 12 x 6
# model term estimate p.value conf.low conf.high
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one (Intercept) -70.0 1.55e- 1 -168. 28.2
# 2 one wt 102. 8.20e- 9 76.4 128.
# 3 one vs 31.2 6.54e- 1 -110. 172.
# 4 one wt:vs -36.7 1.10e- 1 -82.2 8.82
# 5 two (Intercept) 4.31 1.28e- 5 2.64 5.99
# 6 two wt 0.849 4.90e- 4 0.408 1.29
# 7 two vs -2.19 7.28e- 2 -4.59 0.216
# 8 two wt:vs 0.0869 8.20e- 1 -0.689 0.862
# 9 three (Intercept) 29.5 6.55e-12 24.2 34.9
#10 three wt -3.50 2.33e- 5 -4.92 -2.08
#11 three vs 11.8 4.10e- 3 4.06 19.5
#12 three wt:vs -2.91 2.36e- 2 -5.40 -0.419
Or use summarise with across, unclass and then bind with bind_rows
model_dat %>%
summarise(across(everything(), ~ {
# // get the column name
nm1 <- cur_column()
# // extract the list element (.[[1]])
list(.[[1]] %>%
# // create new column by extracting the numeric part
mutate(model = english(readr::parse_number(nm1))) %>%
# // select the subset of columns, wrap in a list
select(model, term, estimate, p.value:conf.high))
}
)) %>%
# // unclass to list
unclass %>%
# // bind the list elements
bind_rows
-output
# A tibble: 12 x 6
# model term estimate p.value conf.low conf.high
# <english> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one (Intercept) -70.0 1.55e- 1 -168. 28.2
# 2 one wt 102. 8.20e- 9 76.4 128.
# 3 one vs 31.2 6.54e- 1 -110. 172.
# 4 one wt:vs -36.7 1.10e- 1 -82.2 8.82
# 5 two (Intercept) 4.31 1.28e- 5 2.64 5.99
# 6 two wt 0.849 4.90e- 4 0.408 1.29
# 7 two vs -2.19 7.28e- 2 -4.59 0.216
# 8 two wt:vs 0.0869 8.20e- 1 -0.689 0.862
# 9 three (Intercept) 29.5 6.55e-12 24.2 34.9
#10 three wt -3.50 2.33e- 5 -4.92 -2.08
#11 three vs 11.8 4.10e- 3 4.06 19.5
#12 three wt:vs -2.91 2.36e- 2 -5.40 -0.419

R find top n results of column operation on aggregate operation per column over dataframe

Say I have a dataframe called RaM that holds cumulative return values. In this case, they literally are just a single row of cumulative return values along with column headers, but I would like to apply the logic to not just single row dataframes.
Say I want to sort by the max cumulative return value of each column, or even the average, or the sum of each column.
So each column would be re-ordered so that the max cumulative returns for each column is compared and the highest return becomes the 1st column with the min being the last column
then say I want to derive either the top 10 (1st 10 columns after they are rearranged), or even the top 10%.
I know how to derive the column averages, but I don't know how to effectively do the remaining operations. There is an order function, but when I used it, it stripped my column names, which I need. I could easily then cut the 1st say 10 columns, but is there a way that preserves the names? I don't think I can easily extract the names from the unordered original dataframe and apply it with my sorted by aggregate dataframe. My goal is to extract the column names of the top n columns (in dataframe RaM) in terms of a column aggregate function over the entire dataframe.
something like
top10 <- getTop10ColumnNames(colSums(RaM))
that would then output a dataframe of the top 10 columns in terms of their sum from RaM
Here's output off RaM
> head(RaM,2)
ABMD ACAD ALGN ALNY ANIP ASCMA AVGO CALD CLVS CORT
2013-01-31 0.03794643 0.296774194 0.13009009 0.32219178 0.13008130 0.02857604 0.13014640 -0.07929515 0.23375000 0.5174825
2013-02-28 0.14982079 0.006633499 0.00255102 -0.01823456 -0.05755396 0.07659708 -0.04333138 0.04066986 -0.04457953 -0.2465438
CPST EA EGY EXEL FCSC FOLD GNC GTT HEAR HK HZNP
2013-01-31 -0.05269663 0.08333333 -0.01849711 0.01969365 0 0.4179104 0.07992677 0.250000000 0.2017417 0.10404624 -0.085836910
2013-02-28 0.15051595 0.11443102 -0.04475854 -0.02145923 0 -0.2947368 0.14079036 0.002857143 0.4239130 -0.07068063 -0.009389671
ICON IMI IMMU INFI INSY KEG LGND LQDT MCF MU
2013-01-31 0.07750896 0.05393258 -0.01027397 -0.01571429 -0.05806459 0.16978417 -0.03085824 -0.22001958 0.01345609 0.1924290
2013-02-28 -0.01746362 0.03091684 -0.20415225 0.19854862 0.36849503 0.05535055 0.02189055 0.06840289 -0.09713487 0.1078042
NBIX NFLX NVDA OREX PFPT PQ PRTA PTX RAS REXX RTRX
2013-01-31 0.2112299 0.7846467 0.00000000 0.08950306 0.06823721 0.03838384 -0.1800819 0.04387097 0.23852335 0.008448541 0.34328358
2013-02-28 0.1677704 0.1382251 0.03888981 0.04020979 0.06311787 -0.25291829 0.0266223 -0.26328801 0.05079882 0.026656512 -0.02222222
SDRL SHOS SSI STMP TAL TREE TSLA TTWO UVE VICL
2013-01-31 0.07826093 0.2023956 -0.07788381 0.07103175 -0.14166875 -0.030504714 0.10746974 0.1053588 0.0365299 0.2302405
2013-02-28 -0.07585546 0.1384419 0.08052150 -0.09633197 0.08009728 -0.002860412 -0.07144761 0.2029581 -0.0330408 -0.1061453
VSI VVUS WLB
2013-01-31 0.06485356 -0.0976155 0.07494647
2013-02-28 -0.13965291 -0.1156069 0.04581673
Here's one way using the first section of your sample data to illustrate. You can gather up all the columns so that we can do summary calculations more easily, calculate all the summaries by group that you want, and then sort with arrange. Here I ordered with the highest sums first, but you could do whatever order you wanted.
library(tidyverse)
ram <- read_table2(
"ABMD ACAD ALGN ALNY ANIP ASCMA AVGO CALD CLVS CORT
0.03794643 0.296774194 0.13009009 0.32219178 0.13008130 0.02857604 0.13014640 -0.07929515 0.23375000 0.5174825
0.14982079 0.006633499 0.00255102 -0.01823456 -0.05755396 0.07659708 -0.04333138 0.04066986 -0.04457953 -0.2465438"
)
summary <- ram %>%
gather(colname, value) %>%
group_by(colname) %>%
summarise_at(.vars = vars(value), .funs = funs(mean = mean, sum = sum, max = max)) %>%
arrange(desc(sum))
summary
#> # A tibble: 10 x 4
#> colname mean sum max
#> <chr> <dbl> <dbl> <dbl>
#> 1 ALNY 0.152 0.304 0.322
#> 2 ACAD 0.152 0.303 0.297
#> 3 CORT 0.135 0.271 0.517
#> 4 CLVS 0.0946 0.189 0.234
#> 5 ABMD 0.0939 0.188 0.150
#> 6 ALGN 0.0663 0.133 0.130
#> 7 ASCMA 0.0526 0.105 0.0766
#> 8 AVGO 0.0434 0.0868 0.130
#> 9 ANIP 0.0363 0.0725 0.130
#> 10 CALD -0.0193 -0.0386 0.0407
If you then want to reorder your original data frame, you can get the order from this summary output and index with it:
ram[summary$colname]
#> # A tibble: 2 x 10
#> ALNY ACAD CORT CLVS ABMD ALGN ASCMA AVGO ANIP
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.322 0.297 0.517 0.234 0.0379 0.130 0.0286 0.130 0.130
#> 2 -0.0182 0.00663 -0.247 -0.0446 0.150 0.00255 0.0766 -0.0433 -0.0576
#> # ... with 1 more variable: CALD <dbl>
Created on 2018-08-01 by the reprex package (v0.2.0).

Resources