How can I divide several entire numbers separated by a comma in one column by numbers in another column - r

I wanted to divide numbers separated by commas in a column
by other numbers.
Here is the input I have
> df = data.frame (SAMPLE1.DP=c("555","651","641","717"), SAMPLE1.AD=c("555", "68,583","2,639","358,359"), SAMPLE2.DP=c("1023","930","683","1179"), SAMPLE2.AD=c("1023","0,930","683","585,594"))
> df
SAMPLE1.DP SAMPLE1.AD SAMPLE2.DP SAMPLE2.AD
1 555 555 1023 1023
2 651 68,583 930 0,930
3 641 2,639 683 683
4 717 358,359 1179 585,594
In the end I want to add two new columns (AD/DP) that divide the values SAMPLE1.AD by SAMPLE1.DP AND SAMPLE2.AD by SAMPLE2.DP, which represent pourcentage of numbers at each side of the comma, like this :
> end = data.frame(SAMPLE1.DP=c("555","651","641","717"),
+ SAMPLE1.AD=c("555", "68,583","204,437","358,359"),
+ SAMPLE1.AD_DP=c("1.00","0.10,0.90","0.32,0.68","0.50,0.50"),
+ SAMPLE2.DP=c("1023","930","683","1179"),
+ SAMPLE2.AD=c("1023","0,930","683","585,594"),
+ SAMPLE2.AD_DP=c("1.00","0.00,1.00","1.00","0.49,0,51"))
>end
SAMPLE1.DP SAMPLE1.AD SAMPLE1.AD_DP SAMPLE2.DP SAMPLE2.AD SAMPLE2.AD_DP
1 555 555 1.00 1023 1023 1.00
2 651 68,583 0.10,0.90 930 0,930 0.00,1.00
3 641 204,437 0.32,0.68 683 683 1.00
4 717 358,359 0.50,0.50 1179 585,594 0.49,0,51
it means :
XX YY,ZZ YY/XX,ZZ/XX AA BB,CC BB/AA,CC/AA
If I consider the values inside the table as.numeric, it does not work since values are separated by commas...
Do you have any idea to do this ?
Thanks in advance for your help

First thing you need to do is replace the , with . and cast to numeric. Then split based on your required condition and divide, i.e.
df[] <- lapply(df, function(i)as.numeric(gsub(',', '.', i)))
do.call(cbind, lapply(split.default(df, gsub('\\D+', '', names(df))), function(i) i[2] / i[1]))
# SAMPLE1.AD SAMPLE2.AD
#1 1.000000000 1.000000
#2 0.004066052 0.001000
#3 0.004117005 1.000000
#4 0.499803347 0.496687

If there are commas in your numbers than the column has most likely been poisoned and is cast as characters. What you need to do is convert your columns to numeric and then divide each column respectively.
library(tidyverse)
dat <- tribble(~"SAMPLE1.DP", ~"SAMPLE1.AD", ~"SAMPLE2.DP", ~"SAMPLE2.AD",
555, 555, 1023, 1023,
651, "2,647", 930, ",93",
641, "2,639", 683, 683,
717, "358,359", 1179, "585,594")
dat %>%
mutate_at(c(2,4), list(~str_replace(., ",", "."))) %>%
mutate_all(as.numeric) %>%
mutate(addp1 = SAMPLE1.AD / SAMPLE1.DP,
addp2 = SAMPLE2.AD / SAMPLE2.DP)
#> # A tibble: 4 x 6
#> SAMPLE1.DP SAMPLE1.AD SAMPLE2.DP SAMPLE2.AD addp1 addp2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 555 555 1023 1023 1 1
#> 2 651 2.65 930 0.93 0.00407 0.001
#> 3 641 2.64 683 683 0.00412 1
#> 4 717 358. 1179 586. 0.500 0.497
Created on 2019-05-20 by the reprex package (v0.2.1)

Thanks everyone but I was not very clear in my question, very sorry.
In my input example, I have only whole numbers separated by commas, no decimales.
For example, on line 3 of my example :
2,647 means 2 AND 647, and I want to divide both numbers by 651 in order to have as result : 2/651 , 647/651 , so it will be 0.01 and 0.99 (or 1% and 99%)
They are entire numbers (or integers), separated by commas.
Hope I am clearer ...thanks...

Related

Force the application of the right as_tibble function on an object in a purrr::map call in R

I am trying to use the fact that as_tibble is a generic function so that I can process a column containing one out of two types of objects in the same way.
The list column temp can store either a list or an object of type AsspDataObj.
I can define an as_tibble function for that class
library(dplyr)
as_tibble.AsspDataObj <- function(x,field=1, prefix=NULL,na.zeros=TRUE){
df <- data.frame(x[[field]])
if(is.null(prefix)){
if(is.numeric(field)){
prefix <- names(x)[field]
}else{
prefix <- field
}
}
colnames(df) <- paste(prefix,seq(1,ncol(df),1),sep="_")
times <- seq(from=attr(x,"startTime"),
by=1/attr(x,"sampleRate"),
length.out=nrow(df))
out <-
tibble(times_orig=times,
times_rel=seq(from=0,to=(attr(x,"endRecord")-1)* 1000/attr(x,"sampleRate") ,by=1000/attr(x,"sampleRate")),
times_norm=times_rel / (max(times_rel) - min(times_rel))
) %>%
dplyr::bind_cols(df)
if(na.zeros){
out <- out %>%
dplyr::mutate(across(!times_orig & !times_rel & !times_norm, ~ na_if(.,0)))
}
return(out)
}
and then apply the function to one of the stored objects and get the expected result.
> class(pluck(illustration, "temp",1))
[1] "AsspDataObj"
> as_tibble(pluck(illustration, "temp",1))
# A tibble: 581 × 7
times_orig times_rel times_norm fm_1 fm_2 fm_3 fm_4
<dbl> <dbl> <dbl> <int> <int> <int> <int>
1 0.0025 0 0 NA 1062 2073 3156
2 0.0075 5 0.00172 1239 2109 3113 4247
3 0.0125 10 0.00345 NA 1352 2316 3310
4 0.0175 15 0.00517 NA 1448 2555 3870
5 0.0225 20 0.00690 NA 1438 2564 3958
[...]
Now, I want to apply the function to each object, and expand (unnest) the output so that result is actually the as_tibble output columns for each stored AsspDataObj, but joined with the other columns in the illustration data set.
Now, if I do this (and simplify the output a but by removing some columns just for now)
> illustration %>% mutate(out = map(.x = temp,.f=as_tibble)) %>% select(sl_rowIdx,out) %>% unnest(out)
I get this output
# A tibble: 1,162 × 10
sl_rowIdx frame_time fm1 fm2 fm3 fm4 bw1 bw2 bw3 bw4
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2.5 0 1062 2073 3156 0 800 369 890
2 1 7.5 1239 2109 3113 4247 644 493 792 859
3 1 12.5 0 1352 2316 3310 0 486 762 933
4 1 17.5 0 1448 2555 3870 0 577 716 1442
5 1 22.5 0 1438 2564 3958 0 548 552 1062
6 1 27.5 0 1343 2518 4001 0 637 617 1096
which is not the output I would expect to see if the as_tibble generic above had been applied. Instead, it seems that some other as_tibble function has been applied to the object.
Why? How do I find out what as_tibble.* function has been applied here? How do I force the application of my definition?
You need a data excerpt https://umeauniversity-my.sharepoint.com/:u:/g/personal/frkkan96_ad_umu_se/ET_56GqikHxAuFc60y_ce1UBL0lOOJcqWaMDjDwLOxjuOw?e=IsEZPf
Sorry, I found the answer. The problem was in the lexical scoping of R (I think) with a (not exported) definition of as_table.AsspDataObj defined in the package where the AsspDataObj was defined
https://github.com/IPS-LMU/wrassp/blob/cbbc6e9fe100f5f32f7b30510f3008c5a5553440/R/AsspDataObj.R#L357
being called instead of my function.
Confusing.

Can I identify the same values within a range between 2 columns?

I am trying to compare values between two different columns but I need it to accept values within a range of ±3.
I created this 2 tibbles:
example_tp1 <- tibble(Object_centre = c(84, 149, 489, 534, 680.5))
example_tp2 <- tibble(Object_centre = c(84.5, 149.5, 489, 528.5, 542, 680.5))
And I want the program to link the ones that are the same within a ±3 range.
So for example, I want it to identify that 84 and 84.5 are the same, also 149 and 149.5; 489 and 489; 680.5 and 680.5. But I want it to also tell me that 534, 528.5 and 542 do not have a match.
Is there any way to do this?
This could be achieved via the fuzzyjoin package like so:
library(dplyr)
library(fuzzyjoin)
example_tp1 <- tibble(Object_centre = c(84, 149, 489, 534, 680.5))
example_tp2 <- tibble(Object_centre = c(84.5, 149.5, 489, 528.5, 542, 680.5))
match_fun1 <- function(x, y) {
# (x >= y - 3) & (x <= y + 3)
# or following the suggestion by #DarrenTsai
abs(x - y) <= 3
}
fuzzy_full_join(example_tp1, example_tp2,
by = c("Object_centre"),
match_fun = match_fun1)
#> # A tibble: 7 x 2
#> Object_centre.x Object_centre.y
#> <dbl> <dbl>
#> 1 84 84.5
#> 2 149 150.
#> 3 489 489
#> 4 680. 680.
#> 5 534 NA
#> 6 NA 528.
#> 7 NA 542
Created on 2020-08-22 by the reprex package (v0.3.0)
You could look at all combinations of values and see which ones match.
# Data Frame of all combinations
example <- expand.grid(c(84, 149, 489, 534, 680.5), c(84.5, 149.5, 489, 528.5, 542, 680.5))
# Assigns a Match if the values are within a range of 3
example %>%
mutate(match = ifelse(abs(Var1-Var2) <= 3, "Match", "No Match"))
Var1 Var2 match
1 84.0 84.5 Match
2 149.0 84.5 No Match
3 489.0 84.5 No Match
4 534.0 84.5 No Match
5 680.5 84.5 No Match
6 84.0 149.5 No Match
7 149.0 149.5 Match
8 489.0 149.5 No Match
9 ..... ..... ........
10 ..... ..... ........
and so on
You could then filter out only the matches or see which values have no match.
Similar to #Jumble's answer using tidyverse functions :
tidyr::crossing(example_tp1, example_tp2, .name_repair = ~c('col1', 'col2')) %>%
dplyr::filter(abs(col1 - col2) <= 3)
# col1 col2
# <dbl> <dbl>
#1 84 84.5
#2 149 150.
#3 489 489
#4 680. 680.
crossing generates all combinations of example_tp1 and example_tp2 and we keep only those rows where the difference is less than equal to 3.

Divide colums by other columns and itself depending on index in dplyr

library(dplyr)
set.seed(1)
df <- data.frame(dddt_a = sample(1:1000, 1000, replace=T),
dddt_b = sample(1:1000, 1000, replace=T),
dddt_c = sample(1:1000, 1000, replace=T),
dddt_d = sample(1:1000, 1000, replace=T),
index = as.character(sample(c("a", "b"), 1000, replace=T)))
I want to divide each colum by either dddt_a or dddt_b depending on what the index is. If the index is a then divide all columns except the index by dddt_a and if index==b divide all columns except the index by dddt_b. The way it is set up now, this only divides dddt_a by a but not the other columns (likewise if index==b).
df1 <- df %>%
mutate_at(.vars = vars(starts_with("dddt")),
.funs = list(~ifelse(index=="a", ./dddt_a, ./dddt_b)))
head(df1)
dddt_a dddt_b dddt_c dddt_d index
1 1.0000000 686 474 756 a
2 0.7388466 1 681 726 b
3 1.0000000 218 570 448 a
4 2.0086393 1 830 958 b
5 1.0000000 989 590 128 a
6 1.0000000 128 978 144 a
A work around is storing the denominator variable outside, split the data for each index, divide everything and put it back together (I ran it only for index==a here). However, this should be possible in dplyr, I'm sure...?
ind_a <- df$dddt_a[df$index=="a"]
dfa <- df %>%
filter(index=="a")%>%
mutate_at(.vars = vars(starts_with("dddt")),
.funs = ~ ./!!ind_a)
Related to what seems to be the same issue. In a nex step I want to sum the values up, again depending on the index variable:
df2 <- df1 %>%
mutate(SUMS = ifelse(index=="a",
1+dddt_b+dddt_c+dddt_d,
1+dddt_a+dddt_c+dddt_d))
However, this sums all variables up...
head(df2)
dddt_a dddt_b dddt_c dddt_d index SUMS
1 1.0000000 686 474 756 a 1917.000
2 0.7388466 1 681 726 b 1408.739
3 1.0000000 218 570 448 a 1237.000
4 2.0086393 1 830 958 b 1791.009
5 1.0000000 989 590 128 a 1708.000
6 1.0000000 128 978 144 a 1251.000
But for the first row, for example, SUMS should be equal to 1916:
rowSums(df2[1,2:4]) #the result should be 1916 not 1917
1916
Thanks for the help.
Create a new column after dividing
library(dplyr)
df %>%
mutate_at(vars(starts_with("dddt")),
list(new = ~ifelse(index=="a", ./dddt_a, ./dddt_b))) %>%
head
# dddt_a dddt_b dddt_c dddt_d index dddt_a_new dddt_b_new dddt_c_new dddt_d_new
#1 836 686 474 756 a 1.000 0.821 0.567 0.904
#2 679 919 681 726 b 0.739 1.000 0.741 0.790
#3 129 218 570 448 a 1.000 1.690 4.419 3.473
#4 930 463 830 958 b 2.009 1.000 1.793 2.069
#5 509 989 590 128 a 1.000 1.943 1.159 0.251
#6 471 128 978 144 a 1.000 0.272 2.076 0.306
If you want you can then select only "_new" columns or rename the "_new" column to names of your choice.
We can also use case_when
library(dplyr)
df %>%
mutate_at(vars(starts_with("dddt")),
list(new = ~case_when(index=="a" ~ ./dddt_a, TRUE ~ ./dddt_b)))

Reading fixed width format data into R with entries exceeding column width

I need to use the Annual Building Permits by Metropolitan Area Data distributed by the US Census Bureau, which are downloadable here as fixed width format text files. Here is an excerpt of the file (I've stripped the column names as they aren't in a nice format and can be replaced after reading the file into a date frame):
999 10180 Abilene, TX 306 298 8 0 0 0
184 10420 Akron, OH 909 905 0 4 0 0
999 13980 Blacksburg-Christiansburg-Radford,
VA 543 455 0 4 84 3
145 14010 Bloomington, IL 342 214 4 0 124 7
160 15380 Buffalo-Cheektowaga-Niagara Falls,*
NY 1964 931 14 14 1005 68
268 15500 Burlington, NC 1353 938 12 16 387 20
As seen in the above excerpt, many of the entries in the Name column exceed the width of the column (which looks to be 36 characters). I've experimented with the various fwf reading functions of both the utils package and readr but can't find a solution that takes these entries into account. Any tips would be much appreciated.
Edit: The original file excerpt was edited by a mod for formatting and in the process the example entries where the third column width was exceeded were deleted. I've since updated the excerpt to reinclude them and have stripped the column names.
I ran #markdly 's code, which was submitted before this edit, works for all the entries that don't have this issue. I exported the result to a csv, and included an excerpt below to show what happens with these entries:
"38","999",NA,"13980",NA,"Blacksburg-Christiansburg-Radford,",NA,NA,NA,NA,NA,NA
"39","V","A",NA,NA,NA,"543",455,0,4,84,3
"40","145",NA,"14010",NA,"Bloomington, IL","342",214,4,0,124,7
"51","160",NA,"15380",NA,"Buffalo-Cheektowaga-Niagara Falls,*",NA,NA,NA,NA,NA,NA
"52","N","Y",NA,NA,NA,"1964",931,14,14,1005,68
"53","268",NA,"15500",NA,"Burlington, NC","1353",938,12,16,387,20
Edit 2: Most of the major metro areas I'm actually looking at don't fall into this problem category, so while it would be nice to have the data for the ones that do, if there is no workable solution, would there be a way to remove these entries from the data set altogether?
Edit:
Based on the updated information, the files are not fixed width for some records. In this situation, I think readr::read_table is more useful than read_fwf. The following example is a tidyverse approach to importing and processing one of the source files (tb3u2016.txt). A base approach might involve using something like readLines.
Step 1 Read the file in and assign the split records a common record id
library(tidyverse)
df <- read_table("tb3u2016.txt", col_names = FALSE, skip = 11) %>%
rownames_to_column() %>%
mutate(record = if_else(lag(is.na(X2) & rowname > 1), lag(rowname), rowname))
df[37:40, ]
#> # A tibble: 4 x 8
#> rowname X1 X2
#> <chr> <chr> <int>
#> 1 37 999 13900 Bismarck, ND 856 629
#> 2 38 999 13980 Blacksburg-Christiansburg-Radford, NA
#> 3 39 VA 543 455
#> 4 40 145 14010 Bloomington, IL 342 214
#> # ... with 5 more variables: X3 <int>, X4 <int>, X5 <int>, X6 <int>,
#> # record <chr>
Step 2 Combine the split record text then put the contents into separate variables using tidyr::extract. Trim whitespace and remove the redundant records.
df <- df %>%
mutate(new_X1 = if_else(rowname != record, paste0(lag(X1), X1), X1)) %>%
extract(new_X1, c("CSA", "CBSA", "Name", "Total"), "([0-9]+) ([0-9]+) (.+) ([0-9]+)") %>%
mutate(Name = trimws(Name)) %>%
filter((lead(record) != record) | rowname == 1) %>%
select(CSA, CBSA, Name, Total, X2, X3, X4, X5, X6)
df[37:39, ]
#> # A tibble: 3 x 9
#> CSA CBSA Name Total X2 X3 X4
#> <chr> <chr> <chr> <chr> <int> <int> <int>
#> 1 999 13900 Bismarck, ND 856 629 16 6
#> 2 999 13980 Blacksburg-Christiansburg-Radford,VA 543 455 0 4
#> 3 145 14010 Bloomington, IL 342 214 4 0
#> # ... with 2 more variables: X5 <int>, X6 <int>
Below is a condensed version of the solution provided to an earlier version of the question using readr::read_fwf.
Example data
library(readr)
# example data
txt <- " Num of
Struc-
tures
With
3 and 4 5 Units 5 Units
CSA CBSA Name Total 1 Unit 2 Units Units or more or more
999 10180 Abilene, TX 306 298 8 0 0 0
184 10420 Akron, OH 909 905 0 4 0 0"
write_file(txt, "example.txt")
Solution
col_widths <- c(3, 1, 5, 1, 36, 8, 8, 8, 8, 8, NA)
col_names <- c("CSA", "blank_1", "CBSA", "blank_2", "Name", "Total", "units_1", "units_2",
"units_3_and_4", "units_5_or_more", "num_struc_5_or_more")
df <- read_fwf("example.txt", fwf_widths(col_widths, col_names), skip = 7)
df
#> # A tibble: 2 x 11
#> CSA blank_1 CBSA blank_2 Name Total units_1 units_2
#> <int> <chr> <int> <chr> <chr> <int> <int> <int>
#> 1 999 <NA> 10180 <NA> Abilene, TX 306 298 8
#> 2 184 <NA> 10420 <NA> Akron, OH 909 905 0
#> # ... with 3 more variables: units_3_and_4 <int>, units_5_or_more <int>,
#> # num_struc_5_or_more <int>

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Resources