I have the following data:
library(data.table)
sales <- data.table(Customer = c(192,964,929,345,898,477,705,804,188,231,780,611,420,816,171,212,504,526,471,979,524,410,557,152,417,359,435,820,305,268,763,194,757,475,351,933,805,687,813,880,798,327,602,710,785,840,446,891,165,662),
Producttype = c(1,2,3,2,3,3,2,1,3,3,1,1,2,2,1,3,1,3,3,1,1,1,1,3,3,3,3,2,1,1,3,3,3,3,1,1,3,3,3,2,3,2,3,3,3,2,1,2,3,1),
Price = c(469,721,856,956,554,188,429,502,507,669,427,582,574,992,418,835,652,983,149,917,370,617,876,337,663,252,599,949,915,556,313,842,892,724,415,307,900,114,439,456,541,261,881,757,199,308,958,374,409,738),
Quarter = c(2,3,3,4,4,1,4,4,3,3,1,1,1,1,1,1,4,1,2,1,3,1,2,3,3,4,4,1,1,4,1,1,3,2,1,3,3,2,2,2,1,4,3,3,1,1,1,3,1,1))
How can I remove (let's say) the row in which Customer = 891?
And then I have another question:
If I want to manipulate the data I use data [row, column]. But when I want to use only the rows in which Quarter equals (for example) 4. I use data [Quarter = 4,] Why is it not data [, Quarter = 4] since Quarter is a column and not a row?
I did not find an appropriate answer in the internet which really explains the why.
Thank you.
You have used 'data.table' function to import your data, so you could write :
sales[Customer != 891,]
The data[Quarter = 4, ], ensures that all columns should be returned for the rows where Quarter is equal to 4. The comma(,) is necessary to only select the rows, and not the column Quarter = 4.
When you use indexing, ie, data[row, column] you are telling R to look for either a specific row or column index.
Row: sales[sales$Customer %in% c(192,964),] translates to "search the specific column Customer in the data frame (or table) for any rows that have values that contain 192 or 964 and isolate them. Note that data.table will allow for sales[Customer %in% c(192, 964),] but data frames cant (use sales[sales$Customer %in% c(192,964),])
Customer Producttype Price Quarter
1: 192 1 469 2
2: 964 2 721 3
Columns sales[, "Customer"] translates to "search the data frame (or table) for columns named "Customer" and isolate all its rows
Customer
1: 192
2: 964
3: 929
4: 345
5: 898
...
Note this returns a data table with one column. If you use sales[,Customer] (data table) or sales$Customer (data frame), it will return a vector:
# [1] 192 964 929 345 898 477 705 804 188 231 780 611 420 816 171 212 504 526 471 979 524
# [22] 410 557 152 417 359 435 820 305 268 763 194 757 475 351 933 805 687 813 880 798 327
# [43] 602 710 785 840 446 891 165 662
You can of course combine - if you did, sales[sales$Quarter %in% 1:2, c("Customer", "Producttype")] you would isolate all values of Customer and Producttype which were in quarters 1 and 2:
Customer Producttype
1: 192 1
2: 477 3
3: 780 1
4: 611 1
5: 420 2
...
I have a data frame reporting the count of answers per question (this is just a part of it), and I'd like to obtain the answer percentage for each question. I've found adorn_percentages, but it computes the percentage by dividing the values for the whole data frame, meanwhile, I just want the percentage for each column. Each column has a total of 2230 answers.
I was thinking to use something like (x/2230)*100 but I don't know how to go on.
df<-data.frame(q1=c(159,139,1048,571,93), q2=c(106,284,1043,672,125), q3=c(99,222,981,843,94))
q1 q2 q3
1 159 106 99
2 139 284 222
3 1048 1043 981
4 571 672 843
5 93 125 94
We may use colSums to do the division after making the lengths same
100 * df/colSums(df)[col(df)]
or use sweep
100 * sweep(df, 2, colSums(df), `/`)
Or use proportions
df[paste0(names(df), "_prop")] <- 100 * proportions(as.matrix(df), 2)
-output
> df
q1 q2 q3 q1_prop q2_prop q3_prop
1 159 106 99 7.910448 4.753363 4.421617
2 139 284 222 6.915423 12.735426 9.915141
3 1048 1043 981 52.139303 46.771300 43.814203
4 571 672 843 28.407960 30.134529 37.650737
5 93 125 94 4.626866 5.605381 4.198303
You can apply prop.table for each column -
library(dplyr)
df %>% mutate(across(.fns = prop.table, .names = '{col}_prop') * 100)
# q1 q2 q3 q1_prop q2_prop q3_prop
#1 159 106 99 7.910448 4.753363 4.421617
#2 139 284 222 6.915423 12.735426 9.915141
#3 1048 1043 981 52.139303 46.771300 43.814203
#4 571 672 843 28.407960 30.134529 37.650737
#5 93 125 94 4.626866 5.605381 4.198303
Consider the following code.
set.seed(56)
library(dplyr)
df <- data.frame(
NUM_1 = sample.int(500, replace = TRUE),
DENOM_1 = sample.int(500, replace = TRUE),
NUM_2 = sample.int(500, replace = TRUE),
DENOM_2 = sample.int(500, replace = TRUE)
)
head(df)
NUM_1 DENOM_1 NUM_2 DENOM_2
1 417 379 154 173
2 160 437 239 154
3 243 315 106 361
4 291 169 393 340
5 170 450 429 421
6 422 131 75 64
Without having to manually specify each of the column names (the actual problem has about 40 of these I need to create), I would like to create columns FRAC_1 and FRAC_2 for which FRAC_X = NUM_X/DENOM_X.
So, this would be what I'm looking for with regard to output, but since I'm dealing with about 40 of these, I don't want to have to manually type out each column:
df_frac <- df %>%
mutate(FRAC_1 = NUM_1 / DENOM_1,
FRAC_2 = NUM_2 / DENOM_2)
head(df_frac)
NUM_1 DENOM_1 NUM_2 DENOM_2 FRAC_1 FRAC_2
1 417 379 154 173 1.1002639 0.8901734
2 160 437 239 154 0.3661327 1.5519481
3 243 315 106 361 0.7714286 0.2936288
4 291 169 393 340 1.7218935 1.1558824
5 170 450 429 421 0.3777778 1.0190024
6 422 131 75 64 3.2213740 1.1718750
I would strongly prefer a dplyr solution to this. I thought maybe I could use mutate() with across(), but it isn't clear to me how to tell across() to pair the NUM_x with the corresponding DENOM_x columns.
Here is one in tidyverse
Loop across the columns with names starts_with 'NUM'
Extract the column name cur_column(), replace the substring from 'NUM' to 'DENOM' in str_replace
get the column value, divide by the NUM column, and change the column name in .names to create the 'FRAC' columns
library(dplyr)
library(stringr)
df <- df %>%
mutate(across(starts_with("NUM"), ~
./get(str_replace(cur_column(), 'NUM', 'DENOM')),
.names = "{str_replace(.col, 'NUM', 'FRAC')}"))
-output
head(df)
NUM_1 DENOM_1 NUM_2 DENOM_2 FRAC_1 FRAC_2
1 417 379 154 173 1.1002639 0.8901734
2 160 437 239 154 0.3661327 1.5519481
3 243 315 106 361 0.7714286 0.2936288
4 291 169 393 340 1.7218935 1.1558824
5 170 450 429 421 0.3777778 1.0190024
6 422 131 75 64 3.2213740 1.1718750
I have two bedfiles as dataframes in R, for which I want to map all overlapping regions to each other (similar to what bedtools closest would be able to do).
BedA:
chr start end
2 100 500
2 200 250
3 275 300
BedB:
chr start end
2 210 265
2 99 106
8 275 290
BedOut:
chr start.A end.A start.B end.B
2 100 500 210 265
2 100 500 99 106
2 200 250 210 265
Now, I found this very similar question, which suggest to use iRanges. Using the proposed way seems works, but I have no idea how to turn the output into a data frame like "BedOut".
Another data.table option using foverlaps:
setkeyv(BedA, names(BedA))
setkeyv(BedB, names(BedB))
ans <- foverlaps(BedB, BedA, nomatch=0L)
setnames(ans, c("start","end","i.start","i.end"), c("start.A","end.A","start.B","end.B"))
output:
chr start.A end.A start.B end.B
1: 2 100 500 99 106
2: 2 100 500 210 265
3: 2 200 250 210 265
data:
library(data.table)
BedA <- fread("chr start end
2 100 500
2 200 250
3 275 300")
BedB <- fread("chr start end
2 210 265
2 99 106
8 275 290")
Here is a solution using the data.table package.
library(data.table)
chr = c(2,2,3)
start.A = c(100, 200, 275)
end.A = c(500, 250, 300)
df_A = data.table(chr, start.A, end.A)
chr = c(2,2,8)
start.B = c(210, 99, 275)
end.B = c(265, 106, 290)
df_B = data.table(chr, start.B, end.B)
First, inner join the data-tables on the key chr:
df_out = df_B[df_A, on="chr", nomatch=0]
Then filter the overlapping interval:
df_out = df_out[(start.A>=start.B & start.A<=end.B) | (start.B>=start.A & start.B<=end.A)]
setcolorder(df_out, c("chr", "start.A", "end.A", "start.B", "end.B"))
chr start.A end.A start.B end.B
1: 2 100 500 210 265
2: 2 100 500 99 106
3: 2 200 250 210 265
Let's say I'd like to calculate the magnitude of the range over a few columns, on a row-by-row basis.
set.seed(1)
dat <- data.frame(x=sample(1:1000,1000),
y=sample(1:1000,1000),
z=sample(1:1000,1000))
Using data.frame(), I would do something like this:
dat$diff_range <- apply(dat,1,function(x) diff(range(x)))
To put it more simply, I'm looking for this operation, over each row:
diff(range(dat[1,]) # for i 1:nrow(dat)
If I were doing this for the entire table, it would be something like:
setDT(dat)[,diff_range := apply(dat,1,function(x) diff(range(x)))]
But how would I do it for only named (or numbered) rows?
pmax and pmin find the min and max across columns in a vectorized way, which is much better than splitting and working with each row separately. It's also pretty concise:
dat[, r := do.call(pmax,.SD) - do.call(pmin,.SD)]
x y z r
1: 266 531 872 606
2: 372 685 967 595
3: 572 383 866 483
4: 906 953 437 516
5: 201 118 192 83
---
996: 768 945 292 653
997: 61 231 965 904
998: 771 145 18 753
999: 841 148 839 693
1000: 857 252 218 639
How about this:
D[,list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I][c(1:4,15:18)]
# I V1
#1: 1 971
#2: 2 877
#3: 3 988
#4: 4 241
#5: 15 622
#6: 16 684
#7: 17 971
#8: 18 835
#actually this will be faster
D[c(1:4,15:18),list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I]
use .I to give you an index to call with the by= parameter, then you can run the function on each row. The second call pre-filters by any list of row numbers, or you can add a key and filter on that if your real table looks different.
You can do it by subsetting before/during the function. If you only want every second row for example
dat_Diffs <- apply(dat[seq(2,1000,by=2),],1,function(x) diff(range(x)))
Or for rownames 1:10 (since their names weren't specified they are just numbers counting up)
dat_Diffs <- apply(dat[rownames(dat) %in% 1:10,],1,function(x) diff(range(x)))
But why not just calculate per row then subset later?