how to subset data frame if all the conditions are met? [duplicate] - r

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 5 years ago.
i have two message ids say 197, 198 . I want to subset the data frame for those users who have got the messages from these ids. I only want those rows which contains both these message ids.
The data frame is m
I have used the code
a = c(197,198)
n = subset(m$userid,m$mid %in% a)
I also tried
n = m[m$mid == 197 & m$mid == 198]
both of these codes are creating OR output whereas I want AND output.
here is the sample dataframe:
mid userid opened
197 1022 Y
197 1036 N
197 1100 Y
198 1000 Y
198 1022 N
198 1036 Y
I want output as records containing userid for both mid 197 &198
mid userid opened
197 1022 Y
197 1036 N
198 1022 N
198 1036 Y

Using sqldf one solution could be achieved as:
# data
m <- read.table(text = "mid userid opened
197 1022 Y
197 1036 N
197 1100 Y
198 1000 Y
198 1022 N
198 1036 Y", header = T, stringsAsFactors = F)
library(sqldf)
result <- sqldf("SELECT * FROM m
WHERE userid in (SELECT userid FROM m WHERE mid == 197) AND
userid in (SELECT userid FROM m WHERE mid == 198)")
result
# mid userid opened
# 1 197 1022 Y
# 2 197 1036 N
# 3 198 1022 N
# 4 198 1036 Y

Using duplicated :
m[duplicated(m$userid) | duplicated(m$userid,fromLast = T), ]
# mid userid opened
# 1 197 1022 Y
# 2 197 1036 N
# 5 198 1022 N
# 6 198 1036 Y
With your real data you may need first : m2 <- subset(m,mid %in% a) to make sure you have only mid from a in your table before applying my solution.

For the sake of completeness, here are two data.table approaches. Both are capable of handling a of arbitrary length, i.e., with more than just 2 selected mid.
Join
library(data.table)
setDT(m)[m[mid %in% a][, uniqueN(mid), by = .(userid)][V1 == uniqueN(a)],
on = "userid"]
mid userid opened V1
1: 197 1022 Y 2
2: 198 1022 N 2
3: 197 1036 N 2
4: 198 1036 Y 2
The expression
m[mid %in% a][, uniqueN(mid), by = .(userid)][V1 == uniqueN(a)]
userid V1
1: 1022 2
2: 1036 2
filters m, then counts the number of unique mid by userid and returns those userid which have matches with all entries in a. (Instead of uniqueN(a), length(a) can be used but the former is safer).
Subsetting by row indices
There is an alternative approach which returns the row id's of m which are then used for subsetting:
m[mid %in% a][, .I[uniqueN(mid) == uniqueN(a)], by = .(userid)]
userid V1
1: 1022 1
2: 1022 5
3: 1036 2
4: 1036 6
m[m[mid %in% a][, .I[uniqueN(mid) == uniqueN(a)], by = .(userid)]$V1]
mid userid opened
1: 197 1022 Y
2: 198 1022 N
3: 197 1036 N
4: 198 1036 Y

Related

Remove row with specific value

I have the following data:
library(data.table)
sales <- data.table(Customer = c(192,964,929,345,898,477,705,804,188,231,780,611,420,816,171,212,504,526,471,979,524,410,557,152,417,359,435,820,305,268,763,194,757,475,351,933,805,687,813,880,798,327,602,710,785,840,446,891,165,662),
Producttype = c(1,2,3,2,3,3,2,1,3,3,1,1,2,2,1,3,1,3,3,1,1,1,1,3,3,3,3,2,1,1,3,3,3,3,1,1,3,3,3,2,3,2,3,3,3,2,1,2,3,1),
Price = c(469,721,856,956,554,188,429,502,507,669,427,582,574,992,418,835,652,983,149,917,370,617,876,337,663,252,599,949,915,556,313,842,892,724,415,307,900,114,439,456,541,261,881,757,199,308,958,374,409,738),
Quarter = c(2,3,3,4,4,1,4,4,3,3,1,1,1,1,1,1,4,1,2,1,3,1,2,3,3,4,4,1,1,4,1,1,3,2,1,3,3,2,2,2,1,4,3,3,1,1,1,3,1,1))
How can I remove (let's say) the row in which Customer = 891?
And then I have another question:
If I want to manipulate the data I use data [row, column]. But when I want to use only the rows in which Quarter equals (for example) 4. I use data [Quarter = 4,] Why is it not data [, Quarter = 4] since Quarter is a column and not a row?
I did not find an appropriate answer in the internet which really explains the why.
Thank you.
You have used 'data.table' function to import your data, so you could write :
sales[Customer != 891,]
The data[Quarter = 4, ], ensures that all columns should be returned for the rows where Quarter is equal to 4. The comma(,) is necessary to only select the rows, and not the column Quarter = 4.
When you use indexing, ie, data[row, column] you are telling R to look for either a specific row or column index.
Row: sales[sales$Customer %in% c(192,964),] translates to "search the specific column Customer in the data frame (or table) for any rows that have values that contain 192 or 964 and isolate them. Note that data.table will allow for sales[Customer %in% c(192, 964),] but data frames cant (use sales[sales$Customer %in% c(192,964),])
Customer Producttype Price Quarter
1: 192 1 469 2
2: 964 2 721 3
Columns sales[, "Customer"] translates to "search the data frame (or table) for columns named "Customer" and isolate all its rows
Customer
1: 192
2: 964
3: 929
4: 345
5: 898
...
Note this returns a data table with one column. If you use sales[,Customer] (data table) or sales$Customer (data frame), it will return a vector:
# [1] 192 964 929 345 898 477 705 804 188 231 780 611 420 816 171 212 504 526 471 979 524
# [22] 410 557 152 417 359 435 820 305 268 763 194 757 475 351 933 805 687 813 880 798 327
# [43] 602 710 785 840 446 891 165 662
You can of course combine - if you did, sales[sales$Quarter %in% 1:2, c("Customer", "Producttype")] you would isolate all values of Customer and Producttype which were in quarters 1 and 2:
Customer Producttype
1: 192 1
2: 477 3
3: 780 1
4: 611 1
5: 420 2
...

grouping overlapping regions based on a clustering factor in R

Using the foverlaps function from the data.table package I get overlapping regions (it shows only 25 lines but it's more than 50 thousand) and I would like to group the overlapping regions for each id taking into account the following criteria:
If they have the same ID and overlapping regions belonging to the same or different group, then:
group them all, 2) extend the range (i.e. start = min(overlapping item set) and end=max(overlapping item set)), and 3) place the name of the group of the maximum score.
For example, given the data set:
dt <- data.table::data.table(
ID=c("1015_4_1_1","1015_4_1_1","1015_4_1_1","103335_0_1_2","103335_0_1_2",
"103335_0_1_2","11099_0_1_1","11099_0_1_1","11099_0_1_1","11099_0_1_1","11099_0_1_1",
"11702_0_1_1","11702_0_1_1","11702_0_1_1","11702_0_1_5","11702_0_1_5","11702_0_1_5",
"140331_0_1_1","140331_0_1_1","140331_0_1_1","14115_0_1_7","14115_0_1_7",
"14115_0_1_7","14115_0_1_8","14115_0_1_8"),
start=c(193,219,269,149,149,163,51,85,314,331,410,6193,6269,6278,6161,6238,6246,303,304,316,1525,1526,1546,1542,1543),
end=c(307,273,399,222,235,230,158,128,401,428,507,6355,6337,6356,6323,6305,6324,432,396,406,1603,1688,1612,1620,1705),
group=c("R7","R5","R5","R4","R5","R6","R7","R5","R4","R5","R5","R5","R6","R4","R5","R6","R4","R5","R4","R6","R4","R5","R6","R4","R5"),
score=c(394,291,409,296,319,271,318,252,292,329,252,524,326,360,464,340,335,515,506,386,332,501,307,308,443)
)
The expected result is:
# 1015_4_1_1 193 399 R5 409
# 103335_0_1_2 149 235 R5 319
# 11099_0_1_1 51 158 R7 318
# 11099_0_1_1 314 507 R5 329
# 11702_0_1_1 6193 6356 R5 524
# 11702_0_1_5 6161 6324 R5 464
# 140331_0_1_1 303 432 R5 515
# 14115_0_1_7 1525 1705 R5 501
note that for each ID there may be subgroups of regions that do not overlap each other, for example in "11099_0_1_1" rows 7 and 8 are grouped in one subgroup and the rest in another subgroup.
I have no experience with GenomicRanges or IRanges, and read in another comment that data.table is usually faster. So, since I was expecting a lot of overlapping regions, I started with foverlaps from data.table, but I don't know how to proceed. I hope you can help me, and thank you very much in advance
If your group is the full ID, then you could do:
dt <- dt[
,IDy := cumsum(fcoalesce(+(start > (shift(cummax(end), type = 'lag') + 1L)), 0L)), by = ID][
, .(start = min(start), end = max(end),
group = group[which.max(score)],
score = max(score)),
by = .(ID, IDy)][, IDy := NULL]
Output (additional 443 score added as it represents the 14115_0_1_8):
ID start end group score
1: 1015_4_1_1 193 399 R5 409
2: 103335_0_1_2 149 235 R5 319
3: 11099_0_1_1 51 158 R7 318
4: 11099_0_1_1 314 507 R5 329
5: 11702_0_1_1 6193 6356 R5 524
6: 11702_0_1_5 6161 6324 R5 464
7: 140331_0_1_1 303 432 R5 515
8: 14115_0_1_7 1525 1688 R5 501
9: 14115_0_1_8 1541 1705 R5 443
In case your ID group are actually only the numbers before the underscore, then:
library(data.table)
dt <- dt[, IDx := sub('_.*', '', ID)][
, IDy := cumsum(fcoalesce(+(start > (shift(cummax(end), type = 'lag') + 1L)), 0L)), by = IDx][
, .(ID = ID[which.max(score)],
start = min(start), end = max(end),
group = group[which.max(score)],
score = max(score)),
by = .(IDx, IDy)
][, c('IDx', 'IDy') := NULL]
Output (lacks 464 from your example):
dt
ID start end group score
1: 1015_4_1_1 193 399 R5 409
2: 103335_0_1_2 149 235 R5 319
3: 11099_0_1_1 51 158 R7 318
4: 11099_0_1_1 314 507 R5 329
5: 11702_0_1_1 6161 6356 R5 524
6: 140331_0_1_1 303 432 R5 515
7: 14115_0_1_7 1525 1705 R5 501
The above assumes that your start variable is already ordered from lowest to highest. If this is not the case, just do the setorder(dt, start) before executing the above code.

Summing columns in a data frame and adding those values to a new data frame in R [duplicate]

This question already has answers here:
How to sum data.frame column values?
(5 answers)
Closed 2 years ago.
I am trying to sum the columns of a data frame and add these sums to a new output data frame. When I run the following script, I get an error stating that the replacement has two rows and the data has 3.
a <-data.frame(replicate(3,sample(1:100,10,rep=TRUE)))
colnames(a) <- c("name1", "name2","name3")
for (i in 1:ncol(a)) {
b <-as.data.frame(names(a))
c <- sum(a[i])
b$d[i] <- c[i]
}
I am looking for the output as a data frame such as:
name1 sum1
name2 sum2
name3 sum3
Your solution was already pretty close. I made some slight modifications for you and it works:
a <-data.frame(replicate(3,sample(1:100,10,rep=TRUE)))
colnames(a) <- c("name1", "name2","name3")
b <-as.data.frame(names(a))
for (i in 1:ncol(a)) {
b$sum[i] <- sum(a[i])
}
Output:
names(a) sum
1 name1 470
2 name2 616
3 name3 495
I would suggest a dplyr approach:
library(dplyr)
#Data
a <-data.frame(replicate(3,sample(1:100,10,rep=TRUE)))
colnames(a) <- c("name1", "name2","name3")
#Code
a %>%
mutate(across(c(name1:name3),.fns = list(sum = ~ sum(.,na.rm=T)) ))
Output:
name1 name2 name3 name1_sum name2_sum name3_sum
1 98 31 79 599 489 506
2 8 71 4 599 489 506
3 59 23 48 599 489 506
4 65 76 64 599 489 506
5 47 53 57 599 489 506
6 80 84 55 599 489 506
7 40 19 28 599 489 506
8 39 2 47 599 489 506
9 65 36 40 599 489 506
10 98 94 84 599 489 506
If only one dataframe is desired you can use this:
a %>%
summarise(across(c(name1:name3),.fns = list(sum = ~ sum(.,na.rm=T)) ))
Output:
name1_sum name2_sum name3_sum
1 599 489 506
Initial code should be used when you want to add those variables to same dataframe.
And if you want a variable for the names and another for results you can use previous code with pivot_longer() from tidyverse to produce this:
library(tidyverse)
#Code
a %>%
summarise(across(c(name1:name3),.fns = list(sum = ~ sum(.,na.rm=T)) )) %>%
pivot_longer(cols = everything())
Output:
# A tibble: 3 x 2
name value
<chr> <int>
1 name1_sum 599
2 name2_sum 489
3 name3_sum 506
It can be vectorized with colSums in base R
as.data.frame.list(colSums(a))
Or for a two column summary
stack(colSums(a))
If we need to create new columns in 'a'
a[paste0(names(a), "_sum")] <- colSums(a)

Adding values of two columns on the same row to get a new value

Sorry for asking a very basic question but I am new to R and really stuck on a rather simple matter; I have the data frame below (2 rows and 7 columns):
Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166
These values correspond with time duration (secs) for seven test conditions
col$names <- c(sup_b, hdt, sup_2, lbnp, sup_3, hut, sup_4)
and 17 rows (each row is for one study subject- I have only included first two rows).
I am trying to add values from row 1 col$sup_b (175) and row 1 col$hdt (434) to get the combined duration for the first two conditions i.e. 609 secs. I then add the value of the previous two cols (609) to the next col$sup_2 to get the total duration (609 + 596) and so on until the last condition col$sup_4.
I have tried the method below which is for subject 6 (row 1), which works fine, but I want to tidy this up and make it easier as I have 17 subjects (rows) and have been advised there is an easier way around this:
sup_b <- 175
hdt <- (sup_b + 434)
sup_2 <- (hdt + 596)
lbnp <- (sup_2 + 585)
sup_3 <- (hdt_lbnp + 601)
hut <- (sup_3 + 593)
sup_4 <- (hut + 211)
I want to be able to just change the number of row and have the data pulled across from the data frame rather than entering each individual time period; for instance:
line <- 1 ### the row I want which corresponds to the subject
sup_b <- df[line, 2]
hdt <-df[line, 2] + df[line, 3]
but I keep getting this warning message:
In Ops.factor(df[line, 2], df[line, 3]) : ‘+’ not meaningful for factor
I have even tried: colSums(df[,c(2:3)]), but get the following warning:
Error in colSums(df[, c(2:3)]) : 'x' must be numeric.
also tried: st$sum <- apply(df[,c(2:3)], 1, sum), which doesn't work either.
df1[-1] <- t(apply(df1[-1],1,cumsum))
# Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
# 1 6 175 609 1205 1790 2391 2984 3195
# 2 7 130 722 1314 1907 2507 2891 3057
data
df1 <- read.table(text="Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166",h=T,strin=F)

R One sample test for set of columns for each row

I have a data set where I have the Levels and Trends for say 50 cities for 3 scenarios. Below is the sample data -
City <- paste0("City",1:50)
L1 <- sample(100:500,50,replace = T)
L2 <- sample(100:500,50,replace = T)
L3 <- sample(100:500,50,replace = T)
T1 <- runif(50,0,3)
T2 <- runif(50,0,3)
T3 <- runif(50,0,3)
df <- data.frame(City,L1,L2,L3,T1,T2,T3)
Now, across the 3 scenarios I find the minimum Level and Minimum Trend using the below code -
df$L_min <- apply(df[,2:4],1,min)
df$T_min <- apply(df[,5:7],1,min)
Now I want to check if these minimum values are significantly different between the levels and trends respectively. So check L_min with columns 2-4 and T_min with columns 5-7. This needs to be done for each city (row) and if significant then return which column it is significantly different with.
It would help if some one could guide how this can be done.
Thank you!!
I'll put my idea here, nevertheless I'm looking forward for ideas for others.
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min
1 City1 251 176 263 1.162313 0.07196579 2.0925715 176 0.07196579
2 City2 385 406 264 0.353124 0.66089524 2.5613980 264 0.35312402
3 City3 437 333 426 2.625795 1.43547766 1.7667891 333 1.43547766
4 City4 431 405 493 2.042905 0.93041254 1.3872058 405 0.93041254
5 City5 101 429 100 1.731004 2.89794314 0.3535423 100 0.35354230
6 City6 374 394 465 1.854794 0.57909775 2.7485841 374 0.57909775
> df$FC <- rowMeans(df[,2:4])/df[,8]
> df <- df[order(-df$FC), ]
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min FC
18 City18 461 425 117 2.7786757 2.6577894 0.75974121 117 0.75974121 2.857550
38 City38 370 117 445 0.1103141 2.6890014 2.26174542 117 0.11031411 2.655271
44 City44 101 473 222 1.2754675 0.8667007 0.04057544 101 0.04057544 2.627063
10 City10 459 361 132 0.1529519 2.4678493 2.23373484 132 0.15295194 2.404040
16 City16 232 393 110 0.8628494 1.3995549 1.01689217 110 0.86284938 2.227273
15 City15 499 475 182 0.3679611 0.2519497 2.82647041 182 0.25194969 2.117216
Now you have the most different rows based on columns 2:4 at the top. Columns 5:7 in analogous way.
And some tips for stastical tests:
Always use t.test(parametrical, based on mean) instead of wilcoxon(u-mann whitney - non-parametrical, based on median), it has more power; HOWEVER:
-Data sets should be big ex. hipotesis: Montreal has taller citizens than Quebec; t.test will work fine when you take a 100 people from each city, so we have height measurment of 200 people 100 vs 100.
-Distribution should be close to normal distribution in all samples; or both samples should have similar distribution far from normal - it may be binominal. Anyway we can't use this test when one sample has normal distribution, and second hasn't.
-Size of both samples should be eqal, so 100 vs 100 is ok, but 87 vs 234 not exactly, p-value will be below 0.05, however it may be misrepresented.
If your data doesn't meet above conditions, I prefer non-parametrical test, less power but more resistant.

Resources