How to subtract values of a first column from all columns by function in R - r

I wonder how to make a function to subtract values present in column A01 from columns A02, A03 etc.
example data frame:
A01 A02 A03 A04 A05 (...)
1 158 297 326 354 357
2 252 131 341 424 244
3 ...
4 ...
I can manually subtract each column for example:
sampledata[1]-sampledata[1]
sampledata[2]-sampledata[1]
sampledata[3]-sampledata[1]
sampledata[4]-sampledata[1] ... etc.
But how to make a nice function to do this calculation for each of column ? As a result I suppose to have this:
A01 A02 A03 A04 A05 (...)
1 0 139 168 196 199
2 0 -121 89 171 -8
3 ...
4 ...
After subtraction, if some value would be negative, then I want to convert it to zero.
I assume that my problem is easy to solve, but I'm newbie in R.

Thank you all for different solutions.
It seems that the simplest and still perfectly working is that suggested by #DavidArenburg:
new_sample_data = (sampledata - sampledata[,1]) * (sampledata > sampledata[,1])
It makes two transformations in one formula (subtracting first column, and converting negatives to zeroes).
Thank you!

Here's how:
# Your data
A01 <- c(158, 252)
A02 <- c(297, 131)
A03 <- c(326, 341)
A04 <- c(354, 424)
A05 <- c(357, 244)
df <- data.frame(A01, A02, A03, A04, A05, stringsAsFactors = FALSE)
df
# Define the function
f_minus <- function(first_col, other_col) {
other_col - first_col
}
df_output <- as.data.frame(matrix(ncol=ncol(df), nrow=nrow(df)))
for (i in 1:ncol(df)) {
df_output[,c(i)] <- f_minus(df[,1], df[,i])
}
df_output
# V1 V2 V3 V4 V5
# 1 0 139 168 196 199
# 2 0 -121 89 172 -8

Related

grouping overlapping regions based on a clustering factor in R

Using the foverlaps function from the data.table package I get overlapping regions (it shows only 25 lines but it's more than 50 thousand) and I would like to group the overlapping regions for each id taking into account the following criteria:
If they have the same ID and overlapping regions belonging to the same or different group, then:
group them all, 2) extend the range (i.e. start = min(overlapping item set) and end=max(overlapping item set)), and 3) place the name of the group of the maximum score.
For example, given the data set:
dt <- data.table::data.table(
ID=c("1015_4_1_1","1015_4_1_1","1015_4_1_1","103335_0_1_2","103335_0_1_2",
"103335_0_1_2","11099_0_1_1","11099_0_1_1","11099_0_1_1","11099_0_1_1","11099_0_1_1",
"11702_0_1_1","11702_0_1_1","11702_0_1_1","11702_0_1_5","11702_0_1_5","11702_0_1_5",
"140331_0_1_1","140331_0_1_1","140331_0_1_1","14115_0_1_7","14115_0_1_7",
"14115_0_1_7","14115_0_1_8","14115_0_1_8"),
start=c(193,219,269,149,149,163,51,85,314,331,410,6193,6269,6278,6161,6238,6246,303,304,316,1525,1526,1546,1542,1543),
end=c(307,273,399,222,235,230,158,128,401,428,507,6355,6337,6356,6323,6305,6324,432,396,406,1603,1688,1612,1620,1705),
group=c("R7","R5","R5","R4","R5","R6","R7","R5","R4","R5","R5","R5","R6","R4","R5","R6","R4","R5","R4","R6","R4","R5","R6","R4","R5"),
score=c(394,291,409,296,319,271,318,252,292,329,252,524,326,360,464,340,335,515,506,386,332,501,307,308,443)
)
The expected result is:
# 1015_4_1_1 193 399 R5 409
# 103335_0_1_2 149 235 R5 319
# 11099_0_1_1 51 158 R7 318
# 11099_0_1_1 314 507 R5 329
# 11702_0_1_1 6193 6356 R5 524
# 11702_0_1_5 6161 6324 R5 464
# 140331_0_1_1 303 432 R5 515
# 14115_0_1_7 1525 1705 R5 501
note that for each ID there may be subgroups of regions that do not overlap each other, for example in "11099_0_1_1" rows 7 and 8 are grouped in one subgroup and the rest in another subgroup.
I have no experience with GenomicRanges or IRanges, and read in another comment that data.table is usually faster. So, since I was expecting a lot of overlapping regions, I started with foverlaps from data.table, but I don't know how to proceed. I hope you can help me, and thank you very much in advance
If your group is the full ID, then you could do:
dt <- dt[
,IDy := cumsum(fcoalesce(+(start > (shift(cummax(end), type = 'lag') + 1L)), 0L)), by = ID][
, .(start = min(start), end = max(end),
group = group[which.max(score)],
score = max(score)),
by = .(ID, IDy)][, IDy := NULL]
Output (additional 443 score added as it represents the 14115_0_1_8):
ID start end group score
1: 1015_4_1_1 193 399 R5 409
2: 103335_0_1_2 149 235 R5 319
3: 11099_0_1_1 51 158 R7 318
4: 11099_0_1_1 314 507 R5 329
5: 11702_0_1_1 6193 6356 R5 524
6: 11702_0_1_5 6161 6324 R5 464
7: 140331_0_1_1 303 432 R5 515
8: 14115_0_1_7 1525 1688 R5 501
9: 14115_0_1_8 1541 1705 R5 443
In case your ID group are actually only the numbers before the underscore, then:
library(data.table)
dt <- dt[, IDx := sub('_.*', '', ID)][
, IDy := cumsum(fcoalesce(+(start > (shift(cummax(end), type = 'lag') + 1L)), 0L)), by = IDx][
, .(ID = ID[which.max(score)],
start = min(start), end = max(end),
group = group[which.max(score)],
score = max(score)),
by = .(IDx, IDy)
][, c('IDx', 'IDy') := NULL]
Output (lacks 464 from your example):
dt
ID start end group score
1: 1015_4_1_1 193 399 R5 409
2: 103335_0_1_2 149 235 R5 319
3: 11099_0_1_1 51 158 R7 318
4: 11099_0_1_1 314 507 R5 329
5: 11702_0_1_1 6161 6356 R5 524
6: 140331_0_1_1 303 432 R5 515
7: 14115_0_1_7 1525 1705 R5 501
The above assumes that your start variable is already ordered from lowest to highest. If this is not the case, just do the setorder(dt, start) before executing the above code.

Adding columns by splitting number, and removing duplicates

I have a dataframe like the following (this is a reduced example, I have many more rows and columns):
CH1 CH2 CH3
1 3434 282 7622
2 4442 6968 8430
3 4128 6947 478
4 6718 6716 3017
5 3735 9171 1128
6 65 4876 4875
7 9305 6944 3309
8 4283 6060 650
9 5588 2285 203
10 205 2345 9225
11 8634 4840 780
12 6383 0 1257
13 4533 7692 3760
14 9363 9846 4697
15 3892 79 4372
16 6130 5312 9651
17 7880 7386 6239
18 8515 8021 2295
19 1356 74 8467
20 9024 8626 4136
I need to create additional columns by splitting the values. For example, value 1356 would have to be split into 6, 56, and 356. I do this on a for loop splitting by string. I do this to keep the leading zeros. So far, decent.
# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {
# Create a temporal variable
temp <- as.character(data[,col] )
# Save the new column
for(mod in c(-1, -2, -3)) {
# Create the column
temp <- cbind(temp, str_sub(as.character(data[,col]), mod))
}
# Merge to the row
data <- cbind(data, temp)
}
My problem is that not all cells have 4 digits: some may have 1, 2 or 3 digits. Therefore, I get repeated values when I split. For example, for 79 I get: 79 (original), 9, 79, 79, 79.
Problem: I need to remove the repeated values. Of course, I could do unique, but that gives me rows of uneven number of columns. I need to fill those missing (i.e. the removed repeated values) with NA. I can only compare this by row.
I checked CJ Yetman's answer here, but they only replace consecutive numbers. I only need to keep unique values.
Reproducible Example: Here is a fiddle with my code working: http://rextester.com/IKMP73407
Expected outcome: For example, for rows 11 & 12 of the example (see the link for the reproducible example), if this is my original:
8634 4 34 634 4840 0 40 840 780 0 80 780
6383 3 83 383 0 0 0 0 1257 7 57 257
I'd like to get this:
8634 4 34 634 4840 0 40 840 780 NA 80 NA
6383 3 83 383 0 NA NA NA 1257 7 57 257
You can use apply():
The data:
data <- structure(list(CH1 = c(3434L, 4442L, 4128L, 6718L, 3735L, 65L,
9305L, 4283L, 5588L, 205L, 8634L, 6383L, 4533L, 9363L, 3892L,
6130L, 7880L, 8515L, 1356L, 9024L), CH2 = c(282L, 6968L, 6947L,
6716L, 9171L, 4876L, 6944L, 6060L, 2285L, 2345L, 4840L, 0L, 7692L,
9846L, 79L, 5312L, 7386L, 8021L, 74L, 8626L), CH3 = c(7622L,
8430L, 478L, 3017L, 1128L, 4875L, 3309L, 650L, 203L, 9225L, 780L,
1257L, 3760L, 4697L, 4372L, 9651L, 6239L, 2295L, 8467L, 4136L
)), .Names = c("CH1", "CH2", "CH3"), row.names = c(NA, 20L), class = "data.frame")
Select row 11 and 12:
data <- data[11:12, ]
Using your code:
# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {
# Create a temporal variable
temp <- data[,col]
# Save the new column
for(mod in c(10, 100, 1000)) {
# Create the column
temp <- cbind(temp, data[, col] %% mod)
}
data <- cbind(data, temp)
}
data[,1:3] <- NULL
The result is:
temp V2 V3 V4 temp V2 V3 V4 temp V2 V3 V4
11 8634 4 34 634 4840 0 40 840 780 0 80 780
12 6383 3 83 383 0 0 0 0 1257 7 57 257
Then go through the data row by row and remove duplicates and transpose the outcome:
t(apply(data, 1, function(row) {
row[duplicated(row)] <- NA
return(row)
}))
The result is:
temp V2 V3 V4 temp V2 V3 V4 temp V2 V3 V4
11 8634 4 34 634 4840 0 40 840 780 NA 80 NA
12 6383 3 83 383 0 NA NA NA 1257 7 57 257

Subset Columns based on partial matching of column names in the same data frame

I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.
Here is a small explanation of my required output. It is described below,
Lets say the data frame is eatable
fruits_area fruits_production vegetable_area vegetable_production
12 100 26 324
33 250 40 580
66 510 43 581
eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
"vegetable_production")
I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
checkExpression(eatable,"your_string")
The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.
Edit:- I think regular expressions would work here.
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map() + select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
I believe this may address your needs:
checkExpression <- function(dataset,str){
cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
subset(dataset,select=colnames(dataset)[cols])
}
Note the addition of "^" to the pattern used in grepl.
Using your data:
checkExpression(eatable,"fruit")
## fruits_area fruits_production
##1 12 100
##2 33 250
##3 660 510
checkExpression(eatable,"veget")
## vegetables_area vegetable_production
##1 26 324
##2 40 580
##3 43 581
Your function does exactly what you want but there was a small error:
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
Change the name of the object from which your subsetting from obje to dataset.
checkExpression(eatable,"fr")
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
checkExpression(eatable,"veg")
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

Multiple scatterplot figure in R

I have a slightly complicated plotting task. I am half way there, quite sure how to get it. I have a dataset of the form below, with multiple subjects, each in either Treatgroup 0 or Treatgroup 1, each subject contributing several rows of data. Each row corresponds to a single timepoint at which there are values in columns "count1, count2, weirdname3, etc.
Task 1. I need to calculate "Days", which is just the visitdate - the startdate, for each row. Should be an apply type function, I guess.
Task 2. I have to make a multiplot figure with one scatterplot for each of the count variables (a plot for count1, one for count2, etc). In each scatterplot, I need to plot the value of the count (y axis) against "Days" (x-axis) and connect the dots for each subject. Subjects in Treatgroup 0 are one color, subjects in treatgroup 1 are another color. Each scatterplot should be labeled with count1, count2 etc as appropriate.
I am trying to use the base plotting function, and have taken the approach of writing a plotting function to call later. I think this can work but need some help with syntax.
#Enter example data
tC <- textConnection("
ID StartDate VisitDate Treatstarted count1 count2 count3 Treatgroup
C0098 13-Jan-07 12-Feb-10 NA 457 343 957 0
C0098 13-Jan-06 2-Jul-10 NA 467 345 56 0
C0098 13-Jan-06 7-Oct-10 NA 420 234 435 0
C0098 13-Jan-05 3-Feb-11 NA 357 243 345 0
C0098 14-Jan-06 8-Jun-11 NA 209 567 254 0
C0098 13-Jan-06 9-Jul-11 NA 223 235 54 0
C0098 13-Jan-06 12-Oct-11 NA 309 245 642 0
C0110 13-Jan-06 23-Jun-10 30-Oct-10 629 2436 45 1
C0110 13-Jan-07 30-Sep-10 30-Oct-10 461 467 453 1
C0110 13-Jan-06 15-Feb-11 30-Oct-10 270 365 234 1
C0110 13-Jan-06 22-Jun-11 30-Oct-10 236 245 23 1
C0151 13-Jan-08 2-Feb-10 30-Oct-10 199 653 456 1
C0151 13-Jan-06 24-Mar-10 3-Apr-10 936 25 654 1
C0151 13-Jan-06 7-Jul-10 3-Apr-10 1147 254 666 1
C0151 13-Jan-06 9-Mar-11 3-Apr-10 1192 254 777 1
")
data1 <- read.table(header=TRUE, tC)
close.connection(tC)
# format date
data1$VisitDate <- with(data1,as.Date(VisitDate,format="%d-%b-%y"))
# stuck: need to define days as VisitDate - StartDate for each row of dataframe (I know I need an apply family fxn here)
data1$Days <- [applyfunction of some kind ](VisitDate,ID,function(x){x-data1$StartDate})))
# Unsure here. Need to define plot function
plot_one <- function(d){
with(d, plot(Days, Count, t="n", tck=1, cex.main = 0.8, ylab = "", yaxt = 'n', xlab = "", xaxt="n", xlim=c(0,1000), ylim=c(0,1200))) # set limits
grid(lwd = 0.3, lty = 7)
with(d[d$Treatgroup == 0,], points(Days, Count1, col = 1))
with(d[d$Treatgroup == 1,], points(Days, Count1, col = 2))
}
#Create multiple plot figure
par(mfrow=c(2,2), oma = c(0.5,0.5,0.5,0.5), mar = c(0.5,0.5,0.5,0.5))
#trouble here. I need to call the column names somehow, with; plyr::d_ply(data1, ???, plot_one)
Task 1:
data1$days <- floor(as.numeric(as.POSIXlt(data1$VisitDate,format="%d-%b-%y")
-as.POSIXlt(data1$StartDate,format="%d-%b-%y")))
Task 2:
par(mfrow=c(3,1), oma = c(2,0.5,1,0.5), mar = c(2,0.5,1,0.5))
plot(data1$days, data1$count1, col=as.factor(data1$Treatgroup), main="count1")
plot(data1$days, data1$count2, col=as.factor(data1$Treatgroup), main="count2")
plot(data1$days, data1$count3, col=as.factor(data1$Treatgroup), main="count3")

Count number of occurances of a string in R under different conditions

I have a dataframe, with multiple columns called "data" which looks like this:
Preferences Status Gender
8a 8b 9a Employed Female
10b 11c 9b Unemployed Male
11a 11c 8e Student Female
That is, each customer selected 3 preferences and specified other information such as Status and Gender. Each preference is given by a [number][letter] combination, and there are c. 30 possible preferences. The possible preferences are:
8[a - c]
9[a - k]
10[a - d]
11[a - c]
12[a - i]
I want to count the number of occurrences of each preference, under certain conditions for the other columns - eg. for all women.
The output will ideally be a dataframe that looks like this:
Preference Female Male Employed Unemployed Student
8a 1034 934 234 495 203
8b 539 239 609 394 235
8c 124 395 684 94 283
9a 120 999 895 945 345
9b 978 385 596 923 986
etc.
What's the most efficient way to achieve this?
Thanks.
I am assuming you are starting with something that looks like this:
mydf <- structure(list(
Preferences = c("8a 8b 9a", "10b 11c 9b", "11a 11c 8e"),
Status = c("Employed", "Unemployed", "Student"),
Gender = c("Female", "Male", "Female")),
.Names = c("Preferences", "Status", "Gender"),
class = c("data.frame"), row.names = c(NA, -3L))
mydf
# Preferences Status Gender
# 1 8a 8b 9a Employed Female
# 2 10b 11c 9b Unemployed Male
# 3 11a 11c 8e Student Female
If that's the case, you need to "split" the "Preferences" column (by spaces), transform the data into a "long" form, and then reshape it to a wide form, tabulating while you do so.
With the right tools, this is pretty straightforward.
library(devtools)
library(data.table)
library(reshape2)
source_gist(11380733) # for `cSplit`
dcast.data.table( # Step 3--aggregate to wide form
melt( # Step 2--convert to long form
cSplit(mydf, "Preferences", " ", "long"), # Step 1--split "Preferences"
id.vars = "Preferences"),
Preferences ~ value, fun.aggregate = length)
# Preferences Employed Female Male Student Unemployed
# 1: 10b 0 0 1 0 1
# 2: 11a 0 1 0 1 0
# 3: 11c 0 1 1 1 1
# 4: 8a 1 1 0 0 0
# 5: 8b 1 1 0 0 0
# 6: 8e 0 1 0 1 0
# 7: 9a 1 1 0 0 0
# 8: 9b 0 0 1 0 1
I also tried a dplyr + tidyr approach, which looks like the following:
library(dplyr)
library(tidyr)
mydf %>%
separate(Preferences, c("P_1", "P_2", "P_3")) %>% ## splitting things
gather(Pref, Pvals, P_1:P_3) %>% # stack the preference columns
gather(Var, Val, Status:Gender) %>% # stack the status/gender columns
group_by(Pvals, Val) %>% # group by these new columns
summarise(count = n()) %>% # aggregate the numbers of each
spread(Val, count) # spread the values out
# Source: local data table [8 x 6]
# Groups:
#
# Pvals Employed Female Male Student Unemployed
# 1 10b NA NA 1 NA 1
# 2 11a NA 1 NA 1 NA
# 3 11c NA 1 1 1 1
# 4 8a 1 1 NA NA NA
# 5 8b 1 1 NA NA NA
# 6 8e NA 1 NA 1 NA
# 7 9a 1 1 NA NA NA
# 8 9b NA NA 1 NA 1
Both approaches are actually pretty quick. Test it with some better sample data than what you shared, like this:
preferences <- c(paste0(8, letters[1:3]),
paste0(9, letters[1:11]),
paste0(10, letters[1:4]),
paste0(11, letters[1:3]),
paste0(12, letters[1:9]))
set.seed(1)
nrow <- 10000
mydf <- data.frame(
Preferences = vapply(replicate(nrow,
sample(preferences, 3, FALSE),
FALSE),
function(x) paste(x, collapse = " "),
character(1L)),
Status = sample(c("Employed", "Unemployed", "Student"), nrow, TRUE),
Gender = sample(c("Male", "Female"), nrow, TRUE)
)

Resources