R data.table get maximum value per row for multiple columns - r

I've got a data.table in R which looks like that one:
dat <- structure(list(de = c(1470L, 8511L, 3527L, 2846L, 2652L, 831L
), fr = c(14L, 81L, 36L, 16L, 30L, 6L), it = c(9L, 514L, 73L,
37L, 91L, 2L), ro = c(1L, 14L, 11L, 1L, 9L, 0L)), .Names = c("de",
"fr", "it", "ro"), class = c("data.table", "data.frame"), row.names = c(NA,
-6L))
I now wanna create a new data.table (having exactly the same columns) but holding only the maximum value per row. The values in the other columns should simply be NA.
The data.table could have any number of columns (the data.table above is just an example).
The desired output table would look like this:
de fr it ro
1: 1470 NA NA NA
2: 8511 NA NA NA
3: 3527 NA NA NA
4: 2846 NA NA NA
5: 2652 NA NA NA
6: 831 NA NA NA

There are several issues with what the OP is attempting here: (1) this really looks like a case where data should be kept in a matrix rather than a data.frame or data.table; (2) there's no reason to want this sort of output that I can think of; and (3) doing any standard operations with the output will be a hassle.
With that said...
dat2 = dat
is.na(dat2)[-( 1:nrow(dat) + (max.col(dat)-1)*nrow(dat) )] <- TRUE
# or, as #PierreLafortune suggested
is.na(dat2)[col(dat) != max.col(dat)] <- TRUE
# or using the data.table package
dat2 = dat[rep(NA_integer_, nrow(dat)), ]
mc = max.col(dat)
for (i in seq_along(mc)) set(dat2, i = i, j = mc[i], v = dat[i, mc[i]])

It's not clear to me whether you mean that you want to use the data.table package, or if you are satisfied with making a data.frame using only base functions. It is certainly possible to do the latter.
Here is one solution, which uses only max() and which.max() and relies on the fact that an empty data.frame will fill in all of the remaining cells with NA to achieve a rectangular structure.
maxdat <- data.frame()
for (col in names(dat)) {
maxdat[which.max(dat[,col]), col] <- max(dat[,col])
}

Related

R function to count number of times when values changes

I am new to R,
I have 3 columns named A1, A2, ChangeInA that looks like this in a dataset
A1
A2
ChangeInA
10
20
10
24
30
24
22
35
35
54
65
65
15
29
15
The column 'ChangeInA' is either (A1 or A2)
I want to determine the number of times the 3rd column ('ChangeInA') changes.
Is there any function in R to do that?
Let me explain:
From the table, we can see that the 'ChangeInA' column switched twice,
first at row 3 and it switched again at row 5 (note that 'ChangeInA' can only have values of A1 or A2) so I want an R function to print how many times the switch happened. I can see the change on the dataset but I need to prove it on R
Below is a code I tried from previous answers
change<- rleid(rawData$ChangeInA == rawData$A1)
This showed me all the ChangeInA
change<- max(rleid(rawData$ChangeInA == rawData$A1))
This showed me the maximum number in ChangeInA
One option is to use rleid from data.table to keep track of when a change occurs in ChangeInA, which we can use on a conditional of whether ChangeInA is equal to A1. Then, we can just use max to get the total number of changes.
library(data.table)
max(rleid(df$ChangeInA == df$A1) - 1)
# 2
Or we could use dplyr with rleid:
library(dplyr)
df %>%
mutate(rlid = rleid(A1 == ChangeInA) - 1) %>%
pull(rlid) %>%
last()
Data
df <- structure(list(A1 = c(10L, 24L, 22L, 54L, 15L), A2 = c(20L, 30L,
35L, 65L, 29L), ChangeInA = c(10L, 24L, 35L, 65L, 15L)), class = "data.frame", row.names = c(NA,
-5L))

Paste value from for loop into data frame R

I have two dataframes in R, recurrent and L1HS. I am trying to find a way to do this:
If a sequence in recurrent matches sequence in L1HS, paste a value from a column in recurrent into new column in L1HS.
The recurrent dataframe looks like this:
> head(recurrent)
chr start end X Y level unique
1: chr4 56707846 56708347 0 38 03 chr4_56707846_56708347
2: chr1 20252181 20252682 0 37 03 chr1_20252181_20252682
3: chr2 224560903 224561404 0 37 03 chr2_224560903_224561404
4: chr5 131849595 131850096 0 36 03 chr5_131849595_131850096
5: chr7 46361610 46362111 0 36 03 chr7_46361610_46362111
6: chr1 20251169 20251670 0 36 03 chr1_20251169_20251670
The L1HS dataset contains many columns containing genetic sequence basepairs and a column "Sequence" that should hopefully have some matches with "unique" in the recurrent data frame, like so:
> head(L1HS$Sequence)
"chr1_35031657_35037706"
"chr1_67544575_67550598"
"chr1_81404889_81410942"
"chr1_84518073_84524089"
"chr1_87144764_87150794"
I know how to search for matches using
test <- recurrent$unique %in% L1HS$Sequence
to get the Booleans:
> head(test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
But I have a couple of problems from here. If the sequence is found, I want to copy the "level" value from the recurrent dataset to the L1HS dataset in a new column. For example, if the sequence "chr4_56707846_56708347" from the recurrent data was found in the full-length data, I'd like the full-length data frame to look like:
Sequence level other_columns
chr4_56707846_56708347 03 gggtttcatgaccc....
I was thinking of trying something like:
for (i in L1HS){
if (recurrent$unique %in% L1HS$Sequence{
L1HS$level <- paste(recurrent$level[i])}
}
but of course this isn't working and I can't figure it out.
I am wondering what the best approach is here! I'm wondering if merge/intersect/apply might be easier/better, or just what best practice might look like for a somewhat simple question like this. I've found some similar examples for Python/pandas, but am stuck here.
Thanks in advance!
You can do a simple left_join to add level to L1HS with dplyr.
library(dplyr)
L1HS %>%
left_join(., recurrent %>% select(unique, level), by = c("Sequence" = "unique"))
Or with merge:
merge(x=L1HS,y=recurrent[, c("unique", "level")], by.x = "Sequence", by.y = "unique",all.x=TRUE)
Output
Sequence level
1 chr1_35031657_35037706 4
2 chr1_67544575_67550598 2
3 chr1_81404889_81410942 NA
4 chr1_84518073_84524089 3
5 chr1_87144764_87150794 NA
*Note: This will still retain all the columns in L1HS. I just didn't create any additional columns in the example data below.
Data
recurrent <- structure(list(chr = c("chr4", "chr1", "chr2", "chr5", "chr7",
"chr1"), start = c(56707846L, 20252181L, 224560903L, 131849595L,
46361610L, 20251169L), end = c(56708347L, 20252682L, 224561404L,
131850096L, 46362111L, 20251670L), X = c(0L, 0L, 0L, 0L, 0L,
0L), Y = c(38L, 37L, 37L, 36L, 36L, 36L), level = c(3L, 2L, 3L,
3L, 3L, 4L), unique = c("chr4_56707846_56708347", "chr1_67544575_67550598",
"chr2_224560903_224561404", "chr5_131849595_131850096", "chr1_84518073_84524089",
"chr1_35031657_35037706")), class = "data.frame", row.names = c(NA,
-6L))
L1HS <- structure(list(Sequence = c("chr1_35031657_35037706", "chr1_67544575_67550598",
"chr1_81404889_81410942", "chr1_84518073_84524089", "chr1_87144764_87150794"
)), class = "data.frame", row.names = c(NA, -5L))

Data manipulations in R

As part of a project, I am currently using R to analyze some data. I am currently stuck with the retrieving few values from the existing dataset which i have imported from a csv file.
The file looks like:
For my analysis, I wanted to create another column which is the subtraction of the current value of x and its previous value. But the first value of every unique i, x would be the same value as it is currently. I am new to R and i was trying various ways for sometime now but still not able to figure out a way to do so. Request your suggestions in the approach that I can follow to achieve this task.
Mydata structure
structure(list(t = 1:10, x = c(34450L, 34469L, 34470L, 34483L,
34488L, 34512L, 34530L, 34553L, 34575L, 34589L), y = c(268880.73342868,
268902.322359863, 268938.194698248, 268553.521856105, 269175.38273083,
268901.619719038, 268920.864512966, 269636.604121984, 270191.206593437,
269295.344751692), i = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("t", "x", "y", "i"), row.names = c(NA, 10L), class = "data.frame")
You can use the package data.table to obtain what you want:
library(data.table)
setDT(MyData)[, x_diff := c(x[1], diff(x)), by=i]
MyData
# t x i x_diff
# 1: 1 34287 1 34287
# 2: 2 34789 1 502
# 3: 3 34409 1 -380
# 4: 4 34883 1 474
# 5: 5 34941 1 58
# 6: 6 34045 2 34045
# 7: 7 34528 2 483
# 8: 8 34893 2 365
# 9: 9 34551 2 -342
# 10: 10 34457 2 -94
Data:
set.seed(123)
MyData <- data.frame(t=1:10, x=sample(34000:35000, 10, replace=T), i=rep(1:2, e=5))
You can use the diff() function. If you want to add a new column to your existing data frame, the diff function will return a vector x-1 length of your current data frame though. so in your case you can try this:
# if your data frame is called MyData
MyData$newX = c(NA,diff(MyData$x))
That should input an NA value as the first entry in your new column and the remaining values will be the difference between sequential values in your "x" column
UPDATE:
You can create a simple loop by subsetting through every unique instance of "i" and then calculating the difference between your x values
# initialize a new dataframe
newdf = NULL
values = unique(MyData$i)
for(i in 1:length(values)){
data1 = MyData[MyData$i = values[i],]
data1$newX = c(NA,diff(data1$x))
newdata = rbind(newdata,data1)
}
# and then if you want to overwrite newdf to your original dataframe
MyData = newdf
# remove some variables
rm(data1,newdf,values)

How can you loop this higher-order function in R?

This question relates to the reply I received here with a nice little function from thelatemail.
The dataframe I'm using is not optimal, but it's what I've got and I'm simply trying to loop this function across all rows.
This is my df
dput(SO_Example_v1)
structure(list(Type = structure(c(3L, 1L, 2L), .Label = c("Community",
"Contaminant", "Healthcare"), class = "factor"), hosp1_WoundAssocType = c(464L,
285L, 24L), hosp1_BloodAssocType = c(73L, 40L, 26L), hosp1_UrineAssocType = c(75L,
37L, 18L), hosp1_RespAssocType = c(137L, 77L, 2L), hosp1_CathAssocType = c(80L,
34L, 24L), hosp2_WoundAssocType = c(171L, 115L, 17L), hosp2_BloodAssocType = c(127L,
62L, 12L), hosp2_UrineAssocType = c(50L, 29L, 6L), hosp2_RespAssocType = c(135L,
142L, 6L), hosp2_CathAssocType = c(95L, 24L, 12L)), .Names = c("Type",
"hosp1_WoundAssocType", "hosp1_BloodAssocType", "hosp1_UrineAssocType",
"hosp1_RespAssocType", "hosp1_CathAssocType", "hosp2_WoundAssocType",
"hosp2_BloodAssocType", "hosp2_UrineAssocType", "hosp2_RespAssocType",
"hosp2_CathAssocType"), class = "data.frame", row.names = c(NA,
-3L))
####################
#what it looks like#
####################
require(dplyr)
df <- tbl_df(SO_Example_v1)
head(df)
Type hosp1_WoundAssocType hosp1_BloodAssocType hosp1_UrineAssocType
1 Healthcare 464 73 75
2 Community 285 40 37
3 Contaminant 24 26 18
Variables not shown: hosp1_RespAssocType (int), hosp1_CathAssocType (int), hosp2_WoundAssocType
(int), hosp2_BloodAssocType (int), hosp2_UrineAssocType (int), hosp2_RespAssocType (int),
hosp2_CathAssocType (int)
The function I have is to perform a chisq.test across all categories in df$Type. Ideally the function should switch to a fisher.test() if the cell count is <5, but that's a separate issue (extra brownie points for the person who comes up with how to do that though).
This is the function I'm using to go row by row
func <- Map(
function(x,y) {
out <- cbind(x,y)
final <- rbind(out[1,],colSums(out[2:3,]))
chisq <- chisq.test(final,correct=FALSE)
chisq$p.value
},
SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))]
)
func
But ideally, i'd want it to be something like this
for(i in 1:nrow(df)){func}
But that doesn't work. A further hook is, that when for example, row two is taken, the final call looks like this
func <- Map(
function(x,y) {
out <- cbind(x,y)
final <- rbind(out[2,],colSums(out[c(1,3),]))
chisq <- chisq.test(final,correct=FALSE)
chisq$p.value
},
SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))]
)
func
so the function should understand that the cell count its taking for out[x,] has to be excluded from colSums(). This data.frame only has 3 rows, so it's easy, but I've tried applying this function to a separate data.frame I have that consists >200 rows, so it would be nice to be able to loop this somehow.
Any help appreciated.
Cheers
You were missing two things:
To select the line i and select all but this line you want to use
u[i] and u[-i]
If an item is not the same length than the others given to Map, it is recycled, a very general property of the language. You then just have to add an argument to the function that corresponds to the line you want to oppose to the others, it will be recycled for all the items of the vectors passed.
The following does what you asked for
# the function doing the stats
FisherOrChisq <- function(x,y,lineComp) {
out <- cbind(x,y)
final <- rbind(out[lineComp,],colSums(out[-lineComp,]))
test <- chisq.test(final,correct=FALSE)
return(test$p.value)
}
# test of the stat function
FisherOrChisq(SO_Example_v1[grep("^hosp1",names(SO_Example_v1))[1]],
SO_Example_v1[grep("^hosp2",names(SO_Example_v1))[1]],2)
# making the loop
result <- c()
for(type in SO_Example_v1$Type){
line <- which(SO_Example_v1$Type==type)
res <- Map(FisherOrChisq,
SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))],
line
)
result <- rbind(result,res)
}
colnames(result) <- gsub("^hosp[0-9]+","",colnames(result))
rownames(result) <- SO_Example_v1$Type
That said, what you are doing is very heavy multiple testing. I would be extremely cautious with the use of the corresponding p-values, you need at least to use a multiple testing correction such as what is suggested here.

Extracting values from R table within grouped values

I have the following table ordered group by first, second and name.
myData <- structure(list(first = c(120L, 120L, 126L, 126L, 126L, 132L, 132L), second = c(1.33, 1.33, 0.36, 0.37, 0.34, 0.46, 0.53),
Name = structure(c(5L, 5L, 3L, 3L, 4L, 1L, 2L), .Label = c("Benzene",
"Ethene._trichloro-", "Heptene", "Methylamine", "Pentanone"
), class = "factor"), Area = c(699468L, 153744L, 32913L,
4948619L, 83528L, 536339L, 105598L), Sample = structure(c(3L,
2L, 3L, 3L, 3L, 1L, 1L), .Label = c("PO1:1", "PO2:1", "PO4:1"
), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -7L))
Within each group I want to extract the area that correspond to the specific sample. Several groups don´t have areas from the samples, so if the sample is´nt detected it should return "NA".Ideally, the final output should be a column for each sample.
I have tried the ifelse function to create one column to each sample:
PO1<-ifelse(myData$Sample=="PO1:1",myData$Area, "NA")
However this doesn´t takes into account the group distribution. I want to do this, but within the group. Within each group (a group as equal value for first, second and Name columns) if sample=PO1:1, Area, else NA.
For the first group:
structure(list(first = c(120L, 120L), second = c(1.33, 1.33),
Name = structure(c(1L, 1L), .Label = "Pentanone", class = "factor"),
Area = c(699468L, 153744L), Sample = structure(c(2L, 1L), .Label = c("PO2:1",
"PO4:1"), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -2L))
The output should be:
structure(list(PO1.1 = NA, PO2.1 = 153744L, PO3.1 = NA, PO4.1 = 699468L), .Names =c("PO1.1", "PO2.1", "PO3.1", "PO4.1"), class = "data.frame", row.names = c(NA, -1L))
Any suggestion?
As in the example in the quesiton, I am assuming Sample is a factor. If this is not the case, consider making it such.
First, lets clean up the column Sample to make it a legal name, or else it might cause errors
levels(myData$Sample) <- make.names(levels(myData$Sample))
## DEFINE THE CUTS##
# Adjust these as necessary
#--------------------------
max.second <- 3 # max & nin range of myData$second
min.second <- 0 #
sprd <- 0.15 # with spread for each group
#--------------------------
# we will cut the myData$second according to intervals, cut(myData$second, intervals)
intervals <- seq(min.second, max.second, sprd*2)
# Next, lets create a group column to split our data frame by
myData$group <- paste(myData$first, cut(myData$second, intervals), myData$Name, sep='-')
groups <- split(myData, myData$group)
samples <- levels(myData$Sample) ## I'm assuming not all samples are present in the example. Manually adjusting with: samples <- sort(c(samples, "PO3.1"))
# Apply over each group, then apply over each sample
myOutput <-
t(sapply(groups, function(g) {
#-------------------------------
# NOTE: If it's possible that within a group there is more than one Area per Sample, then we have to somehow allow for thi. Hence the "paste(...)"
res <- sapply(samples, function(s) paste0(g$Area[g$Sample==s], collapse=" - ")) # allowing for multiple values
unlist(ifelse(res=="", NA, res))
## If there is (or should be) only one Area per Sample, then remove the two lines aboce and uncomment the two below:
# res <- sapply(samples, function(s) g$Area[g$Sample==s]) # <~~ This line will work when only one value per sample
# unlist(ifelse(res==0, NA, res))
#-------------------------------
}))
# Cleanup names
rownames(myOutput) <- paste("Group", 1:nrow(myOutput), sep="-") ## or whichever proper group name
# remove dummy column
myData$group <- NULL
Results
myOutput
PO1.1 PO2.1 PO3.1 PO4.1
Group-1 NA "153744" NA "699468"
Group-2 NA NA NA "32913 - 4948619"
Group-3 NA NA NA "83528"
Group-4 "536339" NA NA NA
Group-5 "105598" NA NA NA
You cannot really expect R to intuit that there is a fourth factor level between PO2 and PO4 , now can you.
> reshape(inp, direction="wide", idvar=c('first','second','Name'), timevar="Sample")
first second Name Area.PO4:1 Area.PO2:1 Area.PO1:1
1 120 1.3 Pentanone 699468 153744 NA
3 126 0.4 Heptene 32913 NA NA
4 126 0.4 Heptene 4948619 NA NA
5 126 0.3 Methylamine 83528 NA NA
6 132 0.5 Benzene NA NA 536339
7 132 0.5 Ethene._trichloro- NA NA 105598

Resources