I'm trying to do something with a table in R.
The table comes into the script like this
M P
Position1 34 56
Position2 45 23
Position3 89 78
Position1 56 45
Position3 54 35
Position2 56 89
And after analyzing this script, ideally, I'd like a final output to be this:
M P
Position1 90 101
Position2 101 102
Position3 143 113
Basically I sum the total number across the positions for M and P. I was wondering if there was an easier way to do this. The positions will be at random. Is there a way to potentially split the data table by the position?
You can use summarise_each from dplyr if you have multiple columns and you have a big dataset and of course the data is data.frame (From the post, it is not clear whether you have a matrix or data.frame)
library(dplyr)
dat %>%
group_by(Pos) %>%
summarise_each(funs(sum=sum(., na.rm=TRUE)))
# Pos M P
#1 Position1 90 101
#2 Position2 101 112
#3 Position3 143 113
Or another option I would use for bigger datasets is data.table. From the benchmarks by #Ananda Mahto, it is the clear winner in speed.
library(data.table)
setDT(dat)[, lapply(.SD, sum, na.rm=TRUE), by=Pos]
# Pos M P
#1: Position1 90 101
#2: Position2 101 112
#3: Position3 143 113
If you are using a matrix and do not need to transform it to data.frame with creating a new column for row.names. (Perhaps, that option would still be efficient)
do.call(rbind, by(m1, list(rownames(m1)), colSums, na.rm=TRUE))
# M P
#Position1 90 101
#Position2 101 112
#Position3 143 113
Or a slightly more efficient method when dealing with matrices
library(reshape2)
acast(melt(m1), Var1~Var2, value.var="value", sum, na.rm=TRUE)
# M P
#Position1 90 101
#Position2 101 112
#Position3 143 113
data
The rownames are added as a column as data.frame won't allow duplicate rownames.
dat <- structure(list(Pos = c("Position1", "Position2", "Position3",
"Position1", "Position3", "Position2"), M = c(34L, 45L, 89L,
56L, 54L, 56L), P = c(56L, 23L, 78L, 45L, 35L, 89L)), .Names = c("Pos",
"M", "P"), class = "data.frame", row.names = c(NA, -6L))
m1 <- structure(c(34, 45, 89, 56, 54, 56, 56, 23, 78, 45, 35, 89), .Dim = c(6L,
2L), .Dimnames = list(c("Position1", "Position2", "Position3",
"Position1", "Position3", "Position2"), c("M", "P")))
One more, just for fun. This one produces the structure you show in the post.
t(sapply(split(dat[-1], dat$Pos), colSums))
# M P
# Position1 90 101
# Position2 101 112
# Position3 143 113
This answer only applies if you are dealing with a matrix (like the "m1" dataset shared in #akrun's answer):
xtabs(Freq ~ Var1 + Var2, data.frame(as.table(m1)))
# Var2
# Var1 M P
# Position1 90 101
# Position2 101 112
# Position3 143 113
'aggregate' which needs to be used as follows:
> ddf
V1 V2 V3
1 Position1 34 56
2 Position2 45 23
3 Position3 89 78
4 Position1 56 45
5 Position3 54 35
6 Position2 56 89
> a1 = aggregate(V2~V1, ddf, sum)
> a2 = aggregate(V3~V1, ddf, sum)
> merge(a1, a2)
V1 V2 V3
1 Position1 90 101
2 Position2 101 112
3 Position3 143 113
First get your rownames
rows<-unique(rownames(yourDataFrame))
Make sure unique is there or we'll get a lot of duplicates
Then you can do a couple of different things here, the package plyr would come in handy, but just using base R you can use lapply to calculate your sums
result<-lapply(rownames, function(rname){
subsetDF<-yourDataFrame[rname,]
apply(subsetDF, 2, sum)
}
)
To break it down, you take all your rownames, and in lapply subset your dataframe by just the rows of that rowname. Next, you apply sum over that subset, taking the columns, and then output that to a list. You could then do something like rbindlist(result) to get your resulting dataframe.
Definitely not the most efficient way to do it, but it's the first thing I thought of
What you want is the aggregate function.
Say you have your table stored as data then try
condensedData <- aggregate(data, by=list(position), FUN=sum, na.rm=TRUE)
If that doesn't do exactly what you want, try experimenting with the aggregate function. The important inputs are by and FUN. by tells aggregate which columns you want the result to be identified uniquely by, while FUN tells aggregate what to do to combine numbers with the same by. FUN can be "sum", "mean", etc...
Related
I have the following data table:
library(data.table)
set.seed(1)
DT <- data.table(ind=1:100,x=sample(100),y=sample(100),group=c(rep("A",50),rep("B",50)))
Now the problem I have is that I need to take every value in column "x" (that is, each given ID), and add all the existing values in column "y" to it. I also need to do it separately per column "group". Let's assume we start with ID = 1. This element has the value: x_1 = 68, and y_1 = 76. We also see y_2 = 39, y_3 = 24, etc. So what I want to compute is the sums x_1 + y_1, x_1 + y2, x_1 + y_3, etc. But not only for x_1, but also for x_2, x_3, etc. So for x_2 it would look like: x_2 + y_1, x_2 + y_2, x_2 + y_3, etc. This should also be done separately per column "group" (in this regard the dataset should simple be split by group).
Edit: Exemplary code to do this only for X_1 and group A:
current_X <- DT[1,x] # not needed, just to illustrate
vector_current_X <- rep(DT[1,x],nrow(DT[group == "A"]))
DT[group == "A",copy_current_X := vector_current_X]
DT[,sum_current_X_Y := copy_current_X + y]
DT
One apparent issue with this approach is that if it were applied to all x, then a lot of columns would be added to the final DT. So I am not sure if it is the best approach. In the end, I am just looking for the lowest sum (per element x) with each element y, and per group.
I know how to do operations per group, and I also know the lapply functions. The issue is that from my understanding, I need to include a row-wise loop. And next, the structure of the result will be different from the original data table, because we have many additional observations. I have seen before that you can save lists inside a data.table, but I am unsure if that is the best approach. My dataset is much larger, so efficiency is important.
Thanks for any hints how to approach this.
You can do this:
DT[, .(.BY$x+DT[group==.BY$group,y]), by=.(x,group)]
This returns N rows per x, where N is the size of x's group. We leverage the special (.BY), which is available in j when utilizing by. Basically, .BY is a named list, containing the values of the grouping variables. Here, I'm adding the value of x (.BY$x) to the vector of y values from the subset of DT where the group is equal to the current group value (.BY$group)
Output:
x group V1
<int> <char> <int>
1: 68 A 144
2: 68 A 107
3: 68 A 92
4: 68 A 121
5: 68 A 160
---
4996: 4 B 25
4997: 4 B 66
4998: 4 B 83
4999: 4 B 27
5000: 4 B 68
You can also accomplish this via a join:
DT[,!c("y")][DT[, .(y,group)], on=.(group), allow.cartesian=T][, total:=x+y][order(ind)]
Output:
ind x group y total
<int> <int> <char> <int> <int>
1: 1 68 A 76 144
2: 1 68 A 39 107
3: 1 68 A 24 92
4: 1 68 A 53 121
5: 1 68 A 92 160
---
4996: 100 4 B 21 25
4997: 100 4 B 62 66
4998: 100 4 B 79 83
4999: 100 4 B 23 27
5000: 100 4 B 64 68
If I understand correctly, the requested result requires a cross join where each element of x is combined with each element of y (within each group).
This can be accomplished easily using the CJ() function:
DT[, CJ(x, y, sorted = FALSE), by = group][, sum_x_y := x + y][]
group x y sum_x_y
1: A 68 76 144
2: A 68 39 107
3: A 68 24 92
4: A 68 53 121
5: A 68 92 160
---
4996: B 4 21 25
4997: B 4 62 66
4998: B 4 79 83
4999: B 4 23 27
5000: B 4 64 68
This question already has an answer here:
How can I extract numbers separated by a forward slash in R? [closed]
(1 answer)
Closed 3 years ago.
I need to extract the blood pressure values from a text note that is typically reported as one larger number, "/" over a smaller number, with the units mm HG (it's not a fraction, and only written as such). In the 4 examples below, I want to extract 114/46, 135/67, 109/50 and 188/98 only, without space before or after and place the top number in column called SBP, and the bottom number into a column called DBP.
Thank you in advance for your assistance.
bb <- c("PATIENT/TEST INFORMATION (m2): 1.61 m2\n BP (mm Hg): 114/46 HR 60 (bpm)", "PATIENT/TEST INFORMATION:\ 63\n Weight (lb): 100\nBSA (m2): 1.44 m2\nBP (mm Hg): 135/67 HR 75 (bpm)", "PATIENT/TEST INFORMATION:\nIndication: Coronary artery disease. Hypertension. Myocardial infarction.\nWeight (lb): 146\nBP (mm Hg): 109/50 HR (bpm)", "PATIENT/TEST INFORMATION:\nIndication: Aortic stenosis. Congestive heart failure. Shortness of breath.\nHeight: (in) 64\nWeight (lb): 165\nBSA (m2): 1.80 m2\nBP (mm Hg): 188/98 HR 140 (bpm) ")
BP <- head(bb,4)
dput(bb)
Base R solution:
setNames(data.frame(do.call("rbind", strsplit(trimws(gsub("[[:alpha:]]|[[:punct:]][^0-9]+", "",
gsub("HR.*", "", paste0("BP", lapply(strsplit(bb, "BP"), '[', 2)))), "both"), "/"))),
c("SBP", "DBP"))
We can use regmatches/regexpr from base R to extract the required values, and then with read.table, create a two column data.frame
read.table(text = regmatches(bb, regexpr('\\d+/\\d+', bb)),
sep="/", header = FALSE, stringsAsFactors = FALSE)
# V1 V2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
Or using strcapture from base R
strcapture( "(\\d+)\\/(\\d+)", bb, data.frame(X1 = integer(), X2 = integer()))
# X1 X2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
To create this as new columnss in the original data.frame, use either cbind to bind the output with the original dataset
cbind(data, read.table(text = ...))
Or
data[c("V1", "V2")] <- read.table(text = ...)
Or using extract from tidyr
library(dplyr)
library(tidyr)
tibble(bb) %>%
extract(bb, into = c("X1", "X2"), ".*\\b(\\d+)/(\\d+).*", convert = TRUE)
# A tibble: 4 x 2
# X1 X2
# <int> <int>
#1 114 46
#2 135 67
#3 109 50
#4 188 98
If we don't want to remove the original column, use remove = FALSE in extract
You could use str_match and select numbers which has / in between
as.data.frame(stringr::str_match(bb, "(\\d+)/(\\d+)")[, 2:3])
# X1 X2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
In base R, we can extract the numbers that follow the pattern a/b, split them on '/' and form two columns.
as.data.frame(do.call(rbind, strsplit(sub(".*?(\\d+/\\d+).*", "\\1", bb), "/")))
You can give them the column names as per your choice using setNames or any other method.
I'm new to R and still getting to grips with how it handles data (my background is spreadsheets and databases). the problem I have is as follows. My data looks like this (it is held in CSV):
RecNo Var1 Var2 Var3
41 800 201.8 Y
43 140 39 N
47 60 20.24 N
49 687 77 Y
54 570 135 Y
58 1250 467 N
61 211 52 N
64 96 117.3 N
68 687 77 Y
Column 1 (RecNo) is my observation number; while it is a number, it is not required for my analysis. Column 4 (Var3) is a Yes/No column which, again, I do not currently need for the analysis but will need later in the process to add information in the output.
I need to normalise the numeric data in my dataframe to values between 0 and 1 without losing the other information. I have the following function:
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
sweep(x, 2, apply(x, 2, max), "/")
}
However, when I apply it to my above data by calling
myResult <- normalize(myData)
it returns an error because of the text in Column 4. If I set the text in this column to binary values it runs fine, but then also normalises my case numbers, which I don't want.
So, my question is: How can I change my normalize function above to accept the names of the columns to transform, while outputting the full dataset (i.e. without losing columns)?
I could not get TUSHAr's suggestion to work, but I have found two solutions that work fine:
1. akrun's suggestion above:
myData2 <- myData1 %>% mutate_at(2:3, funs((.-min(.))/max(.-min(.))))
This produces the following:
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
Alternatively, there is the package BBmisc which allowed me the following after transforming my record numbers to factors:
> myData <- myData %>% mutate(RecNo = factor(RecNo))
> myNorm <- normalize(myData2, method="range", range = c(0,1), margin = 1)
> myNorm
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
EDIT: For completion I include TUSHAr's solution as well, showing as always that there are many ways around a single problem:
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)
Thank you for your help!
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)
In R I'm trying to figure out how to select multiple values from a predefined vector of sequences (e.g. indices = c(1:3, 4:6, 10:12, ...)). In other words, if I want a new vector with the 3rd, 5th, and 7th entries in "indices", what syntax should I use to get back a vector with just those sequences intact, e.g. c(10:12, ...)?
If I understand correctly, you want the 3rd, 5th, and 7th entry in c(1:3, 4:6, 10:12, ...), which means you want extract specific sets of indices from a vector.
When you do something like c(1:3, 4:6, ...), the resulting vector isn't what it sounds like you want. Instead, use list(1:3, 4:6, ...). Then you can do this:
indices <- list(1:3, 4:6, 10:12, 14:16, 18:20)
x <- rnorm(100)
x[c(indices[[3]], indices[[5]])]
This is equivalent to:
x[c(10:12, 18:20)]
That is in turn equivalent to:
x[c(10, 11, 12, 18, 19, 20)]
Please let me know if I've misinterpreted your question.
What you are looking for is how to subset data. Most commonly it is done using square bracket notation:
sample data:
my_vector <- c(100:120)
my_vector
# 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
values you want taken out:
indices <- c(1:3, 4:6, 10:12)
indices
# 1 2 3 4 5 6 10 11 12
subsetting using bracket notation
my_vector[indices]
# 100 101 102 103 104 105 109 110 111
there is also a function called subset can can do this as well.
In order to calculate the highest contribution of a row per ID I have a beautiful script which works when the IDs are a numeric. Today however I found out that it is also possible that IDs can have characters (for instance ABC10101). For the function to work, the dataset is converted to a matrix. However data.matrix(df) does not support characters. Can the code be altered in order for the function to work with all kinds of IDs (character, numeric, etc.)? Currently I wrote a quick workaround which converts IDs to numeric when ID=character, but that will slow the process down for large datasets.
Example with code (function: extract the first entry with the highest contribution, so if 2 entries have the same contribution it selects the first):
Note: in this example ID is interpreted as a factor and data.matrix() converts it to a numeric value. In the code below the type of the ID column should be character and the output should be as shown at the bottom. Order IDs must remain the same.
tc <- textConnection('
ID contribution uniqID
ABCUD022221 40 101
ABCUD022221 40 102
ABCUD022222 20 103
ABCUD022222 10 104
ABCUD022222 90 105
ABCUD022223 75 106
ABCUD022223 15 107
ABCUD022223 10 108 ')
df <- read.table(tc,header=TRUE)
#Function that needs to be altered
uniqueMaxContr <- function(m, ID = 1, contribution = 2) {
t(
vapply(
split(1:nrow(m), m[,ID]),
function(i, x, contribution) x[i, , drop=FALSE]
[which.max(x[i,contribution]),], m[1,], x=m, contribution=contribution
)
)
}
df<-data.matrix(df) #only works when ID is numeric
highestdf<-uniqueMaxContr(df)
highestdf<-as.data.frame(highestdf)
In this case the outcome should be:
ID contribution uniqID
ABCUD022221 40 101
ABCUD022222 90 105
ABCUD022223 75 106
Others might be able to make it more concise, but this is my attempt at a data.table solution:
tc <- textConnection('
ID contribution uniqID
ABCUD022221 40 101
ABCUD022221 40 102
ABCUD022222 20 103
ABCUD022222 10 104
ABCUD022222 90 105
ABCUD022223 75 106
ABCUD022223 15 107
ABCUD022223 10 108 ')
df <- read.table(tc,header=TRUE)
library(data.table)
dt <- as.data.table(df)
setkey(dt,uniqID)
dt2 <- dt[,list(contribution=max(contribution)),by=ID]
setkeyv(dt2,c("ID","contribution"))
setkeyv(dt,c("ID","contribution"))
dt[dt2,mult="first"]
## ID contribution uniqID
## [1,] ABCUD022221 40 101
## [2,] ABCUD022222 90 105
## [3,] ABCUD022223 75 106
EDIT -- more concise solution
You can use .SD which is the subset of the data.table for the grouping, and then use which.max to extract a single row.
in one line
dt[,.SD[which.max(contribution)],by=ID]
## ID contribution uniqID
## [1,] ABCUD022221 40 101
## [2,] ABCUD022222 90 105
## [3,] ABCUD022223 75 106