Find the 2 max values for each factor in R - r

I have a question about finding the two largest values of column C, for each unique ID in column A, then calculating the mean of column B. A sample of my data is here:
ID layer weight
1 0.6843629 0.35
1 0.6360772 0.70
1 0.6392318 0.14
2 0.3848640 0.05
2 0.3882660 0.30
2 0.3877026 0.10
2 0.3964194 0.60
2 0.4273218 0.02
2 0.3869507 0.12
3 0.4748541 0.07
3 0.5853659 0.42
3 0.5383678 0.10
3 0.6060287 0.60
4 0.4859274 0.08
4 0.4720740 0.48
4 0.5126481 0.08
4 0.5280899 0.48
5 0.7492097 0.07
5 0.7220433 0.35
5 0.8750000 0.10
5 0.8302752 0.50
6 0.4306283 0.10
6 0.4890895 0.25
6 0.3790714 0.20
6 0.5139686 0.50
6 0.3885678 0.02
6 0.4706815 0.05
For each ID, I want to calculate the mean value of layer, using only the rows where with the two highest weights.
I can do this with the following code in R:
ind.max1 <- ddply(index1, "ID", function(x) x[which.max(x$weight),])
dt1 <- data.table(index1, key=c("layer"))
dt2 <- data.table(ind.max1, key=c("layer"))
index2 <- dt1[!dt2]
ind.max2 <- ddply(index2, "ID", function(x) x[which.max(x$weight),])
ind.max.all <- merge(ind.max1, ind.max2, all=TRUE)
ind.ndvi.mean <- as.data.frame(tapply(ind.max.all$layer, list(ind.max.all$ID), mean))
This uses ddply to select the first highest weight value per ID and put into a dataframe with layer. Then remove these highest weight values from the original dataframe using data.table. I then repeat the ddply select max value, and merge the two max weight value dataframes into one. Finally, computing mean with tapply.
There must be a more efficient way to do this. Does anyone have any insight? Cheers.

You could use data.table
library(data.table)
setDT(dat)[, mean(layer[order(-weight)[1:2]]), by=ID]
# ID Meanlayer
#1: 1 0.6602200
#2: 2 0.3923427
#3: 3 0.5956973
#4: 4 0.5000819
#5: 5 0.7761593
#6: 6 0.5015291
Order weight column in descending order(-weight)
Select first two from the order created [1:2] by group ID
subset the corresponding layer row based on the index layer[order..]
Do the mean
Alternatively, in 1.9.3 (current development version) or from the next version on, a function setorder is exported for reordering data.tables in any order, by reference:
require(data.table) ## 1.9.3+
setorder(setDT(dat), ID, -weight) ## dat is now reordered as we require
dat[, mean(layer[1:min(.N, 2L)]), by=ID]
By ordering first, we avoid the call to order() for each group (unique value in ID). This'll be more advantageous with more groups. And setorder() is much more efficient than order() as it doesn't need to create a copy of your data.

This actually is a question for StackOverflow... anyway!
Don't know if the version below is efficient enough for you...
s.ind<-tapply(df$weight,df$ID,function(x) order(x,decreasing=T))
val<-tapply(df$layer,df$ID,function(x) x)
foo<-function(x,y) list(x[y][1:2])
lapply(mapply(foo,val,s.ind),mean)

I think this will do it. Assuming the data is called dat,
> sapply(split(dat, dat$ID), function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
})
# 1 2 3 4 5 6
# 0.6602200 0.3923427 0.5956973 0.5000819 0.7761593 0.5015291
You'll likely want to include na.rm = TRUE as the second argument to mean to account for any rows that contain NA values.
Alternatively, mapply is probably faster, and has the exact same code just in a different order,
mapply(function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
}, split(dat, dat$ID))

Related

create list and generate descriptives for each variable

I want to generate descriptive statistics for multiple variables at a time (close to 50), rather than writing out the code several times.
Here is a very basic example of data:
id var1 var2
1 1 3
2 2 3
3 1 4
4 2 4
I typically write out each line of code to get a frequency count and descriptives, like so:
library(psych)
table(df$var1)
table(df1$var2)
describe(df1$var1)
describe(df1$var2)
I would like to create a list and get the output from these analyses, rather than writing out 100 lines of code. I tried this, but it is not working:
variable_list<-list(df1$var, df2$var)
for (variable in variable_list){
table(df$variable_list))
describe(df$variable_list))}
Does anyone have advice on getting this to work?
The describe from psych can take a data.frame and returns the descriptive statistics for each column
library(psych)
describe(df1)
# vars n mean sd median trimmed mad min max range skew kurtosis se
#id 1 4 2.5 1.29 2.5 2.5 1.48 1 4 3 0 -2.08 0.65
#var1 2 4 1.5 0.58 1.5 1.5 0.74 1 2 1 0 -2.44 0.29
#var2 3 4 3.5 0.58 3.5 3.5 0.74 3 4 1 0 -2.44 0.29
If it is subset of columns, specify either column index or column name to select and subset the dataset
describe(df1[2:3])
Another option is descr from collapse
library(collapse)
descr(slt(df1, 2:3))
Or to select numeric columns
descr(num_vars(df1))
Or for factors
descr(fact_vars(df1))

How to calculate the average of experimental values of replicates of the same sample, without knowing the number of replicates ahead?

I have a csv file with a data set of experimental values of many samples, and sometimes replicates of the same sample. For the replicates I only take into account the mean value of the replicates belonging to the same sample. The problem is, the number of replicates varies, it can be 2, 3, 4 etc...
My code isn't right, because it should be only working if replicates number is 2 (since I am using a loop to compare one sampleID to the previous sampleID in the loop). Plus, my code doesn't work, it adds the same average value to all my samples, which is not right. I think there is a problem at the start of the loop too. Because when x=1, x-1=0 which doesn't correspond to any value, so that may cause the code to not work?
I am a beginner in R, I never had any courses or training I am training to learn it by myself, so thank you in advance for your help.
My dataset:
Expected output:
PS: in this example the replicates number is 2. However, it can be different depending on samples, sometimes its 2, sometimes 3, 4 etc...
for (x in length(dat$Sample)){
if (dat$Sample[x]==dat$Sample[x-1]){
dat$Average.OD[x-1] <- mean(dat$OD[x], dat$OD[x-1])
dat$Average.OD[x] <- NA
}
}
Let me show you the possible solution by data.table.
#Data
data <- data.frame('Sample'=c('Blank','Blank','STD1','STD1'),
'OD'=c(0.07,0.08,0.09,0.10))
#Code
#Converting our data to data.table.
setDT(data)
#Finding the average of OD by Sample Column. Here Sample Column is the key.If you want it by both Sample and Replicates, pass both of them in by and so on.
data[, AverageOD := mean(OD), by = c("Sample")]
#Turning all the duplicate AverageOD values to NA.
data[duplicated(data, by = c("Sample")), AverageOD := NA]
#Turning column name of AverageOD to Average OD
names(data)[which(names(data) == "AverageOD")] = 'Average OD'
Let me know if you have any questions.
You can do this without any looping using aggregate and merge. Since you do not provide any data, I illustrate with a simple example.
## Example data
set.seed(123)
Sample = round(runif(10), 1)
OD = sample(4, 10, replace=T)
dat = data.frame(OD, Sample)
Means = aggregate(dat$Sample, list(dat$OD), mean, na.rm=T)
names(Means) = c("OD", "mean")
Means
OD mean
1 1 0.9000000
2 2 0.7000000
3 3 0.3666667
4 4 0.4000000
merge(dat, Means, "OD")
OD Sample mean
1 1 0.9 0.9000000
2 1 0.9 0.9000000
3 2 0.8 0.7000000
4 2 0.9 0.7000000
5 2 0.4 0.7000000
6 3 0.0 0.3666667
7 3 0.6 0.3666667
8 3 0.5 0.3666667
9 4 0.3 0.4000000
10 4 0.5 0.4000000

Adding NA's to a vector

Let's say I have a vector of prices:
foo <- c(102.25,102.87,102.25,100.87,103.44,103.87,103.00)
I want to get the percent change from x periods ago and, say, store it into another vector that I'll call log_returns. I can't bind vectors foo and log_returns into a data.frame because the vectors are not the same length. So I want to be able to append NA's to log_returns so I can put them in a data.frame. I figured out one way to append an NA at the end of the vector:
log_returns <- append((diff(log(foo), lag = 1)),NA,after=length(foo))
But that only helps if I'm looking at percent change 1 period before. I'm looking for a way to fill in NA's no matter how many lags I throw in so that the percent change vector is equal in length to the foo vector
Any help would be much appreciated!
You could use your own modification of diff:
mydiff <- function(data, diff){
c(diff(data, lag = diff), rep(NA, diff))
}
mydiff(foo, 1)
[1] 0.62 -0.62 -1.38 2.57 0.43 -0.87 NA
data.frame(foo = foo, diff = mydiff(foo, 3))
foo diff
1 102.25 -1.38
2 102.87 0.57
3 102.25 1.62
4 100.87 2.13
5 103.44 NA
6 103.87 NA
7 103.00 NA
Let's say you have an array with number 1 to 10 arranged in the matrix form, in which
The matrix contains Elements from 5 rows 2 columns & 2nd column to be assigned NA , #
then Making one 5*2 matrix of elements 1:10
Array_test=array(c(1:10),dim=c(5,2,1))
Array_test
Array_test[ ,2, ]=c(NA)# Defining 2nd column to get NA
Array_test
# Similarly to make only one element of the entire matrix be NA
# let's say 4nd-row 2nd column to be made NA then
Array_test[4 ,2, ]=c(NA)

Replacing a string with a matched number in a column in R

I have a data frame in R with 10,000 columns and roughly 4,000 rows. The data are IDs. For example the IDs look like (rs100987, rs1803920, etc). Each rsID# has a corresponding iHS score between 0-3. I have a separate data frame where all the possible rs#'s in existence are in one column and their corresponding iHS scores are in the next column. I want to replace my 10,000 by 4,000 data frame with rsIDs to a 10,000 by 4,000 data frame with the corresponding iHS scores. How do I do this?
This is what my file looks like now:
input ID match 1 match 2 match 3 ......
rs6708 rs10089 rs100098 rs10567
rs8902 rs18079 rs234058 rs123098
rs9076 rs77890 rs445067 rs105023
This is what my iHS score file looks like (it has matching scores for every ID in the above file
snpID iHS
rs6708 1.23
rs105023 0.92
rs234058 2.31
rs77890 0.31
I would like my output to look like
match 1 match 2 match 3
0.89 0.34 2.45
1.18 2.31 0.67
0.31 1.54 0.92
Let's consider a small example:
(dat <- data.frame(id1 = c("rs100987", "rs1803920"), id2=c("rs123", "rs456"), stringsAsFactors=FALSE))
# id1 id2
# 1 rs100987 rs123
# 2 rs1803920 rs456
(dat2 <- data.frame(id=c("rs123", "rs456", "rs100987", "rs1803920", "rs123456"),
score=5:1, stringsAsFactors=FALSE))
# id score
# 1 rs123 5
# 2 rs456 4
# 3 rs100987 3
# 4 rs1803920 2
# 5 rs123456 1
Then you can do this operation with:
apply(dat, 2, function(x) dat2$score[match(x, dat2$id)])
# id1 id2
# [1,] 3 5
# [2,] 2 4
The call to match figures out the row in dat2 corresponding to each id in your column.

Generating equations from factors in R [duplicate]

This question already has answers here:
Google Docs exports spreadsheet values with commas. read.csv() in R treats these as factors instead of numeric
(2 answers)
Closed 8 years ago.
I am fairly new to R, and I am trying to create a new column, which is one column minus another column. For example:
price <- c("$10.00", "$7.15", "$8.75", "12.00", "9.20")
quantity <- c(5, 6, 7, 8, 9)
price <- as.factor(price)
quantity <- as.factor(quantity)
df <- data.frame(price, quantity)
In my actual data set, all the columns imported as factors. When I try to create the new column I get this:
diff <- price - quantity
In Ops.factor(price, quantity): - not meaningful for factors
I have tried to coerce the data to numeric using as.numeric(df), as.numeric(levels(df)), as.numeric(levels(df))[df], and setting stringsAsFactors to false, but the data gets converted to NAs. Data.matrix changes the values. Is there another way to get the above equation to work? Thanks!
You should avoid "" and $ in price column and avoid converting them to factors if you want to do math operations on them:
price <- c(10.00, 7.15, 8.75, 12.00, 9.20)
quantity <- c(5, 6, 7, 8, 9)
df <- data.frame(price, quantity)
df$diff <- price - quantity
df
price quantity diff
1 10.00 5 5.00
2 7.15 6 1.15
3 8.75 7 1.75
4 12.00 8 4.00
5 9.20 9 0.20
Try:
as.numeric(gsub("^\\$","", price))-as.numeric(as.character(quantity))
#[1] 5.00 1.15 1.75 4.00 0.20
Or from df
df$diff <- Reduce(`-`,lapply(df, function(x) as.numeric(gsub("^\\$","",x))))
df$diff
#[1] 5.00 1.15 1.75 4.00 0.20
If you're stuck with factor columns, you could add a new diff column with within() and some type coercion
> within(df, {
diff <- as.numeric(gsub("[$]", "", price)) -
as.numeric(as.character(quantity))
})
# price quantity diff
# 1 $10.00 5 5.00
# 2 $7.15 6 1.15
# 3 $8.75 7 1.75
# 4 12.00 8 4.00
# 5 9.20 9 0.20
You may also consider going back and re-reading the data into R. It's simple, and will make things a little easier. Here's how you could do it and get the desired result that way.
Create a data file: This won't be necessary for you, since you can just read the original file again.
> write.table(df, "df.txt")
Read the data into R, remove the $ sign, and calculate the difference:
> df2 <- read.table("df.txt", stringsAsFactors = FALSE)
> df2$price <- as.numeric(gsub("[$]", "", df2$price))
> with(df2, { price - quantity })
# [1] 5.00 1.15 1.75 4.00 0.20

Resources