Replacing a string with a matched number in a column in R - r

I have a data frame in R with 10,000 columns and roughly 4,000 rows. The data are IDs. For example the IDs look like (rs100987, rs1803920, etc). Each rsID# has a corresponding iHS score between 0-3. I have a separate data frame where all the possible rs#'s in existence are in one column and their corresponding iHS scores are in the next column. I want to replace my 10,000 by 4,000 data frame with rsIDs to a 10,000 by 4,000 data frame with the corresponding iHS scores. How do I do this?
This is what my file looks like now:
input ID match 1 match 2 match 3 ......
rs6708 rs10089 rs100098 rs10567
rs8902 rs18079 rs234058 rs123098
rs9076 rs77890 rs445067 rs105023
This is what my iHS score file looks like (it has matching scores for every ID in the above file
snpID iHS
rs6708 1.23
rs105023 0.92
rs234058 2.31
rs77890 0.31
I would like my output to look like
match 1 match 2 match 3
0.89 0.34 2.45
1.18 2.31 0.67
0.31 1.54 0.92

Let's consider a small example:
(dat <- data.frame(id1 = c("rs100987", "rs1803920"), id2=c("rs123", "rs456"), stringsAsFactors=FALSE))
# id1 id2
# 1 rs100987 rs123
# 2 rs1803920 rs456
(dat2 <- data.frame(id=c("rs123", "rs456", "rs100987", "rs1803920", "rs123456"),
score=5:1, stringsAsFactors=FALSE))
# id score
# 1 rs123 5
# 2 rs456 4
# 3 rs100987 3
# 4 rs1803920 2
# 5 rs123456 1
Then you can do this operation with:
apply(dat, 2, function(x) dat2$score[match(x, dat2$id)])
# id1 id2
# [1,] 3 5
# [2,] 2 4
The call to match figures out the row in dat2 corresponding to each id in your column.

Related

Dividing a number in a column dfA with a number in a row dfB, based on the column name and the row name in R?

I want to do something similar to index match match in Excel, but depending on the column name in dfA and the row name in dfB.
A subset example: dfA (my data) imported from excel and dfB is always the same (molar weight):
dfA <- data.frame(Name=c("A", "B", "C", "D"), #usually imported df from Excel
Asp=c(2,4,6,8),
Glu=c(1,2,3,4),
Arg=c(5,6,7,8))
> dfA
Name Asp Glu Arg
1 A 2 1 5
2 B 4 2 6
3 C 6 3 7
4 D 8 4 8
X <- c("Arg","Asp","Glu")
Y <- c(174,133,147)
dfB <- data.frame(X,Y)
> dfB
X Y
1 Arg 174
2 Asp 133
3 Glu 147
I would like to divide the matching numbers from dfA with dfB, meaning R would "look up" and "take" the value from dfA and divide it with the value that "matches" in dfB.
So for example take the value from sample named A under column "Arg" = 5, and divide it by the row "Arg" in dfB = 174
5 / 174 = 0.029 and make a new data frame called dfC. Looking like below:
#How R would calculate:
Name Asp Glu Arg
1 A 2/133 1/147 5/174
2 B 4/133 2/147 6/174
3 C 6/133 3/147 7/174
4 D 8/133 4/147 8/174
>dfC
Name Asp Glu Arg
1 A 0.015 0.007 0.029
2 B 0.030 0.014 0.034
3 C 0.045 0.020 0.040
4 D 0.060 0.027 0.046
I hope it makes sense :) I am really stuck and have no clear idea, how I can do this easily. I can only think of some weird work arounds, that take much longer than Excel. But I would like to standardize it, so I can use the R script everytime, I get data from the lab.
Here is a way. match the names of dfA, excluding the first with column dfB$X. Then apply a division to both dfA[-1] and dfB$Y. Finally, bind the result with the Name of dfA.
i <- match(names(dfA)[-1], dfB$X)
tmp <- mapply(\(x, y) x/y, dfA[-1], dfB$Y[i])
cbind(dfA[1], tmp)
#> Name Asp Glu Arg
#> 1 A 0.01503759 0.006802721 0.02873563
#> 2 B 0.03007519 0.013605442 0.03448276
#> 3 C 0.04511278 0.020408163 0.04022989
#> 4 D 0.06015038 0.027210884 0.04597701
Created on 2022-09-12 with reprex v2.0.2
Simpler, note the backticks:
tmp <- mapply(`/`, dfA[-1], dfB$Y[i])
Even simpler, do not create the temp matrix.
cbind(dfA[1], mapply(`/`, dfA[-1], dfB$Y[i]))

create list and generate descriptives for each variable

I want to generate descriptive statistics for multiple variables at a time (close to 50), rather than writing out the code several times.
Here is a very basic example of data:
id var1 var2
1 1 3
2 2 3
3 1 4
4 2 4
I typically write out each line of code to get a frequency count and descriptives, like so:
library(psych)
table(df$var1)
table(df1$var2)
describe(df1$var1)
describe(df1$var2)
I would like to create a list and get the output from these analyses, rather than writing out 100 lines of code. I tried this, but it is not working:
variable_list<-list(df1$var, df2$var)
for (variable in variable_list){
table(df$variable_list))
describe(df$variable_list))}
Does anyone have advice on getting this to work?
The describe from psych can take a data.frame and returns the descriptive statistics for each column
library(psych)
describe(df1)
# vars n mean sd median trimmed mad min max range skew kurtosis se
#id 1 4 2.5 1.29 2.5 2.5 1.48 1 4 3 0 -2.08 0.65
#var1 2 4 1.5 0.58 1.5 1.5 0.74 1 2 1 0 -2.44 0.29
#var2 3 4 3.5 0.58 3.5 3.5 0.74 3 4 1 0 -2.44 0.29
If it is subset of columns, specify either column index or column name to select and subset the dataset
describe(df1[2:3])
Another option is descr from collapse
library(collapse)
descr(slt(df1, 2:3))
Or to select numeric columns
descr(num_vars(df1))
Or for factors
descr(fact_vars(df1))

Adding NA's to a vector

Let's say I have a vector of prices:
foo <- c(102.25,102.87,102.25,100.87,103.44,103.87,103.00)
I want to get the percent change from x periods ago and, say, store it into another vector that I'll call log_returns. I can't bind vectors foo and log_returns into a data.frame because the vectors are not the same length. So I want to be able to append NA's to log_returns so I can put them in a data.frame. I figured out one way to append an NA at the end of the vector:
log_returns <- append((diff(log(foo), lag = 1)),NA,after=length(foo))
But that only helps if I'm looking at percent change 1 period before. I'm looking for a way to fill in NA's no matter how many lags I throw in so that the percent change vector is equal in length to the foo vector
Any help would be much appreciated!
You could use your own modification of diff:
mydiff <- function(data, diff){
c(diff(data, lag = diff), rep(NA, diff))
}
mydiff(foo, 1)
[1] 0.62 -0.62 -1.38 2.57 0.43 -0.87 NA
data.frame(foo = foo, diff = mydiff(foo, 3))
foo diff
1 102.25 -1.38
2 102.87 0.57
3 102.25 1.62
4 100.87 2.13
5 103.44 NA
6 103.87 NA
7 103.00 NA
Let's say you have an array with number 1 to 10 arranged in the matrix form, in which
The matrix contains Elements from 5 rows 2 columns & 2nd column to be assigned NA , #
then Making one 5*2 matrix of elements 1:10
Array_test=array(c(1:10),dim=c(5,2,1))
Array_test
Array_test[ ,2, ]=c(NA)# Defining 2nd column to get NA
Array_test
# Similarly to make only one element of the entire matrix be NA
# let's say 4nd-row 2nd column to be made NA then
Array_test[4 ,2, ]=c(NA)

Find the 2 max values for each factor in R

I have a question about finding the two largest values of column C, for each unique ID in column A, then calculating the mean of column B. A sample of my data is here:
ID layer weight
1 0.6843629 0.35
1 0.6360772 0.70
1 0.6392318 0.14
2 0.3848640 0.05
2 0.3882660 0.30
2 0.3877026 0.10
2 0.3964194 0.60
2 0.4273218 0.02
2 0.3869507 0.12
3 0.4748541 0.07
3 0.5853659 0.42
3 0.5383678 0.10
3 0.6060287 0.60
4 0.4859274 0.08
4 0.4720740 0.48
4 0.5126481 0.08
4 0.5280899 0.48
5 0.7492097 0.07
5 0.7220433 0.35
5 0.8750000 0.10
5 0.8302752 0.50
6 0.4306283 0.10
6 0.4890895 0.25
6 0.3790714 0.20
6 0.5139686 0.50
6 0.3885678 0.02
6 0.4706815 0.05
For each ID, I want to calculate the mean value of layer, using only the rows where with the two highest weights.
I can do this with the following code in R:
ind.max1 <- ddply(index1, "ID", function(x) x[which.max(x$weight),])
dt1 <- data.table(index1, key=c("layer"))
dt2 <- data.table(ind.max1, key=c("layer"))
index2 <- dt1[!dt2]
ind.max2 <- ddply(index2, "ID", function(x) x[which.max(x$weight),])
ind.max.all <- merge(ind.max1, ind.max2, all=TRUE)
ind.ndvi.mean <- as.data.frame(tapply(ind.max.all$layer, list(ind.max.all$ID), mean))
This uses ddply to select the first highest weight value per ID and put into a dataframe with layer. Then remove these highest weight values from the original dataframe using data.table. I then repeat the ddply select max value, and merge the two max weight value dataframes into one. Finally, computing mean with tapply.
There must be a more efficient way to do this. Does anyone have any insight? Cheers.
You could use data.table
library(data.table)
setDT(dat)[, mean(layer[order(-weight)[1:2]]), by=ID]
# ID Meanlayer
#1: 1 0.6602200
#2: 2 0.3923427
#3: 3 0.5956973
#4: 4 0.5000819
#5: 5 0.7761593
#6: 6 0.5015291
Order weight column in descending order(-weight)
Select first two from the order created [1:2] by group ID
subset the corresponding layer row based on the index layer[order..]
Do the mean
Alternatively, in 1.9.3 (current development version) or from the next version on, a function setorder is exported for reordering data.tables in any order, by reference:
require(data.table) ## 1.9.3+
setorder(setDT(dat), ID, -weight) ## dat is now reordered as we require
dat[, mean(layer[1:min(.N, 2L)]), by=ID]
By ordering first, we avoid the call to order() for each group (unique value in ID). This'll be more advantageous with more groups. And setorder() is much more efficient than order() as it doesn't need to create a copy of your data.
This actually is a question for StackOverflow... anyway!
Don't know if the version below is efficient enough for you...
s.ind<-tapply(df$weight,df$ID,function(x) order(x,decreasing=T))
val<-tapply(df$layer,df$ID,function(x) x)
foo<-function(x,y) list(x[y][1:2])
lapply(mapply(foo,val,s.ind),mean)
I think this will do it. Assuming the data is called dat,
> sapply(split(dat, dat$ID), function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
})
# 1 2 3 4 5 6
# 0.6602200 0.3923427 0.5956973 0.5000819 0.7761593 0.5015291
You'll likely want to include na.rm = TRUE as the second argument to mean to account for any rows that contain NA values.
Alternatively, mapply is probably faster, and has the exact same code just in a different order,
mapply(function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
}, split(dat, dat$ID))

Integrating Data

I have a large data frame as follows which is a subset of a larger data frame.
tree=data.frame(INVYR=tree$INVYR,
DIA=tree$DIA,PLOT=tree$PLOT,SPCD=tree$SPCD,
D.2=tree$D.2, BA.T=tree$BA.T)
What I am attempting to do is calculate the total BA.T per Plot per Year (plots are remeasured in subsequent years). I do this by ...
x<-aggregate(tree$BA.T,list(tree$INVYR,tree$PLOT),FUN=sum)
x$PLOT<-x$Group.2
x<- x[with(x, order(Group.1,Group.2)), ]
This gives me the data frame...
x=data.frame(Group.1,Group.2,x,PLOT)
Where Group.1 is the INVYR, Group.2 is the PLOT, and x is total BA.T per plot per year. So far this works great. Here is where my problem begins. I then want to integrate this back into my original tree data.frame. If I merge the data by plot it doesn't account for year and quadrupoles the data set because of the four remeasurements. I can't run an if statement because the data set is not equal lengths. The data.frame I wish to accompolish is
tree=data.frame(INVYR, DIA, PLOT, SPCD, D.2, BA.T, x)
where x is the total BA.T for the given INVYR and PLOT of that record.
Any thoughts would be greatly appreciated. Thanks.
Edit
INVYR=rbind(1982,1982,1982,1982,1982,1995,1995,1995,1995,1995,2000,2000,2000,2000,2000)
PLOT=rbind(1,1,2,2,3,1,1,2,2,3,1,1,2,2,3)
BA.T=rbind(.1,.2,.3,.4,.2,.3,.5,.8,.3,.6,.7,.2,.1,1,1.02)
tree=data.frame(INVYR,PLOT,BA.T)
head(tree)
x<-aggregate(tree$BA.T,list(tree$INVYR,tree$PLOT),FUN=sum)
x$PLOT<-x$Group.2
x$INVYR<-x$Group.1
x<- x[with(x, order(Group.1,Group.2)), ]
head(x)
On solution is to use package reshape2.
library(reshape2)
melt(data=tree,id.vars=c('INVYR','PLOT')) ## Notice the choice of the id!the keys!
dcast(tree.m,formula=...~variable,fun.aggregate=sum)
INVYR PLOT BA.T
1 1982 1 0.30
2 1982 2 0.70
3 1982 3 0.20
4 1995 1 0.80
5 1995 2 1.10
6 1995 3 0.60
7 2000 1 0.90
8 2000 2 1.10
9 2000 3 1.02

Resources