R: Add value in new column of data frame depending on value in another column - r

I have 2 data frames in R, df1 and df2.
df1 represents in each row one subject in an experiment. It has 3 columns. The first two columns specify a combination of groups the subject is in. The third column contains the experimental result.
df2 containts values for each combination of groups that can be used for normalization. Thus, it has three columns, two for the groups and a third for the normalization constant.
Now I want to create a fourth column in df1 with the experimental results from the third column, divided by the normalization constant in df2. How can I facilitate this?
Here's an example:
df1 <- data.frame(c(1,1,1,1),c(1,2,1,2),c(10,11,12,13))
df2 <- data.frame(c(1,1,2,2),c(1,2,1,2),c(30,40,50,60))
names(df1)<-c("Group1","Group2","Result")
names(df2)<-c("Group1","Group2","NormalizationConstant")
As result, I need a new column in df1 with c(10/30,11/40,12/30,13/40).
My first attempt is with the following code, which fails for my real data with the error message "In is.na(e1) | is.na(e2) : Length of the longer object is not a multiple of the length of the shorter object". Nevertheless, when I replace the referrer ==df1[,1] and ==df1[,2] with fixed values, it works. Is this really returning only the value of the column for this particular row?
df1$NormalizedResult<- df1$Result / df2[df2[,1]==df1[,1] & df2[,2]==df1[,2],]$NormalizationConstant
Thanks for your help!

In this case where the groups are aligned perfectly it's as simple as:
> df1$expnormed <- df1$Result/df2$NormalizationConstant
> df1
Group1 Group2 Result expnormed
1 1 1 10 0.3333333
2 1 2 11 0.2750000
3 1 1 12 0.2400000
4 1 2 13 0.2166667
If they were not exactly aligned you would use merge:
> dfm <-merge(df1,df2)
> dfm
Group1 Group2 Result NormalizationConstant
1 1 1 10 30
2 1 1 12 30
3 1 2 11 40
4 1 2 13 40
> dfm$expnormed <- with(dfm, Result/NormalizationConstant)

A possibility :
df1$res <- df1$Result/df2$NormalizationConstant[match(do.call("paste", df1[1:2]), do.call("paste", df2[1:2]))]
Group1 Group2 Result res
1 1 1 10 0.3333333
2 1 2 11 0.2750000
3 1 1 12 0.4000000
4 1 2 13 0.3250000
Hth

Related

How to create new dataframe by repeating 1 column and sequentially repeating a 2nd column of original dataframe?

Im trying to create a new dataframe by repeating values in original-df column 1 and corresponding them to repeating values from original-df column 2. However, the values should repeat in a different manner for each column. For example, values from original-df column 1 will repeat as 1,2,3,1,2,3,1,2,3. Where as values from original-df column 2 should repeat as 1,1,1,2,2,2,3,3,3.
#here is original df
df1<-data.frame(x=1:3, y=10:12)
#I've tried the followig:
data.frame(x=df1$x,y=df1[,2])->df2
range<-1:3
data.frame(x=df1$x,y=df1$y[range,2])->df3
#I then tried this:
rep(df1$x,df1$y[l,2])->df4
#output either looks like this:
x y
1 1 10
2 2 11
3 3 12
#Or I receive an error message:
Error in df1$y[1, 2] : incorrect number of dimensions
#I expect data output to look like this:
x y
1 10
2 10
3 10
1 11
2 11
3 11
1 12
2 12
3 12
An option would be expand
library(tidyr)
expand(df1, x, y)
Or with expand.grid from base R
do.call(expand.grid, df1)

How to find first occurrence of a vector of numeric elements within a data frame column?

I have a data frame (min_set_obs) which contains two columns: the first containing numeric values, called treatment, and the second an id column called seq:
min_set_obs
Treatment seq
1 29
1 23
3 60
1 6
2 41
1 5
2 44
Let's say I have a vector of numeric values, called key:
key
[1] 1 1 1 2 2 3
I.e. a vector of three 1s, two 2s, and one 3.
How would I go about identifying which rows from my min_set_obs data frame contain the first occurrence of values from the key vector?
I'd like my output to look like this:
Treatment seq
1 29
1 23
3 60
1 6
2 41
2 44
I.e. the sixth row from min_set_obs was 'extra' (it was the fourth 1 when there should only be three 1s), so it would be removed.
I'm familiar with the %in% operator, but I don't think it can tell me the position of the first occurrence of the key vector in the first column of the min_set_obs data frame.
Thanks
Here is an option with base R, where we split the 'min_set_obs' by 'Treatment' into a list, get the head of elements in the list using the corresponding frequency of 'key' and rbind the list elements to a single data.frame
res <- do.call(rbind, Map(head, split(min_set_obs, min_set_obs$Treatment), n = table(key)))
row.names(res) <- NULL
res
# Treatment seq
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60
Use dplyr, you can firstly count the keys using table and then take the top n rows correspondingly from each group:
library(dplyr)
m <- table(key)
min_set_obs %>% group_by(Treatment) %>% do({
# as.character(.$Treatment[1]) returns the treatment for the current group
# use coalesce to get the default number of rows (0) if the treatment doesn't exist in key
head(., coalesce(m[as.character(.$Treatment[1])], 0L))
})
# A tibble: 6 x 2
# Groups: Treatment [3]
# Treatment seq
# <int> <int>
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

Match dataframe rows according to two variables (Indexing)

I am essentially trying to get disorganized data into long form for linear modeling.
I have 2 data.frames "rec" and "book"
Each row in "book" needs to be pasted onto the end of several of the rows of "rec" according to two variables in the row: "MRN" and "COURSE" which match.
I have tried the following and variations thereon to no avail:
i=1
newlist=list()
colnames(newlist)=colnames(book)
for ( i in 1:dim(rec)[1]) {
mrn=as.numeric(as.vector(rec$MRN[i]));
course=as.character(rec$COURSE[i]);
get.vector<-as.vector(((as.numeric(as.vector(book$MRN))==mrn) & (as.character(book$COURSE)==course)))
newlist[i]<-book[get.vector,]
i=i+1;
}
If anyone has any suggestions on
1)getting this to work
2) making it more elegant (or perhaps just less clumsy)
If I have been unclear in any way I beg your pardons.
I do understand I haven't combined any data above, I think if I can generate a long-format data.frame I can combine them all on my own
Sounds like you need to merge the two data-frames. Try this:
merge(rec, book, by = c('MRN', 'COURSE'))
and do read the help for merge (by doing ?merge at the R console) for more options on how to merge these.
I've created a simple example that may help you. In my case i wanted to paste the 'value' column from df1 in each row of df2, according to variables x1 and x2:
df1 <- read.table(textConnection("
x1 x2 value
1 2 12
1 3 56
2 1 35
2 2 68
"),header=T)
df2 <- read.table(textConnection("
test x1 x2
1 1 2
2 1 3
3 2 1
4 2 2
5 1 2
6 1 3
7 2 1
"),header=T)
library(sqldf)
sqldf("select df2.*, df1.value from df2 join df1 using(x1,x2)")
test x1 x2 value
1 1 1 2 12
2 2 1 3 56
3 3 2 1 35
4 4 2 2 68
5 5 1 2 12
6 6 1 3 56
7 7 2 1 35

Resources