I have searched long and hard for a solution using apply but I am unable to find exactly what I need. I'm a new R user comming over from Excel and need to calculate the percent difference from an observation with a control. A realistic sample data frame looks like this:
site <- c(rep(1, 10), rep(2,10), rep(3,10))
element <-rep(c("ca", "Mg", "K"), 10)
control <- seq(from= 1,to=60, by=2)
BA01 <- seq(from= 31,to=90, by=2)
BA02 <- seq(from= 21,to=80, by=2)
BA03 <- seq(from= 101,to=160, by=2)
mydf <- data.frame(site, element, control, BA01, BA02,BA03)
where BA01 to BA03 are different test which will be compared to the control.
all I would like to do, is make a formula like this:
((BA01-control)/control)*100
and have it calculated for every test column(BA01-BA03) and every row in the data frame. In Excel I could just copy and paste the site and element columns plus the headers BA01-BA03 the type the formula in cell C2 and drag the formula to the right as far as needed then down as far as needed and have my results. In R I'm having difficulty getting the same results. I've tried apply already but can not get it to work. Basically, I would like to have Site and Element as columns 1 and 2, followed by the results from the formula with BA01, BA02 and BA03 as the column names. Probably it wouldn't make a difference, but my real data frame will have upwards of 130 columns and several thousand rows.
Does anyone have some tips for me?
Thank you very much in advance for your help.
Dan
If I understand correctly:
cbind(mydf[1:2],sapply(mydf[-(1:3)],function(x) 100*(x-mydf[[3]])/mydf[[3]]))
site element BA01 BA02 BA03
1 1 ca 3000.00000 2000.00000 10000.0000
2 1 Mg 1000.00000 666.66667 3333.3333
3 1 K 600.00000 400.00000 2000.0000
4 1 ca 428.57143 285.71429 1428.5714
5 1 Mg 333.33333 222.22222 1111.1111
...
Try this:
cbind(mydf[1:2], 100 * mydf[4:6] / mydf$control - 100)
The first 5 lines of output are:
site element BA01 BA02 BA03
1 1 ca 3000.00000 2000.00000 10000.0000
2 1 Mg 1000.00000 666.66667 3333.3333
3 1 K 600.00000 400.00000 2000.0000
4 1 ca 428.57143 285.71429 1428.5714
5 1 Mg 333.33333 222.22222 1111.1111
How about:
pdiff <- function(x,y) (x-y)/y*100
BAcols <- subset(mydf,select=c(BA01,BA02,BA03))
This subset is readable for a small data frame but if you really have lots of rows you want to normalize you will want to select these columns by using a numeric range, i.e. mydf[,-(1:3)] (drop the first three columns) or mydf[,4:ncol(mydf)] (keep columns 4 until the end).
cbind(mydf[,1:2],sweep(BAcols,1,mydf$control,pdiff))
or
with(mydf,data.frame(site,element,sweep(BAcols,1,control,pdiff)))
Related
I have certain data in a list extracted from a bayesian processing from certain electrodes and I want to populate a dataframe out of a loop. First I have a list of 729 processing outcomes and an object elecs which is basically a list of 729 pairs of electrodes (27*27) as you can see.
> head(elecs)
X Elec1 Elec2
1 1 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 5 1 5
6 6 1 6
The thing is I would like to fill dataf1 with the outcome of this loop which happens to be a dataframe of 4000 rows.
dataf1 <- data.frame('Elec1'=rep(NA,4000*729),'Elec2'=rep(NA,4000*729),'int'=rep(NA,4000*729))
for (i in nrow(elecs)){
Elec1=as.data.frame(rep(elecs[i,]$Elec1,4000))
Elec2=as.data.frame(rep(elecs[i,]$Elec2,4000))
post <- posterior_samples(bayeslist[[i]])
int <- as.data.frame(post$b_Intercept)
df <- cbind(Elec1,Elec2,est)
colnames(df) <- c('Elec1','Elec2','int')
dataf1[(1+(i-1)*4000):((1+(i-1)*4000)+3999),c('Elec1','Elec2','int')] <- df
}
Everything works perfectly fine until the last line in the loop:
dataf1[(1+(i-1)*4000):((1+(i-1)*4000)+3999),c('Elec1','Elec2','int')] <- df
And I don't know why exactly this is not working as expected and populating the dataf1 preinitialised dataframe.
Any insight, as always, will be highly appreciated.
I realised I was missing the init in the for, so it's kinda newbie typo. Apart from this, the code works, in case anyone is wondering.
for (i in nrow(elecs)){
for (i in 1:nrow(elecs)){
I have a large dataset, and I'm trying to drop some of my variables based on how many observations each has. For instance, I would like to drop any variable in my dataframe where n < 3 (total observations for that variable is less than 3). Since R can count observations for each variable using describe, can't I use that number to subset the data instead of having to type in each variable name each time I pull in a new version (each version has different variables that will have low n's and there are over 40 variables). Thanks so much for your help!
For instance, my data looks like this:
ID Runaway Aggressive Emergency Hospitalization Injury
1 3 NA 4 1 NA
2 NA NA 2 1 NA
3 4 NA 6 2 3
4 1 NA 1 1 NA
I want to be able to drop "Aggressive" and "Injury" based on their n's being 0 and 1 respectively. However, instead of telling R to drop them by variable name, it would be much more convenient if it was possible to tell R to drop any variable where n < 3 (or whatever number I choose) as I'll be using this code for multiple versions of this dataset. I have tried using column numbers (which is better than writing them out) but it's still pretty tedious when I have to describe() the data, figure out which variables have low n's, and then drop 28 variables or subset() around them.
This works but it's cumbersome...
UIRCorrelation <- UIRKidUnique61[c(28, 30, 32, 34:38, 42, 54:74)]
For some reason, my example looks different when I'm editing versus when I save so I also included an image of it. Sorry. This is the first time I've ever used stack overflow to ask a question. I actually spent a lot of time googling this but couldn't find an answer relating to n.
This line did not work: DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
DF being your dataframe
DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
This function did the trick:
valid <- function(x) {sum(!is.na(x))}
N <- apply(UIRCorrelation,2,valid)
UIRCorrelation2 <- UIRCorrelation[N > 3]
I am trying to include a mean calculation as part of a larger code. The idea is to calculate the mean from a series of values within a column, but not all the column.
For example, from column_x (10 entries) in yFile, calculate the mean of the last 4 values:
column_x
1
5
8
3
0
3
3
7
9
9
Result = 7
This is what I've got:
avg_subx <- mean(yFile$column_x, 7:10, trim = 0, na.rm = FALSE)
But for some reason, the result I am getting back is not the correct value.
Could you help me finding out where I'm going wrong?
Thanks!
have you tried with tail function? With tail you can select the last n values of a data frame or a vector.
example:
avg_subx <- mean(tail(yFile$column_x,4))
In this case you're selecting the las 4 values.
Hope this can help you!
I am new to Stackoverflow and to R, so I hope you can be a bit patient and excuse any formatting mistakes.
I am trying to write an R-script, which allows me to automatically analyze the raw data of a qPCR machine.
I was quite successful in cleaning up the data, but at some point I run into trouble. My goal is to consolidate the data into a comprehensive table.
The initial data frame (DF) looks something like this:
Sample Detector Value
1 A 1
1 B 2
2 A 3
3 A 2
3 B 3
3 C 1
My goal is to have a dataframe with the Sample-names as row names and Detector as column names.
A B C
1 1 2 NA
2 3 NA NA
3 2 3 1
My approach
First I took out the names of samples and detectors and saved them in vectors as factors.
detectors = summary(DF$Detector)
detectors = names(detectors)
samples = summary(DF$Sample)
samples = names(samples)
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
Then I subsetted the detectors into a new dataframe based on the name of the detector in the dataframe.
for (i in 1:length(detectors)){
assign(detectors[i], DF[which(DF$Detector == detectors[i]),])
}
Then I initialize an empty dataframe with the right column and row names:
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
So now the Problem. I have to get the values from the detector subsets into the result dataframe. Here it is important that each values finds the way to the right position in the dataframe. The issue is that there are not equally many values since some samples lack some detectors.
I tried to do the following: Iterate through the detector subsets, compare the rowname (=samplename) with each other and if it's the same write the value into the new dataframe. In case it it is not the same, it should write an NA.
for (i in 1:length(detectors)){
for (j in 1:length(get(detectors[i])$Sample)){
result[j,i] = ifelse(get(detectors[i])$Sample[j] == rownames(result[j,]), get(detectors[i])$Ct.Mean[j], NA)
}
}
The trouble is, that this stops the iteration through the detector$Sample column and it switches to the next detector. My understanding is that the comparing samples get out of sync, yielding the all following ifelse yield a NA.
I tried to circumvent it somehow by editing the ifelse(test, yes, no) NO with j=j+1 to get it back in sync, but this unfortunately didn't work.
I hope I could make my problem understandable to you!
Looking forward to hear any suggestions, or comments (also how to general improve my code ;)
We can use acast from library(reshape2) to convert from 'long' to 'wide' format.
acast(DF, Sample~Detector, value.var='Value') #returns a matrix output
# A B C
#1 1 2 NA
#2 3 NA NA
#3 2 3 1
If we need a data.frame output, use dcast.
Or use spread from library(tidyr), which will also have the 'Sample' as an additional column.
library(tidyr)
spread(DF, Detector, Value)
i have a smallish (2k) data set that contains questionnaire answers filled out by students there were sampled twice a year. not all the students that were present for the first wave were there for the second wave and vice versa. for each student, a unique id was created that consisted of the school code, the class code, the student number and the wave as a decimal point. for example 100612.1 is a student from school 10, grade 6, 12 on the names list and this was the first wave. the idea behind the decimal point was a way to identify the same student again in the data set (the only value which differs less than abs(1) from a given id is the same student on the other wave).at least that was the idea.
i was thinking of a script that would do the following:
- find the rows who's unique id is less than abs(1) from one another
- for those rows, generate a new row (in a new table) that consists of the student id and the delta of the measured variables( i.e value in the wave 2 - value in wave 1).
i a new to R but i have a tiny bit of background in other OOP. i thought about creating a for loop that runs from 1 to length(df) and just looks for it's "brother". my gut feeling tells me that this not the way things are done in R. any ideas?
all i need is a quick way of sifting through the data looking for the second wave row. i think the rest should be straight forward from there.
thank you for helping
PS. since this is my first post here i apologize beforehand for any wrongdoings in this post... :)
The question alludes to data.table, so here is a way to adapt #jed's answer using that package.
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
Example data as before, now instead of data.frame and tapply you can do this:
library(data.table)
surveyDT <- data.table(ids, answers)
surveyDT[, `:=` (child = substr(ids, 1, 6), wave = substr(ids, 8, 8))] # split ID's
# note multiple assign-by-reference := syntax above
setkey(surveyDT, child, wave) # order data
# calculate delta on keyed data, grouping by child
surveyDT[, delta := diff(answers), by = child]
unique(surveyDT[, delta, by = child]) # list results
child delta
1: 100612 -1
2: 100613 1
3: 110714 NA
4: 201802 NA
To remove rows with NA values for delta:
unique(surveyDT[, .SD[(!is.na(delta))], by = child])
child ids answers wave delta
1: 100612 100612.1 5 1 -1
2: 100613 100613.1 3 1 1
Use .SDcols to output only specific columns (in addition to the by columns), for example,
unique(surveyDT[, .SD[(!is.na(delta))], by = child, .SDcols = 'delta'])
child delta
1: 100612 -1
2: 100613 1
It took me some time to get acquainted with data.table syntax, but now I find it more intuitive, and it's fast for big data.
There are two ways that come to mind. The easiest is to use the function floor(), which returns the integer For example:
floor(100612.1)
#[1] 100612
floor(9.9)
#[1] 9
Alternatively, you could write a fairly simple regex expression to get rid of the decimal place too. Then you can use unique() to find the rows that are or are not duplicated entries.
Lets make some fake data so we can see our problem easily:
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
survey <- data.frame(ids,answers)
Now lets split our ids into two different columns:
survey$child_id <- substr(survey$ids,1,6)
survey$wave_id <- substr(survey$ids,8,8)
Then we'll order by child and wave, and compute differences based on child:
survey[order(survey$child_id, survey$wave_id),]
survey$delta <- unlist(tapply(survey$answers, survey$child_id, function(x) c(NA,diff(x))))
Output:
ids answers child_id wave_id delta
1 100612.1 5 100612 1 NA
2 100612.2 4 100612 2 -1
3 100613.1 3 100613 1 NA
4 100613.2 4 100613 2 1
5 110714.1 1 110714 1 NA
6 201802.2 0 201802 2 NA