How is this Pmin() function working indexing a column? - r

I struggled to find a good title for this post, so apologies if this isn't the best to describe what I am asking here. Let's say I have a data frame with 50 columns, and I want to split it up by 2 columns, resulting in a list of 25 data frames.
col_one col_two col_three col_four col_five ... col_fifty
1 1 1 1 1 1
2 2 2 2 2 2
One way I've been able to solve this is through map function and the pmin() function as such:
purrr::map(seq(1, ncol(data), by = 2), ~ data[.x:pmin((.x + 1), ncol(data))])
However, the more I thought about this, the more I thought that I could simply drop the pmin() function and do the same task like this:
purrr::map(seq(1, ncol(data), by = 2), ~ data[, .x:.x+1])
This does create list of 25 data.frames, but it only prints out one column rather than two columns.
Could anybody explain why? Thanks!

Related

Apply autocorrelation function acf() to elements of set of vectors by group in a data frame

I have a data frame DF which looks like this:
ID Area time
1 1 182.685 1
2 2 182.714 1
3 3 182.275 1
4 4 211.928 1
5 5 218.804 1
6 6 183.445 1
...
1 1 184.334 2
2 2 196.765 2
3 3 186.435 2
4 4 213.322 2
5 5 214.766 2
6 6 172.667 2
.. and so to ID = 6. I want to apply an autocorrelation function on each ID, i.e. compare ID = 1 at time 1 with ID = 1 at time 2 and so on.
What is the most straightforward way to apply e.g. acf() to my data frame?
When I try to use
autocorr = aggregate(x = DF$Area, by = list(DF$ID), FUN = acf)
I get a weird object.
Thanks in advance!
I want to apply an autocorrelation function on each ID
OK, good, so you don't want any cross-correlation, which make things much easier.
I get a weird object
acf returns a bunch of things, i.e., it returns a list of things. I think you will be only interested in ACF values, so you need:
FUN = function (u) c(acf(u, plot = FALSE)$acf)
Also, using aggregate is not a good idea. You may want split and sapply:
## so your data frame is called `x`
oo <- sapply(split(x$Area, x$ID), FUN = function (u) c(acf(u, plot = FALSE)$acf) )
If you have balanced data, i.e., if you have equal number of observations for each ID, oo will be simplified into a matrix for sure. If you do not have balanced data, you may want to explicitly control the lag.max argument in acf. By default, acf will auto-decide on this value based on the number of observations.
Now suppose we want lag 0 to lag 7, we can set:
oo <- sapply(split(x$Area, x$ID),
FUN = function (u) c(acf(u, plot = FALSE, lag.max = 7)$acf) )
Thus result oo is a matrix of 8 rows (row for lag, column for ID). I don't see any good of using a data frame to hold this result, but in case you want a data frame, simply do:
data.frame(oo)
With data either in a matrix or a data frame, it is easy for you to do further analysis.
-----------
For a complete description of acf, please read Produce a boxplot for multiple ACFs

Data handling: 2 independent factors, which decide the position of a numeric value in a new data frame

I am new to Stackoverflow and to R, so I hope you can be a bit patient and excuse any formatting mistakes.
I am trying to write an R-script, which allows me to automatically analyze the raw data of a qPCR machine.
I was quite successful in cleaning up the data, but at some point I run into trouble. My goal is to consolidate the data into a comprehensive table.
The initial data frame (DF) looks something like this:
Sample Detector Value
1 A 1
1 B 2
2 A 3
3 A 2
3 B 3
3 C 1
My goal is to have a dataframe with the Sample-names as row names and Detector as column names.
A B C
1 1 2 NA
2 3 NA NA
3 2 3 1
My approach
First I took out the names of samples and detectors and saved them in vectors as factors.
detectors = summary(DF$Detector)
detectors = names(detectors)
samples = summary(DF$Sample)
samples = names(samples)
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
Then I subsetted the detectors into a new dataframe based on the name of the detector in the dataframe.
for (i in 1:length(detectors)){
assign(detectors[i], DF[which(DF$Detector == detectors[i]),])
}
Then I initialize an empty dataframe with the right column and row names:
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
So now the Problem. I have to get the values from the detector subsets into the result dataframe. Here it is important that each values finds the way to the right position in the dataframe. The issue is that there are not equally many values since some samples lack some detectors.
I tried to do the following: Iterate through the detector subsets, compare the rowname (=samplename) with each other and if it's the same write the value into the new dataframe. In case it it is not the same, it should write an NA.
for (i in 1:length(detectors)){
for (j in 1:length(get(detectors[i])$Sample)){
result[j,i] = ifelse(get(detectors[i])$Sample[j] == rownames(result[j,]), get(detectors[i])$Ct.Mean[j], NA)
}
}
The trouble is, that this stops the iteration through the detector$Sample column and it switches to the next detector. My understanding is that the comparing samples get out of sync, yielding the all following ifelse yield a NA.
I tried to circumvent it somehow by editing the ifelse(test, yes, no) NO with j=j+1 to get it back in sync, but this unfortunately didn't work.
I hope I could make my problem understandable to you!
Looking forward to hear any suggestions, or comments (also how to general improve my code ;)
We can use acast from library(reshape2) to convert from 'long' to 'wide' format.
acast(DF, Sample~Detector, value.var='Value') #returns a matrix output
# A B C
#1 1 2 NA
#2 3 NA NA
#3 2 3 1
If we need a data.frame output, use dcast.
Or use spread from library(tidyr), which will also have the 'Sample' as an additional column.
library(tidyr)
spread(DF, Detector, Value)

r - Using l_ply to add results to an existing data frame

Is l_ply or some other apply-like function capable of inserting results to an existing data frame?
Here's a simple example...
Suppose I have the following data frame:
mydata <- data.frame(input1=1:3, input2=4:6, result1=NA, result2=NA)
input1 input2 result1 result2
1 1 4 NA NA
2 2 5 NA NA
3 3 6 NA NA
I want to loop through the rows, perform operations, then insert the answer in the columns result1 and result2. I tried:
l_ply(1:nrow(mydata), function(i) {
mydata[i,"result1"] <- mydata[i,"input1"] + mydata[i,"input2"]
mydata[i,"result2"] <- mydata[i,"input1"] * mydata[i,"input2"]})
but I get back the original data frame with NA's in the result columns.
P.S. I've already read this post, but it doesn't quite answer my question. I have several result columns, and the operations I want to perform are more complicated than what I have above so I'd prefer not to compute the columns separately then add them to the data frame after as the post suggests.
I suppose there might be a plyr approach but this seems very easy and clear to do in base R:
> mydata[3:4] <- with(mydata, list( input1+input2, input1*input2) )
> mydata
input1 input2 result1 result2
1 1 4 5 4
2 2 5 7 10
3 3 6 9 18
Even if you got that plyr code to deliver something useful, you are still not assigning the results to anything so the it would have evaporated under the glaring sun of garbage collection. And do note that if you followed the advice of #Vlo you would have seen a result at the console that might have led you to think that 'mydata' was updated, but the 'mydata'-object would have remained untouched. You need to assign values back to the original object. For dplyr operations you are generally going to be assigning back entire objects.
You don't need to use apply or variations thereof. Instead, you can exploit that R is vectorized:
mydata$result1 <- mydata$input1 + mydata$input2
mydata$result2 <- mydata$input1 * mydata$input2
#> mydata
# input1 input2 result1 result2
#1 1 4 5 4
#2 2 5 7 10
#3 3 6 9 18

create new dataframe based on 2 columns

I have a large dataset "totaldata" containing multiple rows relating to each animal. Some of them are LactationNo 1 readings, and others are LactationNo 2 readings. I want to extract all animals that have readings from both LactationNo 1 and LactationNo 2 and store them in another dataframe "lactboth"
There are 16 other columns of variables of varying types in each row that I need to preserve in the new dataframe.
I have tried merge, aggregate and %in%, but perhaps I'm using them incorrectly eg.
(lactboth <- totaldata[totaldata$LactationNo %in% c(1,2), ])
Animal Id is column 1, and lactationno is column 2. I can't figure out how to select only those AnimalId with LactationNo=1&2
Have also tried
lactboth <- totaldata[ which(totaldata$LactationNo==1 & totaldata$LactationNo ==2), ]
I feel like this should be simple, but couldn't find an example to follow quite the same. Help appreciated!!
If I understand your question correctly, then your dataset looks something like this:
AnimalId LactationNo
1 A 1
2 B 2
3 E 2
4 A 2
5 E 2
and you'd like to select animals that happen to have both lactation numbers 1 & 2 (like A in this particular example). If that's the case, then you can simply use merge:
lactboth <- merge(totaldata[totaldata$LactationNo == 1,],
totaldata[totaldata$LactationNo == 2,],
by.x="AnimalId",
by.y="AnimalId")[,"AnimalId"]

paste several column values into one value in R

I have a really simple question that I cannot find a straightforward answer for. I have a data.frame that looks like this:
df3 <- data.frame(x=c(1:10),y=c(5:14),z=c(25:34))
ID x y z
1 1 5 25
2 2 6 26
3 3 7 27
etc.
And I want to 'paste' together the different values in each column so that they form a single, combined value, as in:
ID x+y+z
1 1525
2 2626
3 3727
I'm sure that this is very easy to do, but I just don't know how!
Yep, paste() is exactly what you want to do:
df3$xyz <- with(df3, paste(x,y,z, sep=""))
# Or, if you want the result to be numeric, rather than character
df3$xyz <- as.numeric(with(df3, paste(x,y,z, sep="")))

Resources