How to use a function to a ragged dataframe in R - r

I am working with 45 large dataframes of 60 each of extremely, messy timeseries data. But it basically boils down to the two data frames below. I have managed to sort wrangle it down to these two dataframe types, the first is the positions on the y-axis
Group1 Group2 Group3
0 0 0
1 1 1
3 2 4
5 3 7
5 8
9
names samples t0 XIncrement
Group1 4 0 2
Group2 5 1 2
Group3 7 2 3
I have to create the time for each group. When I write the function for a single group it works great, I get exactly what I want.
1.) t0 is the starting time
2.) xincrement is the time between each sample taken
3.) sample is the number of samples in the set.
4.) Time would be the time for each sample taken
Time_function <- function(samples, t0, XIncrement) {
samples_2 <-c(1:samples)
Time <- ((samples_2* t0) +XIncrement)
return(Time)
#}
}
test_function <- Time_function(100, 200e-9, 500e-12)
I need it to be "applied" each group. I really am not sure how to either create a new data frame with the information or add it to the rows of the df.
I have tried making an empty data frame and populating it that way and that did not work.
When I transpose the data and get it into list form it, it seems the most usable form but I still cannot get the create Time into the list.
Added For greater clarity: Each Group has it own unique time some groups were sampled more than others.
So, at the end I need to merge Group_1 with Group1_Time, Group_2 with Group2_Time. The data frame will be ragged.
Thanks for any help or guidance you can provide. I have googled and search stackoverflow for information, but I come up empty-handed. If there is a question like this, great, I have yet to find it.

Assuming the input data shown reproducibly in the Note below, this code combines the list L and data frame DF and assumes a more appropriate data structure, ts class, for the output):
make_ts <- function(g, t0, xincr) ts(g, start = t0, deltat = xincr)
tt <- Map(make_ts, L, DF$t0, DF$XIncrement)
tt
giving this list of ts objects:
$Group1
Time Series:
Start = 0
End = 6
Frequency = 0.5
[1] 0 1 3 5
$Group2
Time Series:
Start = 1
End = 9
Frequency = 0.5
[1] 0 1 2 3 5
$Group3
Time Series:
Start = 2
End = 17
Frequency = 0.333333333333333
[1] 0 1 4 7 8 9
Note that we can recover the data, times, start time, deltat and frequency (= 1/deltat) and convert to a list of data frames like this:
lapply(tt, c)
lapply(tt, time)
sapply(tt, start)
sapply(tt, deltat)
sapply(tt, frequency)
library(zoo)
lapply(tt, fortify.zoo)
Note
Since it is not possible to have a data frame with differing column lengths we assume that the first data structure is a list as shown and the second is a data frame.
L <- list(Group1 = c(0, 1, 3, 5), Group2 = c(0, 1, 2, 3, 5),
Group3 = c(0, 1, 4, 7, 8, 9))
DF <- structure(list(names = c("Group1", "Group2", "Group3"), samples = c(4L,
5L, 7L), t0 = 0:2, XIncrement = c(2L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-3L))

Related

Index and assign multiple sets of rows at once

I have an imported dataframe Measurements that contains many observations from an experiment.
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
X Data
1 90
2 85
3 100
4 105
I want to add another column Condition that specifies the treatment group for each datapoint. I know which obervation ranges are from which condition (e.g. observations 1:2 are from the control and observations 3:4 are from the experimental group).
I have devised two solutions already that give the desired output but neither are ideal. First:
Measurements["Condition"] <- c(rep("Cont", 2), rep("Exp", 2))
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
The benefit of this is it is one line of code/one command. But this is not ideal since I need to do math outside separately (e.g. 3:4 = 2 obs, etc) which can be tricky/unclear/indirect with larger datasets and more conditions (e.g. 47:83 = ? obs, etc) and would be liable to perpetuating errors since a small error in length for an early assignment would also shift the assignment of later groups (e.g. if rep of Cont is mistakenly 1, then Exp gets mistakenly assigned to 2:3 too).
I also thought of assigning like this, which gives the desired output too:
Measurements[1:2, "Condition"] <- "Cont"
Measurements[3:4, "Condition"] <- "Exp"
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
This makes it more clear/simple/direct which rows will receive which assignment, but this requires separate assignments and repetition. I feel like there should be a way to "vectorize" this assignment, which is the solution I'm looking for.
I'm having trouble finding complex indexing rules from online. Here is my first intuitive guess of how to achieve this:
Measurements[c(1:2, 3:4), "Condition"] <- list("Cont", "Exp")
X Data Condition
1 90 Cont
2 85 Cont
3 100 Cont
4 105 Cont
But this doesn't work. It seems to combine 1:2 and 3:4 into a single equivalent range (1:4) and assigns only the first condition to this range, which suggests I also need to specify the column again. When I try to specify the column again:
Measurements[c(1:2, 3:4), c("Condition", "Condition")] <- list("Cont", "Exp")
X Data Condition Condition.1
1 90 Cont Exp
2 85 Cont Exp
3 100 Cont Exp
4 105 Cont Exp
For some reason this creates a second new column (??), and it again seems to combine 1:2 and 3:4 into essentially 1:4. So I think I need to index the two row ranges in a way that keeps them separate and only specify the column once, but I'm stuck on how to do this. I assume the solution is simple but I can't seem to find an example of what I'm trying to do. Maybe to keep them separate I do have to assign them separately, but I'm hoping there is a way.
Can anyone help? Thank you a ton in advance from an R noobie!
If you already have a list of observations which belong to each condition you could use dplyr::case_when to do a conditional mutate. Depending on how you have this information stored you could use something like the following:
library(dplyr)
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
# set which observations belong to each condition
Cont <- 1:2
Exp <- 3:4
Measurements %>%
mutate(Condition = case_when(
X %in% Cont ~ "Cont",
X %in% Exp ~ "Exp"
))
# X Data Condition
# 1 90 Cont
# 2 85 Cont
# 3 100 Exp
# 4 105 Exp
Note that this does not require the observations to be in consecutive rows.
I normally see this done with a merge operation. The trick is getting your conditions data into a nice shape.
composeConditions <- function(...) {
conditions <- list(...)
data.frame(
X = unname(unlist(conditions)),
condition = unlist(unname(lapply(
names(conditions),
function(x) rep(x, times = length(conditions[x][[1]]))
)))
)
}
conditions <- composeConditions(Cont = 1:2, Exp = 3:4)
> conditions
X condition
1 1 Cont
2 2 Cont
3 3 Exp
4 4 Exp
merge(Measurements, conditions, by = "X")
X Data condition
1 1 90 Cont
2 2 85 Cont
3 3 100 Exp
4 4 105 Exp
Efficient for larger datasets is to know the data pattern and the data id.
Measurements <- data.frame(X = 1:4, Data = c(90, 85, 100, 105))
dat <- c("Cont","Exp")
pattern <- c(1,1,2,2)
Or draw pattern from data, e.g. conditional from Measurements$Data
pattern <- sapply( Measurements$Data >=100, function(x){ if(x){2}else{1} } )
# [1] 1 1 2 2
Then you can add the data simply by doing:
Measurements$Condition <- dat[pattern]
# X Data Condition
#1 1 90 Cont
#2 2 85 Cont
#3 3 100 Exp
#4 4 105 Exp

R: get corresponding value from another data frame

I'm new to R and here and I need some help to structure my data.
I have two data sets:
One of them is a long format within subjects data set which is large and looks a little bit like this:
long.format <- data.frame(subject.no = c(1, 1, 1, 1, 2, 2, 2, 2), condition = c("prime", "prime", "prime", "prime", "control", "control","control","control"), response = c(1,1,1,0,1,1,1,0))
subject.no condition response
>1 1 prime 1
>2 1 prime 1
>3 1 prime 1
>4 1 prime 0
>5 2 control 1
>6 2 control 1
>7 2 control 1
>8 2 control 0
The other one is already in wide format and looks like this
wide.format <- data.frame(subject = c(1, 2), age = c(26,27), gender = c("m","f"))
subject age gender
>1 1 26 m
>2 2 27 f
The only thing I want to do now is to get the value in "condition" (and only this!) from the long format data frame to the corresponding subject in the wide data frame by adding a new column in the wide data frame (by using the columns subject.no and subject, respectively).
So the final data frame should look like this:
wide.format.aim <- data.frame(subject = c(1, 2), age = c(26,27), gender = c("m","f"), condition = c("prime","control"))
subject age gender condition
>1 1 26 m prime
>2 2 27 f control
I've tried merging but this ended up with a long format data frame added with the information from the wide format data frame... but I want it the other way around...
This is what I've tried:
test.it <- merge(x=wide.format, y=long.format[,c("subject.no", "condition")], all.x=T, by.x="subject", by.y="subject.no")
Any suggestions?
Thanks in advance!
You are interested merging the unique values from long.format[,c("subject.no", "condition")]:
unique(long.format[,c("subject.no", "condition")])
# subject.no condition
#1 1 prime
#5 2 control
You can merge using those values
merge(x = wide.format,
y = unique(long.format[,c("subject.no", "condition")]),
by.x = "subject",
by.y = "subject.no")
# subject age gender condition
#1 1 26 m prime
#2 2 27 f control

How to construct and add to a data frame with named columns?

I cannot figure out how to do this without throwing errors. I have a set of column names for my data frame I want to create and add to that looks like this:
x <- c("A", "B", "C")
So, I go down through the loop and I calculate some numerical values in a vector, say:
z <- c(1, 5, 7, 8, 34, 5)
z is the same dimension each time through the loop.
The first time through (or even outside the loop) I want to initialize a data frame by doing something like:
df$x[1] <- z
so I have a data frame that looks like:
A
1 1
2 5
3 7
4 8
5 34
6 5
The next time through the loop I want to add another column to df with a column heading being the second element of x, and a set of new z values. If the data frame has to be completely dimensioned ahead of time, I could calculate variables outside the loop to do this, say, M and N, but these may change from one run to the next.
I cannot seem to figure out how to do this. Suggestions much appreciated.
Try this:
set.seed(1)
#set the column names
x <- c("A", "B", "C")
#create the list that later we will convert to a data.frame
df<-setNames(vector("list",length(x)),x)
#loop to produce the various z
for (i in 1:length(x)) {
#do some stuff to evaluate z
z<-sample(5)
#assign to an element of df
df[[i]]<-z
}
#coerce to a data.frame
df<-as.data.frame(df)
# A B C
#1 2 5 2
#2 5 4 1
#3 4 2 3
#4 3 3 4
#5 1 1 5

Append values from column 2 to values from column 1

In R, I have two data frames (A and B) that share columns (1, 2 and 3). Column 1 has a unique identifier, and is the same for each data frame; columns 2 and 3 have different information. I'm trying to merge these two data frames to get 1 new data frame that has columns 1, 2, and 3, and in which the values in column 2 and 3 are concatenated: i.e. column 2 of the new data frame contains: [data frame A column 2 + data frame B column 2]
Example:
dfA <- data.frame(Name = c("John","James","Peter"),
Score = c(2,4,0),
Response = c("1,0,0,1","1,1,1,1","0,0,0,0"))
dfB <- data.frame(Name = c("John","James","Peter"),
Score = c(3,1,4),
Response = c("0,1,1,1","0,1,0,0","1,1,1,1"))
dfA:
Name Score Response
1 John 2 1,0,0,1
2 James 4 1,1,1,1
3 Peter 0 0,0,0,0
dfB:
Name Score Response
1 John 3 0,1,1,1
2 James 1 0,1,0,0
3 Peter 4 1,1,1,1
Should results in:
dfNew <- data.frame(Name = c("John","James","Peter"),
Score = c(5,5,4),
Response = c("1,0,0,1,0,1,1,1","1,1,1,1,0,1,0,0","0,0,0,0,1,1,1,1"))
dfNew:
Name Score Response
1 John 5 1,0,0,1,0,1,1,1
2 James 5 1,1,1,1,0,1,0,0
3 Peter 4 0,0,0,0,1,1,1,1
I've tried merge but that simply appends the columns (much like cbind)
Is there a way to do this, without having to cycle through all columns, like:
colnames(dfNew) <- c("Name","Score","Response")
dfNew$Score <- dfA$Score + dfB$Score
dfNew$Response <- paste(dfA$Response, dfB$Response, sep=",")
The added difficulty is, as you might have noticed, that for some columns we need to use addition, whereas others require concatenation separated by a comma (the columns requiring addition are formatted as numerical, the others as text, which might make it easier?)
Thanks in advance!
PS. The string 1,0,0,1,0,1,1,1 etc. captures the response per trial – this example has 8 trials to which participants can either respond correctly (1) or incorrectly (0); the final score is collected under Score. Just to explain why my data/example looks the way it does.
Personally, I would try to avoid concatenating 'response per trial' to a single variable ('Response') from the start, in order to make the data less static and facilitate any subsequent steps of analysis or data management. Given that the individual trials already are concatenated, as in your example, I would therefore consider splitting them up. Formatting the data frame for a final, pretty, printed output I would consider a different, later issue.
# merge data (cbind would also work if data are ordered properly)
df <- merge(x = dfA[ , c("Name", "Response")], y = dfB[ , c("Name", "Response")],
by = "Name")
# rename
names(df) <- c("Name", c("A", "B"))
# split concatenated columns
library(splitstackshape)
df2 <- concat.split.multiple(data = df, split.cols = c("A", "B"),
seps = ",", direction = "wide")
# calculate score
df2$Score <- rowSums(df2[ , -1])
df2
# Name A_1 A_2 A_3 A_4 B_1 B_2 B_3 B_4 Score
# 1 James 1 1 1 1 0 1 0 0 5
# 2 John 1 0 0 1 0 1 1 1 5
# 3 Peter 0 0 0 0 1 1 1 1 4
I would approach this with a for loop over the column names you want to merge. Given your example data:
cols <- c("Score", "Response")
dfNew <- dfA[,"Name",drop=FALSE]
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfA[[n]] + dfB[[n]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfA[[n]], dfB[[n]], sep=",")})
}
This solution is basically what you had as your idea, but with a loop. The data sets are looked at to see if they are numeric (add them numerically) or a string or factor (concatenate the strings). You could get a similar result by having two vectors of names, one for the numeric and one for the character, but this is extensible if you have other data types as well (though I don't know what they might be). The major drawback of this method is that is assumes the data frames are in the same order with regard to Name. The next solution doesn't make that assumption
dfNew <- merge(dfA, dfB, by="Name")
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfNew[[paste0(n,".x")]] + dfNew[[paste0(n,".y")]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfNew[[paste0(n,".x")]], dfNew[[paste0(n,".y")]], sep=",")})
dfNew[[paste0(n,".x")]] <- NULL
dfNew[[paste0(n,".y")]] <- NULL
}
Same general idea as previous, but uses merge to make sure that the data is correctly aligned, and then works on columns (whose names are postfixed with ".x" and ".y") with dfNew. Additional steps are included to get rid of the separate columns after joining. Also has the bonus feature of carrying along any other columns not specified for joining together in cols.

selection of observations by combining criteria in R

This topic has probably been brought up and it is a quite simpe solution , i guess. However i couldnt make it up to now.
Lets say i have a data.frame (called "data") which contains 10 individuals (id) on which i collected observations at 3 time points (T)
> data <- data.frame(id = rep(c(1:10), 3),
T = gl(3, 10),
X = sample(1:30),
Y = sample(c("yes", "no"), 30, replace = TRUE),
Z = sample(1:40, 30),
Z2 = rnorm(30, mean = 5, sd = 0.5))
> head(data)
id T X Y Z Z2
1 1 1 10 yes 15 5.993605
2 2 1 18 no 22 6.096566
3 3 1 5 no 24 5.101393
4 4 1 15 yes 18 4.944108
5 5 1 23 no 34 4.634176
6 6 1 13 no 27 5.576015
I would like to create a subset of this data.frame (an new data.frame called data2) by selecting only individuals that have "yes" (variable Y) for each of the three time points (variable T), that means Y="yes" for T=1 and T=2 and T=3.
I know that combining conditions can be achieved by using the "&" sign, and this can be used to relate conditions for the 3 time points. However, my problem is to write each condition for each time point : how to tell R that i want subjects for which Y="yes" at T="1" for example ?
Thank you very much in advance to all.
Have a great day,
Denis
You can do:
keep.ids <- tapply(data$Y, data$id, FUN = function(x)all(x == "yes"))
subset(data, keep.ids[factor(id)])
Or use the plyr package:
library(plyr)
ddply(data, "id", function(x) if(all(x$Y == "yes")) x else NULL)

Resources