I'm new to R and here and I need some help to structure my data.
I have two data sets:
One of them is a long format within subjects data set which is large and looks a little bit like this:
long.format <- data.frame(subject.no = c(1, 1, 1, 1, 2, 2, 2, 2), condition = c("prime", "prime", "prime", "prime", "control", "control","control","control"), response = c(1,1,1,0,1,1,1,0))
subject.no condition response
>1 1 prime 1
>2 1 prime 1
>3 1 prime 1
>4 1 prime 0
>5 2 control 1
>6 2 control 1
>7 2 control 1
>8 2 control 0
The other one is already in wide format and looks like this
wide.format <- data.frame(subject = c(1, 2), age = c(26,27), gender = c("m","f"))
subject age gender
>1 1 26 m
>2 2 27 f
The only thing I want to do now is to get the value in "condition" (and only this!) from the long format data frame to the corresponding subject in the wide data frame by adding a new column in the wide data frame (by using the columns subject.no and subject, respectively).
So the final data frame should look like this:
wide.format.aim <- data.frame(subject = c(1, 2), age = c(26,27), gender = c("m","f"), condition = c("prime","control"))
subject age gender condition
>1 1 26 m prime
>2 2 27 f control
I've tried merging but this ended up with a long format data frame added with the information from the wide format data frame... but I want it the other way around...
This is what I've tried:
test.it <- merge(x=wide.format, y=long.format[,c("subject.no", "condition")], all.x=T, by.x="subject", by.y="subject.no")
Any suggestions?
Thanks in advance!
You are interested merging the unique values from long.format[,c("subject.no", "condition")]:
unique(long.format[,c("subject.no", "condition")])
# subject.no condition
#1 1 prime
#5 2 control
You can merge using those values
merge(x = wide.format,
y = unique(long.format[,c("subject.no", "condition")]),
by.x = "subject",
by.y = "subject.no")
# subject age gender condition
#1 1 26 m prime
#2 2 27 f control
Related
I am working with 45 large dataframes of 60 each of extremely, messy timeseries data. But it basically boils down to the two data frames below. I have managed to sort wrangle it down to these two dataframe types, the first is the positions on the y-axis
Group1 Group2 Group3
0 0 0
1 1 1
3 2 4
5 3 7
5 8
9
names samples t0 XIncrement
Group1 4 0 2
Group2 5 1 2
Group3 7 2 3
I have to create the time for each group. When I write the function for a single group it works great, I get exactly what I want.
1.) t0 is the starting time
2.) xincrement is the time between each sample taken
3.) sample is the number of samples in the set.
4.) Time would be the time for each sample taken
Time_function <- function(samples, t0, XIncrement) {
samples_2 <-c(1:samples)
Time <- ((samples_2* t0) +XIncrement)
return(Time)
#}
}
test_function <- Time_function(100, 200e-9, 500e-12)
I need it to be "applied" each group. I really am not sure how to either create a new data frame with the information or add it to the rows of the df.
I have tried making an empty data frame and populating it that way and that did not work.
When I transpose the data and get it into list form it, it seems the most usable form but I still cannot get the create Time into the list.
Added For greater clarity: Each Group has it own unique time some groups were sampled more than others.
So, at the end I need to merge Group_1 with Group1_Time, Group_2 with Group2_Time. The data frame will be ragged.
Thanks for any help or guidance you can provide. I have googled and search stackoverflow for information, but I come up empty-handed. If there is a question like this, great, I have yet to find it.
Assuming the input data shown reproducibly in the Note below, this code combines the list L and data frame DF and assumes a more appropriate data structure, ts class, for the output):
make_ts <- function(g, t0, xincr) ts(g, start = t0, deltat = xincr)
tt <- Map(make_ts, L, DF$t0, DF$XIncrement)
tt
giving this list of ts objects:
$Group1
Time Series:
Start = 0
End = 6
Frequency = 0.5
[1] 0 1 3 5
$Group2
Time Series:
Start = 1
End = 9
Frequency = 0.5
[1] 0 1 2 3 5
$Group3
Time Series:
Start = 2
End = 17
Frequency = 0.333333333333333
[1] 0 1 4 7 8 9
Note that we can recover the data, times, start time, deltat and frequency (= 1/deltat) and convert to a list of data frames like this:
lapply(tt, c)
lapply(tt, time)
sapply(tt, start)
sapply(tt, deltat)
sapply(tt, frequency)
library(zoo)
lapply(tt, fortify.zoo)
Note
Since it is not possible to have a data frame with differing column lengths we assume that the first data structure is a list as shown and the second is a data frame.
L <- list(Group1 = c(0, 1, 3, 5), Group2 = c(0, 1, 2, 3, 5),
Group3 = c(0, 1, 4, 7, 8, 9))
DF <- structure(list(names = c("Group1", "Group2", "Group3"), samples = c(4L,
5L, 7L), t0 = 0:2, XIncrement = c(2L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-3L))
I have two columns in my data frame, value and num_leads. I'd like to create a third column that stores the value's value from n rows below - where n is whatever number is stored in num_leads. Here's an example:
df1 <- data.frame(value = c(1:5),
num_leads = c(2, 3, 1, 1, 0))
Desired output:
value num_leads result
1 1 2 3
2 2 3 5
3 3 1 4
4 4 1 5
5 5 0 5
I have tried using the lead function in dplyr but unfortunately it seems all the leads must have the same number.
using indexing
with(df1, value[seq_along(value) + num_leads])
where seq_along(value) gives the row number, and by adding to num_leads you can pull out the right value
This is what I came up with:
df1$result <- df1$value[df1$value + df1$num_leads]
In R, I have two data frames (A and B) that share columns (1, 2 and 3). Column 1 has a unique identifier, and is the same for each data frame; columns 2 and 3 have different information. I'm trying to merge these two data frames to get 1 new data frame that has columns 1, 2, and 3, and in which the values in column 2 and 3 are concatenated: i.e. column 2 of the new data frame contains: [data frame A column 2 + data frame B column 2]
Example:
dfA <- data.frame(Name = c("John","James","Peter"),
Score = c(2,4,0),
Response = c("1,0,0,1","1,1,1,1","0,0,0,0"))
dfB <- data.frame(Name = c("John","James","Peter"),
Score = c(3,1,4),
Response = c("0,1,1,1","0,1,0,0","1,1,1,1"))
dfA:
Name Score Response
1 John 2 1,0,0,1
2 James 4 1,1,1,1
3 Peter 0 0,0,0,0
dfB:
Name Score Response
1 John 3 0,1,1,1
2 James 1 0,1,0,0
3 Peter 4 1,1,1,1
Should results in:
dfNew <- data.frame(Name = c("John","James","Peter"),
Score = c(5,5,4),
Response = c("1,0,0,1,0,1,1,1","1,1,1,1,0,1,0,0","0,0,0,0,1,1,1,1"))
dfNew:
Name Score Response
1 John 5 1,0,0,1,0,1,1,1
2 James 5 1,1,1,1,0,1,0,0
3 Peter 4 0,0,0,0,1,1,1,1
I've tried merge but that simply appends the columns (much like cbind)
Is there a way to do this, without having to cycle through all columns, like:
colnames(dfNew) <- c("Name","Score","Response")
dfNew$Score <- dfA$Score + dfB$Score
dfNew$Response <- paste(dfA$Response, dfB$Response, sep=",")
The added difficulty is, as you might have noticed, that for some columns we need to use addition, whereas others require concatenation separated by a comma (the columns requiring addition are formatted as numerical, the others as text, which might make it easier?)
Thanks in advance!
PS. The string 1,0,0,1,0,1,1,1 etc. captures the response per trial – this example has 8 trials to which participants can either respond correctly (1) or incorrectly (0); the final score is collected under Score. Just to explain why my data/example looks the way it does.
Personally, I would try to avoid concatenating 'response per trial' to a single variable ('Response') from the start, in order to make the data less static and facilitate any subsequent steps of analysis or data management. Given that the individual trials already are concatenated, as in your example, I would therefore consider splitting them up. Formatting the data frame for a final, pretty, printed output I would consider a different, later issue.
# merge data (cbind would also work if data are ordered properly)
df <- merge(x = dfA[ , c("Name", "Response")], y = dfB[ , c("Name", "Response")],
by = "Name")
# rename
names(df) <- c("Name", c("A", "B"))
# split concatenated columns
library(splitstackshape)
df2 <- concat.split.multiple(data = df, split.cols = c("A", "B"),
seps = ",", direction = "wide")
# calculate score
df2$Score <- rowSums(df2[ , -1])
df2
# Name A_1 A_2 A_3 A_4 B_1 B_2 B_3 B_4 Score
# 1 James 1 1 1 1 0 1 0 0 5
# 2 John 1 0 0 1 0 1 1 1 5
# 3 Peter 0 0 0 0 1 1 1 1 4
I would approach this with a for loop over the column names you want to merge. Given your example data:
cols <- c("Score", "Response")
dfNew <- dfA[,"Name",drop=FALSE]
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfA[[n]] + dfB[[n]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfA[[n]], dfB[[n]], sep=",")})
}
This solution is basically what you had as your idea, but with a loop. The data sets are looked at to see if they are numeric (add them numerically) or a string or factor (concatenate the strings). You could get a similar result by having two vectors of names, one for the numeric and one for the character, but this is extensible if you have other data types as well (though I don't know what they might be). The major drawback of this method is that is assumes the data frames are in the same order with regard to Name. The next solution doesn't make that assumption
dfNew <- merge(dfA, dfB, by="Name")
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfNew[[paste0(n,".x")]] + dfNew[[paste0(n,".y")]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfNew[[paste0(n,".x")]], dfNew[[paste0(n,".y")]], sep=",")})
dfNew[[paste0(n,".x")]] <- NULL
dfNew[[paste0(n,".y")]] <- NULL
}
Same general idea as previous, but uses merge to make sure that the data is correctly aligned, and then works on columns (whose names are postfixed with ".x" and ".y") with dfNew. Additional steps are included to get rid of the separate columns after joining. Also has the bonus feature of carrying along any other columns not specified for joining together in cols.
This topic has probably been brought up and it is a quite simpe solution , i guess. However i couldnt make it up to now.
Lets say i have a data.frame (called "data") which contains 10 individuals (id) on which i collected observations at 3 time points (T)
> data <- data.frame(id = rep(c(1:10), 3),
T = gl(3, 10),
X = sample(1:30),
Y = sample(c("yes", "no"), 30, replace = TRUE),
Z = sample(1:40, 30),
Z2 = rnorm(30, mean = 5, sd = 0.5))
> head(data)
id T X Y Z Z2
1 1 1 10 yes 15 5.993605
2 2 1 18 no 22 6.096566
3 3 1 5 no 24 5.101393
4 4 1 15 yes 18 4.944108
5 5 1 23 no 34 4.634176
6 6 1 13 no 27 5.576015
I would like to create a subset of this data.frame (an new data.frame called data2) by selecting only individuals that have "yes" (variable Y) for each of the three time points (variable T), that means Y="yes" for T=1 and T=2 and T=3.
I know that combining conditions can be achieved by using the "&" sign, and this can be used to relate conditions for the 3 time points. However, my problem is to write each condition for each time point : how to tell R that i want subjects for which Y="yes" at T="1" for example ?
Thank you very much in advance to all.
Have a great day,
Denis
You can do:
keep.ids <- tapply(data$Y, data$id, FUN = function(x)all(x == "yes"))
subset(data, keep.ids[factor(id)])
Or use the plyr package:
library(plyr)
ddply(data, "id", function(x) if(all(x$Y == "yes")) x else NULL)
I have a dataset that looks like the one below, and I would like to create a new variable based on these variables, which can be used with the other variables in the dataset.
The first variable, ID, is a respondent identification number. The med variable are 1 and 2, indicating different treatments. Var1_v1 and Var1_v2 has four real options 1,2,3, or 9, and these options are only given to those who med ==1. If med ==2, NA appears in the Var1s. Var2 receives NA when med ==1 and has real values ranging from 1-3 when med==2.
ID <- c(1,2,3,4,5,6,7,8,9,10,11)
med <- c(1,1,1,1,1,1,2,2,2,2,2)
Var1_v1 <- c(2,2,3,9,9,9,NA,NA,NA,NA,NA) #ranges from 1-3, and 9
Var1_v2 <- c(9,9,9,1,3,2,NA,NA,NA,NA,NA) #ranges from 1-3, and 9
Var2 <- c(NA,NA,NA,NA,NA,NA,3,3,1,3,2)
#tables to show you what data looks like relative to med var
table(Var1_v1, med)
table(Var1_v2, med)
table(Var2, med)
I've been looking around for a while to figure out a recoding/new variable creation code, but I have had no luck.
Ultimately, I would like to create a new variable, say Var3, based on three conditions:
Uses the values from Var1_v1 if the value = 1, 2, or 3
Uses the values from Var1_v2 if the value = 1, 2, or 3
uses the values from Var2 if the values = 1, 2, or 3
And this variable should be able to match up with the ID number, so that it can be used within the dataset.
So the final variable should look like:
Var3 <- (2,2,3,1,3,2,3,3,1,3,2)
Thanks!
Something like
v <- Var1_v1
v[Var1_v2 %in% 1:3] <- Var1_v2[Var1_v2 %in% 1:3]
v[Var2 %in% 1:3] <- Var2[Var2 %in% 1:3]
v
[1] 2 2 3 1 3 2 3 3 1 3 2
which uses one of them as a base (you could also use a pure NA vector) and simply fills in only parts that match.