I am trying to scale data from multiple subjects onto the same time-scale. The current data files have 3 months of data for each subject, but the time-stamps for each event for each subject reflect different begin-end dates.
df$ID <- c(1, 1, 1, 1, 1, 2, 2, 2, 2)
df$Time <- c(2:34:00, 2:55:13, 5:23:23, 7:23:04, 9:18:18, 3:22:12, 4:23:02; 5:23:22, 9:30:02)
df$Date <- c(7/13/16, 7/13/16, 7/13/16, 7/14/16, 7/14/16, 1/02/14, 1/02/14, 1/03/14, 1/05/14)
df$widgets <-(4, 6, 9, 18, 3, 3, 7, 9, 12)
I want to change the df to have a common time scale so that I have a date index that allows me to keep the same format like below:
df$ScaleDate <- c(1,1,1,2,2,1,1,2,4) #time scale is within-ID
First, for reference, here's the data I used:
df <- data.frame(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
Time = c("2:34:00", "2:55:13", "5:23:23", "7:23:04", "9:18:18", "3:22:12", "4:23:02", "5:23:22", "9:30:02"),
Date = c("7/13/16", "7/13/16", "7/13/16", "7/14/16", "7/14/16", "1/02/14", "1/02/14", "1/03/14", "1/05/14"),
widgets = c(4, 6, 9, 18, 3, 3, 7, 9, 12))
We can use used dplyr syntax (not mandatory, but looks very legible and simple) with lubridate functions and perform this operation very simply:
library(dplyr)
library(lubridate)
df %>%
mutate(Date = mdy(Date)) %>%
group_by(ID) %>%
mutate(ScaleDate = as.numeric((Date - Date[1]) + 1))
mdy converts the values to Date objects. We can then do operations with them. If the dates are equal, it will result in "0 days". Hence, we add 1 and covert it to numeric to get the indexes.
Related
I have a large data frame in R with over 200 participants who have answered 152 questions. Now, I want to insert a column based on a conditional query after each "Answer" column. As an example, I have the following data frame:
data <- data.frame(Participants = 1:5,
Answer1 = c(4, 6, 2, 2, 3),
Answer2 = c(5, 1, 3, 5, 4))
I now want to insert a new conditional column of "Confidence" after each "Answer" column. For the column of "Answer1", the query would look like this:
data$Confidence1 <- ifelse(data$Answer1 == 1 | data$Answer1 == 6, 2, ifelse(data$Answer1 == 2 | data$Answer1 == 5, 1, 0))
In the end, I want the data frame to look like this:
data <- data.frame(Participants = 1:5,
Answer1 = c(4, 6, 2, 2, 3),
Confidence1 = c(0, 2, 1, 1, 0),
Answer2 = c(5, 1, 3, 5, 4),
Confidence2 = c(1, 2, 0, 1, 0))
Does anyone have an idea on how to achieve this for all "Answer" columns at once? Thanks!
Here is one solution using the modify() function from the purrr package the response by #r2evans from merge two data table into one, with alternating columns in R
library(purrr)
# create data, adding one more answer to your example
data <- data.frame(Participants = 1:5,
Answer1 = c(4, 6, 2, 2, 3),
Answer2 = c(5, 1, 3, 5, 4),
Answer3 = c(3, 2, 3, 4, 5))
# make a new df containing only the answer columns from data
answers <- data[,2:4]
# make confidence df and give it correct names
conf <- modify(answers, function(x) ifelse(x == 1 | x == 6, 2, ifelse(x == 2 | x == 5, 1, 0)))
names(conf) <- paste0("Confidence",1:ncol(answers))
# set order
neworder <- order(c(2*(seq_along(answers) - 1) + 1,
2*seq_along(conf)))
# put it together
cbind(Participants = data[,1], cbind(answers, conf)[,neworder])
If you can forgive my interest in loops, I'd like to know how to loop through a vector of variable names (must be strings in my use case) and mutate the original columns. In this toy example, I want to calculate the mean of the column i plus z.
df_have <- data.frame(x=c(1, 1, 2, 3, 3),
y=c(2, 2, 3, 4, 4),
z=c(0, 1, 2, 3, 4))
for (i in c("x", "y")) {
df_test <-
df_have %>%
mutate(!!i := mean(i)+z)
}
df_want <- data.frame(x=c(2, 3, 4, 5, 6), # mean 2 + z
y=c(3, 4, 5, 6, 7), # mean 3 + z
z=c(0, 1, 2, 3, 4))
Well, if you want to do a loop, then
df_test <- df_have
for (i in c("x", "y")) {
df_test <-
df_test %>%
mutate(!!i := mean((!!as.name(i)))+z)
}
Note you need to turns those strings into symbols in order to use in the expression for mutate. An eaiser trick in this case would be
df_have %>% mutate_at(c("x","y"), funs(mean(.)+z))
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a list of 40 numeric vectors. My question is how can I do one box plot for each vector and group every 4 box plots to a single position on the x-axis.
That is the x-axis will have 10 ticks and every tick has 4 box plots. Additionally I would like to colour the box plots by their order in the group.
Here is a small sample of the data (two groups):
list(
c(1, 3, 3, 4, 2, 1, 1, 2),
c(2, 1, 1, 1, 1, 2, 33, 12),
c(5, 3, 2, 1, 2, 3, 4, 5),
c(4, 4, 2, 2, 2, 1, 3, 4),
c(2, 2, 21, 1, 4),
c(3, 3, 1, 2, 3),
c(2, 2, 3, 1, 1),
c(6, 5, 3, 4, 5)) ->
ll
To create boxplots from a list of numeric vectors and group it into four-tuples, you need to do something like this:
library(tidyverse)
# ll is the list of vectors
map_df(ll,
function(x) data.frame(val = x),
.id = "row") %>%
mutate(
row = as.numeric(row) - 1,
grp = floor(row / 4) %>% as.factor,
method = (row %% 4) %>% as.factor) %>%
ggplot() +
geom_boxplot(aes(grp, val, fill=method), position = "dodge")
In brief: map_df converts each vector to a data frame with one column, then row-binds them all, adding a column named row which identifies where did the original row came from, mutate converts row to 0 based numeric and adds grp and method columns based on the original vector index (every fourth is next group using modulo), ggplot initializes the plot, geom_boxplot adds a box plot layer taking the data from the table.
Output for the sample data (a little ugly because of the outliers).
I have a data frame that looks like this. names and number of columns will NOT be consistent (sometimes 'C' will not be present, other times "D', 'E', 'F' may be present, etc.)
# name and number of columns varies...so need flexible process
A <- c(1, 2, 1, 2, 3, 2, 1, 1, 1, 2, 1, 4, 3, 1, 2, 2, 1, 2, 4, 8)
B <- c(5, 6, 6, 5, 3, 7, 2, 1, 1, 2, 7, 4, 7, 8, 5, 7, 6, 6, 4, 7)
C <- c(9, 1, 2, 2, 1, 4, 5, 6, 7, 8, 89, 9, 7, 6, 5, 6, 8, 9 , 67, 6)
ABC <- data.frame(A, B, C)
I want to loop through each variable and collect various information. This is a simple example, but what I am doing will be more complicated. I say that so that somebody doesn't just recommend some sort of summary() type solution.
maximum_value <- max(A)
mean_value <- mean(A)
# lots of other calculations for A
ID = 'A'
tempA <- data.frame(ID, maximum_value, mean_value)
maximum_value <- max(B)
mean_value <- mean(B)
# lots of other calculations for B
ID = 'B'
tempB <- data.frame(ID, maximum_value, mean_value)
maximum_value <- max(C)
mean_value <- mean(C)
# lots of other calculations for C
ID = 'C'
tempC <- data.frame(ID, maximum_value, mean_value)
output <- rbind(tempA, tempB, tempC)
Here is my attempt at creating a loop to go through the variables one by one and aggregate output. I can't figure out how to get [i] to point at an individual column of the data frame ABC.
# initialize data frame
data__ <- data.frame(ID__ = as.character(),
max__ = as.numeric(),
mean__ = as.numeric())
# loop through A, then B, then C
for(i in A:C) {
ID__ <- '[i]'
max__ <- maximum[i]
mean__ <- mean[i]
data__temp <- (ID__, max__, mean__)
data__ <- rbind(data__, data__temp)
}
If I were doing this in SAS, I would use a select into within proc sql to create a list of the variable names, then write an array, then i could loop through them that way, but there's something I'm missing here.
How would I tell R to do this process for each variable in the data frame?
If you use the tidyverse dplyr and tidyr package, you can do
library(tidyr)
ABC %>% gather(ID, value) %>% group_by(ID) %>% summarize_all(funs(mean, max))
or
ABC %>% gather(ID, value) %>% group_by(ID) %>%
summarize(maximum_value = max(value), mean_value=mean(value))
If you'd rather use base functions and there are a lot of "weird" functions, you can use purrr's map_df function
library(purrr)
map2_df(ABC, names(ABC), function(a, n) {
data_frame(ID=n, max_val=max(a), mean_val=mean(a))
})
I want to use the filter() function to find the types that have an x value less than or equal to 4, OR a y value greater than 5. I think this might be a simple fix I just can't find much info on ?filter(). I almost have it I think:
x = c(1, 2, 3, 4, 5, 6)
y = c(3, 6, 1, 9, 1, 1)
type = c("cars", "bikes", "trains")
df = data.frame(x, y, type)
df2 = df %>%
filter(x<=4)
Try
df %>%
filter(x <=4| y>=5)