I'm trying to move a large list with >200000 character from this:
startTime 1
max 3
min 1
EndTime 2
avg 2
startTime 2
max ..
min ..
EndTime ..
avg ..
..
to a dataframe like this:
startTime max min EndTime avg
1 3 1 2 2
2 .. .. .. ..
I managed it by looping it through a for-loop. It takes to much time. Is there a more sufficient way by not looping it through a for-loop?
Expanding your input data a bit you could use unstack from base R.
Input:
dat
# V1 V2
#1 startTime 1
#2 max 3
#3 min 1
#4 EndTime 2
#5 avg 2
#6 startTime 2
#7 max 3
#8 min 4
#9 EndTime 5
#10 avg 6
Result:
out <- unstack(dat, V2 ~ V1)
out
# avg EndTime max min startTime
#1 2 2 3 1 1
#2 6 5 3 4 2
If you want the column names in the same order as the they appear in dat$V1 do
out <- out[unique(dat$V1)]
data
dat <- structure(list(V1 = c("startTime", "max", "min", "EndTime", "avg",
"startTime", "max", "min", "EndTime", "avg"), V2 = c(1L, 3L,
1L, 2L, 2L, 2L, 3L, 4L, 5L, 6L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-10L))
simply tranform it
library( data.table )
dt <- data.table::fread(" startTime 1
max 3
min 1
EndTime 2
avg 2
startTime 2", header = FALSE)
as.data.table( t( dt ) )
# V1 V2 V3 V4 V5 V6
# 1: startTime max min EndTime avg startTime
# 2: 1 3 1 2 2 2
This is not an exact duplicate of How to reshape data from long to wide format? so I will answer.
First create a new column ID and then use one of the solutions in the duplicate. I will use the solution based on package reshape2.
pattern <- as.character(df1[1, 1])
ipat <- grep(pattern, df1[[1]])
df1$ID <- rep(seq_along(ipat), nrow(df1)/length(ipat))
library(reshape2)
result <- dcast(df1, ID ~ V1, value.var = "V2")[-1]
# avg EndTime max min startTime
#1 2 3 4 1 1
#2 1 2 3 2 2
Final clean up, put the input dataset df1 back as it were.
df1 <- df1[-ncol(df1)]
Data.
df1 <- read.table(text = "
startTime 1
max 3
min 1
EndTime 2
avg 2
startTime 2
max 4
min 2
EndTime 3
avg 1
")
Here are some alternatives. They do not use any packages.
Assume the input DF shown reproducibly in the Note at the end.
1) xtabs The first line of code converts the first column to character in case it is factor. We do not need this with the data shown in the Note but it doesn't hurt and might be useful if the column were factor so that it is in a known state.
Then convert the V1 column to a factor having levels in the order that appear so that they don't get rearranged upon output. Also define nicer names and create a Group number vector which numbers the first group of 5 rows as 1, the second group 2 and so on.
Finally use xtabs to create the desired table. If you prefer a data frame as the output rather than a table then use as.data.frame(xt).
DF2 <- transform(DF, V1 = as.character(V1))
DF2 <- transform(DF2, Stat = factor(V1, levels = V1[1:5]),
Value = V2,
Group = cumsum(V1== "startTime"))
xt <- xtabs(Value ~ Group + Stat, DF2)
xt
giving:
Stat
Group startTime max min EndTime avg
1 1 3 1 2 2
2 2 4 1 3 2
2) matrix Even shorter is this one-liner. It gives a matrix. Use as.data.frame(m) if you want a data frame.
m <- matrix(DF$V2,, 5, byrow = TRUE, list(NULL, DF$V1[1:5]))
m
giving:
startTime max min EndTime avg
[1,] 1 3 1 2 2
[2,] 2 4 1 3 2
Note
The input in reproducible form. I have added a few rows.
Lines <- "
startTime 1
max 3
min 1
EndTime 2
avg 2
startTime 2
max 4
min 1
EndTime 3
avg 2"
DF <- read.table(text = Lines, as.is = TRUE)
A tidyverse solution using #markus' data would be :
library(tidyverse)
dat %>%
group_by(tmp = cumsum(V1=="startTime")) %>%
spread(V1,V2) %>%
ungroup %>%
select(-tmp)
# # A tibble: 2 x 5
# avg EndTime max min startTime
# <int> <int> <int> <int> <int>
# 1 2 2 3 1 1
# 2 6 5 3 4 2
Related
I am trying to obtain counts of each combination of levels of two variables, "week" and "id". I'd like the result to have "id" as rows, and "week" as columns, and the counts as the values.
Example of what I've tried so far (tried a bunch of other things, including adding a dummy variable = 1 and then fun.aggregate = sum over that):
library(plyr)
ddply(data, .(id), dcast, id ~ week, value_var = "id",
fun.aggregate = length, fill = 0, .parallel = TRUE)
However, I must be doing something wrong because this function is not finishing. Is there a better way to do this?
Input:
id week
1 1
1 2
1 3
1 1
2 3
Output:
1 2 3
1 2 1 1
2 0 0 1
You could just use the table command:
table(data$id,data$week)
1 2 3
1 2 1 1
2 0 0 1
If "id" and "week" are the only columns in your data frame, you can simply use:
table(data)
# week
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
You don't need ddply for this. The dcast from reshape2 is sufficient:
dat <- data.frame(
id = c(rep(1, 4), 2),
week = c(1:3, 1, 3)
)
library(reshape2)
dcast(dat, id~week, fun.aggregate=length)
id 1 2 3
1 1 2 1 1
2 2 0 0 1
Edit : For a base R solution (other than table - as posted by Joshua Uhlrich), try xtabs:
xtabs(~id+week, data=dat)
week
id 1 2 3
1 2 1 1
2 0 0 1
The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.
An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:
library(data.table)
dcast(setDT(data), id ~ week)
# Using 'week' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
Or setting the arguments explicitly:
dcast(setDT(data), id ~ week, value.var = "week", fun = length)
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
For pre-data.table 1.9.2 alternatives, see edits.
A tidyverse option could be :
library(dplyr)
library(tidyr)
df %>%
count(id, week) %>%
pivot_wider(names_from = week, values_from = n, values_fill = list(n = 0))
#spread(week, n, fill = 0) #In older version of tidyr
# id `1` `2` `3`
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 1 1
#2 2 0 0 1
Using only pivot_wider -
tidyr::pivot_wider(df, names_from = week,
values_from = week, values_fn = length, values_fill = 0)
Or using tabyl from janitor :
janitor::tabyl(df, id, week)
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L), week = c(1L, 2L, 3L,
1L, 3L)), class = "data.frame", row.names = c(NA, -5L))
I am trying to obtain counts of each combination of levels of two variables, "week" and "id". I'd like the result to have "id" as rows, and "week" as columns, and the counts as the values.
Example of what I've tried so far (tried a bunch of other things, including adding a dummy variable = 1 and then fun.aggregate = sum over that):
library(plyr)
ddply(data, .(id), dcast, id ~ week, value_var = "id",
fun.aggregate = length, fill = 0, .parallel = TRUE)
However, I must be doing something wrong because this function is not finishing. Is there a better way to do this?
Input:
id week
1 1
1 2
1 3
1 1
2 3
Output:
1 2 3
1 2 1 1
2 0 0 1
You could just use the table command:
table(data$id,data$week)
1 2 3
1 2 1 1
2 0 0 1
If "id" and "week" are the only columns in your data frame, you can simply use:
table(data)
# week
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
You don't need ddply for this. The dcast from reshape2 is sufficient:
dat <- data.frame(
id = c(rep(1, 4), 2),
week = c(1:3, 1, 3)
)
library(reshape2)
dcast(dat, id~week, fun.aggregate=length)
id 1 2 3
1 1 2 1 1
2 2 0 0 1
Edit : For a base R solution (other than table - as posted by Joshua Uhlrich), try xtabs:
xtabs(~id+week, data=dat)
week
id 1 2 3
1 2 1 1
2 0 0 1
The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.
An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:
library(data.table)
dcast(setDT(data), id ~ week)
# Using 'week' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
Or setting the arguments explicitly:
dcast(setDT(data), id ~ week, value.var = "week", fun = length)
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
For pre-data.table 1.9.2 alternatives, see edits.
A tidyverse option could be :
library(dplyr)
library(tidyr)
df %>%
count(id, week) %>%
pivot_wider(names_from = week, values_from = n, values_fill = list(n = 0))
#spread(week, n, fill = 0) #In older version of tidyr
# id `1` `2` `3`
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 1 1
#2 2 0 0 1
Using only pivot_wider -
tidyr::pivot_wider(df, names_from = week,
values_from = week, values_fn = length, values_fill = 0)
Or using tabyl from janitor :
janitor::tabyl(df, id, week)
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L), week = c(1L, 2L, 3L,
1L, 3L)), class = "data.frame", row.names = c(NA, -5L))
I am trying to obtain counts of each combination of levels of two variables, "week" and "id". I'd like the result to have "id" as rows, and "week" as columns, and the counts as the values.
Example of what I've tried so far (tried a bunch of other things, including adding a dummy variable = 1 and then fun.aggregate = sum over that):
library(plyr)
ddply(data, .(id), dcast, id ~ week, value_var = "id",
fun.aggregate = length, fill = 0, .parallel = TRUE)
However, I must be doing something wrong because this function is not finishing. Is there a better way to do this?
Input:
id week
1 1
1 2
1 3
1 1
2 3
Output:
1 2 3
1 2 1 1
2 0 0 1
You could just use the table command:
table(data$id,data$week)
1 2 3
1 2 1 1
2 0 0 1
If "id" and "week" are the only columns in your data frame, you can simply use:
table(data)
# week
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
You don't need ddply for this. The dcast from reshape2 is sufficient:
dat <- data.frame(
id = c(rep(1, 4), 2),
week = c(1:3, 1, 3)
)
library(reshape2)
dcast(dat, id~week, fun.aggregate=length)
id 1 2 3
1 1 2 1 1
2 2 0 0 1
Edit : For a base R solution (other than table - as posted by Joshua Uhlrich), try xtabs:
xtabs(~id+week, data=dat)
week
id 1 2 3
1 2 1 1
2 0 0 1
The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.
An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:
library(data.table)
dcast(setDT(data), id ~ week)
# Using 'week' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
Or setting the arguments explicitly:
dcast(setDT(data), id ~ week, value.var = "week", fun = length)
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
For pre-data.table 1.9.2 alternatives, see edits.
A tidyverse option could be :
library(dplyr)
library(tidyr)
df %>%
count(id, week) %>%
pivot_wider(names_from = week, values_from = n, values_fill = list(n = 0))
#spread(week, n, fill = 0) #In older version of tidyr
# id `1` `2` `3`
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 1 1
#2 2 0 0 1
Using only pivot_wider -
tidyr::pivot_wider(df, names_from = week,
values_from = week, values_fn = length, values_fill = 0)
Or using tabyl from janitor :
janitor::tabyl(df, id, week)
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L), week = c(1L, 2L, 3L,
1L, 3L)), class = "data.frame", row.names = c(NA, -5L))
I have data like
subject date number
1 1/2/01 4
1 3/2/01 6
1 10/2/01 7
2 1/1/01 2
2 4/1/01 3
I want to get R to work out the number of days since the first sample for each subject. eg:
Subject days
1 0
1 2
1 9
2 0
2 3
How can I do this? I have converted the dates using lubridate.
SOmething like:
for(i in 1:nrow(data)){
if(data$date[i] != data$date[i -1]) {
data$timeline <- data$date[i] - data$date[i-1]
}
}
I get the error:
argument is of length 0 - I think the problem is the first line where there is no preceeding row..?
I would use dplyr to do some grouping and data manipulation. Note that we first have to convert your date into something R will recognize as a date.
library(dplyr)
dat$Date <- as.Date(dat$date, '%d/%m/%y')
dat %>%
group_by(subject) %>%
mutate(days = Date - min(Date))
# subject date number Date days
# <int> <chr> <int> <date> <time>
# 1 1 1/2/01 4 2001-02-01 0
# 2 1 3/2/01 6 2001-02-03 2
# 3 1 10/2/01 7 2001-02-10 9
# 4 2 1/1/01 2 2001-01-01 0
# 5 2 4/3/01 3 2001-03-04 62
here's the data:
dat <- structure(list(subject = c(1L, 1L, 1L, 2L, 2L), date = c("1/2/01",
"3/2/01", "10/2/01", "1/1/01", "4/3/01"), number = c(4L, 6L,
7L, 2L, 3L), Date = structure(c(11354, 11356, 11363, 11323, 11385
), class = "Date")), .Names = c("subject", "date", "number",
"Date"), row.names = c(NA, -5L), class = "data.frame")
Using the input shown in the note convert the date column to Date class (assuming that it is in the form dd/mm/yy) and then use ave to subtract the least date from all the dates for each subject. If the input is sorted as in the question we could optionally use x[1] instead of min(x). No packages are used.
data$date <- as.Date(data$date, "%d/%m/%y")
diff1 <- function(x) x - min(x)
with(data, data.frame(subject, days = ave(as.numeric(date), subject, FUN = diff1)))
giving:
subject days
1 1 0
2 1 2
3 1 9
4 2 0
5 2 62
Note
The input used, in reproducible form, is:
Lines <- "
subject date number
1 1/2/01 4
1 3/2/01 6
1 10/2/01 7
2 1/1/01 2
2 4/3/01 3"
data <- read.table(text = Lines, header = TRUE)
I am trying to obtain counts of each combination of levels of two variables, "week" and "id". I'd like the result to have "id" as rows, and "week" as columns, and the counts as the values.
Example of what I've tried so far (tried a bunch of other things, including adding a dummy variable = 1 and then fun.aggregate = sum over that):
library(plyr)
ddply(data, .(id), dcast, id ~ week, value_var = "id",
fun.aggregate = length, fill = 0, .parallel = TRUE)
However, I must be doing something wrong because this function is not finishing. Is there a better way to do this?
Input:
id week
1 1
1 2
1 3
1 1
2 3
Output:
1 2 3
1 2 1 1
2 0 0 1
You could just use the table command:
table(data$id,data$week)
1 2 3
1 2 1 1
2 0 0 1
If "id" and "week" are the only columns in your data frame, you can simply use:
table(data)
# week
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
You don't need ddply for this. The dcast from reshape2 is sufficient:
dat <- data.frame(
id = c(rep(1, 4), 2),
week = c(1:3, 1, 3)
)
library(reshape2)
dcast(dat, id~week, fun.aggregate=length)
id 1 2 3
1 1 2 1 1
2 2 0 0 1
Edit : For a base R solution (other than table - as posted by Joshua Uhlrich), try xtabs:
xtabs(~id+week, data=dat)
week
id 1 2 3
1 2 1 1
2 0 0 1
The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.
An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:
library(data.table)
dcast(setDT(data), id ~ week)
# Using 'week' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
Or setting the arguments explicitly:
dcast(setDT(data), id ~ week, value.var = "week", fun = length)
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
For pre-data.table 1.9.2 alternatives, see edits.
A tidyverse option could be :
library(dplyr)
library(tidyr)
df %>%
count(id, week) %>%
pivot_wider(names_from = week, values_from = n, values_fill = list(n = 0))
#spread(week, n, fill = 0) #In older version of tidyr
# id `1` `2` `3`
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 1 1
#2 2 0 0 1
Using only pivot_wider -
tidyr::pivot_wider(df, names_from = week,
values_from = week, values_fn = length, values_fill = 0)
Or using tabyl from janitor :
janitor::tabyl(df, id, week)
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L), week = c(1L, 2L, 3L,
1L, 3L)), class = "data.frame", row.names = c(NA, -5L))