Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
What finally worked was:
a <- cast(we, year ~ region, mean, value='response')
Although, I only have 1 observation per region and site, so mean is just a workaround. I couldn't get c to work as a function.
Output for suggested answer (by Justin)
> DT
> response year
> 1: 15 2000
> 2: 6 2000
> 3: 23 2000
> 4: 23 2000
---
> 794: 3 2010
> 795: 5 2010
> 796: 1 2010
Update: desired output should look like:
> Year x1 x2 x3 x4
> 2000 4 5 16 22
> 2001 6 11 2 18
> 2002 1 0 21 10
> ...
I am struggling to find a way to transpose my data based on factor levels. I have data with 2 columns, a factor and a response. I have many rows for each factor, so I want to transpose the table such that each factor is on one row, with the different responses as a column in that row. I cannot seem to subset within a loop based on levels of that factor. I would appreciate any insight.
example of data:
> response year
> 5 2001
> 10 2001
> 8 2001
> 1 2002
> 7 2010
> levels(data$year)
[1] "2000" "2001" "2002" "2003" "2004" "2005" ...
w <- matrix(0,54,15)
for(i in 1:levels(data$year)){
w[i] <- levels(data$year)==i
}
This syntax is obviously not correct, but it is the idea of what I'm trying to accomplish.
Thank you.
Using the data.table package this is trivial:
library(data.table)
DT <- data.table(data)
DT[, as.list(value), by=year]
However, this will fall apart if you have different numbers of observations per year. Instead:
DT[, list(values = list(value)), by=year]
Or using base R:
tapply(data$value, data$year, c)
Here's another way, using aggregate:
> set.seed(1)
> data <- data.frame(year = rep(2000:2010, each=10), value = sample(3:30, 110, TRUE))
> aggregate(value~year, data=data, FUN=c)
year value.1 value.2 value.3 value.4 value.5 value.6 value.7 value.8 value.9 value.10
1 2000 10 13 19 28 8 28 29 21 20 4
2 2001 8 7 22 13 24 16 23 30 13 24
3 2002 29 8 21 6 10 13 3 13 27 12
4 2003 16 19 16 8 26 21 25 6 23 14
5 2004 25 21 24 18 17 25 3 16 23 22
6 2005 16 27 15 9 4 5 11 17 21 14
7 2006 28 11 15 12 21 10 16 24 5 27
8 2007 12 26 12 12 16 27 27 13 24 29
9 2008 15 22 14 12 24 8 22 6 9 7
10 2009 9 4 20 27 24 25 15 14 25 19
11 2010 21 12 10 30 20 8 6 16 28 19
If I had a different number of responses per year, I would probably come at this problem by first making a new variable to represent the response in each year and then casting that dataset out using dcast. By default dcast fills in missing values with NA, although you can change that if needed.
set.seed(1)
data = data.frame(year = c(rep(2000:2010, each=10), 2011), value = sample(3:30, 111, TRUE))
require(reshape2)
require(plyr)
# Create a new variable representing the number of responses per year and add to dataset
dat2 = ddply(data, .(year), transform,
response = interaction("x", 1:length(value), sep = ""))
dcast(dat2, year ~ response, value.var = "value")
Related
I have a data frame called df that looks like:
> df
Date A B C
1 2001 1 12 14
2 2002 2 13 15
3 2003 3 14 16
4 2004 4 15 17
5 2005 5 16 18
6 2006 6 17 19
7 2007 7 18 20
8 2008 8 19 21
9 2009 9 20 22
10 2010 10 21 23
and a matrix called index that looks like:
> index
Resample01 Resample02 Resample03 Resample04 Resample05
[1,] 1 7 1 2 7
[2,] 3 9 2 3 8
[3,] 5 1 3 8 1
[4,] 8 3 4 9 4
[5,] 10 4 5 10 9
The numbers in each column stands for the row number to be selected.
The aim is to split the dataframe into two exclusive groups of "train" and "test" according to the row numbers in each column of the matrix "index". For example for "Resample01", the result should be look like:
> train
Date A B C
1 2001 1 12 14
3 2003 3 14 16
5 2005 5 16 18
8 2008 8 19 21
10 2010 10 21 23
and
> test
Date A B C
2 2002 2 13 15
4 2004 4 15 17
6 2006 6 17 19
7 2007 7 18 20
9 2009 9 20 22
and this process should be done for each colum in "index", and the results should be saved in two lists of "train" and "test", in which "train" is like:
$train1
Date A B C
1 2001 1 12 14
3 2003 3 14 16
5 2005 5 16 18
8 2008 8 19 21
10 2010 10 21 23
$train2
:
:
$train5
and "test" should be in the same format.
Only to note that my df accually contains 43,000 observations and the index matrix has 2000 columns and more than 20,000 rows. I know that subsetting for one column is easy, by doing:
test = df[-c(index[,1]),]
but for multiple columns I don't know how to do it (or loop it), and the saving form of a list seems also difficult.
You could try it something like this. The result should be of length ncol(index) and each element should hold two list elements, training and testing datasets each.
apply(index, MARGIN = 2, FUN = function(x, data) {
# is is "demoted" from a column to a vector
list(train = data[x, ], test = data[-x, ])
}, data = df)
The solution from akrun solves my problem.
by #Roman Luštrik codes:
listofsample = apply(index, MARGIN = 2, FUN = function(x, data) {
list(train = df[x, ], test = df[-x, ])
}, data = df)
following code from akrun:
train = sapply(listofsample, `[`,1)
test = sapply(listofsample, `[`,2)
it produce the two lists that I wanted.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a question. Need to find first 2 maximum values from a table and print their name in new column. Below is data set
ID Fail1 Fail2 Fail3 Fail4
43324 10 5 4 9
42059 12 7 6 11
43321 14 9 8 13
43414 16 11 10 15
41517 18 13 12 17
43711 20 15 14 19
55675 22 17 16 21
55769 24 19 18 23
55631 26 21 20 25
Now for every ID, need first and second max causes of Fail concatenated in a new column added in same table.
Data set Sample
Here is an approach which reshapes from wide to long format, picks the two max values for each ID and appends the respective column names as a new column to the original data.frame (using join):
library(data.table)
DF[melt(DF, id.var = "ID")[order(-value), .(top = toString(variable[1:2])), by = ID],
on = "ID"]
ID Fail1 Fail2 Fail3 Fail4 top
1: 55631 26 21 20 25 Fail1, Fail4
2: 55769 24 19 18 23 Fail1, Fail4
3: 55675 22 17 16 21 Fail1, Fail4
4: 43711 20 15 14 19 Fail1, Fail4
5: 41517 18 13 12 17 Fail1, Fail4
6: 43414 16 11 10 15 Fail1, Fail4
7: 43321 14 9 8 13 Fail1, Fail4
8: 42059 12 7 6 11 Fail1, Fail4
9: 43324 10 5 4 9 Fail1, Fail4
Data
library(data.table)
DF <- fread(
"ID Fail1 Fail2 Fail3 Fail4
43324 10 5 4 9
42059 12 7 6 11
43321 14 9 8 13
43414 16 11 10 15
41517 18 13 12 17
43711 20 15 14 19
55675 22 17 16 21
55769 24 19 18 23
55631 26 21 20 25"
)
Something like this, assuming your data frame is called dat:
dat$top2 = apply(dat[ , grepl("Fail", names(dat))], 1, function(r) {
paste(names(r)[which(rank(-r, ties.method="first") %in% c(1:2))], collapse=", ")
})
If there's a tie for first, this will give all the columns that tie for first (even if there are more than two) and none of the columns that tie for second. If there's a tie for second and no tie for first, then this will give the column with the highest value and all columns that tie for second.
Here's a tidyverse version of #Uwe's answer:
library(tidyverse)
dat = dat %>% left_join(
dat %>%
gather(key, value, -ID) %>%
arrange(desc(value)) %>%
group_by(ID) %>%
slice(1:2) %>%
summarise(top2 = paste(key, collapse=", "))
)
I want to calculate the difference of two incidents. First five columns indicate a date-time of incident. The rest five columns indicate the date-time of death.
dat <- read.table(header=TRUE, text="
YEAR MONTH DAY HOUR MINUTE D.YEAR D.MONTH D.DAY D.HOUR D.MINUTE
2013 1 6 0 55 2013 1 6 0 56
2013 2 3 21 24 2013 2 4 23 14
2013 1 6 11 45 2013 1 6 12 29
2013 3 6 12 25 2013 3 6 23 55
2013 4 6 18 28 2013 5 3 11 18
2013 4 8 14 31 2013 4 8 14 32")
dat
YEAR MONTH DAY HOUR MINUTE D.YEAR D.MONTH D.DAY D.HOUR D.MINUTE
2013 1 6 1 55 2013 1 6 0 56
2013 2 3 21 24 2013 2 4 23 14
2013 1 6 11 45 2013 1 6 12 29
2013 3 6 12 25 2013 3 6 23 55
2013 4 6 18 28 2013 5 3 11 18
2013 4 8 14 31 2013 4 8 14 32
I want to calculate the difference of time (in minutes). The following code is not going anywhere. The timestamp will look like 2013-04-06 04:08.
library(lubridate)
dat$tstamp1 <- mdy(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE,sep = "-"))
dat$tstamp2 <- mdy(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"))
dat$diff <- dat$tstamp2 -dat$tstamp2 ### want the difference in minutes
In order to parse a date/time string of the "-"-separated format you're creating, you'll need to give a custom format, and pass it to parse_date_time. For example:
parse_date_time(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
Your new code would therefore look like:
library(lubridate)
dat$tstamp1 <- parse_date_time(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
dat$tstamp2 <- parse_date_time(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
Then the following will get you the time difference in minutes:
dat$diff <- as.numeric(dat$tstamp2 - dat$tstamp1)
You can try this:
library(lubridate)
dat$tstamp1 <- strptime(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE,sep = "-"),"%Y-%m-%d-%H-%M")
dat$tstamp2 <- strptime(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),"%Y-%m-%d-%H-%M")
dat$diff <- as.POSIXct(dat$tstamp2) - as.POSIXct(dat$tstamp1)
Using strptime is faster and bit safer against unexpected data. You can read more about it here.
I want to reshape a wide format dataset that has multiple tests which are measured at 3 time points:
ID Test Year Fall Spring Winter
1 1 2008 15 16 19
1 1 2009 12 13 27
1 2 2008 22 22 24
1 2 2009 10 14 20
2 1 2008 12 13 25
2 1 2009 16 14 21
2 2 2008 13 11 29
2 2 2009 23 20 26
3 1 2008 11 12 22
3 1 2009 13 11 27
3 2 2008 17 12 23
3 2 2009 14 9 31
into a data set that separates the tests by column but converts the measurement time into long format, for each of the new columns like this:
ID Year Time Test1 Test2
1 2008 Fall 15 22
1 2008 Spring 16 22
1 2008 Winter 19 24
1 2009 Fall 12 10
1 2009 Spring 13 14
1 2009 Winter 27 20
2 2008 Fall 12 13
2 2008 Spring 13 11
2 2008 Winter 25 29
2 2009 Fall 16 23
2 2009 Spring 14 20
2 2009 Winter 21 26
3 2008 Fall 11 17
3 2008 Spring 12 12
3 2008 Winter 22 23
3 2009 Fall 13 14
3 2009 Spring 11 9
3 2009 Winter 27 31
I have unsuccessfully tried to use reshape and melt. Existing posts address transforming to single column outcome.
Using reshape2:
# Thanks to Ista for helping with direct naming using "variable.name"
df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
df.m <- transform(df.m, Test = paste0("Test", Test))
dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
Update: Using data.table melt/cast from versions >= 1.9.0:
data.table from versions 1.9.0 imports reshape2 package and implements fast melt and dcast methods in C for data.tables. A comparison of speed on bigger data is shown below.
For more info regarding NEWS, go here.
require(data.table) ## ver. >=1.9.0
require(reshape2)
dt <- as.data.table(df, key=c("ID", "Test", "Year"))
dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
dt.m[, Test := paste0("Test", Test)]
dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
At the moment, you'll have to write dcast.data.table explicitly as it's not a S3 generic in reshape2 yet.
Benchmarking on bigger data:
# generate data:
set.seed(45L)
DT <- data.table(ID = sample(1e2, 1e7, TRUE),
Test = sample(1e3, 1e7, TRUE),
Year = sample(2008:2014, 1e7,TRUE),
Fall = sample(50, 1e7, TRUE),
Spring = sample(50, 1e7,TRUE),
Winter = sample(50, 1e7, TRUE))
DF <- as.data.frame(DT)
reshape2 timings:
reshape2_melt <- function(df) {
df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
}
# min. of three consecutive runs
system.time(df.m <- reshape2_melt(DF))
# user system elapsed
# 43.319 4.909 48.932
df.m <- transform(df.m, Test = paste0("Test", Test))
reshape2_cast <- function(df) {
dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
}
# min. of three consecutive runs
system.time(reshape2_cast(df.m))
# user system elapsed
# 57.728 9.712 69.573
data.table timings:
DT_melt <- function(dt) {
dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
}
# min. of three consecutive runs
system.time(dt.m <- reshape2_melt(DT))
# user system elapsed
# 0.276 0.001 0.279
dt.m[, Test := paste0("Test", Test)]
DT_cast <- function(dt) {
dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
}
# min. of three consecutive runs
system.time(DT_cast(dt.m))
# user system elapsed
# 12.732 0.825 14.006
melt.data.table is ~175x faster than reshape2:::melt and dcast.data.table is ~5x than reshape2:::dcast.
Sticking with base R, this is another good candidate for the "stack + reshape" routine. Assuming our dataset is called "mydf":
mydf.temp <- data.frame(mydf[1:3], stack(mydf[4:6]))
mydf2 <- reshape(mydf.temp, direction = "wide",
idvar=c("ID", "Year", "ind"),
timevar="Test")
names(mydf2) <- c("ID", "Year", "Time", "Test1", "Test2")
mydf2
# ID Year Time Test1 Test2
# 1 1 2008 Fall 15 22
# 2 1 2009 Fall 12 10
# 5 2 2008 Fall 12 13
# 6 2 2009 Fall 16 23
# 9 3 2008 Fall 11 17
# 10 3 2009 Fall 13 14
# 13 1 2008 Spring 16 22
# 14 1 2009 Spring 13 14
# 17 2 2008 Spring 13 11
# 18 2 2009 Spring 14 20
# 21 3 2008 Spring 12 12
# 22 3 2009 Spring 11 9
# 25 1 2008 Winter 19 24
# 26 1 2009 Winter 27 20
# 29 2 2008 Winter 25 29
# 30 2 2009 Winter 21 26
# 33 3 2008 Winter 22 23
# 34 3 2009 Winter 27 31
Base reshape function alternative method is below. Though this required using reshape twice, there might be a simpler way.
Assuming your dataset is called df1
tmp <- reshape(df1,idvar=c("ID","Year"),timevar="Test",direction="wide")
result <- reshape(
tmp,
idvar=c("ID","Year"),
varying=list(3:5,6:8),
v.names=c("Test1","Test2"),
times=c("Fall","Spring","Winter"),
direction="long"
)
Which gives:
> result
ID Year time Test1 Test2
1.2008.Fall 1 2008 Fall 15 22
1.2009.Fall 1 2009 Fall 12 10
2.2008.Fall 2 2008 Fall 12 13
2.2009.Fall 2 2009 Fall 16 23
3.2008.Fall 3 2008 Fall 11 17
3.2009.Fall 3 2009 Fall 13 14
1.2008.Spring 1 2008 Spring 16 22
1.2009.Spring 1 2009 Spring 13 14
2.2008.Spring 2 2008 Spring 13 11
2.2009.Spring 2 2009 Spring 14 20
3.2008.Spring 3 2008 Spring 12 12
3.2009.Spring 3 2009 Spring 11 9
1.2008.Winter 1 2008 Winter 19 24
1.2009.Winter 1 2009 Winter 27 20
2.2008.Winter 2 2008 Winter 25 29
2.2009.Winter 2 2009 Winter 21 26
3.2008.Winter 3 2008 Winter 22 23
3.2009.Winter 3 2009 Winter 27 31
tidyverse/tidyr solution:
library(dplyr)
library(tidyr)
df %>%
gather("Time", "Value", Fall, Spring, Winter) %>%
spread(Test, Value, sep = "")
If I have a data frame where I am adding columns, and I would like one column to sum them up. I will not know the names of the columns ahead of time, so I guess I would need some kind of function that would count the number of columns and then sum them up.
If my data is like this:
w=1:10
x=11:20
z=data.frame(w,x)
I would like the total for z$w and z$x. But then if I were to add z$y, I would like to have that incorporated into the sum as well.
You should consider not adding a column for the sum, and just call rowSums(z) whenever you need it. That removes the hassle of having to update the column whenever you modify your data.frame.
Now if that's really what you want, here is a little function that will update the sum and always keep it as the last column. You'll have to run it every time you make a change to your data.frame:
> refresh.total <- function(df) {
+ df$total <- NULL
+ df$total <- rowSums(df)
+ return(df)
+ }
>
> z <- refresh.total(z)
> z
w x total
1 1 11 12
2 2 12 14
3 3 13 16
4 4 14 18
5 5 15 20
6 6 16 22
7 7 17 24
8 8 18 26
9 9 19 28
10 10 20 30
>
> z$y <- 2:11
> z <- refresh.total(z)
> z
w x y total
1 1 11 2 14
2 2 12 3 17
3 3 13 4 20
4 4 14 5 23
5 5 15 6 26
6 6 16 7 29
7 7 17 8 32
8 8 18 9 35
9 9 19 10 38
10 10 20 11 41
After you've finished adding in all the columns, you can do:
z$total <- rowSums(z)