Use plyr to make a 2d summary table [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 7 years ago.
I have a dataset with columns ID, Score and Average age, using ddply(), I get the following table
ddply(data, .(id, score), summarize, group_mean = round(mean(avg_age), 1))
id score group_mean
1 101 0 61.8
2 101 5 70.3
3 101 10 62.2
4 2 0 41.0
5 2 5 40.4
6 2 10 44.5
7 23 0 52.0
8 23 5 52.6
9 25 0 74.5
10 25 5 55.2
11 25 10 48.0
12 28 0 53.4
13 28 5 49.5
14 3 0 41.3
15 3 5 47.8
16 3 10 46.6
17 4 0 53.3
18 4 5 54.2
19 4 10 55.3
20 X 0 72.0
21 X 5 57.1
22 X 10 53.4
What shoud I do if I want the table to look like a pivot table with the rows as id and columns as score? Namely,
0 5 10
101 61.8 70.3 62.2
2 41.0 40.4 44.5
...

We can use spread from tidyr. If out is the result obtained from summarize output from ddply
library(tidyr)
spread(out, score, group_mean)
Or acast from reshape2
library(reshape2)
acast(out, id~score, value.var='group_mean')
Or using base R
xtabs(group_mean~id+score, out)

Related

How to transfer a column from a dataset sharing the same one with another one

I have two versions of datasets sharing the same columns (more or less). Let's take as an example
db = airquality
db1 = airquality[,-c(6)]
db1$Ozone[db1$Ozone < 30] <- 24
db1$Month[db1$Month == 5] <- 24
db
db1
If I would like to transfer two columns 'Ozone' and 'Wind' from the dataset 'db1' to the 'db' dataset by writing a code using the pipe operator %>% or another iterative method to achieve this result, which code you may possibly suggest?
Thanks
You csn do:
library(dplyr)
db1 %>%
select(Ozone, Wind) %>%
bind_cols(db)
Note that in this example, since some column names will be duplicated in the final result, dplyr will automatically rename the duplicates by appending numbers to the end of the column names.
Base R:
cbind(db, db1[,c(1,3)])
Ozone Solar.R Wind Temp Month Day Ozone Wind
1 41 190 7.4 67 5 1 41 7.4
2 36 118 8.0 72 5 2 36 8.0
3 12 149 12.6 74 5 3 24 12.6
4 18 313 11.5 62 5 4 24 11.5
5 NA NA 14.3 56 5 5 NA 14.3
6 28 NA 14.9 66 5 6 24 14.9
7 23 299 8.6 65 5 7 24 8.6
8 19 99 13.8 59 5 8 24 13.8
9 8 19 20.1 61 5 9 24 20.1
10 NA 194 8.6 69 5 10 NA 8.6
11 7 NA 6.9 74 5 11 24 6.9
12 16 256 9.7 69 5 12 24 9.7
.
.
.

Rolling Row Subtraction [duplicate]

This question already has answers here:
Calculate difference between values in consecutive rows by group
(4 answers)
Closed 7 years ago.
I am looking to perform a row subtraction, where I have a group of individuals and I want to subtract the more recent row from the row above it (like a rolling row subtraction). Does anyone know a simple way to do this?
The data would look something like this:
Name Day variable.1
1 Bob 1 43.4
2 Bob 2 32.0
3 Bob 3 18.1
4 Bob 4 41.2
5 Bob 5 85.2
6 Jeff 1 17.4
7 Jeff 2 55.6
8 Jeff 3 58.7
9 Jeff 4 40.6
10 Jeff 5 77.3
11 Carl 1 52.9
12 Carl 2 71.7
13 Carl 3 84.3
14 Carl 4 54.8
15 Carl 5 69.7
For example, for Bob, I would like it to come out as:
Name Day variable.1
1 Bob 1 NA
2 Bob 2 -11.4
3 Bob 3 -13.9
4 Bob 4 23.1
5 Bob 5 44
And then it would go to the next name and perform the same task.
You could try
library(data.table)#v1.9.5+
setDT(df1)[,variable.1:=c(NA,diff(variable.1)) , Name]
Or using shift from the devel version of data.table (as suggested by #Jan Gorecki). Instructions to install are here
setDT(df1)[, variable.1 := variable.1- shift(variable.1), Name]
You can use the base ave() function. For example, if your data is in a data.frame named dd,
dd$newcol <-ave(dd$variable.1, dd$Name, FUN=function(x) c(NA,diff(x)))
You can also try:
library(dplyr)
df %>% group_by(Name) %>% mutate(diff = variable.1-lag(variable.1))
Source: local data frame [15 x 4]
Groups: Name
Name Day variable.1 diff
1 Bob 1 43.4 NA
2 Bob 2 32.0 -11.4
3 Bob 3 18.1 -13.9
4 Bob 4 41.2 23.1
5 Bob 5 85.2 44.0
6 Jeff 1 17.4 NA
7 Jeff 2 55.6 38.2
8 Jeff 3 58.7 3.1
9 Jeff 4 40.6 -18.1
10 Jeff 5 77.3 36.7
11 Carl 1 52.9 NA
12 Carl 2 71.7 18.8
13 Carl 3 84.3 12.6
14 Carl 4 54.8 -29.5
15 Carl 5 69.7 14.9

R Creating new data.table with specified rows of a single column from an old data.table

I have the following data.table:
Month Day Lat Long Temperature
1: 10 01 80.0 180 -6.383330333333309
2: 10 01 77.5 180 -6.193327999999976
3: 10 01 75.0 180 -6.263328333333312
4: 10 01 72.5 180 -5.759997333333306
5: 10 01 70.0 180 -4.838330999999976
---
117020: 12 31 32.5 310 11.840003833333355
117021: 12 31 30.0 310 13.065001833333357
117022: 12 31 27.5 310 14.685003333333356
117023: 12 31 25.0 310 15.946669666666690
117024: 12 31 22.5 310 16.578336333333358
For every location (given by Lat and Long), I have a temperature for each day from 1 October to 31 December.
There are 1,272 locations consisting of each pairwise combination of Lat:
Lat
1 80.0
2 77.5
3 75.0
4 72.5
5 70.0
--------
21 30.0
22 27.5
23 25.0
24 22.5
and Long:
Long
1 180.0
2 182.5
3 185.0
4 187.5
5 190.0
---------
49 300.0
50 302.5
51 305.0
52 307.5
53 310.0
I'm trying to create a data.table that consists of 1,272 rows (one per location) and 92 columns (one per day). Each element of that data.table will then contain the temperature at that location on that day.
Any advice about how to accomplish that goal without using a for loop?
Here we use ChickWeights as the data, where we use "Chick-Diet" as the equivalent of your "lat-lon", and "Time" as your "Date":
dcast.data.table(data.table(ChickWeight), Chick + Diet ~ Time)
Produces:
Chick Diet 0 2 4 6 8 10 12 14 16 18 20 21
1: 18 1 1 1 NA NA NA NA NA NA NA NA NA NA
2: 16 1 1 1 1 1 1 1 1 NA NA NA NA NA
3: 15 1 1 1 1 1 1 1 1 1 NA NA NA NA
4: 13 1 1 1 1 1 1 1 1 1 1 1 1 1
5: ... 46 rows omitted
You will likely need to lat + lon ~ Month + Day or some such for your formula.
In the future, please make your question reproducible as I did here by using a built-in data set.
First create a date value using the lubridate package (I assumed year = 2014, adjust as necessary):
library(lubridate)
df$datetext <- paste(df$Month,df$Day,"2014",sep="-")
df$date <- mdy(df$datetext)
Then one option is to use the tidyr package to spread the columns:
library(tidyr)
spread(df[,-c(1:2,6)],date,Temperature)
Lat Long 2014-10-01 2014-12-31
1 22.5 310 NA 16.57834
2 25.0 310 NA 15.94667
3 27.5 310 NA 14.68500
4 30.0 310 NA 13.06500
5 32.5 310 NA 11.84000
6 70.0 180 -4.838331 NA
7 72.5 180 -5.759997 NA
8 75.0 180 -6.263328 NA
9 77.5 180 -6.193328 NA
10 80.0 180 -6.383330 NA

R program - getting particular values depending on another column

So I have data regarding Id number and time
Id number Time(hr)
1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4
I want this output
Time Id number
10 5
20 10
30 16
40 22
So I want the time to be in 10 hour intervals and get the ID that corresponds to that particular hour...I decided to use this code data <- data2[seq(0, nrow(data2), by=5), ] but instead of the Time being in 10 hr intervals...the ID number is at 10 intervals....but I dont want that output..so far I'm getting this output
Id.number Time..s.
10 19.3
20 36.9
You can use %% (mod) operator.
data[data$Time %% 10 == 0, ]
I use cut() and cumsum(table()) but I don't quite get the answer you are expecting. How exactly are you calculating this?
# first load the data
v.txt <- '1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4'
# load in the data... awkwardly...
v <- as.data.frame(matrix(as.numeric(unlist(strsplit(strsplit(v.txt, '\n')[[1]], ' +'))), byrow=T, ncol=2))
tens <- seq(from=0, by=10, to=100)
v$cut <- cut(v$Time, tens, labels=tens[-1])
v2 <- as.data.frame(cumsum(table(v$cut)))
names(v2) <- 'Time'
v2$Id <- rownames(v2)
rownames(v2) <- 1:nrow(v2)
v2 <- v2[,c(2,1)]
rm(v, v.txt, tens) # not needed anymore
v2 # the answer... but doesn't quite match your expected answer...
Id Time
1 10 5
2 20 10
3 30 15
4 40 21
5 50 25

Reshaping a data frame with more than one measure variable

I'm using a data frame similar to this one:
df<-data.frame(student=c(rep(1,5),rep(2,5)), month=c(1:5,1:5),
quiz1p1=seq(20,20.9,0.1),quiz1p2=seq(30,30.9,0.1),
quiz2p1=seq(80,80.9,0.1),quiz2p2=seq(90,90.9,0.1))
print(df)
student month quiz1p1 quiz1p2 quiz2p1 quiz2p2
1 1 1 20.0 30.0 80.0 90.0
2 1 2 20.1 30.1 80.1 90.1
3 1 3 20.2 30.2 80.2 90.2
4 1 4 20.3 30.3 80.3 90.3
5 1 5 20.4 30.4 80.4 90.4
6 2 1 20.5 30.5 80.5 90.5
7 2 2 20.6 30.6 80.6 90.6
8 2 3 20.7 30.7 80.7 90.7
9 2 4 20.8 30.8 80.8 90.8
10 2 5 20.9 30.9 80.9 90.9
Describing grades received by students during five months – in two quizzes divided into two parts each.
I need to get the two quizzes into separate rows – so that each student in each month will have two rows, one for each quiz, and two columns – for each part of the quiz.
When I melt the table:
melt.data.frame(df, c("student", "month"))
I get the two parts of the quiz in separate lines too.
dcast(dfL,student+month~variable)
of course gets me right back where I started, and I can't find a way to cast the table back in to the required form.
Is there a way to make the melt command function something like:
melt.data.frame(df, measure.var1=c("quiz1p1","quiz2p1"),
measure.var2=c("quiz1p2","quiz2p2"))
Here's how you could do this with reshape(), from base R:
df2 <- reshape(df, direction="long",
idvar = 1:2, varying = list(c(3,5), c(4,6)),
v.names = c("p1", "p2"), times = c("quiz1", "quiz2"))
## Checking the output
rbind(head(df2, 3), tail(df2, 3))
# student month time p1 p2
# 1.1.quiz1 1 1 quiz1 20.0 30.0
# 1.2.quiz1 1 2 quiz1 20.1 30.1
# 1.3.quiz1 1 3 quiz1 20.2 30.2
# 2.3.quiz2 2 3 quiz2 80.7 90.7
# 2.4.quiz2 2 4 quiz2 80.8 90.8
# 2.5.quiz2 2 5 quiz2 80.9 90.9
You can also use column names (instead of column numbers) for idvar and varying. It's more verbose, but seems like better practice to me:
## The same operation as above, using just column *names*
df2 <- reshape(df, direction="long", idvar=c("student", "month"),
varying = list(c("quiz1p1", "quiz2p1"),
c("quiz1p2", "quiz2p2")),
v.names = c("p1", "p2"), times = c("quiz1", "quiz2"))
I think this does what you want:
#Break variable into two columns, one for the quiz and one for the part of the quiz
dfL <- transform(dfL, quiz = substr(variable, 1,5),
part = substr(variable, 6,7))
#Adjust your dcast call:
dcast(dfL, student + month + quiz ~ part)
#-----
student month quiz p1 p2
1 1 1 quiz1 20.0 30.0
2 1 1 quiz2 80.0 90.0
3 1 2 quiz1 20.1 30.1
...
18 2 4 quiz2 80.8 90.8
19 2 5 quiz1 20.9 30.9
20 2 5 quiz2 80.9 90.9
There was a very similar question asked about half a year ago, in which I wrote the following function:
melt.wide = function(data, id.vars, new.names) {
require(reshape2)
require(stringr)
data.melt = melt(data, id.vars=id.vars)
new.vars = data.frame(do.call(
rbind, str_extract_all(data.melt$variable, "[0-9]+")))
names(new.vars) = new.names
cbind(data.melt, new.vars)
}
You can use the function to "melt" your data as follows:
dfL <-melt.wide(df, id.vars=1:2, new.names=c("Quiz", "Part"))
head(dfL)
# student month variable value Quiz Part
# 1 1 1 quiz1p1 20.0 1 1
# 2 1 2 quiz1p1 20.1 1 1
# 3 1 3 quiz1p1 20.2 1 1
# 4 1 4 quiz1p1 20.3 1 1
# 5 1 5 quiz1p1 20.4 1 1
# 6 2 1 quiz1p1 20.5 1 1
tail(dfL)
# student month variable value Quiz Part
# 35 1 5 quiz2p2 90.4 2 2
# 36 2 1 quiz2p2 90.5 2 2
# 37 2 2 quiz2p2 90.6 2 2
# 38 2 3 quiz2p2 90.7 2 2
# 39 2 4 quiz2p2 90.8 2 2
# 40 2 5 quiz2p2 90.9 2 2
Once the data are in this form, you can much more easily use dcast() to get whatever form you desire. For example
head(dcast(dfL, student + month + Quiz ~ Part))
# student month Quiz 1 2
# 1 1 1 1 20.0 30.0
# 2 1 1 2 80.0 90.0
# 3 1 2 1 20.1 30.1
# 4 1 2 2 80.1 90.1
# 5 1 3 1 20.2 30.2
# 6 1 3 2 80.2 90.2

Resources