I have a dataset with 2 calendar variables (Week & Hour) and 1 Amount variable:
Week Hour Amount
35 1 367
35 2 912
36 1 813
36 2 482
37 1 112
37 2 155
35 1 182
35 2 912
36 1 551
36 2 928
37 1 125
37 2 676
I wish to replace each value of Amount with the mean from each observation with the same Week/Hour pair. For instance, here there are 2 obs. for (Week=35, Hour=1), with Amount values of 367 and 182. Hence, for this example, the 2 rows with (Week=35, Hour=1) should have the Amount replaced with mean(c(367,182). The final output should be:
Week Hour Amount
35 1 274.5
35 2 912.0
36 1 682.0
36 2 705.0
37 1 118.5
37 2 415.5
35 1 274.5
35 2 912.0
36 1 682.0
36 2 705.0
37 1 118.5
37 2 415.5
I have the following code that solves this issue. However, for the complete dataset with thousands of rows, it is very slow. Is there any way to automatically reshape with with this paired means?
dataset = data.frame(Week=c(35,35,36,36,37,37,35,35,36,36,37,37),
Hour = c(1,2,1,2,1,2,1,2,1,2,1,2),
Amount = c(367,912,813,482,112,155,182,912,551,928,125,676))
means <- reshape2::dcast(dataset, Week~Hour, value.var="Value", mean)
for (i in 1:nrow(dataset)) {
print(i)
dataset$Amount[i] <- means[means$Week==dataset$Week[i],which(colnames(means)==dataset$Hour[i])]
}
Possible solution with dplyr:
dataset %>%
group_by(Week, Hour) %>%
summarise(mean_amount = mean(Amount))
You group by Week and Hour and calculate the mean based on this condition.
EDIT
To keep the original structure (number of rows) alter the code to
dataset %>%
group_by(Week, Hour) %>%
mutate(Amount = mean(Amount))
If the idea is just to get the mean Amount by Week and Hour, this would work:
aggregate(Amount ~ ., dataset, mean)
Week Hour Amount
1 35 1 274.5
2 36 1 682.0
3 37 1 118.5
4 35 2 912.0
5 36 2 705.0
6 37 2 415.5
EDIT:
If, however, the idea is to put the averages back into the dataset, then this should work:
x <- aggregate(Amount ~ ., dataset, mean)
dataset$Amount <- x$Amount[match(apply(dataset[,1:2], 1, paste0, collapse = " "),
apply(x[,1:2], 1, paste0, collapse = " "))]
dataset
Week Hour Amount
1 35 1 274.5
2 35 2 912.0
3 36 1 682.0
4 36 2 705.0
5 37 1 118.5
6 37 2 415.5
7 35 1 274.5
8 35 2 912.0
9 36 1 682.0
10 36 2 705.0
11 37 1 118.5
12 37 2 415.5
Explanation:
This pastes together into strings the rows of the first two columns in the means dataframe x and in datasetusing the function apply it uses match on these strings to assign the means values to the corresponding rows in dataset
EDIT 2:
Alternatively, you can use interaction and, respectively, %in% for this transformation:
dataset$Amount <- x$Amount[match(interaction(dataset[,1:2]), interaction(x[,1:2]))]
# or:
dataset$Amount <- x$Amount[interaction(x[,1:2]) %in% interaction(dataset[,1:2])]
Base R solution:
dataset$Amount <- with(dataset, ave(dataset$Amount, dataset$Week, dataset$Hour, FUN = mean))
Data:
dataset = data.frame(Week=c(35,35,36,36,37,37,35,35,36,36,37,37),
Hour = c(1,2,1,2,1,2,1,2,1,2,1,2),
Amount = c(367,912,813,482,112,155,182,912,551,928,125,676))
Related
I normally find an answer in previous questions posted here, but I can't seem to find this one, so here is my maiden question:
I have a dataframe with one column with repetitive values, I would like to split the other columns and have only 1 value in the first column and more columns than in the original dataframe.
Example:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
The original dataframe has 3 columns and 15 rows.
And it would turn into a dataframe with 5 rows and the columns would be split into 7 columns: 'test', 'time1', 'time2', 'time3', 'score1', score2', 'score3'.
Does anyone have an idea how this could be done?
I think using dcast with rowid from the data.table-package is well suited for this task:
library(data.table)
dcast(setDT(df), test ~ rowid(test), value.var = c('time','score'), sep = '')
The result:
test time1 time2 time3 score1 score2 score3
1: 1 52 3 29 21 131 45
2: 2 79 44 6 119 1 186
3: 3 67 95 39 18 459 121
4: 4 83 50 40 493 466 497
5: 5 46 14 4 465 9 24
Please try this:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
df$class <- c(rep('a', 5), rep('b', 5), rep('c', 5))
df <- split(x = df, f = df$class)
binded <- cbind(df[[1]], df[[2]], df[[3]])
binded <- binded[,-c(5,9)]
> binded
test time score class time.1 score.1 class.1 time.2 score.2 class.2
1 1 40 404 a 57 409 b 70 32 c
2 2 5 119 a 32 336 b 93 177 c
3 3 20 345 a 44 91 b 100 42 c
4 4 47 468 a 60 265 b 24 478 c
5 5 16 52 a 38 219 b 3 92 c
Let me know if it works for you!
Let's assume I have a data frame consisting of a categorical variable and a numerical one.
df <- data.frame(group=c(1,1,1,1,1,2,2,2,2,2),days=floor(runif(10, min=0, max=101)))
df
group days
1 1 54
2 1 61
3 1 31
4 1 52
5 1 21
6 2 22
7 2 18
8 2 50
9 2 46
10 2 35
I would like to select the row corresponding to the maximum number of days by group as well as all the following/subsequent group rows. For the example above, my subset df2 should look as follows:
df2
group days
2 1 61
3 1 31
4 1 52
5 1 21
8 2 50
9 2 46
10 2 35
Please note that the groups could have different lengths.
For a base R solution, aggregate days by group using a function that keeps the elements with index greater than or equal to the maximum, and then reshape as a long data.frame
df0 = aggregate(days ~ group, df, function(x) x[seq_along(x) >= which.max(x)])
data.frame(group=rep(df0$group, lengths(df0$days)),
days=unlist(df0$days, use.names=FALSE)))
leading to
group days
1 1 84
2 1 31
3 1 65
4 1 23
5 2 94
6 2 69
7 2 45
You can use which.max to find out the index of the maximum of the days and then use slice from dplyr to select all the rows after that, where n() gives the number of rows in each group:
library(dplyr)
df %>% group_by(group) %>% slice(which.max(days):n())
#Source: local data frame [7 x 2]
#Groups: group [2]
# group days
# <int> <int>
#1 1 61
#2 1 31
#3 1 52
#4 1 21
#5 2 50
#6 2 46
#7 2 35
data.table syntax would be similar, .N is similar to n() in dplyr and gives the number of rows in each group:
library(data.table)
setDT(df)[, .SD[which.max(days):.N], group]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35
We can use a faster option with data.table where we find the row index (.I) and then subset the rows based on that.
library(data.table)
setDT(df)[df[ , .I[which.max(days):.N], by = group]$V1]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35
I have some data in 'untidy' format - with 'age' embedded in the variable name.
Using dplyr, I want to create a 'tidy' format dataset in which the keys are datazone, year, and age group, and also where the lower and upper ages within the age group are separate variables.
All of this is fine, except the final step takes much longer than I'd like it to. Is there a faster way of doing this that's still about as 'readable'?
Full reproducible example (using repmis to pull the file)
require(repmis)
require(stringr)
require(tidyr)
require(plyr)
require(dplyr)
persons <- source_DropboxData(
file="persons.csv",
key="vcz7qngb44vbynq"
) %>%
tbl_df() %>%
select(datazone, year,
contains("hspeop")
)
names(persons) <- names(persons) %>%str_replace_all( "GR.hspeop", "count_both_")
persons <- persons %>% gather(age_group, count, -datazone, -year)
persons <- persons %>% mutate(gender="both", age_group=str_replace_all(age_group, "count_both_", ""))
persons$age_group <- persons$age_group %>% revalue(
c(
"1619" = "16_19",
"2024" = "20_24",
"2529" = "25_29",
"3034" = "30_34",
"3539" = "35_39",
"4044" = "40_44",
"4549" = "45_49",
"5054" = "50_54",
"5559" = "55_59",
"6064" = "60_64",
"6569" = "65_69",
"7074" = "70_74",
"7579" = "75_79",
"8084" = "80_84",
"85over" = "85_100"
)
)
# deal with "" separately as revalue can't cope
persons$age_group[nchar(persons$age_group)==0] <- "all"
persons_by_age <- persons %>% filter(grepl("_", age_group)) # this is how to filter by contents of age_group
persons_by_age <- persons_by_age %>%
group_by(age_group) %>%
mutate(
lower_age = str_split(age_group, "_")[[1]][1] %>% as.numeric(),
upper_age = str_split(age_group, "_")[[1]][2] %>% as.numeric()
)
Obviously I'm creating the same object twice in mutate, so potential for speed doubling there. I also thought that group_by would mean that the operation would only have to be completed once per age group, but it seems to do it for each row. Would summarising by age group, mutating, then joining be a faster approach, for example?
Edit
The code above already creates the output, but much slower than I'd like.
A couple of examples of the final output:
> persons_by_age
Source: local data frame [5,854,500 x 7]
datazone year age_group count gender lower_age upper_age
1 S01000001 1996 0 8 both 0 0
2 S01000002 1996 0 4 both 0 0
3 S01000003 1996 0 18 both 0 0
4 S01000004 1996 0 4 both 0 0
5 S01000005 1996 0 17 both 0 0
6 S01000006 1996 0 1 both 0 0
7 S01000007 1996 0 9 both 0 0
8 S01000008 1996 0 10 both 0 0
9 S01000009 1996 0 8 both 0 0
10 S01000010 1996 0 9 both 0 0
.. ... ... ... ... ... ... ...
> persons_by_age %>% filter(year==2000 & gender=="male" & lower_age > 30)
Source: local data frame [71,555 x 7]
datazone year age_group count gender lower_age upper_age
1 S01000001 2000 35_39 34 male 35 39
2 S01000002 2000 35_39 41 male 35 39
3 S01000003 2000 35_39 61 male 35 39
4 S01000004 2000 35_39 43 male 35 39
5 S01000005 2000 35_39 43 male 35 39
6 S01000006 2000 35_39 24 male 35 39
7 S01000007 2000 35_39 34 male 35 39
8 S01000008 2000 35_39 23 male 35 39
9 S01000009 2000 35_39 30 male 35 39
10 S01000010 2000 35_39 37 male 35 39
.. ... ... ... ... ... ... ...
> persons_by_age %>% filter(year==2000 & gender=="female" & lower_age > 30)
Source: local data frame [71,555 x 7]
datazone year age_group count gender lower_age upper_age
1 S01000001 2000 35_39 37 female 35 39
2 S01000002 2000 35_39 30 female 35 39
3 S01000003 2000 35_39 58 female 35 39
4 S01000004 2000 35_39 46 female 35 39
5 S01000005 2000 35_39 28 female 35 39
6 S01000006 2000 35_39 29 female 35 39
7 S01000007 2000 35_39 33 female 35 39
8 S01000008 2000 35_39 25 female 35 39
9 S01000009 2000 35_39 36 female 35 39
10 S01000010 2000 35_39 38 female 35 39
.. ... ... ... ... ... ... ...
You can try this:
persons_by_age<-persons_by_age %>%
group_by(age_group) %>%
do(cbind(.,matrix(rep(unlist(strsplit(as.character(.[1,3]), "_")),nrow(.)),ncol=2,byrow=TRUE)))
The . allows you to access the groups in group_by
For each group, the first row of the age_group column (.[1,3]) is split and lower and upper are made into a vector, which is then repeated for as many rows as there is in the group.
The obtained matrix is then bound to the group.
It ran in a few seconds.
separate as suggested by #jazzurro is much easier though:
separate(persons_by_age, age_group, c("lower", "upper"), sep = "_",remove=FALSE)
I have two dataframes that are structured as follows:
Dataframe A:
id sqft traf month
1 1030 16 35 1
1 1030 15 32 2
2 1027 1 31 1
2 1027 2 31 2
Dataframe B:
id price frequency month day
1 1030 8 196 1 1
2 1030 9 101 1 15
3 1030 10 156 1 30
4 1030 3 137 2 1
5 1030 7 190 2 15
6 1027 6 188 1 1
7 1027 1 198 1 15
8 1027 2 123 1 30
9 1027 4 185 2 1
10 1027 5 122 2 15
I want to output certain types of summary statistics (centered around each unique ID) from both these columns. This would be easy with ddply if say I wanted the mean price for each ID for each month (split by id and month) from Dataframe B or if I wanted the average ratio of sqft to traf for each id (split by id).
But what would be a potential solution if I wanted to make combined variables from both dataframes. For instance, how would I get the average price for each id/month (Dataframe B) divided by sqft for each id/month?
The varying frequencies at of the dataframes are measured makes combining them not easily doable. The only solution I've found so far is to ddply the first dataframe to extract average sqft/id/month and then pass that value into a second ddply call on the second dataframe.
Is there a more efficient/less convoluted way to do this? I would be splitting both dataframes on the same variables (id and month).
Thanks in advance for any suggestions!
In the case of the sample data, you could merge the two data sets like this (by specifying all.y = TRUE you can make sure that all rows of dfb are kept and, in this case, corresponding entries of dfa are repeated accordingly)
dfall <- merge(dfa, dfb, by = c("id", "month"), all.y=TRUE)
# id month sqft traf price frequency day
#1 1027 1 1 31 6 188 1
#2 1027 1 1 31 1 198 15
#3 1027 1 1 31 2 123 30
#4 1027 2 2 31 4 185 1
#5 1027 2 2 31 5 122 15
#6 1030 1 16 35 8 196 1
#7 1030 1 16 35 9 101 15
#8 1030 1 16 35 10 156 30
#9 1030 2 15 32 3 137 1
#10 1030 2 15 32 7 190 15
Then, you can use ddply as usual:
ddply(dfall, .(id, month), mutate, newcol = mean(price)/sqft)
# id month sqft traf price frequency day newcol
#1 1027 1 1 31 6 188 1 3.0000000
#2 1027 1 1 31 1 198 15 3.0000000
#3 1027 1 1 31 2 123 30 3.0000000
#4 1027 2 2 31 4 185 1 2.2500000
#5 1027 2 2 31 5 122 15 2.2500000
#6 1030 1 16 35 8 196 1 0.5625000
#7 1030 1 16 35 9 101 15 0.5625000
#8 1030 1 16 35 10 156 30 0.5625000
#9 1030 2 15 32 3 137 1 0.3333333
#10 1030 2 15 32 7 190 15 0.3333333
Edit: if you're looking for better performance, consider using dplyr instead of plyr. The equivalent dplyr code (including the merge) is:
library(dplyr)
dfall <- dfb %>%
left_join(., dfa, by = c("id", "month")) %>%
group_by(id, month) %>%
dplyr::mutate(newcol = mean(price)/sqft) # I added dplyr:: to avoid confusion with plyr::mutate
Of course, you could also check out data.table which is also very efficient.
AFAIK ddply is not designed to be used with different data frames at the same time.
dplyr does well here. This code merges the data frames, gets price and sqft means by unique id/month combination, then creates a new variable pricePerSqft.
require(dplyr)
dfa %>%
left_join(dfb, by = c("id", "month")) %>%
group_by(id, month) %>%
summarize(
avgPrice = mean(price),
avgSqft = mean(sqft)) %>%
mutate(pricePerSqft = round(avgPrice / avgSqft, 2))
Here's the result:
id month avgPrice avgSqft pricePerSqft
1 1027 1 3.0 1 3.00
2 1027 2 4.5 2 2.25
3 1030 1 9.0 16 0.56
4 1030 2 5.0 15 0.33
Combining 2 columns into 1 column many times in a very large dataset in R
The clumsy solutions I am working on are not going to be very fast if I can get them to work and the true dataset is ~1500 X 45000 so they need to be fast. I definitely at a loss for 1) at this point although have some code for 2) and 3).
Here is a toy example of the data structure:
pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), rs157=c(2,4,2,2,2,4,4,4,2,2),
rs157.1=c(4,4,4,2,4,4,4,4,2,2), rs132=c(4,4,4,4,4,4,4,4,2,2),
rs132.1=c(4,4,4,4,4,4,4,4,4,4))
Thus, there are a few columns of basic demographic info and then the rest of the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and rs123.1 is the second allele of rs123.
1) I need to merge all the biallelic SNP data that is currently in 2 columns into 1 column, so, for example: rs123 and rs123.1 into one column (but within the dataset):
11
31
11
31
31
11
11
11
31
11
2) I need to identify the least frequent SNP value (in the above example it is 31).
3) I need to replace the least frequent SNP value with 1 and the other(s) with 0.
Do you mean 'merge' or 'rearrange' or simply concatenate? If it is the latter then
R> pop2 <- data.frame(pop[,1:4], rs123=paste(pop[,5],pop[,6],sep=""),
+ rs157=paste(pop[,7],pop[,8],sep=""),
+ rs132=paste(pop[,9],pop[,10], sep=""))
R> pop2
status sex age disType rs123 rs157 rs132
1 0 0 42 0 11 24 44
2 1 1 37 0 31 44 44
3 1 0 38 0 11 24 44
4 0 1 45 0 31 22 44
5 1 1 25 0 31 24 44
6 0 1 31 0 11 44 44
7 1 0 43 0 11 44 44
8 0 0 41 0 11 44 44
9 1 1 57 0 31 22 24
10 1 1 40 0 11 22 24
and now you can do counts and whatnot on pop2:
R> sapply(pop2[,5:7], table)
$rs123
11 31
6 4
$rs157
22 24 44
3 3 4
$rs132
24 44
2 8
R>