dcast - error: Aggregate function missing - r

A little background information regarding my question: I have run a trial with 2 different materials, using 2x2 settings. Each treatment was performed in duplo, resulting in a total number of 2x2x2x2 = 16 runs in my dataset. The dataset has the following headings, in which repetition is either 1 or 2 (as it was performed in duplo).
| Run | Repetition | Material | Air speed | Class. speed | Parameter of interest |
I would like to transform this into a dataframe/table which has the following headings, resulting in 8 columns:
| Run | Material | Air speed | Class. speed | Parameter of interest from repetition 1 | Parameter of interest from repetition 2 |
This means that each treatment (combination of material, setting 1 and setting 2) is only shown once, and the parameter of interest is shown twice.
I have a dataset which looks as follows:
code rep material airspeed classifier_speed fine_fraction
1 L17 1 lupine 50 600 1
2 L19 2 lupine 50 600 6
3 L16 1 lupine 60 600 9
4 L22 2 lupine 60 600 12
5 L18 1 lupine 50 1200 4
6 L21 2 lupine 50 1200 6
I have melted it as follows:
melt1 <- melt(duplo_selection, id.vars = c("material", "airspeed", "classifier_speed", "rep"),
measure.vars=c("fine_fraction"))
and then tried to cast it as follows:
cast <- dcast(melt1, material + airspeed + classifier_speed ~ variable, value.var = "value")
This gives the following message:
Aggregate function missing, defaulting to 'length'
and this dataframe, in which the parameter of interest is counted rather than both values being presented.

Thanks for your effort and time to try to help me out, after a little puzzling I found out what I had to do.
I added replicate to each observation, being either a 1 or a 2, as the trial was performed in duplo.
Via the code
cast <- dcast(duplo_selection, material + airspeed + classifier_speed ~ replicate, value.var = "fine_fraction")
I came to the 5 x 8 table I was looking for.

Related

How to count number of instances above a value within a given range in R?

I have a rather large dataset looking at SNPs across an entire genome. I am trying to generate a heatmap that scales based on how many SNPs have a BF (bayes factor) value over 50 within a sliding window of x base pairs across the genome. For example, there might be 5 SNPs of interest within the first 1,000,000 base pairs, and then 3 in the next 1,000,000, and so on until I reach the end of the genome, which would be used to generate a single row heatmap. Currently, my data are set out like so:
SNP BF BP
0001_107388 11.62814713 107388
0001_193069 2.333472447 193069
0001_278038 51.34452334 278038
0001_328786 5.321968927 328786
0001_523879 50.03245434 523879
0001_804477 -0.51777189 804477
0001_990357 6.235452787 990357
0001_1033297 3.08206707 1033297
0001_1167609 -2.427835577 1167609
0001_1222410 52.96447989 1222410
0001_1490205 10.98099565 1490205
0001_1689133 3.75363951 1689133
0001_1746080 3.519987207 1746080
0001_1746450 -2.86666016 1746450
0001_1777011 0.166999413 1777011
0001_2114817 3.266942137 2114817
0001_2232084 50.43561123 2232084
0001_2332903 -0.15022324 2332903
0001_2347062 -1.209000033 2347062
0001_2426273 1.230915683 2426273
where SNP = the SNP ID, BF = the bayes factor, and BP = the position on the genome (I've fudged a couple of > 50 values in there for the data to be suitable for this example).
The issue is that I don't have a SNP for each genome position, otherwise I could simply split the windows of interest based on line count and then count however many lines in the BF column are over 50. Is there any way I can I count the number of SNPs of interest within different windows of the genome positions? Preferably in R, but no issues with using other languages like Python or Bash if it gets the job done.
Thanks!
library(slider); library(dplyr)
my_data %>%
mutate(count = slide_index(BF, BP, ~sum(.x > 50), .before = 999999))
This counts how many BF > 50 in the window of the last 1M in BP.
SNP BF BP count
1 0001_107388 11.6281471 107388 0
2 0001_193069 2.3334724 193069 0
3 0001_278038 51.3445233 278038 1
4 0001_328786 5.3219689 328786 1
5 0001_523879 50.0324543 523879 2
6 0001_804477 -0.5177719 804477 2
7 0001_990357 6.2354528 990357 2
8 0001_1033297 3.0820671 1033297 2
9 0001_1167609 -2.4278356 1167609 2
10 0001_1222410 52.9644799 1222410 3
11 0001_1490205 10.9809957 1490205 2
12 0001_1689133 3.7536395 1689133 1
13 0001_1746080 3.5199872 1746080 1
14 0001_1746450 -2.8666602 1746450 1
15 0001_1777011 0.1669994 1777011 1
16 0001_2114817 3.2669421 2114817 1
17 0001_2232084 50.4356112 2232084 1
18 0001_2332903 -0.1502232 2332903 1
19 0001_2347062 -1.2090000 2347062 1
20 0001_2426273 1.2309157 2426273 1

Unable to store loop output in R

I am trying to store the loop output. However, my dataset is quite big and it crashes Rstudio whenever I try to View it. I have tried different techniques such as the functions in library(iterators) and library(foreach), but it is not doing what I want it to do. I am trying to take a row from my main table (Table A)(number of rows 54000) and then a row from another smaller table (Table B)(number of rows = 6). I have also took a look at Storing loop output in a dataframe in R but it doesn't really allow me to view my results.
The code takes the first row from Table A and then iterates it 6 times through table B and then outputs the result of each iteration then moves to Table A's second row. As such my final dataset should contain 324000 (54000*6) observations.
Below is the code that provides me with the correct observations (but I am unable to view it to see it the values are being correctly calculated) and a snippet of Table A and Table B.
output_ratios <- NULL
for (yr in seq(yrs)) {
if (is.na(yr) == 'TRUE') {
numerator=0
numerator1=0
numerator2=0
denominator=0
} else {
numerator=Table.B[Table.B$PERIOD == paste("PY_", yr, sep=""), c("1")]
denominator=Table.B[Table.B$PERIOD == paste("PY_", yr, sep=""), c("2")]
denom=Table.A[, "1"] + (abs(Table.A[, "1"])*denominator)
num=Table.A[, "2"] + (abs(Table.A[, "2"])*numerator)
new.data$1=num
new.data$2=denom
NI=num / denom
NI_ratios$NI=c(NI)
output_ratios <<- (rbind(output_ratios, NI))
}
}
TABLE B:
PERIOD 1 2 3 4 5
1 PY_1 0.21935312 -0.32989691 0.12587413 -0.28323699 -0.04605116
2 PY_2 0.21328526 0.42051282 -0.10559006 0.41330645 0.26585064
3 PY_3 -0.01338112 -0.03971119 -0.06641667 -0.08238231 -0.05323772
4 PY_4 0.11625091 0.01127819 0.07114166 0.08501516 0.55676498
5 PY_5 -0.01269256 -0.02379182 0.39115278 -0.03716100 0.63530682
6 PY_6 0.69041864 0.51034273 0.59290357 0.78571429 -0.48683736
TABLE A:
1 2 3 4
1 25 3657 2258
2 23 361361 250
3 24 35 000
4 25 362 502
5 25 1039 502
I would greatly appreciate any help.

How to check for skipped values in a series in a R dataframe column?

I have a dataframe price1 in R that has four columns:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
There are ten Car names in all in price1, so the above is just to give an idea about the structure. Each car name should have 54 observations corresponding to 54 weeks. But, there are some weeks for which no observation exists (for e.g., Week 3 and 4 in the above case). For these missing weeks, I need to plug in information from another dataframe price2:
Name AveragePrice AverageRebate
Car 1 20000 500
Car 2 20000 400
Car 3 20000 400
---- ---- ---
Car 10 20400 450
So, I need to identify the missing week for each Car name in price1, capture the row corresponding to that Car name in price2, and insert the row in price1. I just can't wrap my head around a possible approach, so unfortunately I do not have a code snippet to share. Most of my search in SO is leading me to answers regarding handling missing values, which is not what I am looking for. Can someone help me out?
I am also indicating the desired output below:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 3 20200 410
Car 1 4 20300 420
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
---- -- ---- ---
Car 10 54 21400 600
Note that the output now has Car 1 info for Week 4 and 5 which I should fetch from price2. Final output should contain 54 observations for each of the 10 car names, so total of 540 rows.
try this, good luck
library(data.table)
carNames <- paste('Car', 1:10)
df <- data.table(Name = rep(carNames, each = 54), Week = rep(1:54, times = 10))
df <- merge(df, price1, by = c('Name', 'Week'), all.x = TRUE)
df <- merge(df, price2, by = 'Name', all.x = TRUE); df[, `:=`(Price = ifelse(is.na(Price), AveragePrice, Price), Rebate = ifelse(is.na(Rebate), AverageRebate, Rebate))]
df[, 1:4]
So if I understand your problem correctly you basically have 2 dataframes and you want to make sure the dataframe - "price1" has the correct rownames(names of the cars) in the 'names' column?
Here's what I would do, but it probably isn't the optimal way:
#create a loop with length = number of rows in your frame
for(i in 1:nrow(price1)){
#check if the value is = NA,
if (is.na(price1[1,i] == TRUE){
#if it is NA, replace it with the corresponding value in price2
price1[1,i] <- price2[1,i]
}
}
Hope this helps (:
If I understand your question correctly, you only want to see what is in the 2nd table and not in the first. You will just want to use an anti_join. Note that the order you feed the tables into the anti_join matters.
library(tidyverse)
complete_table ->
price2 %>%
anti_join(price1)
To expand your first table to cover all 54 weeks use complete() or you can even fudge it and right_join a table that you will purposely build with all 54 weeks in it. Then anything that doesn't join to this second table gets an NA in that column.

Missing values in truncate data frame

I have the next dataframe (t) with 317,000 obs.
date | page | rank
2015-10-10 | url1 | 1
2015-10-10 | url2 | 2
2015-10-10 | url2 | 3
.
.
.
2015-10-10 | url1000 | 1000
2015-10-11 | url1 | 1
I'm trying to truncate this data, because I want to know how much days, a particular URL have maintained in the rank 50 or less.
piv = reshape(t,direction = "wide", idvar = "page", timevar = "date")
If I do that I obtained a table with 27,447 obs and 318 columns, but it generates a lot of NAs. Example below (only 20 columns)
page id.2015-12-07 id.2015-12-08 id.2015-12-09 id.2015-12-10 id.2015-12-11 id.2015-12-12 id.2015-12-13
1 url1 1 1 1 1 1 2 2
id.2015-12-14 id.2015-12-15 id.2015-12-16 id.2015-12-17 id.2015-12-18 id.2015-12-19 id.2015-12-20 id.2015-12-21
1 1 1 1 1 106 534 NA 282
id.2015-12-22 id.2015-12-23 id.2015-12-24 id.2015-12-26
1 270 445 NA NA
Also using cast I had the next error
pivoted = cast(t,page ~ rank + date )
****Using id as value column. Use the value argument to cast to override this choice
Error in `[.data.frame`(data, , variables, drop = FALSE) :
undefined columns selected****
I have 317 uniques dates and 27,447 unique pages or urls.
I suggest you use the dplyr package for this kind of tasks, if this is possible for you:
library(dplyr)
df %>%
filter(rank <= 50) %>%
group_by(page) %>%
summarize(days_in_top_50 = n())
will give you the result you are looking for.
You have row per page and day. The first line (filter) means you only want to consider rows where the rank was in the top 50. The second line (group_by) means you want to get results by page and finally in the third line the n() function counts those rows that pass the filter for each page.
For more information you can check out https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Executing two separate codes on the basis of conditions

I am having problem with the following codes (I'm a beginner, so please go easy on me):
COW$id<- (COW$tcode1*1000 + COW$tcode2)
COW$id<- (COW$tcode2*1000 + COW$tcode1)
I want the first line of code to be executed on the condition that the value of tcode1 (a variable in COW dataframe) is less than tcode2 (tcode1 < tcode2), and I want the second line of code to be executed if tcode1 is greater than tcode2 (tcode1 > tcode2). The end result I am looking for is a single column "ID" in my dataframe, on the basis of the conditions above. Does anyone know how to achieve this?
COW = data.frame(tcode1=c(5,7,18,9),tcode2=c(4,15,8,10))
head(COW)
tcode1 tcode2
5 4
7 15
18 8
9 10
id = ifelse(COW$tcode1<COW$tcode2,
COW$tcode1*1000 + COW$tcode2,
COW$tcode2*1000 + COW$tcode1)
COW = data.frame(id=id,COW)
head(COW)
id tcode1 tcode2
4005 5 4
7015 7 15
8018 18 8
9010 9 10

Resources