I am having problem with the following codes (I'm a beginner, so please go easy on me):
COW$id<- (COW$tcode1*1000 + COW$tcode2)
COW$id<- (COW$tcode2*1000 + COW$tcode1)
I want the first line of code to be executed on the condition that the value of tcode1 (a variable in COW dataframe) is less than tcode2 (tcode1 < tcode2), and I want the second line of code to be executed if tcode1 is greater than tcode2 (tcode1 > tcode2). The end result I am looking for is a single column "ID" in my dataframe, on the basis of the conditions above. Does anyone know how to achieve this?
COW = data.frame(tcode1=c(5,7,18,9),tcode2=c(4,15,8,10))
head(COW)
tcode1 tcode2
5 4
7 15
18 8
9 10
id = ifelse(COW$tcode1<COW$tcode2,
COW$tcode1*1000 + COW$tcode2,
COW$tcode2*1000 + COW$tcode1)
COW = data.frame(id=id,COW)
head(COW)
id tcode1 tcode2
4005 5 4
7015 7 15
8018 18 8
9010 9 10
Related
A little background information regarding my question: I have run a trial with 2 different materials, using 2x2 settings. Each treatment was performed in duplo, resulting in a total number of 2x2x2x2 = 16 runs in my dataset. The dataset has the following headings, in which repetition is either 1 or 2 (as it was performed in duplo).
| Run | Repetition | Material | Air speed | Class. speed | Parameter of interest |
I would like to transform this into a dataframe/table which has the following headings, resulting in 8 columns:
| Run | Material | Air speed | Class. speed | Parameter of interest from repetition 1 | Parameter of interest from repetition 2 |
This means that each treatment (combination of material, setting 1 and setting 2) is only shown once, and the parameter of interest is shown twice.
I have a dataset which looks as follows:
code rep material airspeed classifier_speed fine_fraction
1 L17 1 lupine 50 600 1
2 L19 2 lupine 50 600 6
3 L16 1 lupine 60 600 9
4 L22 2 lupine 60 600 12
5 L18 1 lupine 50 1200 4
6 L21 2 lupine 50 1200 6
I have melted it as follows:
melt1 <- melt(duplo_selection, id.vars = c("material", "airspeed", "classifier_speed", "rep"),
measure.vars=c("fine_fraction"))
and then tried to cast it as follows:
cast <- dcast(melt1, material + airspeed + classifier_speed ~ variable, value.var = "value")
This gives the following message:
Aggregate function missing, defaulting to 'length'
and this dataframe, in which the parameter of interest is counted rather than both values being presented.
Thanks for your effort and time to try to help me out, after a little puzzling I found out what I had to do.
I added replicate to each observation, being either a 1 or a 2, as the trial was performed in duplo.
Via the code
cast <- dcast(duplo_selection, material + airspeed + classifier_speed ~ replicate, value.var = "fine_fraction")
I came to the 5 x 8 table I was looking for.
I have a rather large dataset looking at SNPs across an entire genome. I am trying to generate a heatmap that scales based on how many SNPs have a BF (bayes factor) value over 50 within a sliding window of x base pairs across the genome. For example, there might be 5 SNPs of interest within the first 1,000,000 base pairs, and then 3 in the next 1,000,000, and so on until I reach the end of the genome, which would be used to generate a single row heatmap. Currently, my data are set out like so:
SNP BF BP
0001_107388 11.62814713 107388
0001_193069 2.333472447 193069
0001_278038 51.34452334 278038
0001_328786 5.321968927 328786
0001_523879 50.03245434 523879
0001_804477 -0.51777189 804477
0001_990357 6.235452787 990357
0001_1033297 3.08206707 1033297
0001_1167609 -2.427835577 1167609
0001_1222410 52.96447989 1222410
0001_1490205 10.98099565 1490205
0001_1689133 3.75363951 1689133
0001_1746080 3.519987207 1746080
0001_1746450 -2.86666016 1746450
0001_1777011 0.166999413 1777011
0001_2114817 3.266942137 2114817
0001_2232084 50.43561123 2232084
0001_2332903 -0.15022324 2332903
0001_2347062 -1.209000033 2347062
0001_2426273 1.230915683 2426273
where SNP = the SNP ID, BF = the bayes factor, and BP = the position on the genome (I've fudged a couple of > 50 values in there for the data to be suitable for this example).
The issue is that I don't have a SNP for each genome position, otherwise I could simply split the windows of interest based on line count and then count however many lines in the BF column are over 50. Is there any way I can I count the number of SNPs of interest within different windows of the genome positions? Preferably in R, but no issues with using other languages like Python or Bash if it gets the job done.
Thanks!
library(slider); library(dplyr)
my_data %>%
mutate(count = slide_index(BF, BP, ~sum(.x > 50), .before = 999999))
This counts how many BF > 50 in the window of the last 1M in BP.
SNP BF BP count
1 0001_107388 11.6281471 107388 0
2 0001_193069 2.3334724 193069 0
3 0001_278038 51.3445233 278038 1
4 0001_328786 5.3219689 328786 1
5 0001_523879 50.0324543 523879 2
6 0001_804477 -0.5177719 804477 2
7 0001_990357 6.2354528 990357 2
8 0001_1033297 3.0820671 1033297 2
9 0001_1167609 -2.4278356 1167609 2
10 0001_1222410 52.9644799 1222410 3
11 0001_1490205 10.9809957 1490205 2
12 0001_1689133 3.7536395 1689133 1
13 0001_1746080 3.5199872 1746080 1
14 0001_1746450 -2.8666602 1746450 1
15 0001_1777011 0.1669994 1777011 1
16 0001_2114817 3.2669421 2114817 1
17 0001_2232084 50.4356112 2232084 1
18 0001_2332903 -0.1502232 2332903 1
19 0001_2347062 -1.2090000 2347062 1
20 0001_2426273 1.2309157 2426273 1
I have a group of individuals that I am distributing items to in an effort to move toward even distribution of total items across individuals.
Each individual can receive only certain item types.
The starting distribution of items is not equal.
The number of available items of each type is known, and must fully be exhausted.
df contains an example format for the person data. Note that Chuck has 14 items total, not 14 bats and 14 gloves.
df<-data.frame(person=c("Chuck","Walter","Mickey","Vince","Walter","Mickey","Vince","Chuck"),alloweditem=c("bat","bat","bat","bat","ball","ball","glove","glove"),startingtotalitemspossessed=c(14,9,7,12,9,7,12,14))
otherdf contains and example format for the items and number needing assignment
otherdf<-data.frame(item=c("bat","ball","glove"),numberneedingassignment=c(3,4,7))
Is there a best method for coding this form of item distribution? I imagine the steps to be:
Check which person that can receive a given item has the lowest total items assigned. Break a tie at random.
Assign 1 of the given item to this person.
Update the startingtotalitemspossessed for the person receiving the item.
Update the remaining number of the item left to assign.
Stop this loop for a given item if the total remaining is 0, and move to the next item.
Below is a partial representation of something like how i'd imagine this working as a view inside the loop, left to right.
Note: The number of items and people is very large. If possible, a method that would scale to any given number of people or items would be ideal!
Thank you in advance for your help!
I'm sure there are better ways, but here is an example:
df<-data.frame(person=c("Chuck","Walter","Mickey","Vince","Walter","Mickey","Vince","Chuck"),
alloweditem=c("bat","bat","bat","bat","ball","ball","glove","glove"),
total=c(14,9,7,12,9,7,12,14))
print(df)
## person alloweditem total
## 1 Chuck bat 14
## 2 Walter bat 9
## 3 Mickey bat 7
## 4 Vince bat 12
## 5 Walter ball 9
## 6 Mickey ball 7
## 7 Vince glove 12
## 8 Chuck glove 14
otherdf<-data.frame(item=c("bat","ball","glove"),
numberneedingassignment=c(3,4,7))
# Items in queue
queue <- rep(otherdf$item, otherdf$numberneedingassignment)
for (i in 1:length(queue)) {
# Find person with the lowest starting total
personToBeAssigned <- df[df$alloweditem == queue[i] &
df$total == min(df[df$alloweditem == queue[i], 3]), 1][1]
df[df$person == personToBeAssigned & df$alloweditem == queue[i], 3] <-
df[df$person == personToBeAssigned & df$alloweditem == queue[i], 3] + 1
}
print(df)
## person alloweditem total
## 1 Chuck bat 14
## 2 Walter bat 10
## 3 Mickey bat 9
## 4 Vince bat 12
## 5 Walter ball 10
## 6 Mickey ball 10
## 7 Vince glove 17
## 8 Chuck glove 16
I am trying to store the loop output. However, my dataset is quite big and it crashes Rstudio whenever I try to View it. I have tried different techniques such as the functions in library(iterators) and library(foreach), but it is not doing what I want it to do. I am trying to take a row from my main table (Table A)(number of rows 54000) and then a row from another smaller table (Table B)(number of rows = 6). I have also took a look at Storing loop output in a dataframe in R but it doesn't really allow me to view my results.
The code takes the first row from Table A and then iterates it 6 times through table B and then outputs the result of each iteration then moves to Table A's second row. As such my final dataset should contain 324000 (54000*6) observations.
Below is the code that provides me with the correct observations (but I am unable to view it to see it the values are being correctly calculated) and a snippet of Table A and Table B.
output_ratios <- NULL
for (yr in seq(yrs)) {
if (is.na(yr) == 'TRUE') {
numerator=0
numerator1=0
numerator2=0
denominator=0
} else {
numerator=Table.B[Table.B$PERIOD == paste("PY_", yr, sep=""), c("1")]
denominator=Table.B[Table.B$PERIOD == paste("PY_", yr, sep=""), c("2")]
denom=Table.A[, "1"] + (abs(Table.A[, "1"])*denominator)
num=Table.A[, "2"] + (abs(Table.A[, "2"])*numerator)
new.data$1=num
new.data$2=denom
NI=num / denom
NI_ratios$NI=c(NI)
output_ratios <<- (rbind(output_ratios, NI))
}
}
TABLE B:
PERIOD 1 2 3 4 5
1 PY_1 0.21935312 -0.32989691 0.12587413 -0.28323699 -0.04605116
2 PY_2 0.21328526 0.42051282 -0.10559006 0.41330645 0.26585064
3 PY_3 -0.01338112 -0.03971119 -0.06641667 -0.08238231 -0.05323772
4 PY_4 0.11625091 0.01127819 0.07114166 0.08501516 0.55676498
5 PY_5 -0.01269256 -0.02379182 0.39115278 -0.03716100 0.63530682
6 PY_6 0.69041864 0.51034273 0.59290357 0.78571429 -0.48683736
TABLE A:
1 2 3 4
1 25 3657 2258
2 23 361361 250
3 24 35 000
4 25 362 502
5 25 1039 502
I would greatly appreciate any help.
I made a table like this table name a.
variable relative_importance scaled_importance percentage
1 x005 68046.078125 1.000000 0.195396
2 x004 63890.796875 0.938934 0.183464
3 x007 48253.820312 0.709134 0.138562
4 x012 43492.117188 0.639157 0.124889
5 x008 43132.035156 0.633865 0.123855
6 x013 32495.070312 0.477545 0.093310
7 x009 18466.910156 0.271388 0.053028
8 x015 10625.453125 0.156151 0.030511
9 x010 8893.750977 0.130702 0.025539
10 x014 4904.361816 0.072074 0.014083
11 x002 1812.269531 0.026633 0.005204
12 x001 1704.574585 0.025050 0.004895
13 x006 1438.692139 0.021143 0.004131
14 x011 1080.584106 0.015880 0.003103
15 x003 10.152302 0.000149 0.000029
and use this code to order that table.
setorder(a,variable)
and want to get only second column.
a[2]
relative_importance
12 380.4296
11 645.4594
15 10.1440
4 8599.7715
2 10749.5752
13 263.7065
5 8434.3760
6 7443.8530
7 3602.8850
10 935.6713
14 256.7183
3 9160.4062
1 12071.1826
9 1173.0701
8 1698.0955
I want to copy "relative_importance" and paste in Excel.
But, I couldn't delete the rownames. (12,11,15...,9,8)
Is there any way to print only "relative_importance"? (print without rownames or hide rownames)
Thank you :)
You could simply use writeClipboard( as.character(a$relative_importance) ) and paste it in Excel
You could create a csv file, which you can open with Excel.
write.csv(a[2], "myfile.csv", row.names = FALSE, col.names = FALSE.
Note that the file will be created in your current working directory, which you can find by running the following code: getwd().
On a different note, are you trying to get the column into Excel for further analysis? If you are, I encourage you to learn how to do that analysis in R.