Change order of conditions when plotting normalised counts for single gene - r

I have a df of 17 variables (my samples) with the condition location which I would like to plot based on a single gene "photosystem II protein D1 1"
View(metadata)
sample location
<chr> <chr>
1 X1344 West
2 X1345 West
3 X1365 West
4 X1366 West
5 X1367 West
6 X1419 West
7 X1420 West
8 X1421 West
9 X1473 Mid
10 X1475 Mid
11 X1528 Mid
12 X1584 East
13 X1585 East
14 X1586 East
15 X1678 East
16 X1679 East
17 X1680 East
View(countdata)
func X1344 X1345 X1365 X1366 X1367 X1419 X1420 X1421 X1473 X1475 X1528 X1584 X1585 X1586 X1678 X1679 X1680
photosystem II protein D1 1 11208 6807 3483 4091 12198 7229 7404 5606 6059 7456 4007 2514 5709 2424 2346 4447 5567
countdata contains thousands of genes but I am only showing the headers and gene of interest
ddsMat has been created like this:
ddsMat <- DESeqDataSetFromMatrix(countData = countdata,
colData = metadata,
design = ~ location)
When plotting:
library(DeSeq2)
plotCounts(ddsMat, "photosystem II protein D1 1", intgroup=c("location"))
By default, the function plots the "conditions" alphabetically eg: East-Mid-West. But I would like to order them so I can see them on the graph West-Mid-East.
Check plotCountsIMAGEhere
Is there a way of doing this?
Thanks,

I have found that you can manually change the order like this:
ddsMat$location <- factor(ddsMat$location, levels=c("West", "Mid", "East"))

Related

put the resulting values from for loop into a table in r [duplicate]

This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
I'm trying to calculate the total number of matches played by each team in the year 2019 and put them in a table along with the corresponding team names
teams<-c("Sunrisers Hyderabad", "Mumbai Indians", "Gujarat Lions", "Rising Pune Supergiants",
"Royal Challengers Bangalore","Kolkata Knight Riders","Delhi Daredevils",
"Kings XI Punjab", "Deccan Chargers","Rajasthan Royals", "Chennai Super Kings",
"Kochi Tuskers Kerala", "Pune Warriors", "Delhi Capitals", " Gujarat Lions")
for (j in teams) {
print(j)
ipl_table %>%
filter(season==2019 & (team1==j | team2 ==j)) %>%
summarise(match_count=n())->kl
print(kl)
match_played<-data.frame(Teams=teams,Match_count=kl)
}
The match played by last team (i.e Gujarat Lions is 0 and its filling 0's for all other teams as well.
The output match_played can be found on the link given below.
I'd be really glad if someone could help me regarding this error as I'm very new to R.
filter for the particular season, get data in long format and then count number of matches.
library(dplyr)
matches %>%
filter(season == 2019) %>%
tidyr::pivot_longer(cols = c(team1, team2), values_to = 'team_name') %>%
count(team_name) -> result
result
# team_name n
# <chr> <int>
#1 Chennai Super Kings 17
#2 Delhi Capitals 16
#3 Kings XI Punjab 14
#4 Kolkata Knight Riders 14
#5 Mumbai Indians 16
#6 Rajasthan Royals 14
#7 Royal Challengers Bangalore 14
#8 Sunrisers Hyderabad 15
Here is an example
library(tidyr)
df_2019 <- matches[matches$season == 2019, ] # get the season you need
df_long <- gather(df_2019, Team_id, Team_Name, team1:team2) # Make it long format
final_count <- data.frame(t(table(df_long$Team_Name)))[-1] # count the number of matches
names(final_count) <- c("Team", "Matches")
Team Matches
1 Chennai Super Kings 17
2 Delhi Capitals 16
3 Kings XI Punjab 14
4 Kolkata Knight Riders 14
5 Mumbai Indians 16
6 Rajasthan Royals 14
7 Royal Challengers Bangalore 14
8 Sunrisers Hyderabad 15
Or by using base R
final_count <- data.frame(t(table(c(df_2019$team1, df_2019$team2))))[-1]
names(final_count) <- c("Team", "Matches")
final_count

"for" loop in R and checking previous value from a column

I'm working on a data frame which looks like this
Here's how it looks like:
shape id day hour week id footfall category area name
22496 22/3/14 3 12 634 Work cluster CBD area 1
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1
23287 22/3/14 3 12 723 Airport Changi Airport 2
16430 22/3/14 4 12 947 Work cluster CBD area 2
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2
and so on... until 635 rows return.
and the other dataset that I want to compare with can be found here
Here's how it looks like:
category Foreigners Locals
Work cluster 1600000 3623900
Shopping cluster 1800000 3646666.667
Airport 15095152 8902705
Residential area 527700 280000
They both share the same attribute, i.e. category
I want to check if I can compare the previous hour from the column hour in the first dataset so I can compare it with the value from the second dataset.
Here's, what I ideally want to find in R:
#for n in 1: number of rows{
# check the previous hour from IDA dataset !!!!
# calculate hourSum - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
I'm not sure how to do that and here's what i tried:
tbl1 <- secondDataset
tbl2 <- firstDataset
mergetbl <- function(tbl1, tbl2)
{
newtbl = data.frame(hour=numeric(),forgHour=numeric(),locHour=numeric())
ntbl1rows<-nrow(tbl1) # get the number of rows
for(n in 1:ntbl1rows)
{
#get the previousHour
newHourSum <- tbl1$hour - previousHour
footfallHour <- (tbl1$hour/(newHourSum-previousHour)) * tbl2$Foreigners
#add to newtbl
}
}
This would what i expected:
shape id day hour week id footfall category area name forgHour locHour
22496 22/3/14 3 12 634 Work cluster CBD area 1 1 12
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1 21 25
23287 22/3/14 3 12 723 Airport Changi Airport 2 31 34
16430 22/3/14 4 12 947 Work cluster CBD area 2 41 23
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2 51 23
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3 61 45
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2 72 54

Replacing values in one data frame with values in a second data frame conditional on a logic statement

I have two data frames: "unit_test" with unique descriptions of survey units (one row per survey unit) and "data_test" with field data (multiple rows per survey unit). If it is a ground survey (data_test$type='ground'), I want to replace data_test$easting with the value in unit_test$east for the corresponding code (unit_test$code must match data_test$code1). If it is an air survey (data_test$type=='air'), I want to keep the original values in data_test$easting.
# Create units table
code <- c('pondA','pondB','pondC','pondD','transect1','transect2','transect3','transect4')
east <- c(12345,23456,34567,45678,NA,NA,NA,NA)
north <- c(99876,98765,87654,76543,NA,NA,NA,NA)
unit_test <- data.frame(cbind(code,east,north))
unit_test
# Create data table
code1 <- c('pondA','pondA','transect1','pondB','pondB','transect2','pondC','transect3','pondD','transect4')
type <- c('ground','ground','air','ground','ground','air','ground','air','ground','air')
easting <- c(NA,NA,18264,NA,NA,46378,NA,86025,NA,46295)
northing <-c(NA,NA,96022,NA,NA,85766,NA,21233,NA,23090)
species <- c('NOPI','NOPI','SCAU','GWTE','GWTE','RUDU','NOPI','GADW','NOPI','MALL')
count <- c(10,23,50,1,2,43,12,3,7,9)
data_test <- data.frame(cbind(code1,type,easting,northing,species,count))
data_test
I have tried using the match function:
if(data_test$type=="ground") {
data_test$easting <- unit_test$east[match(data_test$code1, unit_test$code)]
}
However it replaces the easting values if data_test$type=='air' with NAs. Any help would be much appreciated.
I want my final output to look like this:
code1 type easting northing species count
1 pondA ground 12345 99876 NOPI 10
2 pondA ground 12345 99876 NOPI 23
3 transect1 air 18264 96022 SCAU 50
4 pondB ground 23456 98765 GWTE 1
5 pondB ground 23456 98765 GWTE 2
6 transect2 air 46378 85766 RUDU 43
7 pondC ground 34567 87654 NOPI 12
8 transect3 air 86025 21233 GADW 3
9 pondD ground 45678 76543 NOPI 7
10 transect4 air 46295 23090 MALL 9
I think data.table package is really useful for this task:
install.packages("data.table")
library(data.table)
unit_test = data.table(unit_test)
data_test = data.table(data_test)
Add a column to unit_test specifying it refers to "ground":
unit_test$type = "ground"
Set keys to table in order to cross reference
setkey(data_test, code1, type, species)
setkey(unit_test, code, type)
Every time you have "ground" for type in data_test, lookup appropriate data in unit_test and replace easting with east
data_test[unit_test, easting:= east]
data_test[unit_test,northing:= north]
Results:
> data_test
code1 type easting northing species count
1: pondA ground 12345 99876 NOPI 10
2: pondA ground 12345 99876 NOPI 23
3: pondB ground 23456 98765 GWTE 1
4: pondB ground 23456 98765 GWTE 2
5: pondC ground 34567 87654 NOPI 12
6: pondD ground 45678 76543 NOPI 7
7: transect1 air 18264 96022 SCAU 50
8: transect2 air 46378 85766 RUDU 43
9: transect3 air 86025 21233 GADW 3
10: transect4 air 46295 23090 MALL 9
Base R:
data_test[data_test$type == 'ground',c('easting','northing')] <- unit_test[match(data_test[data_test$type == 'ground','code1'],unit_test$code),c('east','north')]
Find the spots you want to fill, and make an index with match like you mentioned. This is after a change in your sample data. I used stringsAsFactors = F when creating both data frames so I didn't have to deal with factors.

Calculate rows with same title

Since my other question got closed, here is the required data.
What I'm trying to do is have R calculate the last column 'count' towards the column city so I can map the data. Therefore I would need some kind of code to match this. Since I want to show how many participants (in count) are in the state of e.g Hawaii (HI)
zip city state latitude longitude count
96860 Pearl Harbor HI 24.859832 -168.021815 36
96863 Kaneohe Bay HI 21.439867 -157.74772 39
99501 Anchorage AK 61.216799 -149.87828 12
99502 Anchorage AK 61.153693 -149.95932 17
99506 Elmendorf AFB AK 61.224384 -149.77461 2
what I've tried is
match<- c(match(datazip$state, datazip$number))>$
but I'm really helpless trying to find a solution since I don't even know how to describe this in short. My plan afterwards is to make choropleth map with the data and believe me by now I've seen almost all the pages that try to give advice. so your help is pretty much appreciated. Thanks
# I read your sample data to a data frame
> df
zip city state latitude longitude count
1 96860 Pearl_Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe_Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf_AFB AK 61.22438 -149.7746 2
# If you want to sum the number of counts by state
library(plyr)
> ddply(df, .(state), transform, count2 = sum(count))
zip city state latitude longitude count count2
1 99501 Anchorage AK 61.21680 -149.8783 12 31
2 99502 Anchorage AK 61.15369 -149.9593 17 31
3 99506 Elmendorf_AFB AK 61.22438 -149.7746 2 31
4 96860 Pearl_Harbor HI 24.85983 -168.0218 36 75
5 96863 Kaneohe_Bay HI 21.43987 -157.7477 39 75
Maybe aggregate would be a nice and simple solution for you:
df
zip city state latitude longitude count
1 96860 Pearl Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf AFB AK 61.22438 -149.7746 2
aggregate(df$count,by=list(df$state),sum)
Group.1 x
1 AK 31
2 HI 75
aggregate(df$count,by=list(df$city),sum)
Group.1 x
1 Anchorage 29
2 Elmendorf AFB 2
3 Kaneohe Bay 39
4 Pearl Harbor 36

Clustering list for hclust function

Using plot(hclust(dist(x))) method, I was able to draw a cluster tree map. It works. Yet I would like to get a list of all clusters, not a tree diagram, because I have huge amount of data (like 150K nodes) and the plot gets messy.
In other words, lets say if a b c is a cluster and if d e f g is a cluster then I would like to get something like this:
1 a,b,c
2 d,e,f,g
Please note that this is not exactly what I want to get as an "output". It is just an example. I just would like to be able to get a list of clusters instead of a tree plot It could be vector, matrix or just simple numbers that show which groups elements belong to.
How is this possible?
I will use the dataset available in R to demonstrate how to cut a tree into desired number of pieces. Result is a table.
Construct a hclust object.
hc <- hclust(dist(USArrests), "ave")
#plot(hc)
You can now cut the tree into as many branches as you want. For my next trick, I will split the tree into two groups. You set the number of cuts with the k parameter. See ?cutree and the use of paramter h which may be more useful to you (see cutree(hc, k = 2) == cutree(hc, h = 110)).
cutree(hc, k = 2)
Alabama Alaska Arizona Arkansas California
1 1 1 2 1
Colorado Connecticut Delaware Florida Georgia
2 2 1 1 2
Hawaii Idaho Illinois Indiana Iowa
2 2 1 2 2
Kansas Kentucky Louisiana Maine Maryland
2 2 1 2 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 2 1 2
Montana Nebraska Nevada New Hampshire New Jersey
2 2 1 2 2
New Mexico New York North Carolina North Dakota Ohio
1 1 1 2 2
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 2 2 1
South Dakota Tennessee Texas Utah Vermont
2 2 2 2 2
Virginia Washington West Virginia Wisconsin Wyoming
2 2 2 2 2
lets say,
y<-dist(x)
clust<-hclust(y)
groups<-cutree(clust, k=3)
x<-cbind(x,groups)
now you will get for each record, the cluster group.
You can subset the dataset as well:
x1<- subset(x, groups==1)
x2<- subset(x, groups==2)
x3<- subset(x, groups==3)

Resources