"for" loop in R and checking previous value from a column - r

I'm working on a data frame which looks like this
Here's how it looks like:
shape id day hour week id footfall category area name
22496 22/3/14 3 12 634 Work cluster CBD area 1
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1
23287 22/3/14 3 12 723 Airport Changi Airport 2
16430 22/3/14 4 12 947 Work cluster CBD area 2
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2
and so on... until 635 rows return.
and the other dataset that I want to compare with can be found here
Here's how it looks like:
category Foreigners Locals
Work cluster 1600000 3623900
Shopping cluster 1800000 3646666.667
Airport 15095152 8902705
Residential area 527700 280000
They both share the same attribute, i.e. category
I want to check if I can compare the previous hour from the column hour in the first dataset so I can compare it with the value from the second dataset.
Here's, what I ideally want to find in R:
#for n in 1: number of rows{
# check the previous hour from IDA dataset !!!!
# calculate hourSum - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
I'm not sure how to do that and here's what i tried:
tbl1 <- secondDataset
tbl2 <- firstDataset
mergetbl <- function(tbl1, tbl2)
{
newtbl = data.frame(hour=numeric(),forgHour=numeric(),locHour=numeric())
ntbl1rows<-nrow(tbl1) # get the number of rows
for(n in 1:ntbl1rows)
{
#get the previousHour
newHourSum <- tbl1$hour - previousHour
footfallHour <- (tbl1$hour/(newHourSum-previousHour)) * tbl2$Foreigners
#add to newtbl
}
}
This would what i expected:
shape id day hour week id footfall category area name forgHour locHour
22496 22/3/14 3 12 634 Work cluster CBD area 1 1 12
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1 21 25
23287 22/3/14 3 12 723 Airport Changi Airport 2 31 34
16430 22/3/14 4 12 947 Work cluster CBD area 2 41 23
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2 51 23
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3 61 45
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2 72 54

Related

How can I extract the unique variables in one column conditional to a variable in another and make a new data frame with the output?

I would like to extract the number of camera trap nights (CTN) (one column in df) per camera trap station (another column in DF) so I can work out relative abundance indices for each cameras station. For example Station 1 has had 5 triggers/events (of the same species) and has had 30 CTN. It is listed in my database 5 times (has 5 rows). I want to extract the unique CTN for Station 1 and subsequently all the other Stations in the DF.
Data frame:
EventID CameraStation CTN
001 Station 1 30
002 Station 1 30
003 Station 1 30
004 Station 1 30
005 Station 2 29
006 Station 2 29
007 Station 2 29
008 Station 2 29
009 Station 2 29
010 Station 3 31
011 Station 3 31
I have tried to use 'unique' and 'with' but do not get the result I want.
with(unique(rai.PS[c("CameraStation", "CTN")]), table(CameraStation))
I expect to get the following results;
CameraStation CTN
Station 1 30
Station 2 29
Station 3 31
I.e. Station 1 is only listed once with the outcome of CTN and in a new data frame.
But instead I get;
CameraStation
Station 1
1
Station 2
1
Station 3
1
I am assuming it is giving me the unique station once without the CTN as the criteria.

Change order of conditions when plotting normalised counts for single gene

I have a df of 17 variables (my samples) with the condition location which I would like to plot based on a single gene "photosystem II protein D1 1"
View(metadata)
sample location
<chr> <chr>
1 X1344 West
2 X1345 West
3 X1365 West
4 X1366 West
5 X1367 West
6 X1419 West
7 X1420 West
8 X1421 West
9 X1473 Mid
10 X1475 Mid
11 X1528 Mid
12 X1584 East
13 X1585 East
14 X1586 East
15 X1678 East
16 X1679 East
17 X1680 East
View(countdata)
func X1344 X1345 X1365 X1366 X1367 X1419 X1420 X1421 X1473 X1475 X1528 X1584 X1585 X1586 X1678 X1679 X1680
photosystem II protein D1 1 11208 6807 3483 4091 12198 7229 7404 5606 6059 7456 4007 2514 5709 2424 2346 4447 5567
countdata contains thousands of genes but I am only showing the headers and gene of interest
ddsMat has been created like this:
ddsMat <- DESeqDataSetFromMatrix(countData = countdata,
colData = metadata,
design = ~ location)
When plotting:
library(DeSeq2)
plotCounts(ddsMat, "photosystem II protein D1 1", intgroup=c("location"))
By default, the function plots the "conditions" alphabetically eg: East-Mid-West. But I would like to order them so I can see them on the graph West-Mid-East.
Check plotCountsIMAGEhere
Is there a way of doing this?
Thanks,
I have found that you can manually change the order like this:
ddsMat$location <- factor(ddsMat$location, levels=c("West", "Mid", "East"))

Make a new column and categorize data from 2 dataset

I have two datasets. The first one is called Buildings, the data consists of each Building ID with its respective characteristics.
Building_ID Address Year BCR
1 Machida, TY 1994 80
2 Ueno, TY 1972 50
3 Asakusa, TY 1990 70
4 Machida, TY 1982 60
.
.
.
54634 Chiyoda, TY 2002 70
The second dataset is called Residential ID. It only has one table, consisting of the Building ID (which is the same with the Building ID in 'Buildings' dataset) which have Residential usage.
Building_ID
2
3
14
23
39
44
45
133
393
423
.
.
or something like that. What I want to do is to make a new column in my first dataset with regards to my second dataset. I want to categorize which one is a Residential building and which one is not (basically, I want to select all the Buildings ID mentioned in my second dataset and categorize it into Residential in my first dataset). If it is residential, we can name it 'Residential'and else it is 'NR' so it could look something like this:
Building_ID Address Year BCR Category
1 Machida, TY 1994 80 NR
2 Ueno, TY 1972 50 Residential
3 Asakusa, TY 1990 70 Residential
4 Machida, TY 1982 60 NR
.
.
.
54634 Chiyoda, TY 2002 70 NR
I was thinking it has something to do with ifelse or grepl but so far my code doesn't work.

Looping over a data frame and adding a new column in R with certain logic

I have a data frame which contains information about sales branches, customers and sales.
branch <- c("Chicago","Chicago","Chicago","Chicago","Chicago","Chicago","LA","LA","LA","LA","LA","LA","LA","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa")
customer <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21)
sales <- c(33816,24534,47735,1467,39389,30659,21074,20195,45165,37606,38967,41681,47465,3061,23412,22993,34738,19408,11637,36234,23809)
data <- data.frame(branch, customer, sales)
What I need to accomplish is to iterate over each branch, take each customer in the branch and divide the sales for that customer by the total of the branch. I need to do it to find out how much each customer is contributing towards the total sales of the corresponding branch. E.g. for customer 1 I would like to divide 33816/177600 and store this value in a new column. (177600 is the total of chicago branch)
I have tried to write a function to iterate over each row in a for loop but I am not sure how to do it at a branch level. Any guidance is appreciated.
Consider base R's ave for new column of inline aggregate which also considers same customer with multiple records within the same branch:
data$customer_contribution <- ave(data$sales, data$customer, FUN=sum) /
ave(data$sales, data$branch, FUN=sum)
data
# branch customer sales customer_contribution
# 1 Chicago 1 33816 0.190405405
# 2 Chicago 2 24534 0.138141892
# 3 Chicago 3 47735 0.268778153
# 4 Chicago 4 1467 0.008260135
# 5 Chicago 5 39389 0.221784910
# 6 Chicago 6 30659 0.172629505
# 7 LA 7 21074 0.083576241
# 8 LA 8 20195 0.080090263
# 9 LA 9 45165 0.179117441
# 10 LA 10 37606 0.149139610
# 11 LA 11 38967 0.154537126
# 12 LA 12 41681 0.165300433
# 13 LA 13 47465 0.188238887
# 14 Tampa 14 3061 0.017462291
# 15 Tampa 15 23412 0.133560003
# 16 Tampa 16 22993 0.131169705
# 17 Tampa 17 34738 0.198172193
# 18 Tampa 18 19408 0.110718116
# 19 Tampa 19 11637 0.066386372
# 20 Tampa 20 36234 0.206706524
# 21 Tampa 21 23809 0.135824795
Or less wordy:
data$customer_contribution <- with(data, ave(sales, customer, FUN=sum) /
ave(sales, branch, FUN=sum))
We can use dplyr::group_by and dplyr::mutate to calculate fractional sales of total by branch.
library(dplyr);
library(magrittr);
data %>%
group_by(branch) %>%
mutate(sales.norm = sales / sum(sales))
## A tibble: 21 x 4
## Groups: branch [3]
# branch customer sales sales.norm
# <fct> <dbl> <dbl> <dbl>
# 1 Chicago 1. 33816. 0.190
# 2 Chicago 2. 24534. 0.138
# 3 Chicago 3. 47735. 0.269
# 4 Chicago 4. 1467. 0.00826
# 5 Chicago 5. 39389. 0.222
# 6 Chicago 6. 30659. 0.173
# 7 LA 7. 21074. 0.0836
# 8 LA 8. 20195. 0.0801
# 9 LA 9. 45165. 0.179
#10 LA 10. 37606. 0.149

Calculate rows with same title

Since my other question got closed, here is the required data.
What I'm trying to do is have R calculate the last column 'count' towards the column city so I can map the data. Therefore I would need some kind of code to match this. Since I want to show how many participants (in count) are in the state of e.g Hawaii (HI)
zip city state latitude longitude count
96860 Pearl Harbor HI 24.859832 -168.021815 36
96863 Kaneohe Bay HI 21.439867 -157.74772 39
99501 Anchorage AK 61.216799 -149.87828 12
99502 Anchorage AK 61.153693 -149.95932 17
99506 Elmendorf AFB AK 61.224384 -149.77461 2
what I've tried is
match<- c(match(datazip$state, datazip$number))>$
but I'm really helpless trying to find a solution since I don't even know how to describe this in short. My plan afterwards is to make choropleth map with the data and believe me by now I've seen almost all the pages that try to give advice. so your help is pretty much appreciated. Thanks
# I read your sample data to a data frame
> df
zip city state latitude longitude count
1 96860 Pearl_Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe_Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf_AFB AK 61.22438 -149.7746 2
# If you want to sum the number of counts by state
library(plyr)
> ddply(df, .(state), transform, count2 = sum(count))
zip city state latitude longitude count count2
1 99501 Anchorage AK 61.21680 -149.8783 12 31
2 99502 Anchorage AK 61.15369 -149.9593 17 31
3 99506 Elmendorf_AFB AK 61.22438 -149.7746 2 31
4 96860 Pearl_Harbor HI 24.85983 -168.0218 36 75
5 96863 Kaneohe_Bay HI 21.43987 -157.7477 39 75
Maybe aggregate would be a nice and simple solution for you:
df
zip city state latitude longitude count
1 96860 Pearl Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf AFB AK 61.22438 -149.7746 2
aggregate(df$count,by=list(df$state),sum)
Group.1 x
1 AK 31
2 HI 75
aggregate(df$count,by=list(df$city),sum)
Group.1 x
1 Anchorage 29
2 Elmendorf AFB 2
3 Kaneohe Bay 39
4 Pearl Harbor 36

Resources