Cumulative values for columns based on previous row [duplicate] - r

This question already has an answer here:
Sum of previous rows in a column R
(1 answer)
Closed 3 years ago.
Assume I need calculate the cumulative value based on other column in the same row and also the value from same column but previous row. Example: to obtain cumulative time based on time intervals.
> data <- data.frame(interval=runif(10),time=0)
> data
interval time
1 0.95197753 0
2 0.73623490 0
3 0.63938696 0
4 0.32085833 0
5 0.92621764 0
6 0.02801951 0
7 0.09071334 0
8 0.60624511 0
9 0.35364178 0
10 0.79759991 0
I can generate the cumulative value of time using the (ugly) code below:
for( i in 1:nrow(data)){
data[i,"time"] <- data[i,"interval"] + ifelse(i==1,0,data[i-1,"time"])
}
> data
interval time
1 0.95197753 0.9519775
2 0.73623490 1.6882124
3 0.63938696 2.3275994
4 0.32085833 2.6484577
5 0.92621764 3.5746754
6 0.02801951 3.6026949
7 0.09071334 3.6934082
8 0.60624511 4.2996533
9 0.35364178 4.6532951
10 0.79759991 5.4508950
Is it possible to do this without the for iteration, using a single command?

Maybe what you are looking for is cumsum():
library(tidyverse)
data <- data %>%
mutate(time = cumsum(interval))

As Ronak says and you do this as well using dplyr and the pipe:
library(dplyr)
data <- data %>%
mutate(time = cumsum(interval))

Related

R: Dataframe summary into a new dataframe [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
how can I make stacked barplot with ggplot2
(1 answer)
Closed 11 months ago.
I'm really new into R and I'm working on a solution that needs to handle dataframe and charts.
I have an original DF like the one below:
Product_ID Sales_2016 Sales_2017 Import
1 1162.91 235.19 1
2 1191.11 944.87 1
3 1214.96 737.06 0
4 1336.07 986.15 0
5 1343.29 871.33 1
6 1208.78 1168.82 0
7 1205.49 900.39 0
8 1675.69 2255.11 0
9 1412.55 1021.05 1
10 1546.06 1893.82 0
11 1393.98 1629.42 0
12 1264.76 1014.74 0
13 1399 2163.64 0
14 1712.75 1845.59 0
15 1310.67 1400.31 1
16 1727.8 1574.98 0
17 1987.14 3359.82 0
18 1711.99 2065.15 0
19 1570.97 2444.15 0
20 1546.2 1912.32 0
I want to build a stack bar chart with this data where it should show the year (2020/2021) on the X axis and the sales sum on the Y axis. The bar would split imported or not into two groups and stack them like the image bellow:
TO achieve it, my solution was to create another DF manually with the sum values:
Import TotalSales Year
1 123456 2016
0 654321 2016
1 456789 2017
0 987654 2017
I created this DF totally manually. My question is, is there a faster and more reliable way to create this chart and this kind of dataframe view?
Regards,
Marcus

group_by() with two counts, one per variable [duplicate]

This question already has answers here:
Frequency count of two column in R
(8 answers)
count number of rows in a data frame in R based on group [duplicate]
(8 answers)
Closed 3 years ago.
My data looks like this
PDB.ID Chain.ID Space.Group Uniprot.Recommended.Name
101M A P 6 Myoglobin
102L A P 32 2 1 Endolysin
102M A P 6 Myoglobin
103L A P 32 2 1 Endolysin
103M A P 6 Myoglobin
104L A H 3 2 Endolysin
After reading the data and loading the required package
df <- read.delim("~/Downloads/dummy2.tsv")
library(dplyr)
I can count the number of entries for a specific variable with code like this
df %>% count(Uniprot.Recommended.Name)
Or alternatively
df %>% +
group_by(Uniprot.Recommended.Name) %>% +
summarise( +
count = n() +
)
I get two columns, a count for every case of Uniprot.Recommended.Name
My question: Is it possible to get a table with two counts. Counting the number of entries for every Space.Group per Uniprot.Recommended.Name.
Expected table should be like something like this
Myoglobin P 6 123
Myoglobin P 32 2 1 124
Endolysin P 32 2 1 125
Endolysin H 3 2 126
Thanks

Change or keep the value of a specific column considering the value in the same row and another column

I want to know how to change or keep the value of a specific column considering the value in the same row and another column.
Here is my dataset named (df):
BLUP_pop BLUPISM_rate
1 1.94693747 1.00000000
2 1.33774978 0.68710465
3 1.04724481 0.78284058
4 0.95897119 0.91570871
5 0.75524367 0.78755616
6 0.44728346 0.59223728
7 0.35502008 0.79372504
8 0.29392675 0.82791585
9 0.26649710 0.90667862
10 0.15827465 0.59390759
11 -0.00630328 -0.03982495
12 -0.21526737 34.15164327
I'd like to state the following rule:
If df$BLUP_pop <= 0, then paste "0" in df$BLUPISM_rate;
If df$BLUP_pop >= 0, then keep the value.
i.e.
BLUP_pop BLUPISM_rate
1 1.94693747 1.00000000
2 1.33774978 0.68710465
3 1.04724481 0.78284058
4 0.95897119 0.91570871
5 0.75524367 0.78755616
6 0.44728346 0.59223728
7 0.35502008 0.79372504
8 0.29392675 0.82791585
9 0.26649710 0.90667862
10 0.15827465 0.59390759
11 -0.00630328 0
12 -0.21526737 0
Thanks.
BLUPISM_rate is an existing column in data frame df which can be modified according to other column BLUP_pop based of condition using ifelse.
mutate is a function in dplyr package to do manipulations in existing or new columns in given data frame
# ifelse(condition,TRUE,FALSE)
library(dplyr)
df <- df %>%
mutate(
BLUPISM_rate = ifelse(BLUP_pop <= 0 , 0 , BLUPISM_rate)
)
print(df)

how to flatten values in this column to separate columns in R [duplicate]

This question already has answers here:
How do I get a contingency table?
(6 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have some data which looks as follows
"ID","PROD"
"1001658",6619
"100288",11843
"100288",20106
"1004303",921
I need to convert it into a format like
"ID","PROD_6619","PROD_11843","PROD_20106","PROD_921"
"1001658",1,0,0,0
"100288",0,1,1,0
"1004303",0,0,0,1
Basically where each value in the column PROD from the original data set is in a separate column of it's own. Note that the above dataset is only a sample and I cannot hard code to be "PROD_6619","PROD_11843","PROD_20106","PROD_921". It could be much more.
I have tried writing this iteratively using a for loop and it's very slow for my huge data set.
Can you suggest me an alternative in R
You can just use table for something like this.
Example:
table(mydf)
## PROD
## ID 921 6619 11843 20106
## 100288 0 0 1 1
## 1001658 0 1 0 0
## 1004303 1 0 0 0
Here is another approach with the dcast of the reshape2 package.
library(reshape2)
dcast(dat, ID ~ PROD, length )
Using PROD as value column: use value.var to override.
ID 921 6619 11843 20106
1 100288 0 0 1 1
2 1001658 0 1 0 0
3 1004303 1 0 0 0

drawing multiple boxplots from imputed data in R

I have an imputed dataset that I'm analysing, and I'm trying to draw boxplots, but I can't wrap my head around the proper procedure.
my data (a sample, original has 20 observations per imputation and 13 vars per group, all values range from 0 to 25):
.imp .id FTE_RM FTE_PD OMZ_RM OMZ_PD
1 1 25 25 24 24
1 2 4 0 2 6
1 3 11 5 3 2
1 4 12 3 3 3
2 1 20 15 15 15
2 2 4 1 2 3
2 3 0 0 0 6
2 4 20 0 0 0
.imp signifies the imputation round, .id the identifer for each observartion.
I want to draw all the FTE_* variables in a single plot (and the `OMZ_* in another), but wonder what to do with all the imputations, can I just include all values? The imputated data now has 500 observations. With for instance an ANOVA I'd need to average the ANOVA results by 5 to get back to 20 observations. But is this needed for a boxplot as well, since I only deal with medians, means, max. and min.?
Such as:
data_melt <- melt(df[grep("^FTE_", colnames(df))])
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()
I've played a couple of times with ggplot, but consider myself a complete newbie.
I assume you want to keep the identifier for .imp and .id after melting so rather put:
data_melt <- melt(df,c(".imp",".id"))
For completeness of the dataframe it probably helps to introduce a column that identifies the type - FTE vs. OMZ:
data_melt$type <- ifelse(grepl("FTE",data_melt$variable),"FTE","OMZ")
Having this data.frame you can, for example, facet on the type (alternatively you can just use a simple filter statement on data_melt to restrict to one type):
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()+facet_wrap(~type,scales="free_x")
This would look like this.
EDIT: fixed the data mess-up

Resources