R: Dataframe summary into a new dataframe [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
how can I make stacked barplot with ggplot2
(1 answer)
Closed 11 months ago.
I'm really new into R and I'm working on a solution that needs to handle dataframe and charts.
I have an original DF like the one below:
Product_ID Sales_2016 Sales_2017 Import
1 1162.91 235.19 1
2 1191.11 944.87 1
3 1214.96 737.06 0
4 1336.07 986.15 0
5 1343.29 871.33 1
6 1208.78 1168.82 0
7 1205.49 900.39 0
8 1675.69 2255.11 0
9 1412.55 1021.05 1
10 1546.06 1893.82 0
11 1393.98 1629.42 0
12 1264.76 1014.74 0
13 1399 2163.64 0
14 1712.75 1845.59 0
15 1310.67 1400.31 1
16 1727.8 1574.98 0
17 1987.14 3359.82 0
18 1711.99 2065.15 0
19 1570.97 2444.15 0
20 1546.2 1912.32 0
I want to build a stack bar chart with this data where it should show the year (2020/2021) on the X axis and the sales sum on the Y axis. The bar would split imported or not into two groups and stack them like the image bellow:
TO achieve it, my solution was to create another DF manually with the sum values:
Import TotalSales Year
1 123456 2016
0 654321 2016
1 456789 2017
0 987654 2017
I created this DF totally manually. My question is, is there a faster and more reliable way to create this chart and this kind of dataframe view?
Regards,
Marcus

Related

Episode splitting in survival analysis by the timing of an event in R

Is it possible to split episode by a given variable in survival analysis in R, similar to in STATA using stsplit in the following way: stsplit var, at(0) after(time=time)?
I am aware that the survival package allows one to split episode by given cut points such as c(0,5,10,15) in survSplit, but if a variable, say time of divorce, differs by each individual, then providing cutpoints for each individual would be impossible, and the split would have to be based on the value of a variable (say graduation, or divorce, or job termination).
Is anyone aware of a package or know a resource I might be able to tap into?
Perhaps Epi package is what you are looking for. It offers multiple ways to cut/split the follow-up time using the Lesix objects. Here is the documentation of cutLesix().
After some poking around, I think tmerge() in the survival package can achieve what stsplit var can do, which is to split episodes not just by a given cut points (same for all observations), but by when an event occurs for an individual.
This is the only way I knew how to split data
id<-c(1,2,3)
age<-c(19,20,29)
job<-c(1,1,0)
time<-age-16 ## create time since age 16 ##
data<-data.frame(id,age,job,time)
id age job time
1 1 19 1 3
2 2 20 1 4
3 3 29 0 13
## simple split by time ##
## 0 to up 2 years, 2-5 years, 5+ years ##
data2<-survSplit(data,cut=c(0,2,5),end="time",start="start",
event="job")
id age start time job
1 1 19 0 2 0
2 1 19 2 3 1
3 2 20 0 2 0
4 2 20 2 4 1
5 3 29 0 2 0
6 3 29 2 5 0
7 3 29 5 13 0
However, if I want to split by a certain variable, such as when each individuals finished school, each person might have a different cut point (finished school at different ages).
## split by time dependent variable (age finished school) ##
d1<-data.frame(id,age,time,job)
scend<-c(17,21,24)-16
d2<-data.frame(id,scend)
## create start/stop time ##
base<-tmerge(d1,d1,id=id,tstop=time)
## create time-dependent covariate ##
s1<-tmerge(base,d2,id=id,
finish=tdc(scend))
id age time job tstart tstop finish
1 1 19 3 1 0 1 0
2 1 19 3 1 1 3 1
3 2 20 4 1 0 4 0
4 3 29 13 0 0 8 0
5 3 29 13 0 8 13 1
I think tmerge() is more or less comparable with stsplit function in STATA.

Creating columns for each aggregation bucket with group_by()and dplyr [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
I have a dataset where each row represents a citation to an article, along with the difference between the publication date of the article and each reference, like so:
EID ref delta
1 2-s2 r1 0
2 2-s2 r2 3
3 2-s2 r3 22
4 2-s2 r4 100
5 2-s2 r5 7
6 3-s2 r6 1
7 3-s2 r7 0
8 3-s2 r8 1
I want to determine for each distinct EID, how many references fall in different ranges of year deltas (I.e. for a given article, how many references are 1 year old, 2 years old, 4 years old, etc?). I attempted to create buckets for each:
buckets=c(0,1,2,4,8,16,32,64,9999)
bt=bt %>%
mutate(delta = as.numeric(delta)) %>%
mutate(bucket=cut(delta, breaks = buckets))
group = bt %>%
group_by(EID, bucket) %>%
summarise(count=n())
The resulting grouped data is:
EID bucket count
1 2-s2 (1,2] 6
2 2-s2 (2,4] 8
3 2-s2 (4,8] 16
4 2-s2 (8,16] 18
5 2-s2 (16,32] 10
6 3-s2 (1,2] 1
7 3-s2 (2,4] 13
8 3-s2 (4,8] 1
9 4-s1 (4,8] 3
I would like to create a column for each bucket I have, and then group by EID, placing the appropriate count in the appropriate bucket for each EID, where the result looks something like this:
EID (1,2] (2,4] (4,8] (8,16] (16,32]
1 2-s2 6 8 16 18 10
2 3-s2 1 13 1 0 0
2 4-s1 0 0 3 0 0
Looking at the code I used to generate the first table, it seems like I should be able to use unstack(group, bucket~count) somehow, or just directly automate the creation of these bucket columns using summarise() but I'm not clear on exactly how to do so. Ideally, I would not have to hard-code in each column; I would like to be able to reference the bucketing list, so if I decide to change the bucketing scheme, it will update accordingly. Thank you!
We can use pivot_wider to reshape to 'wide' format
library(dplyr)
library(tidyr)
group %>%
pivot_wider(names_from =bucket, values_from = count)

convert table to data frame with the first column as a variable - R [duplicate]

This question already has answers here:
Convert row names into first column
(9 answers)
Closed 5 years ago.
I have a table object in R. It looks a bit like this:
2422 2581 3363
16566 0 1 0
16568 0 2 0
16598 0 1 0
16627 0 1 0
16683 0 1 0
16701 0 1 0
16740 0 1 0
16741 0 1 0
I'd like to convert it to a data frame, whereby this data frame should have 4 variables, not 3. In other words, the first column, 16566, 16568, etc. should be a variable - let's call it ID. The other variables should be the 2422, 2581, 3363 columns.
I've tried
as.data.frame() and
as.data.frame.matrix()
but both functions somehow swallow the first column.
Thanks in advance for your help!
df <- as.data.frame(table)
df$ID <- rownames(table)

Manipulating a dataframe in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I have the following dataframe in R:
id year count
1 2013 2
1 2014 20
2 2013 6
2 2014 7
2 2015 8
3 2011 13
...
999 2016 109
Each id is associated with at least 1 year, and each year has a count. The number of years associated with each id is pretty much random.
I wish to rearrange it into this format:
id 2011_count 2012_count 2013_count 2014_count ...
1 0 0 3 20 ...
2 0 0 6 7 ...
...
999 ... ... ...
I'm pretty sure someone else has asked a similar question, but I don't know how/what to search for.
Thanks!
Something like:
result <- reshape(aggregate(count~id+year, df, FUN=sum), idvar="id", timevar="year", direction="wide")
result[is.na(result)] <- 0
names(result) <- gsub("count\\.(.*)", "\\1_count", colnames(result))

how to flatten values in this column to separate columns in R [duplicate]

This question already has answers here:
How do I get a contingency table?
(6 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have some data which looks as follows
"ID","PROD"
"1001658",6619
"100288",11843
"100288",20106
"1004303",921
I need to convert it into a format like
"ID","PROD_6619","PROD_11843","PROD_20106","PROD_921"
"1001658",1,0,0,0
"100288",0,1,1,0
"1004303",0,0,0,1
Basically where each value in the column PROD from the original data set is in a separate column of it's own. Note that the above dataset is only a sample and I cannot hard code to be "PROD_6619","PROD_11843","PROD_20106","PROD_921". It could be much more.
I have tried writing this iteratively using a for loop and it's very slow for my huge data set.
Can you suggest me an alternative in R
You can just use table for something like this.
Example:
table(mydf)
## PROD
## ID 921 6619 11843 20106
## 100288 0 0 1 1
## 1001658 0 1 0 0
## 1004303 1 0 0 0
Here is another approach with the dcast of the reshape2 package.
library(reshape2)
dcast(dat, ID ~ PROD, length )
Using PROD as value column: use value.var to override.
ID 921 6619 11843 20106
1 100288 0 0 1 1
2 1001658 0 1 0 0
3 1004303 1 0 0 0

Resources