Manipulating a dataframe in R [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I have the following dataframe in R:
id year count
1 2013 2
1 2014 20
2 2013 6
2 2014 7
2 2015 8
3 2011 13
...
999 2016 109
Each id is associated with at least 1 year, and each year has a count. The number of years associated with each id is pretty much random.
I wish to rearrange it into this format:
id 2011_count 2012_count 2013_count 2014_count ...
1 0 0 3 20 ...
2 0 0 6 7 ...
...
999 ... ... ...
I'm pretty sure someone else has asked a similar question, but I don't know how/what to search for.
Thanks!

Something like:
result <- reshape(aggregate(count~id+year, df, FUN=sum), idvar="id", timevar="year", direction="wide")
result[is.na(result)] <- 0
names(result) <- gsub("count\\.(.*)", "\\1_count", colnames(result))

Related

R: Dataframe summary into a new dataframe [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
how can I make stacked barplot with ggplot2
(1 answer)
Closed 11 months ago.
I'm really new into R and I'm working on a solution that needs to handle dataframe and charts.
I have an original DF like the one below:
Product_ID Sales_2016 Sales_2017 Import
1 1162.91 235.19 1
2 1191.11 944.87 1
3 1214.96 737.06 0
4 1336.07 986.15 0
5 1343.29 871.33 1
6 1208.78 1168.82 0
7 1205.49 900.39 0
8 1675.69 2255.11 0
9 1412.55 1021.05 1
10 1546.06 1893.82 0
11 1393.98 1629.42 0
12 1264.76 1014.74 0
13 1399 2163.64 0
14 1712.75 1845.59 0
15 1310.67 1400.31 1
16 1727.8 1574.98 0
17 1987.14 3359.82 0
18 1711.99 2065.15 0
19 1570.97 2444.15 0
20 1546.2 1912.32 0
I want to build a stack bar chart with this data where it should show the year (2020/2021) on the X axis and the sales sum on the Y axis. The bar would split imported or not into two groups and stack them like the image bellow:
TO achieve it, my solution was to create another DF manually with the sum values:
Import TotalSales Year
1 123456 2016
0 654321 2016
1 456789 2017
0 987654 2017
I created this DF totally manually. My question is, is there a faster and more reliable way to create this chart and this kind of dataframe view?
Regards,
Marcus

Creating columns for each aggregation bucket with group_by()and dplyr [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
I have a dataset where each row represents a citation to an article, along with the difference between the publication date of the article and each reference, like so:
EID ref delta
1 2-s2 r1 0
2 2-s2 r2 3
3 2-s2 r3 22
4 2-s2 r4 100
5 2-s2 r5 7
6 3-s2 r6 1
7 3-s2 r7 0
8 3-s2 r8 1
I want to determine for each distinct EID, how many references fall in different ranges of year deltas (I.e. for a given article, how many references are 1 year old, 2 years old, 4 years old, etc?). I attempted to create buckets for each:
buckets=c(0,1,2,4,8,16,32,64,9999)
bt=bt %>%
mutate(delta = as.numeric(delta)) %>%
mutate(bucket=cut(delta, breaks = buckets))
group = bt %>%
group_by(EID, bucket) %>%
summarise(count=n())
The resulting grouped data is:
EID bucket count
1 2-s2 (1,2] 6
2 2-s2 (2,4] 8
3 2-s2 (4,8] 16
4 2-s2 (8,16] 18
5 2-s2 (16,32] 10
6 3-s2 (1,2] 1
7 3-s2 (2,4] 13
8 3-s2 (4,8] 1
9 4-s1 (4,8] 3
I would like to create a column for each bucket I have, and then group by EID, placing the appropriate count in the appropriate bucket for each EID, where the result looks something like this:
EID (1,2] (2,4] (4,8] (8,16] (16,32]
1 2-s2 6 8 16 18 10
2 3-s2 1 13 1 0 0
2 4-s1 0 0 3 0 0
Looking at the code I used to generate the first table, it seems like I should be able to use unstack(group, bucket~count) somehow, or just directly automate the creation of these bucket columns using summarise() but I'm not clear on exactly how to do so. Ideally, I would not have to hard-code in each column; I would like to be able to reference the bucketing list, so if I decide to change the bucketing scheme, it will update accordingly. Thank you!
We can use pivot_wider to reshape to 'wide' format
library(dplyr)
library(tidyr)
group %>%
pivot_wider(names_from =bucket, values_from = count)

iterate through 2 big dataframes with different length (if, else) [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I worked for a while and still not finding an efficient way to handle this matter.
I have two data frames and both are very huge.
df1 looks like (one ID could have multiple prices):
ID Price
1 20
1 9
2 12
3 587
3 59
3 7
4 78
5 12
6 290
6 191
...
1000000 485
df2 looks like(one ID only have one location):
ID Location
1 USA
2 CAN
3 TWN
4 USA
5 CAN
6 AUS
...
100000 JAP
I want to create a new data frame looks like (create Location to df1 based on ID):
ID Price Location
1 20 USA
1 9 USA
2 12 CAN
3 587 TWN
3 59 TWN
3 7 TWN
4 78 USA
5 12 CAN
6 290 AUS
6 191 AUS
...
1000000 485 JAP
I tried "merge" but R gave me negative length vectors are not allowed. Both lists are huge, one over 2M rows and one over 0.6M rows.
I also tried lapply inside a loop but failed. I cannot figure out how to handle this matter except using two loops (and two nested loops will take a long time).
We can do a join with data.table for efficiently creating the column 'Location'
library(data.table)
setDT(df1)[df2, Location := Location, on = .(ID)]

R Concatenate column in data frame with one value/string [duplicate]

This question already has answers here:
How to add leading zeros?
(8 answers)
Closed 4 years ago.
I am trying to concatenate some data in a column of a df, with "0000"
I tried to use paste() in a loop, but it becomes very performance heavy, as I have +2.000.000 rows. Thus, it takes forever.
Is there a smart, less performance heavy way to do it?
#DF:
CUSTID VALUE
103 12
104 10
105 15
106 12
... ...
#Desired result:
#DF:
CUSTID VALUE
0000103 12
0000104 10
0000105 15
0000106 12
... ...
How can this be achieved?
paste is vectorized so it'll work with a vector of values (i.e. a column in a data frame. The following should work:
DF <- data.frame(
CUSTID = 103:107,
VALUE = 13:17
)
DF$CUSTID <- paste0('0000', DF$CUSTID)
Should give you
CUSTID VALUE
1 0000103 13
2 0000104 14
3 0000105 15
4 0000106 16
5 0000107 17

Adding all values of a variable in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 7 years ago.
I don't know how to word the title exactly, so I will just do my best to explain below... Sorry in advance for the .csv format.
I have the following example dataset:
print(data)
ID Tag Flowers
1 1 6871 1
2 2 6750 1
3 3 6859 1
4 4 6767 1
5 5 6747 1
6 6 6261 1
7 7 6750 1
8 8 6767 1
9 9 6812 1
10 10 6746 1
11 11 6496 4
12 12 6497 1
13 13 6495 4
14 14 6481 1
15 15 6485 1
Notice that in Lines 2 and 7, the tag 6750 appears twice. I observed one flower on plant number 6750 on two separate days, equaling two flowers in its lifetime. Basically, I want to add every flower that occurs for tag 6750, tag 6767, etc throughout ~100 rows. Each tag appears more than once, usually around 4 or 5 times.
I feel like I need to apply the unlist function here, but I'm a little bit lost as to how I should do so.
Without any extra packages, you can use function aggregate():
res<-aggregate(data$Flowers, list(data$Tag), sum)
This calculates a sum of the values in Flowers column for every value in the Tag column.

Resources