turn colunm into raw in r [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
Hello guys, I have a data frame with the columns "Auto", "ClasPri" and "Total". For each "Auto" I can have 4 different "ClasPri" and a "Total" for each "ClasPri". I wanted to make each "Auto" appear only once and the other columns to be "0, 1, 2, 3" with their respective "Total". Can someone help me?
The first image is as it is and the second as it should be.

I think you're just trying to pivot your data on ClasPri. Here's a dplyr / tidyr way to do that:
library(tidyr)
library(dplyr)
arrange(df, ClasPri) %>%
pivot_wider(names_from = ClasPri, values_from = Total, values_fill = 0) %>%
arrange(Auto)
#> # A tibble: 4 x 5
#> Auto `0` `1` `2` `3`
#> <int> <int> <int> <int> <int>
#> 1 343 0 688160 8260000 0
#> 2 453 6000 29168829 7663334 2275200
#> 3 469 0 7888857 0 540000
#> 4 609 0 0 0 0
Data used
df <- data.frame(Auto = c(343L, 343L, 453L, 453L, 453L, 453L, 469L, 469L, 609L),
ClasPri = c(1L, 2L, 0L, 1L, 2L, 3L, 1L, 3L, 1L),
Total = c(688160L, 8260000L, 6000L, 29168829L,
7663334L, 2275200L, 7888857L, 540000L, 0L))
df
#> Auto ClasPri Total
#> 1 343 1 688160
#> 2 343 2 8260000
#> 3 453 0 6000
#> 4 453 1 29168829
#> 5 453 2 7663334
#> 6 453 3 2275200
#> 7 469 1 7888857
#> 8 469 3 540000
#> 9 609 1 0
Created on 2020-07-19 by the reprex package (v0.3.0)

Related

Creating new columns in dataset as a lookup function in R?

so i lets say i have a datatable that consist of stock monthly returns:
Company
Year
return
next years return
1
1
5
1
2
6
1
3
2
1
4
4
For a large dataset, of multiple companies and years how can i get a new column that consist of next years returns, for example in first row there would be second years return of 6% etc etc? In excel i could simple use index match but no idea how its done in R. And the reason for not using excel is that it takes over 20 hours to compute all functions as index match is extremely slow. The code needs to do this for all companies so it has to find the correct company for correct year and then input it into new column.
You could group by the company and use lead() to get the next value:
library(dplyr)
df <- data.frame(
company = c(1L, 1L, 1L, 1L, 2L, 2L),
year = c(1L, 2L, 3L, 4L, 1L, 2L),
return_ = c(5L, 6L, 2L, 4L, 2L, 4L))
df
#> company year return_
#> 1 1 1 5
#> 2 1 2 6
#> 3 1 3 2
#> 4 1 4 4
#> 5 2 1 2
#> 6 2 2 4
df %>% group_by(company) %>%
mutate(next.years.return = lead(return_, order_by = year))
#> # A tibble: 6 × 4
#> # Groups: company [2]
#> company year return_ next.years.return
#> <int> <int> <int> <int>
#> 1 1 1 5 6
#> 2 1 2 6 2
#> 3 1 3 2 4
#> 4 1 4 4 NA
#> 5 2 1 2 4
#> 6 2 2 4 NA
Created on 2023-02-10 with reprex v2.0.2
Getting the next years return if its really the next year.
library(dplyr)
df %>%
group_by(Company) %>%
arrange(Company, Year) %>%
mutate("next years return" =
if_else(lead(Year) - Year == 1, lead(`return`), NA)) %>%
ungroup()
# A tibble: 8 × 4
Company Year return `next years return`
<dbl> <dbl> <int> <int>
1 1 1 5 NA
2 1 3 2 4
3 1 4 4 6
4 1 5 6 NA
5 2 1 5 6
6 2 2 6 2
7 2 3 2 4
8 2 4 4 NA
Data
df <- structure(list(Company = c(1, 1, 1, 1, 2, 2, 2, 2), Year = c(1,
5, 3, 4, 4, 3, 2, 1), return = c(5L, 6L, 2L, 4L, 4L, 2L, 6L,
5L)), row.names = c("1", "2", "3", "4", "41", "31", "21", "11"
), class = "data.frame")

R - Distinct count across columns

I have a large and complex dataset (a bit too complex to share here, and probably not necessary to share the whole thing) but here's an example of what it looks like. This is just one day and the full sample spans hundreds of days:
What I want to do, is to devise a way to count variation of the Genre within each Row. To put it more simply (I hope): each Row has 12 Columns and I want to measure the variation of Genre across those 12 Columns (it's the BBC iPlayer, which many of you might be familiar with). E.g. If a Row is comprised of 4 "sport", 4 "drama", and 4 "documentary", there would be a distinct count of 3 genres.
I'm thinking that a simple distinct count would be a good way to measure variation within each row (the more distinct the count, the higher the variation) but it's not a very nuanced approach. I.e. if a row is comprised of 11 "sport" and 1 "documentary" it's a distinct count of 2. If it's comprised of 6 "sport" and 6 "documentary" it's still a distinct count of 2 - so distinct count doesn't really help in that sense.
I guess I'm asking for advice on two things here:
Firstly, what would be the most appropriate way to measure variation
of Genre within each Row
Secondly, how would I go about doing that! I.e. what code / packages would I need?
I hope that's all clear, but if not, I'd be happy to elaborate on anything. It's perhaps worth noting (as I mentioned above) that I want to determine variation on a specific date, and the sample data shared here is just one date (but I have hundreds).
Thanks in advance :)
*** Update ***
Thanks for the comments below - especially about sharing a snapshot of the real data (which you'll find below). My apologies - I'm a bit of a novice in this area and not really familiar with the proper conventions!
Here's a sample of the data - I hope it's right and I hope it helps:
structure(list(Row = c(0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L), Genre = c("", "Sport", "Drama", "Documentary",
"Entertainment", "Drama", "Comedy", "Crime Drama", "Entertainment",
"Documentary", "Entertainment", "History", "Crime Drama", "",
"", "", "", "", "", "", "", "", "", "", "", "Drama", "Drama",
"Documentary", "Entertainment", "Period Drama"), Column = c(1L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L)), row.names = c(NA,
30L), class = "data.frame")
First create some reproducible data. All we need is Row and Genre:
set.seed(42)
Row <- rep(1:10, each=10)
Genre <- sample(c("Sport", "Drama", "Documentary", "Entertainment", "History", "Crime Drama", "Period Drama", "Film - Comedy", "Film-Thriller"), 100, replace=TRUE)
example <- data.frame(Row, Genre)
str(example)
# 'data.frame': 100 obs. of 2 variables:
# $ Row : int 1 1 1 1 1 1 1 1 1 1 ...
# $ Genre: chr "Sport" "History" "Sport" "Film-Thriller" ...
Now to get the number of different genres in each row:
Count <- tapply(example$Genre, example$Row, function(x) length(unique(x)))
Count
# 1 2 3 4 5 6 7 8 9 10
# 7 5 6 7 6 8 7 7 7 6
There are 7 genres in row 1 and only 5 in row 2. For more detail:
xtabs(~Genre+Row, example)
# Row
# Genre 1 2 3 4 5 6 7 8 9 10
# Crime Drama 0 0 1 1 3 1 1 0 2 1
# Documentary 0 1 1 1 1 0 1 0 1 0
# Drama 1 1 1 3 2 1 2 1 1 0
# Entertainment 2 2 3 1 1 1 1 2 0 0
# Film - Comedy 1 0 3 2 0 1 2 2 0 2
# Film-Thriller 1 3 0 0 0 1 1 1 2 2
# History 1 3 0 1 2 1 2 2 2 1
# Period Drama 1 0 0 1 0 2 0 1 1 2
# Sport 3 0 1 0 1 2 0 1 1 2
Reproducible sample data:
set.seed(42)
sampdata <- transform(
expand.grid(Date = Sys.Date() + 0:2, Row=0:3, Column=1:12),
Genre = sample(c("Crime Drama","Documentary","Drama","Entertainment"),
size = 48, replace = TRUE)
)
head(sampdata)
# Date Row Column Genre
# 1 2022-02-18 0 1 Crime Drama
# 2 2022-02-19 0 1 Crime Drama
# 3 2022-02-20 0 1 Crime Drama
# 4 2022-02-18 1 1 Crime Drama
# 5 2022-02-19 1 1 Documentary
# 6 2022-02-20 1 1 Entertainment
nrow(sampdata)
# [1] 144
Using dplyr and tidyr, we can group, summarize, then pivot:
library(dplyr)
# library(tidyr) # pivot_wider
sampdata %>%
group_by(Date, Row) %>%
summarize(
Uniq = n_distinct(Genre),
Var = var(table(Genre))
) %>%
tidyr::pivot_wider(
Date, names_from = Row, values_from = c(Uniq, Var)
) %>%
ungroup()
# # A tibble: 3 x 9
# Date Uniq_0 Uniq_1 Uniq_2 Uniq_3 Var_0 Var_1 Var_2 Var_3
# <date> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 2022-02-18 2 3 2 2 0 3 18 18
# 2 2022-02-19 3 3 1 3 3 3 NA 3
# 3 2022-02-20 2 3 3 3 18 3 3 3
Two things: Uniq_# is per-Row counts of distinct Genre values, and Var_# are the variance of the counts. For instance, in your example, two genres with counts 6 and 6 will have a variance of 0, but counts of 11 and 1 will have a variance of 50 (var(c(11,1))), indicating more variation for that Date/Row combination.
Because we use group_by, if you have even more grouping variables, it is straight-forward to extend this, both in the grouping and in what aggregation we can do in addition to n_distinct(.) and var(.).
BTW: depending on your other calculations, analysis, and reporting/plotting, it might be useful to keep this in the long format, removing the pivot_wider.
sampdata %>%
group_by(Date, Row) %>%
summarize(
Uniq = n_distinct(Genre),
Var = var(table(Genre))
) %>%
ungroup()
# # A tibble: 12 x 4
# Date Row Uniq Var
# <date> <int> <int> <dbl>
# 1 2022-02-18 0 2 0
# 2 2022-02-18 1 3 3
# 3 2022-02-18 2 2 18
# 4 2022-02-18 3 2 18
# 5 2022-02-19 0 3 3
# 6 2022-02-19 1 3 3
# 7 2022-02-19 2 1 NA
# 8 2022-02-19 3 3 3
# 9 2022-02-20 0 2 18
# 10 2022-02-20 1 3 3
# 11 2022-02-20 2 3 3
# 12 2022-02-20 3 3 3
Good examples of when to keep it long include further aggregation by Date/Row and plotting with ggplot2 (which really rewards long-data).

Is there an R function to extract repeating rows of numbers?

I am looking to extract timepoints from a table.
Output should be the starting point in seconds from column 2 and the duration of the series. But output only if the stage lasts for at least 3 minutes ( if you look at the seconds column) so repetition of either stage 0,1,2,3 or 5 for more than 6 consecutive lines of the stage column.
So in this case the 0-series does not qualify, while the following 1-series does.
desired output would be : 150, 8
starting at timepoint 150 and lasting for 8 rows.
I was experimenting with rle(), but haven't been successful yet..
Stage
Seconds
0
0
0
30
0
60
0
90
0
120
1
150
1
180
1
210
1
240
1
270
1
300
1
330
1
360
1
390
0
420
Not sure how representative of your data this might be. This may be an option using dplyr
library(dplyr)
df %>%
mutate(grp = c(0, cumsum(abs(diff(stage))))) %>%
filter(stage == 1) %>%
group_by(grp) %>%
mutate(count = n() - 1) %>%
filter(row_number() == 1, count >= 6) %>%
ungroup() %>%
select(-c(grp, stage))
#> # A tibble: 4 x 2
#> seconds count
#> <dbl> <dbl>
#> 1 960 16
#> 2 1500 7
#> 3 2040 17
#> 4 2670 10
Created on 2021-09-23 by the reprex package (v2.0.0)
data
set.seed(123)
df <- data.frame(stage = sample(c(0, 1), 100, replace = TRUE, prob = c(0.2, 0.8)),
seconds = seq(0, by = 30, length.out = 100))
Similar to this answer, you can use data.table::rleid() with dplyr
df <- structure(list(Stage = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 0L), Seconds = c(0L, 30L, 60L, 90L, 120L,
150L, 180L, 210L, 240L, 270L, 300L, 330L, 360L, 390L, 420L)), class = "data.frame", row.names = c(NA,
-15L))
library(dplyr)
library(data.table)
df %>%
filter(Seconds > 0) %>%
group_by(grp = rleid(Stage)) %>%
filter(n() > 6)
#> # A tibble: 9 x 3
#> # Groups: grp [1]
#> Stage Seconds grp
#> <int> <int> <int>
#> 1 1 150 2
#> 2 1 180 2
#> 3 1 210 2
#> 4 1 240 2
#> 5 1 270 2
#> 6 1 300 2
#> 7 1 330 2
#> 8 1 360 2
#> 9 1 390 2
Created on 2021-09-23 by the reprex package (v2.0.0)

How do I remove duplicates based on three columns, but I keep the row with the highest number in the specific column using R?

I have a dataset that looks like this:
Unique Id|Class Id|Version Id
501 1 1
602 3 1
602 3 1
405 2 1
305 2 3
305 2 2
305 1 1
305 2 1
509 1 1
501 2 1
501 3 1
501 3 2
602 2 1
602 1 1
405 1 1
If I were to run the script the remaining entries should be:
Unique Id|Class Id|Version Id
501 1 1
602 3 1
405 2 1
305 2 3
305 1 1
509 1 1
501 2 1
501 3 2
602 2 1
602 1 1
405 1 1
Note that Unique id:501 Class id:3 and Version id:2 was selected instead because it has the highest Version id. Note Unique id:602 Class id:3 and VersionId:1 is deleted because it is exactly the same from beginning to end.
Basically I want the script to delete all duplicates based on three columns and leave the row with the highest version id.
We can use rleid on the UniqueID column and do slice_max after grouping by the rleid on 'Unique Id' and Class Id
library(dplyr)
library(data.table)
data %>%
group_by(grp = rleid(`Unique Id`), `Class Id`) %>%
slice_max(`Version Id`) %>%
ungroup %>%
select(-grp) %>%
distinct
-output
# A tibble: 11 x 3
# `Unique Id` `Class Id` `Version Id`
# <int> <int> <int>
# 1 501 1 1
# 2 602 3 1
# 3 405 2 1
# 4 305 1 1
# 5 305 2 3
# 6 509 1 1
# 7 501 2 1
# 8 501 3 2
# 9 602 1 1
#10 602 2 1
#11 405 1 1
Or if we don't have to consider the Unique Id with adjacent blocks as one
data %>%
group_by(`Unique Id`, `Class Id`) %>%
slice_max(`Version Id`) %>%
ungroup %>%
distinct
Or using base R
ind <- with(rle(data$`Unique Id`), rep(seq_along(values), lengths))
data1 <- data[order(ind, -data$`Version Id`),]
data1[!duplicated(cbind(ind, data1$`Class Id`)),]
data
data <- structure(list(`Unique Id` = c(501L, 602L, 602L, 405L, 305L,
305L, 305L, 305L, 509L, 501L, 501L, 501L, 602L, 602L, 405L),
`Class Id` = c(1L, 3L, 3L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 3L,
3L, 2L, 1L, 1L), `Version Id` = c(1L, 1L, 1L, 1L, 3L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L)), class = "data.frame",
row.names = c(NA,
-15L))
If the order doesn't matter then we can reorder the data so that higher version IDs are on top, and then remove duplicated entries.
df <- df[order(df[,1], df[,2], -df[,3]),]
df <- df[!duplicated(df[,-3]),]
df
Unique Id Class Id Version Id
7 305 1 1
5 305 2 3
15 405 1 1
4 405 2 1
1 501 1 1
10 501 2 1
12 501 3 2
9 509 1 1
14 602 1 1
13 602 2 1
2 602 3 1

How to find the average of several lines with the same id in a big R dataframe? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
i have a big data frame (more than 100 000 entries) that look something like this :
ID Pre temp day
134 10 6 1
134 20 7 1
134 10 8 1
234 5 1 2
234 10 4 2
234 15 10 3
I want to reduce my data frame by finding the mean value of pre, temp and day for identical ID values.
At the end, my data frame would look something like this
ID Pre temp day
134 13.3 7 1
234 10 5 2.3
i'm not sure how to do it ?
Thank you in advance !
With the dplyr package you can group_by your ID value and then use summarise to take the mean
library(dplyr)
df %>%
group_by(ID) %>%
summarise(Pre= mean(Pre),
temp = mean(temp),
day = mean(day))
# A tibble: 2 x 4
ID Pre temp day
<dbl> <dbl> <dbl> <dbl>
1 134 13.3 7 1
2 234 10 5 2.33
With dplyr, a solution looks like this:
textFile <- "ID Pre temp day
134 10 6 1
134 20 7 1
134 10 8 1
234 5 1 2
234 10 4 2
234 15 10 3"
data <- read.table(text = textFile,header=TRUE)
library(dplyr)
data %>% group_by(ID) %>%
summarise(.,Pre = mean(Pre),temp = mean(temp),day=mean(day))
...and the output:
<int> <dbl> <dbl> <dbl>
1 134 13.3 7 1
2 234 10 5 2.33
>
You can try next:
library(dplyr)
#Data
df <- structure(list(ID = c(134L, 134L, 134L, 234L, 234L, 234L), Pre = c(10L,
20L, 10L, 5L, 10L, 15L), temp = c(6L, 7L, 8L, 1L, 4L, 10L), day = c(1L,
1L, 1L, 2L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-6L))
#Code
df %>% group_by(ID) %>% summarise_all(mean,na.rm=T)
# A tibble: 2 x 4
ID Pre temp day
<int> <dbl> <dbl> <dbl>
1 134 13.3 7 1
2 234 10 5 2.33
There is no need of setting each individual variable.

Resources