I am working in R, but I don't know very well how to extract from any number a series of data, i.e., from the number 20102168056, I want to subdivide it like this
2010 -> year
2 -> semester
168 -> university career
056 -> unique number
I tried to do it with an if, but every time I got more errors, I am new to this and I would like to know if you can help me (By the way, it is for any number, as 20211888070, so I did not use the if I raised).
You can use tidyr::separate.
library(tidyverse)
df <- tibble(original = c(20102168056, 20141152013, 20182008006))
df %>%
separate(original, into = c("year", "semester", "university_career", "unique_number"), sep = c(4,5,8,11))
# A tibble: 3 × 4
year semester university_career unique_number
<chr> <chr> <chr> <chr>
1 2010 2 168 056
2 2014 1 152 013
3 2018 2 008 006
You may want to convert some of the columns to an integer:
df %>%
separate(original, into = c("year", "semester", "university_career", "unique_number"), sep = c(4,5,8,11)) %>%
mutate(across(year:unique_number, as.integer))
# A tibble: 3 × 4
year semester university_career unique_number
<int> <int> <int> <int>
1 2010 2 168 56
2 2014 1 152 13
3 2018 2 8 6
We can use stringr::str_match().
library(tidyverse)
data <- c(20102168056, 20102168356)
str_match(data, '^(\\d{4})(\\d{1})(\\d{3})(\\d{3})') %>%
as.data.frame() %>%
set_names(c('value', 'year', 'semester', 'university_career', 'unique_number'))
#> value year semester university_career unique_number
#> 1 20102168056 2010 2 168 056
#> 2 20102168356 2010 2 168 356
Created on 2021-12-08 by the reprex package (v2.0.1)
You can use the substr() function if you first make the number into a character with as.character().
test <- '20102168056'
data <- list()
data$year <- substr(test, 1, 4)
data$semester <- substr(test, 5, 5)
data$uni_career <- substr(test, 6, 8)
data$unique_num <- substr(test, 9, 11)
print(data)
#> $year
#> [1] "2010"
#>
#> $semester
#> [1] "2"
#>
#> $uni_career
#> [1] "168"
#>
#> $unique_num
#> [1] "056"
Created on 2021-12-08 by the reprex package (v2.0.1)
Related
New to R, my apologies if there is an easy answer that I don't know of.
I have a dataframe with 127.124 observations and 5 variables
Head(SortedDF)
number Retention.time..min. Charge m.z Group
102864 6947 12.58028 5 375.0021 Pro
68971 60641 23.36693 2 375.1373 Pro
75001 104156 24.54187 3 375.1540 Pro
87435 146322 22.69630 3 375.1540 Pro
82658 88256 22.32042 3 375.1541 Pro
113553 97971 14.54600 3 375.1566 Pro
...
I want to compare every row with the row underneath it (so basically rownumber vs rownumber +1) and see if they match. After reading the For and if-else functions, I came up with this code:
for (i in 1:dim(SortedDF))
if(abs(m.z[i]-m.z[i+1])<0.01 | abs(Retention.time..min.[i]-Retention.time..min.[i+1])<1 | (Charge[i]=Charge[i+1]) | Group[i]!=Group[i+1])
print("Match")
else
print("No match")
However, this code does not work as it only prints out the first function function [1], and I'm not sure if i+1 is a thing. Is there any way to solve this not using i+1?
library(tidyverse)
data <- tibble(x = c(1, 1, 2), y = "a")
data
#> # A tibble: 3 × 2
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 1 a
#> 3 2 a
same_rows <-
data %>%
# consider all columns
unite(col = "all") %>%
transmute(same_as_next_row = all == lead(all))
data %>%
bind_cols(same_rows)
#> # A tibble: 3 × 3
#> x y same_as_next_row
#> <dbl> <chr> <lgl>
#> 1 1 a TRUE
#> 2 1 a FALSE
#> 3 2 a NA
Created on 2022-03-30 by the reprex package (v2.0.0)
library(tidyverse)
data <- tibble::tribble(
~id, ~number, ~Retention.time..min., ~Charge, ~m.z, ~Group,
102864, 6947, 12.58028, 5, 375.0021, "Pro",
68971, 60641, 23.36693, 2, 375.1373, "Pro",
75001, 104156, 24.54187, 3, 375.1540, "Pro",
87435, 146322, 22.69630, 3, 375.1540, "Pro",
82658, 88256, 22.32042, 3, 375.1541, "Pro",
113553, 97971, 14.54600, 3, 375.1566, "Pro"
)
data %>%
mutate(
matches_with_next_row = (abs(m.z - lead(m.z)) < 0.01) |
(abs(Retention.time..min. - lead(Retention.time..min.)) < 1)
)
#> # A tibble: 6 × 7
#> id number Retention.time..min. Charge m.z Group matches_with_next_row
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <lgl>
#> 1 102864 6947 12.6 5 375. Pro FALSE
#> 2 68971 60641 23.4 2 375. Pro FALSE
#> 3 75001 104156 24.5 3 375. Pro TRUE
#> 4 87435 146322 22.7 3 375. Pro TRUE
#> 5 82658 88256 22.3 3 375. Pro TRUE
#> 6 113553 97971 14.5 3 375. Pro NA
Created on 2022-03-30 by the reprex package (v2.0.0)
I am having trouble getting the desired number of decimal places from summarise. Here is a simple example:
test2 <- data.frame(c("a","a","b","b"), c(245,246,247,248))
library(dplyr)
colnames(test2) <- c("V1","V2")
group_by(test2,V1) %>% summarise(mean(V2))
The dataframe is:
V1 V2
1 a 245
2 a 246
3 b 247
4 b 248
The output is:
V1 `mean(V2)`
<fctr> <dbl>
1 a 246
2 b 248
I would like it to give me the means including the decimal place (i.e. 245.5 and 247.5)
Because you are using dplyr tools, the resulting output is actually a tibble, which by default prints numbers with 3 significant digits (see option pillar.sigfig). This is not the same as number of digits after the period. To obtain the latter, convert it simply to a data.frame: as.data.frame
Note that tibble's concept of significant digits is somehow complicated, and does not indicate how many digits after the period are represented, but the minimum number of digits necessary to have a given accurate representation of the number (I think 99.9%, see discussion here).
This means the number of digits printed depends on the "size" of your number:
library(tibble)
packageVersion("tibble")
#> [1] '2.1.3'
packageVersion("pillar")
#> [1] '1.4.2'
tab <- tibble(x = c(0.1234, 1.1234, 10.1234, 100.1234, 1000.1234))
options(pillar.sigfig=3)
tab
#> # A tibble: 5 x 1
#> x
#> <dbl>
#> 1 0.123
#> 2 1.12
#> 3 10.1
#> 4 100.
#> 5 1000.
options(pillar.sigfig=4)
tab
#> # A tibble: 5 x 1
#> x
#> <dbl>
#> 1 0.1234
#> 2 1.123
#> 3 10.12
#> 4 100.1
#> 5 1000.
as.data.frame(tab)
#> x
#> 1 0.1234
#> 2 1.1234
#> 3 10.1234
#> 4 100.1234
#> 5 1000.1234
Created on 2019-08-21 by the reprex package (v0.3.0)
This is one solution-
test2 <- data.frame(c("a", "a", "b", "b"), c(245, 246, 247, 248))
library(dplyr)
colnames(test2) <- c("V1", "V2")
group_by(test2, V1) %>%
dplyr::summarise(mean(V2)) %>%
dplyr::mutate_if(is.numeric, format, 1)
#> # A tibble: 2 x 2
#> V1 `mean(V2)`
#> <fct> <chr>
#> 1 a 245.5
#> 2 b 247.5
Created on 2018-01-20 by the reprex
package (v0.1.1.9000).
EDIT :
If you want to keep it as numeric :
test2 <- data.frame(c("a", "a", "b", "b"), c(245, 246, 247, 248))
library(dplyr)
colnames(test2) <- c("V1", "V2")
group_by(test2, V1) %>%
dplyr::summarise(mean(V2)) %>%
as.data.frame(.) %>%
dplyr::mutate_if(is.numeric, round, 1)
Gives
V1 mean(V2)
1 a 245.5
2 b 247.5
And with another example (from #Matifou) :
tab <- tibble(x = c(0.1234, 1.1234, 10.1234, 100.1234, 1000.1234))
tab %>%
as.data.frame(.) %>%
dplyr::mutate_if(is.numeric, round, 2)
Gives :
x
1 0.12
2 1.12
3 10.12
4 100.12
5 1000.12
I think the simplest solution is the following:
test2 <- data.frame(c("a","a","b","b"), c(245,246,247,248))
library(dplyr)
colnames(test2) <- c("V1","V2")
group_by(test2,V1) %>% summarise(`mean(V2)` = sprintf("%0.1f",mean(V2)))
# A tibble: 2 x 2
V1 `mean(V2)`
<fct> <chr>
1 a 245.5
2 b 247.5
I currently work with multiple large datasets of the same row number but different column numbers. Now I need to calculate the rate of change between columns and add it to either a new object or to the existing object to go on with my analysis.
In my research on the web I usually only encounterd people trying to figure out rate of change in a column but not between those. Is the easiest way to just flip all my data?
I am very sorry for my vague description of my problem as R and english are not my first languages.
I hope you can still show me the direction to further my understanding of R.
Thank you in advance for any tipps you might have!
I recommend joining all the data together and then convert it into a 3NF normalized long format table:
library(tidyverse)
data1 <- tibble(
country = c("A", "B", "C"),
gdp_2020 = c(1, 8, 10),
gdp_2021 = c(1, 8, 10),
population_2010 = c(5e3, 6e3, 6e3),
population_2020 = c(5.5e3, 6.8e3, 6e3)
)
data1
#> # A tibble: 3 x 5
#> country gdp_2020 gdp_2021 population_2010 population_2020
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 1 5000 5500
#> 2 B 8 8 6000 6800
#> 3 C 10 10 6000 6000
data2 <- tibble(
country = c("A", "B", "C"),
population_2021 = c(7e3, 8e3, 7e3),
population_2022 = c(7e3, 7e3, 10e3)
)
data2
#> # A tibble: 3 x 3
#> country population_2021 population_2022
#> <chr> <dbl> <dbl>
#> 1 A 7000 7000
#> 2 B 8000 7000
#> 3 C 7000 10000
list(
data1,
data2
) %>%
reduce(full_join) %>%
pivot_longer(matches("^(gdp|population)")) %>%
separate(name, into = c("variable", "year"), sep = "_") %>%
type_convert() %>%
arrange(country, variable, year) %>%
group_by(variable, country) %>%
mutate(
# NA for the first value because it does not have a precursor to calculate change
change_rate = (value - lag(value)) / (year - lag(year))
)
#> Joining, by = "country"
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> country = col_character(),
#> variable = col_character(),
#> year = col_double()
#> )
#> # A tibble: 18 x 5
#> # Groups: variable, country [6]
#> country variable year value change_rate
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 A gdp 2020 1 NA
#> 2 A gdp 2021 1 0
#> 3 A population 2010 5000 NA
#> 4 A population 2020 5500 50
#> 5 A population 2021 7000 1500
#> 6 A population 2022 7000 0
#> 7 B gdp 2020 8 NA
#> 8 B gdp 2021 8 0
#> 9 B population 2010 6000 NA
#> 10 B population 2020 6800 80
#> 11 B population 2021 8000 1200
#> 12 B population 2022 7000 -1000
#> 13 C gdp 2020 10 NA
#> 14 C gdp 2021 10 0
#> 15 C population 2010 6000 NA
#> 16 C population 2020 6000 0
#> 17 C population 2021 7000 1000
#> 18 C population 2022 10000 3000
Created on 2021-12-16 by the reprex package (v2.0.1)
Example: rate of change in the second row (gdp of country A) is 0 because it was the same in both 2021 and 2020.
I have a large data.frame that I am trying to spread. A toy example looks like this.
data = data.frame(date = rep(c("2019", "2020"), 2), ticker = c("SPY", "SPY", "MSFT", "MSFT"), value = c(1, 2, 3, 4))
head(data)
date ticker value
1 2019 SPY 1
2 2020 SPY 2
3 2019 MSFT 3
4 2020 MSFT 4
I would like to spread it so the data.frame looks like this.
spread(data, key = ticker, value = value)
date MSFT SPY
1 2019 3 1
2 2020 4 2
However, when I do this on my actual data.frame, I get an error.
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 18204 rows:
* 30341, 166871
* 30342, 166872
* 30343, 166873
* 30344, 166874
* 30345, 166875
* 30346, 166876
* 30347, 166877
* 30348, 166878
* 30349, 166879
* 30350, 166880
* 30351, 166881
* 30352, 166882
Below is a head and tail of my data.frame
head(df)
ref.date ticker weeklyReturn
<date> <chr> <dbl>
1 2008-02-01 SPY NA
2 2008-02-04 SPY NA
3 2008-02-05 SPY NA
4 2008-02-06 SPY NA
5 2008-02-07 SPY NA
6 2008-02-08 SPY -0.0478
tail(df)
ref.date ticker weeklyReturn
<date> <chr> <dbl>
1 2020-02-12 MDYV 0.00293
2 2020-02-13 MDYV 0.00917
3 2020-02-14 MDYV 0.0179
4 2020-02-18 MDYV 0.0107
5 2020-02-19 MDYV 0.00422
6 2020-02-20 MDYV 0.00347
You can use dplyr and tidyr packages. To get rid of that error, you would have to firstly sum the values for each group.
data %>%
group_by(date, ticker) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = ticker, values_from = value)
# date MSFT SPY
# <fct> <dbl> <dbl>
# 1 2019 3 1
# 2 2020 4 2
As said in the comments, you have multiple values for same combination of date-ticker. You need to define what to do with it.
Here with a reprex:
library(tidyr)
library(dplyr)
# your data is more like:
data = data.frame(
date = c(2019, rep(c("2019", "2020"), 2)),
ticker = c("SPY", "SPY", "SPY", "MSFT", "MSFT"),
value = c(8, 1, 2, 3, 4))
# With two values for same date-ticker combination
data
#> date ticker value
#> 1 2019 SPY 8
#> 2 2019 SPY 1
#> 3 2020 SPY 2
#> 4 2019 MSFT 3
#> 5 2020 MSFT 4
# Results in error
data %>%
spread(ticker, value)
#> Error: Each row of output must be identified by a unique combination of keys.
#> Keys are shared for 2 rows:
#> * 1, 2
# New pivot_wider() Creates list-columns for duplicates
data %>%
pivot_wider(names_from = ticker, values_from = value,)
#> Warning: Values in `value` are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list(value = list)` to suppress this warning.
#> * Use `values_fn = list(value = length)` to identify where the duplicates arise
#> * Use `values_fn = list(value = summary_fun)` to summarise duplicates
#> # A tibble: 2 x 3
#> date SPY MSFT
#> <fct> <list> <list>
#> 1 2019 <dbl [2]> <dbl [1]>
#> 2 2020 <dbl [1]> <dbl [1]>
# Otherwise, decide yourself how to summarise duplicates with mean() for instance
data %>%
group_by(date, ticker) %>%
summarise(value = mean(value, na.rm = TRUE)) %>%
spread(ticker, value)
#> # A tibble: 2 x 3
#> # Groups: date [2]
#> date MSFT SPY
#> <fct> <dbl> <dbl>
#> 1 2019 3 4.5
#> 2 2020 4 2
Created on 2020-02-22 by the reprex package (v0.3.0)
I am having trouble getting the desired number of decimal places from summarise. Here is a simple example:
test2 <- data.frame(c("a","a","b","b"), c(245,246,247,248))
library(dplyr)
colnames(test2) <- c("V1","V2")
group_by(test2,V1) %>% summarise(mean(V2))
The dataframe is:
V1 V2
1 a 245
2 a 246
3 b 247
4 b 248
The output is:
V1 `mean(V2)`
<fctr> <dbl>
1 a 246
2 b 248
I would like it to give me the means including the decimal place (i.e. 245.5 and 247.5)
Because you are using dplyr tools, the resulting output is actually a tibble, which by default prints numbers with 3 significant digits (see option pillar.sigfig). This is not the same as number of digits after the period. To obtain the latter, convert it simply to a data.frame: as.data.frame
Note that tibble's concept of significant digits is somehow complicated, and does not indicate how many digits after the period are represented, but the minimum number of digits necessary to have a given accurate representation of the number (I think 99.9%, see discussion here).
This means the number of digits printed depends on the "size" of your number:
library(tibble)
packageVersion("tibble")
#> [1] '2.1.3'
packageVersion("pillar")
#> [1] '1.4.2'
tab <- tibble(x = c(0.1234, 1.1234, 10.1234, 100.1234, 1000.1234))
options(pillar.sigfig=3)
tab
#> # A tibble: 5 x 1
#> x
#> <dbl>
#> 1 0.123
#> 2 1.12
#> 3 10.1
#> 4 100.
#> 5 1000.
options(pillar.sigfig=4)
tab
#> # A tibble: 5 x 1
#> x
#> <dbl>
#> 1 0.1234
#> 2 1.123
#> 3 10.12
#> 4 100.1
#> 5 1000.
as.data.frame(tab)
#> x
#> 1 0.1234
#> 2 1.1234
#> 3 10.1234
#> 4 100.1234
#> 5 1000.1234
Created on 2019-08-21 by the reprex package (v0.3.0)
This is one solution-
test2 <- data.frame(c("a", "a", "b", "b"), c(245, 246, 247, 248))
library(dplyr)
colnames(test2) <- c("V1", "V2")
group_by(test2, V1) %>%
dplyr::summarise(mean(V2)) %>%
dplyr::mutate_if(is.numeric, format, 1)
#> # A tibble: 2 x 2
#> V1 `mean(V2)`
#> <fct> <chr>
#> 1 a 245.5
#> 2 b 247.5
Created on 2018-01-20 by the reprex
package (v0.1.1.9000).
EDIT :
If you want to keep it as numeric :
test2 <- data.frame(c("a", "a", "b", "b"), c(245, 246, 247, 248))
library(dplyr)
colnames(test2) <- c("V1", "V2")
group_by(test2, V1) %>%
dplyr::summarise(mean(V2)) %>%
as.data.frame(.) %>%
dplyr::mutate_if(is.numeric, round, 1)
Gives
V1 mean(V2)
1 a 245.5
2 b 247.5
And with another example (from #Matifou) :
tab <- tibble(x = c(0.1234, 1.1234, 10.1234, 100.1234, 1000.1234))
tab %>%
as.data.frame(.) %>%
dplyr::mutate_if(is.numeric, round, 2)
Gives :
x
1 0.12
2 1.12
3 10.12
4 100.12
5 1000.12
I think the simplest solution is the following:
test2 <- data.frame(c("a","a","b","b"), c(245,246,247,248))
library(dplyr)
colnames(test2) <- c("V1","V2")
group_by(test2,V1) %>% summarise(`mean(V2)` = sprintf("%0.1f",mean(V2)))
# A tibble: 2 x 2
V1 `mean(V2)`
<fct> <chr>
1 a 245.5
2 b 247.5