I have a data frame with this structure:
var
A1sometext_r2
BXother_r11
A1sometext_r4
C7sometext_r8
And would like a new column that stores the number that follows the "r"
var new
A1some9text_r2 2
BXother_r11 11
A1sometext_r4 4
C7sometext_r8 8
I'm trying to incorporate into a pipe so Tidyverse would be better
Thx!
U can do it like this:
tibble(var = paste0('lala_r', sample(1:20, 15))) %>%
dplyr::mutate(
new = stringr::str_replace_all(var, '.*_r([0-9]*)$', '\\1'),
new = as.integer(new)
)
Output:
# A tibble: 15 x 2
var new
<chr> <int>
1 lala_r8 8
2 lala_r11 11
3 lala_r16 16
4 lala_r7 7
5 lala_r1 1
6 lala_r10 10
7 lala_r12 12
8 lala_r9 9
9 lala_r18 18
10 lala_r6 6
11 lala_r3 3
12 lala_r20 20
13 lala_r4 4
14 lala_r14 14
15 lala_r15 15
Related
I have a data set with 1000 variables. The naming fashion of the variable is as shown in the figure below.
Now I want to use a loop function to standardize each of these 1000 variables and keep their original names. That is, I want the new "SCORE.1" to be the standardized "SCORE.1", new "SCORE.2" is the standardized "SCORE.2".
How can I do this? Many thanks!
Perhaps it would be better to keep the 'original' data (e.g. "df_1") and create a new dataframe (e.g. "df_2") with the transformed values, i.e.
library(tidyverse)
# Create some fake data
set.seed(123)
names <- paste("SCORE", 1:1000, sep = ".")
IDs <- 1:100
m <- matrix(sample(1:20, 10000, replace = TRUE), ncol = 1000, nrow = 100,
dimnames=list(IDs, names))
df_1 <- as.data.frame(m)
head(df_1)
#> SCORE.1 SCORE.2 SCORE.3 SCORE.4 SCORE.5 SCORE.6 SCORE.7 SCORE.8 SCORE.9
#> 1 15 6 9 15 11 7 9 8 6
#> 2 19 16 16 19 15 4 16 20 4
#> 3 14 11 17 6 20 10 9 11 3
#> 4 3 4 13 16 2 17 2 18 14
#> 5 10 12 8 15 16 16 9 14 19
#> 6 18 14 7 19 19 8 11 3 14
# Transform the 'original' fake data into 'new' fake data
df_2 <- df_1 %>%
mutate(across(everything(), ~(.x - mean(.x) / sd(.x))))
head(df_2)
#> SCORE.1 SCORE.2 SCORE.3 SCORE.4 SCORE.5 SCORE.6 SCORE.7
#> 1 12.8991333 4.105098 7.164641 13.001316 9.2716116 5.25409 7.1758716
#> 2 16.8991333 14.105098 14.164641 17.001316 13.2716116 2.25409 14.1758716
#> 3 11.8991333 9.105098 15.164641 4.001316 18.2716116 8.25409 7.1758716
#> 4 0.8991333 2.105098 11.164641 14.001316 0.2716116 15.25409 0.1758716
#> 5 7.8991333 10.105098 6.164641 13.001316 14.2716116 14.25409 7.1758716
#> 6 15.8991333 12.105098 5.164641 17.001316 17.2716116 6.25409 9.1758716
Does this answer your question?
I have a data frame that includes information about schools. The code below produces a toy example.
df <- tibble(grade_range = c('1-3','2-5','5-12'),
school = c('AAA', 'BBB', 'CCC'),
score = c(100, 110, 150))
The current data has one row per school, with a single character variable indicating the range of grade levels. I'd like to have a longer dataset, with one row per school-by-grade combination. The code below does the job, but it feels like a clumsy workaround, and I'm wondering if there's a more efficient way to produce the same output.
df_long <- df %>%
mutate(low_grade = as.numeric(str_remove(str_extract(grade_range, '[[:digit:]]+-'),'-')),
high_grade = as.numeric(str_remove(str_extract(grade_range, '-[[:digit:]]+'),'-')),
fake_join_var = 1) %>%
left_join(data.frame(grade_level = c(1:12), fake_join_var = rep(1,12))) %>%
select(-fake_join_var) %>%
filter(grade_level >= low_grade &
grade_level <= high_grade)
(To be clear, df_long is exactly the output I want, I'm just wondering if there's a simpler way of producing it, maybe with purrr somehow?)
Since your code is based on the difference between low_grade and high_grade, you still have to extract the numerical value from the string.
However, after that, you can simply unnest() the sequence between the two.
Here is the code:
library(tidyverse)
df <- tibble(grade_range = c('1-3','2-5','5-12'),
school = c('AAA', 'BBB', 'CCC'),
score = c(100, 110, 150))
x = df %>%
mutate(
low_grade = as.numeric(str_remove(str_extract(grade_range, '\\d+-'),'-')),
high_grade = as.numeric(str_remove(str_extract(grade_range, '-\\d+'),'-')),
grade_level = map2(low_grade, high_grade, seq)
) %>%
unnest(grade_level)
x
#> # A tibble: 15 x 6
#> grade_range school score low_grade high_grade grade_level
#> <chr> <chr> <dbl> <dbl> <dbl> <int>
#> 1 1-3 AAA 100 1 3 1
#> 2 1-3 AAA 100 1 3 2
#> 3 1-3 AAA 100 1 3 3
#> 4 2-5 BBB 110 2 5 2
#> 5 2-5 BBB 110 2 5 3
#> 6 2-5 BBB 110 2 5 4
#> 7 2-5 BBB 110 2 5 5
#> 8 5-12 CCC 150 5 12 5
#> 9 5-12 CCC 150 5 12 6
#> 10 5-12 CCC 150 5 12 7
#> 11 5-12 CCC 150 5 12 8
#> 12 5-12 CCC 150 5 12 9
#> 13 5-12 CCC 150 5 12 10
#> 14 5-12 CCC 150 5 12 11
#> 15 5-12 CCC 150 5 12 12
waldo::compare(df_long, x)
#> v No differences
Created on 2021-10-01 by the reprex package (v2.0.0)
I have a table with hundreds of columns. Their names end either with .a or .b
What I need is to rename all columns.a with a columns.a_new and column.b with column->column.b_new at once.
I can do it only one pattern at a time but I don't know how to do it at once for all columns.
rename_at_example <- my_table %>% rename_at(vars(ends_with(".a")),
funs(str_replace(., ".a", ".a_new")))
Any idea how to write it in a compact way for all columns?
Thank you
One dplyr option could be:
df %>%
rename_at(vars(matches("[ab]$")), ~ paste0(., "_new"))
col1a_new col2a_new col1b_new col2b_new col1c col2c
1 1 11 1 11 1 11
2 2 12 2 12 2 12
3 3 13 3 13 3 13
4 4 14 4 14 4 14
5 5 15 5 15 5 15
6 6 16 6 16 6 16
7 7 17 7 17 7 17
8 8 18 8 18 8 18
9 9 19 9 19 9 19
10 10 20 10 20 10 20
Sample data:
df <- data.frame(col1a = 1:10,
col2a = 11:20,
col1b = 1:10,
col2b = 11:20,
col1c = 1:10,
col2c = 11:20,
stringsAsFactors = FALSE)
If '.a' names and '.b' names don't require the same replacement/action, e.g. adding '_new' to the end, you could use reduce2
library(tidyverse) # dplyr + purrr for reduce2
df <- data.frame(one.a = 1, one.d = 2, twoa = 3, two.b = 4, three.a = 5)
df
# one.a one.d twoa two.b three.a
# 1 1 2 3 4 5
df %>%
rename_all(~ reduce2(c('\\.a$', '\\.b$'), c('.a_new1', '.b_new2'),
str_replace, .init = .x))
# one.a_new1 one.d twoa two.b_new2 three.a_new1
# 1 1 2 3 4 5
I am working with gait-cycle data. I have 8 events marked for each id and gait trial. The values "LFCH" and "RFCH" occurs twice in each trial, as these represent the beginning and the end of the gait cycles from left and right leg.
Sample Data Frame:
df <- data.frame(ID = rep(1:5, each = 16),
Gait_nr = rep(1:2, each = 8, times=5),
Frame = rep(c(1,5,7,9,10,15,22,25), times = 10),
Marks = rep(c("LFCH", "LHL", "RFCH", "LTO", "RHL", "LFCH", "RTO", "RFCH"), times =10)
head(df,8)
ID Gait_nr Frame Marks
1 1 1 1 LFCH
2 1 1 5 LHL
3 1 1 7 RFCH
4 1 1 9 LTO
5 1 1 10 RHL
6 1 1 15 LFCH
7 1 1 22 RTO
8 1 1 25 RFCH
I wold like to create something like
Total_gait_left = Frame[The last time Marks == "LFCH"] - Frame[The first time Marks == "LFCH"]
My current code solves the problem, but depends on the position of the Frame values rather than actual values in Marks. Any individual not following the normal gait pattern will have wrong values produced by the code.
library(tidyverse)
l <- df %>% group_by(ID, Gait_nr) %>% filter(grepl("L.+", Marks)) %>%
summarize(Total_gait = Frame[4] - Frame[1],
Side = "left")
r <- df %>% group_by(ID, Gait_nr) %>% filter(grepl("R.+", Marks)) %>%
summarize(Total_gait = Frame[4] - Frame[1],
Side = "right")
val <- union(l,r, by=c("ID", "Gait_nr", "Side")) %>% arrange(ID, Gait_nr, Side)
Can you help me make my code more stable by helping me change e.g. Frame[4] to something like Frame[Marks=="LFCH" the last time ]?
If both LFCH and RFCH happen exactly twice, you can filter and then use diff in summarize:
df %>%
group_by(ID, Gait_nr) %>%
summarise(
left = diff(Frame[Marks == 'LFCH']),
right = diff(Frame[Marks == 'RFCH'])
)
# A tibble: 10 x 4
# Groups: ID [?]
# ID Gait_nr left right
# <int> <int> <dbl> <dbl>
# 1 1 1 14 18
# 2 1 2 14 18
# 3 2 1 14 18
# 4 2 2 14 18
# 5 3 1 14 18
# 6 3 2 14 18
# 7 4 1 14 18
# 8 4 2 14 18
# 9 5 1 14 18
#10 5 2 14 18
We can use first and last from the dplyr package.
library(dplyr)
df2 <- df %>%
filter(Marks %in% "LFCH") %>%
group_by(ID, Gait_nr) %>%
summarise(Total_gait = last(Frame) - first(Frame)) %>%
ungroup()
df2
# # A tibble: 10 x 3
# ID Gait_nr Total_gait
# <int> <int> <dbl>
# 1 1 1 14
# 2 1 2 14
# 3 2 1 14
# 4 2 2 14
# 5 3 1 14
# 6 3 2 14
# 7 4 1 14
# 8 4 2 14
# 9 5 1 14
# 10 5 2 14
I get a freq table, but can I save this table in a csv file or - better - sort it or extract the biggest values?
library(plyr)
count(birthdaysExample, 'month')
I'm guessing at what the relevant part of your data looks like, but in any case this should get you a frequency table sorted by values:
library(plyr)
birthdaysExample <- data.frame(month = round(runif(200, 1, 12)))
freq_df <- count(birthdaysExample, 'month')
freq_df[order(freq_df$freq, decreasing = TRUE), ]
This gives you:
month freq
5 5 29
9 9 24
3 3 22
4 4 18
6 6 17
7 7 15
2 2 14
10 10 14
11 11 14
8 8 13
1 1 10
12 12 10
To get the highest 3 values:
library(magrittr)
freq_df[order(freq_df$freq, decreasing = TRUE), ] %>% head(., 3)
month freq
5 5 29
9 9 24
3 3 22
Or, with just base R:
head(freq_df[order(freq_df$freq, decreasing = TRUE), ], 3)
With dplyr
dplyr is a newer approaching for many routine data manipulations in R (one of many tutorials) that is a bit more intuitive:
library(dplyr)
library(magrittr)
freq_df2 <- birthdaysExample %>%
group_by(month) %>%
summarize(freq = n()) %>%
arrange(desc(freq))
freq_df2
This returns:
Source: local data frame [12 x 2]
month freq
1 5 29
2 9 24
3 3 22
4 4 18
5 6 17
6 7 15
7 2 14
8 10 14
9 11 14
10 8 13
11 1 10
12 12 10
The object it returns is not a data frame anymore, so if you want to use base R functions with it, it might be easier to convert it back, with something like:
my_df <- as.data.frame(freq_df2)
And if you really want, you can write this to a CSV file with:
write.csv(my_df, file="foo.csv")