I have an R dataframe with the numeric id as first column, I would like to group the data frame rows according to the first digit of id
id=c(33211,5966,4478,5589,10003,633,4411,99874,3641) ...
Maybe you want this where you can use substr to extract the first digit of your id like this:
df <- data.frame(id=c(33211,5966,4478,5589,10003,633,4411,99874,3641))
df$new_id <- with(df, as.numeric(substr(id, 1, 1)))
df
#> id new_id
#> 1 33211 3
#> 2 5966 5
#> 3 4478 4
#> 4 5589 5
#> 5 10003 1
#> 6 633 6
#> 7 4411 4
#> 8 99874 9
#> 9 3641 3
Created on 2022-07-27 by the reprex package (v2.0.1)
Similar to #Quinten's answer but just arranging the numbers in order
df<-data.frame(id =c(33211,5966,4478,5589,10003,633,4411,99874,3641))
df %>%
mutate(new_id = substr(df$id, 1,1)) %>%
arrange(by = new_id)
id new_id
1 10003 1
2 33211 3
3 3641 3
4 4478 4
5 4411 4
6 5966 5
7 5589 5
8 633 6
9 99874 9
Related
I am trying to extract the date from multiple PDF's to create a date column in a dataset.
I have a folder holding all the pdf's and am trying to do a topic modelling over a time period, hence I need to extract the dates.
Below is the dataset I have just containing the filenames.
# A tibble: 260 x 1
filename
<chr>
1 ./2012.01.18.pdf
2 ./2012.02.07.pdf
3 ./2012.03.12.pdf
4 ./2012.03.26.pdf
5 ./2012.04.02.pdf
6 ./2012.04.04.pdf
7 ./2012.04.19.pdf
8 ./2012.05.01.pdf
9 ./2012.05.07.pdf
10 ./2012.06.14.pdf
Tried "as.Date" with no luck, as I am unable to extract the dates from a file holding the all the PDFs
In the format, we could specify the extra characters along with the custom format for year (%Y), month (%m) and day (%d)
df$V2 <- as.Date(df$V2, format = "./%Y.%m.%d.pdf")
-output
> df
V1 V2
1 1 2012-01-18
2 2 2012-02-07
3 3 2012-03-12
4 4 2012-03-26
5 5 2012-04-02
6 6 2012-04-04
7 7 2012-04-19
8 8 2012-05-01
9 9 2012-05-07
10 10 2012-06-14
data
df <- structure(list(V1 = 1:10, V2 = c("./2012.01.18.pdf", "./2012.02.07.pdf",
"./2012.03.12.pdf", "./2012.03.26.pdf", "./2012.04.02.pdf", "./2012.04.04.pdf",
"./2012.04.19.pdf", "./2012.05.01.pdf", "./2012.05.07.pdf", "./2012.06.14.pdf"
)), class = "data.frame", row.names = c(NA, -10L))
You must first extract the date string from the name, then coerce to class "Date".
df1 <-'1 ./2012.01.18.pdf
2 ./2012.02.07.pdf
3 ./2012.03.12.pdf
4 ./2012.03.26.pdf
5 ./2012.04.02.pdf
6 ./2012.04.04.pdf
7 ./2012.04.19.pdf
8 ./2012.05.01.pdf
9 ./2012.05.07.pdf
10 ./2012.06.14.pdf'
df1 <- read.table(textConnection(df1))
df1$V2 <- sub(".*(\\d{4}.\\d{2}.\\d{2}).*", "\\1", df1$V2)
df1$V2 <- as.Date(df1$V2, "%Y.%m.%d")
df1
#> V1 V2
#> 1 1 2012-01-18
#> 2 2 2012-02-07
#> 3 3 2012-03-12
#> 4 4 2012-03-26
#> 5 5 2012-04-02
#> 6 6 2012-04-04
#> 7 7 2012-04-19
#> 8 8 2012-05-01
#> 9 9 2012-05-07
#> 10 10 2012-06-14
Created on 2022-11-27 with reprex v2.0.2
lets say we have following data:
df1 = data.frame(cm= c('10129', '21120', '123456','345239'),
num=c(6,6,6,6))
> df1
cm num
10129 6
21120 6
123456 6
345 4
as you see the length of some boxes in the cm column is 6 digits and some of 5 digits. I want to code the following: if the number of digits in the cm column is less than number in num column, add 0 value in the front to get the given output:
cm num
010129 6
021120 6
123456 6
0345 4
You can use str_pad
library(tidyverse)
df1 %>% mutate(cm = str_pad(cm, num, "left", "0"))
#> cm num
#> 1 010129 6
#> 2 021120 6
#> 3 123456 6
#> 4 0345 4
Created on 2022-04-13 by the reprex package (v2.0.1)
Input Data
df1 <- data.frame(cm = c('10129', '21120', '123456','345'), num = c(6,6,6,4))
df1
#> cm num
#> 1 10129 6
#> 2 21120 6
#> 3 123456 6
#> 4 345 4
Perhaps simplest using dplyr and nchar:
library(dplyr)
df1 %>% mutate(cm = if_else(nchar(cm) < num, paste0(0, cm), cm))
cm num
1 10129 6
2 21120 6
3 123456 6
4 345239 6
The other tidyverse/dplyr answers are nicer, but if you want to stick to base R for some reason:
df1$cm <- ifelse(nchar(df1$cm) < df1$num, paste0('0', df1$cm), df1$cm)
df1
#> cm num
#> 1 010129 6
#> 2 021120 6
#> 3 123456 6
#> 4 345239 6
I have a data frame as follows:
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6
4 4 25 7
5 5 3 7
6 6 34 7
For each row with the sample_id, I would like to add a unique identifier as follows:
index val sample_id
1 1 14 5
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
Any suggestion? Thank you for your help.
Base R
dat$id2 <- ave(dat$sample_id, dat$sample_id,
FUN = function(z) if (length(z) > 1) paste(z, LETTERS[seq_along(z)], sep = "-") else as.character(z))
dat
# index val sample_id id2
# 1 1 14 5 5
# 2 2 22 6 6-A
# 3 3 1 6 6-B
# 4 4 25 7 7-A
# 5 5 3 7 7-B
# 6 6 34 7 7-C
tidyverse
library(dplyr)
dat %>%
group_by(sample_id) %>%
mutate(id2 = if (n() > 1) paste(sample_id, LETTERS[row_number()], sep = "-") else as.character(sample_id)) %>%
ungroup()
Minor note: it might be tempting to drop the as.character(z) from either or both of the code blocks. In the first, nothing will change (here): base R allows you to be a little sloppy; if we rely on that and need the new field to always be character, then in that one rare circumstance where all rows have unique sample_id, then the column will remain integer. dplyr is much more careful in guarding against this; if you run the tidyverse code without as.character, you'll see the error.
Using dplyr:
library(dplyr)
dplyr::group_by(df, sample_id) %>%
dplyr::mutate(sample_id = paste(sample_id, LETTERS[seq_along(sample_id)], sep = "-"))
index val sample_id
<int> <dbl> <chr>
1 1 14 5-A
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
If you just want to create unique tags for the same sample_id, maybe you can try make.unique like below
transform(
df,
sample_id = ave(as.character(sample_id),sample_id,FUN = function(x) make.unique(x,sep = "_"))
)
which gives
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6_1
4 4 25 7
5 5 3 7_1
6 6 34 7_2
Ciao,
Here is a replicate able example.
df <- data.frame("STUDENT"=c(1,2,3,4,5),
"TEST1A"=c(NA,5,5,6,7),
"TEST2A"=c(NA,8,4,6,9),
"TEST3A"=c(NA,10,5,4,6),
"TEST1B"=c(5,6,7,4,1),
"TEST2B"=c(10,10,9,3,1),
"TEST3B"=c(0,5,6,9,NA),
"TEST1TOTAL"=c(NA,23,14,16,22),
"TEST2TOTAL"=c(10,16,15,12,NA))
I have columns STUDENT through TEST3B and want to create TEST1TOTAL TEST2TOTAL. TEST1TOTAL=TEST1A+TEST2A+TEST3A and so on for TEST2TOTAL. If there is any missing score in TEST1A TEST2A TEST3A then TEST1TOTAL is NA.
here is my attempt but is there a solution with less lines of coding? Because here I will need to write this line out many times as there are up to TEST A through O.
TEST1TOTAL=rowSums(df[,c('TEST1A', 'TEST2A', 'TEST3A')], na.rm=TRUE)
Using just R base functions:
output <- data.frame(df1, do.call(cbind, lapply(c("A$", "B$"), function(x) rowSums(df1[, grep(x, names(df1))]))))
Customizing colnames:
> colnames(output)[(ncol(output)-1):ncol(output)] <- c("TEST1TOTAL", "TEST2TOTAL")
> output
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL
1 1 NA NA NA 5 10 0 NA 15
2 2 5 8 10 6 10 5 23 21
3 3 5 4 5 7 9 6 14 22
4 4 6 6 4 4 3 9 16 16
5 5 7 9 6 1 1 NA 22 NA
Try:
library(dplyr)
df %>%
mutate(TEST1TOTAL = TEST1A+TEST2A+TEST3A,
TEST2TOTAL = TEST1B+TEST2B+TEST3B)
or
df %>%
mutate(TEST1TOTAL = rowSums(select(df, ends_with("A"))),
TEST2TOTAL = rowSums(select(df, ends_with("B"))))
I think for what you want, Jilber Urbina's solution is the way to go. For completeness sake (and because I learned something figuring it out) here's a tidyverse way to get the score totals by test number for any number of tests.
The advantage is you don't need to specify the identifiers for the tests (beyond that they're numbered or have a trailing letter) and the same code will work for any number of tests.
library(tidyverse)
df_totals <- df %>%
gather(test, score, -STUDENT) %>% # Convert from wide to long format
mutate(test_num = paste0('TEST', ('[^0-9]', '', test),
'TOTAL'), # Extract test_number from variable
test_let = gsub('TEST[0-9]*', '', test)) %>% # Extract test_letter (optional)
group_by(STUDENT, test_num) %>% # group by student + test
summarize(score_tot = sum(score)) %>% # Sum score by student/test
spread(test_num, score_tot) # Spread back to wide format
df_totals
# A tibble: 5 x 4
# Groups: STUDENT [5]
STUDENT TEST1TOTAL TEST2TOTAL TEST3TOTAL
<dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA
2 2 11 18 15
3 3 12 13 11
4 4 10 9 13
5 5 8 10 NA
If you want the individual scores too, just join the totals together with the original:
left_join(df, df_totals, by = 'STUDENT')
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL TEST3TOTAL
1 1 NA NA NA 5 10 0 NA NA NA
2 2 5 8 10 6 10 5 11 18 15
3 3 5 4 5 7 9 6 12 13 11
4 4 6 6 4 4 3 9 10 9 13
5 5 7 9 6 1 1 NA 8 10 NA
To manipulate/summarize data over time, I usually use SQL ROW_NUMBER() OVER(PARTITION by ...). I'm new to R, so I'm trying to recreate tables I otherwise would create in SQL. The package sqldf does not allow OVER clauses. Example table:
ID Day Person Cost
1 1 A 50
2 1 B 25
3 2 A 30
4 3 B 75
5 4 A 35
6 4 B 100
7 6 B 65
8 7 A 20
I want my final table to include the average of the previous 2 instances for each day after their 2nd instance (day 4 for both):
ID Day Person Cost Prev2
5 4 A 35 40
6 4 B 100 50
7 6 B 65 90
8 7 A 20 35
I've been trying to play around with aggregate, but I'm not really sure how to partition or qualify the function. Ideally, I'd prefer not to use the fact that id is sequential with the date to form my answer (i.e. original table could be rearranged with random date order and code would still work). Let me know if you need more details, thanks for your help!
You could lag zoo::rollapplyr with a width of 2. In dplyr,
library(dplyr)
df %>% arrange(Day) %>% # sort
group_by(Person) %>% # set grouping
mutate(Prev2 = lag(zoo::rollapplyr(Cost, width = 2, FUN = mean, fill = NA)))
#> Source: local data frame [8 x 5]
#> Groups: Person [2]
#>
#> ID Day Person Cost Prev2
#> <int> <int> <fctr> <int> <dbl>
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or all in dplyr,
df %>% arrange(Day) %>% group_by(Person) %>% mutate(Prev2 = (lag(Cost) + lag(Cost, 2)) / 2)
which returns the same thing. In base,
df <- df[order(df$Day), ]
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
c(NA, zoo::rollapplyr(x, width = 2, FUN = mean, fill = NA)[-length(x)])
})
df
#> ID Day Person Cost Prev2
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or without zoo,
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
(c(NA, x[-length(x)]) + c(NA, NA, x[-(length(x) - 1):-length(x)])) / 2
})
which does the same thing. If you want to remove the NA rows, tack on tidyr::drop_na(Prev2) or na.omit.