I have some data which looks like:
col
1 €€€€€
2 ££
3 €£
4 €€
5 €€€€€
6 €€€€
7 €€
8 €€
9 €€
10 €€
11 €€
12 €€
13 €€€€
14 €€€
15 €€€€
16 €€
17 €€
18 €€€€
19 $$
20 €€€CHF
It contains a collapsed set of currency symbols of different lengths. What I would like to do is to create a new column and extract the unique currencies. In most cases the currencies are all the same however in row 3 and row 20 the currencies look like: €£ and €€€CHF respectively.
Expected output:
col colCur1 colCur2
1 €€€€€ €
2 ££ £
3 €£ € £
4 €€ ...
5 €€€€€
6 €€€€
7 €€
8 €€
9 €€
10 €€
11 €€
12 €€
13 €€€€
14 €€€
15 €€€€
16 €€
17 €€
18 €€€€ ...
19 $$ $
20 €€€CHF € CHF
Data:
structure(list(col = c("\200\200\200\200\200", "££", "\200£",
"\200\200", "\200\200\200\200\200", "\200\200\200\200", "\200\200",
"\200\200", "\200\200", "\200\200", "\200\200", "\200\200", "\200\200\200\200",
"\200\200\200", "\200\200\200\200", "\200\200", "\200\200", "\200\200\200\200",
"$$", "\200\200\200CHF")), class = "data.frame", row.names = c(NA,
-20L))
Here is an option
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(col2 = str_replace_all(col, "(.)\\1+", "\\1"),
col2 = str_replace_all(col2, "([^A-Z])([^A-Z])", "\\1,\\2"),
col2 = str_replace_all(col2, "(?<=[^A-Z])(?=[A-Z])", ","),
col2 = strsplit(col2, ",")) %>%
unnest_wider(c(col2)) %>%
rename_at(-1, ~ str_c('colCur', seq_along(.)))
-output
# A tibble: 20 x 3
# col colCur1 colCur2
# <chr> <chr> <chr>
# 1 €€€€€ € <NA>
# 2 ££ £ <NA>
# 3 €£ € £
# 4 €€ € <NA>
# 5 €€€€€ € <NA>
# 6 €€€€ € <NA>
# 7 €€ € <NA>
# 8 €€ € <NA>
# 9 €€ € <NA>
#10 €€ € <NA>
#11 €€ € <NA>
#12 €€ € <NA>
#13 €€€€ € <NA>
#14 €€€ € <NA>
#15 €€€€ € <NA>
#16 €€ € <NA>
#17 €€ € <NA>
#18 €€€€ € <NA>
#19 $$ $ <NA>
#20 €€€CHF € CHF
Related
I have a data frame like:
df
group group_name value
1 1 <NA> VV0001
2 1 <NA> VV_RS00280
3 2 <NA> VV0002
4 2 <NA> VV_RS00285
5 3 <NA> VV0003
6 3 <NA> VV_RS00290
7 5 <NA> VV0004
8 5 <NA> VV_RS00295
9 6 <NA> VV0005
10 6 <NA> VV_RS00300
11 7 <NA> VV0006
12 7 <NA> VV_RS00305
13 8 <NA> VV0007
14 8 <NA> VV_RS00310
15 9 <NA> VV0009
16 9 <NA> VV_RS00315
17 10 <NA> VV0011
18 10 <NA> VV_RS00320
19 11 <NA> VV0012
20 11 <NA> VV_RS00325
21 12 <NA> VV0013
22 12 <NA> VV_RS00330
so I want to construct an other data frame using the columns "group" and "value", all the group 1 (df[df$group == 1,]) will get the data in "value" column (VV0001, VV_RS00280) and construct the data.frame like:
group value
1 VV0001 VV_RS00280
and then the next df[df$group == 2,], and so on, at the end will be:
group value
1 VV0001 VV_RS00280
2 VV0002 VV_RS00285
3 VV0003 VV_RS00290
4 VV0004 VV_RS00295
I tried to do it manually but the nrow(df) is big, > 3000 or more !!
Thanks
You may try,
library(dplyr)
library(tidyr)
df %>%
rename(idv = group) %>%
mutate(group_name = rep(c("group", "value"),n()/2)) %>%
group_by(idv) %>%
pivot_wider(names_from = group_name, values_from = value) %>%
ungroup %>%
select(-idv)
group value
<chr> <chr>
1 VV0001 VV_RS00280
2 VV0002 VV_RS00285
3 VV0003 VV_RS00290
4 VV0004 VV_RS00295
5 VV0005 VV_RS00300
6 VV0006 VV_RS00305
7 VV0007 VV_RS00310
8 VV0009 VV_RS00315
9 VV0011 VV_RS00320
10 VV0012 VV_RS00325
11 VV0013 VV_RS00330
I have 30 columns and I want to sort them. It seems multi packages are required, but I struggle to do it.
Here is a sample of data:
df<-read.table (text=" Id Name Class bc1 M1 G1
23 Smith A1 13 13 12
19 John Z1 12 14 12
33 Rose OG1 14 13 14
66 Javid MO1 12 14 13
66 Javid MO2 12 13 14
23 Smith A11 12 13 12
33 Rose OG2 14 14 13
19 John Z11 12 12 12
", header=TRUE)
And I want to get these data:
Id Name Class1 bc1 M1 G1 Class2 bc1 M1 G1
23 Smith A1 13 13 12 A11 12 13 12
19 John Z1 12 14 12 Z11 12 12 12
33 Rose OG1 14 13 14 OG2 14 14 13
66 Javid MO1 12 14 13 MO2 12 13 14
The logic is that each Id has different values and I want to generate columns for them.
Thank you for your help.
We can create a sequence column and then use pivot_wider
library(tidyr)
library(dplyr)
library(data.table)
library(stringr)
df %>%
mutate(nm1 = rowid(Id, Name)) %>%
pivot_wider(names_from = nm1, values_from = Class:G1)%>%
select(Id, Name, order(as.integer(str_remove(names(.)[-(1:2)], ".*_"))) + 2)
-output
# A tibble: 4 x 10
# Id Name Class_1 bc1_1 M1_1 G1_1 Class_2 bc1_2 M1_2 G1_2
# <int> <chr> <chr> <int> <int> <int> <chr> <int> <int> <int>
#1 23 Smith A1 13 13 12 A11 12 13 12
#2 19 John Z1 12 14 12 Z11 12 12 12
#3 33 Rose OG1 14 13 14 OG2 14 14 13
#4 66 Javid MO1 12 14 13 MO2 12 13 14
I have a 2 column table (tibble), made up of a date object and a numeric variable. There is maximum one entry per day but not every day has an entry (ie date is a natural primary key). I am attempting to do a running sum of the numeric column along with dates but with the running sum resetting when the month turns (the data is sorted by ascending date). I have replicated what I want to get as a result below.
Date score monthly.running.sum
10/2/2019 7 7
10/9/2019 6 13
10/16/2019 12 25
10/23/2019 2 27
10/30/2019 13 40
11/6/2019 2 2
11/13/2019 4 6
11/20/2019 15 21
11/27/2019 16 37
12/4/2019 4 4
12/11/2019 24 28
12/18/2019 28 56
12/25/2019 8 64
1/1/2020 1 1
1/8/2020 15 16
1/15/2020 9 25
1/22/2020 8 33
It looks like the package "runner" is possibly suited to this but I don't really understand how to instruct it. I know I could use a join operation plus a group_by using dplyr to do this, but the data set is very very large and doing so would be wildly inefficient. i could also manually iterate through the list with a loop, but that also seems inelegant. last option i can think of is selecting out a unique vector of yearmon objects and then cutting the original list into many shorter lists and running a plain cumsum on it, but that also feels unoptimal. I am sure this is not the first time someone has to do this, and given how many tools there is in the tidyverse to do things, I think I just need help finding the right one. The reason I am looking for a tool instead of using one of the methods I described above (which would take less time than writing this post) is because this code needs to be very very readable by an audience that is less comfortable with code.
We can also use data.table
library(data.table)
setDT(df)[, Date := as.IDate(Date, "%m/%d/%Y")
][, monthly.running.sum := cumsum(score),format(Date, "%Y-%m")][]
# Date score monthly.running.sum
# 1: 2019-10-02 7 7
# 2: 2019-10-09 6 13
# 3: 2019-10-16 12 25
# 4: 2019-10-23 2 27
# 5: 2019-10-30 13 40
# 6: 2019-11-06 2 2
# 7: 2019-11-13 4 6
# 8: 2019-11-20 15 21
# 9: 2019-11-27 16 37
#10: 2019-12-04 4 4
#11: 2019-12-11 24 28
#12: 2019-12-18 28 56
#13: 2019-12-25 8 64
#14: 2020-01-01 1 1
#15: 2020-01-08 15 16
#16: 2020-01-15 9 25
#17: 2020-01-22 8 33
data
df <- structure(list(Date = c("10/2/2019", "10/9/2019", "10/16/2019",
"10/23/2019", "10/30/2019", "11/6/2019", "11/13/2019", "11/20/2019",
"11/27/2019", "12/4/2019", "12/11/2019", "12/18/2019", "12/25/2019",
"1/1/2020", "1/8/2020", "1/15/2020", "1/22/2020"), score = c(7L,
6L, 12L, 2L, 13L, 2L, 4L, 15L, 16L, 4L, 24L, 28L, 8L, 1L, 15L,
9L, 8L)), row.names = c(NA, -17L), class = "data.frame")
Using lubridate, you can extract month and year values from the date, group_by those values and them perform the cumulative sum as follow:
library(lubridate)
library(dplyr)
df %>% mutate(Month = month(mdy(Date)),
Year = year(mdy(Date))) %>%
group_by(Month, Year) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 6
# Groups: Month, Year [4]
Date score monthly.running.sum Month Year SUM
<chr> <int> <int> <int> <int> <int>
1 10/2/2019 7 7 10 2019 7
2 10/9/2019 6 13 10 2019 13
3 10/16/2019 12 25 10 2019 25
4 10/23/2019 2 27 10 2019 27
5 10/30/2019 13 40 10 2019 40
6 11/6/2019 2 2 11 2019 2
7 11/13/2019 4 6 11 2019 6
8 11/20/2019 15 21 11 2019 21
9 11/27/2019 16 37 11 2019 37
10 12/4/2019 4 4 12 2019 4
11 12/11/2019 24 28 12 2019 28
12 12/18/2019 28 56 12 2019 56
13 12/25/2019 8 64 12 2019 64
14 1/1/2020 1 1 1 2020 1
15 1/8/2020 15 16 1 2020 16
16 1/15/2020 9 25 1 2020 25
17 1/22/2020 8 33 1 2020 33
An alternative will be to use floor_date function in order ot convert each date as the first day of each month and the calculate the cumulative sum:
library(lubridate)
library(dplyr)
df %>% mutate(Floor = floor_date(mdy(Date), unit = "month")) %>%
group_by(Floor) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 5
# Groups: Floor [4]
Date score monthly.running.sum Floor SUM
<chr> <int> <int> <date> <int>
1 10/2/2019 7 7 2019-10-01 7
2 10/9/2019 6 13 2019-10-01 13
3 10/16/2019 12 25 2019-10-01 25
4 10/23/2019 2 27 2019-10-01 27
5 10/30/2019 13 40 2019-10-01 40
6 11/6/2019 2 2 2019-11-01 2
7 11/13/2019 4 6 2019-11-01 6
8 11/20/2019 15 21 2019-11-01 21
9 11/27/2019 16 37 2019-11-01 37
10 12/4/2019 4 4 2019-12-01 4
11 12/11/2019 24 28 2019-12-01 28
12 12/18/2019 28 56 2019-12-01 56
13 12/25/2019 8 64 2019-12-01 64
14 1/1/2020 1 1 2020-01-01 1
15 1/8/2020 15 16 2020-01-01 16
16 1/15/2020 9 25 2020-01-01 25
17 1/22/2020 8 33 2020-01-01 33
A base R alternative :
df$Date <- as.Date(df$Date, "%m/%d/%Y")
df$monthly.running.sum <- with(df, ave(score, format(Date, "%Y-%m"),FUN = cumsum))
df
# Date score monthly.running.sum
#1 2019-10-02 7 7
#2 2019-10-09 6 13
#3 2019-10-16 12 25
#4 2019-10-23 2 27
#5 2019-10-30 13 40
#6 2019-11-06 2 2
#7 2019-11-13 4 6
#8 2019-11-20 15 21
#9 2019-11-27 16 37
#10 2019-12-04 4 4
#11 2019-12-11 24 28
#12 2019-12-18 28 56
#13 2019-12-25 8 64
#14 2020-01-01 1 1
#15 2020-01-08 15 16
#16 2020-01-15 9 25
#17 2020-01-22 8 33
The yearmon class represents year/month objects so just convert the dates to yearmon and accumulate by them using this one-liner:
library(zoo)
transform(DF, run.sum = ave(score, as.yearmon(Date, "%m/%d/%Y"), FUN = cumsum))
giving:
Date score run.sum
1 10/2/2019 7 7
2 10/9/2019 6 13
3 10/16/2019 12 25
4 10/23/2019 2 27
5 10/30/2019 13 40
6 11/6/2019 2 2
7 11/13/2019 4 6
8 11/20/2019 15 21
9 11/27/2019 16 37
10 12/4/2019 4 4
11 12/11/2019 24 28
12 12/18/2019 28 56
13 12/25/2019 8 64
14 1/1/2020 1 1
15 1/8/2020 15 16
16 1/15/2020 9 25
17 1/22/2020 8 33
I would like to split strings in my dataframe using stringr.
The following is my dataframe:
df<-data.frame(ID = 1:26,
DRUG_STRENGTH = c("50 MG", "1250 MG", "20 MG", "200 MG", "2MG", "60MG", NA, "300IU",
NA, "600 MG", "500MG", "625MG", NA, NA, "50MG/ML", "40MG", "200MG",
"200MG", "200MG", "5 MG", "5 MG", "200MG", "300IU/3ML", "0.05%",
"112.5 BILLION", "10.8MG"))
My desired dataframe is:
# > df
# ID DRUG_STRENGTH DRUG_STRENGTH_NO DRUG_STRENGTH_UNIT
# 1 1 50 MG 50 MG
# 2 2 1250 MG 1250 MG
# 3 3 20 MG 20 MG
# 4 4 200 MG 200 MG
# 5 5 2MG 2 MG
# 6 6 60MG 60 MG
# 7 7 <NA> <NA> <NA>
# 8 8 300IU 300 IU
# 9 9 <NA> <NA> <NA>
# 10 10 600 MG 600 MG
# 11 11 500MG 500 MG
# 12 12 625MG 625 MG
# 13 13 <NA> <NA> <NA>
# 14 14 <NA> <NA> <NA>
# 15 15 50MG/ML 50 MG/ML
# 16 16 40MG 40 MG
# 17 17 200MG 200 MG
# 18 18 200MG 200 MG
# 19 19 200MG 200 MG
# 20 20 5 MG 5 MG
# 21 21 5 MG 5 MG
# 22 22 200MG 200 MG
# 23 23 300IU/3ML 300 IU/3ML
# 24 24 0.05% 0.05 %
# 25 25 112.5 BILLION 112.5 BILLION
# 26 26 10.8MG 10.8 MG
My code gives me my desired df but I would like to ask if there is a nicer way to write the regular expressions.
df <- df %>%
mutate(DRUG_STRENGTH_NO = str_extract(DRUG_STRENGTH, pattern = "^\\d\\.?\\d?\\.?\\d?\\.?\\d*"),
DRUG_STRENGTH_UNIT = str_trim(str_replace(DRUG_STRENGTH, pattern = "^\\d\\.?\\d?\\.?\\d?\\.?\\d*", replacement = "")))
I'd use extract for this:
library(tidyverse)
df %>%
extract(DRUG_STRENGTH, into = c("No", "Unit"), "([0-9.]+)(.*)", remove = FALSE)
## ID DRUG_STRENGTH No Unit
## 1 1 50 MG 50 MG
## 2 2 1250 MG 1250 MG
## 3 3 20 MG 20 MG
## 4 4 200 MG 200 MG
## 5 5 2MG 2 MG
## 6 6 60MG 60 MG
## 7 7 <NA> <NA> <NA>
## 8 8 300IU 300 IU
## 9 9 <NA> <NA> <NA>
## 10 10 600 MG 600 MG
## 11 11 500MG 500 MG
## 12 12 625MG 625 MG
## 13 13 <NA> <NA> <NA>
## 14 14 <NA> <NA> <NA>
## 15 15 50MG/ML 50 MG/ML
## 16 16 40MG 40 MG
## 17 17 200MG 200 MG
## 18 18 200MG 200 MG
## 19 19 200MG 200 MG
## 20 20 5 MG 5 MG
## 21 21 5 MG 5 MG
## 22 22 200MG 200 MG
## 23 23 300IU/3ML 300 IU/3ML
## 24 24 0.05% 0.05 %
## 25 25 112.5 BILLION 112.5 BILLION
## 26 26 10.8MG 10.8 MG
You may need to go back through and check for any whitespace later.
Or, if you make sure the number and the remainder are separated by say, a space, you could use strsplit or str_split (with or without simplify).
Using regular expressions might prove to be more flexible, but can also turn messy in more complicated situations.
I am having trouble reading a text file that has data formatted in a matrix format as follows:
Location Product Day1 Day2 Day3 Day4 ... Day1 Day2 Day3
Jan Jan Jan ... Feb Feb Feb
123 23 8 9 3
234 25 2 4 9
254 87 3
213 56 7 5
It is essentially a time series that has quantities of products by location by day. I want to eventually convert this into a "sql" table format.
My trouble is when I've tried the following to just skip row 2 and import the rest of the data with the fill = TRUE, I don't get the desired result. The actual counts get shifted to the right and don't align to the first "header" row. I want to combine row 1 and two together to make a date field starting from Day1 in row 1 and then leave empty fields as null or NA. Then eventually pivot this data to be in the following format:
Location Product Period Count
123 23 Jan 1
234 25 Jan 1 5
234 25 Feb 3 9
How can I accomplish this?
This demonstrates the auto-position guessing function, fwf_empty of pkg:readr. I could not get the read_fwf-function to accept a text connection argument to the file argument, so needed to save the text as a slightly edited version that looks like:
Location Product Day1 Day2 Day3 Day4 Day1 Day2 Day3
Jan Jan Jan Jan Feb Feb Feb
123 23 8 9 3
234 25 2 4 9
254 87 3
213 56 7 5
The R code:
require(readr)
fwf_empty(file="~/Untitled 4 copy.txt")
$begin
[1] 0 9 17 22 27 32 40
$end
[1] 8 16 21 26 31 36 55
$col_names
[1] "X1" "X2" "X3" "X4" "X5" "X6" "X7"
> read_fwf(file="~/Untitled 4 copy.txt", fwf_empty(file="~/Untitled 4 copy.txt"))
Warning: 8 parsing failures.
row col expected actual
2 X9 4 chars 3
3 X8 4 chars 2
3 -- 9 columns 8 columns
4 X9 4 chars 3
5 X3 4 chars 2
... ... ......... .........
.See problems(...) for more details.
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 Location Product Day1 Day2 Day3 Day4 Day1 Day2 Day3
2 <NA> <NA> Jan Jan Jan Jan Feb Feb Feb
3 123 23 <NA> <NA> 8 <NA> 9 3 <NA>
4 234 25 2 4 <NA> <NA> <NA> <NA> 9
5 254 87 3 <NA> <NA> <NA> <NA> <NA> <NA>
6 213 56 <NA> 7 <NA> <NA> 5 <NA> <NA>
Then rename the columns and remove the first two lines:
> colnm <- paste0( inp[1,], inp[2,])
> colnm
[1] "LocationNA" "ProductNA" "Day1Jan" "Day2Jan" "Day3Jan"
[6] "Day4Jan" "Day1Feb" "Day2Feb" "Day3Feb"
> colnames(inp) <- colnm
> inp[-(1:2), ]
LocationNA ProductNA Day1Jan Day2Jan Day3Jan Day4Jan Day1Feb Day2Feb
3 123 23 <NA> <NA> 8 <NA> 9 3
4 234 25 2 4 <NA> <NA> <NA> <NA>
5 254 87 3 <NA> <NA> <NA> <NA> <NA>
6 213 56 <NA> 7 <NA> <NA> 5 <NA>
Day3Feb
3 <NA>
4 9
5 <NA>
6 <NA>