Extracting many variables from a single column in R - r

I'm working on a data cleaning problem where I'm stuck. I've started receiving csv-files in the format shown below and I need to clean it before I can do any analysis. There are several such columns and there can be a couple hundred variables in each cell that need to be extracted.
Original <- structure(list(CustNum = c(0, 1),
Sales = c("[1000, 345, Zero, 56]", "[987, 879, 325, 4568]"),
Amounts = c("[10, 2, 0, 98]", "[57, 25, 52, 75]"),
Number = c("['1', '2', '3', '4']", "['4', '3', '2', '1']"),
Identifier = c("A", "B")),
row.names = c(NA, -2L),
class = c("tbl_df", "tbl", "data.frame"))
What I'm trying to do is wrangle it into this format.
Desired <- tibble(CustNum = c(0, 0, 0, 0, 1, 1, 1, 1),
Sales = c(1000, 345, "Zero", 56, 987, 879, 325, 4568),
Amounts = c(10, 2, 0, 98, 57, 25, 52, 75),
Number = c(1, 2, 3, 4, 4, 3, 2, 1),
Identifier = c("A", "A", "A", "A", "B", "B", "B", "B"))
I've tried a number of different variations of the following type but can't get anywhere.
Original$Sales %>%
str_replace("\\[", "") %>%
str_replace("\\]", "") %>%
str_replace("'", "")
It's easy to do the cleaning in Power Query in Excel but would like to find a way to do it in R so I don't have to use several different tools. Can anyone show me how to do this?

Try with this:
library(dplyr) # must be version >= 1.0.0
library(stringr)
Original %>%
mutate(across(everything(), str_remove_all, pattern = "\\[|\\]|\\'")) %>%
mutate(across(everything(), str_split, pattern = ",")) %>%
tidyr::unnest(everything()) %>%
mutate(across(everything(), str_trim)) %>%
mutate(across(c(CustNum, Amounts, Number), as.numeric))
# A tibble: 8 x 5
CustNum Sales Amounts Number Identifier
<dbl> <chr> <dbl> <dbl> <chr>
1 0 1000 10 1 A
2 0 345 2 2 A
3 0 Zero 0 3 A
4 0 56 98 4 A
5 1 987 57 4 B
6 1 879 25 3 B
7 1 325 52 2 B
8 1 4568 75 1 B
Basically:
Remove [ ] '
Split by ,
Unnest the lists
Trim out unnecessary spaces
Set to numeric where necessary

You can try this approach
library(tidyverse)
library(stringr)
Original2 <- Original %>%
mutate_at(vars(Sales, Amounts, Number), ~str_replace_all(., "\\[|\\'|\\]|\\s", "")) %>%
separate_rows(c("Sales", "Amounts", "Number"), sep = ",")
# CustNum Sales Amounts Number Identifier
# <dbl> <chr> <chr> <chr> <chr>
# 1 0 1000 10 1 A
# 2 0 345 2 2 A
# 3 0 Zero 0 3 A
# 4 0 56 98 4 A
# 5 1 987 57 4 B
# 6 1 879 25 3 B
# 7 1 325 52 2 B
# 8 1 4568 75 1 B
Here we replace [ , ' and space, then we use separate_rows() from tidyr package to separate rows. It takes 2 steps to achieve our goal.

I would suggest this approach reshaping your Original data first to long and then separate the rows by sep=','. After that you will clean the variable to remove some special characters. Therefore, you can create an id variable by group in order to transform data to wide as you want in Desired:
library(tidyverse)
#Reshape
Original %>%
pivot_longer(cols = -c(CustNum,Identifier)) %>%
separate_rows(value,sep = ',') %>%
mutate(value=trimws(gsub("[[:punct:]]", " ", value))) %>%
group_by(name) %>% mutate(id=1:n()) %>%
pivot_wider(names_from = name,values_from=value) %>%
ungroup() %>%
select(-id)
Output:
# A tibble: 8 x 5
CustNum Identifier Sales Amounts Number
<dbl> <chr> <chr> <chr> <chr>
1 0 A 1000 10 1
2 0 A 345 2 2
3 0 A Zero 0 3
4 0 A 56 98 4
5 1 B 987 57 4
6 1 B 879 25 3
7 1 B 325 52 2
8 1 B 4568 75 1

Related

R: creating a longitudinal dataset using tidyr

I am looking to generate a longitudinal dataset. I have generated my pat numbers and treatment groups:
library(dplyr)
set.seed(420)
Pat_TNO <- 1001:1618
data.frame(Pat_TNO = Pat_TNO) %>%
rowwise() %>%
mutate(
trt = rbinom(1, 1, 0.5)
)
My timepoints (in days) are:
timepoint_weeks <- c(seq(2, 12, 2), 16, 20, 24, 52)
timepoint_days <- 7 * timepoint_weeks
How can I pivot this dataset using the vector timepoint_days, so I have 10 rows per participant and column names Pat_TNO, trt, timepoint_days.
You can use the unnest function from tidyr to achieve what you want.
Here is the code
library(dplyr)
library(tidyr)
set.seed(420)
Pat_TNO <- 1001:1618
x <- data.frame(Pat_TNO = Pat_TNO) %>%
rowwise() %>%
mutate(
trt = rbinom(1, 1, 0.5)
)
timepoint_weeks <- c(seq(2, 12, 2), 16, 20, 24, 52)
timepoint_days <- 7 * timepoint_weeks
x %>%
mutate(timepoint_days = list(timepoint_days)) %>%
unnest()
Output
# A tibble: 6,180 × 3
Pat_TNO trt timepoint_days
<int> <int> <dbl>
1 1001 1 14
2 1001 1 28
3 1001 1 42
4 1001 1 56
5 1001 1 70
6 1001 1 84
7 1001 1 112
8 1001 1 140
9 1001 1 168
10 1001 1 364
# … with 6,170 more rows
Here I used the mutate function to add a column with a list containing timepoint_days in every row. And then unnest collapses each row to get 10 rows per participant.

"N-1" Cumulative Averages

I have the following data:
library(dplyr)
my_data = data.frame(patient_id = c(1,1,1,1, 2,2,2),
age = c(43, 43, 44, 44, 21, 21, 21),
gender = c("M", "M", "M", "M", "F", "F", "F"),
appointment_number = c(1,2,3,4,1,2,3),
missed = c(0, 0, 1, 1, 1, 1, 1))
My Question: Grouped by each ID, I want to create two variables:
The first variable takes the value of the previous appointment value
The second variable takes the "n-1" cumulative average of the previous appointment values (e.g. If patient_id = 1 has 8 rows, the cumulative average at this row would be the cumulative average of the first 7 rows)
Here is my attempt to do this:
my_data_final <- my_data %>%
group_by(patient_id) %>%
mutate(cummean = cumsum(missed)/(row_number() - 1)) %>%
mutate(previous_apt = lag(missed))
This results in the cummean variable being greater than 1, even though the variable in question can only be 1 or 0:
# A tibble: 7 x 7
# Groups: patient_id [2]
patient_id age gender appointment_number missed cummean previous_apt
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 43 M 1 0 NaN NA
2 1 43 M 2 0 0 0
3 1 44 M 3 1 0.5 0
4 1 44 M 4 1 0.667 1
5 2 21 F 1 1 Inf NA
6 2 21 F 2 1 2 1
7 2 21 F 3 1 1.5 1
Can someone please show me how to fix this?
Thanks!
Note: I tried to resolve this - is this correct?
my_data %>%
group_by(patient_id) %>%
mutate(previous_apt = lag(missed)) %>%
mutate(cummean = (cumsum(missed) - missed) / (row_number() - 1)) %>% mutate(previous_apt_2 = lag(missed, 2))

Group cases based on id and other variables

I'm having trouble restricting a dataset based on an ifelse condition.
This is an example of my dataframe:
structure(list(id = c(111, 111, 111, 112, 112, 112), se = c(1,
2, 3, 1, 2, 3), pe = c(1, 1, 2, 1, 1, 1)), class = "data.frame", row.names = c(NA,
-6L))
I need to select cases that have the same id and pe
End table should be this:
id se pe
112 1 1
112 2 1
112 3 1
I would suggest next approach using dplyr. You can compute flags to determine the number of unique elements and then filter. The flags are nid and npe. Here the code with df your dput() data:
library(dplyr)
#Code
df %>% group_by(id) %>% mutate(nid = n_distinct(id),npe = n_distinct(pe)) %>%
filter(nid==1 & npe==1) %>% select(-c(nid,npe))
Output:
# A tibble: 3 x 3
# Groups: id [1]
id se pe
<dbl> <dbl> <dbl>
1 112 1 1
2 112 2 1
3 112 3 1
We could also do this without creating/deleting new columns
library(dplyr)
df1 %>%
group_by(id) %>%
filter(n_distinct(se) == 1 | n_distinct(pe) == 1)
# A tibble: 3 x 3
# Groups: id [1]
# id se pe
# <dbl> <dbl> <dbl>
#1 112 1 1
#2 112 2 1
#3 112 3 1

Dplyr tranformation based on string filtering and conditions

I would like to tranform messy dataset in R,
However I am having issues figuring out how to do so, I provided example dataset and result that I need to achieve:
dataset <- tribble(
~ID, ~DESC,
1, "3+1Â 81Â mÂ",
2, "2+1Â 90Â mÂ",
3, "3+KK 28Â mÂ",
4, "3+1 120 m (Mezone)")
dataset
dataset_tranformed <- tribble(
~ID, ~Rooms, ~Meters, ~Mezone, ~KK,
1, 4, 81,0, 0,
2, 3, 90,0,0,
3, 3, 28,0,1,
4, 4, 120,1, 0)
dataset_tranformed
columns firstly need to be seperated, however using dataset %>% separate(DESC, c("size", "meters_squared", "Mezone"), sep = " ") does not work because (Mezone) is thrown away.
We can do this by doing evaluation and individually extract the components
library(dplyr)
library(stringr)
library(tidyr)
dataset %>%
mutate(Rooms = map_dbl(DESC, ~
str_extract(.x, "^\\d+\\+\\d*") %>%
str_replace("\\+$", "+0") %>%
rlang::parse_expr(.) %>%
eval ),
Meters = str_extract(DESC, "(?<=\\s)\\d+(?=Â)"),
Mezone = +(str_detect(DESC, "Mezone")),
KK = +(str_detect(DESC, "KK"))) %>%
select(-DESC)
# A tibble: 4 x 5
# ID Rooms Meters Mezone KK
# <dbl> <dbl> <chr> <int> <int>
#1 1 4 81 0 0
#2 2 3 90 0 0
#3 3 3 28 0 1
#4 4 4 120 1 0
Or another option is extract and then make use of str_detect
dataset %>%
extract(DESC, into = c("Rooms1", "Rooms2", "Meters"),
"^(\\d+)\\+(\\d*)[^0-9]+(\\d+)", convert = TRUE, remove = FALSE) %>%
transmute(ID, Mezone = +(str_detect(DESC, "Mezone")),
KK = +(is.na(Rooms2)), Rooms = Rooms1 + replace_na(Rooms2, 0), Meters )
# A tibble: 4 x 5
# ID Mezone KK Rooms Meters
# <dbl> <int> <int> <dbl> <int>
#1 1 0 0 4 81
#2 2 0 0 3 90
#3 3 0 1 3 28
#4 4 1 0 4 120

substitute value in dataframe based on conditional

I have the following data set
library(dplyr)
df<- data.frame(c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"),
c(1, 1, 2, 2, 2, 3, 1, 2, 2, 2, 3, 3),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60))
colnames(df)<-c("name", "year", "val")
This we summarize by grouping df by name and year and then find the average and number of these entries
asd <- (df %>%
group_by(name,year) %>%
summarize(average = mean(val), `ave_number` = n()))
This gives the following desired output
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 50 1
4 b 1 20 1
5 b 2 20 3
6 b 3 50 2
Now, all entries of asd$average where asd$ave_number<2 I would like to substitute according to the following array based on year
replacer<- data.frame(c(1,2,3),
c(100,200,300))
colnames(replacer)<-c("year", "average")
In other words, I would like to end up with
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1 #substituted
4 b 1 100 1 #substituted
5 b 2 20 3
6 b 3 50 2
Is there a way to achieve this with dplyr? I guess I have to use the %>%-operator, something like this (not working code)
asd %>%
group_by(name, year) %>%
summarize(average = ifelse(n() < 2, #SOMETHING#, mean(val)))
Here's what I would do:
colnames(replacer) <- c("year", "average_replacer") #To avoid duplicate of variable name
asd <- left_join(asd, replacer, by = "year") %>%
mutate(average = ifelse(ave_number < 2, average_replacer, average)) %>%
select(-average_replacer)
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2
Regarding the following:
I guess I have to use the %>%-operator
You don't ever have to use the pipe operator. It is there for convenience because you can string (or "pipe") functions one after another, as you would with a train of thought. It's kind of like having a flow in your code.
You can do this easily by using a named vector of replacement values by year instead of a data frame. If you're set on a data frame, you'd be using joins.
replacer <- setNames(c(100,200,300),c(1,2,3))
asd <- df %>%
group_by(name,year) %>%
summarize(average = mean(val),
ave_number = n()) %>%
mutate(average = if_else(ave_number < 2, replacer[year], average))
Source: local data frame [6 x 4]
Groups: name [2]
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2

Resources