Split Column in 3 columns with R - r

I'm trying to separate a column into 3 columns.
My code:
library(dplyr)
library(tidyr)
table1 <- read.csv("tablepartipants.csv")
table2 <- tidyr::separate(table1, col = unique_participant, into = c("uID", "gender", "employment"), sep='.')
I always get this error: Expected 3 pieces. Additional pieces discarded in 80 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
This is how the column dataset looks like
All 3 "new" columns are empty...

Remove the sep part of your command. A period is the default and . is a special character.
# Example data.frame
table1 <- data.frame(unique_participant = paste0(30:33, c('.male.Student', '.female.Student')))
One option
separate(table1, unique_participant, into = c("uID", "gender", "employment"))
Or using \\. to specify a period.
separate(table1, unique_participant, into = c("uID", "gender", "employment"), sep = '\\.')

Related

Using map() function to apply for each element

I need, with the help of the map() function, apply the above for each element
How can I do so?
As dt is of class data.table, you can make a vector of columns of interest (i.e. your items; below I use grepl on the names), and then apply your weighting function to each of those columns using .SD and .SDcols, with by
qs = names(dt)[grepl("^q", names(dt))]
dt[, (paste0(qs,"wt")):=lapply(.SD, \(q) 1/(sum(!is.na(q))/.N)),
.(sex, education_code, age), .SDcols = qs]
As mentioned in the comments, you miss a dt <- in your dt[, .(ID, education_code, age, sex, item = q1_1)] which makes the column item unavailable in the following line dt[, no_respond := is.na(item)].
Your weighting scheme is not absolutely clear to me however, assuming you want to do what is done in your code here, I would go with dplyr solution to iterate over columns.
# your data without no_respond column and correcting missing value in q2_3
dt <- data.table::data.table(
ID = c(1,2,3,4, 5, 6, 7, 8, 9, 10),
education_code = c(20,50,20,60, 20, 10,5, 12, 12, 12),
age = c(87,67,56,52, 34, 56, 67, 78, 23, 34),
sex = c("F","M","M","M", "F","M","M","M", "M","M"),
q1_1 = c(NA,1,5,3, 1, NA, 3, 4, 5,1),
q1_2 = c(NA,1,5,3, 1, 2, NA, 4, 5,1),
q1_3 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q1_text = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_1 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_2 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_3 = c(NA,1,5,3, 1, NA, NA, 4, 5,1),
q2_text = c(NA,1,5,3, 1, NA, 3, 4, 5,1))
dt %>%
group_by(sex, education_code, age) %>% #groups the df by sex, education_code, age
add_count() %>% #add a column with number of rows in each group
mutate(across(starts_with("q"), #for each column starting with "q"
~ 1/(sum(!is.na(.))/n), #create a new column following your weight calculation
.names = '{.col}_wgt')) %>% #naming the new column with suffix "_wgt" to original name
ungroup()

keep the original value when using ifelse in dplyr after using cut

There is a dataset in which I have to create labels conditionally using the cut function; after that, only one of the labels shall be changed. Here is a similar code:
data.frame(x = c(12, 1, 25, 12, 65, 2, 6, 17)) %>%
mutate(rank = cut(x, breaks = c(0, 3, 12, 15, 20, 80),
labels = c("First", "Second", "Third", "Fourth", "Fifth")))
The output will be as follows:
However, when I want to change that rank relevant to the x value of 17 only to "Seventeen" and keep the rest as original using the following code, all other values in the rank column will change as well:
data.frame(x = c(12, 1, 25, 12, 65, 2, 6, 17)) %>%
mutate(rank = cut(x, breaks = c(0, 3, 12, 15, 20, 80),
labels = c("First", "Second", "Third", "Fourth", "Fifth"))) %>%
mutate(rank = ifelse(x == 17, "Seventeen",rank))
and the output will look like:
How can I prevent this happening?
We have to wrap rank in as.character to avoid class incompatibilities (factor vs. character):
Until 'Seventeen' is added to the rank column, we have a vector rank of class factor. With adding of 'Seventeen' to this column or (vector) factor changes I think als called coerces to character, because character is the strongest!
library(dplyr)
df %>%
mutate(rank = cut(x, breaks = c(0, 3, 12, 15, 20, 80),
labels = c("First", "Second", "Third", "Fourth", "Fifth"))) %>%
mutate(rank = ifelse(x == 17, "Seventeen",as.character(rank)))
x rank
1 12 Second
2 1 First
3 25 Fifth
4 12 Second
5 65 Fifth
6 2 First
7 6 Second
8 17 Seventeen

how to properly sum rows based in an specific date column rank?

The idea is to get the sum based on the column names that are
between 01/01/2021 and 01/08/2021:
# define rank parameters {start-end}
first_date <- format(Sys.Date(), "01/01/%Y")
actual_date <- format(Sys.Date() %m-% months(1), "01/%m/%Y")
# get the sum of the rows between first_date and actual_date
df$ytd<- rowSums(df[as.character(seq(first_date,
actual_date))])
However, when applied the next error arises:
Error in seq.default(first_date, to_date) :
'from' must be a finite number
Expected output is a new column that takes the sum of the rows between the specified rank.
data
df <- structure(list(country = c("Mexico", "Mexico", "Mexico", "Mexico"
), `01/01/2021` = c(12, 23, 13, 12), `01/02/2021` = c(12, 23,
13, 12), `01/03/2021` = c(12, 23, 13, 12), `01/04/2021` = c(12,
23, 13, 12), `01/05/2021` = c(12, 23, 13, 12), `01/06/2021` = c(12,
23, 13, 12), `01/07/2021` = c(12, 23, 13, 12), `01/08/2021` = c(12,
23, 13, 12), `01/09/2021` = c(12, 23, 13, 12), `01/10/2021` = c(12,
23, 13, 12), `01/11/2021` = c(12, 23, 13, 12), `01/12/2021` = c(12,
23, 13, 12)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
How could I properly apply a function to get this output?
The format and seq don't work i.e. seq expects a Date class whereas the format is a character class. Instead, make use of the range operator in across or select
library(dplyr)
out <- df %>%
mutate(ytd = rowSums(across(all_of(first_date):all_of(actual_date))))
-output
> out$ytd
[1] 96 184 104 96
A base R approach using match -
df$ytd <- rowSums(df[match(first_date, names(df)):match(actual_date, names(df))])
df$ytd
#[1] 96 184 104 96

Separate a column with uneven/unequal strings and with no delimiters

How would I separate a column like this where the data has delimiters but the rest does not and it has some unequal strings?
Input:
id
142 TM500A2013PISA8/22/17BG
143 TM500CAGE2012QUDO8/22/1720+
Output:
category site garden plot year species date portion
142 TM 500 A 2013 PISA 8/22/17 BG
143 TM 500 CAGE 2012 QUDO 8/22/17 20+
I poked around other questions and tried something that may work if it was an equal string ie:
>df <- avgmass %>% separate(id, c("site", "garden", "plot", "year",
"species", "sampledate", "portion"),sep=cumsum(c(2,3,3,4,4,5)))
But as the plot id is either A, B or CAGE; the date has "/" - I am not sure how to approach it.
As I am relatively new to R, I tried searching for more details on how to use the sep argument but to no avail... Thank you for your help.
The code below may work for you, assuming that the "site", "garden" and "species" columns are of a fixed width.
df <- df %>%
mutate(site = substr(id, 1, 2),
garden = substr(id, 3, 5),
plot = ifelse(substr(id, 6, 9) == "CAGE", substr(id, 6, 9), substr(id, 6, 6)),
year = ifelse(substr(id, 6, 9) == "CAGE", substr(id, 10, 13), substr(id, 7, 10)),
species = ifelse(substr(id, 6, 9) == "CAGE", substr(id, 14, 17), substr(id, 11, 14)),
sampledate = ifelse(substr(id, 6, 9) == "CAGE", substr(id, 18, nchar(id)), substr(id, 15, nchar(id)))) %>%
separate(sampledate, into = c("m","d","y"), sep = "/") %>%
mutate(portion = substr(y, 3, nchar(y)),
sampledate = as.Date(paste(m, d, substr(y, 1, 2), sep = "-"), format = "%m-%d-%y"),
m = NULL,
d = NULL,
y = NULL)

Lookup value in one dataframe based on column name stored as a value in another dataframe

Please see the reproducible (cut + paste) example below. The actual data set has over 4000 serial observations on 11000 people. I need to create columns A, B, C, etc. showing the NUMBER of the "Drug" variables X,Y, Z etc. that corresponds to the first occurrence of a particular value of a "Disease" variable. The numbers refer to actions that were taken with particular drugs (start, stop, increase dose etc.) The "disease" variable refers to whether the disease flared or not in a disease that has many stages including flares and remissions.
For example:
Animal <- c("aardvark", "1", "cheetah", "dromedary", "eel", "1", "bison", "cheetah", "dromedary",
"eel")
Plant <- c("apple_tree", "blossom", "cactus", "1", "bronze", "apple_tree", "bronze", "cactus",
"dragonplant", "1")
Mineral <- c("amber", "bronze", "1", "bronze", "emerald", "1", "bronze", "bronze", "diamond",
"emerald")
Bacteria <- c("acinetobacter", "1", "1", "d-strep", "bronze", "acinetobacter", "bacillus",
"chlamydia", "bronze", "enterobacter" )
AnimalDrugA <- c(1, 11, 12, 13, 14, 15, 16, 17, 18, 19)
AnimalDrugB <- c(20, 1, 22, 23, 24, 25, 26, 27, 28, 29)
PlantDrugA <- c(301, 302, 1, 304, 305, 306, 307, 308, 309, 310)
PlantDrugB <- c(401, 402, 1, 404, 405, 406, 407, 408, 409, 410)
MineralDrugA <- c(1, 2, 3, 4, 1, 6, 7, 8, 9, 10)
MineralDrugB <- c(11, 12, 13, 1, 15, 16, 17, 18, 19, 20)
BacteriaDrugA <- c(1, 2, 3, 4, 5, 6 , 7, 8, 9, 1)
BacteriaDrugB <- c(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
dummy_id <- c(1001, 2002, 3003, 4004, 5005, 6006, 7007, 8008, 9009, 10101)
Elements <- data.frame(dummy_id, Animal, Plant, Mineral, Bacteria, AnimalDrugA, AnimalDrugB,
PlantDrugA, PlantDrugB, MineralDrugA, MineralDrugB, BacteriaDrugA, BacteriaDrugB)
ds <- Elements[,order(names(Elements))]
ds #Got it in alphabetical order... The real data set will be re-ordered chronologically
#Now I want the first occurrence of the word "bronze" for each id
# for each subject 1 through 10. (That is, "bronze" corresponds to start of disease flare.)
first.bronze <- colnames(ds)[apply(ds,1,match,x="bronze")]
first.bronze
#Now, I want to find the number in the DrugA, DrugB variable that corresponds to the first
#occurrence of bronze.
#Using the alphabetically ordered data set, the answer should be:
#dummy_id DrugA DrugB
#1... NA NA
#2... 2 12
#3... NA NA
#4... 4 1
#5... 5 6
#6... NA NA
#7... 7 17
#8... 8 18
#9... 9 2
#10... NA NA
#Note that all first occurrences of "bronze"
# are in Mineral or Bacteria.
#As a first step, join first.bronze to the ds
ds$first.bronze <- first.bronze
ds
#Make a new ds where those who have an NA for first.bronze are excluded:
ds2 <- ds[complete.cases(ds$first.bronze),]
ds2
# Create a template data frame
out <- data.frame(matrix(nr = 1, nc = 3))
colnames(out) <- c("Form Number", "DrugA", "DrugB") # Gives correct column names
out
#Then grow the data frame...yes I realize potential slowness of computation
test <- for(i in ds2$first.bronze){
data <- rbind(colnames(ds2)[grep(i, names(ds2), ignore.case = FALSE, fixed = TRUE)])
colnames(data) <- c("Form Number", "DrugA", "DrugB") # Gives correct column names
out <- rbind(out, data)
}
out
#Then delete the first row of NAs
out <- na.omit(out)
out
#Then add the appropriate dummy_ids
dummy_id <- ds2$dummy_id
out_with_ids <- as.data.frame(cbind(dummy_id, out))
out_with_ids
Now I am stuck. I have the name of the column from ds2 listed as a value of Drug A, Drug B in the out_with_ids dataset. I have search through Stack Overflow thoroughly but solutions based on match, merge, replace, and the data.table package don't seem to work.
Thank you!
I think the problem here is data format. May I suggest you store it in "long" table, like this:
library(data.table)
dt <- data.table(dummy_id = rep(dummy_id, 4),
type = rep(c("Animal", "Bacteria", "Mineral", "Plant"), each = 10),
name = c(Animal, Bacteria, Mineral, Plant),
drugA = c(AnimalDrugA, BacteriaDrugA, MineralDrugA, PlantDrugA),
drugB = c(AnimalDrugB, BacteriaDrugB, MineralDrugB, PlantDrugB))
Then it is much easier to filter and do other operations. For example,
dt[name == "bronze"][order(dummy_id)]
Frankly I'm not sure I understand what you want to achieve in the end.

Resources