how to split a long string into multiple variables

how to split a long string into multiple variables - r

I have a df that contain long strings. If I want to separate it into different variable, how should I do that?
sample data is here:
df <- structure(list(tx = c(" [1] Timepoint EGTMPT Categorical select one (nominal) 51 Screening",
" [2] N/A : O ff-Study EGTNA Categorical yes/no (dichotomous) 3",
" [3] Check if Not Done EGTMPTND Categorical yes/no (dichotomous) 3",
" [4] Date Performed ECGDT Date 11",
" [5] Time (24-hour format) ECGTM Time 5",
" [6] O verall ECG Interpretation ECGRES Categorical select one (nominal) 37 Normal"
)), row.names = c(NA, 6L), class = "data.frame")

It seems that the variables occupy a fixed space, so to find those spaces we do the following:
Manually separate one line:
vars = c(" [1] ", "Timepoint ", "EGTMPT ",
"Categorical select one (nominal) ", "51 ", "Screening")
Count the number of characters in each variable:
sizes = numeric(length(vars))
for(i in 1:length(vars)){
sizes[i] = nchar(vars[i])}
Cumulatively sum those values and add a 1 (starting point) at the beggining:
sizes = c(1, cumsum(sizes))
The result is:
> sizes
[1] 1 14 62 74 107 118 127
So the first variable goes from the 1st to the 14th position, etc. Now we just need to cut each line in those places:
df2 = character()
for(i in 2:length(sizes)){
df2 = cbind(df2, apply(df, 1, function(x){substr(x, sizes[i-1], sizes[i])}))}
And lastly remove the extra spaces:
df2 = gsub(" ", "", df2)

Related

Encoding of a column from another column when a specific value is observed r

An example dataframe and a list that has the information of which column to encode is given below.
# Dataframe
DF <- data.frame("genres" = c("pop", "pop","jazz","rock","jazz","blues","rock","pop","blues","pop"),
"colors" = c("orange","red","red","orange","green","blue","orange","red","blue","green"),
"values" = c(12, 15, 24, 33 ,47, 2 , 9 ,6, 89, 75),
"genres number 12" = c("r","r","?","l","?","r","l","r","r","r"),
"genres number 17" = c("l","l","?","r","?","l","r","l","l","l"),
"colors number 3" = c("r","l","l","r","?","r","r","l","r","?"),
"colors number 10" = c("r","l","l","r","l","r","r","l","r","l"),
check.names = FALSE
)
# Encoding list
EncodingList <- list("genres number 17", "colors number 3")
names(EncodingList) <- c("colors number 3", "genres number 12")
I want to encode a column from another column when a specific value is observed. For example the first element in the EncodingList is "colors number 3" and its corresponding name is "genres number 17". When a value is ? in "genres number 17" column of DF, we should fill that row with whatever corresponding value "colors number 3" has ("r","l" or "?"). The expected output is given below. EncodingList is very long, it is preferable to use a loop to iterate through.
expectedDF <- data.frame("genres" = c("pop", "pop","jazz","rock","jazz","blues","rock","pop","blues","pop"),
"colors" = c("orange","red","red","orange","green","blue","orange","red","blue","green"),
"values" = c(12, 15, 24, 33 ,47, 2 , 9 ,6, 89, 75),
"genres number 12" = c("r","r","?","l","?","r","l","r","r","r"),
"genres number 17" = c("l","l","l","r","?","l","r","l","l","l"),
"colors number 3" = c("r","l","l","r","?","r","r","l","r","r"),
"colors number 10" = c("r","l","l","r","l","r","r","l","r","l"),
check.names = FALSE
)

You can try a for loop to update the columns given in EncodingList having a ? with values of another given column.
for(i in seq_len(length(EncodingList))) {
j <- which(DF[,EncodingList[[i]]] == "?")
DF[j,EncodingList[[i]]] <- DF[j,names(EncodingList)[i]]
}
identical(DF, expectedDF)
#[1] TRUE

How to get the MIN and MAX from a dataframe SUBSET which consists of factors

I scraped a table and need to retrieve the Minimum and the Maximum from a subset column in the dataframe.
The table looks like this:
Date Year Title Budget Gross
2 Jun 22 2018 Project 5 $170,000,000 $1,308,334,005
3 Jun 12 2015 Project 4 $215,000,000 $1,669,943,967
4 Jul 18 2001 Project 3 $93,000,000 $365,900,000
5 May 22 1997 Project 2 $75,000,000 $618,638,999
6 Jun 11 1993 Project 1 $63,000,000 $1,045,573,035
I need to find the minimum and maximum in column Gross.
This doesn't currently work as the column are vectors. But when I use gsub to replace the comma, it gets messed up.
I don't understand
(1) how to change the vectors to real figures
(2) find the MIN and MAX in this subset
(if this way of thought is correct?)
Thanks for any leads

A tidyverse solution:
library(tidyverse)
df %>%
mutate(across(Gross, parse_number)) %>%
summarise(across(Gross, list(min = min, max = max)))
Gross_min Gross_max
1 365900000 1669943967

We need to remove both the , and the $ and $ is a metacharacter in regex to denote the end of the string. Either we can escape it for place them in a square bracket ([$,]+ - one or more characters that are either a $ or ,) and replace with blank (""). Then, we convert the column to numeric (as.numeric)
df1$Gross <- as.numeric(gsub("[$,]+", "", df1$Gross))
Now, we can get the min and max
min(df1$Gross, na.rm = TRUE)
#[1] 365900000
max(df1$Gross, na.rm = TRUE)
#[1] 1669943967
Or use the range function
range(df1$Gross, na.rm = TRUE)
#[1] 365900000 1669943967
data
df1 <- structure(list(Date = c("Jun 22", "Jun 12", "Jul 18", "May 22",
"Jun 11"), Year = c(2018L, 2015L, 2001L, 1997L, 1993L), Title = c("Project 5",
"Project 4", "Project 3", "Project 2", "Project 1"), Budget = c("$170,000,000",
"$215,000,000", "$93,000,000", "$75,000,000", "$63,000,000"),
Gross = c("$1,308,334,005", "$1,669,943,967", "$365,900,000",
"$618,638,999", "$1,045,573,035")), class = "data.frame",
row.names = c("2",
"3", "4", "5", "6"))

Display means (or # observations) by two variables

I would like to show a statistic (for this example, it would be very helpful to know how to do means and # of observations), by two other variable, where one variable is on one side and another variable is on another side.
I don't know if there is a way for R to figure out how to make it so that the value labels, which would be in string for this example, are rearranged for "optimal" viewing in the resulting table, but that would be ideal. What I have in mind is where the string value labels might be really long, so that in a resulting table everything just gets squished. If there really isn't a smart way but instead just the judicious use of \n, that would be fine too.
An example below for where means could be applied, along group and type.
# Example data frame
df <- data.frame(
num <- c(0.225802, 0.384, 0.583, 0.868, 0.3859, 0.58582, 0.9485802, 0.085802),
type <- c("This is a description of type 1", "This is a description of type 2", "This is a description of type 3", "This is a description of type 4", "This is a description of type 1", "This is a description of type 2", "This is a description of type 3", "This is a description of type 4"),
group <- c("This is a really long description for group A", "This is a really long description for group A", "This is a really long description for group A", "This is a really long description for group A", "This is a really long description for group B", "This is a really long description for group B", "This is a really long description for group B", "This is a really long description for group B")
)
colnames(df) <- c("num", "type", "group")
Thanks!

You can do the following using data.table. In this case, I create a summary table containing the Means and No. Obs for num across Type/Group pairs.
Code
require(data.table)
setDT(df)
untypes = df[, unique(as.character(type))] # Unique type descr
ungroups = df[, unique(as.character(group))] # Unique group descr
types = c(1,2,3,4) # Short types in the order they appear in `untypes` (1 to 4)
groups = c('A', 'B') # Short groups in the order they appear in `ungroups` (A to B)
df[, stype := sapply(type, function(x) types[which(untypes == x)])] # Assign short notation type ID
df[, sgroup := sapply(group, function(x) groups[which(ungroups == x)])] # Assign short notation group ID
dcast(df[, .(Mean = mean(num), No = length(num)), .(stype, sgroup)], stype ~ sgroup, value.var = c('Mean', 'No')) # Create summary matrix
Result
stype Mean_A Mean_B No_A No_B
1: 1 0.225802 0.3859000 1 1
2: 2 0.384000 0.5858200 1 1
3: 3 0.583000 0.9485802 1 1
4: 4 0.868000 0.0858020 1 1
It is important that types and groups are declared such that their orders coincide with the corresponding orders of untypes and ungroups, respectively. For instance, if the long description of type 2 enters as the second observation in untypes, then types[2] must equal 2.

Calculate pairwise difference within groups and still keep the grouping factor

I want to pick, within each group, 3 values that are closest to one another. How do I fix my code?
This data has 11 rows (the full data is much larger but I just want to make a reproducible example). The number of values in the Total column that share the same JobID range from 2 to 9 values. I am grouping values by their JobID and then calculate the within-group pairwise differences, called pdiff. Assuming that I have all the pairwise differences calculated, I will sort the pdiff column using ascending order and then use head() to pick my top 3 values.
The dat1 lines gave me the list of pairwise differences only. The dat2 lines sent me an error
"Error in `[[<-.data.frame`(`*tmp*`, col, value = c(46, 99891, 99577, 99746, :
replacement has 55 rows, data has 11"
dput(dat)
structure(list(JobID = c("11 W 29", "11 W 29", "11 W 50", "11 W 50",
"11 W 50", "23 E 27", "23 E 27", "23 E 27", "11 W 50", "11 W 50",
"11 W 50"), Total = c(145501, 145547, 45610, 45924, 45755, 56425,
56383, 56262, 756185, 750110, 751467)), row.names = c(NA, -11L
), class = c("tbl_df", "tbl", "data.frame"))
dat%>%
group_by(JobID)%>%
summarize(dist(Total))
dat1 <- dat%>%
group_by(JobID)%>%
summarize(dist(Total))
dat2<-dat%>%
group_by(JobID)%>%
mutate(pdiff=dist(Total))
The expected result should be a data frame of 55 rows and 3 columns (JobID, Total and a new column (called "pdiff" for example)).

How to replace specific values in a dataset with randomized numbers?

I have a data column that contains a bunch of ranges as strings (e.g. "2 to 4", "5 to 6", "7 to 8" etc.). I'm trying to create a new column that converts each of these values to a random number within the given range. How can I leverage conditional logic within my function to solve this problem?
I think the function should be something along the lines of:
df<-mutate(df, c2=ifelse(df$c=="2 to 4", sample(2:4, 1, replace=TRUE), "NA"))
Which should produce a new column in my dataset that replaces all the values of "2 to 4" with a random integer between 2 and 4, however, this is not working and replacing every value with "NA".
Ideally, I am trying to do something where the dataset:
df<-c("2 to 4","2 to 4","5 to 6")
Would add a new column:
df<-c2("3","2","5")
Does anyone have any idea how to do this?

We can split the string on "to" and create a range between the two numbers after converting them to numeric and then use sample to select any one of the number in range.
df$c2 <- sapply(strsplit(df$c1, "\\s+to\\s+"), function(x) {
vals <- as.integer(x)
sample(vals[1]:vals[2], 1)
})
df
# c1 c2
#1 2 to 4 2
#2 2 to 4 3
#3 5 to 6 5
data
df<- data.frame(c1 = c("2 to 4","2 to 4","5 to 6"), stringsAsFactors = FALSE)

We can do this easily with sub. Replace the to with : and evaluate to get the sequence, then get the sample of 1 from it
df$c2 <- sapply(sub(" to ", ":", df$c1), function(x)
sample(eval(parse(text = x)), 1))
df
# c1 c2
#1 2 to 4 4
#2 2 to 4 3
#3 5 to 6 5
Or with gsubfn
library(gsubfn)
as.numeric(gsubfn("(\\d+) to (\\d+)", ~ sample(seq(as.numeric(x),
as.numeric(y), by = 1), 1), df$c1))
Or with read.table/Map from base R
sapply(do.call(Map, c(f = `:`, read.csv(text = sub(" to ", ",", df$c1),
header = FALSE))), sample, 1)
data
df <- structure(list(c1 = c("2 to 4", "2 to 4", "5 to 6")),
class = "data.frame", row.names = c(NA, -3L))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to split a long string into multiple variables - r

Related

Encoding of a column from another column when a specific value is observed r

How to get the MIN and MAX from a dataframe SUBSET which consists of factors

Display means (or # observations) by two variables

Calculate pairwise difference within groups and still keep the grouping factor

How to replace specific values in a dataset with randomized numbers?

Categories

Resources