I have a data column that contains a bunch of ranges as strings (e.g. "2 to 4", "5 to 6", "7 to 8" etc.). I'm trying to create a new column that converts each of these values to a random number within the given range. How can I leverage conditional logic within my function to solve this problem?
I think the function should be something along the lines of:
df<-mutate(df, c2=ifelse(df$c=="2 to 4", sample(2:4, 1, replace=TRUE), "NA"))
Which should produce a new column in my dataset that replaces all the values of "2 to 4" with a random integer between 2 and 4, however, this is not working and replacing every value with "NA".
Ideally, I am trying to do something where the dataset:
df<-c("2 to 4","2 to 4","5 to 6")
Would add a new column:
df<-c2("3","2","5")
Does anyone have any idea how to do this?
We can split the string on "to" and create a range between the two numbers after converting them to numeric and then use sample to select any one of the number in range.
df$c2 <- sapply(strsplit(df$c1, "\\s+to\\s+"), function(x) {
vals <- as.integer(x)
sample(vals[1]:vals[2], 1)
})
df
# c1 c2
#1 2 to 4 2
#2 2 to 4 3
#3 5 to 6 5
data
df<- data.frame(c1 = c("2 to 4","2 to 4","5 to 6"), stringsAsFactors = FALSE)
We can do this easily with sub. Replace the to with : and evaluate to get the sequence, then get the sample of 1 from it
df$c2 <- sapply(sub(" to ", ":", df$c1), function(x)
sample(eval(parse(text = x)), 1))
df
# c1 c2
#1 2 to 4 4
#2 2 to 4 3
#3 5 to 6 5
Or with gsubfn
library(gsubfn)
as.numeric(gsubfn("(\\d+) to (\\d+)", ~ sample(seq(as.numeric(x),
as.numeric(y), by = 1), 1), df$c1))
Or with read.table/Map from base R
sapply(do.call(Map, c(f = `:`, read.csv(text = sub(" to ", ",", df$c1),
header = FALSE))), sample, 1)
data
df <- structure(list(c1 = c("2 to 4", "2 to 4", "5 to 6")),
class = "data.frame", row.names = c(NA, -3L))
Related
Currently trying to use sample R code and implement it into my own
Sample code goes like this:
syn_data <- syn_data %>%
dplyr::mutate(gender = factor(gender,
labels = c("female", "male")))
My code goes:
data <- data %>%
dplyr::mutate(condition = factor(condition,
labels = c("Fixed Ratio 6", "Variable Ratio 6", "Fixed Interval 8", "Variable Interval 8")))
Getting this error:
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "character"
Edit:
categorical. Reinforcement schedule the rat has been assigned to: 0 = 'Fixed Ratio 6'; 1 = 'Variable Ratio 6'; 2 = 'Fixed Interval 8'; 3 = 'Variable Interval 8'.
Data (sample right, mine left)
The cause of your problem is that data is not a data.frame, which is the required class for the first argument of mutate. If you change it to a data.frame, your code works.
For example:
tap_data <- data.frame(rat_id = 1:4, condition = c(0,1,2,3))
tap_data <- tap_data %>% mutate(condition = factor(condition,
labels = c("Fixed Ratio 6", "Variable Ratio 6",
"Fixed Interval 8", "Variable Interval 8")))
tap_data
# rat_id condition
# 1 1 Fixed Ratio 6
# 2 2 Variable Ratio 6
# 3 3 Fixed Interval 8
# 4 4 Variable Interval 8
To check if an object is a data.frame, you can use is.data.frame(). You can check for some other classes with similar syntax, such as is.factor().
is.data.frame(tap_data)
#[1] TRUE
is.data.frame(tap_data$condition)
# [1] FALSE
is.factor(tap_data$condition)
#[1] TRUE
In addition to the answer above, you can convert the matrix or array to a data frame as follows:
data <- data %>%
as.data.frame(.) %>%
dplyr::mutate(condition = factor(condition,
labels = c("Fixed Ratio 6", "Variable Ratio 6", "Fixed Interval 8", "Variable Interval 8")))
These dplyr functions are set to handle data frames only, therefore, you need to check if the data structure you are working on is of a data-frame class.
I have a df that contain long strings. If I want to separate it into different variable, how should I do that?
sample data is here:
df <- structure(list(tx = c(" [1] Timepoint EGTMPT Categorical select one (nominal) 51 Screening",
" [2] N/A : O ff-Study EGTNA Categorical yes/no (dichotomous) 3",
" [3] Check if Not Done EGTMPTND Categorical yes/no (dichotomous) 3",
" [4] Date Performed ECGDT Date 11",
" [5] Time (24-hour format) ECGTM Time 5",
" [6] O verall ECG Interpretation ECGRES Categorical select one (nominal) 37 Normal"
)), row.names = c(NA, 6L), class = "data.frame")
It seems that the variables occupy a fixed space, so to find those spaces we do the following:
Manually separate one line:
vars = c(" [1] ", "Timepoint ", "EGTMPT ",
"Categorical select one (nominal) ", "51 ", "Screening")
Count the number of characters in each variable:
sizes = numeric(length(vars))
for(i in 1:length(vars)){
sizes[i] = nchar(vars[i])}
Cumulatively sum those values and add a 1 (starting point) at the beggining:
sizes = c(1, cumsum(sizes))
The result is:
> sizes
[1] 1 14 62 74 107 118 127
So the first variable goes from the 1st to the 14th position, etc. Now we just need to cut each line in those places:
df2 = character()
for(i in 2:length(sizes)){
df2 = cbind(df2, apply(df, 1, function(x){substr(x, sizes[i-1], sizes[i])}))}
And lastly remove the extra spaces:
df2 = gsub(" ", "", df2)
There is the basic width : xxxx.xxxxxx (4digits before "." 6 digits after".")
Have to add "0" when each side before and after "." is not enough digits.
Use regexr find "[.]" location with combination of str_pad can
fix the first 4 digits but
don't know how to add value after the specific character with fixed digits.
(cannot find a library can count the location from somewhere specified)
Data like this
> df
Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402
Desired data
> df
Category
1 0300.030340
2 3400.040290
3 0700.070110
4 1700.090100
5 0700.070114
6 0700.079100
7 3600.050590
8 4400.040200
I am beginner of coding that sometime can't understand some regex like "["
e.t.c .With some explain of them would be super helpful.
Also i have a combination like this :
df$Category<-ifelse(regexpr("[.]",df$Category)==4,
paste("0",df1$Category,sep = ""),df$Category)
df$Category<-str_pad(df$Category,11,side = c("right"),pad="0")
Desire to know are there is any better way do this , especially count and
return the location from the END until specific character appear.
Using formatC:
df$Category <- formatC(as.numeric(df$Category), format = 'f', width = 11, flag = '0', digits = 6)
# > df
# Category
# 1 0300.030340
# 2 3400.040290
# 3 0700.070110
# 4 1700.090100
# 5 0700.070114
# 6 0700.079100
# 7 3600.050590
# 8 4400.040200
format = 'f': formating doubles;
width = 11: 4 digits before . + 1 . + 6 digits after .;
flag = '0': pads leading zeros;
digits = 6: the desired number of digits after the decimal point (format = "f");
Input df seems to be character data.frame:
structure(list(Category = c("300.030340", "3400.040290", "700.07011",
"1700.0901", "700.070114", "700.0791", "3600.05059", "4400.0402"
)), .Names = "Category", row.names = c(NA, -8L), class = "data.frame")
We can use sprintf
df$Category <- sprintf("%011.6f", df$Category)
df
# Category
#1 0300.030340
#2 3400.040290
#3 0700.070110
#4 1700.090100
#5 0700.070114
#6 0700.079100
#7 3600.050590
#8 4400.040200
data
df <- structure(list(Category = c(300.03034, 3400.04029, 700.07011,
1700.0901, 700.070114, 700.0791, 3600.05059, 4400.0402)),
.Names = "Category", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
There are plenty of great tricks, functions, and shortcuts to be learned, and I would encourage you to explore them all! For example, if you're trying to win code golf, you will want to use #akrun's sprintf() approach. Since you stated you're a beginner, it might be more helpful to breakdown the problem into its component parts. One transparent and easy-to-follow, in my opinion, approach would be to utilize the stringr package:
library(stringr)
location_of_dot <- str_locate(df$Category, "\\.")[, 1]
substring_left_of_dot <- str_sub(df$Category, end = location_of_dot - 1)
substring_right_of_dot <- str_sub(df$Category, start = location_of_dot + 1)
pad_left <- str_pad(substring_left_of_dot, 4, side = "left", pad = "0")
pad_right <- str_pad(substring_right_of_dot, 6, side = "right", pad = "0")
result <- paste0(pad_left, ".", pad_right)
result
Use separate in tidyr to separate Category on decimal. Use str_pad from stringr to add zeros in the front or back and paste them together.
library(tidyr) # to separate columns on decimal
library(dplyr) # to mutate and pipes
library(stringr) # to strpad
input_data <- read.table(text =" Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402", header = TRUE, stringsAsFactors = FALSE) %>%
separate(Category, into = c("col1", "col2")) %>%
mutate(col1 = str_pad(col1, width = 4, side= "left", pad ="0"),
col2 = str_pad(col2, width = 6, side= "right", pad ="0"),
Category = paste(col1, col2, sep = ".")) %>%
select(-col1, -col2)
I have the following data frame
Loci p-value chromosome start end geneDescription
A 2.046584849E-2 1 98542 98699 tyrosine kinase
B 5.67849483E-20 2 8958437 8958437 endocytosis
...
However, when I want to print the data frame with the following code:
write.table(table,"~/Desktop/genes.txt", sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE, append = FALSE)
I get the following:
Loci p-value chromosome start end geneDescription
A 2.046584849E-20 1 98542 98699 tyrosine kinase
B 5.67849483E-20 2 8958437 8958437 endocytosis
I know that it has to do with the "\t", but can R adjust automatically the width of the columns when printing to get the original data frame above?
Thank you.
No, since this is a tab formatting issue and can be partially solved by increasing the tabwidth on you editor. Try normalizing the length of the column names.
max.name <- max(sapply(colnames(table), nchar))
colnames(table) <- sapply(colnames(table), function(name) paste0(c(name, rep(" ", max.name - nchar(name))), collapse = ''))
Perhaps you're just looking for capture.output or sink.
In the following examples, replace x with an actual file name. This is just done for illustrative purposes.
x <- tempfile()
capture.output(mydf, file=x)
readLines(x)
# [1] " Loci p.value chromosome start end geneDescription"
# [2] "1 A 2.046585e-02 1 98542 98699 tyrosinekinase"
# [3] "2 B 5.678495e-20 2 8958437 8958437 endocytosis"
x <- tempfile()
sink(file = x)
mydf
sink()
readLines(x)
# [1] " Loci p.value chromosome start end geneDescription"
# [2] "1 A 2.046585e-02 1 98542 98699 tyrosinekinase"
# [3] "2 B 5.678495e-20 2 8958437 8958437 endocytosis"
The readLines step is just to show you what was written to your "file".
Very simple question. I am using an excel sheet that has two rows for the column headings; how can I convert these two row headings into one? Further, these headings don't start at the top of the sheet.
Thus, I have DF1
Temp Press Reagent Yield A Conversion etc
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
and I want,
Temp degC Press bar Reagent /g Yield A % Conversion etc
1 2 3 4 5
6 7 8 9 10
Using colnames(DF1) returns the upper names, but getting the second line to merge with the upper one keeps eluding me.
Using your data, modified to quote text fields that contain the separator (get whatever tool you used to generate the file to quote text fields for you!)
txt <- "Temp Press Reagent 'Yield A' 'Conversion etc'
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
"
this snippet of code below reads the file in two steps
First we read the data, so skip = 2 means skip the first 2 lines
Next we read the data again but only the first two line, this output is then further processed by sapply() where we paste(x, collapse = " ") the strings in the columns of the labs data frame. These are assigned to the names of dat
Here is the code:
dat <- read.table(text = txt, skip = 2)
labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
names(dat) <- sapply(labs, paste, collapse = " ")
dat
names(dat)
The code, when runs produces:
> dat <- read.table(text = txt, skip = 2)
> labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
> names(dat) <- sapply(labs, paste, collapse = " ")
>
> dat
Temp degC Press bar Reagent /g Yield A % Conversion etc %
1 1 2 3 4 5
2 6 7 8 9 10
> names(dat)
[1] "Temp degC" "Press bar" "Reagent /g"
[4] "Yield A %" "Conversion etc %"
In your case, you'll want to modify the read.table() calls to point at the file on your file system, so use file = "foo.txt" in place of text = txt in the code chunk, where "foo.txt" is the name of your file.
Also, if these headings don't start at the top of the file, then increase skip to 2+n where n is the number of lines before the two header rows. You'll also need to add skip = n to the second read.table() call which generates labs, where n is again the number of lines before the header lines.
This should work. You only need set stringsAsFactors=FALSE when reading data.
data <- structure(list(Temp = c("degC", "1", "6"), Press = c("bar", "2",
"7"), Reagent = c("/g", "3", "8"), Yield.A = c("%", "4", "9"),
Conversion = c("%", "5", "10")), .Names = c("Temp", "Press",
"Reagent", "Yield.A", "Conversion"), class = "data.frame", row.names = c(NA,
-3L)) # Your data
colnames(data) <-paste(colnames(dados),dados[1,]) # Set new names
data <- data[-1,] # Remove first line
data <- data.frame(apply(data,2,as.real)) # Correct the classes (works only if all collums are numbers)
Just load your file with read.table(file, header = FALSE, stringsAsFactors = F) arguments. Then, you can grep to find the position this happens.
df <- data.frame(V1=c(sample(10), "Temp", "degC"),
V2=c(sample(10), "Press", "bar"),
V3 = c(sample(10), "Reagent", "/g"),
V4 = c(sample(10), "Yield_A", "%"),
V5 = c(sample(10), "Conversion", "%"),
stringsAsFactors=F)
idx <- unique(c(grep("Temp", df$V1), grep("degC", df$V1)))
df2 <- df[-(idx), ]
names(df2) <- sapply(df[idx, ], function(x) paste(x, collapse=" "))
Here, if you want, you can then convert all the columns to numeric as follows:
df2 <- as.data.frame(sapply(df2, as.numeric))