I am working with a data set where I need to flag all specific codes that start with "C13.xxx." There are other tree codes in the column and all tree codes are separated as follows: "C13.xxx|B12.xxx" - and all tree codes have a period in them. But the data set has other variables that are causing my string r function to flag characters that are not tree codes. Example:
library(tidyverse)
# test data
test <- tribble(
~id, ~treecode, ~contains_c13_xxx,
#--|--|----
1, "B12.123|C13.234.432|A11.123", "yes",
2, "C12.123|C13039|", "no"
)
# what I tried
test %>% mutate(contains_C13_error = ifelse(str_detect(treecode, "C13."), 1, 0))
# code above is flagging both id's as containing C13.xxx
in id 2, there is a variable that begins with C13, but it is not a tree codes (all tree codes have a period). The contains_c13_xxx variable is what I would like the code to produce. In the string detect function, I specified the period, so I'm not sure what is going wrong here.
The tricky part is there are multiple tree codes in the same column with a separator which makes it difficult to flag. We can bring each treecode into separate rows and then check for the code that we need. Using separate_rows from tidyr.
library(dplyr)
test %>%
tidyr::separate_rows(treecode, sep = "\\|") %>%
group_by(id) %>%
summarise(contains_C13_error = any(startsWith(treecode, "C13.")),
treecode = paste(treecode, collapse = "|"))
# A tibble: 2 x 3
# id contains_C13_error treecode
# <dbl> <lgl> <chr>
#1 1 TRUE B12.123|C13.234.432|A11.123
#2 2 FALSE C12.123|C13039|
This is assuming that there could be codes of the pattern "C13" without a dot. If the treecode would always have "C13" followed by a dot then simply escaping the dot in your regex would work.
Base R solution:
# Split on the | delim:
split_treecode <- strsplit(df$treecode, "[|]")
# Roll out the ids the number of times of each relevant treecode:
rolled_out_df <- data.frame(id = rep(df$id, sapply(split_treecode, length)), tc = unlist(split_treecode))
# Test whether or not string contains "C13"
rolled_out_df$contains_c13_xxx <- grepl("C13.", rolled_out_df$tc, fixed = T)
# Does the id have an element containing "C13" ?
rolled_out_df$contains_c13_xxx <- ifelse(ave(rolled_out_df$contains_c13_xxx,
rolled_out_df$id,
FUN = function(x){as.logical(sum(x))}), "yes", "no")
# Build back orignal df:
df <- merge(df[,c("id", "treecode")], unique(rolled_out_df[,c("id", "contains_c13_xxx")]), by = "id")
Data:
df <-
structure(
list(
id = c(1, 2),
treecode = c("B12.123|C13.234.432|A11.123",
"C12.123|C13039|"),
contains_c13_xxx = c("yes", "no")
),
row.names = c(NA,-2L),
class = "data.frame"
)
Related
Im trying to remove all rows that contain a ? in any column in a data frame. I have 950 rows by 11 columns.
Ive tried this to do it all at once.
dataNew <- data %>% filter_all(all_vars(!grepl("?",.)))
and this to see if i could even get it to work for one column.
dataNew <- data[!grepl('?',data$column),]
Both of these attempts resulted in an empty dataframe. Any help is appreciated, thank you.
We can use fixed = TRUE as ? is a metacharacter (or escape (\\?) or wrap it inside square bracket ([?]) when the default mode for grep is fixed = FALSE
library(dplyr)
data %>%
filter_all(all_vars(!grepl("?",., fixed = TRUE)))
# col1 col2
#1 1 2
Or using across from the devel version of dplyr
data %>%
filter(across(everything(), ~ !grepl("?", ., fixed = TRUE)))
# col1 col2
#1 1 2
Or using base R
data[!Reduce(`|`, lapply(data, grepl, pattern = '?', fixed = TRUE)),]
data
data <- data.frame(col1 = c("?", 1, 3, "?"), col2 = c(1, 2, "?", "?"),
stringsAsFactors = FALSE)
I have a rather wide dataset to read in with over 1000 missing values at the top, but all the variable names follow the same pattern. Is there a way to use starts_with() to force certain variables to be parsed correctly?
MWE:
library(tidyverse)
library(readr)
mwe.csv <- data.frame(id = c("a", "b"), #not where I actually get the data from
amount1 = c(NA, 20),
currency1 = c(NA, "USD")
)
mwe <- readr::read_csv("mwe.csv", guess_max = 1) #guess_max() for example purposes
I'd like to be able do
mwe<- read_csv("mwe.csv", guess.max = 1
col_types = cols(starts_with("amount") = "d",
starts_with("currency") = "c"))
)
> mwe
# A tibble: 2 x 3
id amount currency
<chr> <dbl> <chr>
1 a NA NA
2 b 20 USD
But I get the error "unexpected '=' in: read_csv". Any thoughts? I cannot hard code it because the number of columns will change regularly, but the pattern (amountN) will be constant. There will also be other columns that are not id or amount/currency. I would prefer not to increase the guess.max() option for speed purposes.
The answer is to cheat!
mwe <- read_csv("mwe.csv", n_max = 0) # only need the col_names
cnames <- attr(mwe, "spec") # grab the col_names
ctype <- rep("?", ncol(mwe)) # create the col_parser abbr -- all guesses
currency <- grepl("currency", names(cnames$col)) # which ones are currency?
# or use base::startsWith(names(cnames$col), "currency")
ctype[currency] <- "c" # do not guess on currency ones, use character
# repeat lines 4 & 5 as needed
mwe <- read_csv("mwe.csv", col_types = paste(ctype, collapse = ""))
Does anyone know how I can create a format of a variable in R and apply it to any other variable I want?
More specifically, I am trying to translate a SAS script to R script.
In SAS I can create a format of a variable like this:
PROC FORMAT
VALUE bool
1 = "Yes"
2 = "No"
3 = "NA"
;
(so the variable bool has the levels 1, 2, 3, where 1 will be replaced with "Yes", 2 with "No", etc)
Then I can indicate that for a specific variable of my data set (myVariable) - which also has the levels 1, 2, 3 - I want to have the same format:
FORMAT myVariable bool.;
so all the 1s will become "Yes", etc. Obviously, the order of the levels is not the same between the two variables, i just want to apply the same labels.
I cannot find how to do this with R, has anyone already done it?
You can also create a function if you want to reuse the format (and not deal with factors if that is a problem).
library(dplyr)
lvl <- function(y){ifelse(y == 1, "Yes",
ifelse(y == 2, "No","NA"))}
df <- data.frame(
answers = c(1,2,3)
)
df2 <- df %>% mutate(var2 = lvl(answers))
Try a look-up vector. For example.
v <-setNames(c("yes", "no","na"), 1:3))
v[c(1,2,2,3,1,1)]
In vanilla R, you can do this:
# create data
df <- data.frame(
'answers' = c('1','2','3')
)
# make 'answers' into a factor
df$answers <- as.factor(df$answers)
#rename factor levels
levels(df$answers)
[1] "1" "2" "3"
levels(df$answers) <- c('Yes','No','NA')
In Tidyverse, this is slightly less clunky.
# you can also do this within tidyverse
library(tidyverse)
# create data
df <- data.frame(
'answers' = c('1','2','3')
)
df %>% mutate(answers = as.factor(answers)) %>%
recode(answers, '1' = 'Yes', '2'='No', '3'='NA')) -> df
I have two data frames with columns of words and associated scores for these words. I want to run comments through these frames and create an additive score based on if the words appear in the sentences.
I want to do this across many, many comments so it needs to be computationally efficient. So for example, the sentence "hi, he said. why is it okay" will get a score of .98 + .1 + .2 because the words "hi", "why", and "okay" are in data frame a. Any sentence could potentially have words from several data frames as well.
Can anyone help me create the column "add_score" with a procedure that scales well to large data frames? Thank you
a <- data.frame(words = c("hi","no","okay","why"),score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here",score = c(.5,.3,.2)))
comment_df = data.frame(id = c("1","2","3"), comments = c("hi, he said. why
is it okay","okay okay okay no","yes, here is it"))
comment_df$add_score = c(1.28,1.1,.5)
This solution uses functions from tidyverse and stringr.
# Load packages
library(tidyverse)
library(stringr)
# Merge a and b to create score_df
score_df <- bind_rows(a, b)
# Create a function to calculate score for one string
string_cal <- function(string, score_df){
temp <- score_df %>%
# Count the number of words in one string
mutate(Number = str_count(string, pattern = fixed(words))) %>%
# Calcualte the score
mutate(Total_Score = score * Number)
# Return the sum
return(sum(temp$Total_Score))
}
# Use map_dbl to apply the string_cal function over comments
# The results are stored in the add_score column
comment_df <- comment_df %>%
mutate(add_score = map_dbl(comments, string_cal, score_df = score_df))
Data Preparation
a <- data.frame(words = c("hi","no","okay","why"),
score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here"),
score = c(.5,.3,.2))
comment_df <- data.frame(id = c("1","2","3"),
comments = c("hi, he said. why is it okay",
"okay okay okay no",
"yes, here is it"))
Apologies if the example is not formatted properly.
I have a data set with one sample per row, the data contain two columns with reference numbers of the start value and end value.
cell A1 = Sample #1
cell B1 = 101-263 (start value)
cell C1 = 101-266 (end value)
cell A2 = Sample #2
cell B2 = 162-186 (start value)
cell C2 = 162-187 (end value)
The range of values is a different length of each row of data, with a maximum range of 8 values. I need to fill in the values in the range, with each value in a cell along the row.
So for sample #1 above I need to create the cell values: D1 = 101-264, and E1 = 101-265
While for sample #2 there will be no extra cells needed.
Is there a formula (using Vlookup and If perhaps?) that I can create and drag across all rows and over the 8 needed columns to fill in this data? (I don't mind if there are N/A in the shorter-range rows)
If there is an easier way using R also fine with me.
Thanks for any advice
Please try in D1 copied across eight columns and then D1:K1 copied down to suit:
=IF(1*RIGHT($C1,3)>RIGHT($B1,3)+COLUMN()-3,LEFT($B1,4)&RIGHT($B1,3)+COLUMN()-3,"")
The condition (IF) checks whether or not to display a result (or 'blank' "", for neater presentation) depending on the result equalling or exceeding the upper limit specified in ColumnC.
There is some text manipulation (RIGHT and LEFT) to get at the part that is to be integer incremented or to add back the static part.
COLUMN() returns the column number (A>1, B>2 etc) so is useful as a kind of stepping function. In D1 COLUMN()-3 is 4-3 or 1 so there 1 is added to the start of the range (shown on the right of B1). When copied across to ColumnE COLUMN()-3 becomes 5-3, so 2 is added to the start of the range.
The following code:
library(magrittr)
library(plyr)
library(reshape2)
# Create input example
dat = data.frame(
sample = c("Sample #1", "Sample #2"),
start = c("101-263", "162-186"),
end = c("101-266", "162-187"),
stringsAsFactors = FALSE
)
# Extract 'start' and 'end' values
dat$num1 = dat$start %>% strsplit("-") %>% sapply("[", 1)
dat$start2 = dat$start %>% strsplit("-") %>% sapply("[", 2) %>% as.numeric
dat$end2 = dat$end %>% strsplit("-") %>% sapply("[", 2) %>% as.numeric
dat$start = NULL
dat$end = NULL
# For each row
for(i in 1:nrow(dat)) {
# Check if there is any need to add entries
if((dat$end2[i] - dat$start2[i]) > 1) {
# For each entry
for(j in seq(dat$start2[i], dat$end2[i] -1)) {
# Create entry
new_entry = data.frame(
sample = dat$sample[i],
num1 = dat$num1[i],
start2 = dat$start2[i],
end2 = j,
stringsAsFactors = FALSE
)
# Add to table
dat = rbind(dat, new_entry)
}
}
}
# Calculate all values
dat$value = paste0(dat$num1, "-", dat$end2)
dat = dat[, c("sample", "value")]
# Create column labels
dat = ddply(
dat,
"sample",
transform,
var = paste0("col", rank(value))
)
# Reshape to required format
dat = dcast(dat, sample ~ var, value.var = "value")
Does what you asked on the provided example.
It transforms this table -
sample start end
1 Sample #1 101-263 101-266
2 Sample #2 162-186 162-187
Into this one -
sample col1 col2 col3 col4
1 Sample #1 101-263 101-264 101-265 101-266
2 Sample #2 162-187 <NA> <NA> <NA>
If there is a larger example for testing will be happy to do so :)