creating a new variable based on string matching - r

I have the following dataframe:
df <- data.frame(Sample_name = c("01_00H_NA_DNA", "01_00H_NA_RNA", "01_00H_NA_S", "01_00H_NW_DNA", "01_00H_NW_RNA", "01_00H_NW_S", "01_00H_OM_DNA", "01_00H_OM_RNA", "01_00H_OM_S", "01_00H_RL_DNA", "01_00H_RL_RNA", "01_00H_RL_S"),
Pair = c("","", "S1","","","S2","","","S3","", "","S5"))
I am trying to create a new variable treatment based on sample_name. I used the following code:
df$treatment <- ifelse(grep("_NA_", df$sample_name, ignore.case = T), "nat",
ifelse(grep("_NW_", df$sample_name, ignore.case = T), "natH2",
ifelse(grep("_RL_", df$sample_name, ignore.case = T), "RNALat",
ifelse(grep("_OM_", df$sample_name, ignore.case = T ), "Om"))))
I don't understand what I am doing wrong here, I got an error saying
Error in $<-.data.frame(*tmp*, "treatment", value = logical(0)) :
replacement has 0 rows, data has 12
Any suggestions?

Got the answer, added grepl to each grep statement:
df$treatment <- ifelse(grepl("_NA_", df$sample_name, ignore.case = T), "nat",
ifelse(grepl("_NW_", df$sample_name, ignore.case = T ), "natH2",
ifelse(grepl("_RL_", df$sample_name, ignore.case = T), "RNALat",
ifelse(grepl("_OM_", df$sample_name, ignore.case = T ), "Om", "NA"))))

Related

R - extract minimum and maximum values

I have a list something like this:
my_data<- list(c(dummy= 300), structure(123.7, .Names = ""),
structure(143, .Names = ""), structure(113.675, .Names = ""),
structure(163.75, .Names = ""), structure(656, .Names = ""),
structure(5642, .Names = ""), structure(1232, .Names = ""))
I want the minimun and maximum values from this list
I have tried using
min(my_data)
max(my_data)
But I get an error: Error in min(weighted_mae) : invalid 'type' (list) of argument
typeof(my_data) #[1] "list"
class(my_data) #[1] "list"
What is the right way for getting the minimum and maximum from my_data?
You could do:
my_data |>
unlist(use.names = FALSE) |>
range()
The following is the same, without piping:
range(unlist(my_data, use.names = FALSE))
If you want to get minimum and maximum values separately, then you could do:
min(unlist(my_data, use.names = FALSE))
max(unlist(my_data, use.names = FALSE))

How to fix Error in FUN(left, right) : non-numeric argument to binary operator

When i run the code below in R, I get the error: 'FUN(left, right) : non-numeric argument to binary operator'. I tried to fix this by converting the variables that are characterized as 'character' to numeric variables by using the code : as.numeric(). However, NA's are introduced by coercion when I try to use that operator. As a result, the whole column in my datafram is empty as it shows only NA's for every row. Does anyone know to to fix this error? Thank you in advance!
library(tidyverse)
library(readr)
sessie_03 <- read_csv("~/Downloads/sessie_03.csv")
View(sessie_03)
sessie_03 <- read.csv("~/Downloads/sessie_03.csv", header = TRUE, sep = ",")
stars_master <- read.csv("~/Downloads/stars_master.csv", header = TRUE, sep =";")
stars_numbers <- read.csv("~/Downloads/stars_numbers.csv", header = TRUE, sep = ";", dec = ",")
new_stars_master <- tibble(Title.id = sessie_03$title_id,
Title.year = sessie_03$Year,
Star1.name = sessie_03$imdb.com_star1_name,
Star1.id = sessie_03$imdb.com_star1_id,
Star2.name = sessie_03$imdb.com_star2_name,
Star2.id = sessie_03$imdb.com_star2_id,
Star3.name = sessie_03$imdb.com_star3_name,
Star3.id = sessie_03$imdb.com_star3_id,
)
new_stars_numbers <- tibble(Star.id = stars_numbers$imdb_com_star_id,
Title.year = stars_numbers$ï..year,
"Title.year+1" = stars_numbers$ï..year + 1,
Star.rank = stars_numbers$the_numbers_com_starpower_rank
)
STP <- tibble(Title.id = new_stars_master$Title.id,
Star1.id = new_stars_master$Star1.id,
Star1.name = new_stars_master$Star1.name,
Star1.rank = new_stars_master %>% left_join(new_stars_numbers, by = c("Star1.id" = "Star.id",
"Title.year" = "Title.year+1"))
%>% select(Star.rank),
Star2.id = new_stars_master$Star2.id,
Star2.name = new_stars_master$Star2.name,
Star2.rank = new_stars_master %>% left_join(new_stars_numbers, by = c("Star2.id" = "Star.id",
"Title.year" = "Title.year+1"))
%>% select(Star.rank),
Star3.id = new_stars_master$Star3.id,
Star3.name = new_stars_master$Star3.name,
Star3.rank = new_stars_master %>% left_join(new_stars_numbers, by = c("Star3.id" = "Star.id",
"Title.year" = "Title.year+1"))
%>% select(Star.rank),
Star.power = (Star1.rank + Star2.rank + Star3.rank ) /3
)

Merging .txt files from 2002-2020 for bulk financial data into one table

I am very new to R and RStudio and currently running codes from Machine Learning with R Quick Start Guide to review bulk financial data. I am running the following code chunk in R:
library(readr)
t <- proc.time()
for (i in 1:length(myfiles)){
tables<-list()
myfiles <- list.files(path = "~/MachineLearning/Banks_model", pattern = "20", full.names = TRUE)
filelist <- list.files(path = myfiles[i], pattern = "*", full.names = TRUE)
filelist<-filelist[1:(length(filelist)-1)]
for (h in 1:length(filelist)){
#assuming tab separated values with a header
aux = as.data.frame(read_delim(filelist[h], "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE, skip = 2))
variables<-colnames(as.data.frame(read_delim(filelist[h], "\t", escape_double = FALSE, col_names = TRUE, trim_ws = TRUE, skip = 0)))
colnames(aux)<-variables
dataset_name<-paste("aux",h,sep='')
tables[[h]]<-assign(dataset_name,aux)
}
final_data_name<-paste("year",i+2001,sep='')
union <- Reduce(function(x,y) merge(x, y, all=TRUE,
by=c("ID RSSD","Reporting Period")), tables, accumulate=FALSE)
assign(final_data_name,union)
rm(list=ls()[! ls() %in% c(ls(pattern="year*"),"tables","t")])
}
proc.time() - t
and received the following error:
Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns
The traceback in RStudio is as follows:
7.
stop(ngettext(sum(bad), "'by' must specify a uniquely valid column", "'by' must specify uniquely valid columns"), domain = NA)
6.
fix.by(by.x, x)
5.
merge.data.frame(x, y, all = TRUE, by = c("ID RSSD", "Reporting Period"))
4.
merge(x, y, all = TRUE, by = c("ID RSSD", "Reporting Period"))
3.
merge(x, y, all = TRUE, by = c("ID RSSD", "Reporting Period"))
2.
f(init, x[[i]])
1.
Reduce(function(x, y) merge(x, y, all = TRUE, by = c("ID RSSD", "Reporting Period")), tables, accumulate = FALSE)
Any idea what the fix is please?

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 0, 588

I'm trying to work with an API from the NS (Dutch train company). I want to have it in a dataframe format but I get this error when I run the following code:
NSspoorkaart <- GET("https://gateway.apiportal.ns.nl/Spoorkaart-API/api/v1/spoorkaart",
add_headers("Ocp-Apim-Subscription-Key" = "f354d5839ec5454fbaf1bc44304b1845"))
JSON <- fromJSON(content(NSspoorkaart, "text"), flatten = TRUE)
Data_NS <- as.data.frame(JSON)
Can someone explain me what I'm doing wrong?
Would this work?
NSspoorkaart <- GET("https://gateway.apiportal.ns.nl/Spoorkaart-API/api/v1/spoorkaart", add_headers("Ocp-Apim-Subscription-Key" = "f354d5839ec5454fbaf1bc44304b1845"))
NSspoorkaart.string <- content(NSspoorkaart, as = "text", encoding = "UTF-8")
NSspoorkaart.list <- jsonlite::fromJSON(NSspoorkaart.string)
NSspoorkaart.df <- NSspoorkaart.list$payload$features

Simplify R code to import big data as character

I am currently using the code below very often to import a big dataset into R and forcing it to treat everything as character in order to avoid the truncation of rows. The code seems to work well, but I was wondering whether any of you knows how it could be simplified or improved to so it doesn't get so repetitive each time I need to do it.
library(readr)
library(stringr)
dataset.path <- choose.files(caption = "Select dataset", multi = FALSE)
data.columns <- read_delim(dataset.path, delim = '\t', col_names = TRUE, n_max = 0)
data.coltypes <- c(rep("c", ncol(data.columns)))
data.coltypes <- str_c(data.coltypes, collapse = "")
dataset <- read_delim(dataset.path, delim = '\t', col_names = TRUE, col_types = data.coltypes)
like #Roland has suggested, you should write a function. here is one possibility:
foo <- function(){
require(readr)
dataset.path <- choose.files(caption = "Select dataset", multi = FALSE)
data.columns <- read_delim(dataset.path, delim = '\t', col_names = TRUE, n_max = 0)
data.coltypes <- paste(rep("c", ncol(data.columns)), collapse = "")
dataset <- read_delim(dataset.path, delim = '\t', col_names = TRUE, col_types = data.coltypes)
}
you can then just call foo() whenever you need to read a database in using this method.
your two liner:
data.coltypes <- c(rep("c", ncol(data.columns)))
data.coltypes <- str_c(data.coltypes, collapse = "")
can be collapsed into just one line and only using base R paste instead of str_c in the stringr package.

Resources