i am using Expss pakage .
df<-read_spss("test.SAV")
I shows the following:
Warning message: In foreign::read.spss(enc2native(file),
use.value.labels = FALSE, : Tally.SAV: Very long string record(s)
found (record type 7, subtype 14), each will be imported in
consecutive separate variables
It shows 4174 Variables in environment Panel.Actual Number of Variables in the Data file around 400.
Can anyone among you please help me on this.
As mentioned in the comment foreign::read.spss split SPSS long (>255 chars) characters variables into the several columns. If the such columns are empty you can drop them without any issues.
Convenience function for this:
remove_empty_characters_after_foreign = function(data){
empty_chars = vapply(data, FUN = function(column) is.character(column) & all(is.na(column)), FUN.VALUE = logical(1))
additional_chars = grepl("00\\d$", colnames(data), perl = TRUE)
to_remove = empty_chars & additional_chars
if(any(to_remove)){
message(paste0("Removing ", paste(colnames(data)[to_remove], collapse = ", "),"..."))
}
data[,!to_remove, drop = FALSE]
}
df = remove_empty_characters_after_foreign(df)
Related
Using R in Databricks.
I have the following sample list of possible text entries.
extract <- c("codeine", "tramadol", "fentanyl", "morphine")
I want check if any of these appear more than once in a string (example below) and return a binary output in a new column.
Example = ("codeine with fentanyl oral")
The output for this example would be 1.
I have tried the following with only partial success:
df$testvar1 <- +(str_count(df$medname, fixed(extract))> 1)
also tried
df$testvar2 <- cSplit_e(df$medname, split.col = "String", sep = " ", type = "factor", mode = "binary", fixed = TRUE, fill = 0)
and also
df$testvar3 <- str_extract_all(df$medname, paste(extract, collapse = " "))
Combine your extract with |.
+(stringr::str_count(Example, paste(extract, collapse = "|"))> 1)
# [1] 1
I tried the following and it worked for my code
df$testvar <- sapply(df$medname, function(x) str_extract(x, paste(extract, collapse="|")))
I am attempting to split out a flags column into multiple new columns in r using mutate_at and then separate functions. I have simplified and cleaned my solution as seen below, however I am getting an error that indicates that the entire column of data is being passed into my function rather than each row individually. Is this normal behaviour which just requires me to loop over each element of x inside my function? or am I calling the mutate_at function incorrectly?
example data:
dataVariable <- data.frame(c_flags = c(".q.q.q","y..i.o","0x5a",".lll.."))
functions:
dataVariable <- read_csv("...",
col_types = cols(
c_date = col_datetime(format = ""),
c_dbl = col_double(),
c_flags = col_character(),
c_class = col_factor(c("a", "b", "c")),
c_skip = col_skip()
))
funTranslateXForNewColumn <- function(x){
binary = ""
if(startsWith(x, "0x")){
binary=hex2bin(x)
} else {
binary = c(0,0,0,0,0,0)
splitFlag = strsplit(x, "")[[1]]
for(i in splitFlag){
flagVal = 1
if(i=="."){
flagVal = 0
}
binary=append(binary, flagVal)
}
}
return(paste(binary[4:12], collapse='' ))
}
mutate_at(dataVariable, vars(c_flags), funs(funTranslateXForNewColumn(.)))
separate(dataVariable, c_flags, c(NA, "flag_1","flag_2","flag_3","flag_4","flag_5","flag_6","flag_7","flag_8","flag_9"), sep="")
The error I am receiving is:
Warning messages:
1: Problem with `mutate()` input `c_flags`.
i the condition has length > 1 and only the first element will be used
After translating the string into an appropriate binary representation of the flags, I will then use the seperate function to split it into new columns.
Similar to OP's logic but maybe shorter :
dataVariable$binFlags <- sapply(strsplit(dataVariable$c_flags, ''), function(x)
paste(as.integer(x != '.'), collapse = ''))
If you want to do this using dplyr we can implement the same logic as :
library(dplyr)
dataVariable %>%
mutate(binFlags = purrr::map_chr(strsplit(c_flags, ''),
~paste(as.integer(. != '.'), collapse = '')))
# c_flags binFlags
#1 .q.q.q 010101
#2 y..i.o 100101
#3 .lll.. 011100
mutate_at/across is used when you want to apply a function to multiple columns. Moreover, I don't see here that you are creating only one new binary column and not multiple new columns as mentioned in your post.
I was able to get the outcome I desired by replacing the mutate_at function with:
dataVariable$binFlags <- mapply(funTranslateXForNewColumn, dataVariable$c_flags)
However I want to know how to use the mutate_at function correctly.
credit to: https://datascience.stackexchange.com/questions/41964/mutate-with-custom-function-in-r-does-not-work
The above link also includes the solution to get this function to work which is to vectorize the function:
v_funTranslateXForNewColumn <- Vectorize(funTranslateXForNewColumn)
mutate_at(dataVariable, vars(c_flags), funs(v_funTranslateXForNewColumn(.)))
I have written a function that "cleans up" taxonomic data from NGS taxonomic files. The problem is that I am unable to replace NA cells with a string like "undefined". I know that it has something to do with variables being made into factors and not characters (Warning message: In `...` : invalid factor level, NA generated), however even when importing data with stringsAsFactors = FALSE I still get this error in some cells.
Here is how I import the data:
raw_data_1 <- taxon_import(read.delim("taxonomy_site_1/*/*/*/taxonomy.tsv", stringsAsFactors = FALSE))
The taxon_import function is used to split the taxa and assign variable names:
taxon_import <- function(data) {
data <- as.data.frame(str_split_fixed(data$Taxon, ";", 7))
colnames(data) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
return(data)
}
Now the following function is used to "clean" the data and this is where I would like to replace certain strings with "Undefined", however I keep getting the error: In[<-.factor(tmp, thisvar, value = "Undefined") : invalid factor level, NA generated
Here follows the data_cleanup function:
data_cleanup <- function(data) {
strip_1 = list("D_0__", "D_1__", "D_2__", "D_3__", "D_4__", "D_5__", "D_6__")
for (i in strip_1) {
data <- as.data.frame(sapply(data, gsub, pattern = i, replacement = ""))
}
data[data==""] <- "Undefined"
strip_2 = list("__", "unidentified", "Ambiguous_taxa", "uncultured", "Unknown", "uncultured .*", "Unassigned .*", "wastewater Unassigned", "metagenome")
for (j in strip_2) {
data <- as.data.frame(sapply(data, gsub, pattern = j, replacement = "Undefined"))
}
return(data)
}
The function is simply applied like: test <- data_cleanup(raw_data_1)
I am appending the data from a cloud, since it is very lengthy data. Here is the link to a data file https://drive.google.com/open?id=1GBkV_sp3A0M6uvrx4gm9Woaan7QinNCn
I hope you will forgive my ignorance, however I tried many solutions before posting here.
We start by using the tidyverse library. Let me give a twist to your question, as it's about replacing NAs, but I think with this code you should avoid that problem.
As I read your code, you erase the strings "D_0__", "D_1__", ... from the observation strings. Then you replace the strings "Ambiguous_taxa", "unidentified", ... with the string "Undefined".
According to your data, I replaced the functions with regex, which makes a little easy to clean your data:
library(tidyverse)
taxon_import <- function(data) {
data <- as.data.frame(str_split_fixed(data$Taxon, ";", 7))
colnames(data) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
return(data)
}
raw_data_1 <- taxon_import(read.delim("taxonomy.tsv", stringsAsFactors = FALSE))
raw_data_1 <- data.frame(lapply(raw_data_1,as.character),stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(raw_data_1,function(x) sub("^D_[0-6]__","",x)), stringAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("__|unidentified|Ambiguous_taxa|uncultured","Undefined",x)), stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("Unknown|uncultured\\s.\\*|Unassigned\\s.\\*","Undefined",x)), stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("wastewater\\sUnassigned|metagenome","Undefined",x)), stringsAsFactors = FALSE)
depured[depured ==""] <- "Undefined"
Let me explain my code. First, I read in many websites that it's better to avoid loops, as "for". So how you replace text that starts with "D_0__"?
The answer is regex (regular expression). It seems complicated at first but with practice it'll be helpful. See this expression:
"^D_[0-6]__"
It means: "Take the start of the string which begins with "D_" and follows a number between 0 and 6 and follows "__"
Aha. So you can use the function sub
sub("^D_[0-6]__","",string)
which reads: replace the regular expression with a blank space "" in the string.
Now you see another regex:
"__|unidentified|Ambiguous_taxa|uncultured"
It means: select the string "__" or "unidentified" or "Ambiguous_taxa" ...
Be careful with this regex
"Unknown|uncultured\\s.\\*|Unassigned\\s.\\*"
it means: select the string "Unknown" or "uncultured .*" or...
the blank space it's represented by \s and the asterisk is \*
Now what about the as.data.frame function? Every time I use it I have to make it "stringsAsFactors = FALSE" because the function tries to use the characters, as factors.
With this code no NA are created.
Hope it helps, please don't hesitate to ask if needed.
Regards,
Alexis
I want to perform a set of operations (in R) on a number of data frames located within a list. In particular, for each of one I create a "library" column, which is then used to determine which kind of filtering operation to perform. This is the actual code:
sampleList <- list(RNA1 = "data/not_processed/dedup.Bp1R4T2_S2.txt",
RNA2 = "data/not_processed/dedup.Bp1R4T3_S4.txt",
RNA3 = "data/not_processed/dedup.Bp1R5T2_S1.txt",
RNA4 = "data/not_processed/dedup.Bp1R5T3_S2.txt",
RNA5 = "data/not_processed/dedup.Bp1R14T5_S1.txt",
RNA6 = "data/not_processed/dedup.Bp1R14T6_S1.txt",
RNA7 = "data/not_processed/dedup.Bp1R14T6_S2.txt",
RNA8 = "data/not_processed/dedup.Bp1R14T7_S2.txt",
RNA9 = "data/not_processed/dedup.Bp1R14T8_S3.txt",
RNA10 = "data/not_processed/dedup.Bp1R14T9_S3.txt",
RNA11 = "data/not_processed/dedup.Bp1R14T9_S4.txt",
DNA1 = "data/not_processed/dedup.dna10_1_S4.txt",
DNA2 = "data/not_processed/dedup.dna10_2_S5.txt",
DNA3 = "data/not_processed/dedup.dna10_3_S6.txt",
DNA4 = "data/not_processed/dedup.dna50_1_S1.txt",
DNA5 = "data/not_processed/dedup.dna50_2_S2.txt",
DNA6 = "data/not_processed/dedup.dna50_3_S3.txt",
DNA7 = "data/not_processed/dedup.dna50_pcrcocktail_S7.txt")
batch <- lapply(names(sampleList),function(mysample){
aux <- read.table(sampleList[[mysample]], col.names=c(column1, column2, ..., ID, library, column4, etc...))
aux %>% mutate(library = mysample, R = Fw_ref + Rv_ref, A = Fw_alt + Rv_alt) %>% distinct(ID, .keep_all=T)
if (grepl("DNA", aux$library)){
aux %>% filter(aux$R>1 & aux$A>1)
} else {
aux %>% filter((aux$R+aux$A)>7 & aux$Fw_ref>=1 & aux$Rv_ref>=1 & aux$Fw_alt>=1 & aux$Rv_alt>=1)
}
aux
})
batch_file <- do.call(rbind, batch)
write.table(batch_file, "data/batch_file.txt", col.names = T, sep = "\t")
The possible values of the library column are DNA1 to DNA7, and RNA1 to 11. I tried also with "char" %in%, but it gives the same problem:
Error in if (grepl("DNA", aux$library)) { : argument is of length zero
Seems like the if condition is not able to identify the value in library. However, when I tried to apply the if/else condition on the batch_file (not filtered, basically obtained with this code without the if/else part) it worked perfectly.
Many thanks in advance.
This question is related to a previous topic:
How to use custom function to create new binary variables within existing dataframe?
I would like to use a similar function but be able to use a vector to specify ICD9 diagnosis variables within the dataframe to search for (e.g., "diag_1", "diag_2","diag_1", etc )
I tried
y<-c("diag_1","diag_2","diag_1")
diagnosis_func(patient_db, y, "2851", "Anemia")
but I get the following error:
Error in `[[<-`(`*tmp*`, i, value = value) :
recursive indexing failed at level 2
Below is the working function by Benjamin from the referenced post. However, it works only from 1 diagnosis variable at a time. Ultimately I need to create a new binary variable that indicates if a patient has a specific diagnosis by querying the 25 diagnosis variables of the dataframe.
*targetcolumn is the icd9 diagnosis variables "diag_1"..."diag_20" is the one I would like to input as vector
diagnosis_func <- function(data, target_col, icd, new_col){
pattern <- sprintf("^(%s)",
paste0(icd, collapse = "|"))
data[[new_col]] <- grepl(pattern = pattern,
x = data[[target_col]]) + 0L
data
}
diagnosis_func(patient_db, "diag_1", "2851", "Anemia")
This non-function version works for multiple diagnosis. However I have not figured out how to use it in a function version as above.
pattern = paste("^(", paste0("2851", collapse = "|"), ")", sep = "")
df$anemia<-ifelse(rowSums(sapply(df[c("diag_1","diag_2","diag_3")], grepl, pattern = pattern)) != 0,"1","0")
Any help or guidance on how to get this function to work would be greatly appreciated.
Best,
Albit
Try this modified version of Benjamin's function:
diagnosis_func <- function(data, target_col, icd, new_col){
pattern <- sprintf("^(%s)",
paste0(icd, collapse = "|"))
new <- apply(data[target_col], 2, function(x) grepl(pattern=pattern, x)) + 0L
data[[new_col]] <- ifelse(rowSums(new)>0, 1,0)
data
}