I have a data-frame containing # as a missing value in multiple columns. How can I convert all such #s to NAs?
is.na(dat) <- dat == "#"
will do the trick (where dat is the name of your data frame).
You can do this a few ways. One is to re-read the file in with the na.strings argument set to "#"
read.table(file, na.strings = "#")
Another would be to just change the values in the data frame df with
df[df == "#"] <- NA
I have written a function makemeNA that is part of my "SOfun" package.
The function looks like this (in case you don't want to get the package just for this function):
makemeNA <- function (mydf, NAStrings, fixed = TRUE) {
if (!isTRUE(fixed)) {
mydf[] <- lapply(mydf, function(x) gsub(NAStrings, "", x))
NAStrings <- ""
}
mydf[] <- lapply(mydf, function(x) type.convert(
as.character(x), na.strings = NAStrings))
mydf
}
Usage would be:
makemeNA(df, "#")
Get the package with:
library(devtools)
install_github("mrdwab/SOfun")
Related
I have a function like this:
do.call(cbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE,header = FALSE,
col.names = )))
}
Unfortunately, the files I want to read, don't have headers. First I'd like to read only the second column, but it seems like read.csv doesn't have that option. More importantly, I'd like the second column to be named after the filename. How can it be done?
If we use read.csv, then read the columns, extract the second column with indexing, and name that column as the file name
out <- do.call(cbind, lapply(files, function(x) {
dat <- read.csv(x, stringsAsFactors = FALSE, header = FALSE)[2]
names(dat) <- x
dat
}))
Or using fread
library(data.table)
do.call(cbind, lapply(files, function(x) setnames(fread(x, select = 2), x)))
I have created a nested list of dataframes through read.csv and lapply function. This nested list of data frames contains the first column as product and rest 239 columns for data on various countries.
All the numbers are in character format which I wish to convert into numeric form for each dataframe in the list.
I have used the following code. But it removes the product column[1] from each dataframe and displays only [2:240] rest of the columns. How to prevent the product column from getting removed?
files <- list.files(path = "D:\\R34\\casia3\\data_kaz\\export\\", pattern = "*.csv")
myfiles <- lapply(files, function(x) {
df <- read.csv(x, strip.white = T, stringsAsFactors = F, sep = ",")
df$ID <- as.character(x)
return(df)
})
myfiles <- lapply(myfiles, function(x) lapply(x[2:240], as.numeric))
We can use type.convert to automatically convert the class
lstdat <- lapply(lstdat, function(x) {x[] <- lapply(x,
type.convert, as.is = TRUE); x})
Try doing
myfiles <- lapply(myfiles, function(x) {x[2:240] <- lapply(x[2:240], as.numeric);x})
Since you are applying as.numeric function to column 2:240 only those are returned back. We can apply the function to those selected columns and return back the entire dataframe from the inner lapply call.
If interested you might also consider this tidyverse alternative
library(tidyverse)
myfiles <- map(myfiles,. %>% mutate_at(2:240, as.numeric))
I am reading in a bunch of CSVs that have stuff like "sales - thousands" in the title and come into R as "sales...thousands". I'd like to use a regular expression (or other simple method) to clean these up.
I can't figure out why this doesn't work:
#mock data
a <- data.frame(this.is.fine = letters[1:5],
this...one...isnt = LETTERS[1:5])
#column names
colnames(a)
# [1] "this.is.fine" "this...one...isnt"
#function to remove multiple spaces
colClean <- function(x){
colnames(x) <- gsub("\\.\\.+", ".", colnames(x))
}
#run function
colClean(a)
#names go unaffected
colnames(a)
# [1] "this.is.fine" "this...one...isnt"
but this code does:
#direct change to names
colnames(a) <- gsub("\\.\\.+", ".", colnames(a))
#new names
colnames(a)
# [1] "this.is.fine" "this.one.isnt"
Note that I'm fine leaving one period between words when that occurs.
Thank you.
names(a) <- gsub(x = names(a), pattern = "\\.", replacement = "#")
you can use gsub function to replace . with another special character like #.
Rich Scriven had the answer:
Define
colClean <- function(x){ colnames(x) <- gsub("\\.\\.+", ".", colnames(x)); x }
and then do
a <- colClean(a)
to update a
I need to save a list of csv files and extract values from thr 13th row on of a specific column (the second one) from each of dataframes.
Here's my try:
temp <- list.files(FILEPATH, pattern="*\\.csv$", full.names = TRUE)
for (i in 1:length(temp)){
assign(temp[i], read.csv(temp[i], header=TRUE, ski[=13, na.strings=c("", "NA")))
subset(temp[i], select=2) #extract the second column of the dataframe
temp[i] <- na.omit(temp[i])
However, this doesn't work. On the one hand, I think that's because of the skip argument of the read.csv command, as it apparently ignores the headers. On the other hand, if skip is not used, the following error pops up:
Error in subset.default(temp[i], select = 2) : argument "subset" is
missing, with no default
When I insert the argument subset=TRUE in the subset command, it doesn't give any error, but no extraction is performed.
Any possible solution?
Without seeing the files it's not easy to tell, but I would use lapply, not a for loop. Maybe you can get inspiration from something like the follwing. I use read.table because you skip = 13 lines and read.csv reads in the first line as column headers. Note that I avoid the use of assign.
df_list <- lapply(temp, read.table, sep = ",", skip = 13, na.strings = c("", "NA"))
names(df_list) <- temp
col2_list <- lapply(df_list, `[[`, 2)
col2_list <- lapply(col2_list, na.omit)
names(col2_list) <- temp
col2_list
If you want col2_list to be a list of df's with just one column each, column 2 of the original files, then, like I've said in comment use
col2_list <- lapply(df_list, `[`, 2)
And to rename that one column and renumber the rows consecutively
new_name <- "the_column_of_choice" # change this!
col2_list <- lapply(col2_list, function(x){
names(x) <- new_name
row.names(x) <- NULL
x
})
I am reading in a bunch of CSVs that have stuff like "sales - thousands" in the title and come into R as "sales...thousands". I'd like to use a regular expression (or other simple method) to clean these up.
I can't figure out why this doesn't work:
#mock data
a <- data.frame(this.is.fine = letters[1:5],
this...one...isnt = LETTERS[1:5])
#column names
colnames(a)
# [1] "this.is.fine" "this...one...isnt"
#function to remove multiple spaces
colClean <- function(x){
colnames(x) <- gsub("\\.\\.+", ".", colnames(x))
}
#run function
colClean(a)
#names go unaffected
colnames(a)
# [1] "this.is.fine" "this...one...isnt"
but this code does:
#direct change to names
colnames(a) <- gsub("\\.\\.+", ".", colnames(a))
#new names
colnames(a)
# [1] "this.is.fine" "this.one.isnt"
Note that I'm fine leaving one period between words when that occurs.
Thank you.
names(a) <- gsub(x = names(a), pattern = "\\.", replacement = "#")
you can use gsub function to replace . with another special character like #.
Rich Scriven had the answer:
Define
colClean <- function(x){ colnames(x) <- gsub("\\.\\.+", ".", colnames(x)); x }
and then do
a <- colClean(a)
to update a