Change a char value in a data column into zero? - r

I have a simple problem in that I have a very long data frame which reports 0 as a char "nothing" in the data frame column. How would I replace all of these to a numeric 0. A sample data frame is below
Group
Candy
A
5
B
nothing
And this is what I want to change it into
Group
Candy
A
5
B
0
Keeping in mind my actual dataset is 100s of rows long.
My own attempt was to use is.na but apparently it only works for NA and can convert those into zeros with ease but wasn't sure if there's a solution for actual character datatypes.
Thanks

The best way is to read the data in right, not with "nothing" for missing values. This can be done with argument na.strings of functions read.table or read.csv. Then change the NA's to zero.
The following function is probably slow for large data.frames but replaces the "nothing" values by zeros.
nothing_zero <- function(x){
tc <- textConnection("nothing", "w")
sink(tc) # divert output to tc connection
print(x) # print in string "nothing" instead of console
sink() # set the output back to console
close(tc) # close connection
tc <- textConnection(nothing, "r")
y <- read.table(tc, na.strings = "nothing", header = TRUE)
close(tc) # close connection
y[is.na(y)] <- 0
y
}
nothing_zero(df1)
# Group Candy
#1 A 5
#2 B 0
The main advantage is to read numeric data as numeric.
str(nothing_zero(df1))
#'data.frame': 2 obs. of 2 variables:
# $ Group: chr "A" "B"
# $ Candy: num 5 0
Data
df1 <- read.table(text = "
Group Candy
A 5
B nothing", header = TRUE)

sapply(df,function(x) {x <- gsub("nothing",0,x)})
Output
a
[1,] "0"
[2,] "5"
[3,] "6"
[4,] "0"
Data
df <- structure(list(a = c("nothing", "5", "6", "nothing")),
class = "data.frame",
row.names = c(NA,-4L))
Another option
df[] <- lapply(df, gsub, pattern = "nothing", replacement = "0", fixed = TRUE)
If you are only wanting to apply to one column
library(tidyverse)
df$a <- str_replace(df$a,"nothing","0")
Or applying to one column in base R
df$a <- gsub("nothing","0",df$a)

Related

rbind named vector to matrix with different lengths

I am trying to bind together a named vector onto a matrix. The named vector has a different length as the matrix:
> m <- matrix(data = c("1", "2", "3"),
nrow = 1, ncol = 3,
dimnames = list(c(),
c("column 1", "column 2", "column 3")))
> named_vec <- c("4", "5")
> names(named_vec) <- c("column 1", "column 2")
> rbind(m, named_vec)
I get the following:
Warning message:
In rbind(m, named_vec) :
number of columns of result is not a multiple of vector length (arg 2)
This has the undesired effect of repeating the shorter vector.
Also, plyr's rbind.fill function does not work here, since both arguments need to be data frames:
> plyr::rbind.fill(data.frame(m), data.frame(named_vec))
Error: All inputs to rbind.fill must be data.frames
My desired output is a matrix that fills in missing values with NA's instead of repeating the vector, like this:
column 1 column 2 column 3
[1,] "1" "2" "3"
[2,] "4" "5" NA
Below is a base R solution
do.call(rbind,lapply(u<-list(m,named_vec),`length<-`,max(lengths(u))))
such that
column 1 column 2 column 3
[1,] "1" "2" "3"
[2,] "4" "5" NA
If it is ok to convert the matrices to dataframe, you can use bind_rows.
dplyr::bind_rows(data.frame(m), data.frame(t(named_vec)))
# column.1 column.2 column.3
#1 1 2 3
#2 4 5 <NA>
We can use rbindlist
library(data.table)
rbindlist(list(as.data.frame(m), as.data.frame(t(named_vec))), fill = TRUE)

Convert data frame columns into vectors stored in a list

I have data consisting of many columns/variables and three rows. Each variable is an integer and the values vary across rows and columns. Below is a minimal example of my data:
# Minimal example of data frame I have
df <- data.frame(x1 = c(1,2,3),
x2 = c(4,1,6),
x3 = c(3,0,2),
x4 = c(3,0,1))
I am trying to somehow collapse each column into a numeric vector containing the values in each row. For example, I want something like:
# Desired data based on minimal example
target_list <- list(c(1,2,3),
c(4,1,6),
c(3,0,2),
c(3,0,1))
The ultimate goal is to be able to take another data frame with many columns and generate a new data frame containing only the columns with indices matching the values in each numeric vector. For each vector, I generate another data frame. All frames are stored in a list. An example of my target output given the working example inputs:
# Example "super data frame" I will subset. The values contained in each column are arbitrary.
df2 <- data.frame(z1 = "a", z2 = "b",
z3 = 999, z4 = NA,
z5 = "foo", z6 = "bar")
# Subset "super data frame" to only columns in each vector in the list, store in a list
list(df2[ ,target_list[[1]]],
df2[ ,target_list[[2]]],
df2[ ,target_list[[3]]],
df2[ ,target_list[[4]]])
I've tried various pasting approaches, but they produce character vectors that I can't use to select the columns of the other data frame by index, e.g. it produces this:
paste0(df[1, ], df[2, ], df[3, ], df[4, ])
Any help on how to generate the list of numeric vectors from df?
Or use as.list
as.list(df)
#$x1
#[1] 1 2 3
#$x2
#[1] 4 1 6
#$x3
#[1] 3 0 2
#$x4
#[1] 3 0 1
You can use unname to remove names of the list.
Maybe I'm missing something, but the only difference between your input and your target are three attributes:
attributes(df)
#$names
#[1] "x1" "x2" "x3" "x4"
#
#$class
#[1] "data.frame"
#
#$row.names
#[1] 1 2 3
You can remove them:
attributes(df) <- NULL
df
#[[1]]
#[1] 1 2 3
#
#[[2]]
#[1] 4 1 6
#
#[[3]]
#[1] 3 0 2
#
#[[4]]
#[1] 3 0 1
Or, alternatively:
c(unname(unclass(df)))
But, of course, these attributes don't hurt and you can always treat a data.frame like a list (because it actually is a list).

How to ignore a "Level" in R?

Not sure how to do the following. Please refer to the picture in the link below:
https://i.stack.imgur.com/Kx79x.png
I have some blank spaces, and they are the missing values. I do not want this level to be read. I want R to ignore this level. I want to write a regression so that this empty category is not part of the model.
The data was read from a csv file. The variable is "I", "II"...."IV", but there is an extra "" factor because of missing data. I want R to ignore this factor. My question is how?
you can do the following:
df <- data.frame(letters=letters[1:5], numbers=c(1,2,3,"",5)) # my data frame
# letters numbers
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d
# 5 e 5
levels(df$numbers)
# "" "1" "2" "3" "5"
subdf <- subset(df, numbers != "") # data subset
subdf$numbers <- factor(subdf$numbers)
levels(subdf$numbers)
# "1" "2" "3" "5"
change the "" data to missing:
# generate sample data
df <- data.frame(x = sample(c("","I","II","III"),100, replace = T), stringsAsFactors = T)
option 1
df[df$x=="",'x'] <- NA
option 2
df$x <- factor(ifelse(df$x == "",NA,as.character(df$x)))

In R, compare two lists of strings, find parts of each element of the first list on the second

NOTE: I have updated the question to reflect specific patterns in the data.
Say that I have two vectors.
names_data <- c('A', 'B', 'C', 'D', 'E', 'F')
levels_selected <- c('A1','A3', 'Blow', 'Bhigh', 'D(4.88e+03,9.18+e+04]', 'F')
I want to know how to get a vector, a data frame, a list, or whatever, that checks on the levels vector and returns which levels of which variables where selected. Something that says:
A: 1, 3
B: low, high
D: (4.88e+03,9.18e+04]
Ultimately, there is a data frame X for which names_data = names(data) and levels_selected are some, but not all, of the levels in each of the variables. In the end what I want to do is to make a matrix (for, say for example, a random forest) using model.matrix where I want to include only the variables AND levels in levels_selected. Is there a straightforward way of doing so?
We can create a grouping variable after keeping the substring that contains the "names_data" in the "levels_selected" ('grp'), split the substring with prefix removed using the 'grp' to get a list.
grp <- sub(paste0("^(", paste(names_data, collapse="|"), ").*"), "\\1", levels_selected)
value <- gsub(paste(names_data, collapse="|"), "",
levels_selected)
lst <- split(value, grp)
lst
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "x"
If we meant something like
library(qdapTools)
mtabulate(lst)
# 1 3 high low x
#A 1 1 0 0 0
#B 0 0 1 1 0
#D 0 0 0 0 1
Or another option is using strsplit
d1 <- as.data.frame(do.call(rbind, strsplit(levels_selected,
paste0("(?<=(", paste(names_data, collapse="|"), "))"),
perl=TRUE)), stringsAsFactors=FALSE)
aggregate(V2~V1, d1, FUN= toString)
# V1 V2
#1 A 1, 3
#2 B low, high
#3 D x
and possibly the model.matrix would be
model.matrix(~V1+V2-1, d1)
Update
By using the OP's new example
d1 <- as.data.frame(do.call(rbind, strsplit(levels_selected,
paste0("(?<=(", paste(names_data, collapse="|"), "))"),
perl=TRUE)), stringsAsFactors=FALSE)
split(d1$V2, d1$V1)
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "(4.88e+03,9.18+e+04]"
It is also working with the first method.
Update2
If there are no characters that succeed the elements in 'names_data', we can filter them out
lst <- strsplit(levels_selected, paste0("(?<=(", paste(names_data,
collapse="|"), "))"), perl = TRUE)
d2 <- as.data.frame(do.call(rbind,lst[lengths(lst)==2]), stringsAsFactors=FALSE)
split(d2$V2, d2$V1)
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "(4.88e+03,9.18+e+04]"
An option that returns a list with levels as a vector stored under each corresponding name:
> setNames(lapply(names_data, function(x) gsub(x, "", levels_selected[grepl(x, levels_selected)])), names_data)
$A
[1] "1" "3"
$B
[1] "low" "high"
$C
character(0)
$D
[1] "x"
$E
character(0)
So this is a handy little function I extended from the regexpr help example, using perl-style regex
parseAll <- function(data, pattern) {
result <- gregexpr(pattern, data, perl = TRUE)
do.call(rbind,lapply(seq_along(data), function(i) {
if(any(result[[i]] == -1)) return("")
st <- data.frame(attr(result[[i]], "capture.start"))
le <- data.frame(attr(result[[i]], "capture.length") - 1)
mapply(function(start,leng) substring(data[i], start, start + leng), st, le)
}))
}
EDIT: It's extended because this one will find multiple matches of the patterns, allowing you to look for say, multiple patterns per line. so a pattern like: "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)" (from the original regexpr help) finds all instances of the pattern in each string, rather than just one.
suppose I had data that looked like this:
dat <- c('A1','A2','A3','B3')
I could then search for this data via
parseAll(z,'A(?<A>.*)|B(?<B>.*)') to get a data.frame with the levels selected:
parseAll(dat,'A(?<A>.*)|B(?<B>.*)')
A B
[1,] "1" ""
[2,] "2" ""
[3,] "3" ""
[4,] "" "3"
and which selection had each level (though that may not be useful to you), I can programmatically generate these patterns as well from your vectors:
pattern <- paste(paste0(names_data,'(?<',names_data,'>.*)'),collapse = '|')
then your selected levels are the unique elements of each column, (it's in data.frame, so the conversion to list is easy enough)
This is my omnitool for this kinda stuff, hope it's handy

Add leading space to certain values in a data frame

I have the following data frame and for each positive number (yes they are need to be stored as strings) I want to add a leading space.
d <- data.frame(c1 = c("4", "-1.5", "5", "-3"))
> d
c1
1 4
2 -1.5
3 5
4 -3
So far I used grep and invert to return only the positive numbers which I want to add a leading space to:
d$c1[grep("-", d$c1, invert = TRUE)]
However, I am not sure how to proceed. I think I rather have to work with indices than with the actual number. And probably incorporate gsub? Is that right?
Here is an approach using formatC(). Similar results could be achieved using sprintf(). Note that I don't just add a single space. instead this approach pads each string to a maximum width.
d <- data.frame(c1 = c("4", "-1.5", "5", "-3"), stringsAsFactors = FALSE)
d <- transform(d, d2 = formatC(c1, width = 4), stringsAsFactors = FALSE)
R> d
c1 d2
1 4 4
2 -1.5 -1.5
3 5 5
4 -3 -3
R> str(d)
'data.frame': 4 obs. of 2 variables:
$ c1: chr "4" "-1.5" "5" "-3"
$ d2: chr " 4" "-1.5" " 5" " -3"
If you don't know ahead of time what the width argument should be, compute it from d$c1:
R> with(d, max(nchar(as.character(c1))))
[1] 4
Or use it directly inline
d <- transform(d, d2 = formatC(c1, width = max(nchar(as.character(c1)))),
stringsAsFactors = FALSE)
paste(' ',d[d[,1] > 0,]) does that look like what you want?
The print method for data.frames has nice automated padding features. In general, the strings are padded on the left with spaces to ensure right alignment (by default). You can take advantage of this by capturing the print output. For example, using your d:
> print(d, print.gap = 0, row.names = FALSE)
c1
4
-1.5
5
-3
The argument print.gap = 0 ensures that there are no extra padding spaces in front of the longest string. row.names = FALSE prevents row names from being printed.
This case is special in a couple ways: The column name is shorter than the longest character string in the data, and the data.frame is only one column. To generalize, you could subset the data and unname it:
myChar <- unname(d[, 1, drop = FALSE])
Then, you can capture the printed object using capture.output:
> (dStr <- capture.output(print(myChar, print.gap = 0, row.names = FALSE)))
[1] " NA" " 4" "-1.5" " 5" " -3"
Since the column name is also printed, you could subset the object thusly:
> dStr[-1]
[1] " 4" "-1.5" " 5" " -3"
This way, you don't have to know how long the longest character string is, and this can handle most data types, not just character.

Resources