extract column based on row values - r

Following Data Set:
df <- read.table(text = " a b c
X Y Z", header = T)
The command
df[, sapply (df[1, ], as.character) %in% c("Y", "X")]
returns
a b
1 X Y
but the command
df[, sapply (df[1, ], as.character) %in% c("Y")]
returns
[1] Y
Levels: Y
and not
b
1 Y
Any idea why? And how I can retrieve the correct column name

Likely a duplicate...but to provide an answer...
You've fallen into one of the major traps of the R Inferno; 8.1.44 where "By default dimensions of arrays are dropped when subscripting makes the dimension length 1. Subscripting with drop=FALSE overrides the default."
So, this will return what you expect:
df[, sapply (df[1, ], as.character) %in% c("Y"), drop = FALSE]

Related

Selecting multiple columns using Regular Expressions

I have variables with names such as r1a r3c r5e r7g r9i r11k r13g r15i etc. I am trying to select variables which starts with r5 - r12 and create a dataframe in R.
The best code that I could write to get this done is,
data %>% select(grep("r[5-9][^0-9]" , names(data), value = TRUE ),
grep("r1[0-2]", names(data), value = TRUE))
Given my experience with regular expressions span a day, I was wondering if anyone could help me write a better and compact code for this!
Here's a regex that gets all the columns at once:
data %>% select(grep("r([5-9]|1[0-2])", names(data), value = TRUE))
The vertical bar represents an 'or'.
As the comments have pointed out, this will fail for items such as r51, and can also be shortened. Instead, you will need a slightly longer regex:
data %>% select(matches("r([5-9]|1[0-2])([^0-9]|$)"))
Suppose that in the code below x represents your names(data). Then the following will do what you want.
# The names of 'data'
x <- scan(what = character(), text = "r1a r3c r5e r7g r9i r11k r13g r15i")
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.numeric(y[sapply(y, `!=`, "")])
x[y > 4]
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
EDIT.
You can make a function with a generalization of the above code. This function has three arguments, the first is the vector of variables names, the second and the third are the limits of the numbers you want to keep.
var_names <- function(x, from = 1, to = Inf){
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.integer(y[sapply(y, `!=`, "")])
x[from <= y & y <= to]
}
var_names(x, 5)
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
Remove the non-digits, scan the remainder in and check whether each is in 5:12 :
DF <- data.frame(r1a=1, r3c=2, r5e=3, r7g=4, r9i=5, r11k=6, r13g=7, r15i=8) # test data
DF[scan(text = gsub("\\D", "", names(DF)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
Using magrittr it could also be written like this:
library(magrittr)
DF %>% .[scan(text = gsub("\\D", "", names(.)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6

R: Apply a function over only character columns without type coercion

I have a data frame with many columns. The columns differ in their types: some are numeric, some are character, etc. Here's a small example where we just have 3 variables with 2 types:
# Generate data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
I want to replace the value 3 with a space, but only for character columns. Here's my current code:
out <- as.data.frame(lapply(dat, function(x) {
ifelse(is.character(x),
gsub("3", " ", x),
x)}),
stringsAsFactors = FALSE)
The problem is that the ifelse() function ignores that y and z are numeric and that it also seems to coerce the numeric variables to character anyway.
And idea has been to pull out the character columns, gsub() them, then bind them back to the original data frame. This, however, changes the ordering of the columns. Key to any solution is that I do not need to specify variables by name but only by type.
One can also do this trivially using dplyr:
# Load package
library(dplyr)
# Create data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
# Replace 3's with spaces for character columns
dat <- dat %>% mutate_if(is.character, function(x) gsub(pattern = "3", " ", x))
I tried your code and for me it seems like ifelse did not work but separating if ad else does. Below is the code which works:
# Generate data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
> lapply(dat, function(x) { if(is.character(x)) gsub("3", " ", x) else x })
$x
[1] "1" "2" " "
$y
[1] 1.0 2.5 3.3
$z
[1] 1 2 3
> as.data.frame(lapply(dat, function(x) { if(is.character(x)) gsub("3", " ", x) else x }))
x y z
1 1 1.0 1
2 2 2.5 2
3 3.3 3
It comes down to this line in ?ifelse
ifelse returns a value with the same shape as test ...
is.character is length one so the returned value is length 1. You can use if(...) yes else no as you have suggested instead as #Heikki have suggested.
Similar to #user3614648 solution:
library(dplyr)
dat %>%
mutate_if(is.character, funs(ifelse(. == "3", " ", .)))
x y z
1 1 1.0 1
2 2 2.5 2
3 3.3 3

r: subsetting with square brackets not working

I made data frame called x:
a b
1 2
3 NA
3 32
21 7
12 8
When I run
y <- x["a">2,]
The object y returned is identical to x. If I run
y <- x["a" == 1,]
y is an empty frame.
I made sure that the names of the x data frame have no white spaces (I named them myself with names() ) and also that a and are numeric.
PS: If I try
y <- x["a">2]
y is also returned as identical to x.
You're making an error in referencing the column of your data.frame x.
"a">2 means character a bigger than two, not variable a of data.frame x. You need to add either x$a or x["a"] to reference your data.frame column.
try
y <- x[x$a >2 ,]
or
y <- x[x["a"] >2 ,]
or even more clear
ix <- x["a"] > 2
y <- x[ix,]
A simple alternative would be using data.table
library(data.table)
setDT(x)
y <- x[ a > 2, ]
y <- x[ a == 1, ]

R subset vector when treated as strings

I have a large data frame where I've forced my vectors into a string (using lapply and toString) so they fit into a dataframe and now I can't check if one column is a subset of the other. Is there a simple way to do this.
X <- data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))
X
y z
1 ABC ABC
2 A A,B,C
all(X$y %in% X$z)
[1] FALSE
(X$y[1] %in% X$z[1])
[1] TRUE
(X$y[2] %in% X$z[2])
[1] FALSE
I need to treat each y and z string value as a vector (comma separated) again and then check if y is a subset of z.
In the above case, A is a subset of A,B,C. However because I've treated both as strings, it doesnt work.
In the above y is just one value and z is 1 and 3. The data frames sample I'll be testing is 10,000 rows and the y will have 1-5 values per row and z 1-100 per row. It looks like the 1-5 are always a subset of z, but I'd like to check.
df = data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))
apply(df, 1, function(x) { # perform rowise ops.
y = unlist(strsplit(x[1], ",")) # splitting X$y if incase it had ","
z = y %in% unlist(strsplit(x[2], ",")) # check how many of 'X$y' present in 'X$z'
if (sum(z) == length(y)) # if all present then return TRUE
return(TRUE)
else
return(FALSE)
})
# 1] TRUE TRUE
# Case 2: changed the data. You will have to define if you want perfect subset or not. Accordingly we can update the code
df = data.frame(y=c("ABC","A,B,D"), z=c("ABC","A,B,C"))
#[1] TRUE FALSE
I think it might work better for you not to use your lapply and toString combination, but store the lists in your data frame. For this purpose, I find the tbl_df (as found in the tibble package) more friendly, although I believe data.table objects can do this as well (someone correct me if I'm wrong)
library(tibble)
y_char <- list("ABC", "A")
z_char <- list("ABC", c("A", "B", "C"))
X <- data_frame(y = y_char,
z = z_char)
Notice that when you print X now, your entries in each row of the tibble are entries from the list. Now we can use mapply to do pairwise comparison.
# All y in z
mapply(function(x, y) all(x %in% y),
X$y,
X$z)
# All z in y
mapply(function(x, y) all(y %in% x),
X$y,
X$z)

How to replace existing values by new values from look-up list without causing NA?

I have a data frame. One column contains the following values:
df$current_column=(A,B,C,D,E)
A vector contains a look up value:
v <- c(A=X, B=Y)
I want to replace this column to come up with a list of (X, Y, C,D,E)
I am thinking to create a new column like
df$new_column <- v[df$current_column]
It does the replacement of A and B but it also makes C,D,E as NA (X,Y, NA, NA, NA).
How to keep C,D and E or is there any other way?
looks like ifelse() could help:
d$current_column <- ifelse( d$current_column == A, X,
ifelse( d$current_column == B, Y, d$current_column ))
We can create a logical index with %in% and then do the conversion
i1 <- df$current_column %in% names(v)
df$new_column <- df$current_column
df$new_column[i1] <- v[df$new_column[i1]]
df$new_column
#[1] "X" "Y" "C" "D" "E"
Or use a single ifelse
with(df, ifelse(current_column %in% names(v),
v[current_column], current_column))
Update
If the 'current_column' is factor class, convert to character class and it should work.
df$new_column <- as.character(df$current_column)
df$new_column[i1] <- v[df$new_column[i1]]
data
df <- data.frame(current_column = LETTERS[1:5],
stringsAsFactors=FALSE)
v <- setNames(c('X', 'Y'), LETTERS[1:2])
user2029709,
-- was working off of your little example; for a more generic approach it would be nice to see a snippet of the real data or close simulation. In any case, here is something that may work for you better, without coding manually all ifelse() options, and is still a relatively straightforward solution:
vd <- data.frame(current_column = names(v), new_column = v, stringsAsFactors = FALSE)
df <- merge(df, vd, by = 'current_column', all.x = TRUE)
df$new_column <- ifelse(is.na(df$new_column), df$current_column, df$current_column)
You may have to modify data types when creating vd data.frame to assure proper merge.
Best,
oleg

Resources