Numbers as factor after read.delim - r

I have a data frame that looks like this:
A B C D
1 2 3 4
E F G H
5 6 7 8
I would like to subset only the numeric portion using the following code:
sub_num = DF[sapply(DF, is.numeric)]
The problem is that the numbers are factors after reading the data.frame using read.delim. If I set stringsAsFactors = FALSE the numbers are characters.
This may be a basic problem but I'm not able to solve it.

Try the following instead
sub_num <- DF[!is.na(as.numeric(sapply(DF, as.character)))[1:ncol(DF)], ]
# V1 V2 V3 V4
# 2 1 2 3 4
# 4 5 6 7 8
As for your sapply statement, sapply(DF, is.numeric), in order to work correctly, it would need as.character
sapply(DF, function(X) is.numeric(as.character(X)))
But that would not index your DF as you would expect

Related

R multiple regular expressions, dataframe column names

I have a dataframe data with a lot of columns in the form of
...v1...min ...v1...max ...v2...min ...v2...max
1 a a a a
2 b b b b
3 c c c c
where in place ... there could be any expression.
I would like to create a function createData that takes three arguments:
X: a dataframe,
cols: a vector containing first part of the column, so i.e. c("v1", "v2")
fun: a vector containing second part of the column, so i.e. c("min"), or c("max", "min")
and returns filtered dataframe, so - for example:
createData(X, c("v1"), None) would return this kind of dataframe:
...v1...min ...v1...max
1 a a
2 b b
3 c c
while createData(X, c("v1", "v2"), c("min")) would give me
...v1...min ...v2...min
1 a a
2 b b
3 c c
At this point I decided I need to use i.e. select(contains()) from dplyr package.
createData <- function(data, fun, cols)
{
X %>% select(contains())
return(X)
}
What I struggle with is:
how to filter columns that consist two (or maybe more?) strings, i.e. both var1 and min? I tried going with data[grepl(".*(v1*min|min*v1).*", colnames(data), ignore.case=TRUE)] but it doesn't seem to work and also my expressions aren't fixed - they depend on the vector I pass,
how to filter multiple columns with different names, i.e. c("v1", "v2"), passed in a vector? and how to combine it with the first question?
I don't really need to stick with dplyr package, it was just for the sake of the example. Thanks!
EDIT:
An reproducible example:
data = data.frame(AXv1c2min = c(1,2,3),
subv1trwmax = c(4,5,6),
ss25v2xxmin = c(7,8,9),
cwfv2urttmmax = c(10,11,12))
If you pass a vector to contains, it will function like an OR tag, while multiple select statements will have additive effects. So for your esample data:
We can filter for (v1 OR v2) AND min like this:
library(tidyverse)
data %>%
select(contains(c('v1','v2'))) %>%
select(contains('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
So as a function where either argument is optional:
createData <- function(data, fun=NULL, cols=NULL) {
if (!is.null(fun)) data <- select(data, contains(fun))
if (!is.null(cols)) data <- select(data, contains(cols))
return(data)
}
A series of examples:
createData(data, cols=c('v1', 'v2'), fun='min')
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, fun=c('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'), fun=c('min', 'max'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, cols=c('v1'), fun=c('max'))
subv1trwmax
1 4
2 5
3 6

Add values to a vector to make a consecutive vector in R

I have several vectors that look like this:
v1 <- c(1,2,4)
v2 <- c(3,5,8)
v3 <- c(4)
This is just a small sample of them. I'm trying to figure out a way to add values to each of them to make them all consecutive vectors. So that at the end, they look like this:
v1 <- c(1,2,3,4)
v2 <- c(1,2,3,4,5,6,7,8)
v3 <- c(1,2,3,4)
So "3" is added to the first vector, "1","2","4","6","7" is added to the second and so forth. I have several hundred vectors that look like this so I'm trying to figure out a solution that would scale/be automated.
You can use seq and max
seq(max(v1))
For multiple vectors, we can loop
lapply(mget(paste0('v',1:3)), function(x) seq(max(x)))
#$v1
#[1] 1 2 3 4
#$v2
#[1] 1 2 3 4 5 6 7 8
#$v3
#[1] 1 2 3 4

Complete.cases used on list of data frames

I'm trying to remove all the NA values from a list of data frames. The only way I have got it to work is by cleaning the data with complete.cases in a for loop. Is there another way of doing this with lapply as I had been trying for a while to no avail. Here is the code that works.
I start with
data_in <- lapply (file_name,read.csv)
Then have:
clean_data <- list()
for (i in seq_along(id)) {
clean_data[[i]] <- data_in[[i]][complete.cases(data_in[[i]]), ]
}
But what I tried to get to work was using lapply all the way like this.
comp <- lapply(data_in, complete.cases)
clean_data <- lapply(data_in, data_in[[id]][comp,])
Which returns this error "Error in [.default(xj, i) : invalid subscript type 'list' "
What I'd like to know is some alternatives or if I was going about this right. And why didn't the last example not work?
Thank you so much for your time. Have a nice day.
I'm not sure what you expected with
clean_data <- lapply(data_in, data_in[[id]][comp,])
The second parameter to lapply should be a proper function to which each member of the data_in list will be passed one at a time. Your expression data_in[[id]][comp,] is not a function. I'm not sure where you expected id to come from, but lapply does not create magic variables for you like that. Also, at this point comp is now a list itself of indices. You are making no attempt to iterate over this list in sync with your data_in list. If you wanted to do it in two separate steps, a more appropriate approach would be
comp <- lapply(data_in, complete.cases)
clean_data <- Map(function(d,c) {d[c,]}, data_in, comp)
Here we use Map to iterate over the data_in and comp lists simultaneously. They each get passed in to the function as a parameter and we can do the proper extraction that way. Otherwise, if we wanted to do it in one step, we could do
clean_data <- lapply(data_in, function(x) x[complete.cases(x),])
welcome to SO, please provide some working code next time
here is how i would do it with na.omit (since complete.cases only returns a logical)
(dat.l <- list(dat1 = data.frame(x = 1:2, y = c(1, NA)),
dat2 = data.frame(x = 1:3, y = c(1, NA, 3))))
# $dat1
# x y
# 1 1 1
# 2 2 NA
#
# $dat2
# x y
# 1 1 1
# 2 2 NA
# 3 3 3
Map(na.omit, dat.l)
# $dat1
# x y
# 1 1 1
#
# $dat2
# x y
# 1 1 1
# 3 3 3
Do you mean like the below?
> lst
$a
a
1 1
2 2
3 NA
4 3
5 4
$b
b
1 1
2 NA
3 2
4 3
5 4
$d
d e
1 NA 1
2 NA 2
3 3 3
4 4 NA
5 5 NA
> f <- function(x) x[complete.cases(x),]
> lapply(lst, f)
$a
[1] 1 2 3 4
$b
[1] 1 2 3 4
$d
d e
3 3 3
file_name[complete.cases(file_name), ]
complete.cases() returns only a logical value. This should do the job and returns only the rows with no NA values.

match values in dataframes with values in a column

I have two data.frames that looks like these ones:
>df1
V1
a
b
c
d
e
>df2
V1 V2
1 a,k,l
2 c,m,n
3 z,b,s
4 l,m,e
5 t,r,d
I would like to match the values in df1$V1 with those from df2$V2and add a new column to df1 that corresponds to the matching and to the value of df2$V1, the desire output would be:
>df1
V1 V2
a 1
b 3
c 2
d 5
e 4
I've tried this approach but only works if df2$V2 contains just one element:
match(as.character(df1[,1]), strsplit(as.character(df2[,2], ",")) -> idx
df1$V2 <- df2[idx,1]
Many thanks
You can just use grep, which will return the position of the string found:
sapply(df1$V1, grep, x = df2$V2)
# a b c d e
# 1 3 2 5 4
If you expect repeats, you can use paste.
Let's modify your data so that there is a repeat:
df2$V2[3] <- "z,b,s,a"
And modify the solution accordingly:
sapply(df1$V1, function(z) paste(grep(z, x = df2$V2), collapse = ";"))
# a b c d e
# "1;3" "3" "2" "5" "4"
Similar to Tyler's answer, but in base using stack:
df.stack <- stack(setNames(strsplit(as.character(df2$V2), ","), df2$V1))
transform(df1, V2=df.stack$ind[match(V1, df.stack$values)])
produces:
V1 V2
1 a 1
2 b 3
3 c 2
4 d 5
5 e 4
One advantage of splitting over grep is that with grep you run the risk of searching for a and matching things like alabama, etc. (though you can be careful with the patterns to mitigate this (i.e. include word boundaries, etc.).
Note this will only find the first matching value.
Here's an approach:
library(qdap)
key <- setNames(strsplit(as.character(df2$V2), ","), df2$V1)
df1$V2 <- as.numeric(df1$V1 %l% key)
df1
## V1 V2
## 1 a 1
## 2 b 3
## 3 c 2
## 4 d 5
## 5 e 4
First we used strsplit to create a named list. Then we used qdap's lookup operator %l% to match values and create a new column (I converted to numeric though this may not be necessary).

Replace values in selected columns by passing column name of data.frame into apply() or plyr function

Suppose I have a date.frame like:
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
a b c
1 1 4 5
2 2 3 4
3 3 5 3
4 4 2 2
5 5 1 1
and I need to replace all the 5 as NA in column b & c then return to df:
df
a b c
1 1 4 NA
2 2 3 4
3 3 NA 3
4 4 2 2
5 5 1 1
But I want to do a generic apply() function instead of using replace() each by each because there are actually many variables need to be replaced in the real data. Suppose I've defined a variable list:
var <- c("b", "c")
and come up with something like:
df <- within(df, sapply(var, function(x) x <- replace(x, x==5, NA)))
but nothing happens. I was thinking if there is a way to work this out with something similar to the above by passing a variable list of column names from a data.frame into a generic apply / plyr function (or maybe some other completely different ways). Thanks~
You could just do
df[,var][df[,var] == 5] <- NA
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
var <- c("b","c")
df[,var] <- sapply(df[,var],function(x) ifelse(x==5,NA,x))
df
I find the ifelse notation easier to understand here, but most Rers would probably use indexing instead.

Resources