Check if a character is in data frame - r

I'm looking for a simple way to check if values in an R data frame have comma (or any character for that matter).
Let's suppose I have the following data frame:
df <- data.frame(A = c("apple","orange", "banana","strawberries"),
B = c(23,12,10,15),
C = c("2,53", "1.35","0,25","1,44"))
If I know the column with commas in it I use this:
which(grepl(",",df$C))
length(which(grepl(",",df$C)))
However, I want an output as the one above but not specifying the column of my dataframe.
Any suggestions?

You need to simply go through all three columns; sapply works here:
sapply(df, grep, pattern = ",")
##output:
# $A
# integer(0)
#
# $B
# integer(0)
#
# $C
# [1] 1 3 4
To get the length you can do this:
sapply(sapply(df, grep, pattern = ","), length)
# A B C D
# 0 0 3 0

Somewhat simpler to grasp solution; first, convert your data frame to vector.
df2vector <- as.vector(t(df))
df2vector
# [1] "apple" "23" "2,53" "orange" "12"
# [6] "1.35" "banana" "10" "0,25" "strawberries"
# [11] "15" "1,44"
Then use your approach.
length(which(grepl(",",df2vector)))
# [1] 3

Related

Split dataframe columns into vectors in R

I have a dataframe as such:
Number <- c(1,2,3)
Number2 <- c(10,12,14)
Letter <- c("A","B","C")
df <- data.frame(Number,Number2,Letter)
I would like to split the df into its respective three columns, each one becoming a vector with the respective column name. In essence, the output should look exactly like the original three input vectors in the above example.
I have tried the split function and also using for loop, but without success.
Any ideas? Thank you.
We may use unclass as data.frame is a list with additional attributes. By unclassing, it removes the data.frame attribute
unclass(df)
Or another option is asplit with MARGIN specified as 2
asplit(df, 2)
NOTE: Both of them return a named list. If we intend to create new objects in the global env, use list2env (not recommended though)
We can use c oras.list
> c(df)
$Number
[1] 1 2 3
$Number2
[1] 10 12 14
$Letter
[1] "A" "B" "C"
> as.list(df)
$Number
[1] 1 2 3
$Number2
[1] 10 12 14
$Letter
[1] "A" "B" "C"
Assuming you are trying to create these as vectors if the global environment, use list2env:
df <- data.frame(Number = c(1, 2, 3),
Number2 = c(10, 12, 14),
Letter = c("A", "B", "C"))
list2env(df, .GlobalEnv)
## <environment: R_GlobalEnv>
ls()
## [1] "df" "Letter" "Number" "Number2"
list2env is clearly the easiest way, but if you want to do it with a for loop it can also be achieved.
The "tricky" part is to make a new vector based on the column names inside the for loop. If you just write
names(df[i]) <- input
a vector will not be created.
A workaround is to use paste to create a string with the new vector name and what should be in it, then use "eval(parse(text=)" to evaluate this expression.
Maybe not the most elegant solution, but seems to work.
for (i in colnames(df)){
vector_name <- names(df[i])
expression_to_be_evaluated <- paste(vector_name, "<- df[[i]]")
eval(parse(text=expression_to_be_evaluated))
}
> Letter
[1] A B C
Levels: A B C
> Number
[1] 1 2 3
> Number2
[1] 10 12 14

Storing unique values of each column (of a df) in list

It is straight forward to obtain unique values of a column using unique. However, I am looking to do the same but for multiple columns in a dataframe and store them in a list, all using base R. Importantly, it is not combinations I need but simply unique values for each individual column. I currently have the below:
# dummy data
df = data.frame(a = LETTERS[1:4]
,b = 1:4)
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols)
{
x = unique(i)
unique_values_by_col[[i]] = x
}
The problem comes when displaying unique_values_by_col as it shows as empty. I believe the problem is i is being passed to the loop as a text not a variable.
Any help would be greatly appreciated. Thank you.
Why not avoid the for loop altogether using lapply:
lapply(df, unique)
Resulting in:
> $a
> [1] A B C D
> Levels: A B C D
> $b
> [1] 1 2 3 4
Or you have also apply that is specifically done to be run on column or line:
apply(df,2,unique)
result:
> apply(df,2,unique)
a b
[1,] "A" "1"
[2,] "B" "2"
[3,] "C" "3"
[4,] "D" "4"
thought if you want a list lapply return you a list so may be better
Your for loop is almost right, just needs one fix to work:
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols) {
x = unique(df[[i]])
unique_values_by_col[[i]] = x
}
unique_values_by_col
# $a
# [1] A B C D
# Levels: A B C D
#
# $b
# [1] 1 2 3 4
i is just a character, the name of a column within df so unique(i) doesn't make sense.
Anyhow, the most standard way for this task is lapply() as shown by demirev.
Could this be what you're trying to do?
Map(unique,df)
Result:
$a
[1] A B C D
Levels: A B C D
$b
[1] 1 2 3 4

Using lapply to apply function to each row in a tibble

This is my code that attempts apply a function to each row in a tibble , mytib :
> mytib
# A tibble: 3 x 1
value
<chr>
1 1
2 2
3 3
Here is my code where I'm attempting to apply a function to each line in the tibble :
mytib = as_tibble(c("1" , "2" ,"3"))
procLine <- function(f) {
print('here')
print(f)
}
lapply(mytib , procLine)
Using lapply :
> lapply(mytib , procLine)
[1] "here"
[1] "1" "2" "3"
$value
[1] "1" "2" "3"
This output suggests the function is not invoked once per line as I expect the output to be :
here
1
here
2
here
3
How to apply function to each row in tibble ?
Update : I appreciate the supplied answers that allow my expected result but what have I done incorrectly with my implementation ? lapply should apply a function to each element ?
invisible is used to avoid displaying the output. Also you have to loop through elements of the column named 'value', instead of the column as a whole.
invisible( lapply(mytib$value , procLine) )
# [1] "here"
# [1] "1"
# [1] "here"
# [1] "2"
# [1] "here"
# [1] "3"
lapply loops through columns of a data frame by default. See the example below. The values of two columns are printed as a whole in each iteration.
mydf <- data.frame(a = letters[1:3], b = 1:3, stringsAsFactors = FALSE )
invisible(lapply( mydf, print))
# [1] "a" "b" "c"
# [1] 1 2 3
To iterate through each element of a column in a data frame, you have to loop twice like below.
invisible(lapply( mydf, function(x) lapply(x, print)))
# [1] "a"
# [1] "b"
# [1] "c"
# [1] 1
# [1] 2
# [1] 3

Looping over multiple lists with base R

In python we can do this..
numbers = [1, 2, 3]
characters = ['foo', 'bar', 'baz']
for item in zip(numbers, characters):
print(item[0], item[1])
(1, 'foo')
(2, 'bar')
(3, 'baz')
We can also unpack the tuple rather than using the index.
for num, char in zip(numbers, characters):
print(num, char)
(1, 'foo')
(2, 'bar')
(3, 'baz')
How can we do the same using base R?
To do something like this in an R-native way, you'd use the idea of a data frame. A data frame has multiple variables which can be of different types, and each row is an observation of each variable.
d <- data.frame(numbers = c(1, 2, 3),
characters = c('foo', 'bar', 'baz'))
d
## numbers characters
## 1 1 foo
## 2 2 bar
## 3 3 baz
You then access each row using matrix notation, where leaving an index blank includes everything.
d[1,]
## numbers characters
## 1 1 foo
You can then loop over the rows of the data frame to do whatever you want to do, presumably you actually want to do something more interesting than printing.
for(i in seq_len(nrow(d))) {
print(d[i,])
}
## numbers characters
## 1 1 foo
## numbers characters
## 2 2 bar
## numbers characters
## 3 3 baz
For another option, how about mapply, which is the closest analog to zip I can think of in R. Here I'm using the c function to make a new vector, but you could use any function you'd like:
numbers<- c(1, 2, 3)
characters<- c('foo', 'bar', 'baz')
mapply(c,numbers, characters, SIMPLIFY = FALSE)
[[1]]
[1] "1" "foo"
[[2]]
[1] "2" "bar"
[[3]]
[1] "3" "baz"
Which way is of most use depends on what you want to do with your output, but as the other answers mention, a dataframe is the most natural approach in R (and pandas dataframe probably in python).
To index a vector in R, where the vector is variable x would be x[1]. This would return the first element of the vector. R element numbering starts at 1 in contrast to Python which starts at 0.
For this problem it would be:
x = seq(1,10)
j = seq(11,20)
for (i in 1:length(x)){
print (c(x[i],j[i]))
}
Many functions in R are vectorized and don't require loops:
numbers = c(1, 2, 3)
characters = c('foo', 'bar', 'baz')
myList <- list(numbers, characters)
myDF <- data.frame(numbers,characters, stringsAsFactors = F)
print(myList)
print(myDF)
This is the conceptual equivalent:
for (item in Map(list,numbers,characters)){ # though most of the time you would actually do all your work inside Map
print(item[c(1,2)])
}
# [[1]]
# [1] 1
#
# [[2]]
# [1] "a"
#
# [[1]]
# [1] 2
#
# [[2]]
# [1] "b"
#
# [[1]]
# [1] 3
#
# [[2]]
# [1] "c"
#
# [[1]]
# [1] 4
#
# [[2]]
# [1] "d"
#
# [[1]]
# [1] 5
#
# [[2]]
# [1] "e"
Though most of the time you would actually do all your work inside Map and do something like this:
Map(function(nu,ch){print(data.frame(nu,ch))},numbers,characters)
This is the closest I could get to a clone:
zip <- function(...){ Map(list,...)}
print2 <- function(...){do.call(cat,c(list(...),"\n"))}
for (item in zip(numbers,characters)){
print2(item[[1]],item[[2]])
}
# 1 a
# 2 b
# 3 c
# 4 d
# 5 e
to be able to call items by their names (still works with indices):
zip <- function(...){
names <- sapply(substitute(list(...))[-1],deparse)
Map(function(...){setNames(list(...),names)}, ...)
}
for (item in zip(numbers,characters)){
print2(item[["numbers"]],item[["characters"]])
}
The tidyverse solution would be to use purrr::map2 function. Ex:
numbers <- c(1, 2, 3)
characters <- c('foo', 'bar', 'baz')
map2(numbers, characters, ~paste0(.x, ',', .y))
#[[1]]
#[1] "1,foo"
#[[2]]
#[1] "2,bar"
#[[3]]
#[1] "3,baz"
See API here
Other scalable alternatives: Store the vectors in the list and iterate over.
vect1 <- c(1, 2, 3)
vect1 <- c('foo', 'bar', 'baz')
vect2 <- c('a', 'b', 'c')
idx_list <- list(vect1, vect2)
idx_vect <- c(1:length(idx_list[[1]]))
for(i in idx_vect){
x <- idx_list[[1]][i]
j <- idx_list[[2]][i]
print(c(i, x, j))
}

Find indices of vector elements in a list

I have this toy character vector:
a = c("a","b","c","d","e","d,e","f")
in which some elements are concatenated with a comma (e.g. "d,e")
and a list that contains the unique elements of that vector, where in case of comma concatenated elements I do not keep their individual components.
So this is the list:
l = list("a","b","c","d,e","f")
I am looking for an efficient way to obtain the indices of the elements of a in the l list. For elements of a that are represented by the comma concatenated elements in l it should return the indices of the these comma concatenated elements in l.
So the output of this function would be:
c(1,2,3,4,4,4,5)
As you can see it returns index 4 for a elements: "d", "e", and "d,e"
I would make your search vector into a set of regular expressions, by substituting the comma with a pipe. Add names to the search vector too, according to its position in the list.
L <- setNames(lapply(l, gsub, pattern = ",", replacement = "|"), seq_along(l))
Then you can do:
lapply(L, function(x) grep(x, a, value = TRUE))
# $`1`
# [1] "a"
#
# $`2`
# [1] "b"
#
# $`3`
# [1] "c"
#
# $`4`
# [1] "d" "e" "d,e"
#
# $`5`
# [1] "f"
The names are important, because you can now use stack to get what you are looking for.
stack(lapply(L, function(x) grep(x, a, value = TRUE)))
# values ind
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d 4
# 5 e 4
# 6 d,e 4
# 7 f 5
You could use a strategy with factors. First, find the index for each element in your list with
l <- list("a","b","c","d,e","f")
idxtr <- Map(function(x) unique(c(x, strsplit(x, ",")[[1]])), unlist(l))
This build a list for each item in l along with all possible matches for each element. Then we take the vector a and create a factor with those levels, and then reassign based on the list we just build
a <- c("a","b","c","d","e","d,e","f")
a <- factor(a, levels=unlist(idxtr));
levels(a) <- idxtr
as.numeric(a)
# [1] 1 2 3 4 4 4 5
finally, to get the index, we use as.numeric on the factor

Resources