Error when unlisting columns in a data frame - r

Suppose I have a data frame called DF:
options(stringsAsFactors = F)
letters <- list("A", "B", "C", "D")
numbers <- list(list(1,2), 1, 1, 2)
score <- list(.44, .54, .21, .102)
DF <- data.frame(cbind(letters, numbers, score))
Note that all columns in the data frame are of class "list".
Also, take a look at the structure: DF$numbers[1] is also a list
I'm trying to UNLIST each column.
DF$letters <- unlist(DF$letters)
DF$score <- unlist(DF$score)
DF$numbers <- unlist(DF$numbers)
However, because, DF$numbers[1] is also a list, I'm thrown back this error:
Error in `$<-.data.frame`(`*tmp*`, numbers, value = c(1, 2, 1, 1, 2)) :
replacement has 5 rows, data has 4
Is there a way that I can unlist the whole column, and keep the values cells like DF$numbers[1] as a character vector like c(1,2) or 1,2?
Ideally I would like DF to look something like this, where the individual values in the number column are still of type int:
letters numbers score
A 1,2 .44
B 1 .54
C 1 .21
D 2 .102
The goal is to then write the data frame to a csv file.

You can apply unlist to each individual element of the column numbers instead of the whole column:
DF$numbers <- lapply(DF$numbers, unlist)
DF
# letters numbers value
#1 A 1, 2 0.440
#2 B 1 0.540
#3 C 1 0.210
#4 D 2 0.102
DF$numbers[1]
#[[1]]
#[1] 1 2
Or paste the elements as a single string if you want an atomic vector column:
DF$numbers <- sapply(DF$numbers, toString)
DF
# letters numbers value
#1 A 1, 2 0.44
#2 B 1 0.54
#3 C 1 0.21
#4 D 2 0.102
DF$numbers[1]
#[1] "1, 2"
class(DF$numbers)
# [1] "character"

You can do:
DF$letters <- unlist(DF$letters)
DF$value <- unlist(DF$value)
DF$numbers <- unlist(as.character(DF$numbers))
This returns:
DF
letters numbers value
1 A c(1, 2) 0.440
2 B 1 0.540
3 C 1 0.210
4 D 2 0.102

Related

How to fill in R data.frame with named vectors of different lengths?

I need to fill in R data.frame (or data.table) using named vectors as rows. The problem is that named vectors to be used as rows usually do not have all the variables. In other words, usually named vector has smaller length than the number of columns. Names of variables in the vectors coincide with column names of the dataframe:
df <- data.frame(matrix(NA, 2, 3))
colnames(df) <- c("A", "B", "C")
obs1 <- c(A=2, B=4)
obs2 <- c(A=3, C=10)
I want df as follows:
> df
A B C
1 2 4 NA
2 3 NA 10
So I want to fill in the first two rows with obs1 and obs2 respectively. When I try to do it, I get an error:
> df[1,] <- obs1
Error in `[<-.data.frame`(`*tmp*`, 1, , value = c(A = 2, B = 4)) :
replacement has 2 items, need 3
I suspect that similar question was already asked, but I could not find it. Does anybody know how to do it using data.frame or data.table?
We need to select the columns as well based on the names of 'obs1' and 'obs2'
df[1, names(obs1)] <- obs1
df[2, names(obs2)] <- obs2
-output
> df
A B C
1 2 4 NA
2 3 NA 10
When we do df[1,], it returns the first row with all the columns i.e. the length is 3 where as 'obs1' or 'obs2' have only a length of 2, thus getting the error in length
Also, creating a template dataset to fill is not really needed as we can use bind_rows which will automatically fill with NA for those columns not present
library(dplyr)
bind_rows(obs1, obs2)
# A tibble: 2 x 3
A B C
<dbl> <dbl> <dbl>
1 2 4 NA
2 3 NA 10
solution with data.table;
library(data.table)
obs1 <- data.table(t(obs1))
obs2 <- data.table(t(obs2))
df <- rbindlist(list(obs1,obs2),fill=T)
df
output;
A B C
<dbl> <dbl> <dbl>
1 2 4 NA
2 3 NA 10

Formatting strings in a character vector of a data frame

Suppose I have a data frame (let's call it DF) that looks like this:
options(stringsAsFactors = F)
letters <- c("A", "B", "C", "D", "E")
value <- c(.44, .54, .21, .102, .002)
test <- c("2", "c(1,4)", "1", "3:4", "c(1,2)")
DF <- data.frame(cbind(letters, value, test))
DF$value <- as.numeric(DF$value)
This is what DF looks like if you were to print it:
#DF
# letters value test
#1 A 0.440 2
#2 B 0.540 c(1,4)
#3 C 0.210 1
#4 D 0.102 3:4
#5 E 0.002 c(1,2)
My main issue is DF$test. For any cell that has more than one value (ie: 3:4, c(1,2)), I would like the the cell to have the formating of X:Y , given that X and Y are numeric values.
Can someone help? Please note that DF$test is a character vector.
Another gsub option that uses 2 gsubs:
DF$test2 <- gsub(",",":", gsub(".*c\\((.*)\\).*", "\\1", DF$test))
DF
# letters value test test2
#1 A 0.440 2 2
#2 B 0.540 c(1,4) 1:4
#3 C 0.210 1 1
#4 D 0.102 3:4 3:4
#5 E 0.002 c(1,2) 1:2
The first gsub extracts everything between the c( and ) and the second gsub replaces any , with :. This would work if you had > 2 numbers in your c(). I.e. c(1,2,3) would become 1:2:3.
With:
tst <- gsub('[c()]','',DF$test)
tst <- strsplit(tst, '[,:]')
DF$test <- sapply(tst, paste0, collapse = ':')
or in one go:
DF$test <- sapply(strsplit(gsub('[c()]','',DF$test), '[,:]'), paste0, collapse = ':')
your data.frame now looks like:
> DF
letters value test
1 A 0.440 2
2 B 0.540 1:4
3 C 0.210 1
4 D 0.102 3:4
5 E 0.002 1:2
The advantage of this is that it also works with strings in DF$test that are longer than 2 numbers.
gsub should get you there
DF$test <- gsub(".+(\\d+).(\\d+).+", "\\1:\\2", DF$test)
We can use str_extract
library(stringr)
DF$test <- sapply(str_extract_all(DF$test, '[0-9]+'), paste, collapse=":")
DF$test
#[1] "2" "1:4" "1" "3:4" "1:2"
Or using base R
DF$test <- sapply(regmatches(DF$test, gregexpr('[0-9]+', DF$test)), paste, collapse=":")

How to replace values in multiple columns in a data.frame with values from a vector in R?

I would like to replace the values in the last three columns in my data.frame with the three values in a vector.
Example of data.frame
df
A B C D
5 3 8 9
Vector
1 2 3
what I would like the data.frame to look like.
df
A B C D
5 1 2 3
Currently I am doing:
df$B <- Vector[1]
df$C <- Vector[2]
df$D <- Vector[3]
I would like to not replace the values one by one. I would like to do it all at once.
Any help will be appreciated. Please let me know if any further information is needed.
We can subset the last three columns of the dataset with tail, replicate the 'Vector' to make the lengths similar and assign the values to those columns
df[,tail(names(df),3)] <- Vector[col(df[,tail(names(df),3)])]
df
# A B C D
#1 5 1 2 3
NOTE: I replicated the 'Vector' assuming that there will be more rows in the 'df' in the original dataset.
Try this:
df[-1] <- 1:3
giving:
> df
A B C D
1 5 1 2 3
Alternately, we could do it non-destructively like this:
replace(df, -1, 1:3)
Note: The input df in reproducible form is:
df <- data.frame(A = 5, B =3, C = 8, D = 9)

Add list of columns above a certain threshold

Say I have a dataframe:
df <- data.frame(rbind(c(10,1,5,4), c(6,0,3,10), c(7,1,10,10)))
colnames(df) <- c("a", "b", "c", "d")
df
a b c d
10 1 5 4
6 0 3 10
7 1 10 10
And a vector of numbers (which correspond to the four column names a,b,c,d)
threshold <- c(7,1,5,8)
I need to compare each row in the data frame to the vector. When the value in the data frame meets or exceeds that in the vector, I need to return the column name. The output would be:
a b c d cols
10 1 5 4 a,b,c #10>7, 1>=1, 5>=5
6 0 3 10 d #10>8
7 1 10 10 a,b,c,d ##7>=7, 1>=1, 10>=5, 10>-8
The column cols can be a string that simply lists the columns where the value is exceeded.
Is there any clever way to do this? I'm migrating an old Excel function and I can write a loop or something, but I thought there almost had to be a better way.
You do not need which and the desired output is for comma separated values:
df$cols <- apply(df[-1], 1, function(x) toString(names(df)[-1][x >= threshold]))
df
id a b c d cols
1 aa 10 1 5 4 a, b, c
2 bb 6 0 3 10 d
3 cc 7 1 10 10 a, b, c, d
We can also try
i1 <- which(df >=threshold[col(df)], arr.ind=TRUE)
df$cols <- unname(tapply(names(df)[i1[,2]], i1[,1], toString))
df$cols
#[1] "a, b, c" "d" "a, b, c, d"
You can try this:
df$cols <- apply(df[, 2:5], 1, function(x) names(df[, 2:5])[which(x >= threshold)])

Find the index of the row in data frame that contain one element in a string vector

If I have a data.frame like this
df <- data.frame(col1 = c(letters[1:4],"a"),col2 = 1:5,col3 = letters[10:14])
df
col1 col2 col3
1 a 1 j
2 b 2 k
3 c 3 l
4 d 4 m
5 a 5 n
I want to get the row indices that contains one of the element in c("a", "k", "n"); in this example, the result should be 1, 2, 5.
If you have a large data frame and you wish to check all columns, try this
x <- c("a", "k", "n")
Reduce(union, lapply(x, function(a) which(rowSums(df == a) > 0)))
# [1] 1 5 2
and of course you can sort the end result.
s <- c('a','k','n');
which(df$col1%in%s|df$col3%in%s);
## [1] 1 2 5
Here's another solution. This one works on the entire data.frame, and happens to capture the search strings as element names (you can get rid of those via unname()):
sapply(s,function(s) which(apply(df==s,1,any))[1]);
## a k n
## 1 2 5
Original second solution:
sort(unique(rep(1:nrow(df),ncol(df))[as.matrix(df)%in%s]));
## [1] 1 2 5

Resources