I have a dataset that looks like below that I want to trasnsform to another format that assign true/false based on whether certain string is present. What's the best way to do it either in Excel or R?
Thanks!
Initial dataset:
Row1 A D
Row2 B C
Row3 A C E
The format I want:
A B C D E
Row1 1 0 0 1 0
Row2 0 1 1 0 0
Row3 1 0 1 0 1
Here is a base R way with lapply and xtabs.
I assume that filename holds the data file name.
x <- readLines(filename)
x <- strsplit(x, " ")
l <- lapply(x, \(y) {
values <- y[-1]
rows <- rep(y[1], length(values))
data.frame(rows, values)
})
df1 <- do.call(rbind, l)
rm(x, l)
xtabs(~ rows + values, df1)
#> values
#> rows A B C D E
#> Row1 1 0 0 1 0
#> Row2 0 1 1 0 0
#> Row3 1 0 1 0 1
Created on 2022-09-08 by the reprex package (v2.0.1)
I would like to add a new row to a data frame; however, this row in my case has more values. Let's assume I have the following dataset:
> df <- data.frame(0,0,0)
> colnames(df) <- c("A","B","C")
> df
A B C
0 0 0
Now let us have a vector with 4 elements.
> x <- c(0,0,0,0)
> names(x) <- c("A","B","D","C")
> x
A B D C
0 0 0 0
I would like to add this vector to the data frame above such that
> df
A B C D
0 0 0 NA
0 0 0 0
Is there a way to do this?
Using rbindlist
library(data.table)
rbindlist(list(df, as.data.frame.list(x)), fill = TRUE)
A B C D
1: 0 0 0 NA
2: 0 0 0 0
You may use -
library(dplyr)
df %>% bind_rows(x %>%
t %>%
as.data.frame())
# A B C D
#1 0 0 0 NA
#2 0 0 0 0
Or as #Andrew Gustar mentioned -
dplyr::bind_rows(df, x)
I would like to add a varying number (X) of columns with 0 to an existing data.frame within a function.
Here is an example data.frame:
dt <- data.frame(x=1:3, y=4:6)
I would like to get this result if X=1 :
a x y
1 0 1 4
2 0 2 5
3 0 3 6
And this if X=3 :
a b c x y
1 0 0 0 1 4
2 0 0 0 2 5
3 0 0 0 3 6
What would be an efficient way to do this?
We can assign multiple columns to '0' based on the value of 'X'
X <- 3
nm1 <- names(dt)
dt[letters[seq_len(X)]] <- 0
dt[c(setdiff(names(dt), nm1), nm1)]
Also, we can use add_column from tibble and create columns at a specific location
library(tibble)
add_column(dt, .before = 1, !!!setNames(as.list(rep(0, X)),
letters[seq_len(X)]))
A second option is cbind
f <- function(x, n = 3) {
cbind.data.frame(matrix(
0,
ncol = n,
nrow = nrow(x),
dimnames = list(NULL, letters[1:n])
), x)
}
f(dt, 5)
# a b c d e x y
#1 0 0 0 0 0 1 4
#2 0 0 0 0 0 2 5
#3 0 0 0 0 0 3 6
NOTE: because letters has a length of 26 the function would need some adjustment regarding the naming scheme if n > 26.
You can try the code below
dt <- cbind(`colnames<-`(t(rep(0,X)),letters[seq(X)]),dt)
If you don't care the column names of added columns, you can use just
dt <- cbind(t(rep(0,X)),dt)
which is much shorter
I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1
I have two dataframes with differenth lengths. On is a sample and the other a test sample
df1 a b c d ...
1 0 0 0
2 0 0 1
df2 a e b c d ...
1 1 0 0 0
2 0 0 0 1
How can I delete the columns of df2 not in common with df1 ?
As a result I'm looking for df2 with the same columns as df1 (a, b, c, d ...).
I tried merge() but its not what i'm looking for.
If I understand your question correctly you can subset by the column-names like this:
df2[, colnames(df1)]
If you have column names in df1 not present in df2 you can do
df2[, intersect(colnames(df1), colnames(df2))]
Edit: forgot a comma