I have a dataset that looks like below that I want to trasnsform to another format that assign true/false based on whether certain string is present. What's the best way to do it either in Excel or R?
Thanks!
Initial dataset:
Row1 A D
Row2 B C
Row3 A C E
The format I want:
A B C D E
Row1 1 0 0 1 0
Row2 0 1 1 0 0
Row3 1 0 1 0 1
Here is a base R way with lapply and xtabs.
I assume that filename holds the data file name.
x <- readLines(filename)
x <- strsplit(x, " ")
l <- lapply(x, \(y) {
values <- y[-1]
rows <- rep(y[1], length(values))
data.frame(rows, values)
})
df1 <- do.call(rbind, l)
rm(x, l)
xtabs(~ rows + values, df1)
#> values
#> rows A B C D E
#> Row1 1 0 0 1 0
#> Row2 0 1 1 0 0
#> Row3 1 0 1 0 1
Created on 2022-09-08 by the reprex package (v2.0.1)
Related
I have a dataframe df, I would like to find peaks and valleys for each column and then replace the points where peaks and valleys are present with the value 1.
Here I made an example by applying it to only one column.
Is it possible to do this for all the columns in the dataframe?
df <- data.frame(a = sample(1:10,10),
b = sample(1:10,10),
c = sample(1:10,10),
d = sample(1:10,10),
e = sample(1:10,10))
vallys<- findValleys(df$b, thresh =0)
peaks <- findPeaks(df$b, thresh = 0)
df$b <- rep(0, nrow(df))
df$b <- replace(df$b, peaks, values=1)
df$b <- replace(df$b, vallys, values=1)
Thank you
The easiest thing is to put your code into a function.
library(quantmod)
replace_peaks_valleys <- function(x) {
valleys <- findValleys(x, thresh = 0)
peaks <- findPeaks(x, thresh = 0)
new_col <- rep(0, length(x))
new_col <- replace(new_col, peaks, values = 1)
new_col <- replace(new_col, valleys, values = 1)
return(new_col)
}
Then you can choose whether to do it in base R, dplyr or data.table.
base R
As you want to assign back to your original data frame, in base R you can do (note the square brackets or it will return a list):
df[] <- lapply(df, replace_peaks_valleys)
head(df)
# a b c d e
# 1 0 0 0 0 0
# 2 0 0 0 0 0
# 3 1 1 1 1 1
# 4 1 0 1 1 0
# 5 1 1 0 1 0
# 6 0 1 1 1 1
dplyr
Alternatively, with dplyr you can just do:
library(dplyr)
df |>
mutate(
across(
a:e, replace_peaks_valleys
)
)
# a b c d e
# 1 0 0 0 0 0
# 2 0 0 0 0 0
# 3 1 1 1 1 1
# 4 1 0 1 1 0
# <etc>
data.table
You can also do this with data.table:
library(data.table)
dt <- setDT(df)
dt[, lapply(.SD, replace_peaks_valleys)]
# a b c d e
# 1: 0 0 0 0 0
# 2: 0 0 0 0 0
# 3: 1 0 1 1 1
# 4: 1 1 0 0 0
# <etc>
N.B. I used set.seed(1) before I ran your code - if you do this as well you should exactly the same output.
Function definition
I just copied and pasted your code and made it into a function. You could change it so you assign 0 or 1 to the existing vector, rather than creating a new vector every time:
replace_peaks_valleys2 <- function(x) {
valleys <- findValleys(x, thresh = 0)
peaks <- findPeaks(x, thresh = 0)
x[] <- 0
x[c(peaks,valleys)] <- 1
return(x)
}
I have 2 dfs (simplified example):
df1 a b c g ...
1 0 0 0
2 0 0 1
And
df2 a b d e f ...
1 1 0 0 0
2 0 0 0 1
I would like to merge the 2 dfs but before joining I would like to remove common columns in df1 and df2. So I would retain columns (c,d,e,f,g) as a and b are common in df1 and df2.
So basically doing the opposite of what was answered here:
delete columns in data frame not in common with another (R)
Using set operations viz. union intersect and setdiff on names of both dfs, we may do this
df1 <- read.table(header = T, text = 'a b c g
1 0 0 0
2 0 0 1')
df2 <- read.table(header = T, text = 'a b d e f
1 1 0 0 0
2 0 0 0 1')
# uncommon column names
x <- setdiff(union(names(df1), names(df2)), intersect(names(df1), names(df2)))
cbind(df1[names(df1) %in% x], df2[names(df2) %in% x])
#> c g d e f
#> 1 0 0 0 0 0
#> 2 0 1 0 0 1
Created on 2021-06-15 by the reprex package (v2.0.0)
In base R, you can start by using the duplicated function to work out which column names both data frames have in common. From there, it's just a matter of selecting and binding the columns from each data frame that are not on this list.
dupes <- c(names(df1), names(df2))[duplicated(c(names(df1), names(df2)))]
df3 <- cbind(df1[, -which(names(df1) %in% dupes)], df2[, -which(names(df2) %in% dupes)])
Following your example, this would produce the following data frame, consisting only of the unique columns from each of the others. This is based on the assumption that both data frames have the same number of rows.
df3 c g d e f ...
0 0 0 0 0
0 1 0 0 1
I have a csv file like this:
col1 col2 col3
r1 a,b,c e,f g
r2 h,i j,k
r3 l m,n,o
some cells have multiple text comma separated, some have single and some have none.I want to convert this like:
col1 col2 col3
a 1 0 0
b 1 0 0
c 1 0 0
e 0 1 0
f 0 1 0
g 0 0 1
h 1 0 0
i 1 0 0
j 0 0 1
k 0 0 1
l 1 0 0
m 0 1 0
n 0 1 0
o 0 1 0
Any suggestion? I tried pivot table in excel but not getting the desired output.
Thanks in advance.
Best Regards
Zillur
not sure whether this is the shortest solution (probably not) but it produces the desired output. Basically, we go through all three columns and count the occurrences of the strings and get a long format data frame that we then flip to the wide format you want.
library(tidyr)
library(purrr)
df <- data_frame(col1 = c("a,b,c", "h,i", "l"),
col2 = c("e,f", "", "m,n,o"),
col3 = c("g", "j,k", ""))
let_df <- map_df(df, function(col){
# map_df applies the function to each column of df
# split strings at "," and unlist to get vector of letters
letters <- unlist(str_split(col, ","))
# delete ""
letters <- letters[nchar(letters) > 0]
# count occurrences for each letter
tab <- table(letters)
# replace with 1 if occurs more often
tab[tab > 1] <- 1
# create data frame from table
df <- data_frame(letter = names(tab), count = tab)
return(df)
}, .id = "col") # id adds a column col that contains col1 - col3
# bring data frame into wide format
let_df %>%
spread(col, count, fill = 0)
Such a great problem to solve. Here is my take on it in base R:
col1 <- c("a,b,c","h,i","l")
col2 <- c("e,f","","m,n,o")
col3 <- c("g","j,k","")
data <- data.frame(col1, col2, col3, stringsAsFactors = F)
restructure <- function(df){
df[df==""] <- "missing"
result_rows <- as.character()
l <- list()
for (i in seq_along(colnames(df)) ){
df_col <- sort(unique(unlist(strsplit(gsub(" ", "",toString(df[[i]])), ","))))
df_col <- df_col[!df_col %in% "missing"]
result_rows <- sort(unique(c(result_rows, df_col)))
l[i] <- list(df_col)
}
result <- data.frame(result_rows)
for (j in seq_along(l)){
result$temp <- NA
result$temp[match(l[[j]], result_rows)] <- 1
colnames(result)[colnames(result)=="temp"] <- colnames(df)[j]
}
result[is.na(result)] <- 0
return(result)
}
> restructure(data)
# result_rows col1 col2 col3
#1 a 1 0 0
#2 b 1 0 0
#3 c 1 0 0
#4 e 0 1 0
#5 f 0 1 0
#6 g 0 0 1
#7 h 1 0 0
#8 i 1 0 0
#9 j 0 0 1
#10 k 0 0 1
#11 l 1 0 0
#12 m 0 1 0
#13 n 0 1 0
#14 o 0 1 0
I have a sequence which looks like this
SEQENCE
1 A
2 B
3 B
4 C
5 A
Now from this sequence, I want to get the matrix like this where i the row and jth column element denotes how many times movement occurred from ith row node to jth column node
A B C
A 0 1 0
B 0 1 1
C 1 0 0
How Can I get this in R
1) Use table like this:
s <- DF[, 1]
table(tail(s, -1), head(s, -1))
giving:
A B C
A 0 0 1
B 1 1 0
C 0 1 0
2) or like this. Since embed does not work with factors we convert the factor to character,
s <- as.character(DF[, 1])
do.call(table, data.frame(embed(s, 2)))
giving:
X2
X1 A B C
A 0 0 1
B 1 1 0
C 0 1 0
3) xtabs also works:
s <- as.character(DF[, 1])
xtabs(data = data.frame(embed(s, 2)))
giving:
X2
X1 A B C
A 0 0 1
B 1 1 0
C 0 1 0
Note: The input DF in reproducible form is:
Lines <- " SEQENCE
1 A
2 B
3 B
4 C
5 A"
DF <- read.table(text = Lines, header = TRUE)
How do I turn a list of tables into a data frame?
I have:
> (tabs <- list(table(c('a','a','b')),table(c('c','c','b')),table(c()),table(c('b','b'))))
[[1]]
a b
2 1
[[2]]
b c
1 2
[[3]]
< table of extent 0 >
[[4]]
b
2
I want:
> data.frame(a=c(2,0,0),b=c(1,1,2),c=c(0,2,0))
a b c
1 2 1 0
2 0 1 2
3 0 0 0
4 0 2 0
PS. Please do not assume that the tables were created by table calls! They were not!
c_names <- unique(unlist(sapply(tabs, names)))
df <- do.call(rbind, lapply(tabs, `[`, c_names))
colnames(df) <- c_names
df[is.na(df)] <- 0
This assumes the tables are one dimensional.
all.names <- unique(unlist(lapply(tabs, names)))
df <- as.data.frame(do.call(rbind,
lapply(
tabs, function(x) as.list(replace(c(x)[all.names], is.na(c(x)[all.names]), 0))
) ) )
names(df) <- all.names
df
There is probably a cleaner way to do this.
# a b c
# 1 2 1 0
# 2 0 1 2
# 3 0 0 0
# 4 0 2 0
tabs <- list(table(c('a','a','b')),table(c('c','c','b')),table(c()),table(c('b','b')))
dat.names <- unique(unlist(sapply(tabs, names)))
dat <- matrix(0, nrow = length(tabs), ncol = length(dat.names))
colnames(dat) <- dat.names
for (ii in 1:length(tabs)) {
dat[ii, ] <- tabs[[ii]][match(colnames(dat), names(tabs[[ii]]) )]
}
dat[is.na(dat)] <- 0
> dat
a b c
[1,] 2 1 0
[2,] 0 1 2
[3,] 0 0 0
[4,] 0 2 0
Here is a pretty clean approach:
library(reshape2)
newTabs <- melt(tabs)
newTabs
# Var1 value L1
# 1 a 2 1
# 2 b 1 1
# 3 b 1 2
# 4 c 2 2
# 5 b 2 4
newTabs$L1 <- factor(newTabs$L1, seq_along(tabs))
dcast(newTabs, L1 ~ Var1, fill = 0, drop = FALSE)
# L1 a b c
# 1 1 2 1 0
# 2 2 0 1 2
# 3 3 0 0 0
# 4 4 0 2 0
This makes use of the fact that there is a melt method for lists (see reshape2:::melt.list) which automatically adds in a variable (L1 for an unnested list) that identifies the index of the list element. Since your list has some items which are empty, they won't show up in your melted list, so you need to factor the "L1" column, specifying the levels you want. dcast takes care of restructuring your output and allows you to specify the desired fill value.