Converting given list into dataframe

Converting given list into dataframe - r

I have the following list:
$id1
$id1[[1]]
A B
"A" "B"
$id1[[2]]
A B
"A" "A1"
$id2
$id2[[1]]
A B
"A2" "B2"
In R-pastable form:
dat = structure(list(SampleTable = structure(list(id2 = list(structure(c("90", "7"), .Names = c("T", "G")), structure(c("90", "8"), .Names = c("T", "G"))), id1 = structure(c("1", "1"), .Names = c("T", "G"))), .Names = c("id2", "id1"))), .Names = "SampleTable")
I want this given list to be converted into following dataframe:
id1 A B
id1 A A1
id2 A2 B2

Your data structure (apparently a named list of unnamed lists of 1-row data.frames) is a bit complicated: the easiest may be to use a loop to build the data.frame.
It can be done directly with do.call, lapply and rbind, but it is not very readable, even if you are familiar with those functions.
# Sample data
d <- list(
id1 = list(
data.frame( x=1, y=1 ),
data.frame( x=2, y=2 )
),
id2 = list(
data.frame( x=3, y=3 ),
data.frame( x=4, y=4 )
),
id3 = list(
data.frame( x=5, y=5 ),
data.frame( x=6, y=6 )
)
)
# Convert
d <- data.frame(
id=rep(names(d), unlist(lapply(d,length))),
do.call( rbind, lapply(d, function(u) do.call(rbind, u)) )
)
Other solution, using a loop, if you have a ragged data structure, containing vectors (not data.frames) as explained in the comments.
d <- structure(list(SampleTable = structure(list(id2 = list(structure(c("90", "7"), .Names = c("T", "G")), structure(c("90", "8"), .Names = c("T", "G"))), id1 = structure(c("1", "1"), .Names = c("T", "G"))), .Names = c("id2", "id1"))), .Names = "SampleTable")
result <- list()
for(i in seq_along(d$SampleTable)) {
id <- names(d$SampleTable)[i]
block <- d$SampleTable[[i]]
if(is.atomic(block)) {
block <- list(block)
}
for(row in block) {
result <- c(result, list(data.frame(id, as.data.frame(t(row)))))
}
}
result <- do.call(rbind, result)

NOTE! I could not get melt and cast working on this kind of ragged data (I tried for over an hour...) I am going to leave this answer here to show that for this kind of operation, the reshape pacakge could also be used.
Using the example data of vincent, you can use melt and cast from the reshape package:
library(reshape)
res = cast(melt(d))[-1]
names(res) = c("id","x","y")
res
id x y
1 id1 1 1
2 id2 3 3
3 id3 5 5
4 id1 2 2
5 id2 4 4
6 id3 6 6
The order in the resulting data.frame is not the same, but the result is identical. And the code is a bit shorter. I use the [-1] to delete the first column which is also returned by melt. This additional variable is the column index of the individual data.frames in the list of lists. Just have a look at the result of melt(d), that will hopefully make it more clear.

This is a bit messier that you let on. That dat object has an extra "layer" above it, so it is easier to work with dat[[1]]:
dfrm <- data.frame(dat[[1]], stringsAsFactors=FALSE)
names(dfrm) <- sub("\\..+$", "", names(dfrm))
> dfrm
id2 id2 id1
T 90 90 1
G 7 8 1
> t(dfrm)
T G
id2 "90" "7"
id2 "90" "8"
id1 "1" "1"

Related

Joining dataframes of different dimensions with varying merge by criterion

Good evening, I am trying to merge a couple datasets and my normal tools in R are failing me tonight. Consider df1 and df2 below.
df1 = data.frame(a = c("a", "b", "c"),
b = c("1", "2", "3"),
c = c("x", "y", "z"))
df2 = data.frame(a = c("1", "b", "c", "d", "e"),
b = c("a", "2", "3", "4", "5"),
d = c("x2", "y2", "z2", "x3", "y3"))
In both cases, column a and b are supposed to act as grouping variables. For example, in df1, when a = a and b = 1, then c = x. Given the structure of the data I am working with, the actual order of a and b does not matter such that if a were to = 1 and b = a, then c will still equal x. Herein lies the problem, I would like to merge df1 with a new df, df2. df2, is similarly structured, but contains a new variable d. And, as can be seen df2 includes some a and b combinations that are backwards compared to A. In addition, B has some additional observations.
The desired dataframe I am looking for looks like this:
desired = data.frame(a = c("a", "b", "c"),
b = c("1", "2", "3"),
c = c("x", "y", "z"),
d = c("x2", "y2", "z2"))
As can be seen the original column structure from a b and c are preserved, and we have added in column D. However, we have not added any new observations.
I have tried using merge() with varying combinations of by.x, by.y.
I also tried using various left_join and inner_join but I keep on getting whaping data sets that still aren't handling the mismatch in the a/b columns.
Thanks for any thoughts or help you might be able to provide.
Cheers

You can left_join df2 with df1 twice and use coalesce -
library(dplyr)
df1 %>%
left_join(df2, by = c("a"="a", "b"="b")) %>%
left_join(df2, by = c("a"="b", "b"="a")) %>%
mutate(
d = coalesce(d.x, d.y)
) %>%
select(a,b,c,d)
a b c d
1 a 1 x x2
2 b 2 y y2
3 c 3 z z2

Good morning. It appears, the actual order of a and b does matter. sort your df2, or maybe both.
df2[1:2] <- t(apply(df2[1:2], 1, sort, decreasing=TRUE))
merge(df1, df2)
# a b c d
# 1 a 1 x x2
# 2 b 2 y y2
# 3 c 3 z z2

Trying to produce a loop for summing up consecutive column values in R

I am trying to produce an loop function to sum up consecutive columns of values of a table and output them into another table
For example, in my original table, we have columns a, b, c, etc, which contain the same number of numeric values.
The resulting table then should be a, a+b, a+b+c, etc up to the last column of the original table
I have a feeling a for loop should be sufficient for this operation however can't get my head around the format and syntax.
Any help would be appreciated!

Since you're new, here is an example of a very minimal minimal reproducible example?
library(data.table)
x = data.table(a=1:3,b=4:6,c=7:9)
for(... now what?
And here's a way to do your task:
library(data.table)
# make some dummy data
X = data.table(a=1:2,b=3:4,c=5:6)
# make an empty result table
Y = data.table()
# for i = 1 to the number of columns in X
for(i in 1:ncol(X)){
# colnames(X) is "a" "b" "c".
# colnames(X)[1:1] is "a", colnames(X)[1:2] is "a" "b", colnames(X)[1:3] is "a" "b" "c"
# paste0(colnames(X)[1:1],collapse='') is "a",
# paste0(colnames(X)[1:2],collapse='') is "ab",
# paste0(colnames(X)[1:3],collapse='') is "abc"
newcolname = paste0(colnames(X)[1:i],collapse='')
# Y[,(newcolname):= is data.table syntax to create a new column called newcolname
# X[,1:i] selects columns 1 to i
# rowSums calculates the, um, row sums :D
Y[,(newcolname):=rowSums(X[,1:i])]
}

Maybe you need Reduce like below
cbind(
df,
setNames(
as.data.frame(Reduce(`+`, df, accumulate = TRUE)),
Reduce(paste0, names(df), accumulate = TRUE)
)
)
such that
a b c a ab abc
1 1 4 7 1 5 12
2 2 5 8 2 7 15
3 3 6 9 3 9 18
Data
df <- structure(list(a = 1:3, b = 4:6, c = 7:9), class = "data.frame", row.names = c(NA,
-3L))

Unexpected results using str_split and union in a function with sapply

Given this data.frame:
library(dplyr)
library(stringr)
ml.mat2 <- structure(list(value = c("a", "b", "c"), ground_truth = c("label1, label3",
"label2", "label1"), predicted = c("label1", "label2,label3",
"label1")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L))
glimpse(ml.mat2)
Observations: 3
Variables: 3
$ value <chr> "a", "b", "c"
$ ground_truth <chr> "label1, label3", "label2", "label1"
$ predicted <chr> "label1", "label2,label3", "label1"
I want to measure the length of the intersect between ground_truth and predicted for each row, after splitting the repeated labels based on ,.
In other words, I would expect a result of length 3 with values of 2 2 1.
I wrote a function to do this, but it only seems to work outside of sapply:
m_fn <- function(x,y) length(union(unlist(sapply(x, str_split,",")),
unlist(sapply(y, str_split,","))))
m_fn(ml.mat2$ground_truth[1], y = ml.mat2$predicted[1])
[1] 2
m_fn(ml.mat2$ground_truth[2], y = ml.mat2$predicted[2])
[1] 2
m_fn(ml.mat2$ground_truth[3], y = ml.mat2$predicted[3])
[1] 1
Rather than iterating through the rows of the data set manually like this or with a loop, I would expect to be able to vectorize the solution with sapply like this:
sapply(ml.mat2$ground_truth, m_fn, ml.mat2$predicted)
However, the unexpected results are:
label1, label3 label2 label1
4 3 3

Since you're interating within same observation size, you can generate an index of row numbers and run it in your sapply:
sapply(1:nrow(ml.mat2), function(i) m_fn(x = ml.mat2$ground_truth[i], y = ml.mat2$predicted[i]))
#[1] 2 2 1
or with seq_len:
sapply(seq_len(nrow(ml.mat2)), function(i)
m_fn(x = ml.mat2$ground_truth[i], y = ml.mat2$predicted[i]))

Loop over strings in r

I'd like to know what is wrong with my code rather than a solution. I wish to loop over some strings my data is as follows:
id source transaction
1 a > b 6 > 0
2 J > k 5
3 b > c 4 > 0
I have a list and wish to go over this list and find the rows that contains that element and compute average.
mylist <- c ("a", "b")
So my desired output will for one of the element in the list is
source avg
a 6
b 2
I do not know who to loop over the list and send them to a csv file. I tried this
mylist <- c( "a", "b" )
for(i in mylist)
{
KeepData <- df [grepl(i, df$source), ]
KeepData <- cSplit(KeepData, "transaction", ">", "long")
avg<- mean(KeepData$transactions)
result <- list(i,avg )
write.table(result ,file="C:/Users.csv", append=TRUE,sep=",",col.names=FALSE,row.names=FALSE)
}
But It gives me "NA" result with the following warning
Warning messages: 1: In mean.default(KeepData$transactions) :
argument is not numeric or logical: returning NA 2: In
mean.default(KeepData$transactions) : argument is not numeric or
logical: returning NA

We can use cSplit to split the 'source' and convert the dataset to 'long' format, then specify the 'i', grouped by 'source', get the mean of 'transaction' (using data.table methods)
library(splitstackshape)
cSplit(df1, "source", " > ", "long")[source %in% mylist, .(avg = mean(transaction)), source]
# source avg
#1: a 6
#2: b 5
Or another option is separate_rows from tidyr to convert to 'long' format, then use the dplyr methods to summarise after grouping by 'source'
library(tidyr)
library(dplyr)
separate_rows(df1, source) %>%
filter(source %in% mylist) %>%
group_by(source) %>%
summarise(avg = mean(transaction))
Update
For the new dataset ('df2'), we need to split both the columns to 'long' format, and then get the mean of 'transaction' grouped by 'source'
cSplit(df2, 2:3, " > ", "long")[source %in% my_list, .(avg = mean(transaction)), source]
# source avg
#1: a 6
#2: b 2
The for loop can be modified to
for(i in mylist) {
KeepData <- cSplit(df2, 2:3, ">", "long")
KeepData <- KeepData[grepl(i, source)]
avg<- mean(KeepData$transaction)
result <- list(i,avg )
print(result)
write.table(result ,file="C:/Users.csv",
append=TRUE,sep=",",col.names=FALSE,row.names=FALSE)
}
#[[1]]
#[1] "a"
#[[2]]
#[1] 6
#[[1]]
#[1] "b"
#[[2]]
#[1] 2
data
df1 <- structure(list(id = 1:3, source = c("a > b", "J > k", "b > c"
), transaction = c(6L, 5L, 4L)), .Names = c("id", "source", "transaction"
), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(id = 1:3, source = c("a > b", "J > k", "b > c"
), transaction = c("6 > 0", "5", "4 > 0")), .Names = c("id",
"source", "transaction"), class = "data.frame", row.names = c(NA,
-3L))

replacing blank not NA

I have two variables a and b
a b
vessel hot
parts
nest NA
best true
neat smooth
I want to replace blank in b with a
la$b[i1] <- ifelse(la$b[i1] == "",la$a[i1],la$b[i1])
But it is not working

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), specify the condition in 'i' (b==''), and assign the values of 'a' that corresponds to TRUE values in 'i' to 'b'. It should be fast as we are assigning in place.
library(data.able)
setDT(df1)[b=='', b:= a]
df1
# a b
#1: vessel hot
#2: parts parts
#3: nest NA
#4: best true
#5: neat smooth
Or we can just base R
i1 <- df1$b=='' & !is.na(df1$b)
df1$b[i1] <- df1$a[i1]
data
df1 <- structure(list(a = c("vessel", "parts", "nest", "best", "neat"
), b = c("hot", "", NA, "true", "smooth")), .Names = c("a", "b"
), class = "data.frame", row.names = c(NA, -5L))

instead of
# la$b[i1] <- ifelse(la$b[i1] == "",la$a[i1],la$b[i1])
# what is i1? it doesn't seem to have any obvious function here
... it should be:
la$b <- ifelse(la$b == "", la$a, la$b)
assuming that you want to replace blank in b with a and that applies to all blanks
it works:
df <- structure(list(a = c("vessel", "parts", "nest", "best", "neat"
), b = c("hot", "parts", NA, "true", "smooth")), .Names = c("a",
"b"), row.names = c(NA, -5L), class = "data.frame")
df$b <- ifelse(df$b=="", df$a, df$b)
# or, with `with`: df$b <- with(df, ifelse(b=="",a,b))
# > df
# a b
# 1 vessel hot
# 2 parts parts
# 3 nest <NA>
# 4 best true
# 5 neat smooth