Loop over strings in r - r

I'd like to know what is wrong with my code rather than a solution. I wish to loop over some strings my data is as follows:
id source transaction
1 a > b 6 > 0
2 J > k 5
3 b > c 4 > 0
I have a list and wish to go over this list and find the rows that contains that element and compute average.
mylist <- c ("a", "b")
So my desired output will for one of the element in the list is
source avg
a 6
b 2
I do not know who to loop over the list and send them to a csv file. I tried this
mylist <- c( "a", "b" )
for(i in mylist)
{
KeepData <- df [grepl(i, df$source), ]
KeepData <- cSplit(KeepData, "transaction", ">", "long")
avg<- mean(KeepData$transactions)
result <- list(i,avg )
write.table(result ,file="C:/Users.csv", append=TRUE,sep=",",col.names=FALSE,row.names=FALSE)
}
But It gives me "NA" result with the following warning
Warning messages: 1: In mean.default(KeepData$transactions) :
argument is not numeric or logical: returning NA 2: In
mean.default(KeepData$transactions) : argument is not numeric or
logical: returning NA

We can use cSplit to split the 'source' and convert the dataset to 'long' format, then specify the 'i', grouped by 'source', get the mean of 'transaction' (using data.table methods)
library(splitstackshape)
cSplit(df1, "source", " > ", "long")[source %in% mylist, .(avg = mean(transaction)), source]
# source avg
#1: a 6
#2: b 5
Or another option is separate_rows from tidyr to convert to 'long' format, then use the dplyr methods to summarise after grouping by 'source'
library(tidyr)
library(dplyr)
separate_rows(df1, source) %>%
filter(source %in% mylist) %>%
group_by(source) %>%
summarise(avg = mean(transaction))
Update
For the new dataset ('df2'), we need to split both the columns to 'long' format, and then get the mean of 'transaction' grouped by 'source'
cSplit(df2, 2:3, " > ", "long")[source %in% my_list, .(avg = mean(transaction)), source]
# source avg
#1: a 6
#2: b 2
The for loop can be modified to
for(i in mylist) {
KeepData <- cSplit(df2, 2:3, ">", "long")
KeepData <- KeepData[grepl(i, source)]
avg<- mean(KeepData$transaction)
result <- list(i,avg )
print(result)
write.table(result ,file="C:/Users.csv",
append=TRUE,sep=",",col.names=FALSE,row.names=FALSE)
}
#[[1]]
#[1] "a"
#[[2]]
#[1] 6
#[[1]]
#[1] "b"
#[[2]]
#[1] 2
data
df1 <- structure(list(id = 1:3, source = c("a > b", "J > k", "b > c"
), transaction = c(6L, 5L, 4L)), .Names = c("id", "source", "transaction"
), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(id = 1:3, source = c("a > b", "J > k", "b > c"
), transaction = c("6 > 0", "5", "4 > 0")), .Names = c("id",
"source", "transaction"), class = "data.frame", row.names = c(NA,
-3L))

Related

assign names to data frames columns in a list

I have a list of data frames
# Create dummy data
df1<-data.frame( c(1,2,3),c(2,3,4))
df2<-data.frame(c(5,6,7),c(4,5,6))
# Create a list
l<-list(df1, df2)
I would like to assign names to columns. As l[[1]][,1] gives me access to the first column, I thought I could assign 'names' as the first column name by:
l<-lapply(l, function(x)names(x[[1]][,1]<-"names"))
But this gives me an error
Error in x[[1]][, 1] <- "names" :
incorrect number of subscripts on matrix
Edit: Added some dput
initial data
dput(lapply(head(results1, 2), head, 2))
list(structure(c(1.27679607834331, 1.05090175857491), .Dim = 2:1, .Dimnames = list(
c("..a15.pdf", "..a17.pdf"), "x")), structure(c(2.096687569578,
2.19826038300833), .Dim = 2:1, .Dimnames = list(c("..a15.pdf",
"..a17.pdf"), "x")))
after trying to assign the name
dput(lapply(head(results1, 2), head, 2))
list(structure(c(1.27679607834331, 1.05090175857491), .Dim = 2:1, .Dimnames = list(
c("..a15.pdf", "..a17.pdf"), "names")), structure(c(2.096687569578,
2.19826038300833), .Dim = 2:1, .Dimnames = list(c("..a15.pdf",
"..a17.pdf"), "names")))
Output:
results1[1]
[[1]]
names
..a15.pdf 1.27679608
..a17.pdf 1.05090176
..a18.pdf 1.51820192
..a21.pdf 2.30296037
..a2TTT.pdf 1.48568732
You can subset the names of the dataframe:
l <- lapply(l, function(x) {names(x)[1] <-"names";x})
l
In tidyverse -
library(dplyr)
library(purrr)
l <- map(l, ~.x %>% rename_with(~'names', 1))
From the updated data it seems you have list of matrices and the first column is actually rowname which you can convert to a column and name it.
lapply(results1, function(x) {
mat <- cbind.data.frame(names = rownames(x), x)
rownames(mat) <- NULL
mat
})
#[[1]]
# names x
#1 ..a15.pdf 1.28
#2 ..a17.pdf 1.05
#[[2]]
# names x
#1 ..a15.pdf 2.1
#2 ..a17.pdf 2.2

How to collapse values in a list to allow a list column in a dataframe to be converted to a vector?

I have a dataframe, df:
df <- structure(list(ID = c("ID1", "ID2", "ID3"), values = list(A = "test",
B = c("test2", "test3"), C = "test4")), row.names = c(NA,
-3L), class = "data.frame")
df
ID values
1 ID1 test
2 ID2 test2, test3
3 ID3 test4
sapply(df, class)
ID values
"character" "list"
I'm trying to create a function that will run through each row of df$values, and if the length is greater than one, paste the values into one string. So the data frame will look the same, but will have a different structure:
df
ID values
1 ID1 test
2 ID2 test2, test3
3 ID3 test4
dput(df)
structure(list(ID = c("ID1", "ID2", "ID3"), values = c("test",
"test2, test3", "test4")), class = "data.frame", row.names = c(NA,
-3L))
sapply(df, class)
ID values
"character" "character"
(Note how in the end result, both columns are character columns, rather than a character column and a list).
I tried making a function to do this, but it doesn't work (and is very messy):
newcol <- NULL
for (i in nrow(df)) {
row <- df$values[i] %>%
unlist(., use.names = FALSE)
if (length(row) == 1) {
newcol = rbind(row, newcol)
} else if (length(row)>1) {
row = paste0(row[1], ", ", row[2])
newcol = rbind(row, newcol)
}
}
df$values <- newcol
Is there an easier way to do this (that works), and that can do it for any size of list entry? (eg. if df$values has a row entry that was "test6", test7, test8, test9").
We can use sapply with toString :
df$values <- sapply(df$values, toString)
sapply(df, class)
# ID values
#"character" "character"
str(df)
#'data.frame': 3 obs. of 2 variables:
# $ ID : chr "ID1" "ID2" "ID3"
# $ values: chr "test" "test2, test3" "test4"
toString is shorthand for paste0(..., collapse = ',').
df$values <- sapply(df$values, paste0, collapse = ',')
Using tidyverse
library(dplyr)
library(purrr)
df <- df %>%
mutate(values = map_chr(values, toString))

Removing the special symbols in data.frame column values

I have two data frame each with a column Name
df1:
name
#one2
!iftwo
there_2_go
come&go
df1 = structure(list(name = c("#one2", "!iftwo", "there_2_go", "come&go")),.Names = c("name"), row.names = c(NA, -4L), class = "data.frame")
df2:
name
One2
IfTwo#
there-2-go
come.go
df2 = structure(list(name = c("One2", "IfTwo#", "there-2-go", "come.go")),.Names = c("name"), row.names = c(NA, -4L), class = "data.frame")
Now to compare the two data frames for inequality is cumbersome because of special symbols using %in%. To remove the special symbols using stringR can be useful. But how exactly we can use stringR functions with %in% and display the mismatch between them
have already done the mutate() to convert all in lowercasestoLower()as follows
df1<-mutate(df1,name=tolower(df1$name))
df2<-mutate(df2,name=tolower(df2$name))
Current output of comparison:
df2[!(df2 %in% df1),]
[1] "one2" "iftwo#" "there-2-go" "come.go"
Expected output as essentially the contents are same but with special symbols:
df2[!(df2 %in% df1),]
character(0)
Question : How do we ignore the symbols in the contents of the Frame
Here it is in a function,
f1 <- function(df1, df2){
i1 <- tolower(gsub('[[:punct:]]', '', df1$name))
i2 <- tolower(gsub('[[:punct:]]', '', df2$name))
d1 <- sapply(i1, function(i) grepl(paste(i2, collapse = '|'), i))
return(!d1)
}
f1(df, df2)
# one2 iftwo there2go comego
# FALSE FALSE FALSE FALSE
#or use it for indexing,
df2[f1(df, df2),]
#character(0)

replacing blank not NA

I have two variables a and b
a b
vessel hot
parts
nest NA
best true
neat smooth
I want to replace blank in b with a
la$b[i1] <- ifelse(la$b[i1] == "",la$a[i1],la$b[i1])
But it is not working
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), specify the condition in 'i' (b==''), and assign the values of 'a' that corresponds to TRUE values in 'i' to 'b'. It should be fast as we are assigning in place.
library(data.able)
setDT(df1)[b=='', b:= a]
df1
# a b
#1: vessel hot
#2: parts parts
#3: nest NA
#4: best true
#5: neat smooth
Or we can just base R
i1 <- df1$b=='' & !is.na(df1$b)
df1$b[i1] <- df1$a[i1]
data
df1 <- structure(list(a = c("vessel", "parts", "nest", "best", "neat"
), b = c("hot", "", NA, "true", "smooth")), .Names = c("a", "b"
), class = "data.frame", row.names = c(NA, -5L))
instead of
# la$b[i1] <- ifelse(la$b[i1] == "",la$a[i1],la$b[i1])
# what is i1? it doesn't seem to have any obvious function here
... it should be:
la$b <- ifelse(la$b == "", la$a, la$b)
assuming that you want to replace blank in b with a and that applies to all blanks
it works:
df <- structure(list(a = c("vessel", "parts", "nest", "best", "neat"
), b = c("hot", "parts", NA, "true", "smooth")), .Names = c("a",
"b"), row.names = c(NA, -5L), class = "data.frame")
df$b <- ifelse(df$b=="", df$a, df$b)
# or, with `with`: df$b <- with(df, ifelse(b=="",a,b))
# > df
# a b
# 1 vessel hot
# 2 parts parts
# 3 nest <NA>
# 4 best true
# 5 neat smooth

Converting given list into dataframe

I have the following list:
$id1
$id1[[1]]
A B
"A" "B"
$id1[[2]]
A B
"A" "A1"
$id2
$id2[[1]]
A B
"A2" "B2"
In R-pastable form:
dat = structure(list(SampleTable = structure(list(id2 = list(structure(c("90", "7"), .Names = c("T", "G")), structure(c("90", "8"), .Names = c("T", "G"))), id1 = structure(c("1", "1"), .Names = c("T", "G"))), .Names = c("id2", "id1"))), .Names = "SampleTable")
I want this given list to be converted into following dataframe:
id1 A B
id1 A A1
id2 A2 B2
Your data structure (apparently a named list of unnamed lists of 1-row data.frames) is a bit complicated: the easiest may be to use a loop to build the data.frame.
It can be done directly with do.call, lapply and rbind, but it is not very readable, even if you are familiar with those functions.
# Sample data
d <- list(
id1 = list(
data.frame( x=1, y=1 ),
data.frame( x=2, y=2 )
),
id2 = list(
data.frame( x=3, y=3 ),
data.frame( x=4, y=4 )
),
id3 = list(
data.frame( x=5, y=5 ),
data.frame( x=6, y=6 )
)
)
# Convert
d <- data.frame(
id=rep(names(d), unlist(lapply(d,length))),
do.call( rbind, lapply(d, function(u) do.call(rbind, u)) )
)
Other solution, using a loop, if you have a ragged data structure, containing vectors (not data.frames) as explained in the comments.
d <- structure(list(SampleTable = structure(list(id2 = list(structure(c("90", "7"), .Names = c("T", "G")), structure(c("90", "8"), .Names = c("T", "G"))), id1 = structure(c("1", "1"), .Names = c("T", "G"))), .Names = c("id2", "id1"))), .Names = "SampleTable")
result <- list()
for(i in seq_along(d$SampleTable)) {
id <- names(d$SampleTable)[i]
block <- d$SampleTable[[i]]
if(is.atomic(block)) {
block <- list(block)
}
for(row in block) {
result <- c(result, list(data.frame(id, as.data.frame(t(row)))))
}
}
result <- do.call(rbind, result)
NOTE! I could not get melt and cast working on this kind of ragged data (I tried for over an hour...) I am going to leave this answer here to show that for this kind of operation, the reshape pacakge could also be used.
Using the example data of vincent, you can use melt and cast from the reshape package:
library(reshape)
res = cast(melt(d))[-1]
names(res) = c("id","x","y")
res
id x y
1 id1 1 1
2 id2 3 3
3 id3 5 5
4 id1 2 2
5 id2 4 4
6 id3 6 6
The order in the resulting data.frame is not the same, but the result is identical. And the code is a bit shorter. I use the [-1] to delete the first column which is also returned by melt. This additional variable is the column index of the individual data.frames in the list of lists. Just have a look at the result of melt(d), that will hopefully make it more clear.
This is a bit messier that you let on. That dat object has an extra "layer" above it, so it is easier to work with dat[[1]]:
dfrm <- data.frame(dat[[1]], stringsAsFactors=FALSE)
names(dfrm) <- sub("\\..+$", "", names(dfrm))
> dfrm
id2 id2 id1
T 90 90 1
G 7 8 1
> t(dfrm)
T G
id2 "90" "7"
id2 "90" "8"
id1 "1" "1"

Resources