Given this data.frame:
library(dplyr)
library(stringr)
ml.mat2 <- structure(list(value = c("a", "b", "c"), ground_truth = c("label1, label3",
"label2", "label1"), predicted = c("label1", "label2,label3",
"label1")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L))
glimpse(ml.mat2)
Observations: 3
Variables: 3
$ value <chr> "a", "b", "c"
$ ground_truth <chr> "label1, label3", "label2", "label1"
$ predicted <chr> "label1", "label2,label3", "label1"
I want to measure the length of the intersect between ground_truth and predicted for each row, after splitting the repeated labels based on ,.
In other words, I would expect a result of length 3 with values of 2 2 1.
I wrote a function to do this, but it only seems to work outside of sapply:
m_fn <- function(x,y) length(union(unlist(sapply(x, str_split,",")),
unlist(sapply(y, str_split,","))))
m_fn(ml.mat2$ground_truth[1], y = ml.mat2$predicted[1])
[1] 2
m_fn(ml.mat2$ground_truth[2], y = ml.mat2$predicted[2])
[1] 2
m_fn(ml.mat2$ground_truth[3], y = ml.mat2$predicted[3])
[1] 1
Rather than iterating through the rows of the data set manually like this or with a loop, I would expect to be able to vectorize the solution with sapply like this:
sapply(ml.mat2$ground_truth, m_fn, ml.mat2$predicted)
However, the unexpected results are:
label1, label3 label2 label1
4 3 3
Since you're interating within same observation size, you can generate an index of row numbers and run it in your sapply:
sapply(1:nrow(ml.mat2), function(i) m_fn(x = ml.mat2$ground_truth[i], y = ml.mat2$predicted[i]))
#[1] 2 2 1
or with seq_len:
sapply(seq_len(nrow(ml.mat2)), function(i)
m_fn(x = ml.mat2$ground_truth[i], y = ml.mat2$predicted[i]))
Related
I have a dataset which looks something like this:
print(animals_in_zoo)
// I only know the name of the first column, the second one is dynamic/based on a previously calculated variable
animals | dynamic_column_name
// What the data looks like
elefant x
turtle
monkey
giraffe x
swan
tiger x
What I want is to collect the rows in which the second columns' value is equal to "x".
What I want to do is something like:
SELECT * from data where col2 == "x";
After that, I want to grab only the first column and create a string object like "elefant giraffe tiger", but that is the easy part.
You can reference that column by its index and use that to get the animals you want:
df1 <- structure(list(animal = c("elefant", "turtle", "monkey", "giraffe",
"swan", "tiger"), dynamic_column = c("x", NA, NA, "x", NA, "x"
)), row.names = c(NA, -6L), class = "data.frame")
df1[, 1][df1[, 2] == "x" & !is.na(df1[, 2])]
#> [1] "elefant" "giraffe" "tiger"
We could use filter with grepl which searches for a pattern 'x' in the string:
# the data frame
df <- read.table(header = TRUE, text =
'my_col
"elefant x"
turtle
monkey
"giraffe x"
swan
"tiger x"'
)
library(dplyr)
df %>%
filter(grepl('x', my_col))
my_col
1 elefant x
2 giraffe x
3 tiger x
Use [: the first argument refers to the rows. You want the rows where the second column is "x". The second argument is the column you need in the end, and you want the column named "animals":
dat[dat[2] == "x", "animals"]
#[1] "elefant" "giraffe" "tiger"
data
dat <- structure(list(animals = c("elefant", "turtle", "monkey", "giraffe",
"swan", "tiger"), V2 = c("x", "", "", "x", "", "x")), row.names = c(NA,
-6L), class = "data.frame")
# animals V2
# 1 elefant x
# 2 turtle
# 3 monkey
# 4 giraffe x
# 5 swan
# 6 tiger x
I guess you have a dataframe?
If so, something like df[df$col2 == 'x',] should work.
With base functions, you can do it like this:
# Option 1
your_dataframe[your_dataframe$col2 == "x", ]
# Option 2
your_dataframe[your_dataframe[,2] == "x", ]
With dplyr functions, you can do it like this:
library(dplyr)
your_dataframe %>%
filter(col2 == "x")
I have a function similar to this:
testfun = function(jID,kID,d){
g=paste0(jID,kID)
date = d
bb=data.frame(g,date)
return(bb)
}
Data frame:
x=data.frame(jID = c("a","b"),kID=c("c","d"),date="20170206",stringsAsFactors = FALSE)
I want to pass each row as inputs into the function. The solutions provided here: Passing multiple arguments to a function taken from dataframe are great but in their case, the number of columns was known. How would a solution like this:
vtestfun <- (Vectorize(testfun, SIMPLIFY=FALSE))
vtestfun(x[,1],x[,2],x[,3])
be applied if the number of columns in the dataframe is not known or keeps changing?
If you can match the argument names to the column names like so:
testfun <- function(jID, kID, date){ # 'date', not 'd'
g <- paste0(jID, kID)
bb <- data.frame(g, date)
return(bb)
}
You could do:
purrr::pmap(x, testfun)
Returning:
[[1]]
g date
1 ac 20170206
[[2]]
g date
1 bd 20170206
# Data used:
x <- structure(list(jID = c("a", "b"), kID = c("c", "d"), date = c("20170206", "20170206")), class = "data.frame", row.names = c(NA, -2L))
I am facing an issue while using the seq() function inside ifelse statement. I have a dataframe which contains the following columns.
Dataframe(df): newmodel id
NewModel_1 30
NewModel_2 30
i need to increase the id value for these 2 rows since id should not be same for a model. There is constant value(99) from which we have to increment the id values based on the condition.
When i am trying to implement the below code
df %>% mutate(id=ifelse(any(grepl("NewModel_", df$newmodel)), seq(from =99+1, by =1, length.out=2) , id))
I am getting the output as
newmodel id
NewModel_1 100
NewModel_1 100
Where as the expected one is
newmode1 id
NewModel_1 100
NewModel_1 101
Can someone explain me why it is happening??
Thanks in Advance
Are you looking for something like this?
inds <- grepl('NewModel_', df$newmodel)
df$id[inds] <- seq(100, by = 1, length.out = sum(inds))
df
# newmodel id
#1 NewModel_1 100
#2 NewModel_2 101
data
df <- structure(list(newmodel = c("NewModel_1", "NewModel_2"), id = c(30L,
30L)), class = "data.frame", row.names = c(NA, -2L))
I guess is because somehow the function is getting only the first item of the seq.
You can try this way, it works here.
if(any(grepl("NewModel_", df$newmodel))) {
df$id <- seq(from = 99 + 1, length.out = (length(df$id)))
}
UPDATE: The return of ifelse statement is only one value, so you are trying to input a vector in a single element. An alternative is to use an apply function.
The reason your ifelse(.) is failing is that ifelse keys its output length based on the input length of the conditional vector; if it is shorter than either of the yes= or no= vectors, the extra length is silently ignored. In your case, any(grepl("NewModel_", df$newmodel)) will never be other than length 1, so the output will be length 1.
For example:
ifelse(TRUE, 1:2, 3:4)
# [1] 1
ifelse(c(TRUE, FALSE), 1:2, 3:4)
# [1] 1 4
### and for an example of how R's overly-permissive recycling can go "wrong"
ifelse(c(TRUE, FALSE, TRUE), 1:2, 3:4)
# [1] 1 4 1
Here's a quick method using match to assign a unique integer to each of the models.
base R
dat$newid <- 99 + match(dat$newmodel, unique(dat$newmodel))
dat
# newmodel id newid
# 1 NewModel_1 30 100
# 2 NewModel_2 30 101
dplyr
library(dplyr)
dat %>%
mutate(newid = 99 + match(newmodel, unique(newmodel)))
# newmodel id newid
# 1 NewModel_1 30 100
# 2 NewModel_2 30 101
Data
dat <- structure(list(newmodel = c("NewModel_1", "NewModel_2"), id = c(30L, 30L), newid = c(100, 101)), row.names = c(NA, -2L), class = "data.frame")
I am trying to produce an loop function to sum up consecutive columns of values of a table and output them into another table
For example, in my original table, we have columns a, b, c, etc, which contain the same number of numeric values.
The resulting table then should be a, a+b, a+b+c, etc up to the last column of the original table
I have a feeling a for loop should be sufficient for this operation however can't get my head around the format and syntax.
Any help would be appreciated!
Since you're new, here is an example of a very minimal minimal reproducible example?
library(data.table)
x = data.table(a=1:3,b=4:6,c=7:9)
for(... now what?
And here's a way to do your task:
library(data.table)
# make some dummy data
X = data.table(a=1:2,b=3:4,c=5:6)
# make an empty result table
Y = data.table()
# for i = 1 to the number of columns in X
for(i in 1:ncol(X)){
# colnames(X) is "a" "b" "c".
# colnames(X)[1:1] is "a", colnames(X)[1:2] is "a" "b", colnames(X)[1:3] is "a" "b" "c"
# paste0(colnames(X)[1:1],collapse='') is "a",
# paste0(colnames(X)[1:2],collapse='') is "ab",
# paste0(colnames(X)[1:3],collapse='') is "abc"
newcolname = paste0(colnames(X)[1:i],collapse='')
# Y[,(newcolname):= is data.table syntax to create a new column called newcolname
# X[,1:i] selects columns 1 to i
# rowSums calculates the, um, row sums :D
Y[,(newcolname):=rowSums(X[,1:i])]
}
Maybe you need Reduce like below
cbind(
df,
setNames(
as.data.frame(Reduce(`+`, df, accumulate = TRUE)),
Reduce(paste0, names(df), accumulate = TRUE)
)
)
such that
a b c a ab abc
1 1 4 7 1 5 12
2 2 5 8 2 7 15
3 3 6 9 3 9 18
Data
df <- structure(list(a = 1:3, b = 4:6, c = 7:9), class = "data.frame", row.names = c(NA,
-3L))
I have the following list:
$id1
$id1[[1]]
A B
"A" "B"
$id1[[2]]
A B
"A" "A1"
$id2
$id2[[1]]
A B
"A2" "B2"
In R-pastable form:
dat = structure(list(SampleTable = structure(list(id2 = list(structure(c("90", "7"), .Names = c("T", "G")), structure(c("90", "8"), .Names = c("T", "G"))), id1 = structure(c("1", "1"), .Names = c("T", "G"))), .Names = c("id2", "id1"))), .Names = "SampleTable")
I want this given list to be converted into following dataframe:
id1 A B
id1 A A1
id2 A2 B2
Your data structure (apparently a named list of unnamed lists of 1-row data.frames) is a bit complicated: the easiest may be to use a loop to build the data.frame.
It can be done directly with do.call, lapply and rbind, but it is not very readable, even if you are familiar with those functions.
# Sample data
d <- list(
id1 = list(
data.frame( x=1, y=1 ),
data.frame( x=2, y=2 )
),
id2 = list(
data.frame( x=3, y=3 ),
data.frame( x=4, y=4 )
),
id3 = list(
data.frame( x=5, y=5 ),
data.frame( x=6, y=6 )
)
)
# Convert
d <- data.frame(
id=rep(names(d), unlist(lapply(d,length))),
do.call( rbind, lapply(d, function(u) do.call(rbind, u)) )
)
Other solution, using a loop, if you have a ragged data structure, containing vectors (not data.frames) as explained in the comments.
d <- structure(list(SampleTable = structure(list(id2 = list(structure(c("90", "7"), .Names = c("T", "G")), structure(c("90", "8"), .Names = c("T", "G"))), id1 = structure(c("1", "1"), .Names = c("T", "G"))), .Names = c("id2", "id1"))), .Names = "SampleTable")
result <- list()
for(i in seq_along(d$SampleTable)) {
id <- names(d$SampleTable)[i]
block <- d$SampleTable[[i]]
if(is.atomic(block)) {
block <- list(block)
}
for(row in block) {
result <- c(result, list(data.frame(id, as.data.frame(t(row)))))
}
}
result <- do.call(rbind, result)
NOTE! I could not get melt and cast working on this kind of ragged data (I tried for over an hour...) I am going to leave this answer here to show that for this kind of operation, the reshape pacakge could also be used.
Using the example data of vincent, you can use melt and cast from the reshape package:
library(reshape)
res = cast(melt(d))[-1]
names(res) = c("id","x","y")
res
id x y
1 id1 1 1
2 id2 3 3
3 id3 5 5
4 id1 2 2
5 id2 4 4
6 id3 6 6
The order in the resulting data.frame is not the same, but the result is identical. And the code is a bit shorter. I use the [-1] to delete the first column which is also returned by melt. This additional variable is the column index of the individual data.frames in the list of lists. Just have a look at the result of melt(d), that will hopefully make it more clear.
This is a bit messier that you let on. That dat object has an extra "layer" above it, so it is easier to work with dat[[1]]:
dfrm <- data.frame(dat[[1]], stringsAsFactors=FALSE)
names(dfrm) <- sub("\\..+$", "", names(dfrm))
> dfrm
id2 id2 id1
T 90 90 1
G 7 8 1
> t(dfrm)
T G
id2 "90" "7"
id2 "90" "8"
id1 "1" "1"