I have a data.frame with column headers.
How can I get a specific row from the data.frame as a list (with the column headers as keys for the list)?
Specifically, my data.frame is
A B C
1 5 4.25 4.5
2 3.5 4 2.5
3 3.25 4 4
4 4.25 4.5 2.25
5 1.5 4.5 3
And I want to get a row that's the equivalent of
> c(a=5, b=4.25, c=4.5)
a b c
5.0 4.25 4.5
x[r,]
where r is the row you're interested in. Try this, for example:
#Add your data
x <- structure(list(A = c(5, 3.5, 3.25, 4.25, 1.5 ),
B = c(4.25, 4, 4, 4.5, 4.5 ),
C = c(4.5, 2.5, 4, 2.25, 3 )
),
.Names = c("A", "B", "C"),
class = "data.frame",
row.names = c(NA, -5L)
)
#The vector your result should match
y<-c(A=5, B=4.25, C=4.5)
#Test that the items in the row match the vector you wanted
x[1,]==y
This page (from this useful site) has good information on indexing like this.
Logical indexing is very R-ish. Try:
x[ x$A ==5 & x$B==4.25 & x$C==4.5 , ]
Or:
subset( x, A ==5 & B==4.25 & C==4.5 )
Try:
> d <- data.frame(a=1:3, b=4:6, c=7:9)
> d
a b c
1 1 4 7
2 2 5 8
3 3 6 9
> d[1, ]
a b c
1 1 4 7
> d[1, ]['a']
a
1 1
If you don't know the row number, but do know some values then you can use subset
x <- structure(list(A = c(5, 3.5, 3.25, 4.25, 1.5 ),
B = c(4.25, 4, 4, 4.5, 4.5 ),
C = c(4.5, 2.5, 4, 2.25, 3 )
),
.Names = c("A", "B", "C"),
class = "data.frame",
row.names = c(NA, -5L)
)
subset(x, A ==5 & B==4.25 & C==4.5)
10 years later ---> Using tidyverse we could achieve this simply and borrowing a leaf from Christopher Bottoms. For a better grasp, see slice().
library(tidyverse)
x <- structure(list(A = c(5, 3.5, 3.25, 4.25, 1.5 ),
B = c(4.25, 4, 4, 4.5, 4.5 ),
C = c(4.5, 2.5, 4, 2.25, 3 )
),
.Names = c("A", "B", "C"),
class = "data.frame",
row.names = c(NA, -5L)
)
x
#> A B C
#> 1 5.00 4.25 4.50
#> 2 3.50 4.00 2.50
#> 3 3.25 4.00 4.00
#> 4 4.25 4.50 2.25
#> 5 1.50 4.50 3.00
y<-c(A=5, B=4.25, C=4.5)
y
#> A B C
#> 5.00 4.25 4.50
#The slice() verb allows one to subset data row-wise.
x <- x %>% slice(1) #(n) for the nth row, or (i:n) for range i to n, (i:n()) for i to last row...
x
#> A B C
#> 1 5 4.25 4.5
#Test that the items in the row match the vector you wanted
x[1,]==y
#> A B C
#> 1 TRUE TRUE TRUE
Created on 2020-08-06 by the reprex package (v0.3.0)
Related
I have a list containing data frames:
test <- list()
test[[1]] <- data.frame(C1=c(0.2,0.4,0.5), C2=c(2,3.5,3.7), C3=c(0.3,4,5))
test[[2]] <- data.frame(C1=c(0.1,0.3,0.6), C2=c(3.9,4.3,8), C3=c(3,5.2,10))
test[[3]] <- data.frame(C1=c(0.4,0.55,0.8), C2=c(8.9,10.3,14), C3=c(7,8.4,11))
I´d like to get the line among all data frames lines inside this list which column (e.g.C2 in this example) has the closest value to each element in a vector "vec" (below), as well as the list index (1, 2 or 3 in this example) where it happened.
vector <- c(3, 14.4, 7, 0)
The desired answer would be something like:
list.index line.number.in.df C1 C2 C3
1 2 0.4 3.5 4
3 3 0.8 14 11
2 3 0.6 8 10
1 1 0.2 2 0.3
I could manage to use lapply to get 10% of the problem solved for a single value, but couldn´t do it for a bunch of values (vector) besides getting all list elements dataframe lines where the closest value as found (not only a single line among all data frames),and could not get the corresponding list index as well, i.e.
value <- 3
lapply(test, function(x) x[which.min(abs(value-x$C2)),])
Result I got:
[[1]]
C1 C2 C3
2 0.4 3.5 4
[[2]]
C1 C2 C3
1 0.1 3.9 3
[[3]]
C1 C2 C3
1 0.4 8.9 7
Would anyone be so kind and patient to get me further on this?
Thanks in advance and Happy New Year.
Here is a dplyr approach. We can generate the list.index and line.number.in.df for each dataframe and then bind_rows them together. Next, slice the rows where C2 contains the closest value for each number in that vector.
library(dplyr)
test <- list(structure(list(C1 = c(0.2, 0.4, 0.5), C2 = c(2, 3.5, 3.7
), C3 = c(0.3, 4, 5)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(C1 = c(0.1, 0.3, 0.6), C2 = c(3.9, 4.3,
8), C3 = c(3, 5.2, 10)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(C1 = c(0.4, 0.55, 0.8), C2 = c(8.9, 10.3,
14), C3 = c(7, 8.4, 11)), class = "data.frame", row.names = c(NA,
-3L)))
vector <- c(3, 14.4, 7, 0)
test %>%
lapply(tibble::rowid_to_column, "line.number.in.df") %>%
bind_rows(.id = "list.index") %>%
slice(vapply(vector, \(x) which.min(abs(x - C2)), integer(1L)))
Output is
list.index line.number.in.df C1 C2 C3
1 1 2 0.4 3.5 4.0
2 3 3 0.8 14.0 11.0
3 2 3 0.6 8.0 10.0
4 1 1 0.2 2.0 0.3
You could exploit the substrings of the names.
(w <- sapply(v, \(v)
names(which.min(abs(unlist(setNames(test, seq_along(test))) - v)))))
# [1] "2.C31" "3.C23" "3.C31" "2.C11"
t(mapply(\(x, y) c(list=x, line=y, test[[x]][y, ]),
as.numeric(substr(w, 1, 1)), as.numeric(substring(w, 5)))) |>
as.data.frame()
# list line C1 C2 C3
# 1 2 1 0.1 3.9 3
# 2 3 3 0.8 14 11
# 3 3 1 0.4 8.9 7
# 4 2 1 0.1 3.9 3
Note: R >= 4.1 used.
Data:
test <- list(structure(list(C1 = c(0.2, 0.4, 0.5), C2 = c(2, 3.5, 3.7
), C3 = c(0.3, 4, 5)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(C1 = c(0.1, 0.3, 0.6), C2 = c(3.9, 4.3,
8), C3 = c(3, 5.2, 10)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(C1 = c(0.4, 0.55, 0.8), C2 = c(8.9, 10.3,
14), C3 = c(7, 8.4, 11)), class = "data.frame", row.names = c(NA,
-3L)))
v <- c(3, 14.4, 7, 0)
I hope this is what you are looking for. It finds the value in each of the columns per element of test, that is closest to the values in vector.
#install.packages('birk')
library(birk) # required for which.closest()
# find which of the values across the columns C1:C3 in each element of test are closest
# to the values of vector and return the corresponding row numbers
x <- sapply(1:length(vector), \(x) sapply(test, \(i) apply(i, 2, \(j) which.closest(j, vector[x]))))
x <- apply(x, 1, \(x) as.data.frame(table(x)))
x <- lapply(x, \(i) i[which.max(i[, 2]), ])
row_numbers_df <- as.numeric(matrix(do.call(rbind, x)[['x']]))
# extract the values in each of the column C1:C3 corresponding to row_numbers_df
vals <- array(0, dim = length(row_numbers_df))
for (i in 1:length(row_numbers_df)) { vals[i] <- do.call(cbind, test)[row_numbers_df[i], i] }
# how many columns does each data.frame embedded in test have?
unique_number_of_cols <- unique(sapply(test, ncol))
# store results in a data.frame
r <- \(x) round(x, 1)
out <- data.frame(
seq_len(length(test)),
r(rowMeans(matrix(row_numbers_df, ncol = unique_number_of_cols, byrow = TRUE))),
matrix(vals, ncol = unique_number_of_cols, byrow = TRUE)
)
names(out) <- c('list.index', 'line.number.in.df', sapply(test, colnames)[, 1])
Result
> out
list.index line.number.in.df C1 C2 C3
1 1 3.0 0.5 3.7 5
2 2 1.7 0.6 3.9 3
3 3 1.7 0.8 8.9 7
Alternatively, if you really want to have each line.number.in.df per unique column, then you can easily store them as separate columns in out.
x <- sapply(1:length(vector), \(x) sapply(test, \(i) apply(i, 2, \(j) which.closest(j, vector[x]))))
x <- apply(x, 1, \(x) as.data.frame(table(x)))
x <- lapply(x, \(i) i[which.max(i[, 2]), ])
row_numbers_df <- as.numeric(matrix(do.call(rbind, x)[['x']]))
names(row_numbers_df) <- do.call(c, lapply(test, names))
row_numbers_df
vals <- array(0, dim = length(row_numbers_df))
for (i in 1:length(row_numbers_df)) { vals[i] <- do.call(cbind, test)[row_numbers_df[i], i] }
unique_number_of_cols <- unique(sapply(test, ncol))
out <- data.frame(
seq_len(length(test)),
split(row_numbers_df, names(row_numbers_df)),
matrix(vals, ncol = unique_number_of_cols, byrow = TRUE)
)
column_names <- sapply(test, colnames)[, 1]
names(out) <- c('list.index',
paste0('line.number.in.df.', column_names),
column_names)
Result
> out
list.index line.number.in.df.C1 line.number.in.df.C2 line.number.in.df.C3 C1 C2 C3
1 1 3 3 3 0.5 3.7 5
2 2 3 1 1 0.6 3.9 3
3 3 3 1 1 0.8 8.9 7
I have a dataframe as shown:
structure(list(ID = c(1, 1, 1, 1, 2, 2, 2, 2), ColA = c(2, 3,
4, 5, 2, 3, 4, 5), ColB = c(1, 2, 3, 4, 1, 2, 3, 4), ColA_0.2 = c(2,
3.4, 4.68, 5.936, 2, 3.4, 4.68, 5.936), ColB_0.2 = c(1, 2.2,
3.44, 4.688, 1, 2.2, 3.44, 4.688)), class = "data.frame", row.names = c(NA,
-8L))
What I need ? - For each ID, I want to calculate ColA_ad and ColB_ad. User will pass a parameter 'ad'.
For example - if 'ad' is 0.2 then the values will be calculated as:
First row - same as ColA (i.e. 2)
Second row - Add Second row of ColA to 0.2*First row of ColA_ad (i.e. Sum(3,0.2*2)=3.4)
Third row - Add Third row of ColA to 0.2*second row of ColA_ad (i.e. Sum(4,0.2*3.4)=4.68)
and so on.
The same will be calculated for all other columns (here ColB), which can be mentioned in separate vector.
Summary - I would take 0.2 times carry over effect of previous calculated row and add to new row.
The results are displayed in Column ColA_ad and ColB_ad.
As my dataset is very large, I am looking for data.table solution.
Here is a base R solution, where a linear algebra property is applied to speed up your iterative calculation.
basic idea (taking id = 1 as example)
you first construct a low triangluar matrix for mapping from col to col_ad, i.e.,
l <- 0.2**abs(outer(seq(4),seq(4),"-"))
l[upper.tri(l)] <- 0
which gives
> l
[,1] [,2] [,3] [,4]
[1,] 1.000 0.00 0.0 0
[2,] 0.200 1.00 0.0 0
[3,] 0.040 0.20 1.0 0
[4,] 0.008 0.04 0.2 1
then you use lover columns col, i.e.,
> l %*% as.matrix(subset(df,ID == 1)[-1])
ColA ColB
[1,] 2.000 1.000
[2,] 3.400 2.200
[3,] 4.680 3.440
[4,] 5.936 4.688
code
ad <- 0.2
col_ad <- do.call(rbind,
c(make.row.names = F,
lapply(split(df,df$ID),
function(x) {
l <- ad**abs(outer(seq(nrow(x)),seq(nrow(x)),"-"))
l[upper.tri(l)]<- 0
`colnames<-`(data.frame(l%*% as.matrix(x[-1])),paste0(names(x[-1]),"_",ad))
}
)
)
)
dfout <- cbind(df,col_ad)
such that
> dfout
ID ColA ColB ColA_0.2 ColB_0.2
1 1 2 1 2.000 1.000
2 1 3 2 3.400 2.200
3 1 4 3 4.680 3.440
4 1 5 4 5.936 4.688
5 2 2 1 2.000 1.000
6 2 3 2 3.400 2.200
7 2 4 3 4.680 3.440
8 2 5 4 5.936 4.688
DATA
df <- structure(list(ID = c(1, 1, 1, 1, 2, 2, 2, 2), ColA = c(2, 3,
4, 5, 2, 3, 4, 5), ColB = c(1, 2, 3, 4, 1, 2, 3, 4)), class = "data.frame", row.names = c(NA,
-8L))
A non-recursive option:
setDT(DT)[, paste0(cols,"_",ad) := {
m <- matrix(unlist(shift(ad^(seq_len(.N)-1L), 0L:(.N-1L), fill = 0)), nrow=.N)
lapply(.SD, function(x) c(m%*%x))
}, by = ID, .SDcols = cols]
Another recursive option:
library(data.table)
setDT(DT)[, paste0(cols,"_",ad) := {
a <- 0
b <- 0
.SD[, {
a <- ColA + ad*a
b <- ColB + ad*b
.(a, b)
}, seq_len(.N)][, (1) := NULL]
},
by = ID]
output:
ID ColA ColB ColA_0.2 ColB_0.2
1: 1 2 1 2.000 1.000
2: 1 3 2 3.400 2.200
3: 1 4 3 4.680 3.440
4: 1 5 4 5.936 4.688
5: 2 2 1 2.000 1.000
6: 2 3 2 3.400 2.200
7: 2 4 3 4.680 3.440
8: 2 5 4 5.936 4.688
data:
DT <- structure(list(ID = c(1, 1, 1, 1, 2, 2, 2, 2), ColA = c(2, 3,
4, 5, 2, 3, 4, 5), ColB = c(1, 2, 3, 4, 1, 2, 3, 4), ColA_0.2 = c(2,
3.4, 4.68, 5.936, 2, 3.4, 4.68, 5.936), ColB_0.2 = c(1, 2.2,
3.44, 4.688, 1, 2.2, 3.44, 4.688)), class = "data.frame", row.names = c(NA,
-8L))
ad <- 0.2
cols <- c("ColA", "ColB")
Here is one way with data.table using Reduce:
#Columns to apply function to
cols <- names(df)[2:3]
#Create a function to apply
apply_fun <- function(col, ad) {
Reduce(function(x, y) sum(y, x * ad), col, accumulate = TRUE)
}
library(data.table)
#Convert dataframe to data.table
setDT(df)
#set ad value
ad <- 0.2
#Apply funnction to each columns of cols
df[, (paste(cols, ad, sep = "_")) := lapply(.SD, apply_fun, ad), .SDcols = cols, by = ID]
df
# ID ColA ColB ColA_0.2 ColB_0.2
#1: 1 2 1 2.000 1.000
#2: 1 3 2 3.400 2.200
#3: 1 4 3 4.680 3.440
#4: 1 5 4 5.936 4.688
#5: 2 2 1 2.000 1.000
#6: 2 3 2 3.400 2.200
#7: 2 4 3 4.680 3.440
#8: 2 5 4 5.936 4.688
I want to replace the NA values for observations within a particular sub-group, but the sequence of the observations in that group is not ordered properly. So I am wondering if there exists some dplyr or plyr command that would allow me to replace missing values in a column belonging to one dataframe using the values from the same column from another dataframe while matching on the values of that "key" column.
Here's what I got. Hope someone could shed light on this. Thanks.
## data frame that contains missing values in "diff" column
df <- data.frame(type = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3),
diff = c(0.1, 0.3, NA, NA, NA, NA, NA, 0.2, 0.7, NA, 0.5, NA),
name = c("A", "B", "C", "D", "E", "A", "B", "C", "F", "A", "B", "C"))
## replace with values from this smaller data frame
df2 <- data.frame(diff_rep = c(0.3, 0.2, 0.4), name = c("A", "B", "C"))
## replace using ifelse
df$diff <- ifelse(is.na(df$diff) & (df$type == 2), df2$diff_rep , df$diff)
df
type diff name
1 1 0.1 A
2 1 0.3 B
3 1 NA C
4 2 0.3 D
5 2 0.2 E
6 2 0.4 A
7 2 0.3 B
8 2 0.2 C
9 2 0.7 F
10 3 NA A
11 3 0.5 B
12 3 NA C
## desired output
type diff name
1 1 0.1 A
2 1 0.3 B
3 1 NA C
4 2 NA D
5 2 NA E
6 2 0.3 A
7 2 0.2 B
8 2 0.4 C
9 2 0.7 F
10 3 NA A
11 3 0.5 B
12 3 NA C
Assuminhg row 9 is a mistake, you can use a left join first and then use ifelse() and coalesce() to get your desired result. coalesce() returns the first non-missing value
left_join(df, df2, by = "name") %>%
mutate(diff_wanted = if_else(type == 2,
coalesce(diff, diff_rep),
diff),
diff_wanted = ifelse(name %in% df2$name,
diff_wanted,
NA)) %>%
select(type, diff_wanted, name)
I have a data.table similar to the one below, but with around 3 million rows and a lot more columns.
key1 price qty status category
1: 1 9.26 3 5 B
2: 1 14.64 1 5 B
3: 1 16.66 3 5 A
4: 1 18.27 1 5 A
5: 2 2.48 1 7 A
6: 2 0.15 2 7 C
7: 2 6.29 1 7 B
8: 3 7.06 1 2 A
9: 3 24.42 1 2 A
10: 3 9.16 2 2 C
11: 3 32.21 2 2 B
12: 4 20.00 2 9 B
Heres the dput() string
dados = structure(list(key1 = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4),
price = c(9.26, 14.64, 16.66, 18.27, 2.48, 0.15, 6.29, 7.06,
24.42, 9.16, 32.21, 20), qty = c(3, 1, 3, 1, 1, 2, 1, 1,
1, 2, 2, 2), status = c(5, 5, 5, 5, 7, 7, 7, 2, 2, 2, 2,
9), category = c("B", "B", "A", "A", "A", "C", "B", "A",
"A", "C", "B", "B")), .Names = c("key1", "price", "qty",
"status", "category"), row.names = c(NA, -12L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000004720788>)
I need to transform this data so that I have one entry for each key, and on the proccess I need to create some additional variables. So far I was using this:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
key.aggregate = function(x){
return(data.table(
key1 = Mode(x$key1),
perc.A = sum(x$price[x$category == "A"],na.rm=T)/sum(x$price),
perc.B = sum(x$price[x$category == "B"],na.rm=T)/sum(x$price),
perc.C = sum(x$price[x$category == "C"],na.rm=T)/sum(x$price),
status = Mode(x$status),
qty = sum(x$qty),
price = sum(x$price)
))
}
new_data = split(dados,by = "key1") #Runs out of RAM here
results = rbindlist(lapply(new_data,key.aggregate))
And expecting the following output:
> results
key1 perc.A perc.B perc.C status qty price
1: 1 0.5937447 0.4062553 0.00000000 5 8 58.83
2: 2 0.2780269 0.7051570 0.01681614 7 4 8.92
3: 3 0.4321208 0.4421414 0.12573782 2 6 72.85
4: 4 0.0000000 1.0000000 0.00000000 9 2 20.00
But I'm always running out of RAM when splitting the data by keys. I've tried using only a third of the data, and now only a sixth of it but it still gives the same Error: cannot allocate vector of size 593 Kb.
I'm thinking this approach is very inefficient, which would be the best way to get this result?
I am a beginner in R. I have a table that looks like this:
> means
as er op rt
a 34.66667 3.5 87 4
b 22.66667 4.5 9 5
c 5.00000 7.5 6 9
d 6.00000 0.5 6 3
e 3.00000 8.0 7 89
and another one that looks like this:
> table
exp ctrl
1 as er
2 rt op
I want to extract the values from the columns in "means" that are indicated in column "exp" of "table", like this:
> means_exp <- means[, table$exp]
In the real situation both tables would be much bigger, so I don't want to just specify the names of the columns to extract one by one.
However, with that command I am getting this:
> means_exp
as er
a 34.66667 3.5
b 22.66667 4.5
c 5.00000 7.5
d 6.00000 0.5
e 3.00000 8.0
but I am supposed to get columns "as" and "rt", not "as" and "er"
Any idea why the wrong columns are extracted?
Thank you!
Here is the dput of the first table:
structure(c(34.6666666666667, 22.6666666666667, 5, 6, 3, 3.5,
4.5, 7.5, 0.5, 8, 87, 9, 6, 6, 7, 4, 5, 9, 3, 89), .Dim = c(5L,
4L), .Dimnames = list(c("a", "b", "c", "d", "e"), c("as", "er",
"op", "rt")))
and that of the second:
structure(list(exp = structure(1:2, .Label = c("as", "rt"), class = "factor"),
ctrl = structure(1:2, .Label = c("er", "op"), class = "factor")), .Names = c("exp",
"ctrl"), class = "data.frame", row.names = c(NA, -2L))
The reason the OP got different columns with the 'exp' column in 'table' is the class of the exp. It would be factor class, so converting to character is an option.
means[,as.character(table$exp)]
The factor gets coerced to integer and we get
as.integer(factor(table$exp))
#[1] 1 2
means[,factor(table$exp)]
# as er
#a 34.66667 3.5
#b 22.66667 4.5
#c 5.00000 7.5
#d 6.00000 0.5
#e 3.00000 8.0
So, it selects the first 2 columns instead of the 'as' and 'rt'