Hello I have a DF with multiple columns all containing numeric values. My df contains over 200 columns but the sample should do. I would like to take the values from the list of indices and using them in a RowSums loop so that the list name is the new column and the sums are the combo of indices
Main <- c(rep(1, times = 6), rep(2, times = 6))
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
df.main <- data.frame(Main, Feature1, Feature2,
Feature3, stringsAsFactors = FALSE)
Main Feature1 Feature2 Feature3
1 1 6 483 3
2 1 9 405 1
3 1 18 494 5
4 1 7 499 5
5 1 13 436 1
6 1 2 451 3
7 2 4 456 3
8 2 19 442 5
9 2 16 437 2
10 2 4 497 4
11 2 7 497 3
12 2 5 466 1
list(`Cool Ranch|Cool Chipotle` = c(1L, 4L,), `Trust|Scotia` = c(3L,
4L))
I want my output to look like this
Main Feature1 Feature2 Feature3 cool_ranch trust_scotia
1 1 6 483 3 4 486
2 1 9 405 1 2 406
3 1 18 494 5 6 499
4 1 7 499 5 6 504
5 1 13 436 1 2 437
6 1 2 451 3 4 454
7 2 4 456 3 5 459
8 2 19 442 5 7 447
9 2 16 437 2 4 439
10 2 4 497 4 6 501
11 2 7 497 3 5 500
12 2 5 466 1 3 467
I have tried a few things along the same lines as below
> sum.test<- apply(df.main, 2, function(i) rowSums[vlist.imps$i])
Error in rowSums[vlist.imps$i] :
object of type 'closure' is not subsettable
We can use loop over the 'vlist.imps', extract the columns of 'df.main' with those index, get the rowSums and assign the output back to create new columns
df.main[names(vlist.imps)] <- lapply(vlist.imps, function(x) rowSums(df.main[x]))
Related
Data
Let's take a look at a simple dataset (mine is actually >200,000 rows):
df <- data.frame(
id = c(rep(1, 11), rep(2,6)),
ref.pos = c(NA,NA,NA,301,302,303,800,801,NA,NA,NA, 500,501,502, NA, NA, NA),
pos = c(1:11, 30:35)
)
Which thus looks like this:
id ref.pos pos
1 1 NA 1
2 1 NA 2
3 1 NA 3
4 1 301 4
5 1 302 5
6 1 303 6
7 1 800 7
8 1 801 8
9 1 NA 9
10 1 NA 10
11 1 NA 11
12 2 500 30
13 2 501 31
14 2 502 32
15 2 NA 33
16 2 NA 34
17 2 NA 35
What I want to achieve
Per id I want to extend the numbers in the ref.pos to fill out the whole column, where the ref.pos numbers go down moving up in the data frame and up moving down in the colum. This would result in the following data frame:
id ref.pos pos
1 1 298 1
2 1 299 2
3 1 300 3
4 1 301 4
5 1 302 5
6 1 303 6
7 1 800 7
8 1 801 8
9 1 802 9
10 1 803 10
11 1 804 11
12 2 500 30
13 2 501 31
14 2 502 32
15 2 503 33
16 2 504 34
17 2 505 35
What I tried
I wish I could provide some code here however I haven't figure out a proper way in two days, especially not something applicable to large datasets. I found df %>% group_by(id) %>% tidyr::fill(ref.pos, .direction = "downup") interesting however this repeats numbers rather than going down and up for me.
I hope my question is clear, otherwise let me know in the comments!
An option using data.table:
fillends <- function(x) nafill(nafill(x, "locf"), "nocb")
setDT(df)[, ref.pos2 := {
dif <- fillends(c(diff(ref.pos), NA_integer_))
frp <- fillends(ref.pos)
fp <- fillends(replace(pos, is.na(ref.pos), NA_integer_))
fifelse(is.na(ref.pos), frp + dif*(pos - fp), ref.pos)
}, id]
output:
id ref.pos pos ref.pos2
1: 1 NA 1 298
2: 1 NA 2 299
3: 1 NA 3 300
4: 1 301 4 301
5: 1 302 5 302
6: 1 303 6 303
7: 1 802 7 802
8: 1 801 8 801
9: 1 NA 9 800
10: 1 NA 10 799
11: 1 NA 11 798
12: 2 500 30 500
13: 2 501 31 501
14: 2 502 32 502
15: 2 NA 33 503
16: 2 NA 34 504
17: 2 NA 35 505
data:
df <- data.frame(
id = c(rep(1, 11), rep(2,6)),
ref.pos = c(NA,NA,NA,301,302,303,802,801,NA,NA,NA, 500,501,502, NA, NA, NA),
pos = c(1:11, 30:35)
)
A base R option is to define custom function fill, which is applied in ave
fill <- function(v) {
inds <- range(which(!is.na(v)))
l <- 1:inds[1]
u <- inds[2]:length(v)
v[l] <- v[inds[1]] - rev(l)+1
v[u] <- v[inds[2]] + seq_along(u)-1
v
}
df <- within(df,ref.pos <- ave(ref.pos,id,FUN = fill))
such that
> df
id ref.pos pos
1 1 298 1
2 1 299 2
3 1 300 3
4 1 301 4
5 1 302 5
6 1 303 6
7 1 800 7
8 1 801 8
9 1 802 9
10 1 803 10
11 1 804 11
12 2 500 30
13 2 501 31
14 2 502 32
15 2 503 33
16 2 504 34
17 2 505 35
vocab
wordIDx V1
1 archive
2 name
3 atheism
4 resources
5 alt
wordIDx newsgroup_ID docIdx word/doc totalwords/doc totalwords/newsgroup wordID/newsgroup P(W_j)
1 1 196 3 1240 47821 2 0.028130269
1 1 47 2 1220 47821 2 0.028130269
2 12 4437 1 702 47490 8 0.8
3 12 4434 1 673 47490 8 0.035051912
5 12 4398 1 53 47490 8 0.4
3 12 4564 11 1539 47490 8 0.035051912
For each wordIDx in vocab, I need to compute the following formulae:
For instance wordIDx = 1 ;
my value should be
max(log(0.02813027)+sum(log(2/47821),log(2/47821)))
= -23.73506
I have the following code for now:
classifier_3$ans<- max(log(classifier_3$`P(W_j)`)+ (sum(log(classifier_3$`wordID/newsgroup`/classifier_3$`totalwords/newsgroup`))))
How can I loop in a way that it considers all wordIDx from vocab dataframe and computes the above example as I have highlighted.
Something like this, but you really need to clean your column names.
vocab <- read.table(text = "wordIDx V1
1 archive
2 name
3 atheism
4 resources
5 alt", header = TRUE, stringsAsFactors = FALSE)
classifier_3 <- read.table(text = "wordIDx newsgroup_ID docIdx word/doc totalwords/doc totalwords/newsgroup wordID/newsgroup P(W_j)
1 1 196 3 1240 47821 2 0.028130269
1 1 47 2 1220 47821 2 0.028130269
2 12 4437 1 702 47490 8 0.8
3 12 4434 1 673 47490 8 0.035051912
5 12 4398 1 53 47490 8 0.4
3 12 4564 11 1539 47490 8 0.035051912", header = TRUE, stringsAsFactors = FALSE)
classifier_3 <- classifier_3[!duplicated(classifier_3$wordIDx), ]
classifier_3 <- merge(vocab, classifier_3, by = c("wordIDx"))
classifier_3$ans<- pmax(log(classifier_3$`P.W_j.`)+
(log(classifier_3$`wordID.newsgroup`/classifier_3$`totalwords.newsgroup`) +
# isn't that times 2?
log(classifier_3$`wordID.newsgroup`/classifier_3$`totalwords.newsgroup`)),
log(classifier_3$`wordID.newsgroup`/classifier_3$`totalwords.newsgroup`))
I'm trying to merge multiple dataframes according to their names.
My dataframes are named
Cluster_1_N1, Cluster_1_N2, Cluster_1_N3
Cluster_2_N1, Cluster_2_N2, Cluster_2_N3
etc with 3, 4 ... They all contain two columns.
> Cluster_1_N1
ID counts_N1
1 405
2 6
3 201
4 40
> Cluster_1_N2
ID counts_N2
1 657
2 9
3 250
4 77
The merge will be done on "ID" column. I want to obtain multiple dataframes, named "Cluster_1", "Cluster_2"... structured like this
>Cluster_1
ID counts_N1 counts_N2 counts_N3
1 405 657 10
2 6 9 50
3 201 250 55
4 40 77 68
>Cluster_2
ID counts_N1 counts_N2 counts_N3
1 1 652 11
2 7 3 52
3 58 2 56
4 46 7 68
I tried this:
for(i in 1:2) {
z <- paste ("C", i, sep = "")
n <- paste("Cluster_", i, sep = "")
assign (n, Reduce(function(...) merge (..., by="ID", all= T)),
list(mget(ls(pattern=z)))))
}
This code returns me "Cluster_1" and "Cluster_2" objects, but these objects are lists (containing the appropriate dataframes to merge together) but not dataframes! Thus, the Reduce function lists correctly all the "cluster_1" and "cluster_2" dataframes I want to merge, but does not merge them...
Where is my mistake?
(I hope I'm clear enough, this is my first post ever... )
Here's a way in base R:
pattern <- "Cluster_(\\d+)_N\\d+"
dfs_all <- mget(ls(pattern=pattern))
df_split <- split(dfs_all,gsub(pattern, "\\1", names(dfs_all)))
lapply(df_split, Reduce, f = merge)
# $`1`
# ID counts_N1 counts_N2
# 1 1 405 657
# 2 2 6 9
# 3 3 201 250
# 4 4 40 77
#
# $`2`
# ID counts_N1 counts_N2
# 1 1 1 652
# 2 2 7 3
# 3 3 58 2
# 4 4 46 7
I would recommend however that you fix the code upstream if you're the one responsible for getting the data in this shape, it's much cleaner to work with lists from the start.
data
Cluster_1_N1 <-read.table(text=
"ID counts_N1
1 405
2 6
3 201
4 40",h=T,strin=F)
Cluster_1_N2 <-read.table(text=
"ID counts_N2
1 657
2 9
3 250
4 77",h=T,strin=F)
Cluster_2_N1 <-read.table(text=
"ID counts_N1
1 1
2 7
3 58
4 46",h=T,strin=F)
Cluster_2_N2 <-read.table(text=
"ID counts_N2
1 652
2 3
3 2
4 7",h=T,strin=F)
I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2
For example if my data looks like this:
> a <- c(1:25)
> a
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
How do i get a list like this:
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5
So I want to divide the 25 elements in to 5 sublists, and find out the index of the sublist that each element belongs to. Data is not sorted and is too large to sort. There are also missing values, in which case their index would be 0.
Sorry, to clarify, I dont need the groups to have equal sizes, but they need to be divided by the 0.2, 0.4, 0.6, 0.8 quantiles.
i.e. the ith element in my output should be the nth quantile that the ith element in a belongs to. For example, 8 is in the second quantile, the 8th element in my output is 2.
Perhaps:
acut <- cut(a,
quantile(a, probs=c(0, 0.2, 0.4, 0.6, 0.8, 1) ) ,
include.lowest=TRUE)
as.numeric(acut)
# random data with 3 NAs
> a<-sample(c(NA,NA,NA,sample(1:1000,25)))
> a
[1] 414 744 897 777 20 371 625 462 341 766 NA 243 NA 213 198 691 NA 325 275 526 830 179 40 601 51 725 68 709
> b<-ceiling(rank(a,na.last="keep")/length(which(!is.na(a)))*5)
> b[is.na(b)]=0
> b
[1] 3 5 5 5 1 3 4 3 3 5 NA 2 NA 2 2 4 NA 2 2 3 5 1 1 4 1 4 1 4
# check that all groups have the same size
> table(b)
b
1 2 3 4 5
5 5 5 5 5