I have a data frame df, and a list L of indices at which I should put 0 instead of the current values of df.
Example:
DF:
# A tibble: 11 x 3
A B C
<dbl> <dbl> <dbl>
1724 4 2013
1758 4 2013
1612 3 2013
1692 3 2013
1260 33 2014
1157 22 2014
1359 63 2014
1414 27 2014
387 3 2016
374 3 2016
L:
[[1]]
[1] 3 4
[[2]]
[1] 1 2 3 4 5
[[3]]
[1] 1
So in this example, I have to put zeros in rows 3, 4 of column A, in rows 1:5 in column B and row 1 in column C.
Is there a way to do it as a one-liner in R? A dplyr or R-base solution would be great! Also, I would like to avoid apply or loops since I have to do this very efficiently
Loop looks very fast to me. Haven't done the complexity comparison but if you have your replacement in list form and want to replace with 'val', just simply:
df
a b c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
val<-0
for(i in 1:length(L)){
df[L[[i]],i]<-val
}
df
a b c
1 1 0 0
2 2 0 2
3 0 0 3
4 0 0 4
5 5 0 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
I tested it on x, a 10,000 row and 10,0000 column df:
> b<-Sys.time()
> for(i in 1:length(L)){
+ x[L[[i]],i]<-0
+ }
> Sys.time()-b
Time difference of 0.490464 secs
Looks pretty quick :) I know it's obvious but hope it helps!
******** EDIT 1 ********
If we look at method by #mt1022 using unlist and cbind:
> b<-Sys.time()
> Lcol <- rep(seq_along(L), lengths(L))
> x[cbind(unlist(L), Lcol)] <- 0
> Sys.time()-b
Time difference of 7.467723 secs
Clearly much slower (because when we unlist, we essentailly loop through each and every element in L instead of each vector in L). ;)
Another way using matrix of indices:
# DF <- read.table(textConnection('A B C
# 1724 4 2013
# 1758 4 2013
# 1612 3 2013
# 1692 3 2013
# 1260 33 2014
# 1157 22 2014
# 1359 63 2014
# 1414 27 2014
# 387 3 2016
# 374 3 2016'), header = T)
#
# L <- list(c(3, 4), c(1, 2, 3, 4, 5), c(1))
Lcol <- rep(seq_along(L), lengths(L))
DF[cbind(unlist(L), Lcol)] <- 0
# > DF
# A B C
# 1 1724 0 0
# 2 1758 0 2013
# 3 0 0 2013
# 4 0 0 2013
# 5 1260 0 2014
# 6 1157 22 2014
# 7 1359 63 2014
# 8 1414 27 2014
# 9 387 3 2016
# 10 374 3 2016
Another option is to use mapply in combination with do.call.
do.call(cbind, mapply(function(x,y){
df[x,y]<-0
df[y]
}, mylist, seq_along(mylist)))
# A B C
# [1,] 1724 0 0
# [2,] 1758 0 2013
# [3,] 0 0 2013
# [4,] 0 0 2013
# [5,] 1260 0 2014
# [6,] 1157 22 2014
# [7,] 1359 63 2014
# [8,] 1414 27 2014
# [9,] 387 3 2016
# [10,] 374 3 2016
Data:
df <- read.table(text =
"A B C
1724 4 2013
1758 4 2013
1612 3 2013
1692 3 2013
1260 33 2014
1157 22 2014
1359 63 2014
1414 27 2014
387 3 2016
374 3 2016", header = TRUE)
mylist <- list(c(3, 4), c(1, 2, 3, 4, 5), c(1))
Related
Replicating a 2011 example script, the aggregate() function of base R produces NANs. I was wondering if I need to use a more recent version of aggregate or a similar function? Please advise.
Example s1s2.df can be found here: https://www.dropbox.com/s/dsqina3vuy0774u/df.csv?dl=0
Code that produces NAN instead of summarised values:
s1.no.present <- aggregate(s1s2.df$no.present[s1s2.df$sabap==-1], by=list(s1s2.df$month.n[s1s2.df$sabap==-1]),sum)[,2]
s1.no.cards <- aggregate(s1s2.df$no.cards[s1s2.df$sabap==-1], by=list(s1s2.df$month.n[s1s2.df$sabap==-1]),sum)[,2]
s2.no.present <- aggregate(s1s2.df$no.present[s1s2.df$sabap==1], by=list(s1s2.df$month.n[s1s2.df$sabap==1]),sum)[,2]
s2.no.cards <- aggregate(s1s2.df$no.cards[s1s2.df$sabap==1], by=list(s1s2.df$month.n[s1s2.df$sabap==1]),sum)[,2]
Incorrect output:
> tibble(s1.no.present)
# A tibble: 12 × 1
s1.no.present
<int>
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8 NA
9 NA
10 NA
11 NA
12 NA
Use a custom sum function to remove NAs:
#data
s1s2.df <- read.csv("tmp.csv")
mySum <- function(x){ sum(x, na.rm = TRUE) }
aggregate(s1s2.df$no.present[s1s2.df$sabap == -1 ],
by = list(s1s2.df$month.n[s1s2.df$sabap == -1 ]),
mySum)
# Group.1 x
# 1 1 218
# 2 2 369
# 3 3 590
# 4 4 1471
# 5 5 1880
# 6 6 2241
# 7 7 2306
# 8 8 1827
# 9 9 1377
# 10 10 774
# 11 11 281
# 12 12 280
Or use formulas:
aggregate(formula = no.present ~ month.n,
data = s1s2.df[s1s2.df$sabap == -1, ],
FUN = sum)
# month.n no.present
# 1 1 218
# 2 2 369
# 3 3 590
# 4 4 1471
# 5 5 1880
# 6 6 2241
# 7 7 2306
# 8 8 1827
# 9 9 1377
# 10 10 774
# 11 11 281
# 12 12 280
vocab
wordIDx V1
1 archive
2 name
3 atheism
4 resources
5 alt
wordIDx newsgroup_ID docIdx word/doc totalwords/doc totalwords/newsgroup wordID/newsgroup P(W_j)
1 1 196 3 1240 47821 2 0.028130269
1 1 47 2 1220 47821 2 0.028130269
2 12 4437 1 702 47490 8 0.8
3 12 4434 1 673 47490 8 0.035051912
5 12 4398 1 53 47490 8 0.4
3 12 4564 11 1539 47490 8 0.035051912
For each wordIDx in vocab, I need to compute the following formulae:
For instance wordIDx = 1 ;
my value should be
max(log(0.02813027)+sum(log(2/47821),log(2/47821)))
= -23.73506
I have the following code for now:
classifier_3$ans<- max(log(classifier_3$`P(W_j)`)+ (sum(log(classifier_3$`wordID/newsgroup`/classifier_3$`totalwords/newsgroup`))))
How can I loop in a way that it considers all wordIDx from vocab dataframe and computes the above example as I have highlighted.
Something like this, but you really need to clean your column names.
vocab <- read.table(text = "wordIDx V1
1 archive
2 name
3 atheism
4 resources
5 alt", header = TRUE, stringsAsFactors = FALSE)
classifier_3 <- read.table(text = "wordIDx newsgroup_ID docIdx word/doc totalwords/doc totalwords/newsgroup wordID/newsgroup P(W_j)
1 1 196 3 1240 47821 2 0.028130269
1 1 47 2 1220 47821 2 0.028130269
2 12 4437 1 702 47490 8 0.8
3 12 4434 1 673 47490 8 0.035051912
5 12 4398 1 53 47490 8 0.4
3 12 4564 11 1539 47490 8 0.035051912", header = TRUE, stringsAsFactors = FALSE)
classifier_3 <- classifier_3[!duplicated(classifier_3$wordIDx), ]
classifier_3 <- merge(vocab, classifier_3, by = c("wordIDx"))
classifier_3$ans<- pmax(log(classifier_3$`P.W_j.`)+
(log(classifier_3$`wordID.newsgroup`/classifier_3$`totalwords.newsgroup`) +
# isn't that times 2?
log(classifier_3$`wordID.newsgroup`/classifier_3$`totalwords.newsgroup`)),
log(classifier_3$`wordID.newsgroup`/classifier_3$`totalwords.newsgroup`))
I have 2 data frames
Data Frame A:
Time Reading
1 20
2 23
3 25
4 22
5 24
6 23
7 24
8 23
9 23
10 22
Data Frame B:
TimeStart TimeEnd Alarm
2 5 556
7 9 556
I would like to create the following joined dataframe:
Time Reading Alarmtime Alarm alarmno
1 20 n/a n/a n/a
2 23 2 556 1
3 25 556 1
4 22 556 1
5 24 5 556 1
6 23 n/a n/a n/a
7 24 7 556 2
8 23 556 2
9 23 9 556 2
10 22 n/a n/a n/a
I can do the join easy enough however im struggling with getting the following rows filled with the alarm until the time the alarm ended. Also numbering each individual alarm so even if they are the same alarm they are counted separately. Any thoughts on how i can do this would be great
Thanks
library(sqldf)
df_b$AlarmNo <- seq_len(nrow(df_b))
sqldf('
select a.Time
, a.Reading
, case when a.Time in (b.TimeStart, b.TimeEnd)
then a.Time
else NULL
end as AlarmTime
, b.Alarm
, b.AlarmNo
from df_a a
left join df_b b
on a.Time between b.TimeStart and b.TimeEnd
')
# Time Reading AlarmTime Alarm AlarmNo
# 1 1 20 NA NA NA
# 2 2 23 2 556 1
# 3 3 25 NA 556 1
# 4 4 22 NA 556 1
# 5 5 24 5 556 1
# 6 6 23 NA NA NA
# 7 7 24 7 556 2
# 8 8 23 NA 556 2
# 9 9 23 9 556 2
# 10 10 22 NA NA NA
Or
library(data.table)
setDT(df_b)
df_c <-
df_b[, .(Time = seq(TimeStart, TimeEnd), Alarm, AlarmNo = .GRP)
, by = TimeStart]
merge(df_a, df_c, by = 'Time', all.x = T)
# Time Reading TimeStart Alarm AlarmNo
# 1: 1 20 NA NA NA
# 2: 2 23 2 556 1
# 3: 3 25 2 556 1
# 4: 4 22 2 556 1
# 5: 5 24 2 556 1
# 6: 6 23 NA NA NA
# 7: 7 24 7 556 2
# 8: 8 23 7 556 2
# 9: 9 23 7 556 2
# 10: 10 22 NA NA NA
Data used:
df_a <- fread('
Time Reading
1 20
2 23
3 25
4 22
5 24
6 23
7 24
8 23
9 23
10 22
')
df_b <- fread('
TimeStart TimeEnd Alarm
2 5 556
7 9 556
')
I have list of data with varying list length:
[[1]]
[1] "2009" "2010" "2011" "2012"
[[2]]
[1] "2010" "2011" "2012" "2013"
[[3]]
[1] "2008" "2009" "2010" "2011" "2012"
[[4]]
[1] "2011" "2012"
I would like to get one column data.frame like this:
2009
2010
2011
2012
2010
2011
....
I went on doing this unsuccessfully:
# transpose list of years
YearsDf <- lapply(GetYears, data.frame)
Remove colnames (since the list of dataframes gave some weird column names):
YearsOk <- lapply(YearsDf, function(x) "colnames<-"(x, NULL))
All this comes to:
[[1]]
NA
1 2009
2 2010
3 2011
4 2012
[[2]]
NA
1 2010
2 2011
3 2012
4 2013
......
Now just bind and get data.frame. This gave NA's
ldply(YearsOk, data.frame)
How I get to the data.frame of one column?
Did you consider unlist?
myL <- list(as.character(2009:2012),
as.character(2010:2011),
as.character(2009:2014))
data.frame(year = unlist(myL))
# year
# 1 2009
# 2 2010
# 3 2011
# 4 2012
# 5 2010
# 6 2011
# 7 2009
# 8 2010
# 9 2011
# 10 2012
# 11 2013
# 12 2014
If you think it will be important for you to retain which list element the value came from, consider stack (which requires a named list) or melt from the "reshape2" package:
library(reshape2)
melt(myL)
# value L1
# 1 2009 1
# 2 2010 1
# ...SNIP...
# 11 2013 3
# 12 2014 3
## stack requires names, so add some in...
stack(setNames(myL, seq_along(myL)))
# values ind
# 1 2009 1
# 2 2010 1
# ...SNIP...
# 12 2014 3
Finally, this is absolutely not the approach I would take, but based on your example code, perhaps you were trying to do something like:
do.call(rbind, lapply(myL, function(x) data.frame(year = x)))
It's quite simple. This answer gets for different length
Q<-list(a=a,b=b)
str(Q)
List of 2
$ a: int [1:11] 1 2 3 4 5 6 7 8 9 10 ...
$ b: int [1:29] 2 3 4 5 6 7 8 9 10 11 ...
Q$a
[1] 1 2 3 4 5 6 7 8 9 10 11
T<-c(Q$a,Q$b)
T
[1] 1 2 3 4 5 6 7 8 9 10 11 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[28] 18 19 20 21 22 23 24 25 26 27 28 29 30
TT<-data.frame(T)
TT
T
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 2
13 3
14 4
15 5
16 6
17 7
18 8
19 9
20 10
21 11
22 12
23 13
24 14
25 15
26 16
27 17
28 18
29 19
30 20
31 21
32 22
33 23
34 24
35 25
36 26
37 27
38 28
39 29
40 30
I am working with a large dataset of patent data. Each row is an individual patent, and columns contain information including application year and number of citations in the patent.
> head(p)
allcites appyear asscode assgnum cat cat_ocl cclass country ddate gday gmonth
1 6 1974 2 1 6 6 2/161.4 US 6 1
2 0 1974 2 1 6 6 5/11 US 6 1
3 20 1975 2 1 6 6 5/430 US 6 1
4 4 1974 1 NA 5 <NA> 114/354 6 1
5 1 1975 1 NA 6 6 12/142S 6 1
6 3 1972 2 1 6 6 15/53.4 US 6 1
gyear hjtwt icl icl_class icl_maingroup iclnum nclaims nclass nclass_ocl
1 1976 1 A41D 1900 A41D 19 1 4 2 2
2 1976 1 A47D 701 A47D 7 1 3 5 5
3 1976 1 A47D 702 A47D 7 1 24 5 5
4 1976 1 B63B 708 B63B 7 1 7 114 9
5 1976 1 A43D 900 A43D 9 1 9 12 12
6 1976 1 B60S 304 B60S 3 1 12 15 15
patent pdpass state status subcat subcat_ocl subclass subclass1 subclass1_ocl
1 3930271 10030271 IL 63 63 161.4 161.4 161
2 3930272 10156902 PA 65 65 11.0 11 11
3 3930273 10112031 MO 65 65 430.0 430 331
4 3930274 NA CA 55 NA 354.0 354 2
5 3930275 NA NJ 63 63 NA 142S 142
6 3930276 10030276 IL 69 69 53.4 53.4 53
subclass_ocl term_extension uspto_assignee gdate
1 161 0 251415 1976-01-06
2 11 0 246000 1976-01-06
3 331 0 10490 1976-01-06
4 2 0 0 1976-01-06
5 142 0 0 1976-01-06
6 53 0 243840 1976-01-06
I am attempting to create a new data frame which contains the mean number of citations (allcites) per application year (appyear), separated by category (cat), for patents from 1970 to 2006 (the data goes all the way back to 1901). I did this successfully, but I feel like my solution is somewhat ad hoc and does not take advantage of the specific capabilities of R. Here is my solution
#citations by category
citescat <- data.frame("chem"=integer(37),
"comp"=integer(37),
"drugs"=integer(37),
"ee"=integer(37),
"mech"=integer(37),
"other"=integer(37),
"year"=1970:2006
)
for (i in 1:37) {
for (j in 1:6) {
citescat[i,j] <- mean(p$allcites[p$appyear==(i+1969) & p$cat==j], na.rm=TRUE)
}
}
I am wondering if there is a simple way to do this without using the nested for loops which would make it easy to make small tweaks to it. It is hard for me to pin down exactly what I am looking for other than this, but my code just looks ugly to me and I suspect that there are better ways to do this in R.
Joran is right - here's a plyr solution. Without your dataset in a usable form it's hard to show you exactly, but here it is in a simplified dataset:
p <- data.frame(allcites = sample(1:20, 20), appyear = 1974:1975, pcat = rep(1:4, each = 5))
#First calculate the means of each group
cites <- ddply(p, .(appyear, pcat), summarise, meancites = mean(allcites, na.rm = T))
#This gives us the data in long form
# appyear pcat meancites
# 1 1974 1 14.666667
# 2 1974 2 9.500000
# 3 1974 3 10.000000
# 4 1974 4 10.500000
# 5 1975 1 16.000000
# 6 1975 2 4.000000
# 7 1975 3 12.000000
# 8 1975 4 9.333333
#Now use dcast to get it in wide form (which I think your for loop was doing):
citescat <- dcast(cites, appyear ~ pcat)
# appyear 1 2 3 4
# 1 1974 14.66667 9.5 10 10.500000
# 2 1975 16.00000 4.0 12 9.333333
Hopefully you can see how to adapt that to your specific data.