Subsetting rows by passing an argument to a function - r

I have the following data frame which I imported into R using read.table() (I incorporated read.table() within read_data() which is a function I created that also throw messages in case the file name is not written appropriately):
> raw_data <- read_data("n44.txt")
[1] #### Reading txt file ####
> head(raw_data)
subject block trial_num soa target_identity prime_type target_type congruency prime_exposure target_exposure button_pressed rt ac
1 99 1 1 200 82 9 1 9 0 36 1 1253 1
2 99 1 2 102 95 2 1 2 75 36 1 1895 1
3 99 1 3 68 257 2 2 1 75 36 2 1049 1
4 99 1 4 68 62 9 1 9 0 36 1 1732 1
5 99 1 5 34 482 9 3 9 0 36 3 765 1
6 99 1 6 68 63 9 1 9 0 36 1 2027 1
Then I'm using raw_data within the early_prep() function I created (I copied only the relevant part of the function):
early_prep <- function(file_name, keep_rows = NULL, id = NULL){
if (is.null(id)) {
# Stops running the function
stop("~~~~~~~~~~~ id is missing. Please provide name of id column ~~~~~~~~~~~")
}
# Call read_data() function
raw_data <- read_data(file_name)
if (!is.null(keep_rows)) {
raw_data <- raw_data[keep_rows, ]
# Print to console
print("#### Deleting unnecesarry rows in raw_data ####", quote = FALSE)
}
print(dim(raw_data))
print(head(raw_data))
return(raw_data)
}
}
My problem is with raw_data <- raw_data[keep_rows, ].
When I enter keep_rows = "raw_data$block > 1" this is what I get:
> x1 <- early_prep(file_name = "n44.txt", keep_rows = "raw_data$block > 1", id = "subject")
[1] #### Reading txt file ####
[1] #### Deleting unnecesarry rows in raw_data ####
[1] 1 13
subject block trial_num soa target_identity prime_type target_type congruency prime_exposure target_exposure button_pressed rt ac
NA NA NA NA NA NA NA NA NA NA NA NA NA NA
How can I solve this so it will only delete the rows I want?
Any help will be greatly appreciated
Best,
Ayala

The problem is that you pass the condition as a string and not as a real condition, so R can't evaluate it when you want it to.
if you still want to pass it as string you need to parse and eval it in the right place for example:
cond = eval(parse(text=keep_rows))
raw_data = raw_data[cond,]
This should work, I think

Related

Comparing each element in two columns and set another column

I have a data frame (after fread from a file) with two columns (dep and label). I want to set another column (mark) with id value depending on the match. If the 'dep' entry matches 'lablel' entry, mark get the 'id' of the matched 'label'. For no match, mark get the value of its own 'id'. Currently, I have work around solution with loops but I know there should be a neat way to do it in R specifics.
trace <- data.table(id=seq(1:7),dep=c(-1,45,40,47,0,45,43),
label=c(99,40,43,45,47,42,48), mark=rep("",7))
id dep label mark
1: 1 -1 99 1
2: 2 45 40 2
3: 3 40 43 2
4: 4 47 45 4
5: 5 0 47 5
6: 6 45 42 4
7: 7 43 48 3
I know loops are slow in r and just to give example the following naive for/while works for small sizes but my data set is huge.
trace$mark <- trace$id
for (i in 1:length(trace$id)){
val <- trace$dep[i]
j <- 1
while(j<=i && val !=-1 && val!=0){ // don't compare if val is -1/0
if(val==trace$label[j]){
trace$mark[i] <- trace$id[j]
}
j <-j +1
}
}
I have also tried using the following approach but it works only if there is a single match.
match <- which(trace$dep %in% trace$label)
match_to <- which(trace$label %in% trace$dep)
trace$mark[match] <- trace$mark[match_to]
This solution might help:
trace[trace[,.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
# id dep label mark
# 1: 1 -1 99 1
# 2: 2 45 40 4
# 3: 3 -1 43 3
# 4: 4 47 45 5
# 5: 5 -1 47 5
# 6: 6 45 42 4
# 7: 7 43 48 3
Update:
To make sure you are not matching dep with 0 or -1 values you can just add another line.
trace[dep %in% c(0,-1), mark:= as.character(id)]
OR
Try this:
trace[trace[!dep %in% c(0,-1),.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
The solution that worked
trace[trace[,.(id,dep=label)],on=.(id<=id,dep),mark:=as.char‌​acter(i.id),allow.ca‌​rtesian=TRUE]

R: tapply(x,y,sum) returns NA instead of 0

I have a data set that contains occurrences of events over multiple years, regions, quarters, and types. Sample:
REGION Prov Year Quarter Type Hit Miss
xxx yy 2008 4 Snow 1 0
xxx yy 2009 2 Rain 0 1
I have variables defined to examine the columns of interest:
syno.h <- data$Type
quarter.number<-data$Quarter
syno.wrng<- data$Type
I wanted to get the amount of Hits per type, and quarter for all of the data. Given that the Hits are either 0 or 1, then a simple sum() function using tapply was my first attempt.
tapply(syno.h, list(syno.wrng, quarter.number), sum)
this returned:
1 2 3 4
ARCO NA NA NA 0
BLSN 0 NA 15 74
BLZD 4 NA 17 54
FZDZ NA NA 0 1
FZRA 26 0 143 194
RAIN 106 126 137 124
SNOW 43 2 215 381
SNSQ 0 NA 18 53
WATCHSNSQ NA NA NA 0
WATCHWSTM 0 NA NA NA
WCHL NA NA NA 1
WIND 47 38 155 167
WIND-SUETES 27 6 37 56
WIND-WRECK 34 14 44 58
WTSM 0 1 7 18
For a some of the types that have no occurrences in a given quarter, tapply sometimes returns NA instead of zero. I have checked the data a number of times, and I am confident that it is clean. The values that aren't NA are also correct.
If I check the type/quarter combinations that return NA with tapply using just sum() I get values I expect:
sum(syno.h[quarter.number==3&syno.wrng=="BLSN"])
[1] 15
> sum(syno.h[quarter.number==1&syno.wrng=="BLSN"])
[1] 0
> sum(syno.h[quarter.number==2&syno.wrng=="BLSN"])
[1] 0
> sum(syno.h[quarter.number==2&syno.wrng=="ARCO"])
[1] 0
It seems that my issue is with how I use tapply with sum, and not with the data itself.
Does anyone have any suggestions on what the issue may be?
Thanks in advance
I have two potential solutions for you depending on exactly what you are looking for. If you just are interested in your number of positive Hits per Type and Quarter and don't need a record of when no Hits exist, you can get an answer as
aggregate(data[["Hit"]], by = data[c("Type","Quarter")], FUN = sum)
If it is important to keep a record of the ones where there are no hits as well, you can use
dataHit <- data[data[["Hit"]] == 1, ]
dataHit[["Type"]] <- factor(data[["Type"]])
dataHit[["Quarter"]] <- factor(data[["Quarter"]])
table(dataHit[["Type"]], dataHit[["Quarter"]])

How to remove row wise sorting in R dataframes

I have the following code written to take bind two columns and create a data frame.
complete<-function(directory,id){
x<-vector()
y<-vector()
files<-list.files(directory,full.names=TRUE)
for(i in id){
x[i]<-i
y[i]<-sum(complete.cases(read.csv(files[i])))
}
d<-na.omit(data.frame(x,y))
colnames(d)<-c("id","nobs")
rownames(d)<-1:nrow(d)
print(d)
}
I have the following test case :
complete("specdata",30:25)
id nobs
1 25 463
2 26 586
3 27 338
4 28 475
5 29 71
6 30 932
I am not able get the output in the order called by the function. i.e.
id=30 as the first value and id=25 as the last value. How do I get to disable automatic sorting by id?
We can change for(i in id) to for(i in seq_along(id)) to loop by the sequence of 'id'. Also, make some necessary changes in assigning x[i] and y[i].
complete<-function(directory, id){
x<- vector()
y<- vector()
files<-list.files(directory,full.names=TRUE)
for(i in seq_along(id)){
x[i]<- id[i]
y[i]<-sum(complete.cases(read.csv(files[id[i]])))
}
d<-na.omit(data.frame(x,y))
colnames(d)<-c("id","nobs")
rownames(d)<-1:nrow(d)
print(d)
}
Testing
complete('specdata', 25:30)
#id nobs
#1 25 4
#2 26 0
#3 27 1
#4 28 1
#5 29 2
#6 30 13
complete('specdata', 30:25)
# id nobs
#1 30 13
#2 29 2
#3 28 1
#4 27 1
#5 26 0
#6 25 4
NOTE: The values are different because the 'specdata' directory that I have is from a previous coursera link. They might have updated the data

Optimization of an R loop taking 18 hours to run

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.
Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1

dplyr "not a promise" error

I have a panel dataset for which I have created lagged variables using the lag() function.
When I try to calculate the delta for each timepoint, using the mutate command below, I get the error message "Error: not a promise"
> kw.lags[,c("imps", "lag1_imps", "lag2_imps")]
Source: local data frame [157,737 x 3]
Groups:
imps lag1_imps lag2_imps
1 65 NA NA
2 79 65 NA
3 62 79 65
4 69 62 79
5 1 NA NA
6 2 NA NA
7 2 2 NA
8 1 2 2
9 2 1 2
10 5 NA NA
.. ... ... ...
> kw.deltas <- mutate(kw.lags,
+ d1_imps = imps - lag1_imps,
+ d2_imps = imps - lag2_imps,
+ d3_imps = imps - lag3_imps,
+ )
Error: not a promise
You have a comma after the last line in your mutate statement. Try to remove that, and see if it fixes the error.

Resources