R 3.4.1 gsub on Windows 10 - find and replace all strings except for - r

I am trying to clean up data for a class project. The data deals with NOAA Storm data from 1950 to 2011. The storm types (EVTYPE) are only supposed to be 48 different levels, but there are over 1000 unique entries. I am trying to find all the snow related entries, which gives me:
table(grep("snow", temp$EVTYPE, ignore.case = TRUE, value = TRUE))
ACCUMULATED.SNOWFALL BLOWING.SNOW COLD.AND.SNOW DRIFTING.SNOW
4 5 1 1
EARLY.SNOWFALL EXCESSIVE.SNOW FALLING.SNOW.ICE FIRST.SNOW
7 25 2 2
HEAVY.SNOW HEAVY.SNOW.SHOWER HEAVY.SNOW.SQUALLS ICE.SNOW
13988 1 1 4
LAKE.EFFECT.SNOW LATE.SEASON.SNOW LATE.SEASON.SNOWFALL LATE.SNOW
656 1 3 2
LIGHT.SNOW LIGHT.SNOW.FLURRIES LIGHT.SNOW.FREEZING.PRECIP LIGHT.SNOWFALL
174 3 1 1
MODERATE.SNOW MODERATE.SNOWFALL MONTHLY.SNOWFALL MOUNTAIN.SNOWS
1 101 1 1
RECORD.MAY.SNOW RECORD.SNOW RECORD.SNOWFALL RECORD.WINTER.SNOW
1 2 2 3
SEASONAL.SNOWFALL SNOW SNOW.ACCUMULATION SNOW.ADVISORY
1 425 1 1
SNOW.AND.ICE SNOW.AND.SLEET SNOW.BLOWING.SNOW SNOW.DROUGHT
4 5 6 4
SNOW.ICE SNOW.SHOWERS SNOW.SLEET SNOW.SQUALL
1 5 5 5
SNOW.SQUALLS THUNDERSNOW.SHOWER UNUSUALLY.LATE.SNOW
14 1 1
There is a storm type called "Lake.Effect.Snow", which is one of the 48 storm types. How can I replace all of the other entries while excluding that particular storm type? I've tried:
table(grep("([^lake]?)snow", temp$EVTYPE, ignore.case = TRUE, value = TRUE))
to try and ignore the Lake.Effect.Snow entries, but no good.

Use stringr::str_detect with if.else.
library("stringr")
temp$EVTYPE <- if.else(str_detect(temp$EVTYPE, regex("snow", ignore_case = TRUE)) & temp$EVTYPE != "Lake.Effect.Snow", "Snow", temp$EVTYPE)

Related

How can this R code be sped up with the apply (lapply, mapply ect.) functions?

I am not to proficient with the apply functions, or with R. But I know I overuse for loops which makes my code slow. How can the following code be sped up with apply functions, or in any other way?
sum_store = NULL
for (col in 1:ncol(cazy_fams)){ # for each column in cazy_fams (so for each master family eg. GH, AA ect...)
for (row in 1:nrow(cazy_fams)){ # for each row in cazy fams (so the specific family number e.g GH1 AA7 ect...)
# Isolating the row that pertains to the current cazy family being looked at for every dataframe in the list
filt_fam = lapply(family_summary, function(sample){
sample[as.character(sample$Family) %in% paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = ""),]
})
row_cat = do.call(rbind, filt_fam) # concatinating the lapply list output int a dataframe
if (nrow(row_cat) > 0){
fam_sum = aggregate(proteins ~ Family, data=row_cat, FUN=sum) # collapsing the dataframe into one row and summing the proteins count
sum_store = rbind(sum_store, fam_sum) # storing the results for that family
} else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) {
Family = paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")
proteins = 0
sum_store = rbind(sum_store, data.frame(Family, proteins))
} else {
next
}
}
}
family_summary is just a list of 18 two column dataframes that look like this:
Family proteins
CE0 2
CE1 9
CE4 15
CE7 1
CE9 1
CE14 10
GH0 5
GH1 1
GH3 4
GH4 1
GH8 1
GH9 2
GH13 2
GH15 5
GH17 1
with different cazy families.
cazy_fams is just a dataframe with each coulms being a cazy class (eg. GH, AA ect...) and ech row being a family number, all taken from the linked website:
GH GT PL CE AA CBM
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
The reason behind the else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) statment is to deal with the fact not all classes have the same number of family so when looping over my dataframe I end up with some GHNA and AANA with NA on the end.
The output sum_store is this:
Family proteins
GH1 54
GH2 51
GH3 125
GH4 29
GH5 40
GH6 25
GH7 0
GH8 16
GH9 25
GH10 19
GH11 5
GH12 5
GH13 164
GH14 3
GH15 61
A dataframe with all listed cazy families and the total number of apperances across the family_summary list.
Please let me know if you need anything else to help answer my question.

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

Identification of items by use of wildcards

I have a dataset with several hundret items, looking like this
ID 01_ab_dog 01_ae_cat 02_ae_dog 02_hg_horse 01_oq_cat etc ...
1 1 3 5 8 10 ...
2 654 12 89 7 112 ...
3 4 9 4 978 64 ...
4 19 86 95 46 8 ...
I am looking to identify all items that include the word - let´s say - 'cat'. A solution that includes wildcards (e.g. 01_**_cat) would be great and I was looking for something like this but I did not suceed. How do I solve this problem?
I am not sure what you mean with item. To get all columns with cat, you could use grepl.
df <- data.frame(ab = 1, b = 1, cat_a = 1, bb_bbcat = 1)
df[, grepl("cat", names(df))]
# cat_a bb_bbcat
# 1 1 1

Optimization of an R loop taking 18 hours to run

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.
Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1

Converting R data frame with RDS package: recruitment id error?

I am using the RDS package for respondent-driven sampling survey data. I want to convert a regular R data frame to an rds.data.frame. To do so, I have been trying to use the as.rds.data.frame function from RDS.
Here is an excerpted section of my data frame, where the first case (id=1) is the 'seed' respondent (who has no recruiter). It contains the variables: id (respondent id number), recruit.id(id number of respondent who recruited him/her), netsize (respondent's network size) and population (estimate of whole population size).
df<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10),
recruit.id=c(-1,1,1,2,2,4,5,3,8,3),
netsize=c(6,6,6,5,5,4,4,3,4,6), population=rep(22,000, 10))
I then (try to) apply the relevant function:
new.df <-as.rds.data.frame(df,id=df$id,
recruiter.id=df$recruit.id,
network.size=df$netsize,
population.size=df$population,
max.coupons=2)
I get the error message:
Error in as.rds.data.frame(df, id = df$id, recruiter.id = df$recruit.id,: Invalid id
and the warning
In addition: Warning message:In if (!(id %in% names(x))) stop("Invalid id") :
the condition has length > 1 and only the first element will be used
I have tried assigning various 'recruiter id' values for seed participants, including -1,0 or their own id number but I still get the same message. I have also tried eliminating function arguments (coupon.max, population) or deleting seed respondents, but I still get the same message.
Package documentation says the function will fail if recruitment information is incomplete. As far as I can tell, this is not the case.
I am new to this, so if anyone can point me in the right direction I would be really grateful.
This seems to work:
colnames(df)[2:4] <- c("recruiter.id", "network.size.variable", "population.size")
as.rds.data.frame(df,max.coupons=2)
This gives a result with a warning
as.rds.data.frame(df, id="id", recruiter.id="recruit.id",
network.size="netsize", population.size="population", max.coupons=2)
# An object of class "rds.data.frame"
#id: 1 2 3 4 5 6 7 8 9 10
#recruiter.id: -1 1 1 2 2 4 5 3 8 3
# id recruit.id netsize population
#1 1 -1 6 22
#2 2 1 6 22
#3 3 1 6 22
#4 4 2 5 22
#5 5 2 5 22
#6 6 4 4 22
#7 7 5 4 22
#8 8 3 3 22
#9 9 8 4 22
#10 10 3 6 22
# Warning message:
#In as.rds.data.frame(df, id = "id", recruiter.id = "recruit.id", :
#NAs introduced by coercion

Resources