How to select tuples that begin with a certain substring in R - r

I have a data frame that looks as follows
> df[1:10,c("Uri","Latency")]
Uri
1 /filters/test_group_1/test_datasource%20with%20space/test_application_alias_100
2 /applications?includeDashboards&includeMappings
3 /applications/test_application_alias_1
4 /applications?includeDashboards&includeMappings
5 /applications/test_application_alias_200
6 /applications/test_application_alias_100
7 /filters/00000000-0000-0000-0000-000000000001/test_datasource%20with%20space/test_application_alias_0
8 /dashboards?dashboard=test_dashboard_alias_9&includeMappings
9 /filters/00000000-0000-0000-0000-000000000001/test_dataSource_1/test_application_alias_100
10 /filters/00000000-0000-0000-0000-000000000001/test_datasource%20with%20space/test_application_alias_100
Latency
1 296
2 1388
3 58
4 833
5 239
6 60
7 217
8 36
9 86
10 112
I want to select only those rows that start with /applications. Note that the rest of the Uri could be anything, and is not important.
I could've got the exact matches by doing the following,
df[which(df$Uri == "/applications"),c("Uri","Latency")]
However, since, I am looking for a substring, I understand, I may have to do some wildcard processing, which in SQL would look like.
select * from <table_name> where Uri like '%/applications%'
How can I do the same in R

Assuming that df$Uri is a character vector, I'd go with with:
df[startsWith(df$Uri, "/applications"), ]

I'd use a regular expression:
df[ grepl( "^\\/applications" , df[, "Uri"] ) , c("Uri","Latency") ]

Related

R countif and sum on multiple columns matching elements in specified vector

I am applying this function to my dataset column DL1 on another vector as below and receiving the results expected
table(df$DL1[df$DL1 %in% undefined_dl_codes])
Result:
0 10 30 3B 4 49 54 5A 60 7 78 8 90
24 366 4 3 665 40 1 1 14 8 4 87 1
however I do have columns DL2, DL3 and DL4 which have same data, how can I apply the function to multiple columns and receive the result of all. I would need to go through all 4 required columns and receive 1 result as summary.
Any help highly appreciated!
May not be the best of the methods, however you could do the following
table(c(df$DL1[df$DL1 %in% undefined_dl_codes],
df$DL2[df$DL2 %in% undefined_dl_codes],
df$DL3[df$DL3 %in% undefined_dl_codes],
df$DL4[df$DL4 %in% undefined_dl_codes]
)
)
Using Raghuveer solution I further simplified,
attach(df)
table(c(DL1,DL2,DL3,DL4)[c(DL1,DL2,DL3,DL4) %in% undefined_dl_codes])
detach(df)

Filtering dataset by values and replacing with values in other dataset in R [duplicate]

This question already has answers here:
Replace values in data frame based on other data frame in R
(4 answers)
Closed 4 years ago.
I have two datasets like this:
>data1
id l_eng l_ups
1 6385 239
2 680 0
3 3165 0
4 17941 440
5 135 25
6 151 96
7 102188 84
8 440 65
9 6613 408
>data2
id l_ups
1 237
2 549
3 100
4 444
5 28
6 101
7 229
8 92
9 47
I want to filterout the values from data1 where l_ups==0 and replace them with values in data2 using id as lookup value in r.
Final output should look like this:
id l_eng l_ups
1 6385 239
2 680 549
3 3165 100
4 17941 440
5 135 25
6 151 96
7 102188 84
8 440 65
9 6613 408
I tried the below code but no luck
if(data1[,3]==0)
{
filter(data1, last_90_uploads == 0) %>%
merge(data_2, by.x = c("id", "l_ups"),
by.y = c("id", "l_ups")) %>%
select(-l_ups)
}
I am not able to get this by if statement as it will take only one value as logical condition. But, what if I have more than one value as logical statement?
like this:
>if(data1[,3]==0)
TRUE TRUE
Edit:
I want to filter the values with a condition and replace them with values in another dataset. Hence, this question is not similar to the one suggested as repetitive.
You don't want to filter. filter is an operation that returns a data set where rows might have been removed.
You are looking for a "conditional update" operation (in terms of a databases). You are already using dplyr, so try a join operation instead of match:
left_join(data1, data2, by='id') %>%
mutate(l_ups = ifelse(!is.na(l_ups.x) || l_ups.x == 0, l_ups.y, l_ups.x))
By using a join operation rather than the direct subsetting comparison as #markus suggested, you ensure that you only compare values with same ids. If one of your data frames happens to miss a row, the direct subsetting comparison will fail.
By using a left_join rather than inner_join also ensures that if data2 is missing an id, the corresponding id will not be removed from data1.

Running predictive model according to values in column

I have a dataframe (I might in future not use it):
> PM
names.model.
1 4
2 5
3 6
4 8
5 9
It means that for value of 4 for instance I'll use model[1], for value of 5 I'll use model[2] etc.
As already mentioned I have a list of model (from 1 to 5).
I have another dataframe, that has a column TN.
As can be seen:
> head (test)
Ozone Solar.R Wind Temp Month Day TN
2 36 118 8.0 72 5 2 4
8 19 99 13.8 59 5 8 4
14 14 274 10.9 68 5 14 5
40 71 291 13.8 90 6 9 9
62 135 269 4.1 84 7 1 8
69 97 267 6.3 92 7 8 9
I would like to run the add a new column test$Ozone_pred that will run the relevant model per line. For instance, for the first line I'll run model[1] as well as for the second line (both are 4). For the third line I'll run model[2] , for the forth line model[5] etc.
There are a couple options. First would be to use dplyr's join function to just add your first dataframe (PM) to the second one (test) as a new column and then index based on that. Below is a solution with base R.
To get the correct function for a single row as your current PM is:
model[match(test_TN_number, PM[,2])]
If PM doesn't have the first column equal to row numbers, then:
model[PM[match(test_TN_number, PM[,2])],1]
This is then easily extended to the whole dataframe with apply or within a loop.
Edit: here's a for looped version:
for (test_TN_number in test[,"TN"]){
model[PM[match(test_TN_number, PM[,2])],1]
}

Identification of items by use of wildcards

I have a dataset with several hundret items, looking like this
ID 01_ab_dog 01_ae_cat 02_ae_dog 02_hg_horse 01_oq_cat etc ...
1 1 3 5 8 10 ...
2 654 12 89 7 112 ...
3 4 9 4 978 64 ...
4 19 86 95 46 8 ...
I am looking to identify all items that include the word - let´s say - 'cat'. A solution that includes wildcards (e.g. 01_**_cat) would be great and I was looking for something like this but I did not suceed. How do I solve this problem?
I am not sure what you mean with item. To get all columns with cat, you could use grepl.
df <- data.frame(ab = 1, b = 1, cat_a = 1, bb_bbcat = 1)
df[, grepl("cat", names(df))]
# cat_a bb_bbcat
# 1 1 1

select records according to the difference between records R

I hope someone could suggest me something for this "problem", because I really don't know how to proceed...
Well, my data are like this
data<-data.frame(site=c(rep("A",3),rep("B",3),rep("C",3)),time=c(100,180,245,5,55,130,70,120,160))
where time is in minute.
I want to select only the records, for each site, for which the difference is more than 60, so the output should be Like this:
out<-data[c(1:4,6,7,9),]
What I have tried so far. Well,to get the difference I use this:
difference<-stack(tapply(data$time,data$site,diff))
but then, no idea how to pick up those records which satisfied my condition...
If there is already a similar question, although I've searched for a while, I apologize for this.
To make things clear, as probably the definition of difference was not so unambiguous, I need to select all the records (for each site) which are separated at least by 60 minutes, so not only those that are strictly subsequent in time.
Specifically,
> out
site time
1 A 100#included because difference between 2 and 1 is>60
2 A 180#included because difference between 3 and 2 is>60
3 A 245#included because separated by 6o minutes before record#2
4 B 5#included because difference between 6 and 4 is>60
6 B 130#included because separated by 6o minutes before record#4
7 C 70#included because difference between 9 and 7 is>60
9 C 160#included because separated by 60 minutes before record#7
May be to solve the "problem", it could be useful to consider the results of the difference, something like this:
> difference
values ind
1 80 A#include record 1 and 2
2 65 A#include record 2 and 3
3 50 B#include only record 4
4 75 B#include record 6 because there are(50+75)>60 m from r#4
5 50 C#include only record 7
6 40 C#include record 9 because there are (50+40)>60 m from r#7
Thanks for the help.
data[ave(data$time, data$site, FUN = function(x){c(61, diff(x)) > 60}) == 1, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
Edit following updated question:
keep <- as.logical(ave(data$time, data$site, FUN = function(x){
c(TRUE, cumsum(diff(x)) > 60)
}))
data[keep, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
# 9 C 160
#Calculate the differences
data$diff <- unlist(by(data$time, data$site,function(x)c(NA,diff(x))))
#subset data
data[is.na(data$diff) | data$diff > 60,]
Using plyr:
ddply(dat,.(site),function(x)x[c(TRUE , diff(x$time) >60),])

Resources