Variable selection using regsubset in R - r

I'm working on a Tweets Project and I extracted 87 variables, now i need to perform variable selection method so i used forward subset selection. But i'm facing an error.
regfit.fwd = regsubsets(screen_name ~.,merge_tweets,method = "forward",
complete.cases(merge_tweets),nvmax = 15)
Error in leaps.setup(x[, ii[reorder], drop = FALSE], y, wt,
force.in[reorder], : NA/NaN/Inf in foreign function call (arg 4)
> head(merge_tweets)
X user_id status_id created_at screen_name
1 1 1339835893 1.090257e+18 1548772454 HillaryClinton
2 2 1339835893 1.090002e+18 1548711688 HillaryClinton
3 3 1339835893 1.089999e+18 1548710912 HillaryClinton
4 4 1339835893 1.089994e+18 1548709837 HillaryClinton
5 5 1339835893 1.089994e+18 1548709756 HillaryClinton
6 6 1339835893 1.089994e+18 1548709738 HillaryClinton
text
1 On
top of human suffering and lasting damage to our national parks, the Trump
shutdown cost the economy $11 billion. End shutdowns as a political hostage-
taking tactic.
2 Hurricane Maria decimated trees and ecosystems in Puerto Rico. Para La
Naturaleza's nurseries have made a CGI commitment to plant 750,000 trees in
seven years. The team here has already grown 120,000 seedlings and planted
30,000 trees. source display_text_width is_quote is_retweet favorite_count
retweet_count lang
1 Twitter Web Client 192 FALSE FALSE 14324
4168 en
2 Twitter Web Client 235 FALSE FALSE 10684
2526 en
3 Twitter Web Client 238 FALSE FALSE 11423
2089 en
4 Twitter Web Client 34 FALSE FALSE 1293
113 en
5 Twitter Web Client 222 FALSE FALSE 6641
951 en
6 Twitter Web Client 214 FALSE FALSE 12192
2108 en
status_url name
location
Hillary
Clinton New York, NY
Hillary
Clinton New York, NY
description
1 2016 Democratic Nominee, SecState, Senator, hair icon. Mom, Wife, Grandma
x2, lawyer, advocate, fan of walks in the woods & standing up for our
democracy.
2 2016 Democratic Nominee, SecState, Senator, hair icon. Mom, Wife, Grandma
x2, lawyer, advocate, fan of walks in the woods & standing up for our
democracy.
url protected followers_count friends_count listed_count
statuses_count
1 FALSE 24017203 784
41782 10667
2 FALSE 24017203 784
41782 10667
3 FALSE 24017203 784
41782 10667
favourites_count account_created_at verified profile_url
profile_expanded_url
1 1138 1365530675 TRUE
2 1138 1365530675 TRUE
3 1138 1365530675 TRUE
I have removed some url columns as it doesn't support url to be posted. It would be great if anyone can help me out in solving this problem.
Thanks in advance!!

Related

Rfacebook: get reactions to posts

I want to use Rfacebook to get the reactions (not just likes) to specific posts but couldn't find a way to do that. Basically, I would want the same output for a comment as I get for a post:
> BBC <- getPage(page="bbcnews", token=fb_oauth, n=5, since="2017-10-03", until="2017-10-06", feed=FALSE, reactions=TRUE, verbose=TRUE)
5 posts > BBC
id likes_count from_id from_name
1 228735667216_10155178331342217 1602 228735667216 BBC News
2 228735667216_10155178840252217 7575 228735667216 BBC News
3 228735667216_10155178915482217 5735 228735667216 BBC News
4 228735667216_10155180617187217 6843 228735667216 BBC News
5 228735667216_1964396086910573 1736 228735667216 BBC News
message
1 "What did those people do to deserve that?" \n\nThis woman left the scene of the Las Vegas shooting just moments before it began.
2 Puerto Rico: President Donald J. Trump compares Hurricane Maria to a "real catastrophe like Katrina" bbc.in/2yG9gyZ
3 Do mass shootings ever change gun laws? http://bbc.in/2fIbjv0
4 "Boris asked me to give you this" - The moment comedian Lee Nelson interrupts Prime Minister Theresa May's speech.. by handing her a P45.
5 In her big conference speech, Theresa May talked about council houses and energy prices - but the announcements were overshadowed by a coughing fit and a protester. (Via BBC Politics)\nhttp://bbc.in/2fMCIw3
created_time type link story comments_count shares_count
1 2017-10-03T18:23:36+0000 video https://www.facebook.com/bbcnews/videos/10155178331342217/ NA 406 230
2 2017-10-03T21:34:21+0000 video https://www.facebook.com/bbcnews/videos/10155178840252217/ NA 14722 12284
3 2017-10-03T21:56:01+0000 video https://www.facebook.com/bbcnews/videos/10155178915482217/ NA 3059 2418
4 2017-10-04T11:17:28+0000 video https://www.facebook.com/bbcnews/videos/10155180617187217/ NA 1737 2973
5 2017-10-04T17:16:33+0000 video https://www.facebook.com/bbcnews/videos/1964396086910573/ NA 636 238
love_count haha_count wow_count sad_count angry_count
1 125 16 18 1063 20
2 318 1155 5023 1072 23698
3 104 69 61 980 504
4 513 4127 76 10 80
5 83 467 24 11 21
Now, I want for the first 5 comments of the first post to also have an output like above. I get all of it except the reactions (corresponding to the columns love_count, haha_count, wow_count, sad_count, angry_count) by using the following code:
> BBC_post <- getPost(BBC$id[1], token=fb_oauth, comments=TRUE, n.comments=5, likes=FALSE, reactions=FALSE)
> BBC_post
$post
from_id from_name
1 228735667216 BBC News
message
1 "What did those people do to deserve that?" \n\nThis woman left the scene of the Las Vegas shooting just moments before it began.
created_time type link id
1 2017-10-03T18:23:36+0000 video https://www.facebook.com/bbcnews/videos/10155178331342217/ 228735667216_10155178331342217
likes_count comments_count shares_count
1 1602 406 230
$comments
from_id from_name
1 880124212162441 David Bourton
2 10159595379610445 Valerie Gregory
3 10159810965680122 Nadeem Hussain
4 1657693134252376 Samir Amghar
5 10215327133878123 Shlomo Resnikov
message
1 It's unfathomable to the rest of the world that there are so many people who believe the killer's right to their guns are greater than their victims right to life.
2 That's backwards. The victims didn't do anything. The NRA, the politicians who are bought and paid for by them, including President Trump, and the shooter did. That is where solving the problem begins.
3 BBC ask Israel the same Question... what did the Palestinians civilians do to deserve an Apartheid regime !!!
4 Praying and thinking of the victims will not prevent the next shooting. One failed attempt at a shoe bomb and we all take off our shoes at the airport. 274 Mass shootings since January and no change in your regulation of guns.
5 As a Jew , we constantly ask those kind of questions regarding to the holocaust ,”where was god in the holocaust “? Or “How did he allow this horror”? And the answer that facilitates the most is mysterious ways of god are beyond our perception ,we cannot grasp divine calculation .
created_time likes_count comments_count id
1 2017-10-03T18:25:58+0000 225 71 10155178331342217_10155178338952217
2 2017-10-03T18:29:04+0000 79 45 10155178331342217_10155178346307217
3 2017-10-03T18:28:34+0000 60 38 10155178331342217_10155178345382217
4 2017-10-03T18:32:11+0000 37 3 10155178331342217_10155178354272217
5 2017-10-03T18:44:19+0000 16 20 10155178331342217_10155178380902217
### how do I also display the REACTIONS a comment got? It is not "reactions=TRUE" since that will display the reactions to the post itself and not the comment of the post
Does anyone know how to get there? Or does Rfacebook simply not allow for that (yet) since the feature of 'reacting to comments' was introduced not too long ago?
Many thanks in advance and all the best,
Ivo

Make a percentage depending on DF

I have a train set here and I need you to help me with something.
This is the df.
Jobs Agency Location Date RXH HS TMM Payed
14 Netapp Gitex F1 Events House DWTC 2015-10-19 100 8.0 800 TRUE
5 RWC Heineken Lightblue EGC 2015-10-09 90 4.0 360 FALSE
45 Rugby 7s CEO Seven Stadium 2015-12-04 100 10.0 1000 FALSE
29 Playstation Lightblue Mirdiff CC 2015-11-11 90 7.0 630 FALSE
24 RWC Heineken Lightblue EGC 2015-10-31 90 4.5 405 FALSE
33 Playstation Lightblue Mirdiff CC 2015-11-15 90 10.0 900 FALSE
46 Rugby 7s CEO Seven Stadium 2015-12-05 100 10.0 1000 FALSE
44 Rugby 7s CEO Seven Stadium 2015-12-03 100 10.0 1000 FALSE
I want to know for example that the total of rows is 10, and I worked for " CEO" agency 3 times, I want CEO Agency to have the 30% value for that month, if that makes sense?
I want to know depending on the number of observations how much in % i ve worked for them.
Thats just a Demo DF to see what im talking about.
Thanks
If I understand correctly, you want to summarize by Agency and by month. Here's how to do it with dplyr:
library(dplyr)
table1 %>%
mutate(Month=format(Date,"%m-%Y")) %>%
group_by(Month,Agency)%>%
summarise(Total=n())%>%
mutate(Pct=round(Total/sum(Total)*100))
Source: local data frame [4 x 4]
Groups: Month [3]
Month Agency Total Pct
(chr) (chr) (int) (dbl)
1 10-2015 Events House 1 33
2 10-2015 Lightblue 2 67
3 11-2015 Lightblue 2 100
4 12-2015 CEO 3 100
This is just a simple approach, and I suspect you might be looking for more. However, here's some code that would give you the answer to your sample question:
length(df$Agency[df$Agency == "CEO"]) / length(df$Agency)
The first length() function calculates how many cells in df$Agency are marked "CEO," then the second one calculates the total number of cells in that column. Dividing one by the other will give you the answer.
This will get more complicated if you want to automatically do this for each of the agencies in the column, but there are the basics.

"for" loop in R and checking previous value from a column

I'm working on a data frame which looks like this
Here's how it looks like:
shape id day hour week id footfall category area name
22496 22/3/14 3 12 634 Work cluster CBD area 1
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1
23287 22/3/14 3 12 723 Airport Changi Airport 2
16430 22/3/14 4 12 947 Work cluster CBD area 2
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2
and so on... until 635 rows return.
and the other dataset that I want to compare with can be found here
Here's how it looks like:
category Foreigners Locals
Work cluster 1600000 3623900
Shopping cluster 1800000 3646666.667
Airport 15095152 8902705
Residential area 527700 280000
They both share the same attribute, i.e. category
I want to check if I can compare the previous hour from the column hour in the first dataset so I can compare it with the value from the second dataset.
Here's, what I ideally want to find in R:
#for n in 1: number of rows{
# check the previous hour from IDA dataset !!!!
# calculate hourSum - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
I'm not sure how to do that and here's what i tried:
tbl1 <- secondDataset
tbl2 <- firstDataset
mergetbl <- function(tbl1, tbl2)
{
newtbl = data.frame(hour=numeric(),forgHour=numeric(),locHour=numeric())
ntbl1rows<-nrow(tbl1) # get the number of rows
for(n in 1:ntbl1rows)
{
#get the previousHour
newHourSum <- tbl1$hour - previousHour
footfallHour <- (tbl1$hour/(newHourSum-previousHour)) * tbl2$Foreigners
#add to newtbl
}
}
This would what i expected:
shape id day hour week id footfall category area name forgHour locHour
22496 22/3/14 3 12 634 Work cluster CBD area 1 1 12
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1 21 25
23287 22/3/14 3 12 723 Airport Changi Airport 2 31 34
16430 22/3/14 4 12 947 Work cluster CBD area 2 41 23
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2 51 23
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3 61 45
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2 72 54

R:Fuzzy Logic Name match

I have been working on large data set which has names of customers , each of this has to be checked with the master file which has correct names (300 KB) and if matched append the master file name to names of customer file as new column value. My prev Question worked for small data sets
Both Customer & Master file has been cleaned using tm and have tried different logic , but only works on small set of data when applied to huge files not effective, pattern matching doesn't help here my opinion cause no names comes with exact pattern
Cus File
1 chang chun petrochemical
2 chang chun plastics
3 church dwight
4 citrix systems asia pacific
5 cnh industrial services srl
6 conoco phillips
7 conocophillips
8 dfk laurence varnay
9 dtz worldwide
10 electro motive maintenance operati
11 enterasys networks
12 esso resources
13 expedia
14 expedia
15 exponential interactive aust
16 exxonmobil asia pacific pte
17 exxonmobil chemical asia pac div
18 exxonmobil png
19 formula world championship
20 fortitech asia pacific sdn bhd
Master
1 chang chun group
2 church dwight
3 citrix systems asia pacific
4 cnh industrial nv
5 conoco phillips
6 dfk laurence varnay
7 dtz group zealand
8 caterpillar
9 enterasys networks
10 exxon mobil group
11 expedia group
12 exponential interactive aust
13 formula world championship
14 fortitech asia pacific sdn bhd
15 frhi hotels resorts
16 gardner denver industries
17 glencore xstrata international plc
18 grace
19 incomm nz
20 information resources
21 kbr holdings llc
22 kennametal
23 komatsu
24 leonhard hofstetter pelzdesign
25 communications corporation
26 manhattan associates
27 mattel
28 mmg finance
29 nokia oyj group
30 nortek
i have tried with this simple loop
for (i in 1:100){
result$x[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
#result$Y[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}
*result *
1 chang chun petrochemical <NA> NA
2 chang chun plastics <NA> NA
3 church dwight church dwight 2
4 citrix systems asia pacific citrix systems asia pacific 3
5 cnh industrial services srl <NA> NA
6 conoco phillips church dwight 2
7 conocophillips <NA> NA
8 dfk laurence varnay <NA> NA
9 dtz worldwide church dwight 2
10 electro motive maintenance operati <NA> NA
11 enterasys networks <NA> NA
12 esso resources church dwight 2
13 expedia <NA> NA
14 expedia <NA> NA
15 exponential interactive aust church dwight 2
16 exxonmobil asia pacific pte <NA> NA
17 exxonmobil chemical asia pac div <NA> NA
18 exxonmobil png church dwight 2
19 formula world championship <NA> NA
20 fortitech asia pacific sdn bhd
tried with lapply but no use , as you can notice my master file is large and some times i get error of rows length doesn't match!
mm<-dt[lapply(result, function(x) levenshteinDist(x ,lapply(result1, function(x) x)))]
#using looping stat. for checking each cus name with all the master names
for(i in seq(nrow(result)) )
{
if((levenshteindist(result[i],lapply(result1, function(x) String(x))))==0)
sprintf("%s", x)
}
which method would be best for this ? similar to my Q but not much helpfullI referd few Q from STO
it might be naive but when applied with huge data sets it mis behaves, can anybody familiar with R could correct me with the above code for levenshteinDist
code:
#check with each value of master file and if matches more than .90 then return master value.
for(i in seq(1:nrow(gr1))
{
for(j in seq(1:nrow(gr2))
{
gr1$jar[i,j]<-jarowinkler(gr1$ICIS_Cust_Names[i],gr2$Master_Names[j])
if(gr1$jar[i,j]>.90)
gr1$res[i] = gr2$Master_Names[j]
}
}
#Please let know if there is any minute error with this code
Please if anybody has worked with such data in R please help !
achieved partial result by
code :
df$result<-data.frame(df$Cust_Names, df$Master_Names[max.col(-adist(df$Cust_Names,df$Master_Names))])

Fuzzy string matching in r

I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below.
dataset-1
itemid userid rating time title release_date
99991 1673 835 3 1998-03-27 mirage 1995
99992 1674 840 4 1998-03-29 mamma roma 1962
99993 1675 851 3 1998-01-08 sunchaser, the 1996
99994 1676 851 2 1997-10-01 war at home, the 1996
99995 1677 854 3 1997-12-22 sweet nothing 1995
99996 1678 863 1 1998-03-07 mat' i syn 1997
99997 1679 863 3 1998-03-07 b. monkey 1998
99998 1680 863 2 1998-03-07 sliding doors 1998
99999 1681 896 3 1998-02-11 you so crazy 1994
100000 1682 916 3 1997-11-29 scream of stone (schrei aus stein) 1991
dataset - 2
itemid userid rating time title release_date
1 2844 4477 3 2013-03-09 fantã´mas - 〠l'ombre de la guillotine 1913
2 4936 8871 4 2013-05-05 the bank 1915
3 4936 11628 3 2013-07-06 the bank 1915
4 4972 16885 4 2013-08-19 the birth of a nation 1915
5 5078 11628 2 2013-08-23 the cheat 1915
6 6684 4222 3 2013-08-24 the fireman 1916
7 6689 4222 3 2013-08-24 the floorwalker 1916
8 7264 2092 4 2013-03-17 the rink 1916
9 7264 5943 3 2013-05-12 the rink 1916
10 7880 11628 4 2013-07-19 easy street 1917
I have looked at 'agrep' but it only matches one string at a time. The 'stringdist' function is good but you need to run it in a loop, find the minimum distance and then go onto further precessing which is very time consuming given the size of the datasets. The strings can have typo's and special characters due to which fuzzy matching is required. I have looked around and found 'Lenenshtein' and 'Jaro-Winkler' methods. The later I read is good for when you have typo's in strings.
In this scenario, only fuzzy matching may not provide good results e.g., A movie title 'toy story' in one dataset can be matched to 'toy story 2' in the other which is not right. So I need to consider the release date to make sure the movies that are matched are unique.
I want to know if there is a way to achieve this task without using a loop? worse case scenario if I have to use a loop, how can I make it work efficiently and as fast as possible.
I have tried the following code but it has taken an awful amount of time to process.
for(i in 1:nrow(test))
for(j in 1:nrow(test1))
{
test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
test$title, NA)
}
test - contains 1682 unique movie names converted to lower case
test1 - contains 11451 unique movie names converted to lower case
Is there a way to avoid the for loops and make it work faster?
What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.
dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)

Resources