Use DocumentTermMatrix in R with 'dictionary' parameter - r

I want to use R for text classification. I use DocumentTermMatrix to return the matrix of word:
library(tm)
crude <- "japan korea usa uk albania azerbaijan"
corps <- Corpus(VectorSource(crude))
dtm <- DocumentTermMatrix(corps)
inspect(dtm)
words <- c("australia", "korea", "uganda", "japan", "argentina", "turkey")
test <- DocumentTermMatrix(corps, control=list(dictionary = words))
inspect(test)
The first inspect(dtm) work as expected with result:
Terms
Docs albania azerbaijan japan korea usa
1 1 1 1 1 1
But the second inspect(test) show this result:
Terms
Docs argentina australia japan korea turkey uganda
1 0 1 0 1 0 0
While the expected result is:
Terms
Docs argentina australia japan korea turkey uganda
1 0 0 1 1 0 0
Is it a bug or I use it the wrong way ?

Corpus() seems to have a bug when indexing word frequency.
Use VCorpus() instead, this will give you the expected result.

Related

R- delete the tail word

Can someone teach me how to delete tail word ,thanks.
from
1 North Africa
2 Algeria
3 Canary Islands (Spain)[153]
4 Ceuta (Spain)[154]
to
1 North Africa
2 Algeria
3 Canary Islands
4 Ceuta
I'm sad with my poor English.
It seems that you want to trim a trailing name in parentheses, along with anything which follows to the end of the string. We can use sub for this purpose:
df <- data.frame(id=c(1:4),
places=c("North Africa", "Algeria", "Canary Islands (Spain)[153]", "Ceuta (Spain)[154]"),
stringsAsFactors=FALSE)
df$places <- sub("\\s*\\(.*\\).*$", "", df$places)
df
id places
1 1 North Africa
2 2 Algeria
3 3 Canary Islands
4 4 Ceuta

create a variable in a dataframe based on another matrix on R

I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe

google geocoding and haversine distance calculation in R

I am using the geocode function from the ggmap package to geocode country names and then passing them onto the distHaversine in the geosphere library to calculate the distance between two countries.
Sample of my data is as follows:
Country.Value Address.Country
1: United States United States
2: Cyprus United States
3: Indonesia United States
4: Tanzania Tanzania
5: Madagascar United States
6: Belize Canada
7: Argentina Argentina
8: Egypt Egypt
9: South Africa South Africa
10: Paraguay Paraguay
I have also used if-else statements to try and stay within the geocoding limits set by the free Google Maps geocoder. My code is as follows:
for(i in 1:nrow(df)) {
row<-df.cont.long[i,]
src_lon<- 0.0
src_lat<- 0.0
trgt_lon<- 0.0
trgt_lat<- 0.0
if((row$Country.Value=='United States')){ #Reduce geocoding requirements
trgt_lon<- -95.7129
trgt_lat<- 37.0902
}
else if((row$Address.Country=='United States')){ #Reduce Geocoding Requirements
src_lon<- -95.7129
src_lat<- 37.0902
}
else if((row$Country.Value=='Canada')){ #Reduce geocoding requirements
trgt_lon<- -106.3468
trgt_lat<- 56.1304
}
else if((row$Primary.Address.Country=='Canada')){ #Reduce Geocoding Requirements
src_lon<- -106.3468
src_lat<- 56.1304
}
else if(row$Country.Value == row$Address.Country){ #Reduce Geocoding Requirements
# trgt<-geocode(row$Country.Value)
# trgt_lon<-as.numeric(trgt$lon)
# trgt_lat<-as.numeric(trgt$lat)
# src_lon<-as.numeric(trgt$lon)
# src_lat<-as.numeric(trgt$lat)
}
else{
trgt<-geocode(row$Country.Value, output=c("latlon"))
trgt_lon<-as.numeric(trgt$lon)
trgt_lat<-as.numeric(trgt$lat)
src<-geocode(row$Address.Country)
src_lon<-as.numeric(src$lon)
src_lat<-as.numeric(src$lat)
}
print(i)
print(c(row$Address.Country, src_lon, src_lat))
print(c(row$Country.Value, trgt_lon, trgt_lat))
print(distHaversine( p1=c(as.numeric(src$lon), as.numeric(src$lat)), p2=c(as.numeric(trgt$lon), as.numeric(trgt$lat)) ))
}
In the output
Sometimes geocoding is done, sometimes not, and is defaulting to 0.0
Sometimes distance is getting calculated, sometimes not
I have no idea where the code is going wrong.
Moreover, uncommenting the lines where I check if the Country.Value and Address.Country are equal, makes things even worse.
The functions you're using are vectorized, so all you really need is
library(ggmap)
library(geosphere)
distHaversine(geocode(as.character(df$Country.Value)),
geocode(as.character(df$Address.Country)))
# [1] 0 10432624 14978567 0 15868544 4588708 0 0 0 0
Note the as.characters are there because ggmap::geocode doesn't like factors. The results make sense:
df$distance <- distHaversine(geocode(as.character(df$Country.Value), source = 'dsk'),
geocode(as.character(df$Address.Country), source = 'dsk'))
df
# Country.Value Address.Country distance
# 1 United States United States 0
# 2 Cyprus United States 10340427
# 3 Indonesia United States 14574480
# 4 Tanzania Tanzania 0
# 5 Madagascar United States 16085178
# 6 Belize Canada 5172279
# 7 Argentina Argentina 0
# 8 Egypt Egypt 0
# 9 South Africa South Africa 0
# 10 Paraguay Paraguay 0
Edit
If you don't want to use ggmap::geocode, tmap::geocode_OSM is another geocoding function that uses OpenStreetMap data. However, because it is not vectorized, you need to iterate over it columnwise:
distHaversine(t(sapply(df$Country.Value, function(x){tmap::geocode_OSM(x)$coords})),
t(sapply(df$Address.Country, function(x){tmap::geocode_OSM(x)$coords})))
# [1] 0 10448111 14794618 0 16110917 5156823 0 0 0 0
or rowwise:
apply(df, 1, function(x){distHaversine(tmap::geocode_OSM(x['Country.Value'])$coords,
tmap::geocode_OSM(x['Address.Country'])$coords)})
# [1] 0 10448111 14794618 0 16110917 5156823 0 0 0 0
and subset to the coords data. Also note that Google, DSK, and OSM all choose different centers for each country, so the resulting distances are differ by some distance.

Removing repeats and blanks from R data frame

I apologise in advance for the data structure here, but I'm stuck with it...
I have a data frame with lots of repeats and blanks, like so:
df <- data.frame(
country=c("Afghanistan", "Afghanistan", "Algeria", "Australia", "Australia", "Australia"),
survey.1=c("Influenza","", "","","Influenza","Influenza"),
survey.2=c("","Hepatitis C","","","",""),
survey.3=c("West Nile Virus", "", "", "", "", "West Nile Virus"))
country survey.1 survey.2 survey.3
1 Afghanistan Influenza West Nile Virus
2 Afghanistan Hepatitis C
3 Algeria
4 Australia
5 Australia Influenza
6 Australia Influenza West Nile Virus
I need to remove the repeats and blanks but keep the same data structure (I don't know what you would call this... 'concentrating' as opposed to 'aggregating' maybe?). So what I'd end up with is this:
country survey.1 survey.2 survey.3
1 Afghanistan Influenza Hepatitis C West Nile Virus
2 Australia Influenza West Nile Virus
Can anyone help?
Using plyr:
ddply(df,.(country),
function(x)
sapply(x,function(y){
xx= unique(y[nchar(y)>0])
ifelse(length(xx)>0,xx,unique(y))
}
)
)
country survey.1 survey.2 survey.3
1 Afghanistan Influenza Hepatitis C West Nile Virus
2 Algeria
3 Australia Influenza West Nile Virus

Computing frequency of membership in R's data.frame

I have the following data.frame:
authors <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 3),"noinfo"))
which produce this output:
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia noinfo
What I want to do is to get the frequency of deceased by nationality.
Yielding this output:
US yes 1
US no 1
US noinfo 0
Australia yes 0
Australia no 1
Australia noinfo 1
UK yes 0
UK no 1
UK noinfo 0
At the moment I can only display the statistics through tables.
stat <- table(authors)
I'm not sure how to proceed by accessing the element of the tables.
Advice would be appreciated.
You need to table on the things you want the occurence for...
table( authors[ c("nationality" , "deceased" ) ] )
# deceased
#nationality no noinfo yes
# Australia 1 1 0
# UK 1 0 0
# US 1 0 1
And to get the exact output you want... turn it into a data.frame....
data.frame( table( authors[ c("nationality" , "deceased" ) ] ) )
# nationality deceased Freq
#1 Australia no 1
#2 UK no 1
#3 US no 1
#4 Australia noinfo 1
#5 UK noinfo 0
#6 US noinfo 0
#7 Australia yes 0
#8 UK yes 0
#9 US yes 1

Resources