R data.frame transformation? - r

I have a R data frame that looks like this:
Country Property Value
Canada Capital Ottawa
Canada Population 38
Canada Language1 French
Canada Language2 English
United States Capital Washington
United States Population 280
United States Language1 English
United States Language2 NA
I want to re-arrange this so that it looks like this:
Country Capital Population Language1 Language2
Canada Ottawa 38 French English
United States Washington 280 English NA
Is there any way to do this transformation ?
Thanks.

As per Paul Hiemstra's comment:
the reshape2 package's dcast will do this nicely:
dcast(data=yourdataframe, Country~Property, value.var='Value')
If you've got duplicated values in there though it will try to aggregate them using length as a default, which isn't what you want!

Related

Count the number of repetitive values separated by commas

I have the set that looks something like :
colA
Nepal , India , USA
USA
India
USA
Nepal , India
USA
USA, Nepal
Nepal
Japan
so I want the count as :
COlB
Count
Nepal
4
India
3
USA
5
Japan
4
Is there a way to do it, without going into the Tableau Prep and directly from Tableau Reader with the use of calculative fields or something similar within it.

extracting country name from city name in R

This question may look like a duplicate but I am facing some issue while extracting country names from the string. I have gone through this link [link]Extracting Country Name from Author Affiliations but I was not able to solve my problem.I have tried grepl and for loop for text matching and replacement, my data column consists of more than 300k rows so using grepl and for loop for pattern matching is very very slow.
I have a column like this.
org_loc
Zug
Zug Canton of Zug
Zimbabwe
Zigong
Zhuhai
Zaragoza
York United Kingdom
Delhi
Yalleroi Queensland
Waterloo Ontario
Waterloo ON
Washington D.C.
Washington D.C. Metro
New York
df$org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom", "Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
the string may contain the name of a state, city or country. I just want Country as output. Like this
org_loc
Switzerland
Switzerland
Zimbabwe
China
China
Spain
United Kingdom
India
Australia
Canada
Canada
United State
United state
United state
I am trying to convert state (if match found) to its country using countrycode library but not able to do so. Any help would be appreciable.
You can use your City_and_province_list.csv as a custom dictionary for countrycode. The custom dictionary can not have duplicates in the origin vector (the City column in your City_and_province_list.csv), so you'll have to remove them or deal with them somehow first (as in my example below). Currently, you don't have all of the possible strings in your example in your lookup CSV, so they are not all converted, but if you added all of the possible strings to the CSV, it would work completely.
library(countrycode)
org_loc <- c("Zug", "Zug Canton of Zug", "Zimbabwe", "Zigong", "Zhuhai",
"Zaragoza", "York United Kingdom", "Delhi",
"Yalleroi Queensland", "Waterloo Ontario", "Waterloo ON",
"Washington D.C.", "Washington D.C. Metro", "New York")
df <- data.frame(org_loc)
city_country <- read.csv("https://raw.githubusercontent.com/girijesh18/dataset/master/City_and_province_list.csv")
# custom_dict for countrycode cannot have duplicate origin codes
city_country <- city_country[!duplicated(city_country$City), ]
df$country <- countrycode(df$org_loc, "City", "Country",
custom_dict = city_country)
df
# org_loc country
# 1 Zug Switzerland
# 2 Zug Canton of Zug <NA>
# 3 Zimbabwe <NA>
# 4 Zigong China
# 5 Zhuhai China
# 6 Zaragoza Spain
# 7 York United Kingdom <NA>
# 8 Delhi India
# 9 Yalleroi Queensland <NA>
# 10 Waterloo Ontario <NA>
# 11 Waterloo ON <NA>
# 12 Washington D.C. <NA>
# 13 Washington D.C. Metro <NA>
# 14 New York United States of America
library(countrycode)
df <- c("zug switzerland", "zug canton of zug switzerland", "zimbabwe",
"zigong chengdu pr china", "zhuhai guangdong china", "zaragoza","York United Kingdom", "Yamunanagar","Yalleroi Queensland Australia","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","USA")
df1 <- countrycode(df, 'country.name', 'country.name')
It didn't match a lot of them, but that should do what you're looking for, based on the reference manual for countrycode.
With function geocode from package ggmap you may accomplish, with good but not total accuracy your task; you must also use your criterion to say "Zaragoza" is a city in Spain (which is what geocode returns) and not somewhere in Argentina; geocode tends to give you the biggest city when there are several homonyms.
(remove the $country to see all of the output)
library(ggmap)
org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom",
"Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
geocode(org_loc, output = "more")$country
as geocode is provided by google, it has a query limit, 2,500 per day per IP address; if it returns NAs it may be because an unconsistent limit check, just try it again.

Find string, if does not exist, find another string

I have many files from OECD that have data available for different regional granularities. An example would be:
File A
REG_ID Region
AUS Australia
AU1GS Sydney
AU1 New South Wales
AU2 Victoria
AU2GM Melbourne
File B
REG_ID Region
AUS Australia
AU1GS Sydney
AU2GM Melbourne
File C
REG_ID Region
AUS Australia
AU1 New South Wales
AU1GS Sydney
AU2 Victoria
I want to extract the most granular region, in this case Sydney only, and not New South Wales. However, if Sydney is unavailable, I want to extract New South Wales.
How do I write code that is generalisable to all these files?

Construct a vector of names from data frame using R

I have a big data frame that contains data about the outcomes of sports matches. I want to try and extract specific data from the data frame depending on certain criteria. Here's a quick example of what I mean...
Imagine I have a data frame df, which displays data about specific football matches of a tournament on each row, like so:
Winner_Teams Win_Capt_Nm Win_Country Loser_teams Lose_Capt_Nm Lose_Country
1 Man utd John England Barcalona Carlos Spain
2 Liverpool Steve England Juventus Mario Italy
3 Man utd John Scotland R Madrid Juan Spain
4 Paris SG Teirey France Chelsea Mark England
So, for example, in row [1] Man utd won against Barcalona, Man utd's captain's name was John and he is from England. Barcalona's (the losers of the match) captain's name was Carlos and he is from Spain.
I want to construct a vector with the names of all English players in the tournament, where the output should look something like this:
[1] "John" "Mark" "Steve"
Here's what I've tried so far...
My first step was to create a data frame that discards all the matches that don't have English captains
> England_player <- data.frame(filter(df, Win_Country=="England" ))
> England_player
Winner_Teams Win_Capt_Nm Win_Country Loser_teams Lose_Capt_Nm Lose_Country
1 Man utd John England Barcalona Carlos Spain
2 Liverpool Steve England Juventus Mario Italy
3 Paris SG Teirey France Chelsea MArk England
Then I used select() on England_player to isolate just the names:
> England_player_names <- select(England_player, Win_Capt_Nm, Lose_Capt_Nm)
> England_player_names
Win_Capt_Nm Lose_Capt_Nm
1 John Carlos
2 Steve Mario
3 Teirey Mark
And then I get stuck! As you can see, the output displays the English winner's name and the name of his opponent... which is not what I want!
It's easy to just read the names off this data frame.. but the data frame I'm working with is large, so just reading the values is no good!
Any suggestions as to how I'd do this?
english.players <- union(data$Win_Capt_Nm[data$Win_Country == 'England'], data$Lose_Capt_Nm[data$Lose_Country == 'England'])
[1] "John" "Steve" "Mark"

How can I count the number of instances a value occurs within a subgroup in R?

I have a data frame that I'm working with in R, and am trying to check how many times a value occurs within its larger, associated group. Specifically, I'm trying to count the number of cities that are listed for each particular country.
My data look something like this:
City Country
=========================
New York US
San Francisco US
Los Angeles US
Paris France
Nantes France
Berlin Germany
It seems that table() is the way to go, but I can't quite figure it out — how can I find out how many cities are listed for each country? That is to say, how can I find out how many fields in one column are associated with a particular value in another column?
EDIT:
I'm hoping for something along the lines of
3 US
2 France
1 Germany
I guess you can try table.
table(df$Country)
# France Germany US
# 2 1 3
Or using data.table
library(data.table)
setDT(df)[, .N, by=Country]
# Country N
#1: US 3
#2: France 2
#3: Germany 1
Or
library(plyr)
count(df$Country)
# x freq
#1 France 2
#2 Germany 1
#3 US 3

Resources