How to update only missing values in R based on parameters - r

I have data set as below mentioned script
library(ggmap)
countries <- c('Ghana', 'Guinea', 'Mali', 'Niger')
withLocation<- data.frame(countries, geocode(countries))
once I run the command then I get data like this
countries lon lat
1 Ghana -1.023194 7.946527
2 Guinea -9.696645 9.945587
3 Mali -3.996166 17.570692
4 Niger NA NA
Now I have missing values for 'Niger' and want to update that row only as running the google API with complete list will miss different country, please help me to achieve this

You want to know how to select the part of your data frame and get the values which need replacing?
na_rows <- is.na(withLocation$lon & withLocation$lat)
withLocation[na_rows, c(2,3)] <- c('update', 'values')
I'm not sure this is going to solve your problem, but feel free to write me a comment and let me know what needs improving.

Related

Converting zipcodes to state in R using reverse_zipcode() outputs in the wrong order

I am trying to change a set of zipcodes into states. However, the result comes in a different order than what I inputed, except for null values. This is a different set I created, which produces the same issue. I'm importing my actual file from a CSV if that is relevant.
I'm using the zipcodeR package.
zipcodestest = as.data.frame(c('85364','91910','30004','filler','90210','help'))
colnames(zipcodestest) = "zip"
statetest =as.data.frame(reverse_zipcode(zipcodestest$zip)$state)
zipcodestest$statetest = statetest
View(zipcodestest)
The states are showing up in a different order than the zips. Is there a way I can make sure they pair up properly?
Thanks so much.
zipcodestest %>%
left_join(reverse_zipcode(.$zip),
by = c(zip ='zipcode')) %>%
select(zip, state)
zip state
1 85364 AZ
2 91910 CA
3 30004 GA
4 filler <NA>
5 90210 CA
6 help <NA>

apply nested within lapply not working in R

just earlier today I received a very helpful answer for a problem I was running into that allowed me to move onto the next step of one of my projects. However, I got stuck again later on in the project, and I'm wondering if any of you can help me move forward.
Context
Currently, I have a list of data frames that are full of soccer matches called wc_match_dataframes. Here is what one of the data frames looks like:
type_id tourn_id day month year team_A score_A score_B team_B win loss
f wc_1934 27 5 1934 Germany 5 2 Belgium Germany Belgium
I wasn't able to fit the data for the final three columns, draw, drawA, and drawB but basically the draw column is TRUE if the match is a draw, if not, it is FALSE. In the case of a draw, the win and loss columns are just filled by Draw. The drawA column is filled by team_A if the match was a draw, and likewise, the drawB column is filled by team_B.
The type_id is either f or q depending on if the match was a World Cup qualifier or a World Cup finals match. The tourn_id refers to the tournament the match was for, whether it was a qualifier or finals.
There are a total of 39 of these data frames, with a "finals" data frame for each of the 20 World Cup tournaments, and a "qualifiers" data frame for 19 tournaments (the first World Cup did not have qualifying).
What I Want To Do
I'm trying to populate a different list of data frames wc_dataframes with data for each of the 20 World Cups at the country level as opposed to the match level. Each of these twenty data frames will have the countries that made it to the finals of said tournament and their data like so:
Country
Wins in qualifying
Wins in finals
Losses in qualifying
Losses in finals
... and so on.
I have been able to populate the first country column for every World Cup no problem, but I'm running into issues for the rest of the columns.
Here is what I'm doing
This is the unlooped (only works for one World Cup) version of my code that works successfully:
wc_dataframes$wc_1930$fw <- apply(wc_dataframes$wc_1930, MARGIN = 1, function(country)
sum(wc_match_dataframes$`wc_1930 f`$w == country, na.rm = TRUE))
This is successfully populating the finals win column in the wc_dataframes$wc_1930 data frame by counting the number of wins.
Now, when I try and nest this under lapply to do it across all World Cup years like so:
lapply(names(wc_dataframes), function(year)
wc_dataframes$year$fw <- apply(wc_dataframes$year, MARGIN = 1, function(country)
sum(wc_match_dataframes$`year f`$w == country, na.rm = TRUE)))
It does not work for me. I suspect that the issue has to do with defining the year function and running into issues in the sum portion of my code. I come from a background in STATA so I am more used to running for loops and what not. I'm still getting used to R and lists and everything so I really appreciate the help.
Thank you!
Thank you so much in advance for the help, and happy holidays! :)
What you need is to output whatever you have replaced:
lapply(names(wc_dataframes), function(year){
wc_dataframes[[year]]$fw <- apply(wc_dataframes[[year]], MARGIN = 1, function(country)
sum(wc_match_dataframes[[paste(year,'f')]]$w == country, na.rm = TRUE));
wc_dataframes}
)

Simple lookup to insert values in an R data frame

This is a seemingly simple R question, but I don't see an exact answer here. I have a data frame (alldata) that looks like this:
Case zip market
1 44485 NA
2 44488 NA
3 43210 NA
There are over 3.5 million records.
Then, I have a second data frame, 'zipcodes'.
market zip
1 44485
1 44486
1 44488
... ... (100 zips in market 1)
2 43210
2 43211
... ... (100 zips in market 2, etc.)
I want to find the correct value for alldata$market for each case based on alldata$zip matching the appropriate value in the zipcode data frame. I'm just looking for the right syntax, and assistance is much appreciated, as usual.
Since you don't care about the market column in alldata, you can first strip it off using and merge the columns in alldata and zipcodes based on the zip column using merge:
merge(alldata[, c("Case", "zip")], zipcodes, by="zip")
The by parameter specifies the key criteria, so if you have a compound key, you could do something like by=c("zip", "otherfield").
Another option that worked for me and is very simple:
alldata$market<-with(zipcodes, market[match(alldata$zip, zip)])
With such a large data set you may want the speed of an environment lookup. You can use the lookup function from the qdapTools package as follows:
library(qdapTools)
alldata$market <- lookup(alldata$zip, zipcodes[, 2:1])
Or
alldata$zip %l% zipcodes[, 2:1]
Here's the dplyr way of doing it:
library(tidyverse)
alldata %>%
select(-market) %>%
left_join(zipcodes, by="zip")
which, on my machine, is roughly the same performance as lookup.
The syntax of match is a bit clumsy. You might find the lookup package easier to use.
alldata <- data.frame(Case=1:3, zip=c(44485,44488,43210), market=c(NA,NA,NA))
zipcodes <- data.frame(market=c(1,1,1,2,2), zip=c(44485,44486,44488,43210,43211))
alldata$market <- lookup(alldata$zip, zipcodes$zip, zipcodes$market)
alldata
## Case zip market
## 1 1 44485 1
## 2 2 44488 1
## 3 3 43210 2

getting the max() of a data frame under certain conditions

I have a rather large dataframe with 13 variables. Here is the first line just to give an idea:
prov_code nuts1 nuts1name nuts2 nuts2name prov_geoorder prov_name NUTS_ID EDAD year ORDER graphs value prov_geo
1. 15 1 NW 11 Galicia 1 La Corunna ES111 11 1975 1 1 0.000000000 La Corunna
I would like to obtain the maximum for a certain set of variables according to a combination of variables year ORDER and prov_code (ie, f_all being my data.frame: f_all[(f_all$year==1975)&(f_all$ORDER==1)&(f_all$prov_code=="1"),] ). The goal is to repeat the operation in order to obtain a new data frame containing all the maximum values for each year, ORDER, prov_code.
Is there a simple and quick way to do this?
Thanks for any suggestion on the matter,
There are several way of doing this, for example the one #James mentions. I want to suggest using plyr:
library(ply)
ddply(f_all, .(year, ORDER, prov_code), summarise, mx_value = max(value))
Alternatively, if you have a lot of data, data.table provides similar functionality, but is much much faster in that case.

Need help formatting date in R

I am trying to get a simple bar char of activity count by date; however, when I import my data into R, it either skipping some record or not properly converting the date format.
Here is the script I am using:
ua <- read.table('report_users_activities_byrole 2.txt',sep='|',header=T)
qplot(date,
data=ua,
geom="bar",
weight=count,
ylab="User Count",
fill=factor(un_region)) +
opts(axis.text.x =theme_text(angle=45, size=5))
And my date
head(ua)
date role name un_region un_subregion us_state count
1 2012-06-21 ENTREPRENEUR Australia Oceania Australia and New Zealand 2
2 2012-06-21 ENTREPRENEUR Belgium Europe Western Europe 1
3 2012-06-21 ENTREPRENEUR Bosnia and Herzegovina Europe Southern Europe 1
I suspect you need something like
ua[,"Date"] <- as.Date(ua[,"Date"])
to turn the textual representation of the dates you got from reading the file into an actual Date type.
Not sure what's wrong with your code but something like this should work (that's a version of the example at http://had.co.nz/ggplot2/scale_date.html)
df = data.frame(date=sample(seq(Sys.Date(), len=100, by="1 day"),size=100,replace=TRUE))
qplot(x=date,data=df,geom="bar")
df is a data.frame where some dates appear more often than others (that's the sample() function). not sure why you want the "weight" argument in your qplot() call. Also make sure your date variable is a proper date (not a string), i.e. do
str(df$date)
otherwise
qplot(x=factor(date),data=df,geom="bar")
should work as well.
Looks like i had some encoding issues with my data extract. I used Google refine to clean up the import and then
ua <- read.csv("~/Desktop/R Working/report_users_activities_byrole.csv") and it worked

Resources