I am trying to get a simple bar char of activity count by date; however, when I import my data into R, it either skipping some record or not properly converting the date format.
Here is the script I am using:
ua <- read.table('report_users_activities_byrole 2.txt',sep='|',header=T)
qplot(date,
data=ua,
geom="bar",
weight=count,
ylab="User Count",
fill=factor(un_region)) +
opts(axis.text.x =theme_text(angle=45, size=5))
And my date
head(ua)
date role name un_region un_subregion us_state count
1 2012-06-21 ENTREPRENEUR Australia Oceania Australia and New Zealand 2
2 2012-06-21 ENTREPRENEUR Belgium Europe Western Europe 1
3 2012-06-21 ENTREPRENEUR Bosnia and Herzegovina Europe Southern Europe 1
I suspect you need something like
ua[,"Date"] <- as.Date(ua[,"Date"])
to turn the textual representation of the dates you got from reading the file into an actual Date type.
Not sure what's wrong with your code but something like this should work (that's a version of the example at http://had.co.nz/ggplot2/scale_date.html)
df = data.frame(date=sample(seq(Sys.Date(), len=100, by="1 day"),size=100,replace=TRUE))
qplot(x=date,data=df,geom="bar")
df is a data.frame where some dates appear more often than others (that's the sample() function). not sure why you want the "weight" argument in your qplot() call. Also make sure your date variable is a proper date (not a string), i.e. do
str(df$date)
otherwise
qplot(x=factor(date),data=df,geom="bar")
should work as well.
Looks like i had some encoding issues with my data extract. I used Google refine to clean up the import and then
ua <- read.csv("~/Desktop/R Working/report_users_activities_byrole.csv") and it worked
Related
I have data set as below mentioned script
library(ggmap)
countries <- c('Ghana', 'Guinea', 'Mali', 'Niger')
withLocation<- data.frame(countries, geocode(countries))
once I run the command then I get data like this
countries lon lat
1 Ghana -1.023194 7.946527
2 Guinea -9.696645 9.945587
3 Mali -3.996166 17.570692
4 Niger NA NA
Now I have missing values for 'Niger' and want to update that row only as running the google API with complete list will miss different country, please help me to achieve this
You want to know how to select the part of your data frame and get the values which need replacing?
na_rows <- is.na(withLocation$lon & withLocation$lat)
withLocation[na_rows, c(2,3)] <- c('update', 'values')
I'm not sure this is going to solve your problem, but feel free to write me a comment and let me know what needs improving.
I've got a set of 20 or so consecutive individual-level cross-sectional data sets which I would like to link together.
Unfortunately, there's no time-stable ID number; there are, however, fields for first, last, and maiden names, as well as year of birth--this should allow for a pretty high (90-95%) match rate, I presume.
Ideally, I would create a time-independent ID for each unique individual.
I can do this for those whose marital status (maiden name) does not change pretty easily in R--stack the data sets to get a long panel, then do something to the effect of:
unique(dt,by=c("first_name","last_name","birth_year"))[,id:=.I]
(I'm of course using R data.table), then merging back to the full data.
However, I'm stuck on how to incorporate the maiden name to this procedure. Any suggestions?
Here's a preview of the data:
first_name last_name nee birth_year year
1: eileen aaldxxxx dxxxx 1977 2002
2: eileen aaldxxxx dxxxx 1977 2003
3: sarah aaxxxx gexxxx 1974 2003
4: kelly aaxxxx nxxxx 1951 2008
5: linda aarxxxx-gxxxx aarxxxx 1967 2008
---
72008: stacey zwirxxxx kruxxxx 1982 2010
72009: stacey zwirxxxx kruxxxx 1982 2011
72010: stacey zwirxxxx kruxxxx 1982 2012
72011: stacey zwirxxxx kruxxxx 1982 2013
72012: jill zydoxxxx gundexxxx 1978 2002
UPDATE:
I've done a lot of chipping and hammering at the problem; here's what I've got so far. I would appreciate any comments for possible improvements to the code so far.
I'm still completely missing something like 3-5% of matches due to inexact matches ("tonya" vs. "tanya", "jenifer" vs. "jennifer"); I haven't come up with a clean way of doing fuzzy matching on the stragglers, so there's room for better matching in that direction if anyone's got a straightforward way to implement that.
The basic approach is to build cumulatively--assign IDs in the first year, then look for matches in the second year; assign new IDs to the unmatched. Then for year 3, look back at the first 2 years, etc. As to how to match, the idea is to slowly expand the matching criteria--the idea being that the more robust the match, the lower the chances of mismatching accidentally (particularly worried about the John Smiths).
Without further ado, here's the main function for matching a pair of data sets:
get_id<-function(yr,key_from,key_to=key_from,
mdis,msch,mard,init,mexp,step){
#Want to exclude anyone who is matched
existing_ids<-full_data[.(yr),unique(na.omit(teacher_id))]
#Get the most recent prior observation of all
# unmatched teachers, excluding those teachers
# who cannot be uniquely identified by the
# current key setting
unmatched<-
full_data[.(1996:(yr-1))
][!teacher_id %in% existing_ids,
.SD[.N],by=teacher_id,
.SDcols=c(key_from,"teacher_id")
][,if (.N==1L) .SD,keyby=key_from
][,(flags):=list(mdis,msch,mard,init,mexp,step)]
#Merge, reset keys
setkey(setkeyv(
full_data,key_to)[year==yr&is.na(teacher_id),
(update_cols):=unmatched[.SD,update_cols,with=F]],
year)
full_data[.(yr),(update_cols):=lapply(.SD,function(x)na.omit(x)[1]),
by=id,.SDcols=update_cols]
}
Then I basically go through the 19 years yy in a for loop, running 12 progressively looser matches, e.g. step 3 is:
get_id(yy,c("first_name_clean","last_name_clean","birth_year"),
mdis=T,msch=T,mard=F,init=F,mexp=F,step=3L)
The final step is to assign new IDs:
current_max<-full_data[.(yy),max(teacher_id,na.rm=T)]
new_ids<-
setkey(full_data[year==yy&is.na(teacher_id),.(id=unique(id))
][,add_id:=.I+current_max],id)
setkey(setkey(full_data,id)[year==yy&is.na(teacher_id),
teacher_id:=new_ids[.SD,add_id]],year)
I'm a newbie to R programming..I have a csv file contains items by country, life expectancy and region. And I've to do the following:
List out no. of countries regionwise & draw bar chart
Draw boxplot for each region
Cluster countries based on life expectancy using k-means algorithm
Name the countries that have the min & max life expectancy.
input.csv
Country,LifeExpectancy,Region
India,60,Asia
Srilanka,62,Asia
Myanmar,61,Asia
USA,65,America
Canada,65,America
UK,68,Europe
Belgium,67,Europe
Germany,69,Europe
Switzerland,70,Europe
France,68,Europe
What I did?
1.
mydata <- read.table("input.csv", header=TRUE, sep=",")
barplot(data$ncol(Region))
and I get the error Error in barplot(mydata$ncol(Region)) : attempt to apply non-function
boxplot(LifeExpectancy~Region,mydata=data) ##This is correct
3 Have no idea how to do this!
4.min(mydata$LifeExpectancy);max(mydata$LifeExpectancy) ##This is correct
As I pointed out in my comments, this question is really multiple questions, and does not reflect the title. In future, please try to keep questions manageable and discrete. I'm not going to attempt to answer your third point (about K-means clustering) here. Search SO and I'm sure you will find some relevant questions/answers.
Regarding your other questions, have a careful look at the following. If you don't understand what a particular function is doing, refer to ?function_name (e.g. ?tapply), and for further enlightenment, run nested code from the inside out (e.g. for foo(bar(baz(x))), you could examine baz(x), then bar(baz(x)), and finally foo(bar(baz(x))). This is an easy way to help you get a handle on what's going on, and is also useful when debugging code that produces errors.
d <- read.csv(text='Country,LifeExpectancy,Region
India,60,Asia
Srilanka,62,Asia
Myanmar,61,Asia
USA,65,America
Canada,65,America
UK,68,Europe
Belgium,67,Europe
Germany,69,Europe
Switzerland,70,Europe
France,68,Europe', header=TRUE)
barplot(with(d, tapply(Country, Region, length)), cex.names=0.8,
ylab='No. of countries', xlab='Region', las=1)
boxplot(LifeExpectancy ~ Region, data=d, las=1,
xlab='Region', ylab='Life expectancy')
d$Country[which.min(d$LifeExpectancy)]
# [1] India
# Levels: Belgium Canada France Germany India Myanmar Srilanka Switzerland UK USA
d$Country[which.max(d$LifeExpectancy)]
# [1] Switzerland
# Levels: Belgium Canada France Germany India Myanmar Srilanka Switzerland UK USA
I have a data frame in R including country iso codes. The iso code for Namibia happens to be 'NA'. R treats this text 'NA' as N/A.
For example the code below gives me the row with Namibia.
test <- subset(country.info,is.na(country.info$iso.code))
I initially thought it might be a factor issue, so I made sure the iso code column is character. But this didn't help.
How can this be solved?
This probably relates to how you read in the data. Just because it's character doesn't mean your "NA" isn't an NA, e.g.:
z <- c("NA",NA,"US")
class(z)
#[1] "character"
You could confirm this by giving us a dput() of (part of) your data.
When you read in your data, try changing na.strings = "NA" (e.g., in read.csv) to something else and see if it works.
For example, with na.strings = "":
read.table(text="code country
NA Namibia
GR Germany
FR France", stringsAsFactors=FALSE, header=TRUE, na.strings="")
# code country
# 1 NA Namibia
# 2 GR Germany
# 3 FR France
Make sure to check that the use of "" doesn't result in changing anything else. Else, you can use a string that will definitely not occur in your file like "z_z_z" or something like that.. You can replace the text=.. with your file name.
If Thomas' solution doesn't work you can always use the countrycode package to change your countrycodes to something that causes fewer problems.
In your case from ISO2-character to ISO3-character for instance.
country.info$iso.code<-countrycode(country.info$iso.code,"iso2c","iso3c", warn=TRUE)
If iso2c causes problems use country.names, hoping the Republic of Congo and the Democratic Republic of Congo don't mess things up.
Suppose there is a dataset of different regions, each region a subset of a state, and some outcome variable:
regions <- c("Michigan, Eastern",
"Michigan, Western",
"Minnesota",
"Mississippi, Northern",
"Mississippi, Southern",
"Missouri, Eastern",
"Missouri, Western")
set.seed(123)
outcome <- rpois(7, 12)
testset <- data.frame(regions,outcome)
regions outcome
1 Michigan, Eastern 10
2 Michigan, Western 11
3 Minnesota 17
4 Mississippi, Northern 12
5 Mississippi, Southern 12
6 Missouri, Eastern 17
7 Missouri, Western 13
A useful tool would aggregate each region and add, or take the mean or maximum, etc. of outcome by region and generate a new data frame for state. A sum, for example, would output this:
state outcome
1 Michigan 21
3 Minnesota 17
4 Mississippi 24
6 Missouri 30
The aggregate() function won't solve this problem. Is there something else in R that is built for this? It seems like grep could be used to generate the new column "states" as part of an application specific program. Seems like this would already be out there somewhere though.
The reason this isn't straight forward is that the structure of your data is not consistent, so you couldn't build a library simply for it.
Your state, region column is basically an index column, and you want to index across part of it. tapply is designed for this, but there's no reason to build in a function to do it automatically for this specific scenario. You could do it without creating the column though
tapply(outcome,gsub(",.*$","",testset$regions),sum)
The index column just replaces the , and everything after it, leaving the index column.
PS: you have a slight typo in your example, your data.frame should be
testset <- data.frame(regions,outcome)