creating an extra column based on two other dataframes - r

I have three datasets
one containing a bunch of information about storms.
one that contains full names of the cities and the abbreviations.
and one that contains the year and population for each state.
What I want to do is to add a column to the first dataframe storms called population that contains population per year for each state using the other two dataframes state_codes and states.
Can anyone point me in the right direction? Below some sample data
> head(storms)
num yr mo dy time state magnitude injuries fatalities crop_loss
1 1 1950 1 3 11:00:00 MO 3 3 0 0
2 1 1950 1 3 11:10:00 IL 3 0 0 0
3 2 1950 1 3 11:55:00 IL 3 3 0 0
4 3 1950 1 3 16:00:00 OH 1 1 0 0
5 4 1950 1 13 05:25:00 AR 3 1 1 0
6 5 1950 1 25 19:30:00 MO 2 5 0 0
> head(state_codes)
Name Abbreviation
1 Alabama AL
2 Alaska AK
3 Arizona AZ
4 Arkansas AR
5 California CA
6 Colorado CO
head(states)
Year Alabama Arizona Arkansas California Colorado Connecticut Delaware
1 1900 1830 124 1314 1490 543 910 185
2 1901 1907 131 1341 1550 581 931 187
3 1902 1935 138 1360 1623 621 952 188
4 1903 1957 144 1384 1702 652 972 190
5 1904 1978 151 1419 1792 659 987 192
6 1905 2012 158 1447 1893 680 1010 194

You didn't provide much data to test with, but this should do it.
First, join storms to state_codes, so that it will have state names that are in states. We can rename yr to match states$Year at the same time.
Then pivot states to be in long form.
Finally, join the new version of storms to the long version of states.
library(dplyr)
library(tidyr)
storms %>%
left_join(state_codes,by = c("state" = "Abbreviation")) %>%
rename(Year = yr) -> storms.with.names
states %>%
pivot_longer(-Year, names_to = "Name",
values_to = "Population") -> long.states
storms.with.names %>%
left_join(long.states) -> result

This answer doesn't use dplyr, but I'm offering it because I know that it's very fast on large datasets.
It follows the same first step as the accepted answer: merge state names into the storms dataset. But then it does something clever (I stole the idea): it creates a matrix of row and column numbers, and then uses that to extract the elements from the "states" dataset that you want for the new column.
#Add the state names to storms
storms<-merge(storms, state_codes, by.x = 6, by.y = 2, all.x = T)
#Get row and column indexes for the elements in 'states'
r<-match(storms$year, states$year)
c<-match(storms$state.y, names(states)) #state.y was the name of the merged column
smat<-cbind(r,c)
#And grab them into a new vector
storms$population<-states[smat]

Related

Grouping and/or Counting in R

I'm trying to 're-count' a column in R and having issues by cleaning up the data. I'm working on cleaning data by location and once I change CA to California.
all_location <- read.csv("all_location.csv", stringsAsFactors = FALSE)
all_location <- count(all_location, location)
all_location <- all_location[with(all_location, order(-n)), ]
all_location
A tibble: 100 x 2
location n
<chr> <int>
1 CA 3216
2 Alaska 2985
3 Nevada 949
4 Washington 253
5 Hawaii 239
6 Montana 218
7 Puerto Rico 149
8 California 126
9 Utah 83
10 NA 72
From the above, there's CA and California. Below I'm able to clean grep and replace CA with California. However, my issue is that it's grouping by California but shows two separate instances of California.
ca1 <- grep("CA",all_location$location)
all_location$location <- replace(all_location$location,ca1,"California")
all_location
A tibble: 100 x 2
location n
<chr> <int>
1 California 3216
2 Alaska 2985
3 Nevada 949
4 Washington 253
5 Hawaii 239
6 Montana 218
7 Puerto Rico 149
8 California 126
9 Utah 83
10 NA 72
My goal would be to combine both to a total under n.
all_location$location[substr(all_location$location, 1, 5) %in% "Calif" ] <- "California"
to make sure everything that starts with "Calif" gets made into "California"
I am assuming that maybe you have a space in the California (e.g. "California ") that is already present which is why this is happening..

Frequency categories getting randomly split with table function

I have a large 2 column data frame (a) with country codes (ALB, ALG, ...) and years. There are thousands of unordered rows so the countries rows repeat often and randomly:
> a
Country Year
1 ALB 1991
2 ALB 1993
3 ALB 1994
4 ALB 1994
5 ALB 1996
6 ALG 1996
7 ALG 1971
8 AUS 1942
9 BLG 1998
10 BLG 1923
11 PAR 1957
12 PAR 1994
...
I tried frequency <- data.frame(table(a[,1])) but it does something really weird. It gives me something like this:
Var1 Freq
1 AFG 1
2 ALB 3
3 ARG 1
4 AUS 1
5 AUT 3
6 AZE 2
7 BEL 3
8 BEN 2
9 BGD 3
10 BIH 4
...
129 ALB 33
130 ALG 73
131 AMS 7
132 ANC 1
133 AND 3
134 ANG 36
135 ANT 4
136 ARG 148
137 ARM 12
138 AUS 268
139 AUT 144
...
It'll go through most of the country variables, and then go through them once more, giving me 1 or 2 entries for all Countries. If I add the frequencies up, they'll give me the correct total for their respective countries... but I have no idea why they're getting split like this.
In addition, the countries are getting split at all sorts of random places. The first instance is a relatively small number (no more than 20 with one exception) while the second instance is usually but not always a larger number. Some countries AFG only appear in the first instance while others, ANC only appear in the second...

Selecting unique non-repeating values

I have some panel data from 2004-2007 which I would like to select according to unique values. To be more precise im trying to find out entry and exits of individual stores throughout the period. Data sample:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
2 2004 35000 800 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
4 2005 17500 320 136
4 2006 17500 320 136
4 2007 17500 320 136
5 2005 45000 580 191
5 2006 45000 580 191
5 2007 45000 580 191
6 2004 7000 345 191
6 2005 7000 345 191
6 2006 7000 345 191
7 2007 10000 500 191
So for instance I would like to find out how many stores have exited the market throughout the period, which should look like:
store year rev space market
2 2004 35000 800 136
6 2006 7000 345 191
As well as how many stores have entered the market, which would imply:
store year rev space market
4 2005 17500 320 136
5 2005 45000 580 191
7 2007 10000 500 191
UPDATE:
I didn't include that it also should assume incumbent stores, such as:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
Since im, pretty new to R I've been struggling to do it right even on year-by-year basis. Any suggestions?
Using the data.table package, if your data.frame is called df:
dt = data.table(df)
exit = dt[,list(ExitYear = max(year)),by=store]
exit = exit[ExitYear != 2007] #Or whatever the "current year" is for this table
enter = dt[,list(EntryYear = min(year)),by=store]
enter = enter[EntryYear != 2003]
UPDATE
To get all columns instead of just the year and store, you can do:
exit = dt[,.SD[year == max(year)], by=store]
exit[year != 2007]
store year rev space market
1: 2 2004 35000 800 136
2: 6 2006 7000 345 191
Using only base R functions, this is pretty simple:
> subset(aggregate(df["year"],df["store"],max),year!=2007)
store year
2 2 2004
6 6 2006
and
> subset(aggregate(df["year"],df["store"],min),year!=2004)
store year
4 4 2005
5 5 2005
7 7 2007
or using formula syntax:
> subset(aggregate(year~store,df,max),year!=2007)
store year
2 2 2004
6 6 2006
and
> subset(aggregate(year~store,df,min),year!=2004)
store year
4 4 2005
5 5 2005
7 7 2007
Update Getting all the columns isn't possible for aggregate, so we can use base 'by' instead. By isn't as clever at reassembling the array:
Filter(function(x)x$year!=2007,by(df,df$store,function(s)s[s$year==max(s$year),]))
$`2`
store year rev space market
5 2 2004 35000 800 136
$`6`
store year rev space market
18 6 2006 7000 345 191
So we need to take that step - let's build a little wrapper:
by2=function(x,c,...){Reduce(rbind,by(x,x[c],simplify=FALSE,...))}
And now use that instead:
> subset(by2(df,"store",function(s)s[s$year==max(s$year),]),year!=2007)
store year rev space market
5 2 2004 35000 800 136
18 6 2006 7000 345 191
We can further clarify this by creating a function for getting a row which has the stat (min or max) for a particular column:
statmatch=function(column,stat)function(df){df[df[column]==stat(df[column]),]}
> subset(by2(df,"store",statmatch("year",max)),year!=2007)
store year rev space market
5 2 2004 35000 800 136
18 6 2006 7000 345 191
Dplyr
Using all of these base functions which don't really resemble each other starts to get fiddly after a while, so it's a great idea to learn and use the excellent (and performant) dplyr package:
> df %>% group_by(store) %>%
arrange(-year) %>% slice(1) %>%
filter(year != 2007) %>% ungroup
Source: local data frame [2 x 5]
store year rev space market
1 2 2004 35000 800 136
2 6 2006 7000 345 191
and
> df %>% group_by(store) %>%
arrange(+year) %>% slice(1) %>%
filter(year != 2004) %>% ungroup
Source: local data frame [3 x 5]
store year rev space market
1 4 2005 17500 320 136
2 5 2005 45000 580 191
3 7 2007 10000 500 191
NB The ungroup is not strictly necessary here, but puts the table back in a default state for further calculations.

Loading data with missing values as numeric data

I am trying to impute missing values using the mi package in r and ran into a problem.
When I load the data into r, it recognizes the column with missing values as a factor variable. If I convert it into a numeric variable with the command
dataset$Income <- as.numeric(dataset$Income)
It converts the column to ordinal values (with the smallest value being 1, the second smallest as 2, etc...)
I want to convert this column to numeric values, while retaining the original values of the variable. How can I do this?
EDIT:
Since people have asked, here is my code and an example of what the data looks like.
DATA:
96 GERMANY 6 1960 72480 73 50.24712 NA 0.83034767 0
97 GERMANY 6 1961 73123 85 48.68375 NA 0.79377610 0
98 GERMANY 6 1962 73739 98 48.01359 NA 0.70904115 0
99 GERMANY 6 1963 74340 132 46.93588 NA 0.68753213 0
100 GERMANY 6 1964 74954 146 47.89413 NA 0.67055298 0
101 GERMANY 6 1965 75638 160 47.51518 NA 0.64411484 0
102 GERMANY 6 1966 76206 172 48.46009 NA 0.58274711 0
103 GERMANY 6 1967 76368 183 48.18423 NA 0.57696055 0
104 GERMANY 6 1968 76584 194 48.87967 NA 0.64516949 0
105 GERMANY 6 1969 77143 210 49.36219 NA 0.55475352 0
106 GERMANY 6 1970 77783 227 49.52712 3,951.00 0.53083969 0
107 GERMANY 6 1971 78354 242 51.01421 4,282.00 0.51080717 0
108 GERMANY 6 1972 78717 254 51.02941 4,655.00 0.48773913 0
109 GERMANY 6 1973 78950 264 50.61033 5,110.00 0.48390087 0
110 GERMANY 6 1974 78966 270 48.82353 5,561.00 0.56562229 0
111 GERMANY 6 1975 78682 284 50.50279 6,092.00 0.56846030 0
112 GERMANY 6 1976 78298 301 49.22833 6,771.00 0.53536154 0
113 GERMANY 6 1977 78160 321 49.18999 7,479.00 0.55012371 0
Code:
Income <- dataset$Income
gives me a factor variable, as there are NA's in the data.If I try to turn it into numeric with
as.numeric(Income)
It throws away the original values, and replaces them with the rank of the column. I would like to keep the original values, while still recognizing missing values.
A problem every data manager from Germany knows: The column with the NAs conatins numbers with colons. But R only knows the english style of decimal points without digit grouping. So this column is treated as ordinally scaled character variable.
Try to remove the colons and you'll get the numeric values.
By the way, even if we write decimal colons in Germany, Numbers like 3,951.00 syntactically don't make sense. They even don't make sense in other languages. See these examples of international number syntax.

Beginner tips on using plyr to calculate year-over-year change across groups

I am new to plyr (and R) and looking for a little help to get started. Using the baseball dataset as an exaple, how could I calculate the year-over-year (yoy) change in "at batts" by league and team (lg and team)?
library(plyr)
df1 <- aggregate(ab~year+lg+team, FUN=sum, data=baseball)
After doing a little aggregating to simplify the data fame, the data looks like this:
head(df1)
year lg team ab
1884 UA ALT 108
1997 AL ANA 1703
1998 AL ANA 1502
1999 AL ANA 660
2000 AL ANA 85
2001 AL ANA 219
I would like to end up with someting like this
year lg team ab yoy
1997 AL ANA 1703 NA
1998 AL ANA 1502 -201
1999 AL ANA 660 -842
2000 AL ANA 85 -575
2001 AL ANA 219 134
I started by writign the following function, which I think is wrong:
yoy.func <- function(df) {
lag <- c(df$ab[-1],0)
cur <- c(df$ab[1],0)
df$yoy <- cur -lag
return(df)
}
Without sucess, I used the following code to attempt return the yoy change.
df2 <- ddply(df1, .(lg, team), yoy.func)
Any guidance woud be appreciated.
Thanks
I know you asked for a "plyr"-specific solution, but for the sake of sharing, here is an alternative approach in base R. In my opinion, I find the base R approach just as "readable". And, at least in this particular case, it's a lot faster!
output <- within(df1, {
yoy <- ave(ab, team, lg, FUN = function(x) c(NA, diff(x)))
})
head(output)
# year lg team ab yoy
# 1 1884 UA ALT 108 NA
# 2 1997 AL ANA 1703 NA
# 3 1998 AL ANA 1502 -201
# 4 1999 AL ANA 660 -842
# 5 2000 AL ANA 85 -575
# 6 2001 AL ANA 219 134
library(rbenchmark)
benchmark(DDPLY = {
ddply(df1, .(team, lg), mutate ,
yoy = c(NA, diff(ab)))
}, WITHIN = {
within(df1, {
yoy <- ave(ab, team, lg, FUN = function(x) c(NA, diff(x)))
})
}, columns = c("test", "replications", "elapsed",
"relative", "user.self"))
# test replications elapsed relative user.self
# 1 DDPLY 100 10.675 4.974 10.609
# 2 WITHIN 100 2.146 1.000 2.128
Update: data.table
If your data are very large, check out data.table. Even with this example, you'll find a good speedup in relative terms. Plus the syntax is super compact and, in my opinion, easily readable.
library(plyr)
df1 <- aggregate(ab~year+lg+team, FUN=sum, data=baseball)
library(data.table)
DT <- data.table(df1)
DT
# year lg team ab
# 1: 1884 UA ALT 108
# 2: 1997 AL ANA 1703
# 3: 1998 AL ANA 1502
# 4: 1999 AL ANA 660
# 5: 2000 AL ANA 85
# ---
# 2523: 1895 NL WSN 839
# 2524: 1896 NL WSN 982
# 2525: 1897 NL WSN 1426
# 2526: 1898 NL WSN 1736
# 2527: 1899 NL WSN 787
Now, look at this concise solution:
DT[, yoy := c(NA, diff(ab)), by = "team,lg"]
DT
# year lg team ab yoy
# 1: 1884 UA ALT 108 NA
# 2: 1997 AL ANA 1703 NA
# 3: 1998 AL ANA 1502 -201
# 4: 1999 AL ANA 660 -842
# 5: 2000 AL ANA 85 -575
# ---
# 2523: 1895 NL WSN 839 290
# 2524: 1896 NL WSN 982 143
# 2525: 1897 NL WSN 1426 444
# 2526: 1898 NL WSN 1736 310
# 2527: 1899 NL WSN 787 -949
How about using diff():
df <- read.table(header = TRUE, text = ' year lg team ab
1884 UA ALT 108
1997 AL ANA 1703
1998 AL ANA 1502
1999 AL ANA 660
2000 AL ANA 85
2001 AL ANA 219')
require(plyr)
ddply(df, .(team, lg), mutate ,
yoy = c(NA, diff(ab)))
# year lg team ab yoy
1 1884 UA ALT 108 NA
2 1997 AL ANA 1703 NA
3 1998 AL ANA 1502 -201
4 1999 AL ANA 660 -842
5 2000 AL ANA 85 -575
6 2001 AL ANA 219 134

Resources