extracting values of a column into a string and replacing values in a data frame column - r

More than the programming, I am lost on the right approach for this problem. I have 2 data frames with a market name column. Unfortunately the names vary by a few characters / symbols in each column, for e.g. Albany.Schenectady.Troy = ALBANY, Boston.Manchester = BOSTON.
I want to standardize the market names in both data frames so I can perform merge operations later.
I thought of tackling the problem in two steps:
1) Create a vector of the unique market names from both tables and use that to create a look up table. Something that looks like:
Table 1 Markets > "Albany.Schenectady.Troy" , "Albuquerque.Santa.Fe", "Atlanta" . . . .
Table2 Markets > "SPOKANE" , "BOSTON" . . .
I tried marketnamesvector <- paste(unique(Table1$Market, sep = "", collapse = ",")) but that doesn't produce the desired output.
2) Change Market names in Table 2 to equivalent market names in Table 1. For any market name not available in Table 1, Table 2 should retain the same value in market name.
I know I could use a looping function like below but I still need a lookup table I think.
replacefunc <- function (data, oldvalue, newvalue) {
newdata <- data
for (i in unique(oldvalue)) newdata[data == i] <- newvalue[oldvalue == i]
newdata
}
Table 1: This table is 90 rows x 2 columns and has 90 unique market names.
Market Leads Investment Leads1 Leads2 Leads3
1 Albany.Schenectady.Troy NA NA NA NA NA
2 Albuquerque.Santa.Fe NA NA NA NA NA
3 Atlanta NA NA NA NA NA
4 Austin NA NA NA NA NA
5 Baltimore NA NA NA NA NA
Table 2 : This table is 150K rows x 20 columns and has 89 unique market names.
> df
Spot.ID Date Hour Time Local.Date Broadcast.Week Local.Hour Local.Time Market
2 13072765 6/30/14 0 12:40 AM 2014-06-29 1 21 9:40 PM SPOKANE
261 13072946 6/30/14 5 5:49 AM 2014-06-30 1 5 5:49 AM BOSTON
356 13081398 6/30/14 10 10:52 AM 2014-06-30 1 7 7:52 AM SPOKANE
389 13082306 6/30/14 11 11:25 AM 2014-06-30 1 8 8:25 AM SPOKANE
438 13082121 6/30/14 8 8:58 AM 2014-06-30 1 8 8:58 AM BOSTON
469 13081040 6/30/14 9 9:17 AM 2014-06-30 1 9 9:17 AM ALBANY
482 13080104 6/30/14 12 12:25 PM 2014-06-30 1 9 9:25 AM SPOKANE
501 13082120 6/30/14 9 9:36 AM 2014-06-30 1 9 9:36 AM BOSTON
617 13080490 6/30/14 13 1:23 PM 2014-06-30 1 10 10:23 AM SPOKANE

Assume that the data is in data frames df1, df2. The goal is to adjust the market names to be the same, they are currently slightly different.
First, list the markets, use the following command to list the unique names in df1, repeat for df2.
mk1 <- sort(unique(df1$market))
mk2 <- sort(unique(df2$market))
dmk12 <- setdiff(mk1,mk2)
dmk21 <- setdiff(mk2,mk1)
Use dmk12 and dmk21 to identify the different markets. Decide what names to use, and how they match up, let's change "Atlanta, GA" from df1 to "Atlanta" from df2. Then use
df2[df2$market=="Atlanta","market"] = "Atlanta, GA"
The format is
df_to_change[df_to_change[,"column"]=="old data", "column"] = "new data"
If you only have 90 names to correct, I would write out 90 change lines like the one above.
After adjusting all the names, do sort(unique(df)) again and use setdiff twice to confirm all the names are the same.

Related

Format function output as data frame

I use the following to sum several measures of TABR per Julian date in 5 separate years:
TABR_YearDay<-with(wsmr, tapply(TABR, list(Julian, Year),sum))
Which produces output like:
2015 2016 2017 2018 2019
33 NA NA NA 2 NA
....
80 NA 1 NA 21 NA
81 NA 47 NA 25 NA
82 NA 12 1 9 NA
But I want to convert these results into a dataframe with 6 columns Julian + 2015-2019.
I tried:
TABR_Day<-as.data.frame(TABR_YearDay)
But that seems to not produce a fully realized df: there is no column for Julian and if I want to call an individual variable, I have to use quotes around it like:
hist(TABR_Day$"2017")
Can you help me transition the function output to a dataframe with 6 viable columns?
The column is present in the rownames of the result. Column names usually don't start with a number, we can prepend the column name with the word 'Year' to make it 'Year_2015' etc and construct the final dataframe.
TABR_YearDay<-with(wsmr, tapply(TABR, list(Julian, Year),sum))
colnames(TABR_YearDay) <- paste0('Year_', colnames(TABR_YearDay))
TABR_Day <- data.frame(Julian = rownames(TABR_YearDay), TABR_YearDay)

Reshaping data in R with multiple variable levels - "aggregate function missing" warning

I'm trying to use dcast in reshape2 to transform a data frame from long to wide format. The data is hospital visit dates and a list of diagnoses. (Dx.num lists the sequence of diagnoses in a single visit. If the same patient returns, this variable starts over and the primary diagnosis for the new visit starts at 1.) I would like there to be one row per individual (id). The data structure is:
id visit.date visit.id bill.num dx.code FY Dx.num
1 1/2/12 203 1234 409 2012 1
1 3/4/12 506 4567 512 2013 1
2 5/6/18 222 3452 488 2018 1
2 5/6/18 222 3452 122 2018 2
3 2/9/14 567 6798 923 2014 1
I'm imagining I would end up with columns like this:
id, date_visit1, date_visit2, visit.id_visit1, visit.id_visit2, bill.num_visit1, bill.num_visit2, dx.code_visit1_dx1, dx.code_visit1_dx2 dx.code_visit2_dx1, FY_visit1_dx1, FY_visit1_dx2, FY_visit2_dx1
Originally, I tried creating a visit_dx column like this one:
**visit.dx**
v1dx1 (visit 1, dx 1)
v2dx1 (visit 2, dx 1)
v1dx1 (...)
v1dx2
v1dx1
And used the following code, omitting "Dx.num" from the DF, as it's accounted for in "visit.dx":
wide <-
dcast(
setDT(long),
id + visit.date + visit.id + bill.num ~ visit.dx,
value.var = c(
"dx.code",
"FY"
)
)
When I run this, I get the warning "Aggregate function missing, defaulting to 'length'" and new dataframe full of 0's and 1's. There are no duplicate rows in the dataframe, however. I'm beginning to think I should go about this completely differently.
Any help would be much appreciated.
The data.table package extended dcast with rowid and allowing multiple value.var, so...
library(data.table)
dcast(setDT(DF), id ~ rowid(id), value.var=setdiff(names(DF), "id"))
id visit.date_1 visit.date_2 visit.id_1 visit.id_2 bill.num_1 bill.num_2 dx.code_1 dx.code_2 FY_1 FY_2 Dx.num_1 Dx.num_2
1: 1 1/2/12 3/4/12 203 506 1234 4567 409 512 2012 2013 1 1
2: 2 5/6/18 5/6/18 222 222 3452 3452 488 122 2018 2018 1 2
3: 3 2/9/14 <NA> 567 NA 6798 NA 923 NA 2014 NA 1 NA

Find and tag a number between a range

I have two dfs as below
>codes1
Country State City Start No End No
IN Telangana Hyderabad 100 200
IN Maharashtra Pune (Bund Garden) 300 400
IN Haryana Gurgaon 500 600
IN Maharashtra Pune 700 800
IN Gujarat Ahmedabad (Vastrapur) 900 1000
Now i want to tag ip address from table 1
>codes2
ID No
1 157
2 346
3 389
4 453
5 562
6 9874
7 98745
Now i want to tag numbers in codes2 df as per the range given in codes1 df for No column , expected ouput is
ID No Country State City
1 157 IN Telangana Hyderabad
2 346 IN Maharashtra Pune(Bund Garden)
.
.
.
Basically want to tag No column in codes 2 with codes1 according to the range (Start No and End No) that No observations falls in.
Also the order could be anything in codes 2 df .
You could use the non-equi join capability of the data.table package for that:
library(data.table)
setDT(codes1)
setDT(codes2)
codes2[codes1, on = .(No > StartNo, No < EndNo), ## (1)
`:=`(cntry = Country, state = State, city = City)] ## (2)
(1) obtains matching row indices in codes2 corresponding to each row in codes1, while matching on the condition provided to the on argument.
(2) updates codes2 values for those matching rows for the columns specified directly by reference (i.e., you don't have to assign the result back to another variable).
This gives:
codes2
# ID No cntry state city
# 1: 1 157 IN Telangana Hyderabad
# 2: 2 346 IN Maharashtra Pune (Bund Garden)
# 3: 3 389 IN Maharashtra Pune (Bund Garden)
# 4: 4 453 NA NA NA
# 5: 5 562 IN Haryana Gurgaon
# 6: 6 9874 NA NA NA
# 7: 7 98745 NA NA NA
if you're comfortable writing SQL, you might consider using the sqldf package to do something like
library('sqldf')
result <- sqldf('select * from codes2 left join codes1 on codes2.No between codes1.StartNo and codes1.EndNo')
you may have to remove special characters and spaces from the columnnames of your dataframes beforehand.

R getting rid of nested for loops

I did quite some searching on how to simplify the code for the problem below but was not successful. I assume that with some kind of apply-magic one could speed things up a little, but so far I still have my difficulties with these kind of functions ....
I have an data.frame data, structured as follows:
year iso3c gdpppc elec solid liquid heat
2010 USA 1567 1063 1118 835 616
2015 USA 1571 NA NA NA NA
2020 USA 1579 NA NA NA NA
... USA ... NA NA NA NA
2100 USA 3568 NA NA NA NA
2010 ARG 256 145 91 85 37
2015 ARG 261 NA NA NA NA
2020 ARG 270 NA NA NA NA
... ARG ... NA NA NA NA
2100 ARG 632 NA NA NA NA
As you can see, I have a historical starting value for 2010 and a complete scenario for gdppc up to 2100. I want to let values for elec, solid, liquid and heat grow according to some elasticity with respect to the development of gdppc, but separately for each country (coded in iso3c).
I have the elasticities defined in a separate data.frame parameters:
item value
elec 0.5
liquid 0.2
solid -0.1
heat 0.1
So far I am using a nested for loop:
for (e in 1:length(levels(parameters$item)){
for (c in 1:length(levels(data$iso3c)){
tmp <- subset(data, select=c("year", "iso3c", "gdppc", parameters[e, "item"]), subset=("iso3c" == levels(data$iso3c)[c]))
tmp[tmp$year %in% seq(2015, 2100, 5), parameters[e, "item"]] <-
tmp[tmp$year == 2010, parameters[e, "item"]] *
cumprod((1 + (tmp[tmp$year %in% seq(2015, 2100, 5), "gdppc"] /
tmp[tmp$year %in% seq(2010, 2095, 5), "gdppc"] - 1) * parameters[e, "value"]))
data[data$iso3c == levels(data$iso3c)[i] & data$year %in% seq(2015, 2100, 5), parameters[e, "item"]] <- tmp[tmp$year > 2010, parameters[e, "item"]]
}
}
The outer loop loops over the columns and the inner one over the countries. The inner loop runs for every country (I have 180+ countries). First, a subset containing data on one single country and on the variable of interest is selected. Then I let the respective variable grow with a certain elasticity to growth in gdppc and finally put the subset back into place in data.
I have already tried to let the outer loop run in parallel using foreach but was not succesful recombining the results. Since I have to run similar calculations quite often I would be very grateful for any help.
Thanks
Here's one way. Note I renamed your parameters data.frame to p
library(data.table)
library(reshape2)
dt <- data.table(data)
dt.melt = melt(dt,id=1:3)
dt.melt[,value:=as.numeric(value)] # coerce value column to numeric
dt.melt[,value:=head(value,1)+(gdpppc-head(gdpppc,1))*p[p$item==variable,]$value,
by="iso3c,variable"]
result <- dcast(dt.melt,iso3c+year+gdpppc~variable)
result
# iso3c year gdpppc elec solid liquid heat
# 1 ARG 2010 256 145.0 91.0 85.0 37.0
# 2 ARG 2015 261 147.5 90.5 86.0 37.5
# 3 ARG 2020 270 152.0 89.6 87.8 38.4
# 4 ARG 2100 632 333.0 53.4 160.2 74.6
# 5 USA 2010 1567 1063.0 1118.0 835.0 616.0
# 6 USA 2015 1571 1065.0 1117.6 835.8 616.4
# 7 USA 2020 1579 1069.0 1116.8 837.4 617.2
# 8 USA 2100 3568 2063.5 917.9 1235.2 816.1
The basic idea is to use the melt(...) function to reshape your original data into "long" format, where the values in the four columns solid, liquid, elec, and heat are all in one column, value, and the column variable indicates which metric value refers to. Now, using data tables, you can fill in the values easily. Then, reshape the result back into wide format using dcast(...).

Creating lag variables for matched factors

I have a question about creating lag variables depending on a time factor.
Basically I am working with a baseball dataset where there are lots of names for each player between 2002-2012. Obviously I only want lag variables for the same person to try and create a career arc to predict the current stat. Like for example I want to use lag 1 Average (2003) , lag 2 Average (2004) to try and predict the current average in 2005. So I tried to write a loop that goes through every row (the data frame is already sorted by name and then year, so the previous year is n-1 row), check if the name is the same, and if so then grab the value from the previous row.
Here is my loop:
i=2 #as 1 errors out with 1-0 row
for(i in 2:6264){
if(TS$name[i]==TS$name[i-1]){
TS$runvalueL1[i]=TS$Run_Value[i-1]
}else{
TS$runvalueL1 <- NA
}
i=i+1
}
Because each row is dependent on the name I cannot use most of the lag functions. If you have a better idea I am all ears!
Sample Data won't help a bunch but here is some:
edit: Sample data wasn't producing useable results so I just attached the first 10 people of my dataset. Thanks!
TS[(6:10),c('name','Season','Run_Value')]
name Season ARuns
321 Abad Andy 2003 -1.05
3158 Abercrombie Reggie 2006 27.42
1312 Abercrombie Reggie 2007 7.65
1069 Abercrombie Reggie 2008 5.34
4614 Abernathy Brent 2002 46.71
707 Abernathy Brent 2003 -2.29
1297 Abernathy Brent 2005 5.59
6024 Abreu Bobby 2002 102.89
6087 Abreu Bobby 2003 113.23
6177 Abreu Bobby 2004 128.60
Thank you!
Smth along these lines should do it:
names = c("Adams","Adams","Adams","Adams","Bobby","Bobby", "Charlie")
years = c(2002,2003,2004,2005,2004,2005,2010)
Run_value = c(10,15,15,20,10,5,5)
library(data.table)
dt = data.table(names, years, Run_value)
dt[, lag1 := c(NA, Run_value), by = names]
# names years Run_value lag1
#1: Adams 2002 10 NA
#2: Adams 2003 15 10
#3: Adams 2004 15 15
#4: Adams 2005 20 15
#5: Bobby 2004 10 NA
#6: Bobby 2005 5 10
#7: Charlie 2010 5 NA
An alternative would be to split the data by name, use lapply with the lag function of your choice and then combine the splitted data again:
TS$runvalueL1 <- do.call("rbind", lapply(split(TS, list(TS$name)), your_lag_function))
or
TS$runvalueL1 <- do.call("c", lapply(split(TS, list(TS$name)), your_lag_function))
But I guess there is also a nice possibility with plyr, but as you did not provide a reproducible example, that is all for the beginning.
Better:
TS$runvalueL1 <- unlist(lapply(split(TS, list(TS$name)), your_lag_function))
This is obviously not a problem where you want to create a matrix with cbind, so this is a better data structure:
full=data.frame(names, years, Run_value)
The ave function is quite useful for constructing new columns within categories of other columns:
full$Lag1 <- ave(full$Run_value, full$names,
FUN= function(x) c(NA, x[-length(x)] ) )
full
names years Run_value Lag1
1 Adams 2002 10 NA
2 Adams 2003 15 10
3 Adams 2004 15 15
4 Adams 2005 20 15
5 Bobby 2004 10 NA
6 Bobby 2005 5 10
7 Charlie 2010 5 NA
I thinks it's safer to cionstruct with NA, since that will help prevent errors in logic that using 0 for prior years in year 1 would not alert you to.

Resources