How to iterate one dataframe based on a mapping file in R? - r

Serial No.
Company 1
Company 2
Company 3
01
NA
2
NA
02
2
NA
5
03
NA
NA
4
04
1
NA
NA
05
NA
4
NA
I have a data structure like this where the column headings represent some companies and the row headings represents consumers who buy the products. 'NA' representing no purchase for that company's products by the consumer.
I have a second mapping file where the companies are represented as row headings as follows -
Company
Country
Category
Company 1
UK
FMCG
Company 2
UK
FMCG
Company 3
India
FMCG
Company 4
US
Nicotine
The data set is for over 10000 consumers and 1000 companies. I'm getting the market share for different countries and categories using the aggregate function and mapping file.
I want to make a look to iterate values in the first data-frame to change the share for different countries and categories. The idea is to make a loop where I can choose which country's (or category) share needs to be changed along with the share and then to use the mapping file to iterate values for companies in those countries (or category). The values need to be changes for only those consumers who buy the products from companies belonging to that country (or category).
Can someone suggest how can this be done in R (preferably) or Python?
Edit:
Before iteration I will use the aggregate function in R to get the shares for a country (or category) like this -
Country
Share
UK
0.33
US
0.02
IN
0.41
IR
0.11
PK
0.13
In the loop I want to be able to specify the share for some country (say UK) to whatever is required (say 0.5). The mapping file will be used to iterate values to the first data structure where people have bought products from companies in UK.
The final output will be something like this.
Country
Share
UK
0.50
US
0.00
IN
0.38
IR
0.11
PK
0.01

Here's a guess: ultimately, this is a combination of reshape from wide to long, then merge/join, and finally aggregation/summarizing by group. If you need more information for either operation, using those key-words (on SO) will provide very useful information.
base R (and reshape2)
## reshape
dat1melted <- reshape2::melt(dat1, "Serial No.", variable.name = "Company")
dat1melted$Company <- as.character(dat1melted$Company)
dat1melted <- dat1melted[!is.na(dat1melted$value),]
dat1melted
# Serial No. Company value
# 2 02 Company 1 2
# 4 04 Company 1 1
# 6 01 Company 2 2
# 10 05 Company 2 4
# 12 02 Company 3 5
# 13 03 Company 3 4
## merge
dat1merged <- merge(dat1melted, dat2, by = "Company", all.x = TRUE)
dat1merged
# Company Serial No. value Country Category
# 1 Company 1 02 2 UK FMCG
# 2 Company 1 04 1 UK FMCG
# 3 Company 2 01 2 UK FMCG
# 4 Company 2 05 4 UK FMCG
# 5 Company 3 02 5 India FMCG
# 6 Company 3 03 4 India FMCG
## aggregate by group
aggregate(value ~ Country, data = dat1merged, FUN = sum)
# Country value
# 1 India 9
# 2 UK 9
dplyr
library(dplyr)
# library(tidyr) # pivot_longer
dat1 %>%
## reshape
tidyr::pivot_longer(-`Serial No.`, names_to = "Company") %>%
filter(!is.na(value)) %>%
## merge
left_join(., dat2, by = "Company") %>%
## aggregate by group
group_by(Country) %>%
summarize(value = sum(value))
# # A tibble: 2 x 2
# Country value
# <chr> <int>
# 1 India 9
# 2 UK 9

Related

Joins in R while also spreading out information from one data frame

I am attempting to join together two data frames. One contains records of when certain events happened. The other contains daily information on values that occurred for a given organization.
My current challenge is how to join together the information in the "when certain events happened" data frame fully into the records data frame. Most of dplyr's joins appear to simply join one line together. I need to fully spread out the record information based on start and end dates.
In other words, I need to spread out information from one line into many lines, while simultaneously joining to the daily data table. It is important that I do this in R because the alternative is quite a bit of filtering and dragging in Excel (the information covers thousands of rows).
Below is a representation of the daily data table
value year month day org link
12 1 1 1 AA AA-1-1
45 1 1 2 AA AA-1-2
31 1 1 3 AA AA-1-3
10 1 1 4 AA AA-1-4
Below is a representation of the records table
year month day org link end_link event event_info
1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
1 2 7 BB BB-1-2-7 BB-1-2-10 Sell Yes
And finally, here is what I am aiming for in the end:
value month day org link event event_info
12 1 1 AA AA-1-1-1
45 1 2 AA AA-1-1-2 Buy Yes
31 1 3 AA AA-1-1-3 Buy Yes
10 1 4 AA AA-1-1-4
Is there any way to accomplish this in R? I have tried using dplyr joins but usually am only able to join together the initial link.
Edit: The second "end" link refers to an end date. In the records table this is all in one line, while the second data frame has daily information.
Edit: Below I have put together a cleaner look at my real data. The first image is of DAILY DATA while the second is of RECORDS OF EVENTS. The third is what I would like to see (ideally).
Daily data, which will have multiple orgs present
Records data, note org id AA and the audience
Ideal combined data
We have first to build some dates in order to build date sequences that we'll unnest to get a long version of df2, which we right join on df1:
library(tidyverse)
df2 %>%
separate(link,c("org1","year1","month1","day1")) %>%
separate(end_link,c("org2","year2","month2","day2")) %>%
rowwise %>%
transmute(org,event,event_info, date = list(
as.Date(paste0(year1,"-",month1,"-",day1)):as.Date(paste0(year2,"-",month2,"-",day2)))) %>%
unnest %>%
right_join(df1 %>% mutate(date=as.numeric(as.Date(paste0(year,"-",month,"-",day))))) %>%
select(value, month, day, org, link, event,event_info)
# # A tibble: 4 x 7
# value month day org link event event_info
# <int> <int> <int> <chr> <chr> <chr> <chr>
# 1 12 1 1 AA AA-1-1 <NA> <NA>
# 2 45 1 2 AA AA-1-2 Buy Yes
# 3 31 1 3 AA AA-1-3 Buy Yes
# 4 10 1 4 AA AA-1-4 <NA> <NA>
data
df1 <- read.table(text="value year month day org link
12 1 1 1 AA AA-1-1
45 1 1 2 AA AA-1-2
31 1 1 3 AA AA-1-3
10 1 1 4 AA AA-1-4",h=T,strin=F)
df2 <- read.table(text="year month day org link end_link event event_info
1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
1 2 7 BB BB-1-2-7 BB-1-2-10 Sell Yes",h=T,strin=F)
I would use the Data table package, it is for me the best R package to do data analysis. Hope to have properly understood the problem, let me know if it does not work.
The first part creates the data-set (I created the two data.table objects in two different ways just to show both alternatives, you could read your data directly from excel, .txt, .csv or similar, let me know if you want to know how to do this).
library(data.table)
value<-c(12,45,31,10)
year<-c(1,1,1,1)
month<-c(1,1,1,1)
day<-c(1,2,3,4)
org<-c("AA","AA","AA","AA")
link<-c("AA-1-1","AA-1-2","AA-1-3","AA-1-4")
Daily_dt<-data.table(value, year,month,day,org,link)
Records_dt<-data.table(year=c(1,1),month=c(1,1),day=c(2,3),org=c("AA","BB"),link=c("AA-1-1-2","BB-1-2-7"),end_link=c("AA-1-1-3","BB-1-2-10"),
event=c("Buy","Buy"),event_info=c("Yes","Yes"))
Daily_dt[,Date:=as.Date(paste(year,"-",month,"-",day,sep=""))]
To achieve what you want you need these lines
Records_dt=rbind(Records_dt[,c("org","link","event","event_info")],
Records_dt[,list(org,link=end_link,event,event_info)])
Record_Dates<-as.data.table(tstrsplit(Records_dt$link,"-")[-1])
Record_Dates[,Dates:=as.Date(paste(V1,"-",V2,"-",V3,sep=""))]
Records_dt[,Date:=Record_Dates$Dates]
setkey(Records_dt,Date)
setkey(Daily_dt,Date)
Records_dt<-Records_dt[,c("Date","event","event_info")][Daily_dt,]
Records_dt<-Records_dt[,c("value","month","day","org","link","event","event_info")]
and this is the result
> Records_dt
value month day org link event event_info
1: 12 1 1 AA AA-1-1 NA NA
2: 45 1 2 AA AA-1-2 Buy Yes
3: 31 1 3 AA AA-1-3 Buy Yes
4: 10 1 4 AA AA-1-4 NA NA
If your input data had more than one event in the same day (with or without the same org) something like:
> Records_dt
year month day org link end_link event event_info
1: 1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
2: 1 1 3 BB BB-1-2-7 BB-1-2-10 Buy Yes
3: 1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
4: 1 1 3 AA AA-1-2-7 AA-1-2-10 Buy Yes
some tweaks may be required, but am not sure if you required this, so did not add it.

Canonical way to reduce number of ID variables in wide-format data

I have data organized by two ID variables, Year and Country, like so:
Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8
I'd like to keep Year as an ID variable, but create multiple columns for VarA and VarB, one for each value of Country (I'm not picky about column order), to make the following table:
Year VarA.Canada VarA.USA VarB.Canada VarB.USA
2014 0 NA 10 NA
2015 6 1 5 3
2016 7 2 8 2
I managed to do this with the following code:
require(data.table)
require(reshape2)
data <- as.data.table(read.table(header=TRUE, text='Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8'))
molten <- melt(data, id.vars=c('Year', 'Country'))
molten[,variable:=paste(variable, Country, sep='.')]
recast <- dcast(molten, Year ~ variable)
But this seems a bit hacky (especially editing the default-named variable field). Can I do it with fewer function calls? Ideally I could just call one function, specifying the columns to drop as IDs and the formula for creating new variable names.
Using dcast you can cast multiple value.vars at once (from data.table v1.9.6 on). Try:
dcast(data, Year ~ Country, value.var = c("VarA","VarB"), sep = ".")
# Year VarA.Canada VarA.USA VarB.Canada VarB.USA
#1: 2014 0 NA 10 NA
#2: 2015 6 1 5 3
#3: 2016 7 2 8 2

New column from non-standard date factor in R

I have a dataframe with an oddly formatted dates column. I'd like to create a column just showing the year from the original date column and I am having trouble coming up with a way to do this because the current date column is being treated as a factor. Any advice on how to do this efficiently would be appreciated.
Example
starting with:
org <- c("a","b","c","d")
country <- c("1","2","3","4")
date <- c("01-09-14","01-10-07","11-31-99","10-31-12")
toy <- data.frame(cbind(org,country,date))
toy
org country date
1 a 1 01-09-14
2 b 2 01-10-07
3 c 3 11-31-99
4 d 4 10-31-12
str(toy$date)
Factor w/ 4 levels "01-09-14","01-10-07",..: 1 2 4 3
Desired result:
org country Year
1 a 1 2014
2 b 2 2007
3 c 3 1999
4 d 4 2012
This should work:
transform(toy,Year=format(strptime(date,"%m-%d-%y"),"%Y"))
This produces
## org country date Year
## 1 a 1 01-09-14 2014
## 2 b 2 01-10-07 2007
## 3 c 3 11-31-99 <NA>
## 4 d 4 10-31-12 2012
I initially thought that the NA value was because the %y format indicator wasn't smart enough to handle previous-century dates, but ?strptime says:
‘%y’ Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2004 and 2008 POSIX standards, but they do
also say ‘it is expected that in a future version the default
century inferred from a 2-digit year will change’.
implying that it should be able to handle it.
The problem is actually that 31 November doesn't exist ...
(You can drop the date column at your leisure ...)

R aggregating on date then character

I have a table that looks like the following:
Year Country Variable 1 Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 2 5
1971 UK 2 3
1971 UK 1 5
1971 USA 2 2
1972 USA 1 1
1972 USA 2 5
I'd be grateful if someone could tell me how I can aggregate the data to group it first by year, then country with the sum of variable 1 and variable 2 coming afterwards so the output would be:
Year Country Sum Variable 1 Sum Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 5 13
1971 USA 2 2
1972 USA 3 6
This is the code I've tried to no avail (the real dataframe is 125,000 rows by 30+ columns hence the subset. Please be kind, I'm new to R!)
#making subset from data
GT2 <- subset(GT1, select = c("iyear", "country_txt", "V1", "V2"))
#making sure data types are correct
GT2[,2]=as.character(GT2[,2])
GT2[,3] <- as.numeric(as.character( GT2[,3] ))
GT2[,4] <- as.numeric(as.character( GT2[,4] ))
#removing NA values
GT2Omit <- na.omit(GT2)
#trying to aggregate - i.e. group by year, then country with the sum of Variable 1 and Variable 2 being shown
aggGT2 <-aggregate(GT2Omit, by=list(GT2Omit$iyear, GT2Omit$country_txt), FUN=sum, na.rm=TRUE)
Your aggregate is almost correct:
> aggGT2 <-aggregate(GT2Omit[3:4], by=GT2Omit[c("country_txt", "iyear")], FUN=sum, na.rm=TRUE)
> aggGT2
country_txt iyear V1 V2
1 UK 1970 1 3
2 USA 1970 1 3
3 UK 1971 5 13
4 USA 1971 2 2
5 USA 1972 3 6
dplyr is almost always the answer nowadays.
library(dplyr)
aggGT1 <- GT1 %>% group_by(iyear, country_txt) %>% summarize(sv1=sum(V1), sv2=sum(V2))
Having said that, it is good to learn basic R functions like aggregate and by.

How do I infill non-adjacent rows with sample data from previous rows in R?

I have data containing a unique identifier, a category, and a description.
Below is a toy dataset.
prjnumber <- c(1,2,3,4,5,6,7,8,9,10)
category <- c("based","trill","lit","cold",NA,"epic", NA,NA,NA,NA)
description <- c("skip class",
"dunk on brayden",
"record deal",
"fame and fortune",
NA,
"female attention",
NA,NA,NA,NA)
toy.df <- data.frame(prjnumber, category, description)
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 <NA> <NA>
6 6 epic female attention
7 7 <NA> <NA>
8 8 <NA> <NA>
9 9 <NA> <NA>
10 10 <NA> <NA>
I want to randomly sample the 'category' and 'description' columns from rows that have been filled in to use as infill for rows with missing data.
The final data frame would be complete and would only rely on the initial 5 rows which contain data. The solution would preserve between-column correlation.
An expected output would be:
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 lit record deal
6 6 epic female attention
7 7 based skip class
8 8 based skip class
9 9 lit record deal
10 10 trill dunk on brayden
complete = na.omit(toy.df)
toy.df[is.na(toy.df$category), c("category", "description")] =
complete[sample(1:nrow(complete), size = sum(is.na(toy.df$category)), replace = TRUE),
c("category", "description")]
toy.df
# prjnumber category description
# 1 1 based skip class
# 2 2 trill dunk on brayden
# 3 3 lit record deal
# 4 4 cold fame and fortune
# 5 5 lit record deal
# 6 6 epic female attention
# 7 7 cold fame and fortune
# 8 8 based skip class
# 9 9 epic female attention
# 10 10 epic female attention
Though it would seem a little more straightforward if you didn't start with the unique identifiers filled out for the NA rows...
You could try
library(dplyr)
toy.df %>%
mutate_each(funs(replace(., is.na(.), sample(.[!is.na(.)]))), 2:3)
Based on new information, we may need a numeric index to use in the funs.
toy.df %>%
mutate(indx= replace(row_number(), is.na(category),
sample(row_number()[!is.na(category)], replace=TRUE))) %>%
mutate_each(funs(.[indx]), 2:3) %>%
select(-indx)
Using Base R to fill in a single field a at a time, use something like (not preserving the correlation between the fields):
fields <- c('category','description')
for(field in fields){
missings <- is.na(toy.df[[field]])
toy.df[[field]][missings] <- sample(toy.df[[field]][!missings],sum(missings),T)
}
and to fill them in simultaneously (preserving the correlation between the fields) use something like:
missings <- apply(toy.df[,fields],
1,
function(x)any(is.na(x)))
toy.df[missings,fields] <- toy.df[!missings,fields][sample(sum(!missings),
sum(missings),
T),]
and of course, to avoid the implicit for loop in the apply(x,1,fun), you could use:
rowAny <- function(x) rowSums(x) > 0
missings <- rowAny(toy.df[,fields])

Resources