reshape data with non-unique id and varying time frames - r

I have a dataset with the following format:
name1 year name2 profits2010 profits2009 count
AA 2009 AA 10 15 20
AA 2010 AA 10 15 3
BB 2009 BB 4 NA 34
BB 2010 BB 4 NA 4
I need to reshape the data to this format.Any ideas on how this can be done?
name1 year name2 profits count
AA 2009 AA 15 20
AA 2010 AA 10 3
BB 2009 BB NA 34
BB 2010 BB 4 4

Try
indx <- grep('profits', names(df1))
indx2 <- cbind(1:nrow(df1), match(df1$year,
as.numeric(sub('\\D+', '', names(df1)[indx]))))
df1$profits <- df1[indx][indx2]
df1[-indx]
# name1 year name2 count profits
#1 AA 2009 AA 20 15
#2 AA 2010 AA 3 10
#3 BB 2009 BB 34 NA
#4 BB 2010 BB 4 4

This isn't really reshaping, just defining a new variable. Try this:
df$profits <- ifelse(df$year==2009,df$profits2009,df$profits2010)

Related

how to capture the "new" client (data) by comparing yearly-firm client portfolio in R?

I have one database that contains the firms' names and ids and their clients' names and ids, which are structured yearly. So this means each firm has its unique client portfolio. I would like to find the "new" client per firm each year from this database, but I have no idea how to make it in R.
I appreciate any suggestions!
The current data looks like this:
client.id client.name year firm.id firm.name
1 A 2013 1 AA
1 A 2014 1 AA
1 A 2015 1 AA
2 B 2015 1 AA
1 A 2016 1 AA
2 B 2016 1 AA
3 C 2016 1 AA
4 D 2013 2 BB
5 E 2013 2 BB
5 E 2014 2 BB
6 F 2014 2 BB
5 E 2015 2 BB
6 F 2015 2 BB
7 G 2015 2 BB
What i would like to do is finding the new client for each firm per year:
client.id client.name year firm.id firm.name
2 B 2015 1 AA
3 C 2016 1 AA
6 F 2014 2 BB
7 G 2015 2 BB
Another way to ask which clients are new for each firm per year is: what is the first year that each client appears for each firm?
We should also exclude the first year if that's when the data starts rather than truly indicating new clients:
library(dplyr)
clients |>
group_by(firm.id, client.name) |>
filter(
year == min(year),
year != min(clients$year)
)
# # A tibble: 4 x 5
# # Groups: firm.id, client.name [4]
# client.id client.name year firm.id firm.name
# <int> <chr> <int> <int> <chr>
# 1 2 B 2015 1 AA
# 2 3 C 2016 1 AA
# 3 6 F 2014 2 BB
# 4 7 G 2015 2 BB

Make a string after grouping

Here is my problem. I've got data on city codes (GeoCode) and zip codes (PostCode). Often several zip codes correspond to a single city code. If that's the case, I want to make a column with a string of zip codes corresponding to the same city:
ID<-1:10
GeoCode<-c("AA","BB","BB","CC","CC","CC","DD","DD","DD","DD")
PostCode<-c("01","10","11","20","21","22","30","31","32","33")
data<-data.frame(ID,GeoCode,PostCode)
I want to make such table. For example "20_21_22" belong to City code CC
ID GeoCode PostCode strPostcode
1 1 AA 01 01
2 2 BB 10 10_11
3 3 BB 11 10_11
4 4 CC 20 20_21_22
5 5 CC 21 20_21_22
6 6 CC 22 20_21_22
7 7 DD 30 30_31_32_33
8 8 DD 31 30_31_32_33
9 9 DD 32 30_31_32_33
10 10 DD 33 30_31_32_33
We could group by 'GeoCode' and paste all the unique 'PostCode' in mutate
library(dplyr)
library(stringr)
data %>%
group_by(GeoCode) %>%
mutate(strPostcode = str_c(unique(PostCode), collapse="_"))
# A tibble: 10 x 4
# Groups: GeoCode [4]
# ID GeoCode PostCode strPostcode
# <int> <chr> <chr> <chr>
# 1 1 AA 01 01
# 2 2 BB 10 10_11
# 3 3 BB 11 10_11
# 4 4 CC 20 20_21_22
# 5 5 CC 21 20_21_22
# 6 6 CC 22 20_21_22
# 7 7 DD 30 30_31_32_33
# 8 8 DD 31 30_31_32_33
# 9 9 DD 32 30_31_32_33
#10 10 DD 33 30_31_32_33
Or an option with base R
data$strPostcode <- with(data, ave(PostCode, GeoCode, FUN =
function(x) paste(unique(x), collapse="_")))
The base R option with ave by #akrun is efficient. Here is another workaround
merge(data,
aggregate(PostCode ~ ., data[-1], paste0, collapse = "_"),
by = "GeoCode",
all = TRUE
)
which gives
GeoCode ID PostCode.x PostCode.y
1 AA 1 01 01
2 BB 2 10 10_11
3 BB 3 11 10_11
4 CC 4 20 20_21_22
5 CC 5 21 20_21_22
6 CC 6 22 20_21_22
7 DD 7 30 30_31_32_33
8 DD 8 31 30_31_32_33
9 DD 9 32 30_31_32_33
10 DD 10 33 30_31_32_33
Or you can try this one
data2 <- data %>%
group_by(GeoCode) %>%
mutate(strPostCode = paste0(unique(PostCode), collapse = "_"))
# ID GeoCode PostCode strPostCode
# <int> <chr> <chr> <chr>
# 1 1 AA 01 01
# 2 2 BB 10 10_11
# 3 3 BB 11 10_11
# 4 4 CC 20 20_21_22
# 5 5 CC 21 20_21_22
# 6 6 CC 22 20_21_22
# 7 7 DD 30 30_31_32_33
# 8 8 DD 31 30_31_32_33
# 9 9 DD 32 30_31_32_33
# 10 10 DD 33 30_31_32_33

How to make all elements of all columns align by creating empty spaces so that it stays in the same pattern

I have a temporal dataset, however, it is incomplete so I can not reconstruct the series accurately. These are the data:
df<-data.frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1=c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2=c("U","UU","D","D","DDD","D","U","UU","UUU","U"),
sample3=c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4=c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5=c(NA,"UU","D","U","UU","UUU","U","D","U",NA))
I need it to end up like this:
df2<-data.frame(year=c(2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,
2015,2016,2017,2018),
sample1=c(NA,NA,"D","D","DDD","D","U","UU","UUU","U","D","DDD",NA,NA,NA),
sample2=c("U","UU","D","D","DDD","D","U","UU","UUU","U",NA,NA,NA,NA,NA),
sample3=c(NA,NA,NA,"D","DDD","D","U","UU","UUU","U","D","DDD","D",NA,NA),
sample4=c(NA,NA,"D","D",NA,NA,NA,NA,"UUU","U","D","DDD","D","U","U"),
sample5=c(NA,"UU","D",NA,NA,NA,"U","UU","UUU","U",NA,NA,"D","U",NA))
I need all the columns aligned in the same pattern, the best result was using DNA alignment functions, but these times to find the best alignment invert the elements, in my case can not occur this.
I have no idea how to do this.
dplyr's add_row function makes this pretty easy, once the initial dataframe exists.
library(dplyr)
df<-data.frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1 = c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2 = c("U","UD","D","D","DDD","D","U","UU","UUU","U"),
sample3 = c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4 = c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5 = c(NA,"UU","D","U","UU","UUU","U","D","U",NA))
df2 <- df %>%
add_row(year = 2016:2018)
library(dplyr)
df <- data_frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1=c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2=c("U","UD","D","D","DDD","D","U","UU","UUU","U"),
sample3=c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4=c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5=c(NA,"UU","D","U","UU","UUU","U","D","U",NA)) %>%
add_row(year = c(2004, 2005), .before = 1) %>%
add_row(year = c(2016:2018))
Result:
# A tibble: 15 x 6
year sample1 sample2 sample3 sample4 sample5
<dbl> <chr> <chr> <chr> <chr> <chr>
1 2004 NA NA NA NA NA
2 2005 NA NA NA NA NA
3 2006 D U D D NA
4 2007 D UD DDD D UU
5 2008 DDD D D UUU D
6 2009 D D U U U
7 2010 U DDD UU D UU
8 2011 UU D UUU DDD UUU
9 2012 UUU U U D U
10 2013 U UU D U D
11 2014 D UUU DDD U U
12 2015 DDD U D NA NA
13 2016 NA NA NA NA NA
14 2017 NA NA NA NA NA
15 2018 NA NA NA NA NA

Compare values of two dataframes and substitute them

I've two data frames with the same number of rows and columns, 113x159 with this structure:
df1:
1 2 3 4
a AT AA AG CT
b NA AG AT CC
c AG GG GT AA
d NA NA TT TC
df2:
1 2 3 4
a NA 23 12 NA
b NA 23 44 12
c 11 14 27 55
d NA NA 12 34
I want to compare value to value db1 e db2, and if the value of db 2 is NA and the value of db1 isn't, replace it (also if db1 value is NA and in db2 not).
At the end, my df has to be this:
1 2 3 4
a NA AA AG NA
b NA AG AT CC
c AG GG GT AA
d NA NA TT CC
I've written this if loop but it doesn't work:
merge.na<-function(x){
for (i in df2) AND (k in df1){
if (i==NA) AND (k!=NA)
k==NA}
Any idea?
We can use replace
replace(df1, is.na(df2), NA)
# X1 X2 X3 X4
#a <NA> AA AG <NA>
#b <NA> AG AT CC
#c AG GG GT AA
#d <NA> <NA> TT TC

Left join (or equivalent) to number index by group

I have a sequence of numbers (days):
dayNum <- c(1:10)
And I have a dataframe of id, day, and event:
id = c("aa", "aa", "aa", "bb", "bb", "cc")
day = c(1, 2, 3, 1, 6, 2)
event = c("Y", "Y", "Y", "Y", "Y", "Y")
df = data.frame(id, day, event)
Which looks like this:
id day event
aa 1 Y
aa 2 Y
aa 3 Y
bb 1 Y
bb 6 Y
cc 2 Y
I am trying to put this dataframe into a form that resembles left joining dayNum with df for each id. That is, even if id "aa" had no event on day 5, I should still get a row for "aa" on day 5 with N/A or something under event. Like this:
id day event
aa 1 Y
aa 2 Y
aa 3 Y
aa 4 N/A
aa 5 N/A
aa 6 N/A
aa 8 N/A
aa 9 N/A
aa 10 N/A
bb 1 Y
bb 2 N/A
bb 3 N/A
bb 4 N/A
bb 5 N/A
bb 6 Y
bb 7 N/A
...etc
I can make this work using dplyr and left_join when my dataframe only contains one unique id, but I am stuck trying to make this work with a dataframe that has many different ids.
A push in the right direction would be greatly appreciated.
Thank you!
We can use expand.grid and merge. We create a new dataset using the unique 'id' of 'df' and the 'dayNum'. Then merge with the 'df' to get the expected output.
merge(expand.grid(id=unique(df$id), day=dayNum), df, all.x=TRUE)
# id day event
#1 aa 1 Y
#2 aa 2 Y
#3 aa 3 Y
#4 aa 4 <NA>
#5 aa 5 <NA>
#6 aa 6 <NA>
#7 aa 7 <NA>
#8 aa 8 <NA>
#9 aa 9 <NA>
#10 aa 10 <NA>
#11 bb 1 Y
#12 bb 2 <NA>
#13 bb 3 <NA>
#14 bb 4 <NA>
#15 bb 5 <NA>
#16 bb 6 Y
#17 bb 7 <NA>
#18 bb 8 <NA>
#19 bb 9 <NA>
#20 bb 10 <NA>
#21 cc 1 <NA>
#22 cc 2 Y
#23 cc 3 <NA>
#24 cc 4 <NA>
#25 cc 5 <NA>
#26 cc 6 <NA>
#27 cc 7 <NA>
#28 cc 8 <NA>
#29 cc 9 <NA>
#30 cc 10 <NA>
A similar option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(df), set the 'key' columns, join with the dataset derived from cross join of unique 'id' and 'dayNum'.
library(data.table)
setDT(df, key=c('id', 'day'))[CJ(id=unique(id), day=dayNum)]

Resources