Combining componenets of a list in r - r

I have a list that contains data by year. I want to combine these components into a single dataframe, which is matched by row. Example list:
List [[1]]
State Year X Y
23 1971 etc etc
47 1971 etc etc
List[[2]]
State Year X Y
13 1972 etc etc
23 1973 etc etc
47 1973 etc etc
etc....
List[[45]]
State Year X Y
1 2017 etc etc
2 2017 etc etc
3 2017 etc etc
1 2017 etc etc
23 2017 etc etc
47 2017 etc etc
I want the dataframe to look like (I know I will have to go through and remove some extra columns:
State 1971_X 1971_Y 1972_X 1972_Y....2018_X 2019_Y
1 NA NA NA NA etc etc
2 NA NA etc etc etc etc
3 etc ect etc etc etc etc
...
50 NA NA etc etc etc etc
I have tried the command Outcomewanted=do.call("cbind", examplelist) but get the message
"Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 36, 40, 20, 42, 38, 26, 17, 31, 35, 23, 33, 13, 29, 28, 32, 34, 41, 37, 43, 39, 30, 14, 10, 4, 7"
It seems that the cbind.fill command could be an option but has been retired? Thanks for any help in advance.

You may use reshape after a do.call(rbind()) manoeuvre.
res <- reshape(do.call(rbind, lst), idvar="state", timevar="year", direction="wide")
res
# state x.1971 y.1971 x.1972 y.1972 x.1973 y.1973
# 1 23 1.3709584 0.3631284 NA NA -0.1061245 2.0184237
# 2 24 -0.5646982 0.6328626 NA NA 1.5115220 -0.0627141
# 3 13 NA NA 0.4042683 -0.09465904 NA NA
Data
lst <- list(structure(list(state = c(23, 24), year = c(1971, 1971),
x = c(1.37095844714667, -0.564698171396089), y = c(0.363128411337339,
0.63286260496104)), class = "data.frame", row.names = c(NA,
-2L)), structure(list(state = c(13, 23, 24), year = c(1972, 1973,
1973), x = c(0.404268323140999, -0.106124516091484, 1.51152199743894
), y = c(-0.0946590384130976, 2.01842371387704, -0.062714099052421
)), class = "data.frame", row.names = c(NA, -3L)))

Related

Replacing NA in longitudinal data with average difference of non-missing values

Here is a simplified version of the data I am working with:
data.frame(country = c("country1", "country2", "country3", "country1", "country2"), measurement = c("m1", "m1", "m1", "m2", "m2"),
y2015 = c(NA, 15, 19, 13, 55), y2016 = c(NA, 17, NA, 10, NA), y2017 = c(14, NA, NA, 9, 45), y2018 = c(18, 22, 16, NA, 40))
I am trying to take the difference between the two non-missing variables on either side of the NAs, and replace the missing values with the average of the differences over time.
For row 5, this would be something like c(55, 50, 45, 40).
However, it also needs to work for the rows that have more than one missing value in a sequence, like row 1 and row 3. For row 1, I'd like the difference between 14 and 18 to be interpolated, and so it should look something like c(6, 10, 14, 18). Meanwhile, for row 3, the difference between 19-13 divided between the two missing years, to look something like c(19, 18, 17, 16).
Essentially, I'm looking to create a slope for each country and measurement through the available years, and interpolating missing variables based on that.
I am trying to think of a package for this or perhaps create a loop. I have looked at the package 'spline' but does not seem to work since I want to run separate linear interpolation based on country and measurement.
Any thoughts would be greatly appreciated!
Use zoo::na.spline:
library(zoo)
dat[-c(1:2)] <- t(na.spline(t(dat[-c(1:2)])))
country measurement y2015 y2016 y2017 y2018
1 country1 m1 6 10 14.00000 18
2 country2 m1 15 17 19.33333 22
3 country3 m1 19 18 17.00000 16
4 country1 m2 13 10 9.00000 10
5 country2 m2 55 50 45.00000 40

What is the best way to expand the xts object and fill with NA?

Say i have the following xts object. How do I and what is the best way to expand 20 more rows and fill all the entries of the new rows with NA ?
structure(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, -0.626453810742332,
0.183643324222082, -0.835628612410047, 1.59528080213779, 0.329507771815361,
-0.820468384118015, 0.487429052428485, 0.738324705129217, 0.575781351653492,
-0.305388387156356, 1.51178116845085, 0.389843236411431, -0.621240580541804,
-2.2146998871775, 1.12493091814311, -0.0449336090152309, -0.0161902630989461,
0.943836210685299, 0.821221195098089, 0.593901321217509, 0.918977371608218,
0.782136300731067, 0.0745649833651906, -1.98935169586337, 0.61982574789471,
-0.0561287395290008, -0.155795506705329, -1.47075238389927, -0.47815005510862,
2.83588312039941, 4.71735910305809, 1.79442454531401, 2.77534322311874,
1.89238991883419, -0.754119113657213, 1.17001087340064, 1.2114200925793,
1.88137320657763, 4.20005074396777, 3.52635149691509, 1.67095280749283,
1.49327663972698, 3.39392675080947, 3.11332639734731, 0.62248861090096,
0.585009686075761, 2.72916392427366, 3.53706584903083, 1.77530757569954,
3.76221545290843, 2.79621176073414, 0.775947213498458, 2.68223938284885,
-0.258726192161585, 4.86604740340207, 5.96079979701172, 1.26555704706698,
-0.0882692526330615, 4.70915888232724, 2.59483618835753, 10.2048532815143,
2.88227999180049, 5.06921808735233, 3.084006476342, 0.770180373352784,
3.56637689854303, -2.41487588667311, 7.39666458468866, 3.45976001463569,
9.51783501108646, 4.42652858669899, 0.870160707234557, 4.83217906046716,
0.197707105067245, -0.760900200717306, 3.87433870655239, 1.6701243803447,
3.00331605489487, 3.22302397245499, 1.23143716143578, 1.29399380154449,
2.5944641546285, 6.53426098971961, -1.57070040128929, 4.78183856288526,
3.99885111364055, 6.18929951182909), .Dim = c(29L, 4L), .Dimnames = list(
NULL, c("x", "y1", "y2", "y3")), index = structure(c(1167667200,
1167753600, 1167840000, 1167926400, 1168012800, 1168099200, 1168185600,
1168272000, 1168358400, 1168444800, 1168531200, 1168617600, 1168704000,
1168790400, 1168876800, 1168963200, 1169049600, 1169136000, 1169222400,
1169308800, 1169395200, 1169481600, 1169568000, 1169654400, 1169740800,
1169827200, 1169913600, 1.17e+09, 1170086400), tzone = "", tclass = c("POSIXct",
"POSIXt")), class = c("xts", "zoo"))
Best way is always debatable. But the following works without any other packages. I use seq to create the newly wanted dates, starting from the last timestamp of the xts object. Add 1 day (60*60*24 seconds) to that and end after 20 days.
Then it is just a question of merging and the NA's are created automatically.
library(xts)
# create additional sequence of dates.
new <- seq(from = end(my_xts) + 60*60*24,
to = end(my_xts) + 20*60*60*24,
by = "day")
my_xts_new <- merge(my_xts, new)
tail(my_xts_new)
x y1 y2 y3
2007-02-13 17:00:00 NA NA NA NA
2007-02-14 17:00:00 NA NA NA NA
2007-02-15 17:00:00 NA NA NA NA
2007-02-16 17:00:00 NA NA NA NA
2007-02-17 17:00:00 NA NA NA NA
2007-02-18 17:00:00 NA NA NA NA

Need to create a variable based on the equality of other variables

I have a dataset called CSES (Comparative Study of Electoral Systems) where each row corresponds to an individual (one interview in a public opinion survey), from many countries, in many different years .
I need to create a variable which identifies the ideology of the party each person voted, as perceived by this same person.
However, the dataset identifies this perceived ideology of each party (as many other variables) by letters A, B, C, etc. Then, when it comes to identify WHICH PARTY each person voted for, it has a UNIQUE CODE NUMBER, that does not correspond to these letters across different years (i.e., the same party can have a different letter in different years – and, of course, it is never the same party across different countries, since each country has its own political parties).
Fictitious data to help clarify, reproduce and create a code:
Let’s say:
country = c(1,1,1,1,2,2,2,2,3,3,3,3)
year = c (2000,2000,2004,2004, 2002,2002,2004,2008,2000,2000,2000,2000)
party_A_number = c(11,11,12,12,21,21,22,23,31,31,31,31)
party_B_number = c(12, 12, 11, 11, 22,22,21,22,32,32,32,32)
party_C_number = c(13,13,13,13,23,23,23,21,33,33,33,33)
party_voted = c(12,13,12,11,21,24,23,22,31,32,33,31)
ideology_party_A <- floor(runif (12, min=1, max=10))
ideology_party_B <- floor(runif (12, min=1, max=10))
ideology_party_C <- floor(runif (12, min=1, max=10))
Let’s call the variable I want to create “ideology_voted”:
I need something like:
IF party_A_number == party_voted THEN ideology_voted = ideology_party_A
IF party_B_number == party_voted, THEN ideology_voted == ideology_party_B
IF party_C_number == party_voted, THEN ideology_voted == ideology_party_C
The real dataset has 9 letters for (up to) 9 main parties in each country , dozens of countries and election-years. Therefore, it would be great to have a code where I could iterate through letters A-I instead of “if voted party A, then …; if voted party B then….”
Nevertheless, I am having trouble even when I try longer, repetitive codes (one transformation for each party letter - which would give me 8 lines of code)
library(tidyverse)
df <- tibble(
country = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
year = c(2000, 2000, 2004, 2004, 2002, 2002, 2004, 2008, 2000, 2000, 2000, 2000),
party_A_number = c(11, 11, 12, 12, 21, 21, 22, 23, 31, 31, 31, 31),
party_B_number = c(12, 12, 11, 11, 22, 22, 21, 22, 32, 32, 32, 32),
party_C_number = c(13, 13, 13, 13, 23, 23, 23, 21, 33, 33, 33, 33),
party_voted = c(12, 13, 12, 11, 21, 24, 23, 22, 31, 32, 33, 31),
ideology_party_A = floor(runif (12, min = 1, max = 10)),
ideology_party_B = floor(runif (12, min = 1, max = 10)),
ideology_party_C = floor(runif (12, min = 1, max = 10))
)
> df
# A tibble: 12 x 9
country year party_A_number party_B_number party_C_number party_voted ideology_party_A ideology_party_B
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2000 11 12 13 12 9 3
2 1 2000 11 12 13 13 2 6
3 1 2004 12 11 13 12 3 8
4 1 2004 12 11 13 11 7 8
5 2 2002 21 22 23 21 2 7
6 2 2002 21 22 23 24 8 2
7 2 2004 22 21 23 23 1 7
8 2 2008 23 22 21 22 7 7
9 3 2000 31 32 33 31 4 3
10 3 2000 31 32 33 32 7 5
11 3 2000 31 32 33 33 1 6
12 3 2000 31 32 33 31 2 1
# ... with 1 more variable: ideology_party_C <dbl>
It seems you're after conditioning using case_when:
ideology_voted <- df %>% transmute(
ideology_voted = case_when(
party_A_number == party_voted ~ ideology_party_A,
party_B_number == party_voted ~ ideology_party_B,
party_C_number == party_voted ~ ideology_party_C,
TRUE ~ party_voted
)
)
> ideology_voted
# A tibble: 12 x 1
ideology_voted
<dbl>
1 3
2 7
3 3
4 8
5 2
6 24
7 8
8 7
9 4
10 5
11 6
12 2
Note that the evaluation of case_when is lazy, so the first true condition is used (if it happens that more than one is actually true, say).

R - Weeks of supply

I am trying to calculate the number of weeks the inventory on hand will last given the sales projections for a dataset with 10s of million of rows. I have listed the expected output in the last column of the data structure given below. I also attached the implementation of this in Excel.
Logic
Weeksofsupply = Number of weeks the current inventory on hand will last.
example - in the attached image (SKU_CD 222, STORE_CD 33), the inventory on hand is 19, the sales values are
WK1 + WK2 = 15, Wk1 + Wk2 + Wk3 = 24, Which is greater than 19,
So we are picking 2, which the count of Weeks the current inventory will last.
Expected output in the last column
Data = structure(list(
SKU_CD = c(111, 111, 111, 111, 111, 111, 111,111, 111, 111, 111, 111, 222, 222, 222, 222, 222, 222, 222, 222, 222, 222, 222, 222),
STORE_CD = c(22, 22, 22, 22, 22, 22, 22,22, 22, 22, 22, 22, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33),
FWK_CD = c(201627, 201628, 201629, 201630, 201631, 201632,201633, 201634, 201635, 201636, 201637, 201638, 201627, 201628, 201629, 201630, 201631, 201632, 201633, 201634, 201635, 201636, 201637, 201638),
SALES = c(5, 2, 2, 2, 1, 3, 2, 2, 3, 2, 3, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 7, 5),
INVENTORY = c(29, 27, 25, 23, 22, 19, 17, 15, 12, 10, 25, 1, 19, 17, 15, 13, 12,9, 7, 5, 2, 0, 25, 18),
WeeksofSupply = c("11", "10", "9", "8", "8", "6", "5", "4", "3", "2", "Inventory More", "Inventory Less", "2", "2", "1", "1", "1", "Inventory Less", "Inventory Less", "Inventory Less", "Inventory Less", "Inventory Less", "Inventory More", "Inventory More")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -24L),
.Names = c("SKU_CD", "STORE_CD", "FWK_CD", "SALES", "INVENTORY", "WeeksofSupply"))
Current Excel Code: (Here the weeks are shown in columns, but it should be rows like shown in the expected output.)
=IF(A2<SUM(B2:K2),SUMPRODUCT(--(SUBTOTAL(9,OFFSET(B2:K2,,,,COLUMN(B2:K2)-
COLUMN(B2)+1))<=A2))+LOOKUP(0,SUBTOTAL(9,OFFSET(B2:K2,,,,COLUMN(B2:K2)-
COLUMN(B2)+1))-B2:K2-A2,(A2-(SUBTOTAL(9,OFFSET(B2:K2,,,,COLUMN(B2:K2)-
COLUMN(B2)+1))-B2:K2))/B2:K2),IF(A2=SUM(B2:K2),COUNT(B2:K2),"Inventory
exceeds forecast"))
I would appreciate any input to implement this efficiently in R. Many Thanks for your time!
For your revised data in long format, you can do the following...
library(dplyr) #for the grouping functionality
#define a function to calculate weeks Supply from Sales and Inventory
weekSup <- function(sales,inv){
sales <- unlist(sales)
inv <- unlist(inv)
n <- length(sales)
weeksup <- rep(NA,n)
for(i in 1:n){
if(i==n | inv[i]<sales[i]){
weeksup[i] <- ifelse(inv[i]>sales[i],NA,inv[i]/sales[i])
} else {
weeksup[i] <- approxfun(cumsum(sales[i:n]),1:(n-i+1))(inv[i])
}
}
#Your 'inventory more' is coded as -1 (a number) to avoid whole column being forced to a character vector
weeksup <- replace(weeksup,is.na(weeksup),-1)
return(weeksup) #for whole weeks, change this to `return(floor(weeksup))`
}
Data2 <- Data %>% group_by(SKU_CD,STORE_CD) %>% mutate(weekSup=weekSup(SALES,INVENTORY))
head(Data2,20)
SKU_CD STORE_CD FWK_CD SALES INVENTORY WeeksofSupply weekSup
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 111 22 201627 5 29 11 11.3333333
2 111 22 201628 2 27 10 10.8333333
3 111 22 201629 2 25 9 9.8333333
4 111 22 201630 2 23 8 8.8333333
5 111 22 201631 1 22 8 8.0000000
6 111 22 201632 3 19 6 6.6666667
7 111 22 201633 2 17 5 5.8333333
8 111 22 201634 2 15 4 4.8333333
9 111 22 201635 3 12 3 3.6666667
10 111 22 201636 2 10 2 2.8333333
11 111 22 201637 3 25 Inventory More -1.0000000
12 111 22 201638 6 1 Inventory Less 0.1666667
13 222 33 201627 7 19 2 2.4444444
14 222 33 201628 8 17 2 2.0000000
15 222 33 201629 9 15 1 1.6000000
16 222 33 201630 10 13 1 1.2727273
17 222 33 201631 11 12 1 1.0833333
18 222 33 201632 12 9 Inventory Less 0.7500000
19 222 33 201633 13 7 Inventory Less 0.5384615
20 222 33 201634 14 5 Inventory Less 0.3571429
Here is one way to do it, using the linear interpolation method approxfun...
data$WeeksSupply <- sapply(1:nrow(data),function(i)
approxfun(cumsum(as.vector(c(data[i,2:11]))),1:10)(data$Inventory[i]))
data$WeeksSupply <- replace(data$WeeksSupply,is.na(data$WeeksSupply),
"Inventory Exceeds Forecast")
data
# A tibble: 2 x 12
Inventory Wk1 Wk2 Wk3 Wk4 Wk5 Wk6 Wk7 Wk8 Wk9 Wk10 WeeksSupply
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 200 20 15 25 40 35 45 30 50 45 55 6.66666666666667
2 2000 20 15 25 40 35 45 30 50 45 55 Inventory Exceeds Forecast

moving values from one dataframe to another, depending on value of a variable

Not being familiar with R, I've got the following problem: I want to add the values probeposition from the dataframe mlpa to the dataframe patients, with the values of probeposition being linked by values being present both in mlpa and patients (i.e. probe and patprobe). As far as I've seen, this problem is not covered by the usual data management tutorials.
#mlpa:
probe <- c(12,15,18,19)
probeposition <- c(100,1200,500,900)
mlpa = data.frame(probe = probe, probeposition = probeposition)
#patients:
patid <- c('AT', 'GA', 'TT', 'AG', 'GG', 'TA')
patprobe <- c(12, 12, NA, NA, 18, 19)
patients = data.frame(patid = patid, patprobe = patprobe)
#And that's what I finally want:
patprobeposition = c(100, 100, NA, NA, 500, 900)
patients$patprobeposition = patprobeposition
Update
Upon the response of Andrie, I got aware that that I have to mention that there are several "probes" in the patients dataset, so actually the data would more look like this (in fact, there would not only be probe1 and probe2, but probe1-probe4):
mlpa <- data.frame(probe = c(12,15,18,19),
probeposition = c(100,1200,500,900) )
patients <- data.frame(patid = c('AT', 'GA', 'TT', 'AG', 'GG', 'TA'),
probe1 = c(12, 12, NA, NA, 18, 19),
probe2 = c(15, 15, NA, NA, 19, 19) )
And what I want is this:
patients <- data.frame(patid = c('AT', 'GA', 'TT', 'AG', 'GG', 'TA'),
probe1 = c(12, 12, NA, NA, 18, 19),
probe2 = c(15, 15, NA, NA, 19, 19),
position1 = c(100, 100, NA, NA, 500, 900),
position2 = c(1200, 1200, NA, NA, 900, 900))
You can do this very easily using merge, which takes two data frames and joins them on common columns or row names.
The easiest way to get merge to work, is to make sure you have matching columns names where those columns refer to the same information. To be specific, I have renamed your column patprobe to probe:
mlpa <- data.frame(
probe = c(12,15,18,19),
probeposition = c(100,1200,500,900)
)
patients <- data.frame(
patid = c('AT', 'GA', 'TT', 'AG', 'GG', 'TA'),
probe = c(12, 12, NA, NA, 18, 19)
)
Now you can call merge. However, note that the default values of merge only returns matching rows (in database terminology this is an inner join). What you want, is to include all of the rows in patients (a left outer join). You do this by specifying all.x=TRUE:
merge(patients, mlpa, all.x=TRUE, sort=FALSE)
probe patid probeposition
1 12 AT 100
2 12 GA 100
3 18 GG 500
4 19 TA 900
5 NA TT NA
6 NA AG NA
Install the reshape2 package and try the following:
require(reshape2)
m.patients = melt(patients)
m.patients = merge(m.patients, mlpa,
by.x = "value",
by.y = "probe",
all = TRUE)
reshape(m.patients, direction="wide",
timevar="variable", idvar="patid")
This should give you output like the following, which can be cleaned up to match your desired output.
patid value.probe1 probeposition.probe1 value.probe2 probeposition.probe2
1 AT 12 100 15 1200
2 GA 12 100 15 1200
5 GG 18 500 19 900
7 TA 19 900 19 900
9 TT NA NA NA NA
10 AG NA NA NA NA
Update
Of course, you can also do it all with the reshape2 package as below:
m.patients = melt(patients, id.vars="patid", variable_name="time")
m.patients = melt(merge(m.patients, mlpa, by.x = "value",
by.y = "probe", all = TRUE))
dcast(m.patients, patid ~ variable + time )
Which results in:
patid value_probe1 value_probe2 probeposition_probe1 probeposition_probe2
1 AG NA NA NA NA
2 AT 12 15 100 1200
3 GA 12 15 100 1200
4 GG 18 19 500 900
5 TA 19 19 900 900
Update 2: Using Base R Reshape
You can also avoid using the reshape2 package entirely.
patients.l = reshape(patients, direction="long", idvar="patid",
varying=c("probe1", "probe2"), sep="")
reshape(merge(patients.l, mlpa, all = TRUE), direction="wide",
idvar="patid", timevar="time")
This gets you closest to your desired output:
patid probe.1 probeposition.1 probe.2 probeposition.2
1 AT 12 100 15 1200
2 GA 12 100 15 1200
5 GG 18 500 19 900
7 TA 19 900 19 900
9 TT NA NA NA NA
10 AG NA NA NA NA

Resources