dplyr::left_join produce NA values for new joined columns

dplyr::left_join produce NA values for new joined columns - r

I have two tables I wish to left_join through the dplyr package. The issue is that is produces NA values for all new columns (the ones I'm after).
As you can see below, the left_join procudes NA values for the new column of Incep.Price and DayCounter. Why does this happen, and how can this be resolved?
Update: Thanks to #akrun, using left_join(Avanza.XML, checkpoint, by = c('Firm' = 'Firm')) solves the issue and the columns are joined correctly.
However the warning message is sitll the same, could someone explain this behaviour? Why one must in this case explicitly specify the join columns, or otherwise produce NA values?
> head(Avanza.XML)
Firm Gain.Month.1 Last.Price Vol.Month.1
1 Stockwik Förvaltning 131.25 0.074 131264420
2 Novestra 37.14 7.200 605330
3 Bactiguard Holding 29.55 14.250 2815572
4 MSC Group B 20.87 3.070 671855
5 NeuroVive Pharmaceutical 18.07 9.800 3280944
6 Shelton Petroleum B 16.21 3.800 2135798
> head(checkpoint)
Firm Gain.Month.1 Last.Price Vol.Month.1 Incep.Price DayCounter
1 Stockwik Förvaltning 87.50 0.06 91270090 0.032000 2016-01-25
2 Novestra 38.10 7.25 604683 5.249819 2016-01-25
3 Bactiguard Holding 29.09 14.20 2784161 11.000077 2016-01-25
4 MSC Group B 27.56 3.24 657699 2.539981 2016-01-25
5 Shelton Petroleum B 19.27 3.90 1985305 3.269892 2016-01-25
6 NeuroVive Pharmaceutical 16.87 9.70 3220303 8.299820 2016-01-25
> head(left_join(Avanza.XML, checkpoint))
Joining by: c("Firm", "Gain.Month.1", "Last.Price", "Vol.Month.1")
Firm Gain.Month.1 Last.Price Vol.Month.1 Incep.Price DayCounter
1 Stockwik Förvaltning 131.25 0.074 131264420 NA <NA>
2 Novestra 37.14 7.200 605330 NA <NA>
3 Bactiguard Holding 29.55 14.250 2815572 NA <NA>
4 MSC Group B 20.87 3.070 671855 NA <NA>
5 NeuroVive Pharmaceutical 18.07 9.800 3280944 NA <NA>
6 Shelton Petroleum B 16.21 3.800 2135798 NA <NA>
Warning message:
In left_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector

There are two problems.
Not specifying the by argument in left_join: In this case, by default all the columns are used as the variables to join by. If we look at the columns - "Gain.Month.1", "Last.Price", "Vol.Month.1" - all numeric class and do not have a matching value in each of the datasets. So, it is better to join by "Firm"
left_join(Avanza.XML, checkpoint, by = "Firm")
The "Firm" column class - factor: We get warning when there is difference in the levels of the factor column (if it is the variable that we join by). In order to remove the warning, we can either convert the "Firm" column in both datasets to character class
Avanza.XML$Firm <- as.character(Avanza.XML$Firm)
checkpoint$Firm <- as.character(checkpoint$Firm)
Or if we still want to keep the columns as factor, then change the levels in the "Firm" to include all the levels in both the datasets
lvls <- sort(unique(c(levels(Avanza.XML$Firm),
levels(checkpoint$Firm))))
Avanza.XML$Firm <- factor(Avanza.XML$Firm, levels=lvls)
checkpoint$Firm <- factor(checkpoint$Firm, levels=lvls)
and then do the left_join.

Related

R - enter basic formula

I am new to R and struggling to understand its quirks. I'm trying to do something which should be really simple, but is turning out to be apparently very complicated.
I am used to Excel, SQL and Minitab, where you can enter a value in one column which includes references to other columns and parameters. However, R doesn't seem to be allowing me to do this.
I have a table with (currently) four columns:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 NA 0.787890411
3 30/12/2011 662 NA NA
4 31/12/2011 28 NA NA
5 01/01/2012 46 NA NA
6 02/01/2012 403 NA NA
7 03/01/2012 282 NA NA
8 04/01/2012 315 NA NA
9 05/01/2012 327 NA NA
10 06/01/2012 458 NA NA
and have a parameter "beta", with a value which I have assigned as 0.0002.
All I want to do is assign a formula to rows 3:10 which is:
beta*(Pallets t - Pallets t-1)+(1-beta)*Tt t-1.
I thought that the appropriate code might be:
Table[3:10,4]<-beta*(Table[3:10,"Pallets"]-Table[2:9,"Pallets"])+(1-beta)*Table[2:9,"Tt"]
However, this doesn't work. The first time I enter this formula, it generates:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 NA 0.7878904
3 30/12/2011 662 NA 0.8431328
4 31/12/2011 28 NA NA
5 01/01/2012 46 NA NA
6 02/01/2012 403 NA NA
7 03/01/2012 282 NA NA
8 04/01/2012 315 NA NA
9 05/01/2012 327 NA NA
10 06/01/2012 458 NA NA
So it's generated the correct answer for the second item in the series, but not for any of the subsequent values.
It seems as though R doesn't automatically update each row, and the relationship to each other row, when you enter a formula, as Excel does. Having said that, Excel actually would require me to enter the formula in cell [4,Tt], and then drag this down to all of the other cells. Perhaps R is the same, and there is an equivalent to "dragging down" which I need to do?
Finally, I also noticed that when I change the value of the beta parameter, through, e.g. beta<-0.5, and then print the Table values again, they are unchanged - so the table hasn't updated even though I have changed the value of the parameter.
Appreciate that these are basic questions, but I am very new to R.

In R, the computations are not made "cell by cell", but are vectorised - in your example, R takes the vectors Table[3:10,"Pallets"], Table[2:9,"Pallets"] and Table[2:9,"Tt"] as they are at the moment, computes the resulting vector, and finally assigns it to Table[3:10,4].
If you want to make some computations "cell by cell", you have to use the for loop:
beta <- 0.5
df <- data.frame(v1 = 1:12, v2 = 0)
for (i in 3:10) {
df[i, "v2"] <- beta * (df[i, "v1"] - df[i-1, "v1"]) + (1 - beta) * df[i-1, "v2"]
}
df
v1 v2
1 1 0.0000000
2 2 0.0000000
3 3 0.5000000
4 4 0.7500000
5 5 0.8750000
6 6 0.9375000
7 7 0.9687500
8 8 0.9843750
9 9 0.9921875
10 10 0.9960938
11 11 0.0000000
12 12 0.0000000
As it comes to your second question, R will never update any values on its own (imagine having set manual calculation in Excel). So you need to repeat the computations after changing beta.

Although it's generally a bad design, but you can iterate over rows in a loop:
Table$temp <- c(0,diff(Table$Palletes,1))
prevTt = 0
for (i in 1:10)
{
Table$Tt[i] = Table$temp * beta + (1-beta)*prevTt
prevTt = Table$Tt[i]
}
Table$temp <- NULL

unexpected rbind.fill behavior when combining columns of different class

I tried to use the rbind.fill function from the plyr package to combine two dataframes with a column A, which contains only digits in the first dataframe, but (also) strings in the second dataframe. Reproducible example:
data1 <- data.frame(A=c(11111,22222,33333), b=c(4444,444,44444), c=c(5555,66666,7777))
data2 <- data.frame(A=c(1234,"ss150",123456), c=c(888,777,666))
rbind.fill(data1,data2)
This produced the output below with incorrect data in column A, row 4,5,6. It did not produce an error message.
A b c
1 107778 33434 6
2 1756756 4 7
3 2324234 5 8
4 2 NA 14562
5 3 NA 45613
6 1 NA 14
I had expected that the function would coerce the whole column into character class, or at least display NA or a warning. Instead, it inserted digits that I do not understand (in the actual file, these are two digit numbers that are not sorted). The documentation does not specify that columns must be of the same type in the to-be-combined data.frames.
How can I get this combination?
A b c
1 11111 4444 5555
2 22222 444 66666
3 33333 44444 7777
4 1234 NA 888
5 ss150 NA 777
6 123456 NA 666

look at class(data2$A). It's a factor which is actually an integer with a label vector. Use stringsAsFactors=F in your data.frame creation or in read.csv and friends. This will force the variables be either numeric or character vectors.
data1 <- data.frame(A=c(11111,22222,33333), b=c(4444,444,44444), c=c(5555,66666,7777))
data2 <- data.frame(A=c(1234,"ss150",123456), c=c(888,777,666), stringsAsFactors=FALSE)
rbind.fill(data1,data2)

Removing certain values from the dataframe in R

I am not sure how I can do this, but what I need is I need to form a cluster of this dataframe mydf where I want to omit the inf(infitive) values and the values greater than 50. I need to get the table that has no inf and no values greater than 50. How can I get a table that contains no inf and no value greater than 50(may be by nullifying those cells)? However, For clustering part, I don't have any problem because I can do this using mfuzz package. So the only problem I have is that I want to scale the cluster within 0-50 margin.
mydf
s.no A B C
1 Inf Inf 999.9
2 0.43 30 23
3 34 22 233
4 3 43 45

You can use NA, the built in missing data indicator in R:
?NA
By doing this:
mydf[mydf > 50 | mydf == Inf] <- NA
mydf
s.no A B C
1 1 NA NA NA
2 2 0.43 30 23
3 3 34.00 22 NA
4 4 3.00 43 45
Any stuff you do downstream in R should have NA handling methods, even if it's just na.omit

Identifying individuals with observations across two datasets

I am working with R and "WGCNA" package. I am doing an integrative analysis of transcriptome and metabolome.
I have two data.frames, one for the transcriptome data: datExprFemale, and one for the metabomics data: allTraits, but I am having trouble merging the two data.frames together.
> datExprFemale[1:5, 1:5]
ID gene1 gene2 gene3 gene4
F16 -0.450904880 0.90116800 -2.710879397 0.98942336
F17 -0.304889916 0.70307639 -0.245912838 -0.01089557
F18 0.001696330 0.43059153 -0.177277078 -0.24611398
F19 -0.005428231 0.32838938 0.001070509 -0.31351216
H1 0.183912553 -0.10357460 0.069589703 0.15791036
> allTraits[1:5, 1:5]
IND met1 met2 met3 met4
F15 6546 68465 56465 6548
F17 89916 7639 2838 9557
F20 6330 53 7078 11398
F1 231 938 509 351216
The individuals in allTraits have measurements in datExprFemale, but some individuals in datExprFemale do not occur in allTraits.
Here is what I have tried to merge the two data.frames together:
# First get a vector containing the row names (individual's ID) in datExprFemale
IND=rownames(datExprFemale)
# Get the rows in which two variables have the same individuals
traitRows = match(allTraits$IND, IND)
datTraits = allTraits[traitRows, -1]
This gives me the following:
met1 met2 met3 met4
11 0.0009 0.0559 7.1224 3.3894
12 0.0006 0.0370 10.5776 14.4437
15 0.0011 0.0295 5.7941 19.0225
16 0.0010 0.0531 6.1010 4.7698
17 0.0016 0.0462 7.7819 7.8796
19 0.0011 0.0192 12.7126 9.2564
20 0.0007 0.0502 9.4147 15.3579
21 0.0025 0.0455 8.4129 17.7273
NA NA NA NA NA
NA.1 NA NA NA NA
NA.2 NA NA NA NA
NA.3 NA NA NA NA
NA.4 NA NA NA NA
3 0.0017 0.0375 8.8503 8.7581
7 0.0006 0.0156 7.9272 4.9887
8 0.0011 0.0154 8.4716 8.6515
9 0.0010 0.0306 9.1220 3.5843
As you see there are some NA values, but I'm not sure why?
Now when I want to assign the ID of each individual to the corresponding row using the following code :
rownames(datTraits) = allTraits[traitRows, 1]
R gives this error:
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names':
I'm not sure what I'm doing wrong,

There's a few problems in your code:
In the format you've presented, your datExprFemale does not have rownames, so the match won't work at all.
match is telling you the which rows the individuals in allTraits correspond to in datExprFemale, not the rows you need to extract from allTraits.
Here's the approach I would take:
# First make sure `allTraits` and `datExprFemale` actually have the right rownames
rownames(datExprFemale) = datExprFemale$ID
rownames(allTraits) = allTraits$IND
# Now get the individuals who have both transcriptomic and metabolomic
# measurements
has.both = union(rownames(allTraits), rownames(datExprFemale))
# Now pull out the subset of allTraits you want:
allTraits[has.both,]

thanks for your reply. in fact "datTraits" in the code must be like this:
Insulin_ug_l Glucose_Insulin Leptin_pg_ml Adiponectin Aortic.lesions
F2_3 944 0.42055085 15148.76 14.339 296250
F2_14 632 0.67088608 6188.74 15.439 486313
F2_15 3326 0.16746843 18400.26 11.124 180750
F2_19 426 0.89671362 8438.70 16.842 113000
F2_20 2906 0.15691672 41801.54 13.498 166750
F2_23 920 0.58804348 24133.54 14.511 234000
F2_24 1895 0.24538259 52360.00 13.813 267500
F2_26 7293 0.09090909 126880.00 14.118 198000
F2_37 653 0.65849923 17100.00 12.470 121000
F2_42 1364 0.35703812 99220.00 14.531 110000
in which rows are individuals and columns are metabolites. this variable contains individuals who are in both transcriptomics and metabolomics files.
but in case of the codes I have copied them from the tutorial of WGCNA.
thanks for any suggestion,
Behzad

Lagging Forward in plm

This is a very simple question, but I haven't been able to find a definitive answer, so I thought I would ask it. I use the plm package for dealing with panel data. I am attempting to use the lag function to lag a variable FORWARD in time (the default is to retrieve the value from the previous period, and I want the value from the NEXT). I found a number of old articles/questions (circa 2009) suggesting that this is possible by using k=-1 as an argument. However, when I attempt this, I get an error.
Sample code:
library(plm)
df<-as.data.frame(matrix(c(1,1,1,2,2,3,20101231,20111231,20121231,20111231,20121231,20121231,50,60,70,120,130,210),nrow=6,ncol=3))
names(df)<-c("individual","date","data")
df$date<-as.Date(as.character(df$date),format="%Y%m%d")
df.plm<-pdata.frame(df,index=c("individual","date"))
Lagging:
lag(df.plm$data,0)
##returns
1-2010-12-31 1-2011-12-31 1-2012-12-31 2-2011-12-31 2-2012-12-31 3-2012-12-31
50 60 70 120 130 210
lag(df.plm$data,1)
##returns
1-2010-12-31 1-2011-12-31 1-2012-12-31 2-2011-12-31 2-2012-12-31 3-2012-12-31
NA 50 60 NA 120 NA
lag(df.plm$data,-1)
##returns
Error in rep(1, ak) : invalid 'times' argument
I've also read that plm.data has replaced pdata.frame for some applications in plm. However, plm.data doesn't seem to work with the lag function at all:
df.plm<-plm.data(df,indexes=c("individual","date"))
lag(df.plm$data,1)
##returns
[1] 50 60 70 120 130 210
attr(,"tsp")
[1] 0 5 1
I would appreciate any help. If anyone has another suggestion for a package to use for lagging, I'm all ears. However, I do love plm because it automagically deals with lagging across multiple individuals and skips gaps in the time series.

EDIT2: lagging forward (=leading values) is implemented in plm CRAN releases >= 1.6-4 .
Functions are either lead() or lag() (latter with a negative integer for leading values).
Take care of any other packages attached that use the same function names. To be sure, you can refer to the function by the full namespace, e.g., plm::lead.
Examples from ?plm::lead:
# First, create a pdata.frame
data("EmplUK", package = "plm")
Em <- pdata.frame(EmplUK)
# Then extract a series, which becomes additionally a pseries
z <- Em$output
class(z)
# compute negative lags (= leading values)
lag(z, -1)
lead(z, 1) # same as line above
identical(lead(z, 1), lag(z, -1)) # TRUE

The collapse package in CRAN has a C++ based function flag and also associated lag/lead operators L and F. It supports continuous sequences of lags/leads (positive and negative n values), and plm pseries and pdata.frame classes. Performance: 100x faster than plm and 10x faster than data.table (the fastest in R at the time of writing). Example:
library(collapse)
pwlddev <- plm::pdata.frame(wlddev, index = c("iso3c", "year"))
head(flag(pwlddev$LIFEEX, -1:1)) # A sequence of lags and leads
F1 -- L1
ABW-1960 66.074 65.662 NA
ABW-1961 66.444 66.074 65.662
ABW-1962 66.787 66.444 66.074
ABW-1963 67.113 66.787 66.444
ABW-1964 67.435 67.113 66.787
ABW-1965 67.762 67.435 67.113
head(L(pwlddev$LIFEEX, -1:1)) # Same as above
head(L(pwlddev, -1:1, cols = 9:12)) # Computing on columns 9 through 12
iso3c year F1.PCGDP PCGDP L1.PCGDP F1.LIFEEX LIFEEX L1.LIFEEX F1.GINI GINI L1.GINI
ABW-1960 ABW 1960 NA NA NA 66.074 65.662 NA NA NA NA
ABW-1961 ABW 1961 NA NA NA 66.444 66.074 65.662 NA NA NA
ABW-1962 ABW 1962 NA NA NA 66.787 66.444 66.074 NA NA NA
ABW-1963 ABW 1963 NA NA NA 67.113 66.787 66.444 NA NA NA
ABW-1964 ABW 1964 NA NA NA 67.435 67.113 66.787 NA NA NA
ABW-1965 ABW 1965 NA NA NA 67.762 67.435 67.113 NA NA NA
F1.ODA ODA L1.ODA
ABW-1960 NA NA NA
ABW-1961 NA NA NA
ABW-1962 NA NA NA
ABW-1963 NA NA NA
ABW-1964 NA NA NA
ABW-1965 NA NA NA
library(microbenchmark)
library(data.table)
microbenchmark(plm_class = flag(pwlddev),
ad_hoc = flag(wlddev, g = wlddev$iso3c, t = wlddev$year),
data.table = qDT(wlddev)[, shift(.SD), by = iso3c])
Unit: microseconds
expr min lq mean median uq max neval cld
plm_class 462.313 512.5145 1044.839 551.562 637.6875 15913.17 100 a
ad_hoc 443.124 519.6550 1127.363 559.817 701.0545 34174.05 100 a
data.table 7477.316 8070.3785 10126.471 8682.184 10397.1115 33575.18 100 b

I had this same problem and couldn't find a good solution in plm or any other package. ddply was tempting (e.g. s5 = ddply(df, .(country,year), transform, lag=lag(df[, "value-to-lag"], lag=3))), but I couldn't get the NAs in my lagged column to line up properly for lags other than one.
I wrote a brute force solution that iterates over the dataframe row-by-row and populates the lagged column with the appropriate value. It's horrendously slow (437.33s for my 13000x130 dataframe vs. 0.012s for turning it into a pdata.frame and using lag) but it got the job done for me. I thought I would share it here because I couldn't find much information elsewhere on the internet.
In the function below:
df is your dataframe. The function returns df with a new column containing the forward values.
group is the column name of the grouping variable for your panel data. For example, I had longitudinal data on multiple countries, and I used "Country.Name" here.
x is the column you want to generate lagged values from, e.g. "GDP"
forwardx is the (new) column that will contain the forward lags, e.g. "GDP.next.year".
lag is the number of periods into the future. For example, if your data were taken in annual intervals, using lag=5 would set forwardx to the value of x five years later.
.
add_forward_lag <- function(df, group, x, forwardx, lag) {
for (i in 1:(nrow(df)-lag)) {
if (as.character(df[i, group]) == as.character(df[i+lag, group])) {
# put forward observation in forwardx
df[i, forwardx] <- df[i+lag, x]
}
else {
# end of group, no forward observation
df[i, forwardx] <- NA
}
}
# last elem(s) in forwardx are NA
for (j in ((nrow(df)-lag+1):nrow(df))) {
df[j, forwardx] <- NA
}
return(df)
}
See sample output using built-in DNase dataset. This doesn't make sense in context of the dataset, but it lets you see what the columns do.
require(DNase)
add_forward_lag(DNase, "Run", "density", "lagged_density",3)
Grouped Data: density ~ conc | Run
Run conc density lagged_density
1 1 0.04882812 0.017 0.124
2 1 0.04882812 0.018 0.206
3 1 0.19531250 0.121 0.215
4 1 0.19531250 0.124 0.377
5 1 0.39062500 0.206 0.374
6 1 0.39062500 0.215 0.614
7 1 0.78125000 0.377 0.609
8 1 0.78125000 0.374 1.019
9 1 1.56250000 0.614 1.001
10 1 1.56250000 0.609 1.334
11 1 3.12500000 1.019 1.364
12 1 3.12500000 1.001 1.730
13 1 6.25000000 1.334 1.710
14 1 6.25000000 1.364 NA
15 1 12.50000000 1.730 NA
16 1 12.50000000 1.710 NA
17 2 0.04882812 0.045 0.123
18 2 0.04882812 0.050 0.225
19 2 0.19531250 0.137 0.207
Given how long this takes, you may want to use a different approach: backwards-lag all of your other variables.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr::left_join produce NA values for new joined columns - r

Related

R - enter basic formula

unexpected rbind.fill behavior when combining columns of different class

Removing certain values from the dataframe in R

Identifying individuals with observations across two datasets

Lagging Forward in plm

Categories

Resources