I have a rectangular table with three variables: country, year and inflation. I already have all the descriptives I can have, now I need to do some analytics, and figured that I should do some linear regression against a target country. The best idea I had was to create a new variable called inflation.in.country.x and loop through the inflation of x in this new column but that seems somehow unclean solution.
How to get a linear regression of a rectangular data table? The structure is like this:
> dat %>% str
'data.frame': 1196 obs. of 3 variables:
$ Country.Name: Factor w/ 31 levels "Albania","Armenia",..: 9 8 10 11 12 14 15 16 17 19 ...
$ year : chr "1967" "1967" "1967" "1967" ...
$ inflation : num 1.238 8.328 3.818 0.702 1.467 ...
I want to take Armenia inflation as dependent variable and Albania as independent to get a linear regression. It is possible without transforming the data and keeping the years coherent?
One way is to spread your data table using Country.Name as key:
dat.spread <- dat %>% spread(key="Country.Name", value="inflation")
dat.spread %>% str
'data.frame': 50 obs. of 31 variables:
$ year : chr "1967" "1968" "1969" "1970" ...
$ Albania : num NA NA NA NA NA NA NA NA NA NA ...
$ Armenia : num NA NA NA NA NA NA NA NA NA NA ...
$ Brazil : num NA NA NA NA NA NA NA NA NA NA ...
[...]
But that forces you to transform the data which may seem undesirable. Afterwards, you can simply use cbind to do the linear regression against all countries:
lm(cbind(Armenia, Brazil, Colombia, etc...) ~ Albania, data = dat.spread)
Related
this is my first post and i will try to describe my problem as exactly as i can without writing a novel. Also since english is not my native language please forgive any ambiguities or spelling errors.
I am currently trying out the rayshader package for R in order to visualise several layers and create a representation of georeferenced data from Berlin. The data i got is a DEM (5m resolution) and a GEOJSON including a building layer including information of the building heights, a water layer and a tree layer including tree heights.
For now only the DEM and the building layer are used.
I can render the DEM without any problems. The buildingpolygons are also getting extruded and rendered, but their foundation height does not coincide with the corresponding height that should be read from the elevation matrix created from the DEM.
I expected the polygons to be placed correctly and "stand" on the rendered surface, but most of them clip through said surface or are stuck inside the ground layer. My assumption is, that i use a wrong function for my purpose - the creator of the package uses render_multipolygonz() for buildings as can be seen here timecode 12:49. I tried that, but it just renders an unextruded continuous polygon on my base layer underneath the ground.
Or that i am missing an Argument of the render_polygons() function.
It could also be quite possible, that i am producing a superficial calling or assignment error, since i am all but an expert in R. I am just starting my coding journey.
Here is my code:
#set wd to save location
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
#load libs
library(geojsonR)
library(rayshader)
library(raster)
library(sf)
library(rgdal)
library(dplyr)
library(rgl)
#load DEM
tempel_DOM <- raster("Daten/Tempelhof_Gelaende_5m_25833.tif")
#load buildings layer from GEOJSON
buildings_temp <-
st_read(dsn = "Daten/Tempelhof_GeoJSON_25833.geojson", layer = "polygon") %>%
st_transform(crs = st_crs(tempel_DOM)) %>%
filter(!is.na(bh))
#create elevation matrix from DEM
tempel_elmat <- raster_to_matrix(tempel_DOM)
#Tempelhof Render
tempel_elmat %>%
sphere_shade(texture = "imhof1") %>%
add_shadow(ray_shade(tempel_elmat), 0.5) %>%
plot_3d(
tempel_elmat,
zscale = 5,
fov = 0,
theta = 135,
zoom = 0.75,
phi = 45,
windowsize = c(1000, 800),
)
render_polygons(
buildings_temp,
extent = extent(tempel_DOM),
color = 'hotpink4',
parallel = TRUE,
data_column_top = 'bh',
clear_previous = T,
)
The structure of my buildings_temp using str() is:
> str(buildings_temp)
Classes ‘sf’ and 'data.frame': 625 obs. of 11 variables:
$ t : int 1 1 1 1 1 1 1 1 1 1 ...
$ t2 : int NA NA NA NA NA NA NA NA NA NA ...
$ t3 : int NA NA NA NA NA NA NA NA NA NA ...
$ t4 : int NA NA NA NA NA NA NA NA NA NA ...
$ t1 : int 1 4 1 1 1 1 1 1 1 1 ...
$ bh : num 20.9 2.7 20.5 20.1 19.3 20.9 19.7 19.8 19.6 17.8 ...
$ t5 : int NA NA NA NA NA NA NA NA NA NA ...
$ t6 : int NA NA NA NA NA NA NA NA NA NA ...
$ th : num NA NA NA NA NA NA NA NA NA NA ...
$ id : int 261 262 263 264 265 266 267 268 269 270 ...
$ geometry:sfc_MULTIPOLYGON of length 625; first list element: List of 1
..$ :List of 1
.. ..$ : num [1:12, 1:2] 393189 393191 393188 393182 393177 ...
..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
- attr(*, "sf_column")= chr "geometry"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA
..- attr(*, "names")= chr [1:10] "t" "t2" "t3" "t4" ...
Thanks in advance for any help.
Cheers WiTell
I am encountering an issue with an R dataframe.
The dataframe contains columns that are not recognized as variables.
These new columns contain the other columns' names including the symbol '$'.
I cannot delete these columns from the dataframe.
Would anyone have an idea about what these columns are, why they are not considered as variables, and why I cannot delete them?
Here is a fraction of what appears when I use str() on the dataframe:
(...)
$ EDU_MAX_1: Ord.factor w/ 8 levels "NAP"<"NONE"<"ED1"<..: NA NA NA
$ EDU_MAX_2 : chr NA NA NA NA ...
$ age_14 :'data.frame': 6 obs. of 61 variables:
..$ leeftijd : num 14 14 14 14 14 14
..$ geboortejaar num 1985 1985 1985 1985 1985 ...
(...)
The problem seems to be linked to the variable age_14 which then seems coupled to each column.
When I export the dataframe to an excel file, these columns do not appear in the exported file.
Many thanks in advance for your help
I run the following model in R:
clmm_br<-clmm(Grado_amenaza~Life_Form + size_max_cm +
leaf_length_mean + petals_length_mean +
silicua_length_mean + bloom_length + categ_color+ (1|Genero) ,
data=brasic1)
I didn't get any warnings or errors but when I run summary(clmm_br) I can't get the p-values:
summary(clmm_br)
Cumulative Link Mixed Model fitted with the Laplace approximation
formula: Grado_amenaza ~ Life_Form + size_max_cm + leaf_length_mean +
petals_length_mean + silicua_length_mean + bloom_length +
categ_color + (1 | Genero)
data: brasic1
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 76 -64.18 160.36 1807(1458) 1.50e-03 NaN
Random effects:
Groups Name Variance Std.Dev.
Genero (Intercept) 0.000000008505 0.00009222
Number of groups: Genero 39
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Life_Form[T.G] 2.233338 NA NA NA
Life_Form[T.Hem] 0.577112 NA NA NA
Life_Form[T.Hyd] -22.632916 NA NA NA
Life_Form[T.Th] -1.227512 NA NA NA
size_max_cm 0.006442 NA NA NA
leaf_length_mean 0.008491 NA NA NA
petals_length_mean 0.091623 NA NA NA
silicua_length_mean -0.036001 NA NA NA
bloom_length -0.844697 NA NA NA
categ_color[T.2] -2.420793 NA NA NA
categ_color[T.3] 1.268585 NA NA NA
categ_color[T.4] 1.049953 NA NA NA
Threshold coefficients:
Estimate Std. Error z value
1|3 -1.171 NA NA
3|4 1.266 NA NA
4|5 1.800 NA NA
(4 observations deleted due to missingness)
I tried with no random effects and excluding the rows with NAs but it's the same.
The structure of my data:
str(brasic1)
tibble[,13] [80 x 13] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:80] 135 137 142 145 287 295 585 593 646 656 ...
$ Genero : chr [1:80] "Alyssum" "Alyssum" "Alyssum" "Alyssum" ...
$ Cons.stat : chr [1:80] "LC" "VU" "VU" "LC" ...
$ Amenazada : num [1:80] 0 1 1 0 1 0 0 1 0 0 ...
$ Grado_amenaza : Factor w/ 5 levels "1","3","4","5",..: 1 2 2 1 4 1 1 2 1 1 ...
$ Life_Form : chr [1:80] "Th" "Hem" "Hem" "Th" ...
$ size_max_cm : num [1:80] 12 6 7 15 20 27 60 62 50 60 ...
$ leaf_length_mean : num [1:80] 7.5 7 11 14.5 31.5 45 90 65 65 39 ...
$ petals_length_mean : num [1:80] 2.2 3.5 5.5 2.55 6 8 10.5 9.5 9.5 2.9 ...
$ silicua_length_mean: num [1:80] 3.5 4 3.5 4 25 47.5 37.5 41.5 17.5 2.9 ...
$ X2n : num [1:80] 32 NA 16 16 NA NA 20 20 18 14 ...
$ bloom_length : num [1:80] 2 1 2 2 2 2 2 2 11 2 ...
$ categ_color : chr [1:80] "1" "4" "4" "4" ...
For a full answer we really need a reproducible example, but I can point to a few things that raise suspicions.
The fact that you can get estimates, but not standard errors, implies that there is something wrong with the Hessian (the estimate of the curvature of the log-likelihood surface at the maximum likelihood estimate), but there are several distinct (possibly overlapping possibilities)
any time you have a "large" parameter estimate (say, absolute value > 10), as in your example (Life_Form[T.Hyd] = -22.632916), it suggests complete separation, i.e. the presence/absence of that parameter perfectly predicts the response. (You can search for that term, e.g. on CrossValidated.) However, complete separation usually leads to absurdly large standard errors (along with the large parameter estimates) rather than to NAs.
you may have perfect multicollinearity, i.e. combinations of your predictor variables that are perfectly (jointly) correlated with other such combinations. Some R estimation procedures can detect and deal with this case (typically by dropping one or more predictors), but clmm might not be able to. (You should be able to construct your model matrix (X <- model.matrix( your_formula, your_data), excluding the random effect from the formula) and then use caret::findLinearCombos(X) to explore this issue.)
More generally, if you want to do reliable inference you may need to cut down the size of your model (not by stepwise or other forms of model selection); a rule of thumb is that you need 10-20 observations per parameter estimated. You're trying to estimate 12 fixed effect parameters plus a few more (ordinal-threshold parameters and random effect variance) from 80 observations ...
In addition to dropping random effects, it may be useful to a diagnosis to fit a regular linear model with lm() (which should tell you something about collinearity, by dropping parameters) or a binomial model based on some threshold grade values (which might help with identifying complete separation).
I have a data that I uploaded it here
https://gist.github.com/anonymous/0bc36ec5f46757de7c2c
I load it in R using following command
df <- read.delim("path to the data", header=TRUE, sep="\t", fill=TRUE, row.names=1, stringsAsFactors=FALSE, na.strings='')
Then I check for a specific column to see how many + are there like this
length(which(df$Potential.contaminant == "+"))
which shows 9 in this cas. Then I try to remove all the rows that the + is in that row using the following command
Newdf <- df[df$Potential.contaminant != "+", ]
The output is all NA. what is wrong ?? what do I do wrong here ?
As #akrun suggested I have tried many different ways to do it but without success
df[!grepl("[+]", df$Potential.contaminant),]
df[ is.na(df$Potential.contaminant),]
subset(df, Potential.contaminant != "+")
df[-(which(df$Potential.contaminant == "+")),]
None of above commands could solve it. One idea was that the Potential.contaminant has NA and that is the reason. I replaced all NA with zero using
df[c("Potential.contaminant")][is.na(df[c("Potential.contaminant")])] <- 0
but still the same.
copy pasted your gist in a file c:/input.txt and then used your code:
df <- read.delim("c:/input.txt", header=TRUE, sep="\t", fill=TRUE, row.names=1, stringsAsFactors=FALSE, na.strings='')
Now:
> str(df)
'data.frame': 21 obs. of 11 variables:
$ Intensityhenya : int 0 NA NA NA NA 0 0 0 0 0 ...
$ Only.identified.by.site: chr "+" NA NA NA ...
$ Reverse : logi NA NA NA NA NA NA ...
$ Potential.contaminant : chr "+" NA NA NA ...
$ id : int 0 NA NA NA NA 1 2 3 4 5 ...
$ IDs.1 : chr "16182;22925;28117;28534;28538;29309;36387;36889;42536;49151;49833;52792;54591;54592" NA NA NA ...
$ razor : chr "True;True;False;False;False;False;False;True;False;False;False;False;False;False" NA NA NA ...
$ Mod.IDs : chr "16828;23798;29178;29603;29607;30404;38270;38271;38793;44633;51496;52211;55280;57146;57147;57148;57149" NA NA NA ...
$ Evidence.IDs : chr "694702;694703;694704;1017531;1017532;1017533;1017534;1017535;1017536;1017537;1017538;1017539;1017540;1017541;1017542;1017543;10"| __truncated__ NA NA NA ...
$ GHSIDs : chr NA NA NA NA ...
$ BestGSFD : chr NA NA NA NA ...
If I try to subset:
> df2 <- df[is.na(df$Potential.contaminant),]
> str(df2)
'data.frame': 12 obs. of 11 variables:
$ Intensityhenya : int NA NA NA NA NA NA NA NA NA NA ...
$ Only.identified.by.site: chr NA NA NA NA ...
$ Reverse : logi NA NA NA NA NA NA ...
$ Potential.contaminant : chr NA NA NA NA ...
$ id : int NA NA NA NA NA NA NA NA NA NA ...
$ IDs.1 : chr NA NA NA NA ...
$ razor : chr NA NA NA NA ...
$ Mod.IDs : chr NA NA NA NA ...
$ Evidence.IDs : chr NA NA NA NA ...
$ GHSIDs : chr NA NA NA NA ...
$ BestGSFD : chr NA NA NA NA ...
But your datas are so crazy it's nearly impossible to visualize them so let's try something else to get the glance of it.
> colnames(df)
[1] "Intensityhenya" "Only.identified.by.site" "Reverse" "Potential.contaminant" "id" "IDs.1" "razor" "Mod.IDs"
[9] "Evidence.IDs" "GHSIDs" "BestGSFD"
Your header is a pain to follow, let's have a look at it:
IDs Intensityhenya Only identified by site Reverse Potential contaminant id IDs razor Mod.IDs Evidence IDs GHSIDs BestGSFD
Along with a line of data where long data are cut to get a glance:
CON__A2A4G1 0 + + 0 16182;[...];4592 True;[..];False 16828;[...];57149 694702;[...];2208697;
208698;[...];2441826
3;2433194;[...];4682766
I've just stripped extraneous numbers when possible and sure, keeping the tabs and newlines.
I hope you see how and why this can lead to a proper analysis of your data, do some check on your input data to sanitize them before retrying to load them in R.
For illustration purpose here is your gist with ellipsis and %T% in place of tabs:
IDs%T%Intensityhenya%T%Only identified by site%T%Reverse%T%Potential contaminant%T%id%T%IDs%T%razor%T%Mod.IDs%T%Evidence IDs%T%GHSIDs%T%BestGSFD
CON__A2A4G1%T%0%T%+%T%%T%+%T%0%T%1618[...]4592%T%Tru[...]alse%T%1682[...]7149%T%69470[...]208697;%T%%T%
20869[...]441826%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
[...]20%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
00[...]%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
1271[...]682766%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
CON__A2A5Y0%T%0%T%%T%%T%+%T%1%T%443[...]5777%T%Fals[...]rue%T%464[...]8377%T%21071[...]489947%T%40503[...]780178%T%40505[...]780175
CON__A2AB72%T%0%T%%T%%T%+%T%2%T%443[...]0447%T%Tru[...]alse%T%464[...]2842%T%21070[...]232341%T%40502[...]250729%T%40502[...]250728
CON__ENSEMBL:ENSBTAP00000014147%T%0%T%%T%%T%+%T%3%T%53270%T%TRUE%T%55779%T%238286[...]382871%T%457377[...]573778%T%4573776
CON__ENSEMBL:ENSBTAP00000024146%T%0%T%%T%%T%+%T%4%T%186[...]5835%T%Tru[...]rue%T%194[...]8438%T%8382[...]492132%T%15455[...]783465%T%15455[...]783465
CON__ENSEMBL:ENSBTAP00000024466;CON__ENSEMBL:ENSBTAP00000024462%T%0%T%%T%%T%+%T%5%T%939[...]5179%T%Tru[...]rue%T%978[...]7757%T%41149[...]468480%T%78212[...]739209%T%78217[...]739209
CON__ENSEMBL:ENSBTAP00000025008%T%0%T%+%T%%T%+%T%6%T%1564[...]8580%T%Fals[...]alse%T%1627[...]9651%T%66672[...]269215%T%125151[...]439696%T%125151[...]439691
CON__ENSEMBL:ENSBTAP00000038253%T%0%T%%T%%T%+%T%7%T%120[...]5703%T%Fals[...]alse%T%125[...]8300%T%5326[...]25602%T%%T%
;125602[...]178%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
1[...]483384%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
22838[...]23247%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
;123247[...]411%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
4[...]7%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
603[...]790126;%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
79012[...]13848%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
;413848[...]765024%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
sp|O43790|KRT86_HUMAN;CON__O43790%T%0%T%%T%%T%+%T%8%T%121[...]5716%T%Tru[...]rue%T%126[...]8315%T%5455[...]484318%T%10404[...]426334%T%
It seems like your data rows which are not marked as contaminants, have no values. The "NA" are because of the "na.strings=''" emplyed during read.delim function call. So for example, if you do:
df <- read.delim("https://gist.githubusercontent.com/anonymous/0bc36ec5f46757de7c2c/raw/517ef70ab6a68e600f57308e045c2b4669a7abfc/example.txt", header=TRUE, row.names=1, sep="\t")
df<-df[df$Potential.contaminant!='+',]
summary(df)
you should see empty cells.
I have a data set containing salaries test data. Not all cells have values hence I used na.action=na.pass,na.rm=TRUE but it gives me an error due to the fact that I want to aggregate with JobTitle which is factor?
So far I have developed below code:
aggregate(salaries$JobTitle,
list(pay = salaries$TotalPay),
FUN=mean,
na.action=na.pass,
na.rm=TRUE)
My test data has the following columns:
'data.frame': 104 obs. of 36 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ EmployeeName : Factor w/ 11 levels "","ALBERT PARDINI",..: 10 7 2 4 11 6 3 5 9 8 ...
$ JobTitle : Factor w/ 9 levels "","ASSISTANT DEPUTY CHIEF II",..: 8 4 4 9 6 2 3 7 3 5 ...
$ BasePay : num 167411 155966 212739 77916 134402 ...
$ OvertimePay : num 0 245132 106088 56121 9737 ...
$ OtherPay : num 400184 137811 16453 198307 182235 ...
$ Benefits : logi NA NA NA NA NA NA ...
$ TotalPay : num 567595 538909 335280 332344 326373 ...
$ TotalPayBenefits: num 567595 538909 335280 332344 326373 ...
$ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
$ Notes : logi NA NA NA NA NA NA ...
$ Agency : Factor w/ 2 levels "","San Francisco": 2 2 2 2 2 2 2 2 2 2 ..
The error code which comes up is
Warning messages:
1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
etc...
I have tried with salaries$Id and it work like magic so I assume the code is correct and perhaps I need to change the data type for JobTitle?
If we are getting the mean of 'TotalPaygrouped by 'JobTitle', theformula` method would be
aggregate(TotalPay~JobTitle, salaries, mean, na.rm=TRUE, na.action=na.pass)
Or use
aggregate(salaries$TotalPay, list(salaries$JobTitle), FUN=mean, na.rm=TRUE)
data
set.seed(24)
salaries <- data.frame(JobTitle = sample(LETTERS[1:5], 20,
replace=TRUE), TotalPay= sample(c(1:20, NA), 20))