why when I remove specific rows, my output is all NA? - r

I have a data that I uploaded it here
https://gist.github.com/anonymous/0bc36ec5f46757de7c2c
I load it in R using following command
df <- read.delim("path to the data", header=TRUE, sep="\t", fill=TRUE, row.names=1, stringsAsFactors=FALSE, na.strings='')
Then I check for a specific column to see how many + are there like this
length(which(df$Potential.contaminant == "+"))
which shows 9 in this cas. Then I try to remove all the rows that the + is in that row using the following command
Newdf <- df[df$Potential.contaminant != "+", ]
The output is all NA. what is wrong ?? what do I do wrong here ?
As #akrun suggested I have tried many different ways to do it but without success
df[!grepl("[+]", df$Potential.contaminant),]
df[ is.na(df$Potential.contaminant),]
subset(df, Potential.contaminant != "+")
df[-(which(df$Potential.contaminant == "+")),]
None of above commands could solve it. One idea was that the Potential.contaminant has NA and that is the reason. I replaced all NA with zero using
df[c("Potential.contaminant")][is.na(df[c("Potential.contaminant")])] <- 0
but still the same.

copy pasted your gist in a file c:/input.txt and then used your code:
df <- read.delim("c:/input.txt", header=TRUE, sep="\t", fill=TRUE, row.names=1, stringsAsFactors=FALSE, na.strings='')
Now:
> str(df)
'data.frame': 21 obs. of 11 variables:
$ Intensityhenya : int 0 NA NA NA NA 0 0 0 0 0 ...
$ Only.identified.by.site: chr "+" NA NA NA ...
$ Reverse : logi NA NA NA NA NA NA ...
$ Potential.contaminant : chr "+" NA NA NA ...
$ id : int 0 NA NA NA NA 1 2 3 4 5 ...
$ IDs.1 : chr "16182;22925;28117;28534;28538;29309;36387;36889;42536;49151;49833;52792;54591;54592" NA NA NA ...
$ razor : chr "True;True;False;False;False;False;False;True;False;False;False;False;False;False" NA NA NA ...
$ Mod.IDs : chr "16828;23798;29178;29603;29607;30404;38270;38271;38793;44633;51496;52211;55280;57146;57147;57148;57149" NA NA NA ...
$ Evidence.IDs : chr "694702;694703;694704;1017531;1017532;1017533;1017534;1017535;1017536;1017537;1017538;1017539;1017540;1017541;1017542;1017543;10"| __truncated__ NA NA NA ...
$ GHSIDs : chr NA NA NA NA ...
$ BestGSFD : chr NA NA NA NA ...
If I try to subset:
> df2 <- df[is.na(df$Potential.contaminant),]
> str(df2)
'data.frame': 12 obs. of 11 variables:
$ Intensityhenya : int NA NA NA NA NA NA NA NA NA NA ...
$ Only.identified.by.site: chr NA NA NA NA ...
$ Reverse : logi NA NA NA NA NA NA ...
$ Potential.contaminant : chr NA NA NA NA ...
$ id : int NA NA NA NA NA NA NA NA NA NA ...
$ IDs.1 : chr NA NA NA NA ...
$ razor : chr NA NA NA NA ...
$ Mod.IDs : chr NA NA NA NA ...
$ Evidence.IDs : chr NA NA NA NA ...
$ GHSIDs : chr NA NA NA NA ...
$ BestGSFD : chr NA NA NA NA ...
But your datas are so crazy it's nearly impossible to visualize them so let's try something else to get the glance of it.
> colnames(df)
[1] "Intensityhenya" "Only.identified.by.site" "Reverse" "Potential.contaminant" "id" "IDs.1" "razor" "Mod.IDs"
[9] "Evidence.IDs" "GHSIDs" "BestGSFD"
Your header is a pain to follow, let's have a look at it:
IDs Intensityhenya Only identified by site Reverse Potential contaminant id IDs razor Mod.IDs Evidence IDs GHSIDs BestGSFD
Along with a line of data where long data are cut to get a glance:
CON__A2A4G1 0 + + 0 16182;[...];4592 True;[..];False 16828;[...];57149 694702;[...];2208697;
208698;[...];2441826
3;2433194;[...];4682766
I've just stripped extraneous numbers when possible and sure, keeping the tabs and newlines.
I hope you see how and why this can lead to a proper analysis of your data, do some check on your input data to sanitize them before retrying to load them in R.
For illustration purpose here is your gist with ellipsis and %T% in place of tabs:
IDs%T%Intensityhenya%T%Only identified by site%T%Reverse%T%Potential contaminant%T%id%T%IDs%T%razor%T%Mod.IDs%T%Evidence IDs%T%GHSIDs%T%BestGSFD
CON__A2A4G1%T%0%T%+%T%%T%+%T%0%T%1618[...]4592%T%Tru[...]alse%T%1682[...]7149%T%69470[...]208697;%T%%T%
20869[...]441826%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
[...]20%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
00[...]%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
1271[...]682766%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
CON__A2A5Y0%T%0%T%%T%%T%+%T%1%T%443[...]5777%T%Fals[...]rue%T%464[...]8377%T%21071[...]489947%T%40503[...]780178%T%40505[...]780175
CON__A2AB72%T%0%T%%T%%T%+%T%2%T%443[...]0447%T%Tru[...]alse%T%464[...]2842%T%21070[...]232341%T%40502[...]250729%T%40502[...]250728
CON__ENSEMBL:ENSBTAP00000014147%T%0%T%%T%%T%+%T%3%T%53270%T%TRUE%T%55779%T%238286[...]382871%T%457377[...]573778%T%4573776
CON__ENSEMBL:ENSBTAP00000024146%T%0%T%%T%%T%+%T%4%T%186[...]5835%T%Tru[...]rue%T%194[...]8438%T%8382[...]492132%T%15455[...]783465%T%15455[...]783465
CON__ENSEMBL:ENSBTAP00000024466;CON__ENSEMBL:ENSBTAP00000024462%T%0%T%%T%%T%+%T%5%T%939[...]5179%T%Tru[...]rue%T%978[...]7757%T%41149[...]468480%T%78212[...]739209%T%78217[...]739209
CON__ENSEMBL:ENSBTAP00000025008%T%0%T%+%T%%T%+%T%6%T%1564[...]8580%T%Fals[...]alse%T%1627[...]9651%T%66672[...]269215%T%125151[...]439696%T%125151[...]439691
CON__ENSEMBL:ENSBTAP00000038253%T%0%T%%T%%T%+%T%7%T%120[...]5703%T%Fals[...]alse%T%125[...]8300%T%5326[...]25602%T%%T%
;125602[...]178%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
1[...]483384%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
22838[...]23247%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
;123247[...]411%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
4[...]7%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
603[...]790126;%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
79012[...]13848%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
;413848[...]765024%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%%T%
sp|O43790|KRT86_HUMAN;CON__O43790%T%0%T%%T%%T%+%T%8%T%121[...]5716%T%Tru[...]rue%T%126[...]8315%T%5455[...]484318%T%10404[...]426334%T%

It seems like your data rows which are not marked as contaminants, have no values. The "NA" are because of the "na.strings=''" emplyed during read.delim function call. So for example, if you do:
df <- read.delim("https://gist.githubusercontent.com/anonymous/0bc36ec5f46757de7c2c/raw/517ef70ab6a68e600f57308e045c2b4669a7abfc/example.txt", header=TRUE, row.names=1, sep="\t")
df<-df[df$Potential.contaminant!='+',]
summary(df)
you should see empty cells.

Related

Rayshader: Rendered polygons don't align with the surface height

this is my first post and i will try to describe my problem as exactly as i can without writing a novel. Also since english is not my native language please forgive any ambiguities or spelling errors.
I am currently trying out the rayshader package for R in order to visualise several layers and create a representation of georeferenced data from Berlin. The data i got is a DEM (5m resolution) and a GEOJSON including a building layer including information of the building heights, a water layer and a tree layer including tree heights.
For now only the DEM and the building layer are used.
I can render the DEM without any problems. The buildingpolygons are also getting extruded and rendered, but their foundation height does not coincide with the corresponding height that should be read from the elevation matrix created from the DEM.
I expected the polygons to be placed correctly and "stand" on the rendered surface, but most of them clip through said surface or are stuck inside the ground layer. My assumption is, that i use a wrong function for my purpose - the creator of the package uses render_multipolygonz() for buildings as can be seen here timecode 12:49. I tried that, but it just renders an unextruded continuous polygon on my base layer underneath the ground.
Or that i am missing an Argument of the render_polygons() function.
It could also be quite possible, that i am producing a superficial calling or assignment error, since i am all but an expert in R. I am just starting my coding journey.
Here is my code:
#set wd to save location
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
#load libs
library(geojsonR)
library(rayshader)
library(raster)
library(sf)
library(rgdal)
library(dplyr)
library(rgl)
#load DEM
tempel_DOM <- raster("Daten/Tempelhof_Gelaende_5m_25833.tif")
#load buildings layer from GEOJSON
buildings_temp <-
st_read(dsn = "Daten/Tempelhof_GeoJSON_25833.geojson", layer = "polygon") %>%
st_transform(crs = st_crs(tempel_DOM)) %>%
filter(!is.na(bh))
#create elevation matrix from DEM
tempel_elmat <- raster_to_matrix(tempel_DOM)
#Tempelhof Render
tempel_elmat %>%
sphere_shade(texture = "imhof1") %>%
add_shadow(ray_shade(tempel_elmat), 0.5) %>%
plot_3d(
tempel_elmat,
zscale = 5,
fov = 0,
theta = 135,
zoom = 0.75,
phi = 45,
windowsize = c(1000, 800),
)
render_polygons(
buildings_temp,
extent = extent(tempel_DOM),
color = 'hotpink4',
parallel = TRUE,
data_column_top = 'bh',
clear_previous = T,
)
The structure of my buildings_temp using str() is:
> str(buildings_temp)
Classes ‘sf’ and 'data.frame': 625 obs. of 11 variables:
$ t : int 1 1 1 1 1 1 1 1 1 1 ...
$ t2 : int NA NA NA NA NA NA NA NA NA NA ...
$ t3 : int NA NA NA NA NA NA NA NA NA NA ...
$ t4 : int NA NA NA NA NA NA NA NA NA NA ...
$ t1 : int 1 4 1 1 1 1 1 1 1 1 ...
$ bh : num 20.9 2.7 20.5 20.1 19.3 20.9 19.7 19.8 19.6 17.8 ...
$ t5 : int NA NA NA NA NA NA NA NA NA NA ...
$ t6 : int NA NA NA NA NA NA NA NA NA NA ...
$ th : num NA NA NA NA NA NA NA NA NA NA ...
$ id : int 261 262 263 264 265 266 267 268 269 270 ...
$ geometry:sfc_MULTIPOLYGON of length 625; first list element: List of 1
..$ :List of 1
.. ..$ : num [1:12, 1:2] 393189 393191 393188 393182 393177 ...
..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
- attr(*, "sf_column")= chr "geometry"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA
..- attr(*, "names")= chr [1:10] "t" "t2" "t3" "t4" ...
Thanks in advance for any help.
Cheers WiTell

Convert character variables into numeric with R

FIRST QUESTION EVER ;)
Here's the point: I have this dataset and I started without "stringsAsFactors=FALSE" in read.csv function. I can't work with those data because I got the Warning message: NAs introduced by coercion. Thank you for the help :)
rm(list=ls())
path <- "....."
file <- read.csv(path, header = TRUE, sep = ",", stringsAsFactors=FALSE)
str(file)
#'data.frame': 33 obs. of 11 variables:
#$ Var1: chr "01/09/2021" "02/09/2021" "09/09/2021" "10/09/2021" ...
#$ Var2: chr "mercoledì" "giovedì" "giovedì" "venerdì" ...
#$ Var3: chr "2,5" "2,5" "2,5" "3,0" ...
#$ Var4: chr "4,0" "0,0" "2,0" "3,0" ...
#$ Var5: chr "2,0" "5,0" "5,0" "5,0" ...
#$ Var5: chr "0,0" "0,0" "0,0" "0,0" ...
#$ Var6: chr "6,0" "5,0" "7,0" "8,0" ...
#$ Var7: chr "23,5" "25,0" "28,0" "32,0" ...
#$ Var8: chr "0,0" "1,0" "5,0" "5,5" ...
#$ Var9: chr "23,5" "26,0" "33,0" "37,5" ...
#$ Var10: chr "67,0" "0,0" "0,0" "0,0" ...
as.numeric(file$Var7)
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Warning message:
NAs introduced by coercion
CSV FILE
I managed to recreate your problem. Your file is using , both as field separator and decimal separator (which is uncommon).
You can fix your problem by specifying that decimals are commas in (dec = ",") in read.csv(), as follows:
read.csv(
path,
header = TRUE,
sep = ",",
dec = ",", # I've added this line
stringsAsFactors = FALSE
)
Change this, run str(file) again, and you should see that most columns are numeric.

How to convert in Date format the columns of a particular excel file?

I have an excel file with 77 columns (with 43 NA columns) of different length, 12 of which are Date. Ideally, I want to import it in R the dataset with the columns that refer to Date in date format, while the other columns in numeric format. There is lot of material in stackoverflow and I tried all the options but it is not working.
The first option would be to do it directly from excel:
dataset <- read_xlsx("Data.xlsx", col_types = "numeric") #it gives everything numeric but column date always in this format "36164"
#I also tried something like this:
dataset <- read_xlsx("Data.xlsx", col_types = c("date", rep("numeric", n))) #where "n" stands for all the columns with numbers I have but it did not work
I can import the data with the incorret date columns. After some cleaning (removing NA columns) I get a tbl with different column length. I tried the following codes to transform the incorrect column dates into date format:
dataset <- janitor::remove_empty(dataset, which = "cols") #remove NA columns
dataset <- dataset[-c(1),] #remove the first row of all columns
# Now using this command I could transform each incorrect date column into a date format:
date <- as.Date(as.numeric(dataset$column1), origin = "1899-12-30")
# I would like to do it for all the date columns in one shot but when I try to do it in this way
as.Date(as.numeric(dataset[,c(1,3,5,7,14,16,18,20,21,23,25,32)]), origin = "1899-12-30")
# I get an error, probably because the columns have different length
# the error is: Error in as.Date(as.numeric(var_dataset[, c(1, 3, 5, 7, 14, 16, 18, 20, :
'list' object cannot be coerced to type 'double'
# unlisting the object doesn't solve the problem
I am aware it is missing data to reproduce my problem but in the first scenario I don't know how to approximate my quite big excel file while in the second case I don't know how to create a tbl with many columns of different length without wasting lot of time. Sorry.
Do you have any solution? Either for importing directly from Excel or playing with the dataframe
Thanks so much
I attach here the structure of my dataset:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5500 obs. of 77 variables:
$ Name...1 : chr "Code" "36164" "36165" "36166" ...
$ VSTOXX VOLATILITY INDEX - PRICE INDEX : chr "VSTOXXI(PI)" "18.2" "29.69" "25.17" ...
$ ...3 : logi NA NA NA NA NA NA ...
$ ...4 : logi NA NA NA NA NA NA ...
$ ...5 : logi NA NA NA NA NA NA ...
$ ...6 : logi NA NA NA NA NA NA ...
$ Name...7 : chr "Code" "36799" "36830" "36860" ...
$ EM COMPOSITE INDICATOR OF SOVEREIGN STRESS: GDP WEIGHTS NADJ : chr "EMEBSCGWR" "7.8255999999999992E-2" "8.9886999999999995E-2" "8.0714999999999995E-2" ...
$ ...9 : logi NA NA NA NA NA NA ...
$ Name...10 : chr "Code" "36168" "36175" "36182" ...
$ CISS BOND MKT: GOV & NFC VOLATILITY - ECONOMIC SERIES : chr "EMCIBMG" "4.4651999999999997E-2" "6.6535999999999998E-2" "4.9789E-2" ...
$ ...12 : logi NA NA NA NA NA NA ...
$ Name...13 : chr "Code" "36168" "36175" "36182" ...
$ CISS MONEY MKT: 3M RATE+ VOLATILITY - ECONOMIC SERIES : chr "EMECM3E" "5.7435999999999994E-2" "7.463199999999999E-2" "7.2263999999999995E-2" ...
$ CISS FX MKT: EUR VOLATILITY - ECONOMIC SERIES : chr "EMECFEM" "7.2139999999999996E-2" "8.6049E-2" "4.5948999999999997E-2" ...
$ CISS FIN INTERM: BANK+ VOLATILITY - ECONOMIC SERIES : chr "EMCIFIN" "4.5384999999999995E-2" "0.11820399999999999" "0.11516499999999999" ...
$ CISS NF EQUITY: VOLATILITY - ECONOMIC SERIES : chr "EMCIEMN" "7.7453999999999995E-2" "0.12733" "0.11918899999999999" ...
$ CISS: CROSS SUBINDEXCORRELATION - ECONOMIC SERIES : chr "EMCICRO" "-0.21210999999999999" "-0.29791000000000001" "-0.2369" ...
$ SYSTEMIC STRESS COMPINDICATOR - ECONOMIC SERIES : chr "EMCISSI" "8.4954000000000002E-2" "0.174844" "0.16546" ...
$ ...20 : logi NA NA NA NA NA NA ...
$ ...21 : logi NA NA NA NA NA NA ...
$ ...22 : logi NA NA NA NA NA NA ...
$ ...23 : logi NA NA NA NA NA NA ...
$ ...24 : logi NA NA NA NA NA NA ...
$ ...25 : logi NA NA NA NA NA NA ...
$ Name...26 : chr "Code" "33253" "33284" "33312" ...
$ Z8 IPI: MFG., VOLUME INDEX OF PRODUCTION, 2015=100 (WDA) VOLA: chr "Z8ES493KG" "81" "79.7" "79.400000000000006" ...
$ ...28 : logi NA NA NA NA NA NA ...
$ ...29 : logi NA NA NA NA NA NA ...
$ ...30 : logi NA NA NA NA NA NA ...
$ ...31 : logi NA NA NA NA NA NA ...
$ ...32 : logi NA NA NA NA NA NA ...
$ ...33 : logi NA NA NA NA NA NA ...
$ ...34 : logi NA NA NA NA NA NA ...
$ Name...35 : chr "Code" "35779" "35810" "35841" ...
$ EH HICP: ALL-ITEMS NADJ : chr "EHES795WR" "1.7" "1.6" "1.6" ...
$ ...37 : logi NA NA NA NA NA NA ...
$ ...38 : logi NA NA NA NA NA NA ...
$ Name...39 : chr "Code" "35110" "35139" "35170" ...
$ EH HICP: ALL-ITEMS (%MOM) NADJ : chr "EHESPQ93R" "0.4" "0.4" "0.3" ...
$ ...41 : logi NA NA NA NA NA NA ...
$ ...42 : logi NA NA NA NA NA NA ...
$ ...43 : logi NA NA NA NA NA NA ...
$ Name...44 : chr "Code" "35445" "35476" "35504" ...
$ EH HICP: ALL-ITEMS HICP (%YOY) NADJ : chr "EHESAKZER" "2.2000000000000002" "2" "1.7" ...
$ ...46 : logi NA NA NA NA NA NA ...
$ ...47 : logi NA NA NA NA NA NA ...
$ ...48 : logi NA NA NA NA NA NA ...
$ ...49 : logi NA NA NA NA NA NA ...
$ Name...50 : chr "Code" "36206" "36234" "36265" ...
$ EM EUROSYSTEM: BASE MONEY CURN : chr "EMEBSMYBA" "426.64374199999997" "430.51499999999999" "432.34064499999999" ...
$ ...52 : logi NA NA NA NA NA NA ...
$ ...53 : logi NA NA NA NA NA NA ...
$ ...54 : logi NA NA NA NA NA NA ...
$ ...55 : logi NA NA NA NA NA NA ...
$ Name...56 : chr "Code" "35703" "35734" "35762" ...
$ EM EUROSYSTEM: TOTAL ASSETS/LIABILITIES (EP) CURN : chr "EMECBSALA" "710257.53500000003" "711193.47100000002" "714957.58900000004" ...
$ ...58 : logi NA NA NA NA NA NA ...
$ ...59 : logi NA NA NA NA NA NA ...
$ ...60 : logi NA NA NA NA NA NA ...
$ ...61 : logi NA NA NA NA NA NA ...
$ ...62 : logi NA NA NA NA NA NA ...
$ ...63 : logi NA NA NA NA NA NA ...
$ Name...64 : chr "Code" "41548" "41579" "41609" ...
$ TR EU FWD INFL-LKD SWAP 10YF20Y - MIDDLE RATE : chr "TREFSTT" NA NA NA ...
$ TR EU FWD INFL-LKD SWAP 10YF10Y - MIDDLE RATE : chr "TREFS1T" NA NA NA ...
$ TR EU FWD INFL-LKD SWAP 2YF2Y - MIDDLE RATE : chr "TREFS22" "1.5158" "1.4669000000000001" "1.4715" ...
$ TR EU FWD INFL-LKD SWAP 1YF1Y - MIDDLE RATE : chr "TREFS11" "1.4509000000000001" "1.2338" "1.1225000000000001" ...
$ TR EU FWD INFL-LKD SWAP 2YF3Y - MIDDLE RATE : chr "TREFS23" "1.5906000000000002" "1.5453000000000001" "1.5283000000000002" ...
$ TR EU FWD INFL-LKD SWAP 5YF10Y - MIDDLE RATE : chr "TREFS5T" "2.3516000000000004" "2.3323" "2.3070000000000004" ...
$ ...71 : logi NA NA NA NA NA NA ...
$ ...72 : logi NA NA NA NA NA NA ...
$ ...73 : logi NA NA NA NA NA NA ...
$ ...74 : logi NA NA NA NA NA NA ...
$ ...75 : logi NA NA NA NA NA NA ...
$ Name...76 : chr "Code" "41255" "41286" "41317" ...
$ TR EU FWD INFL-LKD SWAP 5YF5Y - MIDDLE RATE : chr "TREFS55" "2.2027000000000001" "2.2637" "2.383" ...
You have to specify the col_types correctly in the read_excel (or read_xlsx) command. For example:
dataset <- read_xlsx("Data.xlsx",
col_types=c("numeric","date","numeric","date","numeric", "date", ...))
Edit: Finally after much interrogation, the problem is that your data starts in row 3, not 2. So skip the first row (skip=1) and try again.
dataset <- read_xlsx("Data.xlsx", skip=1)
edit: While this will most likely solve the error you're getting, I agree with Edward's advice to use readxl::read_excel which should preserve the dates.
The problem with
as.Date(as.numeric(dataset[,c(1,3,5,7,14,16,18,20,21,23,25,32)]), origin = "1899-12-30")
is that you apply as.numeric on a tibble which internally is a list. Instead do
dplyr::mutate_at(
dataset,
c(1,3,5,7,14,16,18,20,21,23,25,32),
dplyr::funs(as.numeric, as.Date),
origin = "1899-12-30",
format = "%Y-%m-%d"
)
You say the columns have a different length but that's not possible in R's table-like structures (tibble, data.frame, data.table).
Lesson: Always be aware what datatype you're working with doing e.g. str(dataset). as.numeric does not work on tables but needs to be applied to specific columns, using e.g. mutate.

Linear regresion of rectangular table against one set of values

I have a rectangular table with three variables: country, year and inflation. I already have all the descriptives I can have, now I need to do some analytics, and figured that I should do some linear regression against a target country. The best idea I had was to create a new variable called inflation.in.country.x and loop through the inflation of x in this new column but that seems somehow unclean solution.
How to get a linear regression of a rectangular data table? The structure is like this:
> dat %>% str
'data.frame': 1196 obs. of 3 variables:
$ Country.Name: Factor w/ 31 levels "Albania","Armenia",..: 9 8 10 11 12 14 15 16 17 19 ...
$ year : chr "1967" "1967" "1967" "1967" ...
$ inflation : num 1.238 8.328 3.818 0.702 1.467 ...
I want to take Armenia inflation as dependent variable and Albania as independent to get a linear regression. It is possible without transforming the data and keeping the years coherent?
One way is to spread your data table using Country.Name as key:
dat.spread <- dat %>% spread(key="Country.Name", value="inflation")
dat.spread %>% str
'data.frame': 50 obs. of 31 variables:
$ year : chr "1967" "1968" "1969" "1970" ...
$ Albania : num NA NA NA NA NA NA NA NA NA NA ...
$ Armenia : num NA NA NA NA NA NA NA NA NA NA ...
$ Brazil : num NA NA NA NA NA NA NA NA NA NA ...
[...]
But that forces you to transform the data which may seem undesirable. Afterwards, you can simply use cbind to do the linear regression against all countries:
lm(cbind(Armenia, Brazil, Colombia, etc...) ~ Albania, data = dat.spread)

R: Error in as.Date.numeric(value) when using is.na() on data

I am having issues with using is.na() to change NA to zeroes within a data frame. Initially, I am changing the date format. As an example, the code format is like this:
Date <- c("16/08/2010 08:00", "17/08/2010 08:00", "18/08/2010 08:00")
Data1 <- c(30,NA,40)
Data2 <- c(50,60,NA)
df <- data.frame(Date,Data1,Data2)
df$Date <- strptime(df$Date, format = "%d/%m/%Y %H:%M", tz = "GMT" )
df$Date <- as.Date(df$Date, origin = df$Date[1])
df[is.na(df)]<-0
which yields the correct result. However, when I apply the same code to my data, I receive the error which I cannot figure out:
Error in as.Date.numeric(value) : 'origin' must be supplied
When I use str(data) the output is:
str(data)
'data.frame': 19461 obs. of 6 variables:
$ Date : Date, format: "2008-01-28" "2008-01-28" "2008-01-28" ...
$ NO_flux : num NA 5.33 NA -5.92 -10.87 ...
$ NO2_flux: num NA -12.7 NA -11.5 18.8 ...
$ N2O_flux: num NA NA NA NA NA NA NA NA NA NA ...
$ NH3_flux: num NA NA NA NA NA NA NA NA NA NA ...
$ O3_flux : num NA 313.42 NA 228.41 3.46 ...

Resources