I am trying to use the pivot_wider function on the PARAMETER column to get out the unique values however, when I do that, it gives me a bunch on NA values. Below is the code that I am trying to use so far but it is resulting in the picture below and I have attempted a lot of na.omit related functions which just removed all rows.
pivot_wider(names_from = PARAMETER,
values_from = Month_Average)
I am trying to get it in the below format: where everything is on one row.
Year
Month
LAT
LON
Temperature
Humidity
wind_10_meters
wind_50_meters
precipitation
1990
Sep
25.5
-90
95
24
8
8
.5
1991
Oct
25.5
-90
89
20
8
4
1
These aren't accurate numbers, but I want to get all the information to show for that year and month in one row? Below I have provided the data that I am working with.
Here is what dput() gave me. I did head() since it was really long.
structure(list(PARAMETER = c("PS", "PS", "PS", "PS", "PS", "PS"
), YEAR = c(1990L, 1990L, 1990L, 1990L, 1990L, 1990L), LAT = c(35.25,
35.25, 35.25, 35.25, 35.25, 35.25), LON = c(-71.75, -71.75, -71.75,
-71.75, -71.75, -71.75), ANN = c(101.91, 101.91, 101.91, 101.91,
101.91, 101.91), MONTH = c("NOV", "JAN", "FEB", "MAR", "APR",
"MAY"), Month_Average = c(101.9, 102.01, 102.22, 102.36, 101.87,
101.63)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
dput of the code after running my pivot_wider:
structure(list(YEAR = c(1990L, 1990L, 1990L, 1990L, 1990L, 1990L
), LAT = c(35.25, 35.25, 35.25, 35.25, 35.25, 35.25), LON = c(-71.75,
-71.75, -71.75, -71.75, -71.75, -71.75), ANN = c(101.91, 101.91,
101.91, 101.91, 101.91, 101.91), MONTH = c("NOV", "JAN", "FEB",
"MAR", "APR", "MAY"), PS = c(101.9, 102.01, 102.22, 102.36, 101.87,
101.63), T2M = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), RH2M = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), WS10M = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), WS50M = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), PRECTOTCORR = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
Adding Parameter test
structure(list(YEAR = c(1990L, 1990L, 1990L, 1990L, 1990L, 1990L
), MONTH = c("APR", "APR", "APR", "APR", "APR", "APR"), LAT = c(35.25,
35.25, 35.25, 35.25, 35.25, 35.25), LON = c(-78.75, -78.75, -78.75,
-78.75, -78.75, -78.75), ANN = c(2.93, 3.42, 5.39, 16.89, 75.28,
101.13), number_of_parameters = c(1L, 1L, 1L, 1L, 1L, 1L)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
YEAR = 1990L, MONTH = "APR", LAT = 35.25, LON = -78.75, .rows = structure(list(
1:6), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L), .drop = TRUE))
Please let me know if I need to include any more information!
Related
When I import a .gpx file from my naviagation software into r using the gpx package it arrives in this format
dput(test)
list(routes = list(structure(list(Elevation = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), Time = structure(c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Latitude = c(50.76098, 50.76327, 50.766489,
50.771325, 50.771792, 50.773814, 50.774321, 50.774669, 50.775666,
50.774327), Longitude = c(-1.322124, -1.32737, -1.324514, -1.316833,
-1.314606, -1.300727, -1.294736, -1.290568, -1.27571, -1.263494
), extensions = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-10L))), tracks = list(structure(list(Elevation = logical(0),
Time = logical(0), Latitude = logical(0), Longitude = logical(0)), class = "data.frame", row.names = integer(0))),
waypoints = list(structure(list(Elevation = logical(0), Time = logical(0),
Latitude = logical(0), Longitude = logical(0)), class = "data.frame", row.names = integer(0))))
What I want to do is export a gpx back to the same software. I have tried using the pgirmess package but it no longer supports the writeGPX command seemingly.
Then I tried to manipulate the answer given in this question here R Convert GPS to GPX with timestamp
but I am not looking to export a track but a route for people to follow so the timestamp is not relevant.
I also tried using the rgdal package as shown below.
my data
dput(routes)
list(structure(list(Elevation = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Latitude = c(50.768333, 50.771833,
50.7735, 50.769167, 50.77, 50.769167), Longitude = c(-1.307167,
-1.295833, -1.292667, -1.286667, -1.295833, -1.2775), Name = c("3X Donna",
"3Z Trinity House Buoy", "33 Prince Consort", "34 Cowes Corinthian",
"39 Snowden", "4K Royal London YC")), class = "data.frame", row.names = c(53L,
55L, 58L, 59L, 60L, 70L)))
The I run the code below but still no joy
library(rgdal)
coordinates(routes) <- ~Latitude+Longitude
proj4string(routes) <- "+proj=longlat +datum=WGS84"
writeOGR(routes, dsn = "routes.gpx", layer = "routes", driver = "GPX",
dataset_options = "GPX_USE_EXTENSIONS=yes")
I would like ideally to export a route not a track.
I am currently working to merge two datasets in R. The first is a cross-national longitudinal dataset of democracy scores and inequality levels for countries over hundreds of years (15,034 observations, dat_as). The second is a cross-national longitudinal dataset of whether a given country in a given year has a legislature (27,192 observations, dat_vdem). I want to attach the legislatures data to the inequality data. The goal is to have a final df with the same number of observations (15,034). If there is a match, merge the data. If there is not a match, just insert an NA for the row. Every approach I have tried in R does not work. For example, using this code I get a df with 2,558,975 observations.
# load data
dat_as <- read.csv("as.csv")
dat_vdem <- read.csv("vdem.csv")
# merge
test_df <- merge(dat_as, dat_vdem, by = c("code"))
Using this code, however, I get a df with 13,355 observations.
test_df <- merge(dat_as, dat_vdem, by = c("country", "year"))
What am I doing wrong? Any help would be appreciated. Below are reproducible data.
Here is the dat_as:
structure(list(X = 1:6, country = c("United States", "United States",
"United States", "United States", "United States", "United States"
), year = 1800:1805, scode = c("USA", "USA", "USA", "USA", "USA",
"USA"), code = c("USA", "USA", "USA", "USA", "USA", "USA"), democracy = c(1L,
1L, 1L, 1L, 1L, 1L), lagdemocracy = c(NA, 1L, 1L, 1L, 1L, 1L),
lbmginiint = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), lbmgdppint = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), ldemlbmginiint = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), ldemlbmgdppint = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), yearsq = c(3240000,
3243601, 3247204, 3250809, 3254416, 3258025), legislature = c(NA,
NA, NA, NA, NA, NA)), row.names = c(NA, 6L), class = "data.frame")
Here is the dat_vdem:
structure(list(X = 1:6, year = 1800:1805, country = c("United States", "United States", "United States", "United States", "United States", "United States"), code = c("USA",
"USA", "USA", "USA", "USA", "USA"), v2lgbicam = c(0L, 0L, 0L,
0L, 0L, 0L), v2lgqstexp = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), v2lgotovst = c(-2.1, -2.1, -2.1, -2.1, -2.1,
-2.1), v2lginvstp = c(-2.05, -2.05, -2.05, -2.05, -2.05, -2.05
), legislature = c(0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA,
6L), class = "data.frame")
You're describing a left join. The way I find easier is to use dplyr.
dplyr::left_join(dat_as, dat_vdem).
By default it will try and guess which key variables to match by. With the sample data you provided, it matched by "X", "country", "year", "code", "legislature". But you can specify them if need be.
I am attempting to impute data in my validation set, which follows the MICE imputation model from my train set using mice.reuse(). Imputation is following data split as they'll be used to train/val ML algorithms. Data (n=720) is comprised of clinical and lab measurement data (numeric or factor).
See below the steps:
# first split data
sample_size <- floor(0.75 * nrow(data))
train_data <- sample(seq_len(nrow(data)), size = sample_size)
train <- data[train_data, ]
val <- data[-train_data, ]
# mice imputation - train set
train_imp <- mice(train,
predictorMatrix = predM,
method = 'pmm',
m = 5, maxit = 10, print = FALSE,
seed = 500)
# apply train data imp model to val data
val_imp <- mice.reuse(train_imp, val, maxit = 1)
After this step, I receive the following error message:
Error in value[[3L]](cond) :
Error in doTryCatch(return(expr), name, parentenv, handler): Missing left after imputation
In addition: There were 50 or more warnings (use warnings() to see the first 50)
The data looks something like this:
structure(list(male = structure(c(2L, 1L, 2L, 1L, 1L), .Label = c("FALSE",
"TRUE"), class = "factor"), age = c(55.7864476386037, 55.895961670089,
41.0376454483231, 29.6563997262149, 57.2183436002738), bmi = c(36.6115389471026,
31.5536591487683, 22.7903289734443, 42.5307689412473, 33.6484537734337
), waist_circum = c(126, 103, 91, 133, 105), bp_sys = c(147,
NA, 100, 160, 135), bp_dia = c(82, NA, 60, 81, 70), t2dm = structure(c(1L,
2L, 1L, 2L, 1L), .Label = c("0", "1"), class = "factor"), hdl = c(NA,
1.35, NA, NA, NA), ldl = c(3.41, 3.28, 2.87, 3.7, 3.59), triglyceride = c(2.54,
1.04, 1.55, NA, 3.43), cholesterol = c(5.08, 5.1, 4.35, 4.82,
6.06), alt = c(41, 26, 48, 31, 31), ast = c(33, 28, 33, 21, 28
), ggt = c(33, 65, 42, 45, 26), alp = c(70, 79, 51, 88, 70),
platelet = c(156, 313, 308, 337, 186), hb = c(14.4, 11.5,
15.3, 14.5, 14.2), tsat = c(19, 18, 28, 25, 32), albu = c(4.9,
4.5, 5.4, 4.3, 4.6), egfr = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), ferritin = c(171, 70, 499, 156, 94),
pt = c(1.04, 0.98, 0.99, 1.12, 0.97), bili = c(11, 5, 6,
8, 14), timp1 = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), p3np = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), ha = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), glucose = c(5.7, 6.1, 5.4, 13.9, 5.1), insulin_fasted = c(29.8,
36.2, 4.9, 11.3, 14.5), hba1c = c(35.522, 58.475, 40.987,
79.242, 38.801), fibrinogen = c(NA, 5.9, 3.5, 5.9, 3.1),
fib_score = c(3, 0, 0, 0, 2), steatosis_score = c(1, 1, 3,
3, 3), inflam_score = c(2, 0, 1, 0, 1), balloon_score = c(2,
0, 1, 0, 1), fnash = structure(c(2L, 1L, 1L, 1L, 2L), .Label = c("FALSE",
"TRUE"), class = "factor"), f2 = structure(c(2L, 1L, 1L,
1L, 2L), .Label = c("FALSE", "TRUE"), class = "factor"),
f3 = structure(c(2L, 1L, 1L, 1L, 1L), .Label = c("FALSE",
"TRUE"), class = "factor"), nash = structure(c(2L, 1L, 2L,
1L, 2L), .Label = c("FALSE", "TRUE"), class = "factor"),
fib4 = c(1.84300333252591, 0.980635141316159, 0.63463649053119,
0.331915071889987, 1.54703279971167), proc6 = c(9.2, 5.5, 6.2, 5.9, 10.5
), proc4 = c(470.9, 298.5, 458.2, 336.7, 516.3), t2 = c(NA,
60.6857152393886, 48.2999977793012, NA, 93.842859513419),
i7 = c(NA, 113.817410696955, 123.560049194916, NA, 130.116414030393
), fibrosc_stiffness = c(14.1, 7, 5.2, 7, 8.8), prognostic_score = c(-1.908509,
-25.028315, -3.495727, -27.147848, -10.20379), probability_steatosis = c(NA,
0.77446951, 0.114093059, 0.654343677, 0.000105034), diafir.h = c(0.615568767767871,
0.534318429857294, 0.890916911895936, NA, 0.779479718924847
), gdf15 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
I can't seem to work out the problem. Any tips here?
I've also tried to apply the ignore arguement in running MICE itself, but this seems to take a very long time.. any one else experience this?
I have this data and I want to make a new column:
structure(list(AGE_GROUP = c("21-30", "31-40", "41-50"), DATE = c("12/17/2020",
"12/17/2020", "12/17/2020"), VACCINE_COUNT = c(36L, 47L, 26L),
PERC_TOTAL_VACC = c(24.82758621, 32.4137931, 17.93103448),
RECIPIENT_COUNT = c(NA_integer_, NA_integer_, NA_integer_
), PERC_TOTAL_RECIP = c(NA_real_, NA_real_, NA_real_), RECIP_FULLY_VACC = c(NA_integer_,
NA_integer_, NA_integer_), PERC_FULLY_VACC = c(NA_real_,
NA_real_, NA_real_)), row.names = c(NA, 3L), class = "data.frame")
based on age group I want to make a column that includes this numbers c(8, 12,13,16,14,12), and repeat this column 3 times. So the outcome is a new column that 3times have the mentioned numbers.
I have used this code vaccine<-vaccine %>% mutate(new_col = rep(list(vals), n())) %>% unnest()
and I have something like this
"12/18/2020", "12/18/2020"), VACCINE_COUNT = c(421L, 421L, 421L
), PERC_TOTAL_VACC = c(15.52932497, 15.52932497, 15.52932497),
RECIPIENT_COUNT = c(NA_integer_, NA_integer_, NA_integer_
), PERC_TOTAL_RECIP = c(NA_real_, NA_real_, NA_real_), RECIP_FULLY_VACC = c(NA_integer_,
NA_integer_, NA_integer_), PERC_FULLY_VACC = c(NA_real_,
NA_real_, NA_real_), X = c(NA, NA, NA), X.1 = c(14L, 14L,
14L), new_col = c(8, 12, 13)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))```
While I want to keep my data and just repeat the data
Do you mean to repeat the values c(8, 12,13,16,14,12) for each row in the dataframe? Try :
library(dplyr)
library(tidyr)
vals <- c(8, 12,13,16,14,12)
df %>%
mutate(new_col = rep(list(vals), n())) %>%
unnest(new_col)
Using base R
transform(df1[rep(seq_len(nrow(df1)), each = length(vals)),], new_col = vals)
Or with uncount
library(dplyr)
library(tidyr)
df1 %>%
uncount(length(vals)) %>%
mutate(new_col = rep(vals, length.out = n()))
If we need to just replicate and store the column, wrap in a list
df1 %>%
mutate(new_col = list(vals))
data
vals <- c(8, 12,13,16,14,12)
I´m currently working with a 2 Data frames one that I simply call Data and another one called DataOutput. Data has over 400 thousand observations of 21 variables and and DataOutput only has 4 observations of 21 variables. DataOutput is a dataframe that includes for different sums simply how many NA and OOR(OutofRange) and #Measurements and Ratio((NA+OOR)/#Measurements). The Data dataframe currently holds a lot of columns that only include NA because there where simply no measurements of those variables.
I want to get rid of the columns that only have NA in them.
for(z in 2:22)
{
if(DataOutput[4,z] == 1) //This is the ratio ((NA+OOR)/#Measurements) == 1
{
Data <- subset(Data, select = -Data[,z] )
}
}
I tried to do this like this but it does not work. I cant simply hard code the columns I need to delete like select = - c(a,b) or something because the code has to work with many different same sized data frames. They all have different columns that are completely un available.
Any Ideas? Help is really appreciated!
> dput(head(Data))
structure(list(StartTime = structure(c(1169218200, 1169218800,
1169219400, 1169220000, 1169220600, 1169221200), class = c("POSIXct",
"POSIXt"), tzone = ""), Latitude = c(15.6383658333333, 15.648397,
15.6581663333333, 15.6680338333333, 15.6778031666667, 15.6876706666667
), Longitude = c(15.8445643333333, 15.8549853333333, 15.8651343333333,
15.8753853333333, 15.8855343333333, 15.8957853333333), GPSSpeed = c(NA,
NA, 315, 315, 315, 315), LogSpeed = c(NA, NA, 696.091532743333,
697.291378813333, 698.491512383334, 699.691533736667), WindSpeedRel = c(NA,
NA, 1.03611152968314, 1.00016348803882, 1.06045149695061, 0.995509934806929
), WindDirRel = c(NA, NA, 1.38425886694239, 1.29982376776468,
1.37160349066357, 1.33137136705896), Course = c(NA, NA, NA, NA,
NA, NA), SeaDepth = c(NA, NA, NA, NA, NA, NA), DraftFWD = c(NA,
NA, NA, NA, NA, NA), DraftAFT = c(NA, NA, NA, NA, NA, NA), Rudder = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), PropellerKW = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), MEConsumption = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), MERPM = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), MELoad = c(NA,
NA, NA, NA, NA, NA), DGConsumption = c(NA, NA, NA, NA, NA, NA
), DG1_Load = c(NA, NA, NA, NA, NA, NA), DG2_Load = c(NA, NA,
NA, NA, NA, NA), DG3_Load = c(NA, NA, NA, NA, NA, NA), DG4_Load = c(NA,
NA, NA, NA, NA, NA), DG5_Load = c(NA, NA, NA, NA, NA, NA)), .Names = c("StartTime",
"Latitude", "Longitude", "GPSSpeed", "LogSpeed", "WindSpeedRel",
"WindDirRel", "Course", "SeaDepth", "DraftFWD", "DraftAFT", "Rudder",
"PropellerKW", "MEConsumption", "MERPM", "MELoad", "DGConsumption",
"DG1_Load", "DG2_Load", "DG3_Load", "DG4_Load", "DG5_Load"), row.names = c(NA,
6L), class = "data.frame")
and
> dput(head(DataOutput))
structure(list(Type = structure(c(1L, 3L, 2L, 4L), .Label = c("#Measurements",
"NA", "OOR", "Ratio((NA+OOR)/#M"), class = "factor"), Latitude = c(67879,
0, 19829, 0.292122747830699), Longitude = c(67879, 0, 19829,
0.292122747830699), GPSSpeed = c(67879, 7, 19904, 0.293330779769885
), LogSpeed = c(67879, 18235, 49621, 0.999661161773155), WindSpeedRel = c(67879,
392, 38297, 0.569970093843457), WindDirRel = c(67879, 0, 38297,
0.564195111890275), Course = c(67879, 0, 67879, 1), SeaDepth = c(67879,
0, 67879, 1), DraftFWD = c(67879, 0, 67879, 1), DraftAFT = c(67879,
0, 67879, 1), Rudder = c(67879, 46675, 21204, 1), PropellerKW = c(67879,
5857, 21332, 0.400550980421043), MEConsumption = c(67879, 10,
21185, 0.312246792085918), MERPM = c(67879, 5105, 22030, 0.399755447192799
), MELoad = c(67879, 0, 67879, 1), DGConsumption = c(67879, 0,
67879, 1), DG1_Load = c(67879, 0, 67879, 1), DG2_Load = c(67879,
0, 67879, 1), DG3_Load = c(67879, 0, 67879, 1), DG4_Load = c(67879,
0, 67879, 1), DG5_Load = c(67879, 0, 67879, 1)), .Names = c("Type",
"Latitude", "Longitude", "GPSSpeed", "LogSpeed", "WindSpeedRel",
"WindDirRel", "Course", "SeaDepth", "DraftFWD", "DraftAFT", "Rudder",
"PropellerKW", "MEConsumption", "MERPM", "MELoad", "DGConsumption",
"DG1_Load", "DG2_Load", "DG3_Load", "DG4_Load", "DG5_Load"), row.names = c(NA,
4L), class = "data.frame")
I think you can subset directly here without looping:
Data[,DataOutput[4,]!=1]
and if you didn't have DataOutput and wanted to get rid of columns filled exclusively with NAs you could have do something like this:
Data[,colSums(is.na(Data))!=nrow(Data)]