Related
I am trying to use the pivot_wider function on the PARAMETER column to get out the unique values however, when I do that, it gives me a bunch on NA values. Below is the code that I am trying to use so far but it is resulting in the picture below and I have attempted a lot of na.omit related functions which just removed all rows.
pivot_wider(names_from = PARAMETER,
values_from = Month_Average)
I am trying to get it in the below format: where everything is on one row.
Year
Month
LAT
LON
Temperature
Humidity
wind_10_meters
wind_50_meters
precipitation
1990
Sep
25.5
-90
95
24
8
8
.5
1991
Oct
25.5
-90
89
20
8
4
1
These aren't accurate numbers, but I want to get all the information to show for that year and month in one row? Below I have provided the data that I am working with.
Here is what dput() gave me. I did head() since it was really long.
structure(list(PARAMETER = c("PS", "PS", "PS", "PS", "PS", "PS"
), YEAR = c(1990L, 1990L, 1990L, 1990L, 1990L, 1990L), LAT = c(35.25,
35.25, 35.25, 35.25, 35.25, 35.25), LON = c(-71.75, -71.75, -71.75,
-71.75, -71.75, -71.75), ANN = c(101.91, 101.91, 101.91, 101.91,
101.91, 101.91), MONTH = c("NOV", "JAN", "FEB", "MAR", "APR",
"MAY"), Month_Average = c(101.9, 102.01, 102.22, 102.36, 101.87,
101.63)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
dput of the code after running my pivot_wider:
structure(list(YEAR = c(1990L, 1990L, 1990L, 1990L, 1990L, 1990L
), LAT = c(35.25, 35.25, 35.25, 35.25, 35.25, 35.25), LON = c(-71.75,
-71.75, -71.75, -71.75, -71.75, -71.75), ANN = c(101.91, 101.91,
101.91, 101.91, 101.91, 101.91), MONTH = c("NOV", "JAN", "FEB",
"MAR", "APR", "MAY"), PS = c(101.9, 102.01, 102.22, 102.36, 101.87,
101.63), T2M = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), RH2M = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), WS10M = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), WS50M = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), PRECTOTCORR = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
Adding Parameter test
structure(list(YEAR = c(1990L, 1990L, 1990L, 1990L, 1990L, 1990L
), MONTH = c("APR", "APR", "APR", "APR", "APR", "APR"), LAT = c(35.25,
35.25, 35.25, 35.25, 35.25, 35.25), LON = c(-78.75, -78.75, -78.75,
-78.75, -78.75, -78.75), ANN = c(2.93, 3.42, 5.39, 16.89, 75.28,
101.13), number_of_parameters = c(1L, 1L, 1L, 1L, 1L, 1L)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
YEAR = 1990L, MONTH = "APR", LAT = 35.25, LON = -78.75, .rows = structure(list(
1:6), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L), .drop = TRUE))
Please let me know if I need to include any more information!
I am currently working to merge two datasets in R. The first is a cross-national longitudinal dataset of democracy scores and inequality levels for countries over hundreds of years (15,034 observations, dat_as). The second is a cross-national longitudinal dataset of whether a given country in a given year has a legislature (27,192 observations, dat_vdem). I want to attach the legislatures data to the inequality data. The goal is to have a final df with the same number of observations (15,034). If there is a match, merge the data. If there is not a match, just insert an NA for the row. Every approach I have tried in R does not work. For example, using this code I get a df with 2,558,975 observations.
# load data
dat_as <- read.csv("as.csv")
dat_vdem <- read.csv("vdem.csv")
# merge
test_df <- merge(dat_as, dat_vdem, by = c("code"))
Using this code, however, I get a df with 13,355 observations.
test_df <- merge(dat_as, dat_vdem, by = c("country", "year"))
What am I doing wrong? Any help would be appreciated. Below are reproducible data.
Here is the dat_as:
structure(list(X = 1:6, country = c("United States", "United States",
"United States", "United States", "United States", "United States"
), year = 1800:1805, scode = c("USA", "USA", "USA", "USA", "USA",
"USA"), code = c("USA", "USA", "USA", "USA", "USA", "USA"), democracy = c(1L,
1L, 1L, 1L, 1L, 1L), lagdemocracy = c(NA, 1L, 1L, 1L, 1L, 1L),
lbmginiint = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), lbmgdppint = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), ldemlbmginiint = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), ldemlbmgdppint = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), yearsq = c(3240000,
3243601, 3247204, 3250809, 3254416, 3258025), legislature = c(NA,
NA, NA, NA, NA, NA)), row.names = c(NA, 6L), class = "data.frame")
Here is the dat_vdem:
structure(list(X = 1:6, year = 1800:1805, country = c("United States", "United States", "United States", "United States", "United States", "United States"), code = c("USA",
"USA", "USA", "USA", "USA", "USA"), v2lgbicam = c(0L, 0L, 0L,
0L, 0L, 0L), v2lgqstexp = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), v2lgotovst = c(-2.1, -2.1, -2.1, -2.1, -2.1,
-2.1), v2lginvstp = c(-2.05, -2.05, -2.05, -2.05, -2.05, -2.05
), legislature = c(0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA,
6L), class = "data.frame")
You're describing a left join. The way I find easier is to use dplyr.
dplyr::left_join(dat_as, dat_vdem).
By default it will try and guess which key variables to match by. With the sample data you provided, it matched by "X", "country", "year", "code", "legislature". But you can specify them if need be.
Some data:
x %>% dput
structure(list(date = structure(c(18782, 18783, 18784, 18785,
18786, 18787, 18789, 18791, 18792, 18793, 18795, 18797, 18798,
18799, 18801, 18803, 18805, 18806), class = "Date"), `Expired Trials` = c(3L,
1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L), `Trial Sign Ups` = c(3L, 1L, 1L, 2L, 3L, 4L, 1L, 1L, 1L,
1L, 2L, 1L, 3L, 2L, 2L, 1L, 1L, 1L), `Total Site Conversions` = c(3,
1, 1, 2, 3, 4, 1, 1, 1, 1, 2, 1, 3, 2, 2, 1, 1, 1), `Site Conversion Rate` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `Trial to Paid Conversion Rate` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_)), row.names = c(NA, -18L), class = c("tbl_df",
"tbl", "data.frame"))
Context is within a shiny app where sometimes field 'Sessions' will exist and others it won't, depending on the users selections. Rather than display the red warning message, I just want nothing or a blank plot shown instead of an error message:
x %>%
ggplot(aes(date, Sessions)) +
geom_col(na.rm = T) +
geom_line(aes(y = `Site Conversion Rate`), na.rm = T)
Error in FUN(X[[i]], ...) : object 'Sessions' not found
Tried:
tryCatch(expr = {x %>%
ggplot(aes(date, Sessions)) +
geom_col(na.rm = T) +
geom_line(aes(y = `Site Conversion Rate`), na.rm = T)
},
error = function(e) {message(''); print(e)},
finally = {ggplot() + theme_void()})
But, this still spits out the error, wanted/expected a blank plot instead.
How can I do this?
Consider using an if/else expression with all i.e. we plot only if all the column names specified in plot are present or else return a blank plot
nm1 <- c("date", "Sessions", "Site Conversion Rate")
if(!all(nm1 %in% names(x))) {
message("Not all columns are found")
ggplot()
} else {x %>%
ggplot(aes(date, Sessions)) +
geom_col(na.rm = TRUE) +
geom_line(aes(y = `Site Conversion Rate`), na.rm = TRUE)}
Or another option is possibly with specifying otherwise
library(purrr)
f1 <- function(x) {
p1 <- x %>%
ggplot(aes(date, Sessions)) +
geom_col(na.rm = TRUE) +
geom_line(aes(y = `Site Conversion Rate`), na.rm = TRUE)
print(p1)
}
f1p <- possibly(f1, otherwise = ggplot())
-testing
f1p(x)
-output
Or a modification of the OP's tryCatch
tryCatch(expr = {print(x %>%
ggplot(aes(date, Sessions)) +
geom_col(na.rm = T) +
geom_line(aes(y = `Site Conversion Rate`), na.rm = TRUE))
},
error = function(e) {message(''); print(e)},
finally = {
ggplot() +
theme_void()
})
<simpleError in FUN(X[[i]], ...): object 'Sessions' not found>
I have this data and I want to make a new column:
structure(list(AGE_GROUP = c("21-30", "31-40", "41-50"), DATE = c("12/17/2020",
"12/17/2020", "12/17/2020"), VACCINE_COUNT = c(36L, 47L, 26L),
PERC_TOTAL_VACC = c(24.82758621, 32.4137931, 17.93103448),
RECIPIENT_COUNT = c(NA_integer_, NA_integer_, NA_integer_
), PERC_TOTAL_RECIP = c(NA_real_, NA_real_, NA_real_), RECIP_FULLY_VACC = c(NA_integer_,
NA_integer_, NA_integer_), PERC_FULLY_VACC = c(NA_real_,
NA_real_, NA_real_)), row.names = c(NA, 3L), class = "data.frame")
based on age group I want to make a column that includes this numbers c(8, 12,13,16,14,12), and repeat this column 3 times. So the outcome is a new column that 3times have the mentioned numbers.
I have used this code vaccine<-vaccine %>% mutate(new_col = rep(list(vals), n())) %>% unnest()
and I have something like this
"12/18/2020", "12/18/2020"), VACCINE_COUNT = c(421L, 421L, 421L
), PERC_TOTAL_VACC = c(15.52932497, 15.52932497, 15.52932497),
RECIPIENT_COUNT = c(NA_integer_, NA_integer_, NA_integer_
), PERC_TOTAL_RECIP = c(NA_real_, NA_real_, NA_real_), RECIP_FULLY_VACC = c(NA_integer_,
NA_integer_, NA_integer_), PERC_FULLY_VACC = c(NA_real_,
NA_real_, NA_real_), X = c(NA, NA, NA), X.1 = c(14L, 14L,
14L), new_col = c(8, 12, 13)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))```
While I want to keep my data and just repeat the data
Do you mean to repeat the values c(8, 12,13,16,14,12) for each row in the dataframe? Try :
library(dplyr)
library(tidyr)
vals <- c(8, 12,13,16,14,12)
df %>%
mutate(new_col = rep(list(vals), n())) %>%
unnest(new_col)
Using base R
transform(df1[rep(seq_len(nrow(df1)), each = length(vals)),], new_col = vals)
Or with uncount
library(dplyr)
library(tidyr)
df1 %>%
uncount(length(vals)) %>%
mutate(new_col = rep(vals, length.out = n()))
If we need to just replicate and store the column, wrap in a list
df1 %>%
mutate(new_col = list(vals))
data
vals <- c(8, 12,13,16,14,12)
I´m currently working with a 2 Data frames one that I simply call Data and another one called DataOutput. Data has over 400 thousand observations of 21 variables and and DataOutput only has 4 observations of 21 variables. DataOutput is a dataframe that includes for different sums simply how many NA and OOR(OutofRange) and #Measurements and Ratio((NA+OOR)/#Measurements). The Data dataframe currently holds a lot of columns that only include NA because there where simply no measurements of those variables.
I want to get rid of the columns that only have NA in them.
for(z in 2:22)
{
if(DataOutput[4,z] == 1) //This is the ratio ((NA+OOR)/#Measurements) == 1
{
Data <- subset(Data, select = -Data[,z] )
}
}
I tried to do this like this but it does not work. I cant simply hard code the columns I need to delete like select = - c(a,b) or something because the code has to work with many different same sized data frames. They all have different columns that are completely un available.
Any Ideas? Help is really appreciated!
> dput(head(Data))
structure(list(StartTime = structure(c(1169218200, 1169218800,
1169219400, 1169220000, 1169220600, 1169221200), class = c("POSIXct",
"POSIXt"), tzone = ""), Latitude = c(15.6383658333333, 15.648397,
15.6581663333333, 15.6680338333333, 15.6778031666667, 15.6876706666667
), Longitude = c(15.8445643333333, 15.8549853333333, 15.8651343333333,
15.8753853333333, 15.8855343333333, 15.8957853333333), GPSSpeed = c(NA,
NA, 315, 315, 315, 315), LogSpeed = c(NA, NA, 696.091532743333,
697.291378813333, 698.491512383334, 699.691533736667), WindSpeedRel = c(NA,
NA, 1.03611152968314, 1.00016348803882, 1.06045149695061, 0.995509934806929
), WindDirRel = c(NA, NA, 1.38425886694239, 1.29982376776468,
1.37160349066357, 1.33137136705896), Course = c(NA, NA, NA, NA,
NA, NA), SeaDepth = c(NA, NA, NA, NA, NA, NA), DraftFWD = c(NA,
NA, NA, NA, NA, NA), DraftAFT = c(NA, NA, NA, NA, NA, NA), Rudder = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), PropellerKW = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), MEConsumption = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), MERPM = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), MELoad = c(NA,
NA, NA, NA, NA, NA), DGConsumption = c(NA, NA, NA, NA, NA, NA
), DG1_Load = c(NA, NA, NA, NA, NA, NA), DG2_Load = c(NA, NA,
NA, NA, NA, NA), DG3_Load = c(NA, NA, NA, NA, NA, NA), DG4_Load = c(NA,
NA, NA, NA, NA, NA), DG5_Load = c(NA, NA, NA, NA, NA, NA)), .Names = c("StartTime",
"Latitude", "Longitude", "GPSSpeed", "LogSpeed", "WindSpeedRel",
"WindDirRel", "Course", "SeaDepth", "DraftFWD", "DraftAFT", "Rudder",
"PropellerKW", "MEConsumption", "MERPM", "MELoad", "DGConsumption",
"DG1_Load", "DG2_Load", "DG3_Load", "DG4_Load", "DG5_Load"), row.names = c(NA,
6L), class = "data.frame")
and
> dput(head(DataOutput))
structure(list(Type = structure(c(1L, 3L, 2L, 4L), .Label = c("#Measurements",
"NA", "OOR", "Ratio((NA+OOR)/#M"), class = "factor"), Latitude = c(67879,
0, 19829, 0.292122747830699), Longitude = c(67879, 0, 19829,
0.292122747830699), GPSSpeed = c(67879, 7, 19904, 0.293330779769885
), LogSpeed = c(67879, 18235, 49621, 0.999661161773155), WindSpeedRel = c(67879,
392, 38297, 0.569970093843457), WindDirRel = c(67879, 0, 38297,
0.564195111890275), Course = c(67879, 0, 67879, 1), SeaDepth = c(67879,
0, 67879, 1), DraftFWD = c(67879, 0, 67879, 1), DraftAFT = c(67879,
0, 67879, 1), Rudder = c(67879, 46675, 21204, 1), PropellerKW = c(67879,
5857, 21332, 0.400550980421043), MEConsumption = c(67879, 10,
21185, 0.312246792085918), MERPM = c(67879, 5105, 22030, 0.399755447192799
), MELoad = c(67879, 0, 67879, 1), DGConsumption = c(67879, 0,
67879, 1), DG1_Load = c(67879, 0, 67879, 1), DG2_Load = c(67879,
0, 67879, 1), DG3_Load = c(67879, 0, 67879, 1), DG4_Load = c(67879,
0, 67879, 1), DG5_Load = c(67879, 0, 67879, 1)), .Names = c("Type",
"Latitude", "Longitude", "GPSSpeed", "LogSpeed", "WindSpeedRel",
"WindDirRel", "Course", "SeaDepth", "DraftFWD", "DraftAFT", "Rudder",
"PropellerKW", "MEConsumption", "MERPM", "MELoad", "DGConsumption",
"DG1_Load", "DG2_Load", "DG3_Load", "DG4_Load", "DG5_Load"), row.names = c(NA,
4L), class = "data.frame")
I think you can subset directly here without looping:
Data[,DataOutput[4,]!=1]
and if you didn't have DataOutput and wanted to get rid of columns filled exclusively with NAs you could have do something like this:
Data[,colSums(is.na(Data))!=nrow(Data)]