Replace unicode by its value in a dataframe - r

I tried to replace the unicode "U+00F3" from a data frame with the sapply function but nothing happened. The unicode part I want to replace is a chr type.
Here the function :
tableExcel$Team <- sapply(tableExcel$Team, gsub, pattern = "<U+00F3>", replacement= "o")
EDIT :
Thanks to the answer of Cath below, I added before the + : \\
tableExcel$Team <- sapply(tableExcel$Team, gsub, pattern = "<U\\+00F3>", replacement= "o")
But it didn't work.
I also tried to provide an exemple of my dataset but the problem is that it works on it and not on mine :
tableExcel <- data.frame("Team" = c("A", "B", "C", "Reducci<U+00F3>n"), "Point" = c(2, 30, 40, 30))
tableExcel$Team <- as.character(tableExcel$Team)
To provide more information, here the importation of my excel file:
tableExcel <- as.data.frame(read_excel("Dataset LOS.xls", sheet = "Liga Squads"))
The structure of my data :
structure(list(Team = c("CHURN", "CHURN", "RESIDENCIAL NPTB", "RESIDENCIAL NPTB", "AUDIENCIAS TV", "AUDIENCIAS TV"), Points = c("P. Asig", "P. entr", "P. Asig", "P. entr", "P. Asig", "P. entr"), 2019-S01 = c(0, 0, 50, 0, NA, NA), 2019-S02 = c(0, 0, 10, 10, NA, NA), 2019-S03 = c(93, 88, 46, 19, NA, NA), 2019-S04 = c(56, 48, 0, 0, 13, 13), 2019-S05 = c(NA, NA, 80.5, 49.5, 42, 28.5), 2019-S06 = c(NA, NA, 66, 48, 55, 39.5), 2019-S07 = c(131, 112, 103, 63, 40.5, 38)), row.names = c(1L, 2L, 4L, 5L, 7L, 8L), class = "data.frame")

I'm unable to replicate the issue with gsub. The following works as expected:
tableExcel$Team <- gsub("<U\\+00F3>", "o", tableExcel$Team)
#### OUTPUT ####
Team Points 2019-S01 2019-S02 2019-S03 2019-S04 2019-S05 2019-S06 2019-S07
1 Reducci<U+00F1>n P. Asig 0 0 93 56 NA NA 131.0
2 CHURN P. entr 0 0 88 48 NA NA 112.0
4 Reducci<U+00F2>n P. Asig 50 10 46 0 80.5 66.0 103.0
5 RESIDENCIAL NPTB P. entr 0 10 19 0 49.5 48.0 63.0
7 AUDIENCIAS TV P. Asig NA NA NA 13 42.0 55.0 40.5
8 <NA> P. entr NA NA NA 13 28.5 39.5 38.0
9 Reduccion P. entr NA NA NA NA NA NA NA
However, replacement using regular expressions might not be the most efficient way convert the unicode characters, as this would require multiple calls to gsub. Instead, you might want to give stringi's stri_unescape_unicode() a try:
# install.packages("stringi") # Use if not yet installed.
library(stringi)
tableExcel$Team <- stri_unescape_unicode(gsub("<U\\+(.*)>", "\\\\u\\1", tableExcel$Team))
#### OUTPUT ####
Team Points 2019-S01 2019-S02 2019-S03 2019-S04 2019-S05 2019-S06 2019-S07
1 Reducciñn P. Asig 0 0 93 56 NA NA 131.0
2 CHURN P. entr 0 0 88 48 NA NA 112.0
4 Reducciòn P. Asig 50 10 46 0 80.5 66.0 103.0
5 RESIDENCIAL NPTB P. entr 0 10 19 0 49.5 48.0 63.0
7 AUDIENCIAS TV P. Asig NA NA NA 13 42.0 55.0 40.5
8 <NA> P. entr NA NA NA 13 28.5 39.5 38.0
9 Reducción P. entr NA NA NA NA NA NA NA
The format <U+0000> is first converted to \\u0000 using gsub and then unescaped. As you can see, it takes care of multiple unicode characters in one go, which makes things much simpler.
Data:
tableExcel <- structure(list(Team = c("Reducci<U+00F1>n", "CHURN", "Reducci<U+00F2>n",
"RESIDENCIAL NPTB", "AUDIENCIAS TV", NA, "Reducci<U+00F3>n"),
Points = c("P. Asig", "P. entr", "P. Asig", "P. entr", "P. Asig",
"P. entr", "P. entr"), `2019-S01` = c(0, 0, 50, 0, NA, NA,
NA), `2019-S02` = c(0, 0, 10, 10, NA, NA, NA), `2019-S03` = c(93,
88, 46, 19, NA, NA, NA), `2019-S04` = c(56, 48, 0, 0, 13,
13, NA), `2019-S05` = c(NA, NA, 80.5, 49.5, 42, 28.5, NA),
`2019-S06` = c(NA, NA, 66, 48, 55, 39.5, NA), `2019-S07` = c(131,
112, 103, 63, 40.5, 38, NA)), row.names = c(1L, 2L, 4L, 5L,
7L, 8L, 9L), class = "data.frame")

Related

Specify which column(s) a specific date appears in R

I have a subset of my data in a dataframe (dput codeblock below) containing dates in which a storm occurred ("Date_AR"). I'd like to know if a storm occurred in the north, south or both, by determining whether the same date occurred in the "Date_N" and/or "Date_S" column/s.
For example, the first date is Jan 17, 1989 in the "Date_AR" column. In the location column, I would like "S" to be printed, since this date is found in the "Date_S" column. If Apr 5. 1989 occurs in "Date_N" and "Date_S", the I would like a "B" (for both) to be printed in the location column.
Thanks in advance for the help! Apologies if this type of question is already out there. I may not know the keywords to search.
structure(list(Date_S = structure(c(6956, 6957, 6970, 7008, 7034,
7035, 7036, 7172, 7223, 7224, 7233, 7247, 7253, 7254, 7255, 7262, 7263, 7266, 7275,
7276), class = "Date"),
Date_N = structure(c(6968, 6969, 7035, 7049, 7103, 7172, 7221, 7223, 7230, 7246, 7247,
7251, 7252, 7253, 7262, 7266, 7275, 7276, 7277, 7280), class = "Date"),
Date_AR = structure(c(6956, 6957, 6968, 6969, 6970, 7008,
7034, 7035, 7036, 7049, 7103, 7172, 7221, 7223, 7224, 7230,
7233, 7246, 7247, 7251), class = "Date"), Precip = c(23.6,
15.4, 3, 16.8, 0.2, 3.6, 22, 13.4, 0, 30.8, 4.6, 27.1, 0,
19, 2.8, 11.4, 2, 57.6, 9.4, 39), Location = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA)), row.names = c(NA, 20L), class = "data.frame")
Using dplyr::case_when you could do:
library(dplyr)
dat |>
mutate(Location = case_when(
Date_AR %in% Date_S & Date_AR %in% Date_N ~ "B",
Date_AR %in% Date_S ~ "S",
Date_AR %in% Date_N ~ "N"
))
#> Date_S Date_N Date_AR Precip Location
#> 1 1989-01-17 1989-01-29 1989-01-17 23.6 S
#> 2 1989-01-18 1989-01-30 1989-01-18 15.4 S
#> 3 1989-01-31 1989-04-06 1989-01-29 3.0 N
#> 4 1989-03-10 1989-04-20 1989-01-30 16.8 N
#> 5 1989-04-05 1989-06-13 1989-01-31 0.2 S
#> 6 1989-04-06 1989-08-21 1989-03-10 3.6 S
#> 7 1989-04-07 1989-10-09 1989-04-05 22.0 S
#> 8 1989-08-21 1989-10-11 1989-04-06 13.4 B
#> 9 1989-10-11 1989-10-18 1989-04-07 0.0 S
#> 10 1989-10-12 1989-11-03 1989-04-20 30.8 N
#> 11 1989-10-21 1989-11-04 1989-06-13 4.6 N
#> 12 1989-11-04 1989-11-08 1989-08-21 27.1 B
#> 13 1989-11-10 1989-11-09 1989-10-09 0.0 N
#> 14 1989-11-11 1989-11-10 1989-10-11 19.0 B
#> 15 1989-11-12 1989-11-19 1989-10-12 2.8 S
#> 16 1989-11-19 1989-11-23 1989-10-18 11.4 N
#> 17 1989-11-20 1989-12-02 1989-10-21 2.0 S
#> 18 1989-11-23 1989-12-03 1989-11-03 57.6 N
#> 19 1989-12-02 1989-12-04 1989-11-04 9.4 B
#> 20 1989-12-03 1989-12-07 1989-11-08 39.0 N

Determine range of time where measurements are not NA

I have a dataset with hundreds of thousands of measurements taken from several subjects. However, the measurements are only partially available, i.e., there may be large stretches with NA. I need to establish up front, for which timespan positive data are available for each subject.
Data:
df
timestamp C B A starttime_ms
1 00:00:00.033 NA NA NA 33
2 00:00:00.064 NA NA NA 64
3 00:00:00.066 NA 0.346 NA 66
4 00:00:00.080 47.876 0.346 22.231 80
5 00:00:00.097 47.876 0.346 22.231 97
6 00:00:00.099 47.876 0.346 NA 99
7 00:00:00.114 47.876 0.346 NA 114
8 00:00:00.130 47.876 0.346 NA 130
9 00:00:00.133 NA 0.346 NA 133
10 00:00:00.147 NA 0.346 NA 147
My (humble) solution so far is (i) to pick out the range of timestamp values that are not NA and to select the first and last such timestamp for each subject individually. Here's the code for subject C:
NotNA_C <- df$timestamp[which(!is.na(df$C))]
range_C <- paste(NotNA_C[1], NotNA_C[length(NotNA_C)], sep = " - ")
range_C
[1] "00:00:00.080" "00:00:00.130"
That doesn't look elegant and, what's more, it needs to be repeated for all other subjects. Is there a more efficient way to establish the range of time for which non-NA values are available for all subjects in one go?
EDIT
I've found a base R solution:
sapply(df[,2:4], function(x)
paste(df$timestamp[which(!is.na(x))][1],
df$timestamp[which(!is.na(x))][length(df$timestamp[which(!is.na(x))])], sep = " - "))
C B A
"00:00:00.080 - 00:00:00.130" "00:00:00.066 - 00:00:00.147" "00:00:00.080 - 00:00:00.097"
but would be interested in other solutions as well!
Reproducible data:
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
dplyr solution
library(tidyverse)
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
df %>%
pivot_longer(-c(timestamp, starttime_ms)) %>%
group_by(name) %>%
drop_na() %>%
summarise(min = timestamp %>% min(),
max = timestamp %>% max())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#> name min max
#> <chr> <chr> <chr>
#> 1 A 00:00:00.080 00:00:00.097
#> 2 B 00:00:00.066 00:00:00.147
#> 3 C 00:00:00.080 00:00:00.130
Created on 2021-02-15 by the reprex package (v0.3.0)
You could look at the cumsum of differences where there's no NA, coerce them to logical and subset first and last element.
lapply(data.frame(apply(rbind(0, diff(!sapply(df[c("C", "B", "A")], is.na))), 2, cumsum)),
function(x) c(df$timestamp[as.logical(x)][1], rev(df$timestamp[as.logical(x)])[1]))
# $C
# [1] "00:00:00.080" "00:00:00.130"
#
# $B
# [1] "00:00:00.066" "00:00:00.147"
#
# $A
# [1] "00:00:00.080" "00:00:00.097"

Combining componenets of a list in r

I have a list that contains data by year. I want to combine these components into a single dataframe, which is matched by row. Example list:
List [[1]]
State Year X Y
23 1971 etc etc
47 1971 etc etc
List[[2]]
State Year X Y
13 1972 etc etc
23 1973 etc etc
47 1973 etc etc
etc....
List[[45]]
State Year X Y
1 2017 etc etc
2 2017 etc etc
3 2017 etc etc
1 2017 etc etc
23 2017 etc etc
47 2017 etc etc
I want the dataframe to look like (I know I will have to go through and remove some extra columns:
State 1971_X 1971_Y 1972_X 1972_Y....2018_X 2019_Y
1 NA NA NA NA etc etc
2 NA NA etc etc etc etc
3 etc ect etc etc etc etc
...
50 NA NA etc etc etc etc
I have tried the command Outcomewanted=do.call("cbind", examplelist) but get the message
"Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 36, 40, 20, 42, 38, 26, 17, 31, 35, 23, 33, 13, 29, 28, 32, 34, 41, 37, 43, 39, 30, 14, 10, 4, 7"
It seems that the cbind.fill command could be an option but has been retired? Thanks for any help in advance.
You may use reshape after a do.call(rbind()) manoeuvre.
res <- reshape(do.call(rbind, lst), idvar="state", timevar="year", direction="wide")
res
# state x.1971 y.1971 x.1972 y.1972 x.1973 y.1973
# 1 23 1.3709584 0.3631284 NA NA -0.1061245 2.0184237
# 2 24 -0.5646982 0.6328626 NA NA 1.5115220 -0.0627141
# 3 13 NA NA 0.4042683 -0.09465904 NA NA
Data
lst <- list(structure(list(state = c(23, 24), year = c(1971, 1971),
x = c(1.37095844714667, -0.564698171396089), y = c(0.363128411337339,
0.63286260496104)), class = "data.frame", row.names = c(NA,
-2L)), structure(list(state = c(13, 23, 24), year = c(1972, 1973,
1973), x = c(0.404268323140999, -0.106124516091484, 1.51152199743894
), y = c(-0.0946590384130976, 2.01842371387704, -0.062714099052421
)), class = "data.frame", row.names = c(NA, -3L)))

How to melt a multiple columns df with R?

I have this data for which I would like to transform it to long.
library(tidyverse)
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
df <- structure(list(
julian_days = c(
127, 130, 132, 134, 137, 139,
141, 144, 148, 151, 153, 155, 158, 160, 162, 165, 167, 169, 172,
NA, NA, NA, NA, NA, NA, NA, NA, NA
), sea_ice_algae_last_cm = c(
0.636,
0.698, 0.666666666666667, 0.685384615384615, 0.713, 0.6375, 0.58375,
0.637272727272727, 0.6575, 0.691666666666667, 0.629166666666667,
0.637142857142857, 0.589166666666667, 0.56, 0.571818181818182,
0.492, 0.31, 0.312, 0.203076923076923, NA, NA, NA, NA, NA, NA,
NA, NA, NA
), sd = c(
0.0227058484879019, 0.0369684550213647, 0.0533853912601565,
0.0525381424324881, 0.0413790070231539, 0.0381682876458741, 0.0277788888666675,
0.0410099766132362, 0.0222076972732838, 0.0194079021706795, 0.0299873710792131,
0.0363841933236059, 0.0253908835942542, 0.055746679790749, 0.0604678727620178,
0.0294957624075053, 0.10770329614269, 0.0657267069006199, 0.0693282789084673,
NA, NA, NA, NA, NA, NA, NA, NA, NA
), julian_days_2 = c(
127, 130,
132, 134, 137, 139, 141, 144, 146, 148, 151, 153, 155, 158, 160,
162, 165, 167, 169, 172, 174, 176, 179, 181, 183, 186, 188, 190
), water_1_5_m_depth = c(
0.69, 0.5475, 0.596, 0.512, 0.598, 0.488333333333333,
0.27, 0.41, 0.568, 0.503333333333333, 0.668333333333333, 0.71,
0.636666666666667, 0.623333333333333, 0.66, 0.541666666666667,
0.57, 0.545, 0.501666666666667, 0.526666666666667, 0.566666666666667,
0.493333333333333, 0.59, 0.518333333333333, 0.443333333333333,
0.605, 0.58, 0.478333333333333
), sd_2 = c(
0.121655250605964,
0.0718215380880506, 0.0736885337077625, 0.0376828873628335, 0.084380092438916,
0.0636919670497516, 0.054037024344425, 0.0540370243444251, 0.0370135110466435,
0.0571547606649408, 0.0702614166286638, 0.0442718872423573, 0.0799166232186176,
0.0480277697448743, 0.0409878030638384, 0.0462240918425302, 0.0920869154657709,
0.0706399320497981, 0.0511533641774093, 0.100531918646103, 0.0186189867250252,
0.0588784057755188, 0.0841427358718512, 0.0934701378337842, 0.0492612085384298,
0.0653452370108182, 0.0878635305459549, 0.0851860708488579
),
water_10_m_depth = c(
0.66, 0.732, 0.595, 0.712, 0.514, 0.48,
0.35, 0.44, 0.535, 0.403333333333333, 0.728, 0.746, 0.625,
0.698333333333333, 0.705, 0.555, 0.585, 0.651666666666667,
0.603333333333333, 0.595, 0.615, 0.615, 0.658333333333333,
0.641666666666667, 0.623333333333333, 0.628333333333333,
0.661666666666667, 0.631666666666667
), sd_3 = c(
0, 0.0342052627529742,
0.0387298334620742, 0.0327108544675923, 0.0610737259384104,
0.0700000000000001, 0.127279220613579, 0.0972111104761177,
0.0564800849857717, 0.0504645089807343, 0.0540370243444252,
0.0415932686861709, 0.0809320702811933, 0.0475043857624395,
0.0398748040747538, 0.0568330889535313, 0.0388587184554509,
0.0204124145231932, 0.058878405775519, 0.0896102672688791,
0.0535723809439155, 0.0488876262463212, 0.043089055068157,
0.0306050104830347, 0.0527888877195444, 0.0708284312029193,
0.0426223728418147, 0.0348807492274272
), julian_days_3 = c(
134,
137, 139, 141, 146, 148, 153, 155, 160, 162, 165, 169, 172,
174, 176, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA
), water_40_m_depth = c(
0.523166666666667, 0.360833333333333,
0.279, 0.228, 0.551166666666667, 0.358666666666667, 0.593,
0.6225, 0.6665, 0.5468, 0.334714285714286, 0.654, 0.567666666666667,
0.664166666666667, 0.6345, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA
), sd_4 = c(
0.0793937445058905, 0.0346145441493408,
0.0834625664594612, 0.105740247777277, 0.0437008771841786,
0.0810719844747042, 0.0849529281425892, 0.0539620236833275,
0.0689514321823702, 0.0344992753547085, 0.0889713704621029,
0.064221491729794, 0.0166933120340652, 0.0545982295195244,
0.0578472125516865, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA
), julian_days_4 = c(
181, 183, 186, 188, 190, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA
), water_60_m_depth = c(
0.617833333333333,
0.492333333333333, 0.642166666666667, 0.7265, 0.686166666666667,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA
), sd_5 = c(
0.0574818812032684,
0.049766119666563, 0.0704540039079871, 0.0286618212959331,
0.0382225936674458, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
)
), row.names = c(
NA,
-28L
), class = c("tbl_df", "tbl", "data.frame"))
arrange(df, desc(julian_days_4)) # Look at the data at day 190
#> # A tibble: 28 x 14
#> julian_days sea_ice_algae_l… sd julian_days_2 water_1_5_m_dep…
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 137 0.713 0.0414 137 0.598
#> 2 134 0.685 0.0525 134 0.512
#> 3 132 0.667 0.0534 132 0.596
#> 4 130 0.698 0.0370 130 0.548
#> 5 127 0.636 0.0227 127 0.69
#> 6 139 0.638 0.0382 139 0.488
#> 7 141 0.584 0.0278 141 0.27
#> 8 144 0.637 0.0410 144 0.41
#> 9 148 0.658 0.0222 146 0.568
#> 10 151 0.692 0.0194 148 0.503
#> # … with 18 more rows, and 9 more variables: sd_2 <dbl>,
#> # water_10_m_depth <dbl>, sd_3 <dbl>, julian_days_3 <dbl>,
#> # water_40_m_depth <dbl>, sd_4 <dbl>, julian_days_4 <dbl>,
#> # water_60_m_depth <dbl>, sd_5 <dbl>
I would like to “stack” all this into 3 columns:
julian with all columns starting with “julian”
measure with all columns starting with “water” or “sea”
sd with all columns starting with “sd”
Note that in the “water” columns, the numbers represent the depth (ex.: water_1_5_m_depth means 1.5 m).
The desired output for the first line would be something like:
tibble(
julian = c(127, 127, 127, 134, 181),
type = c("sea", "water_1.5", "water_10", "water_40", "water_60"),
measure = c(0.64, 0.69, 0.66, 0.52, 0.62),
sd = c(0.02, 0.12, 0, 0.08, 0.06)
)
#> # A tibble: 5 x 4
#> julian type measure sd
#> <dbl> <chr> <dbl> <dbl>
#> 1 127 sea 0.64 0.02
#> 2 127 water_1.5 0.69 0.12
#> 3 127 water_10 0.66 0
#> 4 134 water_40 0.52 0.08
#> 5 181 water_60 0.62 0.06
My attempt so far was with data.table.
melt(
setDT(df),
measure = patterns("^julian", "^sea", "^water_1_5", "^water_10", "^water_40", "^water_60", "^sd"),
value.name = c("julian", "sea", "water_1.5", "water_10", "water_40", "water_60", "sd")
)
#> variable julian sea water_1.5 water_10 water_40 water_60
#> 1: 1 127 0.6360000 0.6900 0.660 0.5231667 0.6178333
#> 2: 1 130 0.6980000 0.5475 0.732 0.3608333 0.4923333
#> 3: 1 132 0.6666667 0.5960 0.595 0.2790000 0.6421667
#> 4: 1 134 0.6853846 0.5120 0.712 0.2280000 0.7265000
#> 5: 1 137 0.7130000 0.5980 0.514 0.5511667 0.6861667
#> ---
#> 136: 5 NA NA NA NA NA NA
#> 137: 5 NA NA NA NA NA NA
#> 138: 5 NA NA NA NA NA NA
#> 139: 5 NA NA NA NA NA NA
#> 140: 5 NA NA NA NA NA NA
#> sd
#> 1: 0.02270585
#> 2: 0.03696846
#> 3: 0.05338539
#> 4: 0.05253814
#> 5: 0.04137901
#> ---
#> 136: NA
#> 137: NA
#> 138: NA
#> 139: NA
#> 140: NA
Any help appreciated.
UPDATE:
Here is the file I received.
Created on 2019-04-12 by the reprex package (v0.2.1)
library(tidyverse)
list_of_dfs <- split.default(df, rep(1:4, c(3, 5, 3, 3)))
list_of_dfs[[5]] <- list_of_dfs[[2]][, c(1, 4, 5)]
list_of_dfs[[2]] <- list_of_dfs[[2]][, 1:3]
list_of_dfs %>%
map(~ .[complete.cases(.), ]) %>%
map(~ mutate(., type = grep("^sea|^water", names(.), value = TRUE))) %>%
map(setNames, nm = c("julian", "measure", "sd", "type")) %>%
bind_rows()
# # A tibble: 95 x 4
# julian measure sd type
# <dbl> <dbl> <dbl> <chr>
# 1 127 0.636 0.0227 sea_ice_algae_last_cm
# 2 130 0.698 0.0370 sea_ice_algae_last_cm
# 3 132 0.667 0.0534 sea_ice_algae_last_cm
# 4 134 0.685 0.0525 sea_ice_algae_last_cm
# 5 137 0.713 0.0414 sea_ice_algae_last_cm
# 6 139 0.638 0.0382 sea_ice_algae_last_cm
# 7 141 0.584 0.0278 sea_ice_algae_last_cm
# 8 144 0.637 0.0410 sea_ice_algae_last_cm
# 9 148 0.658 0.0222 sea_ice_algae_last_cm
# 10 151 0.692 0.0194 sea_ice_algae_last_cm
# # … with 85 more rows
It would be nice if you share your desired output. I think this is what you want:
df %>%
select(starts_with("julian")) %>%
gather(key = col, julian) %>%
bind_cols(df %>%
select(starts_with("water")) %>%
gather(col_water, measure)) %>%
#bind_cols(df %>%
# select(starts_with("sea")) %>%
# gather(col_sea, measure2)) %>%
bind_cols(df %>%
select(starts_with("sd")) %>%
gather(col_sd, sd)) %>%
select(julian, measure, sd)
julian measure sd
<dbl> <dbl> <dbl>
1 127 0.69 0.122
2 130 0.548 0.0718
3 132 0.596 0.0737
4 134 0.512 0.0377
5 137 0.598 0.0844
6 139 0.488 0.0637
7 141 0.27 0.0540
8 144 0.41 0.0540
9 148 0.568 0.0370
10 151 0.503 0.0572
# ... with 102 more rows
In this try i did not include the variables starting with sea, sice it would lead to a one to many merge. Let me know if I am in the right direction to include that one.
data.table::melt(
df,
measure.vars = patterns("^julian"),
variable.name = "julian_variable",
value.name = "julian_value"
) %>%
data.table::melt(
measure.vars = patterns(measure = "^sea|^water"),
variable.name = "measure_variable",
value.name = "measure_value"
) %>%
data.table::melt(
measure.vars = patterns(measure = "^sd"),
variable.name = "sd_variable",
value.name = "sd_value"
)
# julian_variable julian_value measure_variable measure_value sd_variable sd_value
# 1: julian_days 127 sea_ice_algae_last_cm 0.6360000 sd 0.02270585
# 2: julian_days 130 sea_ice_algae_last_cm 0.6980000 sd 0.03696846
# 3: julian_days 132 sea_ice_algae_last_cm 0.6666667 sd 0.05338539
# 4: julian_days 134 sea_ice_algae_last_cm 0.6853846 sd 0.05253814
# 5: julian_days 137 sea_ice_algae_last_cm 0.7130000 sd 0.04137901
# ---
# 2796: julian_days_4 NA water_60_m_depth NA sd_5 NA
# 2797: julian_days_4 NA water_60_m_depth NA sd_5 NA
# 2798: julian_days_4 NA water_60_m_depth NA sd_5 NA
# 2799: julian_days_4 NA water_60_m_depth NA sd_5 NA
# 2800: julian_days_4 NA water_60_m_depth NA sd_5 NA
Though it is unclear as to what the desired output is. This solution obviously leads to a lot of duplication (basically, each individual value is duplicated 100 times! 4 "julian" columns * 5 "measure" columns * 5 "sd" columns).

"undefined columns selected" - when trying to remove na's from df's in list

I am trying to replicate the success of this solution:
remove columns with NAs from all dataframes in list
or
Remove columns from dataframe where some of values are NA
with a list of dataframes:
m1<- structure(list(vPWMETRO = c(1520L, 1520L, 1520L, 1520L, 1520L),
vPWPUMA00 = c(500L, 900L, 1000L, 1100L, 1200L),
v100 = c(96.1666666666667, 71.4615384615385, 68.6363636363636, 22.5, 64.5),
v101 = c(5, 15, NA, NA, NA),
v102 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)),
.Names = c("vPWMETRO", "vPWPUMA00", "v100", "v101", "v102"),
row.names = 26:30, class = "data.frame")
m2<- structure(list(vPWMETRO = c(6440L, 6440L, 6440L, NA, NA),
vPWPUMA00 = c(1300L,2100L, 2200L, NA, NA),
v100 = c(38.3921568627451, 35, 12.5, NA, NA),
v101 = c(NA, NA, NA, NA, NA),
v102 = c(38.3333333333333, 68, NA, NA, NA)),
.Names = c("vPWMETRO", "vPWPUMA00", "v100", "v101", "v102"),
row.names = c("39", "40", "41", "NA", "NA.1"), class = "data.frame")
#views structure
str(m1)
str(m2)
#creates list
snag<- list(v1520=m1, v6440=m2)
str(snag)
#attempts lapply solution
prob1<- lapply(snag, function(y) y[ ,!is.na(y)])
#2nd attempt, same result on just dataframe:
x5$v6440[ , apply(x5$v6440, 2, function(x) !(is.na(x)))]
So that columns that contain all NA's are deleted within the dataframe. Thus the result should be a list of 2 df's:
v1520: vPWPUMA00, v100, v101
v6440: vPWPUMA00, v100, v102
I see the difference in the example problem is that the dimension is 1x11 and my dimensions are 5x5. I am guessing this causes the "undefined column" error but I'm not sure.
Any assistance or advice would be most appreciated.
Regards,
I don't think you were looking at quite the right questions and answers.
See Remove columns from dataframe where ALL values are NA, which appears to be what you want.
You can then modify my answer to give
lapply(snag, Filter, f = function(x){!all(is.na(x))})
$v1520
vPWMETRO vPWPUMA00 v100 v101
26 1520 500 96.16667 5
27 1520 900 71.46154 15
28 1520 1000 68.63636 NA
29 1520 1100 22.50000 NA
30 1520 1200 64.50000 NA
$v6440
vPWMETRO vPWPUMA00 v100 v102
39 6440 1300 38.39216 38.33333
40 6440 2100 35.00000 68.00000
41 6440 2200 12.50000 NA
NA NA NA NA NA
NA.1 NA NA NA NA

Resources