rainfall <- data.frame("date" = rep(1:15),"location_code" = rep(6:8,5),
"rainfall"=runif(15, min=12, max=60))
rainfall30 <- rainfall %>%
group_by(location_code) %>%
filter(rainfall>30)
I want to use the above data to make the following table, is there a way to do it in R using dplyr?
date location6 location7 location8
2 47.7
5 46.8
6 32.3
7 55.3
9 40.5
I am just starting to use R, please apologize if this already answered. Thanks.
I think what you are looking for is tidyr::pivot_wider, which turns this long-form data.frame into a wide form. See here and here for more information on pivoting data with tidyr.
rainfall30 %>%
pivot_wider(names_from = location_code,
values_from = rainfall)
# date `6` `7` `8`
# <int> <dbl> <dbl> <dbl>
# 1 1 32.3 NA NA
# 2 2 NA 52.7 NA
# 3 3 NA NA 54.3
# 4 4 30.6 NA NA
# 5 7 52.4 NA NA
Here is a base R option using reshape + subset
reshape(
subset(rainfall, rainfall > 30),
idvar = "date",
timevar = "location_code",
direction = "wide"
)
which gives something like below (using set.seed(1) to generate rainfall)
date rainfall.8 rainfall.6 rainfall.7
3 3 39.49696 NA NA
4 4 NA 55.59397 NA
6 6 55.12270 NA NA
7 7 NA 57.34441 NA
8 8 NA NA 43.71829
9 9 42.19747 NA NA
13 13 NA 44.97710 NA
14 14 NA NA 30.43698
15 15 48.95239 NA NA
Related
i have a data frame that i import it from an excel (.xlsx) file and looks like this :
>data
# A tibble: 3,338 x 4
Dates A B C
<dttm> <lgl> <lgl> <lgl>
1 2009-01-05 00:00:00 NA NA NA
2 2009-01-06 00:00:00 NA NA NA
3 2009-01-07 00:00:00 NA NA NA
4 2009-01-08 00:00:00 NA NA NA
5 2009-01-09 00:00:00 NA NA NA
6 2009-01-12 00:00:00 NA NA NA
7 2009-01-13 00:00:00 NA NA NA
8 2009-01-14 00:00:00 NA NA NA
9 2009-01-15 00:00:00 NA NA NA
10 2009-01-16 00:00:00 NA NA NA
# ... with 3,328 more rows
# i Use `print(n = ...)` to see more rows
The problem is that these three columns A,B,C contain numeric values but after some 3 thousand rows.
but trying to convert them into numeric i did :
data%>%
dplyr::mutate(date = as.Date(Dates))%>%
dplyr::select(-Dates)%>%
dplyr::relocate(date,.before="A")%>%
dplyr::mutate_if(is.logical, as.numeric)%>%
tidyr::pivot_longer(!date,names_to = "var", values_to = "y")%>%
dplyr::group_by(var)%>%
dplyr::arrange(var)%>%
tidyr::drop_na()
but the problem remains :
date var y
<date> <chr> <dbl>
1 2021-11-30 A 1
2 2021-12-01 A 1
3 2021-12-02 A 1
4 2021-12-03 A 1
5 2021-12-06 A 1
6 2021-12-07 A 1
7 2021-12-08 A 1
8 2021-12-09 A 1
9 2021-12-10 A 1
10 2021-12-13 A 1
# ... with 189 more rows
any help ?
Summing up from comments:
it's usually easier to fix conversation errors closer to original source as possible.
read_xlsx tries to guess column types by checking first guess_max rows, guess_max being a read_xlsx parameter with a default value of min(1000, n_max). If read_xlsx gets columns types wrong because those cols are filled with NA for the first 1000 rows, just increasing guess_max parameter might be a viable solution:
readxl::read_xlsx("file.xlsx", guess_max = 5000)
Though for a simple 4-column dataset one shouldn't need more than defining correct column types manually:
readxl::read_xlsx("file.xlsx", col_types = c("date", "numeric", "numeric", "numeric"))
If NA values are only at the beginning of some columns, changing sorting order in Excel itself and moving NAs from top before importing the file into R should also work.
I have patients with baseline pain scores and follow up of 6 months, 1 year and 2 years (each their own variable column). I have 26,000+ patients. There is missing data at those various time points. I can easily analyse pain score outcomes at one year excluding missing, 6mths and two years etc.... What I would like to do is analyse outcomes in those with data at EITHER 6mths, one year or two year. Some patients will have more than one and some will have missing data for all three. Any ideas how to code this? Maybe another column with mutate() ... that creates 'vas.outcome' and then in this variable I can have one-year data, if missing one-year then two-year, and if missing two-year then 6-month. If all three missing then code as NA.
# A tibble: 6 x 4
vas.base vas.6mth vas.year vas.two
<dbl> <dbl> <dbl> <dbl>
1 5 NA NA 4
2 9 2.3 1.2 NA
3 8.1 NA NA NA
4 10 NA NA 3.3
5 6.5 6.5 NA NA
6 8 NA NA 3
one approach:
library(dplyr)
your_data_frame %>%
mutate(vas.outcome = coalesce(vas.6mth, vas.year, vas.two))
You could use a case_when()/fcase() approach
dt[, pain:=fcase(
!is.na(vas.year), vas.year,
!is.na(vas.two), vas.two,
!is.na(vas.6mth), vas.6mth,
default = NA
)]
or
dt %>%
mutate(pain:=case_when(
!is.na(vas.year)~vas.year,
!is.na(vas.two)~vas.two,
TRUE~vas.6mth
))
Output:
vas.base vas.6mth vas.year vas.two pain
1: 5.0 NA NA 4.0 4.0
2: 9.0 2.3 1.2 NA 1.2
3: 8.1 NA NA NA NA
4: 10.0 NA NA 3.3 3.3
5: 6.5 6.5 NA NA 6.5
6: 8.0 NA NA 3.0 3.0
I'm not 100% sure what you want your final dataset to look like, and I'm sure there are more elegant ways, but to choose the first occurrence of an outcome (after baseline), you can do:
Data
df <- read.table(text = "id vas.base vas.6mth vas.year vas.two
1 5 NA NA 4
2 9 2.3 1.2 NA
3 8.1 NA NA NA
4 10 NA NA 3.3
5 6.5 6.5 NA NA
6 8 NA NA 3", header = TRUE)
dplyr approach:
library(tidyr)
df %>% pivot_longer(starts_with("vas")[-1], names_to = "visit") %>%
group_by(id) %>% mutate(vas.outcome = first(na.omit(value))) %>%
slice(1) %>% select(id, vas.outcome) %>%
left_join(df, by = "id")
Output:
# id vas.outcome vas.base vas.6mth vas.year vas.two
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 4 5 NA NA 4
# 2 2 2.3 9 2.3 1.2 NA
# 3 3 NA 8.1 NA NA NA
# 4 4 3.3 10 NA NA 3.3
# 5 5 6.5 6.5 6.5 NA NA
# 6 6 3 8 NA NA 3
I want to collapse this data frame so NA's are removed. How to accomplish this? Thanks!!
id <- c(1,1,1,2,2,3,4,5,5)
q1 <- c(23,55,7,88,90,34,11,22,99)
df <- data.frame(id,q1)
df$row <- 1:nrow(df)
spread(df, id, q1)
row 1 2 3 4 5
1 23 NA NA NA NA
2 55 NA NA NA NA
3 7 NA NA NA NA
4 NA 88 NA NA NA
5 NA 90 NA NA NA
6 NA NA 34 NA NA
7 NA NA NA 11 NA
8 NA NA NA NA 22
9 NA NA NA NA 89
I want it to look like this:
1 2 3 4 5
23 88 34 11 22
55 90 NA NA 89
7 NA NA NA NA
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
The row should be created on the sequence of 'id'. In addition, pivot_wider would be a more general function compared to spread
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
mutate(row = row_number()) %>%
ungroup %>%
pivot_wider(names_from = id, values_from = q1) %>%
select(-row)
-output
# A tibble: 3 × 5
`1` `2` `3` `4` `5`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 23 88 34 11 22
2 55 90 NA NA 99
3 7 NA NA NA NA
Or use dcast
library(data.table)
dcast(setDT(df), rowid(id) ~ id, value.var = 'q1')[, id := NULL][]
1 2 3 4 5
<num> <num> <num> <num> <num>
1: 23 88 34 11 22
2: 55 90 NA NA 99
3: 7 NA NA NA NA
Here's a base R solution. I sort each column so the non-NA values are at the top, find the number of non-NA values in the column with the most non-NA values (n), and return the top n rows from the data frame.
library(tidyr)
id <- c(1,1,1,2,2,3,4,5,5)
q1 <- c(23,55,7,88,90,34,11,22,99)
df <- data.frame(id,q1)
df$row <- 1:nrow(df)
df <- spread(df, id, q1)
collapse_df <- function(df) {
move_na_to_bottom <- function(x) x[order(is.na(x))]
sorted <- sapply(df, move_na_to_bottom)
count_non_na <- function(x) sum(!is.na(x))
n <- max(apply(df, 2, count_non_na))
sorted[1:n, ]
}
collapse_df(df[, -1])
I'd like to convert chemical formulas to a data frame containing columns for 1) the mineral name, 2) the chemical formula and 3) a set of columns for each element that is extracted from the formula. I am given the first two columns and I can extract the number of elements from each formula using CHNOSZ::makeup(). However, I'm not familiar working with lists and not sure how to rbind() the lists back into a data frame that contains everything I'm looking for (i.e. see 1-3 above).
Here is what I have so far - appreciate any help (including a link to a good tutorial on how to convert data from nested lists into dataframes).
library(tidyverse)
library(CHNOSZ)
formulas <- structure(list(Mineral = c("Abelsonite", "Abernathyite", "Abhurite",
"Abswurmbachite", "Acanthite", "Acetamide"), Composition = c("C31H32N4Ni",
"K(UO2)(AsO4)4(H2O)", "Sn3O(OH)2Cl2", "CuMn6(SiO4)O8", "Ag2S",
"CH3CONH2")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L))
test <- formulas %>%
select(Composition) %>%
map(CHNOSZ::makeup) %>%
flatten
test2 <- do.call(rbind,test)
> test2
As H K O U
[1,] 31 32 4 1 31
[2,] 4 2 1 19 1
[3,] 2 2 3 3 2
[4,] 1 6 12 1 1
[5,] 2 1 2 1 2
[6,] 2 5 1 1 2
which is not right.
You could do something like this
library(tidyverse)
library(CNOSZ)
test <- formulas %>%
mutate(res = map(Composition, ~stack(makeup(.x)))) %>%
unnest(cols = res) %>%
spread(ind, values)
## A tibble: 6 x 17
# Mineral Composition C H N Ni As K O U Cl
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Abelso… C31H32N4Ni 31 32 4 1 NA NA NA NA NA
#2 Aberna… K(UO2)(AsO… NA 2 NA NA 4 1 19 1 NA
#3 Abhuri… Sn3O(OH)2C… NA 2 NA NA NA NA 3 NA 2
#4 Abswur… CuMn6(SiO4… NA NA NA NA NA NA 12 NA NA
#5 Acanth… Ag2S NA NA NA NA NA NA NA NA NA
#6 Acetam… CH3CONH2 2 5 1 NA NA NA 1 NA NA
## … with 6 more variables: Sn <dbl>, Cu <dbl>, Mn <dbl>, Si <dbl>, Ag <dbl>,
## S <dbl>
I have a data set like the following:
wk name score
3 - Davide - 3.070000
6 - Davide - 3.460000
7 - Davide - 3.480000
48 -Cringe- 2.773333
79 -Fabynsane- 2.330000
69 -PiDjO- 2.070000
61 -sjb- 2.310000
I want to use this information to construct a panel like the following:
name1 name2 name3 ...
wk1
wk2
wk3
...
I have tried dcast in reshape:
panel.num = dcast(data, name + wk ~ score)
but it gives me a panel like the following and this is apparently not the one I want:
Authorname wk.list 1 2 3 4 5 6 7 8 9 10 11 12 13
2 - Davide - 3 1 NA NA NA NA NA NA NA NA NA NA NA NA NA
3 - Davide - 6 1 NA NA NA NA NA NA NA NA NA NA NA NA NA
I am wondering what went wrong and how I could fix this issue. Thanks~
Try doing wk ~ name, ie
dat <- data.frame(wk=sample(1:100, 10),
name=sample(c("Davide", "Cringe", "Fabynsane"), 10, rep=T),
score=runif(10, 2, 3))
library(reshape2)
dcast(dat, wk ~ name)
# wk Cringe Davide Fabynsane
# 1 8 NA 2.225543 NA
# 2 12 NA NA 2.958040
# 3 46 NA 2.659209 NA
# 4 47 NA 2.086529 NA
# 5 59 NA NA 2.287232
Other options include
library(tidyr)
spread(dat, name, score)
Or reshape from base R
reshape(dat, idvar='wk', timevar='name', direction='wide')