Changing columns based on REGEX [duplicate] - r

This question already has answers here:
How to convert class of several variables at once
(2 answers)
Closed 2 years ago.
I have a very large data set with results and dates.
A small subset of the data (I have many more columns with different names and rows):
result_1 date_1 result_2 date_2 result_3 date_3 result_4 date_4
1 1 12.8.2020 4 13.8.2020 2 15.8.2020 1 20.8.2020
2 3 15.8.2020 3 14.8.2020 5 17.8.2020 2 21.8.2020
I want to change some of the columns into numeric, depending on the column names.
I thought of maybe possibly calling vectors with regex, as follows:
data$"result.*" <- as.numeric(data$"result\.*")
but it produces an error:
Error in `$<-.data.frame`(`*tmp*`, "result.*", value = numeric(0)) :
replacement has 0 rows, data has 2
I can also use mutate or some sort of a loop, but I'm sure there's a more efficient way to do this especially since the data set is huge.

dat <- dplyr::tibble(result_1=c(1,2),
date_1=c(2,3),
result_2=c(3,4),
date_2=c(34,3))
dat %>%
dplyr::mutate_if(is.numeric,as.character) %>%
dplyr::mutate_at(dplyr::vars(dplyr::matches("result")),as.numeric)

The other answer works, but note that mutate_at and mutate_if are being superceded by the across function in dplyr:
dat <- data.frame(result_1 = c("4", "2"), date_1 = letters[1:2], result_2 = c("2", "3"))
tidyverse
library(dplyr)
dat %>% mutate(across(matches("result_.*"), as.numeric))
#> result_1 date_1 result_2
#> 1 4 a 2
#> 2 2 b 3
data.table
library(data.table)
dat <- data.table(dat)
cols <- grep("result_.*", names(dat), value=TRUE)
dat[, (cols) := lapply(.SD, as.numeric), .SDcols=cols]
dat
#> result_1 date_1 result_2
#> 1: 4 a 2
#> 2: 2 b 3

Related

Appending a column to each data frame within a list

I have a list of dataframes and want to append a new column to each, however I keep getting various error messages. Can anybody explain why the below code doesn't work for me? I'd be happy if rowid_to)column works as the data in my actual set is alright ordered correctly, otherwise i'd like a new column with a list going from 1:length(data$data)
##dataset
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
nest_by(Location)
###map + rowid_to_column
attempt1<- data%>%
map(.,rowid_to_column(.,var = "hour"))
##mutate
attempt2<-data %>%
map(., mutate("Hours" = 1:6))
###add column
attempt3<- data%>%
map(.$data,add_column(.data,hours = 1:6))
newcolumn<- 1:6
###lapply
attempt4<- lapply(data,cbind(data$data,newcolumn))
Many thanks,
Stuart
You were nearly there with your base R attempt, but you want to iterate over data$data, which is a list of data frames.
data$data <- lapply(data$data, function(x) {
hour <- seq_len(nrow(x))
cbind(x, hour)
})
data$data
# [[1]]
# Day Average Amplitude hour
# 1 1 6.070539 1.123182 1
# 2 2 3.638313 8.218556 2
# 3 3 11.220683 2.049816 3
# 4 4 12.832782 14.858611 4
# 5 5 12.485757 7.806147 5
# 6 6 19.250489 6.181270 6
Edit: Updated as realised it was iterating over columns rather than rows. This approach will work if the data frames have different numbers of rows, which the methods with the vector defined as 1:6 will not.
a data.table approach
library(data.table)
setDT(data)
data[, data := lapply(data, function(x) cbind(x, new_col = 1:6))]
data$data
# [[1]]
# Day Average Amplitude test new_col
# 1 1 11.139917 0.3690539 1 1
# 2 2 5.350847 7.0925508 2 2
# 3 3 9.602104 6.1782818 3 3
# 4 4 14.866074 13.7356913 4 4
# 5 5 1.114201 1.1007080 5 5
# 6 6 2.447236 5.9944926 6 6
#
# [[2]]
# Day Average Amplitude test new_col
# 1 1 17.230213 13.966576 1 1
# .....
A purrr approach:
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
group_split(Location) %>%
purrr::map_dfr(~.x %>% mutate(Hours = c(1:6)))
If you want to use your approach and preserve the same data structure, this is a way again using purrr (you need to ungroup, otherwise it will not work due to the rowwise grouping)
data %>% ungroup() %>%
mutate_at("data", .f = ~map(.x, ~.x %>% mutate(Hours = c(1:6))) )

Split a column list into columns

Suppose I have a DT as -
id values valid_types
1 2|3 100|200
2 4 200
3 2|1 500|100
The valid_types tells me what are the valid types I need. There are 4 total types(100, 200, 500, 2000). An entry specifies their valid types and their corresponding values with | separated character values.
I want to transform this to a DT which has the types as columns and their corresponding values.
Expected:
id 100 200 500
1 2 3 NA
2 NA 4 NA
3 1 NA 2
I thought I could take both the columns and split them on | which would give me two lists. I would then combine them by setting the keys as names of the types list and then convert the final list to a DT.
But the idea I came up with is very convoluted and not really working.
Is there a better/easier way to do this ?
Here is another data.table approach:
dcast(
DT[, lapply(.SD, function(x) strsplit(x, "\\|")[[1L]]), by = id],
id ~ valid_types, value.var = "values"
)
Using tidyr library you can use separate_rows with pivot_wider :
library(tidyr)
df %>%
separate_rows(values, valid_types, sep = '\\|', convert = TRUE) %>%
pivot_wider(names_from = valid_types, values_from = values)
# id `100` `200` `500`
# <int> <int> <int> <int>
#1 1 2 3 NA
#2 2 NA 4 NA
#3 3 1 NA 2
A data.table way would be :
library(data.table)
library(splitstackshape)
setDT(df)
dcast(cSplit(df, c('values', 'valid_types'), sep = '|', direction = 'long'),
id~valid_types, value.var = 'values')

Operations on single row in dplyr [duplicate]

This question already has answers here:
dplyr mutate/replace several columns on a subset of rows
(12 answers)
Closed 3 years ago.
Is it possible to performa dplyr operations with pipes for single rows of a dataframe? For example say I have the following a dataframe (call it df) and want to do some manipulations to the columns of that dataframe:
df <- df %>%
mutate(col1 = col1 + col2)
This code sets one column equal to the sum of that column and another. What if I want to do this, but only for a single row?
df[1,] <- df[1,] %>%
mutate(col1 = col1 + col2)
I realize this is an easy operation in base R, but I am super curious and would love to use dplyr operations and piping to make this happen. Is this possible or does it go against dplyr grammar?
Here's an example. Say I have a dataframe:
df = data.frame(a = rep(1, 100), b = rep(1,100))
The first example I showed:
df <- df %>%
mutate(a = a + b)
Would result in column a_xPlacexHolderxColumnaPlacexHolderx_ being 2 for all rows.
The second example would only result in the first row of column a_xPlacexHolderxColumnaPlacexHolderx_ being 2.
mutate() is for creating columns.
You can do something like df[1,1] <- df[1,1] + df[1,2]
An Example:
You can mutate() and case_when() for conditional manipulation.
df %>%
mutate(a = case_when(row_number(a) == 1 ~ a + b,
TRUE ~ a))
results in
# A tibble: 100 x 2
a b
<dbl> <dbl>
1 2 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
9 1 1
10 1 1
# … with 90 more rows
Data
library(tidyverse)
df <- tibble(a = rep(1, 100), b = rep(1,100))

R code to generate numbers in sequence and insert rows [duplicate]

This question already has answers here:
R code to insert rows based on a column's value and increment it by 1
(3 answers)
Closed 6 years ago.
I have a dataset with 2 columns. First column is an ID and the 2nd will column is the total number of quarters. If the Col B(quarters) has the value 8, then the 8 rows should be created starting from 1 to 8. The ID in col A should be the same for all rows. The dataset shown below is an example.
ID Quarters
A 5
B 2
C 1
Expected output
ID Quarters
A 1
A 2
A 3
A 4
A 5
B 1
B 2
C 1
Here is what I tried.
library(data.table)
setDT(df.WQuarter)[, (Quarters=1:Quarters), ID]
I get this error. Can you please help. I am really stuck at this for the whole day. I am just learning the basics of R.
We can use base R to replicate the 'ID' by 'Quarters' and create the 'Quarters' by taking the sequence of that column.
with(df1, data.frame(ID= rep(ID, Quarters), Quarters = sequence(Quarters)))
# ID Quarters
#1 A 1
#2 A 2
#3 A 3
#4 A 4
#5 A 5
#6 B 1
#7 B 2
#8 C 1
If we are using data.table, convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', get the sequence of 'Quarters' or just seq(Quarters).
library(data.table)
setDT(df1)[, .(Quarters=sequence(Quarters)) , by = ID]
As #PierreLaFortune commented on the post, if we have NA values, then we need to remove it
setDT(df1)[, .(Quarters = seq_len(Quarters[!is.na(Quarters)])), by = ID]
Or using the dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
group_by(ID) %>%
mutate(Quarters = list(seq(Quarters))) %>%
ungroup() %>%
unnest(Quarters)
If the OP's "Quarters" column is non-numeric, it should be converted to 'numeric' before proceeding
df1$Quarters <- as.numeric(as.character(df1$Quarters))
The as.character is in case if the column is factor, but if it is character class, as.numeric is enough.
data
df1 <- structure(list(ID = c("A", "B", "C"), Quarters = c(5L, 2L, 1L
)), .Names = c("ID", "Quarters"), class = "data.frame", row.names = c(NA,
-3L))

Count occurence across multiple columns using R & dplyr

This should be a simple solution...I just can't wrap my head around this. I'd like to count the occurrences of a factor across multiple columns of a data frame. There're 13 columns range from abx.1 > abx.13 and a huge number of rows.
Sample data frame:
library(dplyr)
abx.1 <- c('Amoxil', 'Cipro', 'Moxiflox', 'Pip-tazo')
start.1 <- c('2012-01-01', '2012-02-01', '2013-01-01', '2014-01-01')
abx.2 <- c('Pip-tazo', 'Ampicillin', 'Amoxil', NA)
start.2 <- c('2012-01-01', '2012-02-01', '2013-01-01', NA)
abx.3 <- c('Ampicillin', 'Amoxil', NA, NA)
start.3 <- c('2012-01-01', '2012-02-01', NA,NA)
worksheet <-data.frame (abx.1, start.1, abx.2, start.2, abx.3, start.3)
Result I'd like:
name count
Amoxil 3
Ampicillin 2
Pip-tazo 2
Cipro 1
Moxiflox 1
I've tried :
worksheet %>% group_by (abx.1, abx.2, abx.3) %>% summarise(count = n())
This doesn't give me my desired output. Any thoughts would be greatly appreciated.
If you want a dplyr solution, I'd suggest combining it with tidyr in order to convert your data to a long format first
library(tidyr)
worksheet %>%
select(starts_with("abx")) %>%
gather(key, value, na.rm = TRUE) %>%
count(value)
# Source: local data frame [5 x 2]
#
# value n
# 1 Amoxil 3
# 2 Ampicillin 2
# 3 Cipro 1
# 4 Moxiflox 1
# 5 Pip-tazo 2
Alternatively, with base R, it's just
as.data.frame(table(unlist(worksheet[grep("^abx", names(worksheet))])))
# Var1 Freq
# 1 Amoxil 3
# 2 Cipro 1
# 3 Moxiflox 1
# 4 Pip-tazo 2
# 5 Ampicillin 2

Resources