Creating a new column which is a vector of other columns - r

I have a dataframe with two columns (V1 and V2) and I'd like to create another column which is a vector - via combine function: c() - taking as arguments the other columns.
I'm using dplyr for all tasks, so I'd like to use it also in this context.
I've tried to create the new column with an apply function but it returns a vector with all the rows (not rowwise), something which surprises me, because with other functions it works rowwise.
I've solved it using the function rowwise, but as it's not usually so efficient, I'd like to see if there's another option.
Here is the definition of the dataframe:
IDs <- structure(list(V1 = c("1", "1", "6"),
V2 = c("6", "8", "8")),
class = "data.frame",
row.names = c(NA, -3L)
)
Here is the creation of the columns (together1 is the wrong result, and together2 the good one):
IDs <-
IDs %>%
mutate(together1 = list(mapply(function (x,y) c(x,y), V1, V2))
) %>%
rowwise() %>%
mutate(together2 = list(mapply(function (x,y) c(x,y), V1, V2))
) %>%
ungroup()
Here are the printed results:
print(as.data.frame(IDs))
V1 V2 together1 together2
1 1 6 1, 6, 1, 8, 6, 8 1, 6
2 1 8 1, 6, 1, 8, 6, 8 1, 8
3 6 8 1, 6, 1, 8, 6, 8 6, 8
Thanks in advance!

You can do it with purrr's map2 function:
library(dplyr)
library(purrr)
IDs %>%
mutate(together = map2(V1, V2, ~c(.x, .y)))

pmap could be used here
library(tidyverse)
IDs %>%
mutate(together = pmap(unname(.), c))
# V1 V2 together
#1 1 6 1, 6
#2 1 8 1, 8
#3 6 8 6, 8

You've just missed the SIMPLIFY = FALSE in your mapply() call:
dplyr::mutate(IDs, togeher = mapply(c, V1, V2, SIMPLIFY = F))
V1 V2 togeher
1 1 6 1, 6
2 1 8 1, 8
3 6 8 6, 8

Related

How to select columns in an R dataframe based on string matching

I don't think this exact question has been asked yet (for R, anyway).
I want to retain any columns in my dataset (there are hundreds in actuality) that contain a certain string, and drop the rest. I have found plenty of examples of string searching column names, but nothing for the contents of the columns themselves.
As an example, say I have this dataset:
df = data.frame(v1 = c(1, 8, 7, 'No number'),
v2 = c(5, 3, 5, 1),
v3 = c('Nothing', 4, 2, 9),
v4 = c(3, 8, 'Something', 6))
For this example, say I want to retain any columns with the string No, so that the resulting dataset is:
v1 v3
1 1 Nothing
2 8 4
3 7 2
4 No number 9
How can I do this in R? I am happy with any sort of solution (e.g., base R, dplyr, etc.)!
Thanks in advance!
Simply
df[grep("No", df)]
# v1 v3
# 1 1 Nothing
# 2 8 4
# 3 7 2
# 4 No number 9
This works, because grep internally checks if if (!is.character(x)) and if that's true it basically does:
s <- structure(as.character(df), names = names(df))
s
# v1
# "c(\"1\", \"8\", \"7\", \"No number\")"
# v2
# "c(5, 3, 5, 1)"
# v3
# "c(\"Nothing\", \"4\", \"2\", \"9\")"
# v4
# "c(\"3\", \"8\", \"Something\", \"6\")"
grep("No", s)
# [1] 1 3
Note:
R.version.string
# [1] "R version 4.0.3 (2020-10-10)"
Base R :
df[colSums(sapply(df, grepl, pattern = 'No')) > 0]
# v1 v3
#1 1 Nothing
#2 8 4
#3 7 2
#4 No number 9
Using dplyr :
library(dplyr)
df %>% select(where(~any(grepl('No', .))))
Use dplyr::select_if() function:
df <- df %>% select_if(function(col) any(grepl("No", col)))
You can run grepl for each column and if there's any value in there, pick it.
df = data.frame(v1 = c(1, 8, 7, 'No number'),
v2 = c(5, 3, 5, 1),
v3 = c('Nothing', 4, 2, 9),
v4 = c(3, 8, 'Something', 6))
find.no <- sapply(X = df, FUN = function(x) {
any(grep("No", x = x))
})
> df[, find.no]
v1 v3
1 1 Nothing
2 8 4
3 7 2
4 No number 9

How to arrange elements of a vector based on a square matrix

I have a vector that results from a square matrix as below
P = as.vector(matrix(c(1,2,3,4),nrow=2))
What would be the simplest way of arranging this vector to get a response similar to what I have below as columns
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4
1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4,1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4,1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4,1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4
I have been able to arrange the first 2 columns as
library(tidyverse)
df <- expand.grid(rep(c(1, 2, 3, 4),2))
df1 <- df %>% arrange_all()
df = expand.grid(a = df1[,1], b = df1[,1])
df[,c(2,1)]
The last column should repeat as a whole through
1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4
Does this work:
as.vector(apply(matrix(P, nrow = 2), 2, rep, 4))
[1] 1 2 1 2 1 2 1 2 3 4 3 4 3 4 3 4
paste(as.vector(apply(matrix(P, nrow = 2), 2, rep, 4)), collapse = ',')
[1] "1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4"

R function that counts rows where conditions are met

I am trying to create a new column that counts each column where criteria are met. That is because I want to summarize the number of correct answers by each participant in my master thesis. I am really new to R and in desperate need for help, even on easy tasks.
For Example:
(Participant, Task1, Task2, Task3; COUNT)
1 4 8 1 ; 1|
2 3 8 7 ; 1|
3 1 3 4 ; 2|
4 5 6 4 ; 1|
5 1 8 4 ; 3
The column COUNT should count all correct answers of the rows Task1-Task3. If the correct answers are (1, 8, 4), the COUNT row should result in the numbers shown in the example above.
Can anybody tell me how to create such a variable?
Really appreciated, thanks
Luca
We can use rowSums by making the vector c(1, 8, 4) length same as the 'Task' columns length and do a ==, and get the rowSums
i1 <- startsWith(names(df1), 'Task')
df1$COUNT <- rowSums(df1[i1] == c(1, 8, 4)[col(df1[i1])])
df1$COUNT
#[1] 1 1 2 1 3
Or with sweep
rowSums(sweep(df1[i1], 2, c(1, 8, 4), `==`))
Or another option is apply
df1$COUNT <- apply(df1[i1], 1, function(x) sum(x == c(1, 8, 4)))
NOTE: None of the solutions require any external package
data
df1 <- data.frame(Participant = 1:5, Task1 = c(4, 3, 1, 5, 1),
Task2 = c(8, 8, 3, 6, 8), Task3 = c(1, 7, 4, 4, 4))
We can use pmap_int from purrr to count number of correct answers.
library(dplyr)
df %>% mutate(COUNT = purrr::pmap_int(select(., starts_with('Task')),
~sum(c(...) == c(1, 8, 4))))
# Participant Task1 Task2 Task3 COUNT
#1 1 4 8 1 1
#2 2 3 8 7 1
#3 3 1 3 4 2
#4 4 5 6 4 1
#5 5 1 8 4 3
Another option is to get data in long format, calculate the number of correct answers for each Participant and join the data back.
df1 %>%
tidyr::pivot_longer(cols = starts_with('Task')) %>%
group_by(Participant) %>%
summarise(COUNT = sum(value == c(1, 8, 4))) %>%
left_join(df1, by = 'Participant')

read rectangular data blocks with separate tags as new columns

I feel my situation is a typical use case in experiments where the data are logged as text file for human understanding, but not for machine consumption. Tags are interspersed with the actual data to describe the data that follows. For data analysis, the tags need to integrated with the data rows to be useful. Below is a made-up example.
TAG1, t1_1
DATA_A, 5, 3, 4, 8
DATA_A, 3, 4, 5, 7
TAG1, t1_2
TAG2, t2_1
DATA_B, 1, 2, 3, 4, 5
DATA_A, 1, 2, 3, 4
The desired parse results should be two data frames. One for DATA_A,
X1, X2, X3, X4, TAG1, TAG2
5, 3, 4, 8, t1_1, NA
3, 4, 5, 7, t1_1, NA
1, 2, 3, 4, t1_2, t2_1
and one for DATA_B
X1, X2, X3, X4, X5, TAG1, TAG2
1, 2, 3, 4, 5, t1_2, t2_1
The current method (implemented in Python) check the file line by line. If it starts with "T", then the corresponding tag variable is updated; if it starts with "DATA", then the tag values are appended to the end of the "DATA" line, and the now completed line is appended to the corresponding CSV file. In the end, the CSV files are read into data frames for data analysis.
I wonder if this data import can be done faster in one step. What I have in mind is
library(tidyverse)
text_frame <- read_lines(clipboard(), skip_empty_rows = TRUE) %>%
enframe(name = NULL, value = "line")
text_frame %>%
separate(line, into = c("ID", "value"), extra = "merge", sep = ", ")
which produces
# A tibble: 7 x 2
ID value
<chr> <chr>
1 TAG1 t1_1
2 DATA_A 5, 3, 4, 8
3 DATA_A 3, 4, 5, 7
4 TAG1 t1_2
5 TAG2 t2_1
6 DATA_B 1, 2, 3, 4, 5
7 DATA_A 1, 2, 3, 4
The next step is to create new column "TAG1" and "TAG2" with the value added to the row. This is where I got stuck. It is like gather for individual rows. How could I do it? Is the general approach reasonable? Any suggestions?
Fast/memory efficient solutions are welcome since the I need to deal with hundreds of ~10MB text files (they do have the same structure).
Using the input data
text <- '
TAG1, t1_1
DATA_A, 5, 3, 4, 8
DATA_A, 3, 4, 5, 7
TAG1, t1_2
TAG2, t2_1
DATA_B, 1, 2, 3, 4, 5
DATA_A, 1, 2, 3, 4
'
You can get the tags from the second column of the imported data V2 by selecting the elements of V2 where the first column V1 is TAG[1|2], and do this for each group. Groups are identified by a variable starting at 0 and incrementing by 1 after each occurrence of [V1 contains TAG then V1 doesn't contain TAG].
Then with tags as their own columns you can remove the TAG rows, and split the data according to whether the first column contains 'B'
library(data.table)
df <- fread(text, fill = T, blank.lines.skip = T)
df[, `:=`(TAG1 = V2[V1 == 'TAG1'],
TAG2 = V2[V1 == 'TAG2']),
by = .(g = (rleid(grepl('TAG', V1)) - 1) %/% 2)]
df <- df[-grep('TAG', V1)]
split(df, df[, grepl('B', V1)])
# $`FALSE`
# V1 V2 V3 V4 V5 V6 TAG1 TAG2
# 1: DATA_A 5 3 4 8 NA t1_1 <NA>
# 2: DATA_A 3 4 5 7 NA t1_1 <NA>
# 3: DATA_A 1 2 3 4 NA t1_2 t2_1
#
# $`TRUE`
# V1 V2 V3 V4 V5 V6 TAG1 TAG2
# 1: DATA_B 1 2 3 4 5 t1_2 t2_1
If you don't always 2 tags and may have more or less, you can replace the step after fread above with
n_tags <- df[, as.numeric(gsub('[^0-9]', '', max(grep('TAG', V1, value = T))))]
df[, g := (rleid(grepl('TAG', V1)) - 1) %/% 2]
for(i in seq_len(n_tags))
df[, paste0('TAG', i) := V2[V1 == paste0('TAG', i)], g]

Row-wise sum for columns with certain names

I have a sample data:
SampleID a b d f ca k l cb
1 0.1 2 1 2 7 1 4 3
2 0.2 3 2 3 4 2 5 5
3 0.5 4 3 6 1 3 9 2
I need to find row-wise sum of columns which have something common in names, e.g. row-wise sum(a, ca) or row-wise sum(b,cb). The problem is that i have large data.frame and ideally i would be able to write what is common in column header, so that code would pick only those columns to sum
Thank you beforehand for any assistance.
We can select the columns that have 'a' with grep, subset the columns and do rowSums and the same with 'b' columns.
rowSums(df1[grep('a', names(df1)[-1])+1])
rowSums(df1[grep('b', names(df1)[-1])+1])
If you want the output as a data frame, try using dplyr
# Recreating your sample data
df <- data.frame(SampleID = c(1, 2, 3),
a = c(0.1, 0.2, 0.5),
b = c(2, 3, 4),
d = c(1, 2, 3),
f = c(2, 3, 6),
ca = c(7, 4, 1),
k = c(1, 2, 3),
l = c(4, 5, 9),
cb = c(3, 5, 2))
Process the data
# load dplyr
library(dplyr)
# Sum across columns 'a' and 'ca' (sum(a, ca))
df2 <- df %>%
select(contains('a'), -SampleID) %>% # 'select' function to choose the columns you want
mutate(row_sum = rowSums(.)) # 'mutate' function to create a new column 'row_sum' with the sum of the selected columns. You can drop the selected columns by using 'transmute' instead.
df2 # have a look
a ca row_sum
1 0.1 7 7.1
2 0.2 4 4.2
3 0.5 1 1.5

Resources