Suppose we have the following data:
X Y
6
1
2
2
1 1
8
3
4
1
1 2
I want to convert it to:
X Y Y-1 Y-2 Y-3
6
1
2
2
1 1 2 2 1
8
3
4
1
1 2 1 4 3
That is: for rows with X=1 - take 3 previous Y values and append them to this row.
I "brute-forced" it with a loop:
namevector <- c("Y-1", "Y-2", "Y-3")
mydata[ , namevector] <- ""
for(i in 1:nrow(mydata)){
if(mydata$X[i] != ""){mydata[i,3:5] <- mydata$Y[(i-1):(i-3)]}
}
But it was too slow for my dataset of ~300k points - about 10 minutes.
Then I found a post with a similar question, and they proposed which function, which reduced the time to tolerable 1-2 minutes:
namevector <- c("Y-1", "Y-2", "Y-3")
mydata[ , namevector] <- ""
trials_rows <- which(mydata$X != "")
for (i in trials_rows) {mydata[i,3:5] <- mydata$Y[(i-1):(i-3)]}
But considering that which takes less than a second - I believe I can somehow combine which with some kind of transpose function, but I can't get my mind around it.
I have a big data frame (~300k rows), and ~6k rows have this "X" value.
Is there a fast and simple way to do it fast, instead of iterating through the results of which function?
You can do this with a single assignment using some vectorised trickery:
mydata[trials_rows, namevector] <- mydata$Y[trials_rows - rep(1:3,each=length(trials_rows))]
mydata
# X Y Y-1 Y-2 Y-3
#1 NA 6
#2 NA 1
#3 NA 2
#4 NA 2
#5 1 1 2 2 1
#6 NA 8
#7 NA 3
#8 NA 4
#9 NA 1
#10 1 2 1 4 3
Basically, take each row in trials_rows, look backwards three rows using a vectorised subtraction, and then overwrite the combination of trials_rows in rows and namevector in columns.
Reproducible example used here:
mydata <- structure(list(X = c(NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L),
Y = c(6L, 1L, 2L, 2L, 1L, 8L, 3L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-10L))
Related
Help with applying functions over a list of data frames.
I don't often work with lists or functions so following a 3 hour search and test I need some assistance.
I have a list of 2 data frames as follows (real list has 40+):
df1 <- structure(list(ID = 1:4,
Period = c("C_2021", "C_2021", "C_2021", "C_2021"),
subjects = c(2044L, 2044L, 2058L, 2059L),
Q_1_A = c(1L, 1L, 4L, 6L),
Q_1_B = c(6L, 1L, 6L, NA),
col3 = c(4L, 6L, 5L, 2L),
col4 = c(3L, 5L, 4L, 4L)),
class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(ID = 1:4,
Period = c("C_2022", "C_2022", "C_2022", "C_2022"),
subjects = c(2058L, 2058L, 2065L, 2066L),
Q_1_A = c(2L, 5L, 5L, 6L),
Q_1_B = c(6L, 1L, 4L, NA),
col3 = c(NA, 6L, 5L, 3L),
col4 = c(3L, 6L, 5L, 5L)),
class = "data.frame", row.names = c(NA, -4L))
The structure of the datasets are as follows:
df1
ID Period subjects Q_1_A Q_1_B col3 col4
1 1 C_2021 2044 1 6 4 3
2 2 C_2021 2044 1 1 6 5
3 3 C_2021 2058 4 6 5 4
4 4 C_2021 2059 6 NA 2 4
df2
ID Period subjects Q_1_A Q_1_B col3 col4
1 1 C_2022 2058 2 6 NA 3
2 2 C_2022 2058 5 1 6 6
3 3 C_2022 2065 5 4 5 5
4 4 C_2022 2066 6 NA 3 5
The list of df's
dflist <- list(df1, df2)
I would like to do 2 things:
1. Conditional removal of string before 2nd underscore
I would like to remove characters before the 2nd underscore only in columns beginning with "Q". Column "Q_1_A" would become "A". The code should only impact columns starting with "Q".
Note: The ifelse is important - in the real data there are other columns with 2 underscores that cannot be modified, and the columns in data frames may be in different orders so it needs to be done by column name.
#doesnt work (cant seem to get purr working either)
dflist <- lapply(dflist, function(x) {
names(x) <- ifelse(starts_with(names(x), "Q"), sub("^[^_]*_", "", names(x)), .x)
x})
2. Once column names are updated, remove columns present on a list.
Note: In the real data there are a lot of columns in each df, it's much easier to list the columns to keep rather than remove.
List of columns to keep below
List is structured assuming the gsub above has been complete.
col_keep <- c("ID", "Period", "subjects", "A", "B")
#doesnt work
dflist <- lapply(dflist, function(x) {
x[(names(x) %in% col_keep)]
x})
**UPDATE** I think actually the following will work
dflist <- lapply(dflist, function(x)
{x <- x %>% select(any_of(col_keep))})
#is the best way to do it?
Help would be greatly appreciated.
For the first required apply this
dflist <- lapply(dflist, function(x) {
names(x) <- ifelse(startsWith(names(x), "Q"),
gsub("[Q_0-9]+", "" , names(x)), names(x))
x})
and the second
col_keep <- c("ID", "Period", "subjects", "A", "B")
dflist <- lapply(dflist, function(x) subset(x , select = col_keep))
In base R:
lapply(dflist, \(x)setNames(x, sub('^Q([^_]*_){2}', '', names(x)))[col_keep])
[[1]]
ID Period subjects A B
1 1 C_2021 2044 1 6
2 2 C_2021 2044 1 1
3 3 C_2021 2058 4 6
4 4 C_2021 2059 6 NA
[[2]]
ID Period subjects A B
1 1 C_2022 2058 2 6
2 2 C_2022 2058 5 1
3 3 C_2022 2065 5 4
4 4 C_2022 2066 6 NA
in tidyverse:
library(tidyverse)
dflist %>%
map(~rename_with(.,~str_remove(.,'([^_]+_){2}'), starts_with('Q'))%>%
select(all_of(col_keep)))
[[1]]
ID Period subjects A B
1 1 C_2021 2044 1 6
2 2 C_2021 2044 1 1
3 3 C_2021 2058 4 6
4 4 C_2021 2059 6 NA
[[2]]
ID Period subjects A B
1 1 C_2022 2058 2 6
2 2 C_2022 2058 5 1
3 3 C_2022 2065 5 4
4 4 C_2022 2066 6 NA
Another solutions using base:
# wrap up code for ease of reading
validate_names <- function(df) {
setNames(df, ifelse(grepl("^Q", names(df)),
gsub("[Q_0-9]", "", names(df)), names(df)))
}
# lapply to transform list, then subset with character vector
lapply(dflist, validate_names) |>
lapply(`[`, col_keep)
I am trying to do rowSums but I got zero for the last row and I need it to be "NA".
My df is
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
I used this code based on this link; Sum of two Columns of Data Frame with NA Values
df$sum<-rowSums(df[,c("a", "b", "c")], na.rm=T)
Any advice will be greatly appreciated
For each row check if it is all NA and if so return NA; otherwise, apply sum. We have selected columns a, b and c even though that is all the columns because the poster indicated that there might be additional ones.
sum_or_na <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
transform(df, sum = apply(df[c("a", "b", "c")], 1, sum_or_na))
giving:
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
Note
df in reproducible form is assumed to be:
df <- structure(list(a = c(1L, 2L, 3L, NA), b = c(4L, NA, 5L, NA),
c = c(7L, 8L, NA, NA)),
row.names = c("1", "2", "3", "4"), class = "data.frame")
I have a dataset:data1 which have ME and PDR columns.
I want to create this third column: case which would look like this:
ME PDR case
1 2 2
NA 1 1
NA 1 1
1 2 2
NA NA NA
I tried to use this command but it doesn't return me 1 when I have 1 in either columns and no 2 in any of them.
data1$case=ifelse(data1$ME==2 | data1$PDR==2 ,2,ifelse(data1$ME==NA & data1$PDR==NA,NA,1))
We can use pmax
data1$case <- do.call(pmax, c(data1, na.rm = TRUE))
data1$case
#[1] 2 1 1 2 NA
Regarding the OP's case with NA, the == returns NA for any element that is an NA. So, we need to take care of the NA with adding a condition (& !is.na(ME) - for both columns)
with(data1, ifelse((ME == 2 & !is.na(ME)) | (PDR == 2 & !is.na(PDR)),
2, ifelse(is.na(ME) &is.na(PDR), NA, 1)))
#[1] 2 1 1 2 NA
NOTE: The == for checking NA is not recommended as there are functions to get a logical vector when there are missing values (is.na, complete.cases)
data
data1 <- structure(list(ME = c(1L, NA, NA, 1L, NA), PDR = c(2L, 1L, 1L,
2L, NA)), class = "data.frame", row.names = c(NA, -5L))
I have inherited a data set that is 23 attributes measured for each of 13 names (between-subjects--each participant only rated one name on all of these attributes). Right now it's structured such that the attributes are the fastest-moving factor, followed by the name. So the the data look like this:
Sub# N1-item1 N1-item2 N1-item3 […] N2-item1 N2-item2 N2-item3
1 3 5 3 NA NA NA
2 NA NA NA 1 5 3
3 3 5 3 NA NA NA
4 NA NA NA 2 2 1
It needs to be restructured it such that it's collapsed over name, and all of the item1 entries are the same column (subjects don't matter for this purpose), as below (bearing in mind that there are 23 items not 3 and 13 names not 2):
Name item1 item2 item3
N1 3 5 3
N2 1 5 3
I can do this with loops and, but I'd rather do it in a manner more natural to R, which I'm guessing would be one of the apply family of functions, but I can't quite wrap my head around it--what is the smart way to do this?
Here's an answer using dplyr and tidyr:
library(dplyr)#loads libraries
library(tidyr)
dat %>% #name of your dataframe
gather(key, val, -Sub) %>% #gathers to long data, with id as Sub
filter(!is.na(val)) %>% #removes rows with NA for the value
separate(key, c("Name", "item")) %>% #split the column key into Name and item
spread(item, val) #spreads the data into wide format, with item as the columns
Sub Name item1 item2 item3
1 1 N1 3 5 3
2 2 N2 1 5 3
3 3 N1 3 5 3
4 4 N2 2 2 1
Spin the column names around to be itemX-NY and then let reshape sort it out:
names(dat)[-1] <- gsub("(^.+?)-(.+?$)", "\\2-\\1", names(dat)[-1])
na.omit(reshape(dat, direction="long", idvar="Sub", varying=-1, sep="-"))
# Sub time item1 item2 item3
#1.N1 1 N1 3 5 3
#3.N1 3 N1 3 5 3
#2.N2 2 N2 1 5 3
#4.N2 4 N2 2 2 1
Where the data was:
dat <- structure(list(Sub = 1:4, `item1-N1` = c(3L, NA, 3L, NA), `item2-N1` = c(5L,
NA, 5L, NA), `item3-N1` = c(3L, NA, 3L, NA), `item1-N2` = c(NA,
1L, NA, 2L), `item2-N2` = c(NA, 5L, NA, 2L), `item3-N2` = c(NA,
3L, NA, 1L)), .Names = c("Sub", "item1-N1", "item2-N1", "item3-N1",
"item1-N2", "item2-N2", "item3-N2"), row.names = c(NA, -4L), class = "data.frame
I'm looking for a way to refer to a pevious row in my data frame that has one column value in common with the 'current row'. Basically, if this would be my data frame
A B D
1 10
4 5
6 6
3 25
1 40
I would want D(i) to contain the B value of the last row for which A has the same value as A(i). So for the last row that should be 10.
You could try this:
for(i in seq_len(nrow(dat))) {
try(dat$D[i] <- dat$B[tail(which(dat$A[1:i-1] == dat$A[i]),1)],silent=TRUE)
}
Results:
> dat
A B D
1 1 10 NA
2 4 5 NA
3 6 6 NA
4 3 25 NA
5 1 40 10
Data:
dat <- read.csv(text="A,B,D
1,10
4,5
6,6
3,25
1,40")
You may try
library(dplyr)
df1%>%
group_by(A) %>%
mutate(D=lag(B))
# A B D
#1 1 10 NA
#2 4 5 NA
#3 6 6 NA
#4 3 25 NA
#5 1 40 10
Or
library(data.table)#data.table_1.9.5
setDT(df1)[, D:=shift(B), A][]
data
df1 <- structure(list(A = c(1L, 4L, 6L, 3L, 1L), B = c(10L, 5L, 6L,
25L, 40L)), .Names = c("A", "B"), class = "data.frame",
row.names = c(NA, -5L))