Keep columns of a data frame based on a data frame - r

I have a data frame, called df, which contains 4000 values. I have a list of 1000 column numbers, in a data frame called list, which is 1000 rows by 1 column. How can I keep the rows with the numbers in list in the data frame df and throw the rest out. I already tried using:
listv <- as.vector(list)
and then using
dfnew <- df[,listv]
but I get the error
Error in .subset(x, j) : invalid subscript type 'list'

You're mixing up rows and columns subsetting. Here is a minimal example:
df <- data.frame(matrix(1:21, ncol = 3))
df
# X1 X2 X3
# 1 1 8 15
# 2 2 9 16
# 3 3 10 17
# 4 4 11 18
# 5 5 12 19
# 6 6 13 20
# 7 7 14 21
list <- data.frame(V1 = c(1, 4, 6))
list
# V1
# 1 1
# 2 4
# 3 6
df[list[, 1], ]
# X1 X2 X3
# 1 1 8 15
# 4 4 11 18
# 6 6 13 20
df[unlist(list), ]
# X1 X2 X3
# 1 1 8 15
# 4 4 11 18
# 6 6 13 20
Note also that as.vector(list) doesn't create a vector, as you thought it would. You need unlist here (as I used in the last example).

Related

Add ID column to a list of data frames

I have a list of 142 dataframes file_content and a list from id_list <- list(as.character(1:length(file_content)))
I am trying to add a new column period to each data frame in file_content.
All data frames are similar to 2021-03-16 below.
`2021-03-16` <- file_content[[1]] # take a look at 1/142 dataframes in file_content
head(`2021-03-16`)
author_id created_at id tweet
1 3.304380e+09 2018-12-01 22:58:55+00:00 1.069003e+18 #Acosta I hope he didn’t really say “muckâ€\u009d.
2 5.291559e+08 2018-12-01 22:57:31+00:00 1.069003e+18 #Acosta I like Mattis, but why does he only speak this way when Individual-1 isn't around?
3 2.195313e+09 2018-12-01 22:56:41+00:00 1.069002e+18 #Acosta What did Mattis say about the informal conversation between Trump and Putin at the G20?
4 3.704188e+07 2018-12-01 22:56:41+00:00 1.069002e+18 #Acosta Good! Tree huggers be damned!
5 1.068995e+18 2018-12-01 22:56:11+00:00 1.069002e+18 #Acosta #NinerMBA_01
6 9.983321e+17 2018-12-01 22:55:13+00:00 1.069002e+18 #Acosta Really?
I have tried to add the period column using the following code but it adds all 142 values from the id_list to every row in every data frame in file_content.
for (id in length(id_list)) {
file_content <- lapply(file_content, function(x) { x$period <- paste(id_list[id], sep = "_"); x })
}
You were close, the mistake is you need double brackets in id_list[[id]].
for (id in length(id_list)) {
file_content <- lapply(file_content, function(x) {
x$period <- paste(id_list[[id]], sep = "_")
x
})
}
# $`1`
# X1 X2 X3 X4 period
# 1 1 4 7 10 1
# 2 2 5 8 11 2
# 3 3 6 9 12 3
#
# $`2`
# X1 X2 X3 X4 period
# 1 1 4 7 10 1
# 2 2 5 8 11 2
# 3 3 6 9 12 3
#
# $`3`
# X1 X2 X3 X4 period
# 1 1 4 7 10 1
# 2 2 5 8 11 2
# 3 3 6 9 12 3
You could also try Map() and save a few lines.
Map(`[<-`, file_content, 'period', value=id_list)
# $`1`
# X1 X2 X3 X4 period
# 1 1 4 7 10 1
# 2 2 5 8 11 2
# 3 3 6 9 12 3
#
# $`2`
# X1 X2 X3 X4 period
# 1 1 4 7 10 1
# 2 2 5 8 11 2
# 3 3 6 9 12 3
#
# $`3`
# X1 X2 X3 X4 period
# 1 1 4 7 10 1
# 2 2 5 8 11 2
# 3 3 6 9 12 3
Data:
file_content <- replicate(3, data.frame(matrix(1:12, 3, 4)), simplify=F) |> setNames(1:3)
id_list <- list(as.character(1:length(file_content)))
We may use imap
library(purrr)
library(dplyr)
imap(file_content, ~ .x %>%
mutate(period = .y))
Or with Map from base R
Map(cbind, file_content, period = names(file_content))
In the OP's code, the id_list is created as a single list element by wrapping with list i.e.
list(1:5)
vs
as.list(1:5)
Here, we don't need to convert to list as a vector is enough
id_list <- seq_along(file_content)
Also, the for loop is looping on a single element i.e. the last element with length
for (id in length(id_list)) {
^^
instead, it would be 1:length. In addition, the assignment should be on the single list element file_content[[id]] and not on the entire list
for(id in seq_along(id_list)) {
file_content[[id]]$period <- id_list[id]
}

Moving down columns in data frames in R

Suppose I have the next data frame:
df<-data.frame(step1=c(1,2,3,4),step2=c(5,6,7,8),step3=c(9,10,11,12),step4=c(13,14,15,16))
step1 step2 step3 step4
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
4 4 8 12 16
and what I have to do is something like the following:
df2<-data.frame(col1=c(1,2,3,4,5,6,7,8,9,10,11,12),col2=c(5,6,7,8,9,10,11,12,13,14,15,16))
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
How can I do that? consider that more steps can be included (example, 20 steps).
Thanks!!
We can design a function to achieve this task. df_final is the final output. Notice that bin is an argument that the users can specify how many columns to transform together.
# A function to conduct data transformation
trans_fun <- function(df, bin = 3){
# Calculate the number of new columns
new_ncol <- (ncol(df) - bin) + 1
# Create a list to store all data frames
df_list <- lapply(1:new_ncol, function(num){
return(df[, num:(num + bin - 1)])
})
# Convert each data frame to a vector
dt_list2 <- lapply(df_list, unlist)
# Convert dt_list2 to data frame
df_final <- as.data.frame(dt_list2)
# Set the column and row names of df_final
colnames(df_final) <- paste0("col", 1:new_ncol)
rownames(df_final) <- 1:nrow(df_final)
return(df_final)
}
# Apply the trans_fun
df_final <- trans_fun(df)
df_final
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
Here is a method using dplyr and reshape2 - this assumes all of the columns are the same length.
library(dplyr)
library(reshape2)
Drop the last column from the dataframe
df[,1:ncol(df)-1]%>%
melt() %>%
dplyr::select(col1=value) -> col1
Drop the first column from the dataframe
df %>%
dplyr::select(-step1) %>%
melt() %>%
dplyr::select(col2=value) -> col2
Combine the dataframes
bind_cols(col1, col2)
This should do the work:
df2 <- data.frame(col1 = 1:(length(df$step1) + length(df$step2)))
df2$col1 <- c(df$step1, df$step2, df$step3)
df2$col2 <- c(df$step2, df$step3, df$step4)
Things to point:
The important thing to see in the first line of the code, is the need for creating a table with the right amount of rows
Calling a columns that does not exist will create one, with that name
Deleting columns in R should be done like this df2$col <- NULL
Are you not just looking to do:
df2 <- data.frame(col1 = unlist(df[,-nrow(df)]),
col2 = unlist(df[,-1]))
rownames(df2) <- NULL
df2
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16

R - Subset rows of a data frame on a condition in all the columns

I want to subset rows of a data frame on a single condition in all the columns, avoiding the use of subset.
I understand how to subset a single column but I cannot generalize for all columns (without call all the columns).
Initial data frame :
V1 V2 V3
1 1 8 15
2 2 0 16
3 3 10 17
4 4 11 18
5 5 0 19
6 0 13 20
7 7 14 21
In this example, I want to subset the rows without zeros.
Expected output :
V1 V2 V3
1 1 8 15
2 3 10 17
3 4 11 18
4 7 14 21
Thanks
# create your data
a <- c(1,2,3,4,5,0,7)
b <- c(8,0,10,11,0,14,14)
c <- c(15,16,17,18,19,20,21)
data <- cbind(a, b, c)
# filter out rows where there is at least one 0
data[apply(data, 1, min) > 0,]
A solution using rowSums function after matching to 0.
# creating your data
data <- data.frame(a = c(1,2,3,4,5,0,7),
b = c(8,0,10,11,0,14,14),
c = c(15,16,17,18,19,20,21))
# Selecting rows containing no 0.
data[which(rowSums(as.matrix(data)==0) == 0),]
Another way
data[-unique(row(data)[grep("^0$", unlist(data))]),]

How to add a date to each row for a column in a data frame?

df <- data.frame(DAY = character(), ID = character())
I'm running a (for i in DAYS[i]) and get IDs for each day and storing them in a data frame
df <- rbind(df, data.frame(ID = IDs))
I want to add the DAY[i] in a second column across each row in a loop.
How do I do that?
As #Pascal says, this isn't the best way to create a data frame in R. R is a vectorised language, so generally you don't need for loops.
I'm assuming each ID is unique, so you can create a vector of IDs from 1 to 10:
ID <- 1:10
Then, you need a vector for your DAYs which can be the same length as your IDs, or can be recycled (i.e. if you only have a certain number of days that are repeated in the same order you can have a smaller vector that's reused). Use c() to create a vector with more than one value:
DAY <- c(1, 2, 9, 4, 4)
df <- data.frame(ID, DAY)
df
# ID DAY
# 1 1 1
# 2 2 2
# 3 3 9
# 4 4 4
# 5 5 4
# 6 6 1
# 7 7 2
# 8 8 9
# 9 9 4
# 10 10 4
Or with a vector for DAY that includes unique values:
DAY <- sample(1:100, 10, replace = TRUE)
df <- data.frame(ID, DAY)
df
# ID DAY
# 1 1 61
# 2 2 30
# 3 3 32
# 4 4 97
# 5 5 32
# 6 6 74
# 7 7 97
# 8 8 73
# 9 9 16
# 10 10 98

Parsing Delimited Data In a DataFrame Into Separate Columns in R

I have a data frame which looks as such
A B C
1 3 X1=7;X2=8;X3=9
2 4 X1=10;X2=11;X3=12
5 6 X1=13;X2=14
I would like to parse the C column into separate columns as such...
A B X1 X2 X3
1 3 7 8 9
2 4 10 11 12
5 6 13 14 NA
How would one go about doing this in R?
First, here's the sample data in data.frame form
dd<-data.frame(
A = c(1L, 2L, 5L),
B = c(3L, 4L, 6L),
C = c("X1=7;X2=8;X3=9",
"X1=10;X2=11;X3=12", "X1=13;X2=14"),
stringsAsFactors=F
)
Now I define a small helper function to take vectors like c("A=1","B=2") and changed them into named vectors like c(A="1", B="2").
namev<-function(x) {
a<-strsplit(x,"=")
setNames(sapply(a,'[',2), sapply(a,'[',1))
}
and now I perform the transformations
#turn each row into a named vector
vv<-lapply(strsplit(dd$C,";"), namev)
#find list of all column names
nm<-unique(unlist(sapply(vv, names)))
#extract data from all rows for every column
nv<-do.call(rbind, lapply(vv, '[', nm))
#convert everything to numeric (optional)
class(nv)<-"numeric"
#rejoin with original data
cbind(dd[,-3], nv)
and that gives you
A B X1 X2 X3
1 1 3 7 8 9
2 2 4 10 11 12
3 5 6 13 14 NA
My cSplit function makes solving problems like these fun. Here it is in action:
## Load some packages
library(data.table)
library(devtools) ## Just for source_gist, really
library(reshape2)
## Load `cSplit`
source_gist("https://gist.github.com/mrdwab/11380733")
First, split your values up and create a "long" dataset:
ddL <- cSplit(cSplit(dd, "C", ";", "long"), "C", "=")
ddL
# A B C_1 C_2
# 1: 1 3 X1 7
# 2: 1 3 X2 8
# 3: 1 3 X3 9
# 4: 2 4 X1 10
# 5: 2 4 X2 11
# 6: 2 4 X3 12
# 7: 5 6 X1 13
# 8: 5 6 X2 14
Next, use dcast.data.table (or just dcast) to go from "long" to "wide":
dcast.data.table(ddL, A + B ~ C_1, value.var="C_2")
# A B X1 X2 X3
# 1: 1 3 7 8 9
# 2: 2 4 10 11 12
# 3: 5 6 13 14 NA
Here's one possible approach:
dat <- read.table(text="A B C
1 3 X1=7;X2=8;X3=9
2 4 X1=10;X2=11;X3=12
5 6 X1=13;X2=14", header=TRUE, stringsAsFactors = FALSE)
library(qdapTools)
dat_C <- strsplit(dat$C, ";")
dat_C2 <- sapply(dat_C, function(x) {
y <- strsplit(x, "=")
rep(sapply(y, "[", 1), as.numeric(sapply(y, "[", 2)))
})
data.frame(dat[, -3], mtabulate(dat_C2))
## A B X1 X2 X3
## 1 1 3 7 8 9
## 2 2 4 10 11 12
## 3 5 6 13 14 0
EDIT To obtain the NA values
m <- mtabulate(dat_C2)
m[m==0] <- NA
data.frame(dat[, -3], m)
Here's a nice, somewhat hacky way to get you there.
## read your data
> dat <- read.table(h=T, text = "A B C
1 3 X1=7;X2=8;X3=9
2 4 X1=10;X2=11;X3=12
5 6 X1=13;X2=14", stringsAsFactors = FALSE)
## ---
> s <- strsplit(dat$C, ";|=")
> xx <- unique(unlist(s)[grepl('[A-Z]', unlist(s))])
> sap <- t(sapply(seq(s), function(i){
wh <- which(!xx %in% s[[i]]); n <- suppressWarnings(as.numeric(s[[i]]))
nn <- n[!is.na(n)]; if(length(wh)){ append(nn, NA, wh-1) } else { nn }
})) ## see below for explanation
> data.frame(dat[1:2], sap)
# A B X1 X2 X3
# 1 1 3 7 8 9
# 2 2 4 10 11 12
# 3 5 6 13 14 NA
Basically what's happening in sap is
check which values are missing
change each list element of s to numeric
remove the NA values from (2)
insert NA into the correct position with append
transpose the result

Resources