Populate a dataframe with a for loop - r

I would like to fill a dataframe ("DF") with 0's or 1's depending if values in a vector ("Date") match with other date values in a second dataframe ("df$Date").
If they match the output value have to be 1, otherwise 0.
I tried to adjust this code made by a friend of mine, but it doesn't work:
for(j in 1:length(Date)) { #Date is a vector with all dates from 1967 to 2006
# Start count
count <- 0
# Check all Dates between 1967-2006
if(any(Date[j] == df$Date)) { #df$Date contains specific dates of interest
count <- count + 1
}
# If there is a match between Date and df$Date, its output is 1, else 0.
DF[j,i] <- count
}
The main dataframe "DF" has got 190 columns, which have to filled, and of course a number of rows equal to the Date vector.
extra info
1) Each column is different from the other ones and therefore the observations in a row cannot be all equal (i.e. in a single row, I should have a mixture between 0's and 1's).
2) The column names in "DF" are also present in "df" as df$Code.

We can vectorize this operation with %in% and as.integer(), leveraging the fact that coercing logical to integer returns 0 for false and 1 for true:
Mat[,i] <- as.integer(Date%in%df$Date);
If you want to fill every single column of Mat with the exact same result vector:
Mat[] <- as.integer(Date%in%df$Date);
My above code exactly reproduces the logic of the code in your (original) question.
From your edit, I'm not 100% sure I understand the requirement, but my best guess is this:
set.seed(4L);
LV <- 10L; ND <- 10L;
Date <- sample(seq_len(ND),LV,T);
df <- data.frame(Date=sample(seq_len(ND),3L),Code=c('V1','V2','V3'));
DF <- data.frame(V1=rep(NA,NV),V2=rep(NA,NV),V3=rep(NA,NV));
Date;
## [1] 6 1 3 3 9 3 8 10 10 1
df;
## Date Code
## 1 8 V1
## 2 3 V2
## 3 1 V3
for (cn in colnames(DF)) DF[,cn] <- as.integer(Date%in%df$Date[df$Code==cn]);
DF;
## V1 V2 V3
## 1 0 0 0
## 2 0 0 1
## 3 0 1 0
## 4 0 1 0
## 5 0 0 0
## 6 0 1 0
## 7 1 0 0
## 8 0 0 0
## 9 0 0 0
## 10 0 0 1

Related

Random value in column in R

Does anyone have an idea how to generate column of random values where only one random row is marked with number "1". All others should be "0".
I need function for this in R code.
Here is what i need in photos:
df <- data.frame(subject = 1, choice = 0, price75 = c(0,0,0,1,1,1,0,1))
This command will update the choice column to contain a single random row with value of 1 each time it is called. All other rows values in the choice column are set to 0.
df$choice <- +(seq_along(df$choice) == sample(nrow(df), 1))
With integer(length(DF$choice)) a vector of 0 is created where [<- is replacing a 1 on the position from sample(length(DF$choice), 1).
DF <- data.frame(subject=1, choice="", price75=c(0,0,0,1,1,1,0,1))
DF$choice <- `[<-`(integer(nrow(DF)), sample(nrow(DF), 1L), 1L)
DF
# subject choice price75
#1 1 0 0
#2 1 0 0
#3 1 0 0
#4 1 1 1
#5 1 0 1
#6 1 0 1
#7 1 0 0
#8 1 0 1
> x <- rep(0, 10)
> x[sample(1:10, 1)] <- 1
> x
[1] 0 0 0 0 0 0 0 1 0 0
Many ways to set a random value in a row\column in R
df<-data.frame(x=rep(0,10)) #make dataframe df, with column x, filled with 10 zeros.
set.seed(2022) #set a random seed - this is for repeatability
#two base methods for sampling:
#sample.int(n=10, size=1) # sample an integer from 1 to 10, sample size of 1
#sample(x=1:10, size=1) # sample from 1 to 10, sample size of 1
df$x[sample.int(n=10, size=1)] <- 1 # randomly selecting one of the ten rows, and replacing the value with 1
df

Separate long numbers into individual components in a data frame

I have a data frame with several hundred vectors that look like this
To analyze them I must split the columns so that the number is separated into its individual components in the rows beneath.
V1
0
0
0
0
0
0
...
I've tried using this code and tweaking it but I can't get it to work. There are 2225 columns and the vectors are not all the same size.
text_data <- read_excel("./data/wordcount_vectors.xlsx")
text_vector_data <- text_data %>% select(wordcountvec)
wordvec_list <- c()
for (i in 1:nrow(text_vector_data)){
text_vector_data[i,] <- removePunctuation(as.character(text_vector_data[i,]))
x <- as.list(text_vector_data[i,])
wordvec_list <- c(wordvec_list, x)
}
wordvec_df <- as.data.frame(wordvec_list)
df <- data.frame(matrix(ncol = 2225, nrow = 1106))
for (i in 1:ncol(text_vector_data)){ #Change range depending on size of c
c <- as.numeric(strsplit(as.character(wordvec_df[i]), "")[[1]])
dd[i] <- c[[i]]
}
word_vec_df <- Filter(function(x)!all(is.na(x)), dd)
row.names(word_vec_df)<- NULL ; colnames(word_vec_df)<- NULL
word_vec_df <- t(word_vec_df)
here's some toy data to try
v1 <- (100011000)
v2 <- (10102100)
v3 <- (1120210011)
wordcount_df <- data_frame(v1,v2,v3)
First, you need to read your data as character not numeric because otherwise the leading zeros will be lost:
text_data <- read_excel("./data/wordcount_vectors.xlsx", col_types="text")
Your example does not include spaces between the 1's and 0's but your picture does. Using your provided data:
wordcount_dfc <- as.character(wordcount_df)
wordcount_dfc
# [1] "100011000" "10102100" "1120210011"
wordcount_lst <- strsplit(wordcount_dfc, "") # Use " " if the values are separated by spaces
wordcount_lst <-sapply(wordcount_lst, as.integer)
wordcount_lst
# [[1]]
# [1] 1 0 0 0 1 1 0 0 0
#
# [[2]]
# [1] 1 0 1 0 2 1 0 0
#
# [[3]]
# [1] 1 1 2 0 2 1 0 0 1 1
sapply(wordcount_lst, length)
# [1] 9 8 10
You cannot just bind the columns together because they are of different lengths. The simplest approach would be to use wordcount_lst directly, but if you need to make a data frame, you need to add NAs to pad out short vectors:
size <- max(sapply(wordcount_lst, length))
wordcount_df <- data.frame(sapply(wordcount_lst, function(x) c(x, rep(NA, size - length(x)))))
wordcount_df
# X1 X2 X3
# 1 1 1 1
# 2 0 0 1
# 3 0 1 2
# 4 0 0 0
# 5 1 2 2
# 6 1 1 1
# 7 0 0 0
# 8 0 0 0
# 9 0 NA 1
# 10 NA NA 1

efficient way to store lists within a dataframe

I need to be able to compute pairwise intersection of lists, close to 40k.
Specifically, I want to know if I can store vector id as column 1, and a list of its values in column 2. I should be able to process this column 2 , ie find overlap/intersections between two rows.
column 1 column 2
idA 1,2,5,9,10
idB 5,9,25
idC 2,25,67
I want to be able to get the pairwise intersection values and also, if the values in column 2 are not already sorted, that should also be possible.
What is the best datastructure that I can use if I am going ahead with R?
My data originally looks like this:
column1 1 2 3 9 10 25 67 5
idA 1 1 0 1 1 0 0 1
idB 0 0 0 1 0 1 0 1
idC 0 1 0 0 0 1 1 0
edited to include more clarity as per the suggestions below.
I'd keep the data in a logical matrix:
DF <- read.table(text = "column1 1 2 3 9 10 25 67 5
idA 1 1 0 1 1 0 0 1
idB 0 0 0 1 0 1 0 1
idC 0 1 0 0 0 1 1 0", header = TRUE, check.names = FALSE)
#turn into logical matrix
m <- as.matrix(DF[-1])
rownames(m) <- DF[[1]]
mode(m) <- "logical"
#if you can, create your data as a sparse matrix to save memory
#if you already have a dense data matrix, keep it that way
library(Matrix)
M <- as(m, "lMatrix")
#calculate intersections
#does each comparison twice
intersections <- simplify2array(
lapply(seq_len(nrow(M)), function(x)
lapply(seq_len(nrow(M)), function(x, y) colnames(M)[M[x,] & (M[x,] == M[y,])], x = x)
)
)
This double loop could be optimized. I'd do it in Rcpp and create a long format data.frame instead of a list matrix. I'd also do each comparison only once (e.g., only the upper triangle).
colnames(intersections) <- rownames(intersections) <- rownames(M)
# idA idB idC
#idA Character,5 Character,2 "2"
#idB Character,2 Character,3 "25"
#idC "2" "25" Character,3
intersections["idA", "idB"]
#[[1]]
#[1] "9" "5"

Insert missing natural numbers in a vector

I have a data frame column
df$col1=(1,2,3,4,5,6,7,8,9,...,500000)
and a vector
vc<-c(1,2,4,5,7,8,10,...,499999)
If i compare the two vectors the second vector has some missing values how i can insert in the missing values' places 0s e.g the second vector i want to be
vc<-c(1,2,0,4,5,0,7,8,9,10,...,499999,0)
You could use match and replace (thanks to #RonakShah)
Input
vc <- c(1,2,4,5,7,8,10)
x <- 1:15
Result
out <- replace(tmp <- vc[match(x, vc)], is.na(tmp), 0L)
out
# [1] 1 2 0 4 5 0 7 8 0 10 0 0 0 0 0
You could try using the larger vector containing all values as a template, and then assign zero to any value which does not match to the second smaller vector:
v_out <- df$col1
v_out[!(v_out %in% vc)] <- 0
v_out
[1] 1 2 0 4 5 0 7 8 0 10
Data
df$col1 <- c(1:10)
vc <- c(1,2,4,5,7,8,10)
A more cryptic, but maybe faster one-liner alternative (using Tim's data):
`[<-`(numeric(max(df$col1)),vc,vc)
#[1] 1 2 0 4 5 0 7 8 0 10

removing columns equal to 0 from multiple data frames in a list; lapply not actually removing columns when applying function to a list

I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1

Resources