Separate long numbers into individual components in a data frame - r

I have a data frame with several hundred vectors that look like this
To analyze them I must split the columns so that the number is separated into its individual components in the rows beneath.
V1
0
0
0
0
0
0
...
I've tried using this code and tweaking it but I can't get it to work. There are 2225 columns and the vectors are not all the same size.
text_data <- read_excel("./data/wordcount_vectors.xlsx")
text_vector_data <- text_data %>% select(wordcountvec)
wordvec_list <- c()
for (i in 1:nrow(text_vector_data)){
text_vector_data[i,] <- removePunctuation(as.character(text_vector_data[i,]))
x <- as.list(text_vector_data[i,])
wordvec_list <- c(wordvec_list, x)
}
wordvec_df <- as.data.frame(wordvec_list)
df <- data.frame(matrix(ncol = 2225, nrow = 1106))
for (i in 1:ncol(text_vector_data)){ #Change range depending on size of c
c <- as.numeric(strsplit(as.character(wordvec_df[i]), "")[[1]])
dd[i] <- c[[i]]
}
word_vec_df <- Filter(function(x)!all(is.na(x)), dd)
row.names(word_vec_df)<- NULL ; colnames(word_vec_df)<- NULL
word_vec_df <- t(word_vec_df)
here's some toy data to try
v1 <- (100011000)
v2 <- (10102100)
v3 <- (1120210011)
wordcount_df <- data_frame(v1,v2,v3)

First, you need to read your data as character not numeric because otherwise the leading zeros will be lost:
text_data <- read_excel("./data/wordcount_vectors.xlsx", col_types="text")
Your example does not include spaces between the 1's and 0's but your picture does. Using your provided data:
wordcount_dfc <- as.character(wordcount_df)
wordcount_dfc
# [1] "100011000" "10102100" "1120210011"
wordcount_lst <- strsplit(wordcount_dfc, "") # Use " " if the values are separated by spaces
wordcount_lst <-sapply(wordcount_lst, as.integer)
wordcount_lst
# [[1]]
# [1] 1 0 0 0 1 1 0 0 0
#
# [[2]]
# [1] 1 0 1 0 2 1 0 0
#
# [[3]]
# [1] 1 1 2 0 2 1 0 0 1 1
sapply(wordcount_lst, length)
# [1] 9 8 10
You cannot just bind the columns together because they are of different lengths. The simplest approach would be to use wordcount_lst directly, but if you need to make a data frame, you need to add NAs to pad out short vectors:
size <- max(sapply(wordcount_lst, length))
wordcount_df <- data.frame(sapply(wordcount_lst, function(x) c(x, rep(NA, size - length(x)))))
wordcount_df
# X1 X2 X3
# 1 1 1 1
# 2 0 0 1
# 3 0 1 2
# 4 0 0 0
# 5 1 2 2
# 6 1 1 1
# 7 0 0 0
# 8 0 0 0
# 9 0 NA 1
# 10 NA NA 1

Related

removing columns equal to 0 from multiple data frames in a list; lapply not actually removing columns when applying function to a list

I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1

Populate a dataframe with a for loop

I would like to fill a dataframe ("DF") with 0's or 1's depending if values in a vector ("Date") match with other date values in a second dataframe ("df$Date").
If they match the output value have to be 1, otherwise 0.
I tried to adjust this code made by a friend of mine, but it doesn't work:
for(j in 1:length(Date)) { #Date is a vector with all dates from 1967 to 2006
# Start count
count <- 0
# Check all Dates between 1967-2006
if(any(Date[j] == df$Date)) { #df$Date contains specific dates of interest
count <- count + 1
}
# If there is a match between Date and df$Date, its output is 1, else 0.
DF[j,i] <- count
}
The main dataframe "DF" has got 190 columns, which have to filled, and of course a number of rows equal to the Date vector.
extra info
1) Each column is different from the other ones and therefore the observations in a row cannot be all equal (i.e. in a single row, I should have a mixture between 0's and 1's).
2) The column names in "DF" are also present in "df" as df$Code.
We can vectorize this operation with %in% and as.integer(), leveraging the fact that coercing logical to integer returns 0 for false and 1 for true:
Mat[,i] <- as.integer(Date%in%df$Date);
If you want to fill every single column of Mat with the exact same result vector:
Mat[] <- as.integer(Date%in%df$Date);
My above code exactly reproduces the logic of the code in your (original) question.
From your edit, I'm not 100% sure I understand the requirement, but my best guess is this:
set.seed(4L);
LV <- 10L; ND <- 10L;
Date <- sample(seq_len(ND),LV,T);
df <- data.frame(Date=sample(seq_len(ND),3L),Code=c('V1','V2','V3'));
DF <- data.frame(V1=rep(NA,NV),V2=rep(NA,NV),V3=rep(NA,NV));
Date;
## [1] 6 1 3 3 9 3 8 10 10 1
df;
## Date Code
## 1 8 V1
## 2 3 V2
## 3 1 V3
for (cn in colnames(DF)) DF[,cn] <- as.integer(Date%in%df$Date[df$Code==cn]);
DF;
## V1 V2 V3
## 1 0 0 0
## 2 0 0 1
## 3 0 1 0
## 4 0 1 0
## 5 0 0 0
## 6 0 1 0
## 7 1 0 0
## 8 0 0 0
## 9 0 0 0
## 10 0 0 1

combine tables into a data frame

How do I turn a list of tables into a data frame?
I have:
> (tabs <- list(table(c('a','a','b')),table(c('c','c','b')),table(c()),table(c('b','b'))))
[[1]]
a b
2 1
[[2]]
b c
1 2
[[3]]
< table of extent 0 >
[[4]]
b
2
I want:
> data.frame(a=c(2,0,0),b=c(1,1,2),c=c(0,2,0))
a b c
1 2 1 0
2 0 1 2
3 0 0 0
4 0 2 0
PS. Please do not assume that the tables were created by table calls! They were not!
c_names <- unique(unlist(sapply(tabs, names)))
df <- do.call(rbind, lapply(tabs, `[`, c_names))
colnames(df) <- c_names
df[is.na(df)] <- 0
This assumes the tables are one dimensional.
all.names <- unique(unlist(lapply(tabs, names)))
df <- as.data.frame(do.call(rbind,
lapply(
tabs, function(x) as.list(replace(c(x)[all.names], is.na(c(x)[all.names]), 0))
) ) )
names(df) <- all.names
df
There is probably a cleaner way to do this.
# a b c
# 1 2 1 0
# 2 0 1 2
# 3 0 0 0
# 4 0 2 0
tabs <- list(table(c('a','a','b')),table(c('c','c','b')),table(c()),table(c('b','b')))
dat.names <- unique(unlist(sapply(tabs, names)))
dat <- matrix(0, nrow = length(tabs), ncol = length(dat.names))
colnames(dat) <- dat.names
for (ii in 1:length(tabs)) {
dat[ii, ] <- tabs[[ii]][match(colnames(dat), names(tabs[[ii]]) )]
}
dat[is.na(dat)] <- 0
> dat
a b c
[1,] 2 1 0
[2,] 0 1 2
[3,] 0 0 0
[4,] 0 2 0
Here is a pretty clean approach:
library(reshape2)
newTabs <- melt(tabs)
newTabs
# Var1 value L1
# 1 a 2 1
# 2 b 1 1
# 3 b 1 2
# 4 c 2 2
# 5 b 2 4
newTabs$L1 <- factor(newTabs$L1, seq_along(tabs))
dcast(newTabs, L1 ~ Var1, fill = 0, drop = FALSE)
# L1 a b c
# 1 1 2 1 0
# 2 2 0 1 2
# 3 3 0 0 0
# 4 4 0 2 0
This makes use of the fact that there is a melt method for lists (see reshape2:::melt.list) which automatically adds in a variable (L1 for an unnested list) that identifies the index of the list element. Since your list has some items which are empty, they won't show up in your melted list, so you need to factor the "L1" column, specifying the levels you want. dcast takes care of restructuring your output and allows you to specify the desired fill value.

multiply multiple column and find sum of each column for multiple values

I'm trying to multiply column and get its names.
I have a data frame:
v1 v2 v3 v4 v5
0 1 1 1 1
0 1 1 0 1
1 0 1 1 0
I'm trying to multiplying each column with other, like:
v1v2
v1v3
v1v4
v1v5
and
v2v3
v2v4
v2v5
etc, and
v1v2v3
v1v2v4
v1v2v5
v2v3v4
v2v3v5
4 combination and 5 combination...if there is n column then n combination.
I'm try to use following code in while loop, but it is not working:
i<-1
while(i<=ncol(data)
{
results<-data.frame()
v<-i
results<- t(apply(data,1,function(x) combn(x,v,prod)))
comb <- combn(colnames(data),v)
colnames(results) <- apply(comb,v,function(x) paste(x[1],x[2],sep="*"))
results <- colSums(results)
}
but it is not working.
sample out put..
if n=3
v1v2 v1v3 v2v3
0 0 1
0 0 1
0 1 0
and colsum
v1v2 v1v3 v2v3
0 1 2
then
v1v2=0
v1v3=1
v2v3=2
this one is I'm trying?
Try this:
df <- read.table(text = "v1 v2 v3 v4 v5
0 1 1 1 1
0 1 1 0 1
1 0 1 1 0", skip = 1)
df
ll <- vector(mode = "list", length = ncol(df)-1)
ll <- lapply(2:ncol(df), function(ncols){
tmp <- t(apply(df, 1, function(rows) combn(x = rows, m = ncols, prod)))
if(ncols < ncol(df)){
tmp <- colSums(tmp)
}
else{
tmp <- sum(tmp)
}
names1 <- t(combn(x = colnames(df), m = ncols))
names(tmp) <- apply(names1, 1, function(rows) paste0(rows, collapse = ""))
ll[[ncols]] <- tmp
})
ll
# [[1]]
# V1V2 V1V3 V1V4 V1V5 V2V3 V2V4 V2V5 V3V4 V3V5 V4V5
# 0 1 1 0 2 1 2 2 2 1
#
# [[2]]
# V1V2V3 V1V2V4 V1V2V5 V1V3V4 V1V3V5 V1V4V5 V2V3V4 V2V3V5 V2V4V5 V3V4V5
# 0 0 0 1 0 0 1 2 1 1
#
# [[3]]
# V1V2V3V4 V1V2V3V5 V1V2V4V5 V1V3V4V5 V2V3V4V5
# 0 0 0 0 1
#
# [[4]]
# V1V2V3V4V5
# 0
Edit following comment
The results of the different set of column combinations can then be accessed by indexing (subsetting) the list. E.g. to access the "2 combinations", select the first element of the list, to access the "3rd combination", select the second element of the list, et c.
ll[[1]]
# V1V2 V1V3 V1V4 V1V5 V2V3 V2V4 V2V5 V3V4 V3V5 V4V5
# 0 1 1 0 2 1 2 2 2 1

Aggregating every 10 columns in binary matrice

I am new to R.
I would like to transform a binary matrix like this:
example:
" 1874 1875 1876 1877 1878 .... 2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
Since, columns names are years I would like to aggregate them in decades and obtain something like:
"1840-1849 1850-1859 1860-1869 .... 2000-2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
I am used to python and do not know how to do this transformation without making loops!
Thanks, isabel
It is unclear what aggregation you want, but using the following dummy data
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
The following counts events in each 10-year period.
Get the years as a numeric variable
years <- as.numeric(names(df))
Next we need an indicator for the start of each decade
ind <- seq(from = signif(years[1], 3), to = signif(tail(years, 1), 3), by = 10)
We then apply over the indices of ind (1:(length(ind)-1)), select columns from df that are the current decade and count the 1s using rowSums.
tmp <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)])
}, inds = ind, data = df)
Next we cbind the resulting vectors into a data frame and fix-up the column names:
out <- do.call(cbind.data.frame, tmp)
names(out) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out
This gives:
> out
1870-1879 1880-1889 1890-1899
1 4 5 6
2 4 6 6
3 2 5 5
4 5 5 7
5 3 3 7
6 5 5 4
If you want simply a binary matrix with a 1 indicating at least 1 event happened in that decade, then you can use:
tmp2 <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
as.numeric(rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)]) > 0)
}, inds = ind, data = df)
out2 <- do.call(cbind.data.frame, tmp2)
names(out2) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out2
which gives:
> out2
1870-1879 1880-1889 1890-1899
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
If you want a different aggregation, then modify the function applied in the lapply call to use something other than rowSums.
This is another option, using modular arithmetic to aggregate the columns.
# setup, borrowed from #GavinSimpson
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
result <- do.call(cbind,
by(t(df), as.numeric(names(df)) %/% 10 * 10, colSums))
# add -xxx9 to column names, for each decade
dimnames(result)[[2]] <- paste(colnames(result), as.numeric(colnames(result)) + 9, sep='-')
# 1870-1879 1880-1889 1890-1899
# V1 4 5 6
# V2 4 6 6
# V3 2 5 5
# V4 5 5 7
# V5 3 3 7
# V6 5 5 4
If you wanted to aggregate with something other than sum, replace the call to
colSums with something like function(cols) lapply(cols, f), where f is the aggregating
function, e.g., max.

Resources