Reshape data frame with different column lengths into two columns replicating column ID - r

I have the following data frame, with different row lengths:
myvar <- as.data.frame(rbind(c("Walter","NA","NA","NA","NA"),
c("Walter","NA","NA","NA","NA"),
c("Walter","Jesse","NA","NA","NA"),
c("Gus","Tuco","Mike","NA","NA"),
c("Gus","Mike","Hank","Saul","Flynn")))
ID <- as.factor(c(1:5))
data.frame(ID,myvar)
ID V1 V2 V3 V4 V5
1 Walter NA NA NA NA
2 Walter NA NA NA NA
3 Walter Jesse NA NA NA
4 Gus Tuco Mike NA NA
5 Gus Mike Hank Saul Flynn
My goal is to switch this data frame into a two column data frame. The first column would be the ID and the other one would be the character name. Note that the ID must be correspondent to the row the character were originally placed. I'm expecting the following result:
ID V
1 Walter
2 Walter
3 Walter
3 Jesse
4 Gus
4 Tuco
4 Mike
5 Gus
5 Mike
5 Hank
5 Saul
5 Flynn
I've tried dcast {reshape2} but it doesn't returned what I need. It is noteworthy that my original data frame is quite big. Any tips? Cheers.

You could use unlist
res <- subset(data.frame(ID,value=unlist(myvar[-1],
use.names=FALSE)), value!='NA')
res
# ID value
#1 1 Walter
#2 2 Walter
#3 3 Walter
#4 4 Gus
#5 5 Gus
#6 3 Jesse
#7 4 Tuco
#8 5 Mike
#9 4 Mike
#10 5 Hank
#11 5 Saul
#12 5 Flynn
NOTE: The NAs are 'character' elements in the dataset, it is better to create it without quotes so that it will be real NAs and we can remove it by na.omit, is.na, complete.cases etc.
data
myvar <- data.frame(ID,myvar)

myvar <- as.data.frame(rbind(c("Walter","NA","NA","NA","NA"),
c("Walter","NA","NA","NA","NA"),
c("Walter","Jesse","NA","NA","NA"),
c("Gus","Tuco","Mike","NA","NA"),
c("Gus","Mike","Hank","Saul","Flynn")))
ID <- as.factor(c(1:5))
df <- data.frame(ID, myvar)
Using base reshape. (I'm converting your "NA" character strings to NA which you may not have to do, this is just due to how you created this example)
df[df == 'NA'] <- NA
na.omit(reshape(df, direction = 'long', varying = list(2:6))[, c('ID','V1')])
# ID V1
# 1.1 1 Walter
# 2.1 2 Walter
# 3.1 3 Walter
# 4.1 4 Gus
# 5.1 5 Gus
# 3.2 3 Jesse
# 4.2 4 Tuco
# 5.2 5 Mike
# 4.3 4 Mike
# 5.3 5 Hank
# 5.4 5 Saul
# 5.5 5 Flynn
or using reshape2
library('reshape2')
## na.omit(melt(df, id.vars = 'ID')[, c('ID','value')])
## or better yet as ananda suggests:
melt(df, id.vars = 'ID', na.rm = TRUE)[, c('ID','value')]
# ID value
# 1 1 Walter
# 2 2 Walter
# 3 3 Walter
# 4 4 Gus
# 5 5 Gus
# 8 3 Jesse
# 9 4 Tuco
# 10 5 Mike
# 14 4 Mike
# 15 5 Hank
# 20 5 Saul
# 25 5 Flynn
you get warnings that the factor levels over the columns are not the same but that's fine.

Fix your "NA" so that they are actually NA first:
mydf[mydf == "NA"] <- NA
Using some subsetting to do it all in one fell swoop:
data.frame(ID=mydf$ID[row(mydf[-1])[!is.na(mydf[-1])]], V=mydf[-1][!is.na(mydf[-1])])
# ID V
#1 1 Walter
#2 2 Walter
#3 3 Walter
#4 4 Gus
#5 5 Gus
#6 3 Jesse
#7 4 Tuco
#8 5 Mike
#9 4 Mike
#10 5 Hank
#11 5 Saul
#12 5 Flynn
Or much more readable in base R:
sel <- which(!is.na(mydf[-1]), arr.ind=TRUE)
data.frame(ID=mydf$ID[sel[,1]], V=mydf[-1][sel])

Using tidyr
library("tidyr")
myvar <- as.data.frame(rbind(c("Walter","NA","NA","NA","NA"),
c("Walter","NA","NA","NA","NA"),
c("Walter","Jesse","NA","NA","NA"),
c("Gus","Tuco","Mike","NA","NA"),
c("Gus","Mike","Hank","Saul","Flynn")))
ID <- as.factor(c(1:5))
myvar <- data.frame(ID,myvar)
myvar %>%
gather(ID, Name, V1:V5 ) %>%
select(ID, value) %>%
filter(value != "NA")
If your NAs are coded as NA instead of "NA", then we can in fact use the na.rm = TRUE option in gather. E.g.:
myvar[myvar == "NA"] <- NA
myvar %>%
gather(ID, Name, V1:V5, na.rm = TRUE ) %>%
select(ID, value)
gives
ID value
1 1 Walter
2 2 Walter
3 3 Walter
4 4 Gus
5 5 Gus
6 3 Jesse
7 4 Tuco
8 5 Mike
9 4 Mike
10 5 Hank
11 5 Saul
12 5 Flynn

Since you are thinking of huge data,
time performance would matter, even sorting afterwards could take forever
Here's my solution. You better be using data.table but here I'll use reshape2
first solution
myvar <- as.data.frame(rbind(c("Walter","NA","NA","NA","NA"),
c("Walter","NA","NA","NA","NA"),
c("Walter","Jesse","NA","NA","NA"),
c("Gus","Tuco","Mike","NA","NA"),
c("Gus","Mike","Hank","Saul","Flynn")))
ID <- as.factor(c(1:5))
dat = data.frame(ID,myvar)
dat[] <- lapply(dat, function(x) {x[x=="NA"]=NA; x})
str(dat$V5)
library(dplyr)
library(reshape2)
dat2 <- melt(dat, id.vars="ID", measure.vars = paste0("V", 1:5), na.rm=TRUE)
dat2
dat2[, c('ID', 'value')]
second solution needs some preprocessing. for huge data, i would recommend data.table
datB <- t(dat)
datB
colnames(datB) <- datB["ID", ]
datB <- datB[-1,]
melt(datB, measure.vars = 1:5, na.rm=TRUE)[, c('Var2', 'value')]
you do not need sorting afterwards

Related

From axis values to coodinates pairs [duplicate]

I have two vectors of integers, say v1=c(1,2) and v2=c(3,4), I want to combine and obtain this as a result (as a data.frame, or matrix):
> combine(v1,v2) <--- doesn't exist
1 3
1 4
2 3
2 4
This is a basic case. What about a little bit more complicated - combine every row with every other row? E.g. imagine that we have two data.frames or matrices d1, and d2, and we want to combine them to obtain the following result:
d1
1 13
2 11
d2
3 12
4 10
> combine(d1,d2) <--- doesn't exist
1 13 3 12
1 13 4 10
2 11 3 12
2 11 4 10
How could I achieve this?
For the simple case of vectors there is expand.grid
v1 <- 1:2
v2 <- 3:4
expand.grid(v1, v2)
# Var1 Var2
#1 1 3
#2 2 3
#3 1 4
#4 2 4
I don't know of a function that will automatically do what you want to do for dataframes(See edit)
We could relatively easily accomplish this using expand.grid and cbind.
df1 <- data.frame(a = 1:2, b=3:4)
df2 <- data.frame(cat = 5:6, dog = c("a","b"))
expand.grid(df1, df2) # doesn't work so let's try something else
id <- expand.grid(seq(nrow(df1)), seq(nrow(df2)))
out <-cbind(df1[id[,1],], df2[id[,2],])
out
# a b cat dog
#1 1 3 5 a
#2 2 4 5 a
#1.1 1 3 6 b
#2.1 2 4 6 b
Edit: As Joran points out in the comments merge does this for us for data frames.
df1 <- data.frame(a = 1:2, b=3:4)
df2 <- data.frame(cat = 5:6, dog = c("a","b"))
merge(df1, df2)
# a b cat dog
#1 1 3 5 a
#2 2 4 5 a
#3 1 3 6 b
#4 2 4 6 b

Need help in data manipulation in R [duplicate]

This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Closed 6 years ago.
i have a dataframe with 2 columns id, cat_list
id cat_list
1 A
2 A|B
3 E|F|G
4 I
5 P|R|T|Z
i want to achieve the below using R code.
id cat_list1 cat_list2 cat_list3 cat_list4
1 A
2 A B
3 E F G
4 I
5 P R T Z
tidyr::separate is handy:
library(tidyr)
df %>% separate(cat_list, into = paste0('cat_list', 1:4), fill = 'right')
## id cat_list1 cat_list2 cat_list3 cat_list4
## 1 1 A <NA> <NA> <NA>
## 2 2 A B <NA> <NA>
## 3 3 E F G <NA>
## 4 4 I <NA> <NA> <NA>
## 5 5 P R T Z
We can use cSplit. Here, we don't need to worry to about the number of splits as it will automatically detect it.
library(splitstackshape)
cSplit(df1, "cat_list", "|")
# id cat_list_1 cat_list_2 cat_list_3 cat_list_4
#1: 1 A NA NA NA
#2: 2 A B NA NA
#3: 3 E F G NA
#4: 4 I NA NA NA
#5: 5 P R T Z
NOTE: It may be better to fill with NA rather than ''.

Delete consecutive empty rows in R

df presents possible name matches. Each pair of matches should be divided by an empty row. However, in some cases my output includes several empty rows between the matching pairs:
> df <- data.frame(id = c(1,2,NA,3,4,NA,NA,NA,5,6,NA), name = c("john jones", "john joners",
NA, "clara prat", "klara prat", NA, NA, NA, "alan turing", "allan turing",
NA), stringsAsFactors = F)
> df
id name
1 1 john jones
2 2 john joners
3 NA <NA>
4 3 clara prat
5 4 klara prat
6 NA <NA>
7 NA <NA>
8 NA <NA>
9 5 alan turing
10 6 allan turing
11 NA <NA>
The desired output is:
> df
id name
1 1 john jones
2 2 john joners
3 NA <NA>
4 3 clara prat
5 4 klara prat
6 NA <NA>
7 5 alan turing
8 6 allan turing
9 NA <NA>
I can do this with a for loop, which I understand is less than optimal.
Perhaps this helps
v1 <- rowSums(!is.na(df))
df[unlist(lapply(split(seq_along(v1),
cumsum(c(1, diff(!v1))<0)), function(i)
i[seq(which.max(v1[i]==0))])),]
# id name
#1 1 john jones
#2 2 john joners
#3 NA <NA>
#4 3 clara prat
#5 4 klara prat
#6 NA <NA>
#9 5 alan turing
#10 6 allan turing
#11 NA <NA>
Here is another approach using rle to look for runs of missing
miss <- rowSums(is.na(df))
# get runs of missing
r <- rle(miss)
r$values <- seq_along(r$values)
# subset data, removing rows when all columns are missing
# and rows sequentially missing
df[!(miss == ncol(df) & duplicated(inverse.rle(r))), ]
# id name
# 1 1 john jones
# 2 2 john joners
# 3 NA <NA>
# 4 3 clara prat
# 5 4 klara prat
# 6 NA <NA>
# 9 5 alan turing
# 10 6 allan turing
# 11 NA <NA>
As mentioned by Akrun, you can use data.table::rleid to avoid some of the explicit rle calculations
df[!(rowSums(is.na(df)) == ncol(df) & duplicated(data.table::rleid(is.na(df[[1]])))) , ]
Using the IRanges package.
df <- data.frame(id = c(1,2,NA,3,4,NA,NA,NA,5,6,NA), name = c("john jones", "john joners",
NA, "clara prat", "klara prat", NA, NA, NA, "alan turing", "allan turing",
NA), stringsAsFactors = F)
library(IRanges)
na.rs <- which(is.na(df$id) & is.na(df$name))
na.rs.re <- reduce(IRanges(na.rs, na.rs))
na.rs.rm <- na.rs.re[width(na.rs.re)>1]
start(na.rs.rm) <- start(na.rs.rm) + 1
df[-as.integer(na.rs.rm), ]
# id name
# 1 1 john jones
# 2 2 john joners
# 3 NA <NA>
# 4 3 clara prat
# 5 4 klara prat
# 6 NA <NA>
# 9 5 alan turing
# 10 6 allan turing
# 11 NA <NA>
Surely not the best solution but easy to follow..
miss <- rowSums(is.na(df))
r <- sum(rle(miss)[[2]])
for(i in 2:length(df$id)){
while(is.na(df$id[i-1]) & is.na(df$id[i])){
df <- df[-(i),]
if(sum(is.na(df$id)) == r) break
}
}

Combining Rows - Summing Certain Columns and Not Others in R

I have a data set that has repeated names in column 1 and then 3 other columns that are numeric.
I want to combine the rows of repeated names into one column and sum 2 of the columns while leaving the other alone. Is there a simple way to do this? I have been trying to figure it out with sapply and lapply and have read a lot of the Q&As here and can't seem to find a solution
Name <- c("Jeff", "Hank", "Tom", "Jeff", "Hank", "Jeff",
"Jeff", "Bill", "Mark")
data.Point.1 <- c(3,4,3,3,4,3,3,6,2)
data.Point.2 <- c(6,9,2,5,7,4,8,2,9)
data.Point.3 <- c(2,2,8,6,4,3,3,3,1)
data <- data.frame(Name, data.Point.1, data.Point.2, data.Point.3)
The data looks like this:
Name data.Point.1 data.Point.2 data.Point.3
1 Jeff 3 6 2
2 Hank 4 9 2
3 Tom 3 2 8
4 Jeff 3 5 6
5 Hank 4 7 4
6 Jeff 3 4 3
7 Jeff 3 8 3
8 Bill 6 2 3
9 Mark 2 9 1
I'd like to get it to look like this (summing columns 3 and 4 and leaving column 1 alone. I'd like it to look like this:
Name data.Point.1 data.Point.2 data.Point.3
1 Jeff 3 23 14
2 Hank 4 16 6
3 Tom 3 2 8
8 Bill 6 2 3
9 Mark 2 9 1
Any help would great. Thanks!
Another solution which is a bit more straightforward is by using the library dplyr
library(dplyr)
data <- data %>% group_by(Name, data.Point.1) %>% # group the columns you want to "leave alone"
summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) # sum columns 3 and 4
if you want to sum over all other columns except those you want to "leave alone" then replace summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) with summarise_each(funs(sum))
I'd do it this way using data.table:
setDT(data)[, c(data.Point.1 = data.Point.1[1L],
lapply(.SD, sum)), by=Name,
.SDcols = -"data.Point.1"]
# Name data.Point.1 data.Point.2 data.Point.3
# 1: Jeff 3 23 14
# 2: Hank 3 16 6
# 3: Tom 3 2 8
# 4: Bill 3 2 3
# 5: Mark 3 9 1
We group by Name, and for each group, get first element of data.Point.1, and for the rest of the columns, we compute sum by using base function lapply and looping it through the columns of .SD, which stands for Subset of Data. The columns in .SD is provided by .SDcols, to which we remove data.Point.1, so that all the other columns are provided to .SD.
Check the HTML vignettes for detailed info.
You could try
library(data.table)
setDT(data)[, list(data.Point.1=data.Point.1[1L],
data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)), by=Name]
# Name data.Point.1 data.Point.2 data.Point.3
#1: Jeff 3 23 14
#2: Hank 4 16 6
#3: Tom 3 2 8
#4: Bill 6 2 3
#5: Mark 2 9 1
or using base R
data$Name <- factor(data$Name, levels=unique(data$Name))
res <- do.call(rbind,lapply(split(data, data$Name), function(x) {
x[3:4] <- colSums(x[3:4])
x[1,]} ))
Or using dplyr, you can use summarise_each to apply the function that needs to be applied on multiple columns, and cbind the output with the 'summarise' output for a single column
library(dplyr)
res1 <- data %>%
group_by(Name) %>%
summarise(data.Point.1=data.Point.1[1L])
res2 <- data %>%
group_by(Name) %>%
summarise_each(funs(sum), 3:4)
cbind(res1, res2[-1])
# Name data.Point.1 data.Point.2 data.Point.3
#1 Jeff 3 23 14
#2 Hank 4 16 6
#3 Tom 3 2 8
#4 Bill 6 2 3
#5 Mark 2 9 1
EDIT
The data created and the data showed initially differed in the original post. After the edit on OP's post (by #dimitris_ps), you can get the expected result by replacing group_by(Name) with group_by(Name, data.Point.1) in the res2 <- .. code.

Full outer join of multiple dataframes stored as elements of a list using data.table

I'm trying to do a full outer join of multiple dataframes stored as elements of a list using data.table. I have successfully done this using the merge_recurse() function of the reshape package, but it is very slow with larger datasets, and I'd like to speed up the merge by using data.table. I'm not sure the best way for data.table to handle the list structure with multiple dataframes. I'm also not sure if I've written the Reduce() function correctly on unique keys to do a full outer join on multiple dataframes.
Here's a small example:
#Libraries
library("reshape")
library("data.table")
#Specify list of multiple dataframes
filelist <- list(data.frame(x=c(1,1,1,2,2,2,3,3,3), y=c(1,2,3,1,2,3,1,2,3), a=1:9),
data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,1), b=seq(from=0, by=5, length.out=9)),
data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,2), c=seq(from=0, by=10, length.out=9)))
#Merge with merge_recurse()
listMerged <- merge_recurse(filelist, by=c("x","y"))
#Attempt with data.table
ids <- lapply(filelist, function(x) x[,c("x","y")])
unique_keys <- unique(do.call("rbind", ids))
dt <- data.table(filelist)
setkey(dt, c("x","y")) #error here
Reduce(function(x, y) x[y[J(unique_keys)]], filelist)
Here's my expected output:
> listMerged
x y a b c
1 1 1 1 0 0
2 1 2 2 5 10
3 1 3 3 10 20
4 2 1 4 15 30
5 2 2 5 20 40
6 2 3 6 25 50
7 3 1 7 30 60
8 3 2 8 35 70
9 3 3 9 NA NA
10 4 1 NA 40 NA
11 4 2 NA NA 80
Here are my resources:
Suggestion to use Reduce() function on data.table (see last comment of answer)
Suggestion to use "unique keys" to do full outer join in data.table
This worked for me:
library("reshape")
library("data.table")
##
filelist <- list(
data.frame(
x=c(1,1,1,2,2,2,3,3,3),
y=c(1,2,3,1,2,3,1,2,3),
a=1:9),
data.frame(
x=c(1,1,1,2,2,2,3,3,4),
y=c(1,2,3,1,2,3,1,2,1),
b=seq(from=0, by=5, length.out=9)),
data.frame(
x=c(1,1,1,2,2,2,3,3,4),
y=c(1,2,3,1,2,3,1,2,2),
c=seq(from=0, by=10, length.out=9)))
##
## I used copy so that this would
## not modify 'filelist'
dtList <- copy(filelist)
lapply(dtList,setDT)
lapply(dtList,function(x){
setkeyv(x,cols=c("x","y"))
})
##
> Reduce(function(x,y){
merge(x,y,all=T,allow.cartesian=T)
},dtList)
x y a b c
1: 1 1 1 0 0
2: 1 2 2 5 10
3: 1 3 3 10 20
4: 2 1 4 15 30
5: 2 2 5 20 40
6: 2 3 6 25 50
7: 3 1 7 30 60
8: 3 2 8 35 70
9: 3 3 9 NA NA
10: 4 1 NA 40 NA
11: 4 2 NA NA 80
Also I noticed a couple of problems in your code. dt <- data.table(filelist) resulted in
> dt
filelist
1: <data.frame>
2: <data.frame>
3: <data.frame>
which is most likely the cause of the error in setkey(dt, c("x","y")) that you pointed out above. Also, did this work for you?
Reduce(function(x, y) x[y[J(unique_keys)]], filelist)
I'm just curious, because I was getting an error when I tried to run it (using dtList instead of filelist)
Error in eval(expr, envir, enclos) : could not find function "J"
which I believe has to do with the changes implemented since version 1.8.8 of data.table, explained by #Arun in this answer.

Resources