Reorganize data.frame into tabulation of values - r

df <- data.frame(x=c(1,2,1,2,3,3), y = c(letters[1:5],'a'), val = c(1:5, 9))
print(df)
x y val
1 a 1
2 b 2
1 c 3
2 d 4
3 e 5
3 a 9
I want to create a function fun(df, rowname, colname, valname)that takes a dataframe, rowname, colname, and value inputs and returns a data.frame or matrix with row names, column names and values as shown below
fun(df, "x","y","val") should return
1 2 3
a 1 NA 9
b NA 2 NA
c 3 NA NA
d NA 4 NA
e NA NA 5

The reshape2 library allows this kind of manipulation:
library(reshape2)
dcast(data=df, y~x, value.var = "val")
y 1 2 3
1 a 1 NA 9
2 b NA 2 NA
3 c 3 NA NA
4 d NA 4 NA
5 e NA NA 5

Related

Transpose multiple columns as column names and fill with values in R

The sample data as following:
x <- read.table(header=T, text="
ID CostType1 Cost1 CostType2 Cost2
1 a 10 c 1
2 b 2 c 20
3 a 1 b 50
4 a 40 c 1
5 c 2 b 30
6 a 60 c 3
7 c 10 d 1
8 a 20 d 2")
I want the second and third columns (CostType1 and CostType 2) to be the the names of new columns and fill the corresponding cost to certain cost type. If there's no match, filled with NA. The ideal format will be following:
a b c d
1 10 NA 1 NA
2 NA 2 20 NA
3 1 50 NA NA
4 40 1 NA NA
5 NA 30 2 NA
6 60 NA 3 NA
7 NA NA 10 1
8 20 NA NA 2
A solution using tidyverse. We can first get how many groups are there. In this example, there are two groups. We can convert each group, combine them, and then summarize the data frame with the first non-NA value in the column.
library(tidyverse)
# Get the group numbers
g <- (ncol(x) - 1)/2
x2 <- map_dfr(1:g, function(i){
# Transform the data frame one group at a time
x <- x %>%
select(ID, ends_with(as.character(i))) %>%
spread(paste0("CostType", i), paste0("Cost", i))
return(x)
}) %>%
group_by(ID) %>%
# Select the first non-NA value if there are multiple values
summarise_all(funs(first(.[!is.na(.)])))
x2
# # A tibble: 8 x 5
# ID a b c d
# <int> <int> <int> <int> <int>
# 1 1 10 NA 1 NA
# 2 2 NA 2 20 NA
# 3 3 1 50 NA NA
# 4 4 40 NA 1 NA
# 5 5 NA 30 2 NA
# 6 6 60 NA 3 NA
# 7 7 NA NA 10 1
# 8 8 20 NA NA 2
A base solution using reshape
x1 <- setNames(x[,c("ID", "CostType1", "Cost1")], c("ID", "CostType", "Cost"))
x2 <- setNames(x[,c("ID", "CostType2", "Cost2")], c("ID", "CostType", "Cost"))
reshape(data=rbind(x1, x2), idvar="ID", timevar="CostType", v.names="Cost", direction="wide")

unlist and merge selected columns to data frame in R

Suppose I have a list such like:
df1<-data.frame(n=letters[1:4], x=1:4, y=2:5, z=3:6)
df2<-data.frame(n=letters[2:5], x=2:5, y=3:6, z=4:7)
df3<-data.frame(n=letters[3:7], x=2:6, y=3:7, z=4:8)
ls<-list(df1, df2, df3)
ls
[[1]]
n x y z
1 a 1 2 3
2 b 2 3 4
3 c 3 4 5
4 d 4 5 6
[[2]]
n x y z
1 b 2 3 4
2 c 3 4 5
3 d 4 5 6
4 e 5 6 7
[[3]]
n x y z
1 c 2 3 4
2 d 3 4 5
3 e 4 5 6
4 f 5 6 7
5 g 6 7 8
what I wanted is to merger the first two columns of each data frame in the list by column n and a desired output would be:
n x1 x2 x3
1 a 1 NA NA
2 b 2 2 NA
3 c 3 3 2
4 d 4 4 3
5 e NA 5 4
6 f NA NA 5
7 g NA NA 6
And same thing for y and z:
n y1 y2 y3
1 a 2 NA NA
2 b 3 3 NA
3 c 4 4 3
4 d 5 5 4
5 e NA 6 5
6 f NA NA 6
7 g NA NA 7
n z1 z2 z3
1 a 3 NA NA
2 b 4 4 NA
3 c 5 5 4
4 d 6 6 5
5 e NA 7 6
6 f NA NA 7
7 g NA NA 8
We get the unique column names from the list of data.frames except the 'n' ('nm1'), loop through those (lapply(nm1,...), subset the columns of each of the 'data.frame' in 'ls' (lapply(ls, function(x) ...), and use Reduce, with merge to merge the datasets in the list.
nm1 <- setdiff(unlist(lapply(ls, names)), "n")
lapply(nm1, function(nm) setNames(Reduce(function(...)
merge(..., all=TRUE, by = "n"), lapply(ls,
function(x) x[c("n", nm)])), make.unique(c("n", rep(nm, length(nm1))))))
#[[1]]
# n x x.1 x.2
#1 a 1 NA NA
#2 b 2 2 NA
#3 c 3 3 2
#4 d 4 4 3
#5 e NA 5 4
#6 f NA NA 5
#7 g NA NA 6
#[[2]]
# n y y.1 y.2
#1 a 2 NA NA
#2 b 3 3 NA
#3 c 4 4 3
#4 d 5 5 4
#5 e NA 6 5
#6 f NA NA 6
#7 g NA NA 7
#[[3]]
# n z z.1 z.2
#1 a 3 NA NA
#2 b 4 4 NA
#3 c 5 5 4
#4 d 6 6 5
#5 e NA 7 6
#6 f NA NA 7
#7 g NA NA 8
NOTE: ls is a function name that lists the objects. It is better to avoid naming objects with known R functions.
Here is another base R method that uses do.call, data.frame, and cbind within a nested pair of lapply functions.
# get all levels of n across data frames
allN <- unique(unlist(sapply(ls, "[[", "n")))
# extract desired columns and provide names with setNames
lapply(names(ls[[1]])[-1], function(var) {
cbind("n"=allN, setNames(do.call(data.frame,
lapply(seq_along(ls), function(i) {
ls[[i]][[var]][match(allN, ls[[i]]$n, nomatch=NA)]
})), paste0(var, seq_along(ls))))
})
The first lapply runs through each of the variable names, the second lapply extracts the current variable from the each data frame in the list. In the middle, do.call makes the list a data.frame, setNames provides the desired names, and the n column is added with cbind.
In the innermost portion of the inner lapply, the code ls[[i]][[var]][match(allN, ls[[i]]$n, nomatch=NA)] is used to expand (and potentially reorder) the current vector according to the levels in allN. If the current vector is missing a level, the nomatch=NA tells match to instead return NA.

r element frequency and index, ranking

I was looking at this example code below,
r element frequency and column name
and was wondering if there is any way to show the index of each element in each column, in addition to the rank and frequency in r. so for example, the desired input and output would be
df <- read.table(header=T, text='A B C D
a a b c
b c x e
c d y a
d NA NA z
e NA NA NA
f NA NA NA',stringsAsFactors=F)
and output
element frequency columns ranking A B C D
1 a 3 A,B,D 1 1 1 na 2
3 c 3 A,B,D 1 3 2 na 1
2 b 2 A,C 2 2 na 1 na
4 d 2 A,B 2 4 3 na na
5 e 2 A,D 2 5 na na 2
6 f 1 A 3 6 na na na
8 x 1 C 3 na na 2 na
9 y 1 C 3 na na 3 na
10 z 1 D 3 na na na 3
Thank you.
Perhaps there is a way to do this in one step, but it's not coming to mind at the moment. So, continuing with my previous answer:
library(dplyr)
library(tidyr)
step1 <- df %>%
gather(var, val, everything()) %>% ## Make a long dataset
na.omit %>% ## We don't need the NA values
group_by(val) %>% ## All calculations grouped by val
summarise(column = toString(var), ## This collapses
freq = n()) %>% ## This counts
mutate(ranking = dense_rank(desc(freq))) ## This ranks
step2 <- df %>%
mutate(ind = 1:nrow(df)) %>% ## Add an indicator column
gather(var, val, -ind) %>% ## Go long
na.omit %>% ## Remove NA
spread(var, ind) ## Go wide
inner_join(step1, step2)
# Joining by: "val"
# Source: local data frame [9 x 8]
#
# val column freq ranking A B C D
# 1 a A, B, D 3 1 1 1 NA 3
# 2 b A, C 2 2 2 NA 1 NA
# 3 c A, B, D 3 1 3 2 NA 1
# 4 d A, B 2 2 4 3 NA NA
# 5 e A, D 2 2 5 NA NA 2
# 6 f A 1 3 6 NA NA NA
# 7 x C 1 3 NA NA 2 NA
# 8 y C 1 3 NA NA 3 NA
# 9 z D 1 3 NA NA NA 4

Replacing row values by column name in R

I have a data frame as below
df<- data.frame(a = c(1,2,3,4,5),
b = c(6,7,8,9,10),
c = c(11,12,13,14,15),
z = c("b","c","a","a","b"))
I'm trying to replace row values where that row's column name is equal to the value in column Z. The desired output is below
a b c z
1 1 NA 11 b
2 2 7 NA c
3 NA 8 13 a
4 NA 9 14 a
5 5 NA 15 b
I was thinking something like the following code applied to each row
If column name is equal to Z, replace value with NA
But can't figure it out. Any help appreciated
Cheers!
Matrix indexing match-ing the z column to the colnames
df[cbind(seq(nrow(df)),match(df$z,colnames(df[1:3])))] <- NA
df
a b c z
1 1 NA 11 b
2 2 7 NA c
3 NA 8 13 a
4 NA 9 14 a
5 5 NA 15 b
This is only going to work if the columns with the letters are in lexigraphic order:
> df[cbind(1:5,as.numeric(df$z))] <- rep(NA,5)
> df
a b c z
1 1 NA 11 b
2 2 7 NA c
3 NA 8 13 a
4 NA 9 14 a
5 5 NA 15 b

Fill missing values with new data R-Python

I have two dataset x and y
> x
a index b
1 1 1 5
2 NA 2 6
3 2 3 NA
4 NA 4 9
> y
index a
1 2 100
2 4 101
>
I would like to fill the missing values of x with the values contained in y.
I have tried to use the merge function but the result is not what I want.
> merge(x,y, by = 'index', all=T)
index a.x b a.y
1 1 1 5 NA
2 2 NA 6 100
3 3 2 7 NA
4 4 NA 9 101
In the real problem there are additional limitations:
1 - y does not fill all the missing values
2 - x and y have in common more variables (so not only a and index)
EDIT : More realistic example
> x
a index b c
1 1 1 5 NA
2 NA 2 6 NA
3 2 3 NA 5
4 NA 4 9 NA
5 NA 5 10 6
> y
index a c
1 2 100 4
2 4 101 NA
>
The solution would be accepted both in python or R
I used your merge idea and did the following using dplyr. I am sure there will be better ways of doing this task.
index <- 1:5
a <- c(1, NA, 2, NA, NA)
b <- c(5,6,NA,9,10)
c <- c(NA,NA,5,NA,6)
ana <- data.frame(index, a,b,c, stringsAsFactors=F)
index <- c(2,4)
a <- c(100, 101)
c <- c(4, NA)
bob <- data.frame(index, a,c, stringsAsFactors=F)
> ana
index a b c
1 1 1 5 NA
2 2 NA 6 NA
3 3 2 NA 5
4 4 NA 9 NA
5 5 NA 10 6
> bob
index a c
1 2 100 4
2 4 101 NA
ana %>%
merge(., bob, by = "index", all = TRUE) %>%
mutate(a.x = ifelse(a.x %in% NA, a.y, a.x)) %>%
mutate(c.x = ifelse(c.x %in% NA, c.y, c.x))
index a.x b c.x a.y c.y
1 1 1 5 NA NA NA
2 2 100 6 4 100 4
3 3 2 NA 5 NA NA
4 4 101 9 NA 101 NA
5 5 NA 10 6 NA NA
I overwrote a.x (ana$$a) using a.y (bob$a) using mutate. I did a similar thing for c.x (ana$c). If you remove a.y and c.y in the end, that will be the outcome you expect, I think.
Try:
xa = x[,c(1,2)]
m1 = merge(y,xa,all=T)
m1 = m1[!duplicated(m1$index),]
m1$b = x$b[match(m1$index, x$index)]
m1$c = x$c[match(m1$index, x$index)]
m1
index a b c
1 1 1 5 NA
2 2 100 6 NA
4 3 2 NA 5
5 4 101 9 NA
7 5 NA 10 6
or, if there many other columns like b and c:
xa = x[,c(1,2)]
m1 = merge(y,xa,all=T)
m1 = m1[!duplicated(m1$index),]
for(nn in names(x)[3:4]) m1[,nn] = x[,nn][match(m1$index, x$index)]
m1
index a b c
1 1 1 5 NA
2 2 100 6 NA
4 3 2 NA 5
5 4 101 9 NA
7 5 NA 10 6
If there are multiple columns to replace, you could try converting from wide to long form as shown in the first two methods and replace in one step
m1 <- merge(x,y, by="index", all=TRUE)
m1L <- reshape(m1, idvar="index", varying=grep("\\.", colnames(m1)), direction="long", sep=".")
row.names(m1L) <- 1:nrow(m1L)
lst1 <- split(m1L, m1L$time)
indx <- is.na(lst1[[1]][,4:5])
lst1[[1]][,4:5][indx] <- lst1[[2]][,4:5][indx]
res <- lst1[[1]][,c(4,1,2,5)]
res
# a index b c
#1 1 1 5 NA
#2 100 2 6 4
#3 2 3 NA 5
#4 101 4 9 NA
#5 NA 5 10 6
Or you could use dplyr with tidyr
library(dplyr)
library(tidyr)
z <- left_join(x, y, by="index") %>%
gather(Var, Val, matches("\\.")) %>%
separate(Var, c("Var1", "Var2"))
indx1 <- which(is.na(z$Val) & z$Var2=="x")
z$Val[indx1] <- z$Val[indx1+nrow(z)/2]
z %>%
spread(Var1, Val) %>%
filter(Var2=="x") %>%
select(-Var2)
# index b a c
#1 1 5 1 NA
#2 2 6 100 4
#3 3 NA 2 5
#4 4 9 101 NA
#5 5 10 NA 6
Or split the columns by matching names before the . and use lapply to replace the NA's.
indx <- grep("\\.", colnames(m1),value=TRUE)
res <- cbind(m1[!names(m1) %in% indx],
sapply(split(indx, gsub("\\..*", "", indx)), function(x) {
x1 <- m1[x]
indx1 <- is.na(x1[,1])
x1[,1][indx1] <- x1[,2][indx1]
x1[,1]} ))
res
# index b a c
#1 1 5 1 NA
#2 2 6 100 4
#3 3 NA 2 5
#4 4 9 101 NA
#5 5 10 NA 6

Resources