After scraping some review data from a website, I am having difficulty organizing the data into a useful structure for analysis. The problem is that the data is dynamic, in that each reviewer gave ratings on anywhere between 0 and 3 subcategories (denoted as subcategories "a", "b" and "c"). I would like to organize the reviews so that each row is a different reviewer, and each column is a subcategory that was rated. Where reviewers chose not to rate a subcategory, I would like that missing data to be 'NA'. Here is a simplified sample of the data:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
The vec contains the information of the subcategories that were scored, and the "stop" is the end of each reviewers rating. As such, I would like to organize the result into a data frame with this structure. Expected Output
I would greatly appreciate any help on this, because I've been working on this issue for far longer than it should take me..
#alexis_laz provided what I believe is the best answer:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
stops <- vec == "stop"
i = cumsum(stops)[!stops] + 1L
j = vec[!stops]
tapply(ratings, list(factor(i, 1:max(i)), factor(j)), identity) # although mean/sum work
# a b c
#[1,] 2 5 1
#[2,] 1 3 NA
#[3,] NA NA NA
#[4,] NA NA 2
base R, but I'm using a for loop...
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
categories <- unique(vec)[unique(vec)!="stop"]
row = 1
df = data.frame(lapply(categories, function(x){NA_integer_}))
colnames(df) <- categories
rating = 1
for(i in vec) {
if(i=='stop') {row <- row+1
} else { df[row,i] <- ratings[[rating]]; rating <- rating+1}
}
Here is one option
library(data.table)
library(reshape2)
d1 <- as.data.table(melt(split(vec, c(1, head(cumsum(vec == "stop")+1,
-1)))))[value != 'stop', ratings := ratings
][value != 'stop'][, value := as.character(value)][, L1 := as.integer(L1)]
dcast( d1[CJ(value = value, L1 = seq_len(max(L1)), unique = TRUE), on = .(value, L1)],
L1 ~value, value.var = 'ratings')[, L1 := NULL][]
# a b c
#1: 2 5 1
#2: 1 3 NA
#3: NA NA NA
#4: NA NA 2
Using base R functions and rbind.fill from plyr or rbindlist from data.table to produce the final object, we can do
# convert vec into a list, split by "stop", dropping final element
temp <- head(strsplit(readLines(textConnection(paste(gsub("stop", "\n", vec, fixed=TRUE),
collapse=" "))), split=" "), -1)
# remove empty strings, but maintain empty list elements
temp <- lapply(temp, function(x) x[nchar(x) > 0])
# match up appropriate names to the individual elements in the list with setNames
# convert vectors to single row data.frames
temp <- Map(function(x, y) setNames(as.data.frame.list(x), y),
relist(ratings, skeleton = temp), temp)
# add silly data.frame (single row, single column) for any empty data.frames in list
temp <- lapply(temp, function(x) if(nrow(x) > 0) x else setNames(data.frame(NA), vec[1]))
Now, you can produce the single data.frame (data.table) with either plyr or data.table
# with plyr, returns data.frame
library(plyr)
do.call(rbind.fill, temp)
a b c
1 2 5 1
2 1 3 NA
3 NA NA NA
4 NA NA 2
# with data.table, returns data.table
rbindlist(temp, fill=TRUE)
a b c
1: 2 5 1
2: 1 3 NA
3: NA NA NA
4: NA NA 2
Note that the line prior to the rbinding can be replaced with
temp[lengths(temp) == 0] <- replicate(sum(lengths(temp) == 0),
setNames(data.frame(NA), vec[1]), simplify=FALSE)
where the list items that are empty data frames are replaced using subsetting instead of an lapply over the entire list.
Related
I want to make a data.frame with two columns. The first with the Id of tweets, and the second column the information depends on whether the tweet is a reply or retweet
id_str | x$retweeted_status$id_str or x$in_reply_to_status_id_str
I can make a dataframe with three columns bue I need two.
My code:
ids <- sapply(tweets.list, function(x) x$id_str)
ret_ids <- sapply(tweets.list, function(x) if(is.null(x$retweeted_status)) NA else x$retweeted_status$id_str)
rep_ids <- sapply(tweets.list, function(x) if(is.null(x$in_reply_to_status_id_str)) NA else x$in_reply_to_status_id_str)
isnt.null <- function(x)!is.null(x)
r_ids <- sapply(tweets.list, function(x) if(is.null(x$retweeted_status)) x$in_reply_to_status_id_str else x$retweeted_status$id_str)
data.frame(ids,r_ids)
The output:
Error in data.frame("733222936912351232", NULL, "733220677721968641", :
arguments imply differing number of rows: 1, 0
Data:
ids|ret_ids|rep_ids
1|40|NA
2|32|NA
3|NA|555
4|NA|444
Result desired:
ids|r
1|40
2|32
3|555
4|444
Here's one way
df <- read.table(header=T, sep="|", text="ids|ret_ids|rep_ids
1|40|NA
2|32|NA
3|NA|555
4|NA|444")
setNames(as.data.frame(t(apply(df, 1, na.omit))), c("ids", "r"))
# ids r
# 1 1 40
# 2 2 32
# 3 3 555
# 4 4 444
This question already has answers here:
Add (insert) a column between two columns in a data.frame
(18 answers)
Closed 4 years ago.
I want to add a new column with "NA"s in my dataframe:
A B
1 14379 32094
2 151884 174367
3 438422 449382
But I need it to be located between col. A and B, like this:
A C B
1 14379 NA 32094
2 151884 NA 174367
3 438422 NA 449382
I know how to add col. C after col. B, but that is not helpful to me... Anyone know how to do it?
In 2 steps, you can reorder the columns:
dat$C <- NA
dat <- dat[, c("A", "C", "B")]
A C B
1 0.596068 NA -0.7783724
2 -1.464656 NA -0.8425972
You can also use append
dat <- data.frame(A = rnorm(2), B = rnorm(2))
as.data.frame(append(dat, list(C = NA), after = 1))
A C B
1 -0.7046408 NA 0.2117638
2 0.8402680 NA -2.0109721
If you use data.table you can use the function setcolorder. Note that NA is stored as logical variable, if you want to have the column initiated as an integer, double or character column, you can use NA_integer, NA_real_ or NA_character_
eg
library(data.table)
DT <- data.table(DF)
# add column `C` = NA
DT[, C := NA]
setcolorder(DT, c('A','C','B'))
DT
## A C B
## 1: 14379 NA 32094
## 2: 151884 NA 174367
## 3: 438422 NA 449382
You could do this in one line
setcolorder(DT[, C: = NA], c('A','B','C'))
You can also use the package tibble, which has a very interesting function (among many other) for that : add_column()
library(tibble)
df <- data.frame("a" = 1:5, "b" = 6:10)
add_column(df, c = rep(NA, nrow(df)), .after = 1)
That function is easy to use, and you can use the argument .before instead.
I wrote a function to append columns onto (into) a data.frame. It allows you to name the column as well, and does a few checks...
append_col <- function(x, cols, after=length(x)) {
x <- as.data.frame(x)
if (is.character(after)) {
ind <- which(colnames(x) == after)
if (any(is.null(ind))) stop(after, "not found in colnames(x)\n")
} else if (is.numeric(after)) {
ind <- after
}
stopifnot(all(ind <= ncol(x)))
cbind(x, cols)[, append(1:ncol(x), ncol(x) + 1:length(cols), after=ind)]
}
examples:
# create data
df <- data.frame("a"=1:5, "b"=6:10)
# append column
append_col(df, list(c=1:5))
# append after an column index
append_col(df, list(c=1:5), after=1)
# or after a named column
append_col(df, list(c=1:5), after="a")
# multiple columns / single values work as expected
append_col(df, list(c=NA, d=4:8), after=1)
(One advantage of calling cbind at the end of the function and indexing is that characters within the data.frame are not coerced to factors as would be the case if using as.data.frame(append(x, cols, after=ind)))
This question already has answers here:
Add (insert) a column between two columns in a data.frame
(18 answers)
Closed 4 years ago.
I want to add a new column with "NA"s in my dataframe:
A B
1 14379 32094
2 151884 174367
3 438422 449382
But I need it to be located between col. A and B, like this:
A C B
1 14379 NA 32094
2 151884 NA 174367
3 438422 NA 449382
I know how to add col. C after col. B, but that is not helpful to me... Anyone know how to do it?
In 2 steps, you can reorder the columns:
dat$C <- NA
dat <- dat[, c("A", "C", "B")]
A C B
1 0.596068 NA -0.7783724
2 -1.464656 NA -0.8425972
You can also use append
dat <- data.frame(A = rnorm(2), B = rnorm(2))
as.data.frame(append(dat, list(C = NA), after = 1))
A C B
1 -0.7046408 NA 0.2117638
2 0.8402680 NA -2.0109721
If you use data.table you can use the function setcolorder. Note that NA is stored as logical variable, if you want to have the column initiated as an integer, double or character column, you can use NA_integer, NA_real_ or NA_character_
eg
library(data.table)
DT <- data.table(DF)
# add column `C` = NA
DT[, C := NA]
setcolorder(DT, c('A','C','B'))
DT
## A C B
## 1: 14379 NA 32094
## 2: 151884 NA 174367
## 3: 438422 NA 449382
You could do this in one line
setcolorder(DT[, C: = NA], c('A','B','C'))
You can also use the package tibble, which has a very interesting function (among many other) for that : add_column()
library(tibble)
df <- data.frame("a" = 1:5, "b" = 6:10)
add_column(df, c = rep(NA, nrow(df)), .after = 1)
That function is easy to use, and you can use the argument .before instead.
I wrote a function to append columns onto (into) a data.frame. It allows you to name the column as well, and does a few checks...
append_col <- function(x, cols, after=length(x)) {
x <- as.data.frame(x)
if (is.character(after)) {
ind <- which(colnames(x) == after)
if (any(is.null(ind))) stop(after, "not found in colnames(x)\n")
} else if (is.numeric(after)) {
ind <- after
}
stopifnot(all(ind <= ncol(x)))
cbind(x, cols)[, append(1:ncol(x), ncol(x) + 1:length(cols), after=ind)]
}
examples:
# create data
df <- data.frame("a"=1:5, "b"=6:10)
# append column
append_col(df, list(c=1:5))
# append after an column index
append_col(df, list(c=1:5), after=1)
# or after a named column
append_col(df, list(c=1:5), after="a")
# multiple columns / single values work as expected
append_col(df, list(c=NA, d=4:8), after=1)
(One advantage of calling cbind at the end of the function and indexing is that characters within the data.frame are not coerced to factors as would be the case if using as.data.frame(append(x, cols, after=ind)))
I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
Thanks
This loops by column, setting dt1 by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2
This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
#READING IN YOUR DATA FRAMES
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
df5
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
return(LS[[i]])
}
)
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5
EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
dt1[J(k),names(p):=as.list(p),with=FALSE]
}
or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.
The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
library(data.table)
library(reshape2)
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names
I regularly have situations where I need to replace missing values from a data.frame with values from some other data.frame that is at a different level of aggregation. So, for example, if I have a data.frame full of county data I might replace NA values with state values stored in another data.frame. After writing the same merge... ifelse(is.na()) yada yada a few dozen times I decided to break down and write a function to do this.
Here's what I cooked up, along with an example of how I use it:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols){
mergedDf <- merge(naDf, fillDf, by=mergeCols)
for (col in fillCols){
colWithNas <- mergedDf[[paste(col, "x", sep=".")]]
colWithOutNas <- mergedDf[[paste(col, "y", sep=".")]]
k <- which( is.na( colWithNas ) )
colWithNas[k] <- colWithOutNas[k]
mergedDf[col] <- colWithNas
mergedDf[[paste(col, "x", sep=".")]] <- NULL
mergedDf[[paste(col, "y", sep=".")]] <- NULL
}
return(mergedDf)
}
## test case
fillDf <- data.frame(a = c(1,2,1,2), b = c(3,3,4,4) ,f = c(100,200, 300, 400), g = c(11, 12, 13, 14))
naDf <- data.frame( a = sample(c(1,2), 100, rep=TRUE), b = sample(c(3,4), 100, rep=TRUE), f = sample(c(0,NA), 100, rep=TRUE), g = sample(c(0,NA), 200, rep=TRUE) )
fillNaDf(naDf, fillDf, mergeCols=c("a","b"), fillCols=c("f","g") )
So after I got this running I had this odd feeling that someone has probably solved this problem before me and in a much more elegant way. Is there a better/easier/faster solution to this problem? Also, is there a way that eliminates the loop in the middle of my function? That loop is there because I am often replacing NAs in more than one column. And, yes, the function assumes the columns we're filling from are named the same and the columns we are filling to and the same applies to the merge.
Any guidance or refactoring would be helpful.
EDIT on Dec 2 I realized I had logic flaws in my example which I fixed.
What a great question.
Here's a data.table solution:
# Convert data.frames to data.tables (i.e. data.frames with extra powers;)
library(data.table)
fillDT <- data.table(fillDf, key=c("a", "b"))
naDT <- data.table(naDf, key=c("a", "b"))
# Merge data.tables, based on their keys (columns a & b)
outDT <- naDT[fillDT]
# a b f g f.1 g.1
# [1,] 1 3 NA 0 100 11
# [2,] 1 3 NA NA 100 11
# [3,] 1 3 NA 0 100 11
# [4,] 1 3 0 0 100 11
# [5,] 1 3 0 NA 100 11
# First 5 rows of 200 printed.
# In outDT[i, j], on the following two lines
# -- i is a Boolean vector indicating which rows will be operated on
# -- j is an expression saying "(sub)assign from right column (e.g. f.1) to
# left column (e.g. f)
outDT[is.na(f), f:=f.1]
outDT[is.na(g), g:=g.1]
# Just keep the four columns ultimately needed
outDT <- outDT[,list(a,b,g,f)]
# a b g f
# [1,] 1 3 0 0
# [2,] 1 3 11 0
# [3,] 1 3 0 0
# [4,] 1 3 11 0
# [5,] 1 3 11 0
# First 5 rows of 200 printed.
Here's a slightly more concise/robust version of your approach. You could replace the for-loop with a call to lapply, but I find the loop easier to read.
This function assumes any columns not in mergeCols are fair game to have their NAs filled. I'm not really sure this helps, but I'll take my chances with the voters.
fillNaDf.ju <- function(naDf, fillDf, mergeCols) {
mergedDf <- merge(fillDf, naDf, by=mergeCols, suffixes=c(".fill",""))
dataCols <- setdiff(names(naDf),mergeCols)
# loop over all columns we didn't merge by
for(col in dataCols) {
rows <- is.na(mergedDf[,col])
# skip this column if it doesn't contain any NAs
if(!any(rows)) next
rows <- which(rows)
# replace NAs with values from fillDf
mergedDf[rows,col] <- mergedDf[rows,paste(col,"fill",sep=".")]
}
# don't return ".fill" columns
mergedDf[,names(naDf)]
}
My preference would be to pull out the code from merge that does the matching and do it myself so that I could keep the ordering of the original data frame intact, both row-wise and column-wise. I also use matrix indexing to avoid any loops, though to do so I create a new data frame with the revised fillCols and replace the columns of the original with it; I thought I could fill it in directly but apparently you can't use matrix ordering to replace parts of a data.frame, so I wouldn't be surprised if a loop over the names would be faster in some situations.
With matrix indexing:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
na.ind <- is.na(naDf[,fillCols])
fill.ind <- cbind(match(naB, fillB)[row(na.ind)[na.ind]], col(na.ind)[na.ind])
naX <- naDf[,fillCols]
fillX <- fillDf[,fillCols]
naX[na.ind] <- fillX[fill.ind]
naDf[,colnames(naX)] <- naX
naDf
}
With a loop:
fillNaDf2 <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
m <- match(naB, fillB)
for(col in fillCols) {
fix <- which(is.na(naDf[,col]))
naDf[fix, col] <- fillDf[m[fix],col]
}
naDf
}