Data restructure in R (short lists to binary) - r

I have a data set with this structure:
region1 region2 region3
1 10 5 5
2 8 10 8
3 13 15 12
4 3 17 11
5 17 9
6 12 15
7 4
8 18
9 1
I need:
item region1 region2 region3
1 1 1 0 0
2 3 1 0 0
3 4 1 0 0
4 5 0 1 1
5 8 1 0 1
6 9 0 0 1
7 10 1 1 0
8 11 0 0 1
9 12 1 0 1
10 13 1 0 0
11 15 0 1 1
12 17 1 1 0
13 18 1 0 0
The plan was to get a distinct list of items, left join each of the regions as its own column and replace matches with 1s, missing with 0; but I must be missing a key point of the R merge, dropping the main column of interest. Any advice is greatly appreciated! I'd prefer an R solution, but my next step would be to look into sqldf package.
#read in data
regions <- read.csv("c:/data/regions.csv")
#get unique list of items from all regions
items <- na.omit(unique(stack(regions)[1]))
#merge distinct items with each region, replace matches with 1, missings with 0
merge.test <- merge(items,regions,by.x="values", by.y=c("region1"), all=TRUE)

Helps to provide a reproducible example (i.e. give us an easy copy-paste command to construct your sample data).
You didn't say, so I guess your data is in a list perhaps?
dat <- list(region1=c(10, 8, 3, 17, 12, 4, 18, 1),
region2=c(5,10,15,17),
region3=c(5,8,12,11,9,15))
First find all the items (perhaps no need to sort, I just did it because yours is sorted)
ids <- sort(unique(unlist(dat)))
Then for each region, just see if the list of unique IDs is in that region, coercing the logical TRUE/FALSE to 0 and 1 (you could leave as T/F if that would do for you)
data.frame(ids,
region1=as.integer(ids %in% dat$region1),
region2=as.integer(ids %in% dat$region2),
region3=as.integer(ids %in% dat$region3))
If you have just 3 regions that's OK, if you have more you might want to automate the typing:
cols <- lapply(dat, function (region) as.integer(ids %in% region))
cols$id <- ids
df <- do.call(data.frame, cols)
where do.call calls the data.frame function with the list cols as its (named) arguments, i.e. it just does
data.frame(id=..., region1=..., region2=..., region3=...)
If your original dat was a CSV and each column has NA values you might want to insert na.omit as appropriate.

The existing answers are fine, but they seem to complicated. Just try stack + table instead:
table(stack(dat))
# ind
# values region1 region2 region3
# 1 1 0 0
# 3 1 0 0
# 4 1 0 0
# 5 0 1 1
# 8 1 0 1
# 9 0 0 1
# 10 1 1 0
# 11 0 0 1
# 12 1 0 1
# 15 0 1 1
# 17 1 1 0
# 18 1 0 0
I'm also going to go out on a limb and say that considering your current approach, you actually have a data.frame not a list:
DAT <- dat
Len <- max(sapply(DAT, length))
DAT <- data.frame(lapply(DAT, function(x) { length(x) <- Len; x }))
In that case, the solution is no different:
table(stack(DAT))
# ind
# values region1 region2 region3
# 1 1 0 0
# 3 1 0 0
# 4 1 0 0
# 5 0 1 1
# 8 1 0 1
# 9 0 0 1
# 10 1 1 0
# 11 0 0 1
# 12 1 0 1
# 15 0 1 1
# 17 1 1 0
# 18 1 0 0

Using #mathematical.coffee's example and qdap:
dat <- list(region1=c(10, 8, 3, 17, 12, 4, 18, 1),
region2=c(5,10,15,17),
region3=c(5,8,12,11,9,15))
library(qdap)
matrix2df(t(mtabulate(dat)), "item")
You may need to expand with:
FUN <- function(x) as.numeric(x > 0)
matrix2df(apply(t(mtabulate(dat)), 2, FUN), "item")
If you have more than one item in in a vector.

Related

Create an index variable for blocks of values

I have a dataframe "data" with a grouping variable "grp" and a binary classification variable "classif". For each group in grp, I want to create a "result" variable creating an index of separate blocks of 0 in the classif variable. For the time being, I don't know how to reset the count for each level of the grouping variable and I don't find a way to only create the index for blocks of 0s (ignoring the 1s).
Example data:
grp <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3)
classif <- c(0,1,0,0,1,0,0,1,1,0,0,0,0,1,0,1,1,1,0,0,1,1,0,0,0,1,0,1,0)
result <- c(1,0,2,2,0,3,3,0,0,1,1,1,1,0,2,0,0,0,3,3,0,0,1,1,1,0,2,0,3)
wrong_result <- c(1,2,3,3,4,5,5,1,1,2,2,2,2,3,4,5,5,5,6,6,1,1,2,2,2,3,4,5,6)
Data <- data.frame(grp,classif,result, wrong_result)
I have tried using rleid but the following command produces "wrong_result", which is not what I'm after.
data[, wrong_result:= rleid(classif)]
data[, wrong_result:= rleid(classif), by=grp]
With dplyr, use cumsum() and lag() to find blocks of zeroes .by group. (Make sure you’re using the latest version of dplyr to use the .by argument).
library(dplyr)
Data %>%
mutate(
result2 = ifelse(
classif == 0,
cumsum(classif == 0 & lag(classif, default = 1) == 1),
0
),
.by = grp
)
grp classif result result2
1 1 0 1 1
2 1 1 0 0
3 1 0 2 2
4 1 0 2 2
5 1 1 0 0
6 1 0 3 3
7 1 0 3 3
8 2 1 0 0
9 2 1 0 0
10 2 0 1 1
11 2 0 1 1
12 2 0 1 1
13 2 0 1 1
14 2 1 0 0
15 2 0 2 2
16 2 1 0 0
17 2 1 0 0
18 2 1 0 0
19 2 0 3 3
20 2 0 3 3
21 3 1 0 0
22 3 1 0 0
23 3 0 1 1
24 3 0 1 1
25 3 0 1 1
26 3 1 0 0
27 3 0 2 2
28 3 1 0 0
29 3 0 3 3
Use rle and sequentially number the runs produced and then convert back and zero out the runs of 1's. No packages are used.
seq0 <- function(x) {
r <- rle(x)
is0 <- r$values == 0
r$values[is0] <- seq_len(sum(is0))
inverse.rle(r) * !x
}
transform(Data, result2 = ave(classif, grp, FUN = seq0))

Create a new variable from conditions on multiple variables in R using ifelse condition

I am really new to R.
I have a table named RWA2010LONG containing 65 variables. I want to create a new variable named NEWVAR from the 30:49th variable of the table RWA2010LONG and from another variable of the same table named (BIRTH) based on the following condition: for each row, if one of the values of the variables 30:49 of the table RWA2010LONG is equal to the value of the variable BIRTH, NEWVAR takes the value 1. Otherwise, NEWVAR takes the value 0
This is what I tried to do:
RWA2010LONG$ NEWVAR <- for (i in colnames(RWA2010LONG[30:49])){ifelse(i==RWA2010LONG$BIRTH, 1,0)}
Here is an approach. Since you didn't provide data, I am making up some sample data and checking if any values in columns 5 through 10 are the same as BIRTH:
#Example data
df <- data.frame(matrix(rnbinom(100, mu = 5, size = 0.5), ncol = 10,
dimnames = list(c(sprintf("obs_%s", 1:10)),
c("BIRTH",sprintf("col_%s", 2:10)))))
df$newvar <- apply(df[,5:10] == df$BIRTH, 1, any)*1
The apply statement checks for the condition df[,5:10] == df$BIRTH by row (the 1 indicates apply the condition by row, if for future reference, if you put a 2 it will check the condition by column) and returns TRUE or FALSE if the condition is met. The *1 converts those logicals to a numerical value.
Output:
# BIRTH col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10 newvar
# obs_1 0 3 4 0 6 18 0 10 5 7 1
# obs_2 5 1 0 7 5 0 2 2 2 3 1
# obs_3 1 2 4 2 1 13 14 1 2 8 1
# obs_4 1 0 0 0 11 0 0 0 15 0 0
# obs_5 1 9 1 0 4 27 2 1 0 0 1
# obs_6 5 1 0 2 0 7 2 4 0 0 0
# obs_7 1 0 0 6 0 0 1 5 0 0 1
# obs_8 44 0 15 1 1 0 1 1 0 6 0
# obs_9 6 6 0 0 0 4 17 0 6 1 1
# obs_10 0 2 0 0 2 11 2 1 9 2 0
An option with if_any
library(dplyr)
df <- df %>%
mutate(newvar = +(if_any(5:10, ~ .x == BIRTH)))

ifelse replace value if it is lower than previous

I am working with a dataset which has some errors in the data. Numbers are sometimes registered wrong. Here is some toy data example:
The issue is that the Reversal column should only be counting up (per unique ID). So in a vector of 0,0,0,1,1,1,0,1,2,2,0,0,2,3, the 0's following the 1 and 2 should not be 0's. Instead, they should be equal to whatever value came before. I tried to remedy this by using the lag function from the dplyr package:
Data$Reversal <- ifelse(Data$Reversal < lag(Data$Reversal), lag(Data$Reversal), Data$Reversal) .
But this results in numerous issues:
The first value becomes NA. I've tried using the default=Data$Reversal call in the lag function but to no avail.
The Reversal value should reset to 0 for each Unique ID. Now it continues across ID's. I tried a messy code using group_by(ID) but could not get this to work, as it broke my earlier ifelse function.
This only works when there is 1 error. But if there are two errors in a row it only fixes 1 value.
Alternatively, I found this thread in which the answer provided by Andrie also seems promising. This fixes problem 1 and 3, but I can't get this code to work per ID (using the group_by function).
Andrie's answer:
local({
r <- rle(data)
x <- r$values
x0 <- which(x==0) # index positions of zeroes
xt <- x[x0-1]==x[x0+1] # zeroes surrounded by same value
r$values[x0[xt]] <- x[x0[xt]-1] # substitute with surrounding value
inverse.rle(r)
})
Any help would be much appreciated.
I think cummax does exactly what you need.
Base R
dat$Reversal <- ave(dat$Reversal, dat$ID, FUN = cummax)
dat
# ID Owner Reversal Success
# 1 1 A 0 0
# 2 1 A 0 0
# 3 1 A 0 0
# 4 1 B 1 1
# 5 1 B 1 0
# 6 1 B 1 0
# 7 1 error 1 0
# 8 1 error 1 0
# 9 1 B 1 0
# 10 1 B 1 0
# 11 1 C 1 1
# 12 1 C 2 0
# 13 1 error 2 0
# 14 1 C 2 0
# 15 1 C 3 1
# 16 2 J 0 0
# 17 2 J 0 0
dplyr
dat %>%
group_by(ID) %>%
mutate(Reversal = cummax(Reversal)) %>%
ungroup()
data.table
as.data.table(dat)[, Reversal := cummax(Reversal), by = .(ID)][]
Data, courtesy of https://extracttable.com/
dat <- read.table(header = TRUE, text = "
ID Owner Reversal Success
1 A 0 0
1 A 0 0
1 A 0 0
1 B 1 1
1 B 1 0
1 B 1 0
1 error 0 0
1 error 0 0
1 B 1 0
1 B 1 0
1 C 1 1
1 C 2 0
1 error 0 0
1 C 2 0
1 C 3 1
2 J 0 0
2 J 0 0")

R Grouping/Aggregation where the condition involves other rows in the table, not just the current row

Using R, what is the best way I can aggregate rows on a condition that spans multiple rows.
For example to aggregate any rows where z = 0 for n or more times.
What this would look like run on the following sample table with n = 3.
Sample Table x:
x y z
0 0 6
5 5 0
40 2 0
4 0 0
10 0 1
0 0 2
11 7 0
0 4 0
0 0 0
0 0 0
0 0 2
18 0 4
Results Table:
x y z
0 0 6
49 7 0 <- Above two rows got aggregated
10 0 1
0 0 2
11 11 0 <- Above three rows got aggregated
0 0 2
18 0 4
Since it seems like you're still in the "leaRning phase", I thought an example using the plyr package would be helpful. plyr is an extremely handy library which allows you to slice/dice datasets and summarize their subgroups in a flexible (and terse -- as you'll see below) manner, so it would likely be worth your time to get to know. If you find yourself needing to do similar operations on extremely large data sets, you might also consider looking into the data.table package.
I'm assuming you've done Roman's textConnection trick to get your data into a data.frame named mmf.
I'm adding an idx column to mmf so you can subset it and process the results group by group:
library(plyr)
# mmf <- read.table(textConnection( ...
rle.idx <- rle(mmf$z)
mmf$idx <- rep(seq(RLE$lengths), RLE$lengths)
ans <- ddply(mmf, .(idx), colwise(sum))
And ans looks like:
x y z idx
0 0 6 1
49 7 0 6
10 0 1 3
0 0 2 4
11 11 0 20
0 0 2 6
18 0 4 7
Just remove the idx column and you're done, eg:
ans <- ans[, -4]
This is the code I used to produce your result. If you have any questions, fire away.
mmf <- read.table(textConnection("x y z # read in your example data
0 0 6
5 5 0
40 2 0
4 0 0
10 0 1
0 0 2
11 7 0
0 4 0
0 0 0
0 0 0
0 0 2
18 0 4"), header = TRUE)
# see where there are zeros in the y column
mmf.rle <- rle(mmf$z)
mmf.rle <- data.frame(lengths = mmf.rle$lengths, values = mmf.rle$values)
merge.rows <- 3
# select rows that have more or equal to three zeros
mmf.zero <- which(mmf.rle$values == 0 & mmf.rle$lengths >= merge.rows)
for (i in mmf.zero) {
# find which positions are zero, calculate sums and insert the result into a data.frame where the rows in question were turned to NA
m.mmf <- mmf.rle$lengths[1:i] # select elements from 1 to where the zero appears
select.rows <- (sum(m.mmf[1:length(m.mmf) - 1])+1):sum(m.mmf) # magic
mmf.sum <- colSums(mmf[select.rows, ]) # sum values column-wise for rows that have at least three zeros in z
mmf[select.rows,] <- NA # now that we have a sum by columns, we turn those numbers into NAs...
mmf[select.rows[1], ] <- mmf.sum # ... and insert summed result into the first NA row
}
# remove any left over NA rows
mmf <- mmf[complete.cases(mmf),]
DATA
mmf <- read.table(textConnection("x y z # read in your example data
0 0 6
5 5 0
40 2 0
4 0 0
10 0 1
0 0 2
11 7 0
0 4 0
0 0 0
0 0 0
0 0 2
18 0 4"), header = TRUE)
CODE
agg_n <- function(dat=mmf,coln="z",n=3){
agg <- function(.x) {
# Sum values if first n=3 records in column coln="z" are 0
if(all(.x[[coln]][seq(n)] == 0)) {
y <- rbind(colSums(.x[seq(n),]),.x[-1*seq(n),])
} else y <- .x
return(y)
}
# Groups of records starting with 0 in column coln="z"
G <- cumsum(diff(c(0L,dat[[coln]] == 0))==1)
new_dat <- do.call(rbind,lapply(split(dat,G),agg))
return(new_dat)
}
OUTPUT
> agg_n()
x y z
0 0 0 6
1.1 49 7 0
1.5 10 0 1
1.6 0 0 2
2.1 11 11 0
2.10 0 0 0
2.11 0 0 2
2.12 18 0 4

Creating dummy variables (n-1) categories

I found similar entries but not exactly what I want. For two categorized variable (e.g., gender(1,2)), I need to create a dummy variable, 0s being male and 1s being female.
Here how my data look like and what I did.
data <- as.data.frame(as.matrix(c(1,2,2,1,2,1,1,2),8,1))
V1
1 1
2 2
3 2
4 1
5 2
6 1
7 1
8 2
library(dummies)
data <- cbind(data, dummy(data$V1, sep = "_"))
> data
V1 data_1 data_2
1 1 1 0
2 2 0 1
3 2 0 1
4 1 1 0
5 2 0 1
6 1 1 0
7 1 1 0
8 2 0 1
In this code, the second category is also (0,1). Also, is there a way to determine which to determine the baseline (assigning 0 to any category)?
I want it to look like this:
> data
V1 V1_dummy
1 1 0
2 2 1
3 2 1
4 1 0
5 2 1
6 1 0
7 1 0
8 2 1
Also, I want to extend this to three category variables, having two categories after recoding (n-1).
Thanks in advance!
You can use model.matrix in the following way. Some sample data with a three level factor:
set.seed(1)
(df <- data.frame(x = factor(rbinom(5, 2, 0.4))))
# x
# 1 0
# 2 1
# 3 1
# 4 2
# 5 0
Then
model.matrix(~ x, df)[, -1]
# x1 x2
# 1 0 0
# 2 1 0
# 3 1 0
# 4 0 1
# 5 0 0
If you want to specify which group disappears, we need to rearrange the factor levels. It is the first group that disappears. So, e.g.,
levels(df$x) <- c("1", "0", "2")
model.matrix(~x, df)[, -1]
# x0 x2
# 1 0 0
# 2 1 0
# 3 1 0
# 4 0 1
# 5 0 0

Resources