I am exporting data from R with the command:
write.table(output,file = "data.raw", na "-9999", sep = "\t", row.names = FALSE, col.names = FALSE)
It exports my data correctly, but it exports all of the logical variables as TRUE and FALSE.
I need to read the data into another program that can only process numeric values. Is there an efficient way to convert logical columns to numeric 1s and 0s during the export? I have a large number of numeric variables, so I was hoping to automatically loop through all the variables in the data.table
Alternatively, my output object is a data.table. Is there an efficient way to convert all the logical variables in a data.table into numeric variables?
In case it is helpful, here is some code to generate a data.table with a logical variable in it (it is not a large number of logical variables, but enough to use on example code):
DT = data.table(cbind(1:100, rnorm(100) > 0)
DT[ , V3:= V2 == 1 ]
DT[ , V4:= V2 != 1 ]
For a data.frame, you could convert all logical columns to numeric with:
# The data
set.seed(144)
dat <- data.frame(V1=1:100,V2=rnorm(100)>0)
dat$V3 <- dat$V2 == 1
head(dat)
# V1 V2 V3
# 1 1 FALSE FALSE
# 2 2 TRUE TRUE
# 3 3 FALSE FALSE
# 4 4 FALSE FALSE
# 5 5 FALSE FALSE
# 6 6 TRUE TRUE
# Convert all to numeric
cols <- sapply(dat, is.logical)
dat[,cols] <- lapply(dat[,cols], as.numeric)
head(dat)
# V1 V2 V3
# 1 1 0 0
# 2 2 1 1
# 3 3 0 0
# 4 4 0 0
# 5 5 0 0
# 6 6 1 1
In data.table syntax:
# Data
set.seed(144)
DT = data.table(cbind(1:100,rnorm(100)>0))
DT[,V3 := V2 == 1]
DT[,V4 := FALSE]
head(DT)
# V1 V2 V3 V4
# 1: 1 0 FALSE FALSE
# 2: 2 1 TRUE FALSE
# 3: 3 0 FALSE FALSE
# 4: 4 0 FALSE FALSE
# 5: 5 0 FALSE FALSE
# 6: 6 1 TRUE FALSE
# Converting
(to.replace <- names(which(sapply(DT, is.logical))))
# [1] "V3" "V4"
for (var in to.replace) DT[, (var):= as.numeric(get(var))]
head(DT)
# V1 V2 V3 V4
# 1: 1 0 0 0
# 2: 2 1 1 0
# 3: 3 0 0 0
# 4: 4 0 0 0
# 5: 5 0 0 0
# 6: 6 1 1 0
Simplest way of doing this!
Multiply your matrix by 1
For example:
A <- matrix(c(TRUE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,TRUE),ncol=4)
A
# [,1] [,2] [,3] [,4]
# [1,] TRUE TRUE TRUE FALSE
# [2,] FALSE TRUE FALSE TRUE
B <- 1*A
B
# [,1] [,2] [,3] [,4]
# [1,] 1 1 1 0
# [2,] 0 1 0 1
(You could also add zero: B <- 0 + A)
What about just a:
dat <- data.frame(le = letters[1:10], lo = rep(c(TRUE, FALSE), 5))
dat
le lo
1 a TRUE
2 b FALSE
3 c TRUE
4 d FALSE
5 e TRUE
6 f FALSE
7 g TRUE
8 h FALSE
9 i TRUE
10 j FALSE
dat$lo <- as.numeric(dat$lo)
dat
le lo
1 a 1
2 b 0
3 c 1
4 d 0
5 e 1
6 f 0
7 g 1
8 h 0
9 i 1
10 j 0
or another approach could be with dplyr in order to retain the previous column if the case (no one knows) your data will be imported in R.
library(dplyr)
dat <- dat %>% mutate(lon = as.numeric(lo))
dat
Source: local data frame [10 x 3]
le lo lon
1 a TRUE 1
2 b FALSE 0
3 c TRUE 1
4 d FALSE 0
5 e TRUE 1
6 f FALSE 0
7 g TRUE 1
8 h FALSE 0
9 i TRUE 1
10 j FALSE 0
Edit: Loop
I do not know if my code here is performing but it checks all column and change to numerical only those that are logical. Of course if your TRUE and FALSE are not logical but character strings (which might be remotely) my code won't work.
for(i in 1:ncol(dat)){
if(is.logical(dat[, i]) == TRUE) dat[, i] <- as.numeric(dat[, i])
}
If there are multiple columns, you could use set (using #josilber's example)
library(data.table)
Cols <- which(sapply(dat, is.logical))
setDT(dat)
for(j in Cols){
set(dat, i=NULL, j=j, value= as.numeric(dat[[j]]))
}
As Ted Harding pointed out in the R-help mailing list, one easy way to convert logical objects to numeric is to perform an arithmetic operation on them. Convenient ones would be * 1 and + 0, which will keep the TRUE/FALSE == 1/0 paradigm.
For your mock data (I've changed the code a bit to use regular R packages and to reduce size):
df <- data.frame(cbind(1:10, rnorm(10) > 0))
df$X3 <- df$X2 == 1
df$X4 <- df$X2 != 1
The dataset you get has a mixture of numeric and boolean variables:
X1 X2 X3 X4
1 1 0 FALSE TRUE
2 2 0 FALSE TRUE
3 3 1 TRUE FALSE
4 4 1 TRUE FALSE
5 5 1 TRUE FALSE
6 6 0 FALSE TRUE
7 7 0 FALSE TRUE
8 8 1 TRUE FALSE
9 9 0 FALSE TRUE
10 10 1 TRUE FALSE
Now let
df2 <- 1 * df
(If your dataset contains character or factor variables, you will need to apply this operation to a subset of df filtering out those variables)
df2 is equal to
X1 X2 X3 X4
1 1 0 0 1
2 2 0 0 1
3 3 1 1 0
4 4 1 1 0
5 5 1 1 0
6 6 0 0 1
7 7 0 0 1
8 8 1 1 0
9 9 0 0 1
10 10 1 1 0
Which is 100% numeric, as str(df2) will show you.
Now you can safely export df2 to your other program.
One line solution
Using the following code we take all the logical columns and make them numeric.
library(magrittr)
dat %<>% mutate_if(is.logical,as.numeric)
The same as #saebod but with usual pipe.
library(dplyr)
dat <- dat %>% mutate_if(is.logical, as.numeric)
Related
I have a data set like the following:
id age mod
1 1 1
1 5 0
1 6 1
1 7 1
1 9 1
2 3 0
2 4 1
And I'd like to create the variable first and give it a value of true only for the first occurrence of each episode of mod (each episode begins when mod==1). An episode can be defined as a series (or standalone day) of mod==1 in which age is incrementing by 1 or age is incrementing by 2. In other words, if age== 2 and mod==1, age==3 and mod==0 and age==4 and mod==1, ages 2-4 are still apart of the same series because they are still within 2 days of each other.
So ideally the final data set would look like this:
id age mod first
1 1 1 TRUE
1 5 0 FALSE
1 6 1 TRUE
1 7 1 FALSE
1 9 1 FALSE
2 3 0 FALSE
2 4 1 TRUE
I've tried using lag statements within data.table and have not be successful.
The simple condition is that mod[i]==1 & mod[i-1]==0. That is, if a row has a mod value of 1 and the previous row has a mod value of 0, then it is flagged as first.
This should work:
d = read.table(text='id age mod
1 1 1
1 5 0
1 6 1
1 7 1
1 9 1
2 3 0
2 4 1', header=T)
d$first[1] = (d$mod[1]==1)
d$first[2:nrow(d)] = (d$mod[2:nrow(d)]==1 & d$mod[1:nrow(d)-1]==0)
I believe this should meet your criteria. This could probably be done in a long, convoluted one-liner, but I think breaking it into multiple steps for the sake of clarity will meet the same ends without a meaningful performance hit.
library(data.table)
## Note - I added a couple more sample rows here
DT <- fread("id age mod
1 1 1
1 5 0
1 6 1
1 7 1
1 9 1
2 3 0
2 4 1
3 1 1
3 5 0
3 9 1
3 10 1")
## Create a column to track jumps in age
DT[, agejump := age - shift(age, n = 1L, fill = NA, type = "lag") > 2L, by = .(id)]
## Create a column to define continued sequences
DT[, continued := mod == 1 & shift(mod, n = 1L, fill = NA, type = "lag") == 1L, by = .(id)]
## backfill the first row of NA's for each id for both variables with FALSE
DT[DT[, .I[1], by = .(id)]$V1, c("agejump","continued") := FALSE]
## define first
DT[,first := (mod == 1 & continued == FALSE | mod == 1 & continued == TRUE & agejump == TRUE)]
print(DT)
# id age mod agejump continued first
# 1: 1 1 1 FALSE FALSE TRUE
# 2: 1 5 0 TRUE FALSE FALSE
# 3: 1 6 1 FALSE FALSE TRUE
# 4: 1 7 1 FALSE TRUE FALSE
# 5: 1 9 1 FALSE TRUE FALSE
# 6: 2 3 0 FALSE FALSE FALSE
# 7: 2 4 1 FALSE FALSE TRUE
# 8: 3 1 1 FALSE FALSE TRUE
# 9: 3 5 0 TRUE FALSE FALSE
# 10: 3 9 1 TRUE FALSE TRUE
# 11: 3 10 1 FALSE TRUE FALSE
I want to one hot encode my variables only for the top categories and NA and 'others'.
So in this simplified example, hot encoding b where freq > 1 and NA:
id <- c(1, 2, 3, 4, 5, 6)
b <- c(NA, "A", "C", "A", "B", "C")
c <- c(2, 3, 6, NA, 4, 7)
df <- data.frame(id, b, c)
id b c
1 1 <NA> 2
2 2 A 3
3 3 C 6
4 4 A NA
5 5 B 4
6 6 C 7
table <- as.data.frame(table(df$b))
Var1 Freq
1 A 2
2 B 1
3 C 2
table_top <- table[table$Freq > 1,]
Var1 Freq
1 A 2
3 C 2
Now, I would like to have something like this
id b_NA c b_A b_C b_Others
1 1 2 0 0 0
2 0 3 1 0 0
3 0 6 0 1 0
4 0 NA 1 0 0
5 0 4 0 0 1
6 0 7 0 1 0
I have tried with subsetting df
table_top <- as.vector(table_top$Var1)
table_only_top <- subset(df, b %in% table_top)
table_only_top
a b c
2 1 A 3
3 2 C 6
4 2 A NA
6 3 C 7
However, now I am stuck how to get to the output. In my real data I have many more categories than here, so using the names from the output is not an option. Also the others category in my real output exists of many categories.
Any hint is highly appreciated :)
Fast and sexy with data.table and mltools:
> one_hot(dt, naCols = TRUE, sparsifyNAs = TRUE)
id cat_NA cat_A cat_C cat_Others freq
1: 1 1 0 0 0 2
2: 2 0 1 0 0 3
3: 3 0 0 1 0 6
4: 4 0 1 0 0 NA
5: 5 0 0 0 1 4
6: 6 0 0 1 0 7
Code
Load libraries
library(dplyr)
library(data.table)
library(mltools)
Transform data
# Kick out all with freq == 1 and below
df <- df %>%
# Group by variables that will be onehotted
group_by(cat) %>%
# Add a count per group item column
mutate(count = n()) %>%
# Ungroup for next steps
ungroup() %>%
# Change all that have a count of 1 or below to "Others".
# If cat was a factor, we would get numeric results at this step.
mutate(cat = ifelse(!is.na(cat) & count <= 1, "Others", cat),
# Only now we turn it into a factor for the one_hot function
cat = as.factor(cat)) %>%
# Drop the count column
select(id, cat, freq)
# Turn into data.table
dt <- as.data.table(df)
Check intermediate result
> dt
id cat freq
1: 1 <NA> 2
2: 2 A 3
3: 3 C 6
4: 4 A NA
5: 5 Others 4
6: 6 C 7
Data
id <- c(1, 2, 3, 4, 5, 6)
cat <- c(NA, "A", "C", "A", "B", "C")
freq <- c(2, 3, 6, NA, 4, 7)
# It is important to have no other factor variables other
# than the variable(s) you one want to one hot. For that reason
# the automatic factoring is turned off.
df <- data.frame(id, cat, freq,
stringsAsFactors = FALSE)
> df
id cat freq
1 1 <NA> 2
2 2 A 3
3 3 C 6
4 4 A NA
5 5 B 4
6 6 C 7
Definitely not an elegant solution but it should work:
library(tideverse)
library(reshape2)
df %>%
gather(var, val, -id) %>%
add_count(var, val) %>%
mutate(res = ifelse(var == "b" & n > 1, 1, 0),
val = paste("b_", val, sep = "")) %>%
filter(var == "b" & n != 1) %>%
dcast(id ~ val, value.var = "res") %>%
full_join(df, by = c("id" = "id")) %>%
mutate(b_NA = ifelse(is.na(b), 1, 0)) %>%
mutate_at(vars(contains("b_")), funs(replace(., is.na(.), 0))) %>%
mutate(b_OTHERS = ifelse(rowSums(.[grep("b_", names(.))]) != 0, 0, 1))
id b_A b_C b c b_NA b_OTHERS
1 2 1 0 A 3 0 0
2 3 0 1 C 6 0 0
3 4 1 0 A NA 0 0
4 6 0 1 C 7 0 0
5 1 0 0 <NA> 2 1 0
6 5 0 0 B 4 0 1
You could cbind data.frames based on your different criteria.
# simple conditions -------------------------------------------------------
df <- df_orig[,-1]
df_na <- is.na(df)
colnames(df_na) <- paste0(colnames(df),"_NA")
df_A <- df=="A"
colnames(df_A) <- paste0(colnames(df),"_A")
df_C <- df=="C"
colnames(df_C) <- paste0(colnames(df),"_C")
# for counts you can use sapply with one loop -----------------------------
df_counts <- df
for(j in 1:ncol(df)) {
counts <- sapply(1:nrow(df), function(x) sum(df[x,j]==df[,j], na.rm=T) )
df_counts[,j] <- counts
}
df_counts <- df
# or avoid explicit loops altogether --------------------------------------
df_counts2 <- sapply(1:ncol(df), function(y) sapply(1:nrow(df), function(x) sum(df[x,y]==df[,y], na.rm=T) ) )
colnames(df_counts2 ) <- paste0(colnames(df),"_counts")
# cbind df's -------------------------------------------------------------
df_full <- cbind(df_orig, df_na, df_A, df_C, df_counts2)
# check if frequency greater then 1 or NA ---------------------------------
df_full$result <- df_full[,10:11] >=2 | df_full[,4:5]
df_full
The harder part is I suppose to compute the frequencies, here I included two ways. the result is:
id b c b_NA c_NA b_A c_A b_C c_C b_counts c_counts result.b_NA result.c_NA
1 1 <NA> 2 FALSE FALSE FALSE FALSE FALSE FALSE 1 1 FALSE FALSE
2 2 A 3 FALSE FALSE TRUE FALSE FALSE FALSE 2 1 TRUE FALSE
3 3 C 6 FALSE FALSE FALSE FALSE TRUE FALSE 2 1 TRUE FALSE
4 4 A NA FALSE TRUE TRUE NA FALSE NA 2 0 TRUE TRUE
5 5 B 4 FALSE FALSE FALSE FALSE FALSE FALSE 1 1 FALSE FALSE
6 6 C 7 FALSE FALSE FALSE FALSE TRUE FALSE 2 1 TRUE FALSE
You can modify the columns based on your conditions. Hope that helps
Question :
I want to create a dummy variable first in R which is 1 if the value of a another dummy changed from 0 to 1 under the condition that it is not the first observation for an id number. The problem behind this is that I want to recognise firms which entered a market during the observed time period in a panel setting.
As an example I tried to create this with a small sample set:
id <- c(1,1,1,2,2,3,3,3)
dummy <- c(0,1,1,0,1,1,0,1)
df <- data.frame(id,dummy)
df[,"id"]
first.dum <- function(x)
c( x[-1,"id"] == x[,"id"]
& x[-1,"dummy"] != x[,"dummy"]
& x[,"dummy"] == "1")
df$first <- first.dum(df)
df
The result comes like ...
id dummy first
1 1 0 FALSE
2 1 1 FALSE
3 1 1 FALSE
4 2 0 FALSE
5 2 1 FALSE
6 3 1 TRUE
7 3 0 FALSE
8 3 1 FALSE
I think I did not understand how that dataframe manipulation really works.
Any help would be appreciated.
Here's how I would approach this using data.table package
library(data.table)
setDT(df)[, first := c(0, diff(dummy)) == 1, id][]
# id dummy first
# 1: 1 0 FALSE
# 2: 1 1 TRUE
# 3: 1 1 FALSE
# 4: 2 0 FALSE
# 5: 2 1 TRUE
# 6: 3 1 FALSE
# 7: 3 0 FALSE
# 8: 3 1 TRUE
Basically we are checking per group, if dummy is bigger by one than the previous observation (starting from the second observation).
You can do it similarly using dplyr
library(dplyr)
df %>% group_by(id) %>% mutate(first = c(0, diff(dummy)) == 1)
Or using base R
unlist(tapply(df$dummy, df$id, function(x) c(0, diff(x)) == 1))
Try something like
df$first <- df$id == c(NA, df$id[-nrow(df)]) &
df$dummy > c(1, df$dummy[-nrow(df)])
to give
> df
id dummy first
1 1 0 FALSE
2 1 1 TRUE
3 1 1 FALSE
4 2 0 FALSE
5 2 1 TRUE
6 3 1 FALSE
7 3 0 FALSE
8 3 1 TRUE
If you want something like your function, consider
first.dum <- function(x) {
y <- rbind(c(NA,1),x[-nrow(x),])
x[,"id"] == y[,"id"] & x[,"dummy"] > y[,"dummy"]
}
Let's make a dummy dataset
ll = data.frame(rbind(c(2,3,5), c(3,4,6), c(9,4,9)))
colnames(ll)<-c("b", "c", "a")
> ll
b c a
1 2 3 5
2 3 4 6
3 9 4 9
P = data.frame(cbind(c(3,5), c(4,6), c(8,7)))
colnames(P)<-c("a", "b", "c")
> P
a b c
1 3 4 8
2 5 6 7
I want to create a new dataframe where the values in each column of ll would be turned into 0 when it is less than corresponding values of a,b, & c in the first row of P; in other words, I'd like to see
> new_ll
b c a
1 0 0 5
2 0 0 6
3 9 0 9
so I tried it this way
nn=c("a", "b", "c")
new_ll = sapply(nn, function(i)
ll[,paste0(i)][ll[,paste0(i)] < P[,paste0(i)][1]] <- 0)
But it doesn't work for some reason! I must be doing a silly mistake in my script!! Any idea?
> new_ll
a b c
0 0 0
You can find the values in ll that are smaller than the first row of P with an apply:
t(apply(ll, 1, function(x) x<P[1,][colnames(ll)]))
[,1] [,2] [,3]
[1,] TRUE TRUE FALSE
[2,] TRUE TRUE FALSE
[3,] FALSE TRUE FALSE
Here, the first row of P is ordered to match ll, then the elements are compared.
Credit to Ananda Mahto for recognizing that apply is not required:
ll < c(P[1, names(ll)])
b c a
[1,] TRUE TRUE FALSE
[2,] TRUE TRUE FALSE
[3,] FALSE TRUE FALSE
The TRUE values show where you want to substitute with 0:
ll[ ll < c(P[1, names(ll)]) ] <- 0
ll
b c a
1 0 0 5
2 0 0 6
3 9 0 9
To fix your code, you want something like this:
do.call(cbind, lapply(names(ll), function(i) {
ll[,i][ll[,i] < P[,i][1]] <- 0
return(ll[i])}))
b c a
1 0 0 5
2 0 0 6
3 9 0 9
What's changed? First, sapply is changed to lapply and the function returns a vector for each iteration. Second, the names are presented in the correct order for the expected results. Third, the results are put together with cbind to get the final matrix. As a bonus, the redundant calls to paste0 have been removed.
You could also try mapply, which applies the function to the each corresponding element. Here, the ll and P are both data.frames. So, it applies the function for each column and does the recycling also. Here, I matched the column names of P with that of ll (similar to #Matthew Lundberg) and looked for which elements of ll in each column is < than the corresponding column (the one row of P gets recycled) and returns a logical index. Then the elements that matches the logical condition are assigned to 0.
indx <- mapply(`<`, ll, P[1,][names(ll)])
new_ll <- ll
new_ll[indx] <- 0
new_ll
# b c a
#1 0 0 5
#2 0 0 6
#3 9 0 9
In case you know that ll and P are numeric you can do it also as
llm <- as.matrix(ll)
pv <- as.numeric(P[1, colnames(llm)])
llm[sweep(llm, 2, pv, `<=`)] <- 0
data.frame(llm)
# b c a
# 1 0 0 5
# 2 0 0 6
# 3 9 0 9
I have a question which is similar to this one - Fast minimum distance (interval) between elements of 2 logical vectors (take 2) but it has some important differences.
Say I have a vector:
x <- c("A", "B", "C", "A", "D", "D", "A", "B", "A")
What I would like to do is:
For every element, calculate the minimum distance between it and the next element of each different type in the forward direction only. If for any element, no element of a particular type occurs in the forward direction then a 0 should be returned. The returned data will look like this:
Desired Output Table-
N x A B C D
1 A 3 1 2 4
2 B 2 6 1 3
3 C 1 5 0 2
4 A 3 4 0 1
5 D 2 3 0 1
6 D 1 2 0 0
7 A 2 1 0 0
8 B 1 0 0 0
9 A 0 0 0 0
The first column/var simply refers to the element order. The second col/var is the element at that position. Then there are four cols/vars - each one being a unique element that occurs in the vector.
The numbers in each of these four cols/vars are the minimum distance from that row's element to the next occurring element of each type in the FORWARD direction only. If a '0' is entered, that means that that element does not occur after that row's element in the vector.
How to achieve this?
My first thought was to try and mimic some aspects of the question above. To that end, I used a grepl function to turn the vector into four separate logical vectors indicating the presence/absence of each element.
xA<-grepl("A", x) # TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
xB<-grepl("B", x) # FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
xC<-grepl("B", x) # FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
xD<-grepl("D", x) # FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
I then tried the "Flodel" function and the second function provided by GG using library(data.table).
For example, to compute the minimum distances from all "As" to a "D":
flodel <- function(x, y) {
xw <- which(x)
yw <- which(y)
i <- findInterval(xw, yw, all.inside = TRUE)
pmin(abs(xw - yw[i]), abs(xw - yw[i+1L]), na.rm = TRUE)
}
flodel(xA,xD)
> [1] 4 1 1 3
#GG's data.table option
wxA <- data.table(x = which(xA))
wxD <- data.table(y = which(xD), key = "y")
wxD[wxA, abs(x - y), roll = "nearest"]
# y V1
#1: 1 4
#2: 4 1
#3: 7 1
#4: 9 3
Both of these options find the minimum distance for all A's to a D. However, it is in ANY direction, not the FORWARD direction only. GG's data.table option is on the surface more attractive to me as it returns data showing the position of each element (the 'y' column of the output) which would make it easy to package into a nice summary table (such as my desired output table above).
I have tried to work out alternative ways of using the 'roll' argument in data.table, but I don't seem to manage this issue.
Thanks for any suggestions.
Another way that seems valid:
levs = sort(unique(x))
do.call(rbind,
lapply(seq_along(x),
function(n)
match(levs, x[-seq_len(n)], 0)))
# [,1] [,2] [,3] [,4]
# [1,] 3 1 2 4
# [2,] 2 6 1 3
# [3,] 1 5 0 2
# [4,] 3 4 0 1
# [5,] 2 3 0 1
# [6,] 1 2 0 0
# [7,] 2 1 0 0
# [8,] 1 0 0 0
# [9,] 0 0 0 0
I'm not really sure how efficient this is, but it seems to work. How about
x <- c("A", "B", "C", "A", "D", "D", "A", "B", "A")
#find indexes for each value
locations<-split(seq_along(x), x)
#for each index, find the distance from the next highest
# index in the locations list
t(sapply(seq_along(x), function(i) sapply(locations, function(l)
if(length(z<-l[l>i])>0) z[1]-i else 0)))
This will return
A B C D
[1,] 3 1 2 4
[2,] 2 6 1 3
[3,] 1 5 0 2
[4,] 3 4 0 1
[5,] 2 3 0 1
[6,] 1 2 0 0
[7,] 2 1 0 0
[8,] 1 0 0 0
[9,] 0 0 0 0