How can I select columns and rows with variable in R?

How can I select columns and rows with variable in R? - r

I have an object currency I would like to select one column and the rows equal to 1 with the variable Pair.
>currency
EURUSD EURUSDi USDJPY USDJPYi GBPUSD GBPUSDi AUDUSD AUDUSDi XAUUSD XAUUSDi zeroes
2000-07-16 0 0 0 0 0 1 0 0 0 0 0
2000-07-23 0 0 0 0 0 1 0 0 0 0 0
2000-07-30 0 0 0 0 0 1 0 0 0 0 0
2000-08-06 0 0 0 0 0 0 0 0 0 1 0
2000-08-13 0 1 0 0 0 0 0 0 0 0 0
From the console I can do it with subset like this :
> subset(currency$GBPUSDi, GBPUSDi == 1)
GBPUSDi
2000-07-16 1
2000-07-23 1
2000-07-30 1
2000-08-06 1
2000-08-13 1
2000-08-20 1
But as soon as it is passed in a script with variable Pair it fails. I've searched for hours in the documentation and I'm having a headache trying to figure out what is wrong.
Please find the different command I've try :
subset (currency$Pair, Pair == 1)
subset (currency, Pair = 1, select = Pair)
weights$Cur[currency$Pair = 1]
The one that works is currency[,c(Pair)] but it only select column, how can I complete with row selection of Pair = 1 ?
currency[,c(Pair)][Pair = 1] and subset (currency[,c(Pair)], Pair = 1) with = or == doesn't work.

currency$Pair[currency$Pair == 1] should work ($Pair select column Pair and [currency$Pair == 1] select values equal to 1). It looks like it don't work in your case, because currency don't contain variable Pair.
If currency is not a dataframe but matrix, you can try
currency[currency[, c("Pair")] == 1, c("Pair")]

Related

Argument is of length 0, no NA's

I have a dataframe that looks like this:
logentrytime ord_lat_dt0 ord_lat_dt1 ord_lat_dt2 ord_lat_dt3 ord_lat_dt4 ord_lat_dt5 ord_lat_dt6 ord_lat_dt7 ord_lat_dt8 ord_lat_dt9 ord_num0 ord_num1 ord_num2
1 2016-11-10 14:23:36 0 0 0 0 0 0 2016-02-12 0 0 0 0 0 0
2 2016-11-10 14:22:22 0 0 0 0 0 0 2016-02-12 0 0 0 0 0 0
3 2016-11-07 16:02:45 0 0 0 0 0 0 2016-02-12 0 0 0 0 0 0
4 2016-11-07 21:10:00 0 0 0 0 0 0 2016-02-12 0 0 0 0 0 0
5 2016-11-07 16:03:29 0 0 0 0 0 0 2016-02-12 0 0 0 0 0 0
6 2016-11-10 14:23:05 0 0 0 0 0 0 2016-02-12 0 0 0 0 0 0
Where ord_lat_dt columns are last purchase date of a customer. ord_lat_dt[0-9] were pulled from different database tables. Thus each row represents one customer, and their last order date will be indicated in one of the 9 columns.
I would like to merge these, but before I do, want to calculate "months_since_last_purchase" based on the date in each column.
Thus, I have converted the date columns into character strings, and am looping through using these functions:
elapsed_time <- function(end_date, start_date) {
ed <- as.POSIXlt(end_date)
sd <- as.POSIXlt(start_date)
12 * (ed$year - sd$year) + (ed$mon - sd$mon)
}
convert_time <- function(data, column){
for(i in seq(1,length(data$column))){
if((data$column[i]!= "0") ==TRUE){
data$column[i] <- elapsed_months(Sys.time(), as.Date(data$column[i], format="%Y/%m/%d"))
}
}
return(data)
}
test1 <- convert_time(test2, ord_lat_dt0)
But I obtain the error
Error in if ((data$column[i] != "0") == TRUE) { :
argument is of length zero
I've also tried changing the if statement to check :
grepl("[-]", data$column[i])==FALSE)
But I obtain the same error.
Any ideas?
If you decide to down vote, please explain to me what is wrong with my question. I am trying to learn and would like to make sure I am asking correctly.
NOTE: I am having a different issue and have completely changed the question. Thus some of the comments below do not apply. Because of the down votes, I could not open a new question.

The problem here is that when you do data_theme[is.na(data_theme)] <- 0, NA in the date columns will be replaced. But date columns are in a POSIXct format, and if you try as.POSIXct(0), it will through an error.
One solution could be to do it in two step. First replace NA from numeric column first, then do whatever you want with POSIXct values :
library(dplyr)
df %>%
mutate_if("is.numeric", funs(if_else(is.na(.), 0, .))

You can only replace all NAs by the value 0 if all columns are numeric first. This could for instance be achieved by writing a little function to first convert a column to numeric if needed, and then replace the NAs. Using lapply you can loop over the columns, and make the resulting list of columns a data frame again afterwards.
f <- function(x) {
x <- as.numeric(x)
x[is.na(x)] <- 0
x
}
data_theme <- as.data.frame(lapply(data_theme, f))
Of course, this will also convert any meaningful datetimes to numbers.

For loop storage of output data

I am trying to store the output data from the forloop in the n.I matrix at the end of the code, but I am certain that something is wrong with my output matrix. It is giving me all the same values, either 0 or 1. I know that print(SS) is outputting the correct values and can see that the forloop is working properly.
Does anyone have any advice on how to fix the matrix, or any way that I am able to store the data from the forloop? Thanks in advance!
c=0.2
As=1
d=1
d0=0.5
s=0.5
e=0.1
ERs=e/As
C2 = c*As*exp(-d*s/d0)
#Island States (Initial Probability)
SS=0
for(i in 1:5) {
if (SS > 0) {
if (runif(1, min = 0, max = 1) < ERs){
SS = 0
}
}
else {
if (runif(1, min = 0, max = 1) < C2) {
SS = 1
}
}
print(SS)
}
n.I=matrix(c(SS), nrow=i, ncol=1, byrow=TRUE)

The efficient solution here is not to use a loop. It's unnecessary since the whole task can be easily vectorized.
Z =runif(100,0,1)
as.integer(x <= Z)
#[1] 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
#[70] 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

you can save them in a list. Not very efficient but gets the job done.
list[[1]] indicates the first element saved in a list if you want to retrieve it.
list_pos <- list() # create the list out of the for loop
for(i in 1:100) {
c=0.10 #colonization rate
A=10 #Area of all islands(km^2)
d=250 #Distance from host to target (A-T)
s=0.1 #magnitude of distance
d0=100 #Specific "half distance" for dispersal(km)
C1 = c*A*exp(-d/d0) #Mainland to Target colonization
Z =runif(1,0,1)
x <- C1*A
if(x <= Z) {
list_pos[[i]] <- print("1") # Here you can store the 1 results.print is actually not necessary.
}
if(x >= Z){
list_pos[[i]] <- print("0") # Here you can store the 0 results.print is actually not necessary.
}
}

How to refer to previous cell in a data-frame column (lagged cell), in R

I’m working in R and am trying to find a way to refer to the previous cell within a vector when that vector belongs to a data frame. By previous cell, I’m essentially hoping for a “lag” command of some sort so that I can compare one cell to the cell previous. As an example, I have these data:
A <- c(1,0,0,0,1,0,0)
B <- c(1,1,1,1,1,0,0)
AB_df <- cbind (A,B)
What I want is for a given cell in a given row, if that cell’s value is less than the previous cell’s value for the same column vector, to return a value of 1 and if not to return a value of 0. For this example, the new columns would be called “A-flag” and “B-flag” below.
A B A-flag B-flag
1 1 0 0
0 1 1 0
0 1 0 0
0 1 0 0
1 1 0 0
0 0 1 1
0 0 0 0
Any suggestions for syntax that can do this? Ideally, to just create a new column variable into an existing data-frame.

Here is one solution using dplyr package and it's lag method:
library(dplyr)
AB_df <- data.frame(A = A, B = B)
AB_df %>% mutate(A.flag = ifelse(A < lag(A, default = 0), 1, 0),
B.flag = ifelse(B < lag(B, default = 0), 1, 0))
A B A.flag B.flag
1 1 1 0 0
2 0 1 1 0
3 0 1 0 0
4 0 1 0 0
5 1 1 0 0
6 0 0 1 1
7 0 0 0 0

Find # of rows between events in R

I have a series of data in the format (true/false). eg it looks like it can be generated from rbinom(n, 1, .1). I want a column that represents the # of rows since the last true. So the resulting data will look like
true/false gap
0 0
0 0
1 0
0 1
0 2
1 0
1 0
0 1
What is an efficient way to go from true/false to gap (in practice I'll this will be done on a large dataset with many different ids)

DF <- read.table(text="true/false gap
0 0
0 0
1 0
0 1
0 2
1 0
1 0
0 1", header=TRUE)
DF$gap2 <- sequence(rle(DF$true.false)$lengths) * #create a sequence for each run length
(1 - DF$true.false) * #multiply with 0 for all 1s
(cumsum(DF$true.false) != 0L) #multiply with zero for the leading zeros
# true.false gap gap2
#1 0 0 0
#2 0 0 0
#3 1 0 0
#4 0 1 1
#5 0 2 2
#6 1 0 0
#7 1 0 0
#8 0 1 1
The cumsum part might not be the most efficient for large vectors. Something like
if (DF$true.false[1] == 0) DF$gap2[seq_len(rle(DF$true.false)$lengths[1])] <- 0
might be an alternative (and of course the rle result could be stored temporarly to avoid calculating it twice).

Ok, let me put this in answer
1) No brainer method
data['gap'] = 0
for (i in 2:nrow(data)){
if data[i,'true/false'] == 0{
data[i,'gap'] = data[i-1,'gap'] + 1
}
}
2) No if check
data['gap'] = 0
for (i in 2:nrow(data)){
data[i,'gap'] = (data[i-1,'gap'] + 1) * (-(data[i,'gap'] - 1))
}
Really don't know which is faster, as both contain the same amount of reads from data, but (1) have an if statement, and I don't know how fast is it (compared to a single multiplication)

splitting dataframe with collated points in to individuals in R

I have a dataframe (.txt) which looks like this [where "dayX" = the day of death in a survival assay in fruitflies, the numbers beneath are the number of flies to die in that treatment combination on that day, X or A are treaments, m & f are also treatments, the first number is the line, the second number is the block]
line day1 day2 day3 day4 day5
1 Xm1.1 0 0 0 2 0
2 Xm1.2 0 0 1 0 0
3 Xm2.1 1 1 0 0 0
4 Xm2.2 0 0 0 3 1
5 Xf1.1 0 3 0 0 1
6 Xf1.2 0 0 1 0 0
7 Xf2.1 2 0 2 0 0
8 Xf2.2 1 0 1 0 0
9 Am1.1 0 0 0 0 2
10 Am1.2 0 0 1 0 0
11 Am2.1 0 2 0 0 1
12 Am2.2 0 2 0 0 0
13 Af1.1 3 0 0 1 0
14 Af1.2 0 1 3 0 0
15 Af1.1 0 0 0 1 0
16 Af2.2 1 0 0 0 0
and want it to become this using R->
XA mf line block individual age
1 X m 1 1 1 4
2 X m 1 1 2 4
3 X m 1 2 1 3
and so on...
the resulting dataframe collects the "age" value from the day the individual died, as scored in the upper dataframe, for example there were two flies that died on the 4th day (day4) in treatment Xm1.1 therefore R creates two rows, one containing information extracted regarding the first individual and thus being labelled as individual "1", then another row with the same information except labelled as individual "2".. if a 3rd individual died in the same treatment on day 5, there would be a third row which is the same as the above two rows except the "age" would be "5" and individual would be "3". When it moves on to the next treatment row, in this case Xm1.2, the first individual to die within that treatment set would be labelled as individual "1" (which in this case dies on day 3). In my example there is a total of 38 deaths, therefore I am trying to get R to build a df which is 38*6 (excl. headers).
is there a way to take my dataframe [the real version is approx 50*640 with approx 50 individuals per unique combination of X/A, m/f, line (1:40), block (1-4) so ~32000 individual deaths] to an end dataframe of 6*~32000 in an automated way?
both of these example dataframes can be built using this code if it helps you to try out solutions:
test<-data.frame(1:16);colnames(test)=("line")
test$line=c("Xm1.1","Xm1.2","Xm2.1","Xm2.2","Xf1.1","Xf1.2","Xf2.1","Xf2.2","Am1.1","Am1.2","Am2.1","Am2.2","Af1.1","Af1.2","Af2.1","Af2.2")
test$day1=rep(0,16);test$day2=rep(0,16);test$day3=rep(0,16);test$day4=rep(0,16);test$day5=rep(0,16)
test$day4[1]=2;test$day3[2]=1;test$day2[3]=1;test$day4[4]=3;test$day5[5]=1;
test$day3[6]=1;test$day1[7]=2;test$day1[8]=1;test$day5[9]=3;test$day3[10]=1;
test$day2[11]=2;test$day2[12]=2;test$day4[13]=1;test$day3[14]=3;test$day4[15]=1;
test$day1[16]=1;test$day3[7]=2;test$day3[8]=1;test$day2[5]=3;test$day1[3]=1;
test$day5[11]=1;test$day5[9]=2;test$day5[4]=1;test$day1[13]=3;test$day2[14]=1;
test2=data.frame(rep(1:3),rep(1:3),rep(1:3),rep(1:3),rep(1:3),rep(1:3))
colnames(test2)=c("XA","mf","line","block","individual","age")
test2$XA[1]="X";test2$mf[1]="m";test2$line[1]=1;test2$block[1]=1;test2$individual[1]=1;test2$age[1]=4;
test2$XA[2]="X";test2$mf[2]="m";test2$line[2]=1;test2$block[2]=1;test2$individual[2]=2;test2$age[2]=4;
test2$XA[3]="X";test2$mf[3]="m";test2$line[3]=1;test2$block[3]=2;test2$individual[3]=1;test2$age[3]=3;
apologies for the awfully long way of making this dummy dataset, suffering from sleep deprivation and jetlag and haven't used R for months, if you run the code in R you will hopefully see better what I aim to do
-------------------------------------------------------------------------------------
By Rg255:
Currently stuck at this derived from #Arun's answer (I have added the strsplit (as.character(dt$line) , "" )) section to get around one error)
df=read.table("C:\\Users\\...\\data.txt",header=T)
require(data.table)
head(df[1:20])
dt <- as.data.table(df)
dt <- dt[, {dd <- unlist(.SD, use.names = FALSE);
list(individual = sequence(dd[dd>0]),
age = rep(which(dd>0), dd[dd>0])
)}, by=line]
out <- as.data.table(data.frame(do.call(rbind, strsplit(as.character(dt$line), ""))[, c(1:3,5)], stringsAsFactors=FALSE))
setnames(out, c("XA", "mf", "line", "block"))
out[, `:=`(line = as.numeric(line), block = as.numeric(block))]
out <- cbind(out, dt[, list(individual, age)])
Produces the following output:
> df=read.table("C:\\Users\\..\\data.txt",header=T)
> require(data.table)
> head(df[1:20])
line Day4 Day6 Day8 Day10 Day12 Day14 Day16 Day18 Day20 Day22 Day24 Day26 Day28 Day30 Day32 Day34 Day36 Day38 Day40
1 Xm1.1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 4 2
2 Xm2.1 0 0 0 0 0 0 0 0 0 2 0 0 0 1 2 1 0 2 0
3 Xm3.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1
4 Xm4.1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 2 3 8
5 Xm5.1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 3 3 3 6
6 Xm6.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
> dt <- as.data.table(df)
> dt <- dt[, {dd <- unlist(.SD, use.names = FALSE);
+ list(individual = sequence(dd[dd>0]),
+ age = rep(which(dd>0), dd[dd>0])
+ )}, by=line]
> out <- as.data.table(data.frame(do.call(rbind, strsplit(as.character(dt$line), ""))[, c(1:3,5)], stringsAsFactors=FALSE))
Warning message:
In function (..., deparse.level = 1) :
number of columns of result is not a multiple of vector length (arg 1)
> setnames(out, c("XA", "mf", "line", "block"))
> out[, `:=`(line = as.numeric(line), block = as.numeric(block))]
Error in `[.data.table`(out, , `:=`(line = as.numeric(line), block = as.numeric(block))) :
LHS of := must be a single column name, when with=TRUE. When with=FALSE the LHS may be a vector of column names or positions.
In addition: Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
> out <- cbind(out, dt[, list(individual, age)])
>

Here goes a data.table solution. The line column must have unique values.
require(data.table)
df <- read.table("data.txt", header=TRUE, stringsAsFactors=FALSE)
dt <- as.data.table(df)
dt <- dt[, {dd <- unlist(.SD, use.names = FALSE);
list(individual = sequence(dd[dd>0]),
age = rep(which(dd>0), dd[dd>0])
)}, by=line]
out <- as.data.table(data.frame(do.call(rbind,
strsplit(gsub("([[:alpha:]])([[:alpha:]])([0-9]+)\\.([0-9]+)$",
"\\1 \\2 \\3 \\4", dt$line), " ")), stringsAsFactors=FALSE))
setnames(out, c("XA", "mf", "line", "block"))
out[, `:=`(line = as.numeric(line), block = as.numeric(block))]
out <- cbind(out, dt[, list(individual, age)])
This works on your data.txt file.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How can I select columns and rows with variable in R? - r

Related

Argument is of length 0, no NA's

For loop storage of output data

How to refer to previous cell in a data-frame column (lagged cell), in R

Find # of rows between events in R

splitting dataframe with collated points in to individuals in R

Categories

Resources