R data frame slicing - r

I have a data frame of 4352 observations and 21 columns. First column is a date vector and the other 20 columns are numeric vectors (representing stock prices). Since on some days (i.e. weekends and holidays) there are no trades, therefore some observations have NA's in columns 2:21.
The following code shows me logical data frame indicating of there is NA and test data frame has the same dimensions as the input table.
test <- is.na(prices[, 2:21]) %>% as.data.frame()
However when I do the following, the result is 48052 observations with additional rows names e.g. NA.40755 etc.
test <- prices[is.na(prices[, 2:21]) == 0, ]
But when I use comma instead of colon when slicing columns it seems that I have the desired output (i.e. 2970 observations):
test <- prices[is.na(prices[, 2, 21]) == 0, ]
Therefore my question is why I have to slice [, 2, 21] instead of [, 2:21] ?

is.na(prices[, 2:21]) is a logical matrix with TRUE/FALSE values. I am not sure what you were trying to do when comparing it == 0 because that would return logical matrix of same dimension. You need to consolidate all the row values together using rowSums so that you have only 1 value in each row.
If you want drop the rows with all NA values you can use :
prices <- prices[rowSums(!is.na(prices[, 2:21])) > 0, ]

We can use Reduce with lapply from base R
prices <- prices[!Reduce(`&`, lapply(prices[2:21], is.na)),]

Related

r- Using sum and match to find first occurrence of a high frequency

I have several data frames in wide format imported from dbf. So every column is a date and every row is an observation. Thus for every day i have between 500-2000 observations depending on the size of the geographic shape i am looking at. For the purposes of reproducible I created 2 dummy data frames with a range of values I may see in my actual data frames.
Data1<- data.frame(replicate(10, sample(0:1000, 20, rep= TRUE)))
Data<- data.frame(replicate(10, sample(0:1000, 20, rep= TRUE)))
Since I have many of these data frames I have put them in a list so I can run functions on many at once.
filenames<- mget(ls(pattern= 'Data'))
Now my issue is that I am trying to write a function to count the number of occurrences in each column where values are within the range 0-100. I can accomplish this with
library(plyr)
Datacount<- ldply(Data, function(x) length(which(x>=0 & x<=100)))
Then i need to be able to match the first column instance (date) in which this counted number is greater than 10% of the total number of observations per column. So for a dataframe with 20 observations I would want the first date where the number of cells between 0-100 is greater than 2. I previously accomplished this using apply (where "V1" is the column name containing the counts)
Datamatch<- apply (Datacount["V1"]>2,2,function(x) match (TRUE,x))
My question is whether there is a way I can combine these functions into one process that I can employ into either a for loop over "filenames" or using one of the lapply family functions?
For detail here is an example of a single function I built to run across each row of the dataframe. This gives me a column index of the last date where each row value is <= 100. Then i used lapply to loop over all dataframes in my list and append the results of the function to the original dataframe.
icein<- function(dataframe){
dataframe$icein<- apply(dataframe, 1, function(x){tail(which(x<=100), 1)})
dataframe
}
list2env(lapply(filenames, icein), envir= .GlobalEnv)
After loading all the 'Data' into a list, loop over the list with map, get the mean of logical vector (between(., 0, 100)) check if it greater than or equal to 2, unlist the data.frame, wrap with which to get the position index, extract the first one
library(dplyr)
library(purrr)
n <- 0.2
mget(ls(pattern= 'Data')) %>%
map_int(~ .x %>%
summarise_all(~ mean(between(., 0, 100)) >= n) %>%
unlist %>%
which %>%
first)

How do I replace specific cell values in dataframe using continuous (sequential) indexing?

I have two dataframes of equal dimensions.
One has some value in cells (i.e. 'abc') that i need to index. Other has all different values. And I need to replace the values in other dataframe with the same index as 'abc'.
Examples:
df1 <- data.frame('1'=c('abc','bbb','rweq','dsaf','cxc','rwer','anc','ewr','yuje','gda'),
'2'=c(NA,NA,'bbb','dsaf','rwer','dsaf','ewr','cxc','dsaf','cxc'),
'3'=c(NA,NA,'dsaf','abc','bbb','cxc','yuje',NA,'ewr','anc'),
'4'=c(NA,NA,'cxc',NA,'abc','anc',NA,NA,'yuje','rweq'),
'5'=c(NA,NA,'anc',NA,'abc',NA,NA,NA,'rwer','rwer'),
'6'=c(NA,NA,'rweq',NA,'dsaf',NA,NA,NA,'bbb','bbb'),
'7'=c(NA,NA,'abc',NA,'ewr',NA,NA,NA,'abc','abc'),
'8'=c(NA,NA,'abc',NA,'rweq',NA,NA,NA,'cxc','bbb'),
'9'=c(NA,NA,NA,NA,'abc',NA,NA,NA,'anc',NA),
'10'=c(NA,NA,NA,NA,'abc',NA,NA,NA,'rweq',NA))
df2 <- data.frame('1'=c('green','black','white','yelp','help','green','red','brown','green','crack'),
'2'=c(NA,NA,'black','yelp','green','yelp','brown','help','yelp','help'),
'3'=c(NA,NA,'yelp','green','black','help','green',NA,'brown','red'),
'4'=c(NA,NA,'help',NA,'green','red',NA,NA,'green','white'),
'5'=c(NA,NA,'red',NA,'green',NA,NA,NA,'green','green'),
'6'=c(NA,NA,'white',NA,'yelp',NA,NA,NA,'black','black'),
'7'=c(NA,NA,'green',NA,'brown',NA,NA,NA,'green','green'),
'8'=c(NA,NA,'green',NA,'white',NA,NA,NA,'help','black'),
'9'=c(NA,NA,NA,NA,'green',NA,NA,NA,'red',NA),
'10'=c(NA,NA,NA,NA,'green',NA,NA,NA,'white',NA))
I can find sequential index of 'abc', but it returns one-sized vector
which(df1 == 'abc')
#[1] 1 24 35 45 63 69 70 73 85 95
And i don't know how to replace values using this method
In output expected to view df2 with replaced values 'green' only on the same indexes as values 'abc' in df1.
But note!! that 'green' values in df2 are not only in the same indexes as in df1
I don't think your problem is appropriately approached with the data in a data.frame. That introduces several complications. First, each variable (column) in the data frame is a factor with different levels! Second, your code is making a comparison between a list (data.frame) and a factor (which is coerced into an atomic vector). The help function for the == operator states ..if the other is a list R attempts to coerce it to the type of the atomic vector.. The help function also points out that factors get special handling in comparisons where it first assumes you are comparing factor levels, which your code is doing.
I think you want to convert your data frames of identical dimensions to a matrix first. If you need the results in a data.frame, convert it back after as I show here but realize that the factor levels may have changed.
# Starting with the values assigned to df1 and df2
m1 <- as.matrix(df1)
m2 <- as.matrix(df2)
index <- which(m1 == "abc")
m2[index] <- "abc"
df2 <- as.data.frame(m2)
Here is a way to. Learn about the *apply family in R: I think it is the most useful group of functions in this language, whatever you plan to do ;) Also know that data.frame are of 'list' type.
df1 <- lapply(df1, function(frame, pattern, replace){ # for each frame = column:
matches <- which(pattern %in% frame) # what are the matching indexes of the frame
if(length(matches) > 0) # If there is at least one index matching,
frame[matches] <- replace # give it the value you want
return(frame) # Commit your changes back to df1
}, pattern="abc", replace= "<whatYouWant>") # don't forget this part: the needed arguments !

How to count the number of entries per row in a data frame in R?

I have a number of large data frames for which I need to know the number of elements in each row. For example, if my dataframe df looks like
X Y Z A B
Q R S
I would want the following output vector:
5
3
How can I code for this in R?
We can use rowSums on the non-missing elements (assuming the columns that doesn't have values are NA)
rowSums(!is.na(df))
If the values are blank "" instead of NA, then create the logical matrix with == and use rowSums
rowSums(df != "")

Guetting a subset in R

I have a dataframe with 14 columns, and I want to subset a dataframe with the same column but keeping only row that repeats (for example, I have an ID variable and if ID = 2 repeated so I subset it).
To begin, I applied a table to my dataframe to see the frequencies of ID
head(sort(table(call.dat$IMSI), decreasing = TRUE), 100)
In my case, 20801170106338 repeat two time; so I want to see the two observation for this ID.
Afterward, I did x <- subset(call.dat, IMSI == "20801170106338") and hsb6 <- call.dat[call.dat$IMSI == "20801170106338", ], but the result is false (for x, it's returning me 0 observation of 14 variale and for hsb6 I have only NA in my dataframe).
Can you help me, thanks.
PS: IMSI is a numeric value.
And x <- subset(call.dat, Handset.Manufacturer == "LG") is another example which works perfectly...
You can use duplicated that is a function giving you an array that is TRUE in case the record is duplicated.
isDuplicated <- duplicated(call.dat$IMSI)
Then, you can extract all the rows containing a duplicated value.
call.dat.duplicated <- all.dat[isDuplicated, ]

Error: (subscript) logical subscript too long

Can some one let me know why I am getting this error and how I can fix it?
Here is the code
What I am trying to do is remove the rows that associated 1's if the column of that one's less than 10
a0=rep(1,40)
a=rep(0:1,20)
b=c(rep(1,20),rep(0,20))
c0=c(rep(0,12),rep(1,28))
c1=c(rep(1,5),rep(0,35))
c2=c(rep(1,8),rep(0,32))
c3=c(rep(1,23),rep(0,17))
c4=c(rep(1,6),rep(0,34))
x=matrix(cbind(a0,a,b,c0,c1,c2,c3,c4),nrow=40,ncol=8)
nam <- paste("V",2:9,sep="")
colnames(x)<-nam
dat <- cbind(y=rnorm(40,50,7),x)
#===================================
toSum <- colSums(dat)
Col <- Val <- NULL
for(i in 1:length(toSum)){
if(toSum[i]<10){
Col <- c(Col,colnames(dat)[i])
Val <- c(Val,toSum[i])}
}
cs <- colSums(dat) < 10
indx <- dat[,which(cs)]==0
for(i in 1:dim(indx)[2]){
datnw <- dat[indx[,i],]
dat <- datnw}
datnw2 <- dat[, -which(cs)]
Thanks
If I understand correctly what you're trying to achieve, you might best write it this way:
cs <- colSums(dat) < 10
dat[rowSums(dat[,cs]) == 0, !cs]
This means: for any column with sum less than 10 (called a “small column” hereafter), drop any row which has a 1 in that column. So you only keep rows which have a zero in all those small columns. You drop the small columns as well, as they would only contain zeros in any case.
In your code, indx is a logical data frame with 40 rows, one for each row of input, and one column for each small column in the input. You use the first column of idx to remove the rows with a 1 in the first short column. This results in a new value for dat, which is a few rows shorter than the original. In the next iteration of the loop, you use the second logical vector in an attempt to remove more rows. But this won't work: after the first iteration, dat has less than 40 rows, but the second column still has all 40 rows. This is what's causing the error: you're subscripting a vector of less than 40 elements with a logical vector of length 40.
You could combine the three columns of your indx into a single vector suitable to subscript the rows of interest using the following expression:
apply(indx, 1, all)
This will have a TRUE value in its result for exactly those rows which have TRUE in each column. However, I guess I'd prefer my code above over this, as it is much shorter to write. The most likely reason to prefer the latter is if your data frame may contain negative number, so that a row sum of zero does not imply an all-zero row. Not a problem in your example data.

Resources