combining grep with if else statement - r

I have data frame "data". I searched for a pattern using grep function and i would like to put result back in data frame to match rows with others.
data$CleanDim<-data$RAW_MATERIAL_DIMENSION[grep("^BAC",data$RAW_MATERIAL_DIMENSION)]
I would like to paste the result into a new column data$CleanDim but i get the following errors.... can someone please help me?
Error in `$<-.data.frame`(`*tmp*`, CleanDim, value = c(1393L, 1405L, 734L, : replacement has 2035 rows, data has 1881

grep() returns a vector of indices of entries that match the given criteria.
The only way that your code could work here is if the number of rows of data equals some even multiple of the number of matches grep() finds.
Consider the following reproducible example:
data = data.frame(RAW_MATERIAL_DIMENSION = c("BAC","bBAC","aBAC","BACK","lbd"))
> data
RAW_MATERIAL_DIMENSION
1 BAC
2 bBAC
3 aBAC
4 BACK
5 lbd
> grep("^BAC",data$RAW_MATERIAL_DIMENSION)
[1] 1 4
data$CleanDim <- data$RAW_MATERIAL_DIMENSION[grep("^BAC",data$RAW_MATERIAL_DIMENSION)]
Error in `$<-.data.frame`(`*tmp*`, CleanDim, value = 1:2) :
replacement has 2 rows, data has 5
Note: this would work out ok (though it would be pretty weird) if the original data object just had its first four rows. In that case, you'd just get repeated values populated in your new column.
But, what you want to do here is to look at the results of grep("^BAC",data$RAW_MATERIAL_DIMENSION) and think about what is going to be sensible in your context. Your operation will only work if the length of this result equals that of your data object, or at least if your data object is a whole multiple of that length.

Related

writing first row (column headers) to a vector

I have a matrix I have coerced from a realRatingMatrix in recommenderlab package in R. The data contains predictions of ratings between 0-1 for a number of products.
The matrix should contain customer numbers along the rows (row 2 down) so that column 1 header is row label, and product IDs along the columns in the first row from column 2 onwards. The problem I have is when I coerce to a matrix the data structure becomes messy:
EDIT: Link to Github repository www.github.com/APBuchanan/recommenderlab-model
str(wsratings)
num [1:43, 1:319] 0.192 0.44 0.262 0.161 0.239 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:319] "X011211" "X014227" "X014229" "X014235" ...
The first cell wsratings[1,1] should be labelled as "CustomerNumber" and the remainder of the columns in row 1 should contain the data that is currently held in the above $:chr, but should display as separate variables in the matrix.
From the code below you will see that I've been trying to go about this by inserting the data into two vectors, that I can then call in the dimnames function, but I'm getting something wrong:
setwd("location to pull in data")
#look at using XLConnect package to link straight to excel workbook
library(recommenderlab)
library(xlsx)
library(tidyr)
library(Matrix)
#library(stringer)
data=read.csv("WS1 & WS2 V3.csv",header=TRUE,row.names=1)
#remove rows where number of purchases is <10
df=data[rowSums(data[-1])>=10,]
df<-as.matrix(df)
data.matrix=as(df,"binaryRatingMatrix")
#image(data.matrix)
model=Recommender(data.matrix,method="UBCF")
predictions<-predict(model,data.matrix,n=5)
set.seed(100)
evaluation<-evaluationScheme(data.matrix,method="split",train=0.5,given=5)
Rec.ubcf <- Recommender(getData(evaluation, "train"), "UBCF")
predict.ubcf<-predict(Rec.ubcf,getData(evaluation,"known"),type="topNList")
pred.ubcfratings<-predict(Rec.ubcf,getData(evaluation,"known"),type="ratings")
error.ubcf<-calcPredictionAccuracy(predict.ubcf,getData(evaluation,"unknown"),given=5)
setwd("Location to output data from model")
wsratings<-as(pred.ubcfratings,"matrix")
ratingrows<-c(evaluation#runsTrain)
where I've called colnames2<-c(wsratings[1,2:ncol(wsratings)]) I am expecting the the data from column 2 to the last column, in row 1 to be read into the vector. But when I print the results, it includes rating information as well which is not what I'm after.
ratingrows<-c(evaluation#runsTrain) contains the customer numbers that I want to insert below the row label "CustomerNumber".
I'm guessing there's a way of sorting this out with tidyr package, but not so familiar with it. If anyone can provide some advice on how I can clean this all up, I'd be very grateful.
So with the data you gave, I whipped up a solution here.
You said "I need to extract the customer numbers from the test split of data and drop that into the first column of the matrix - that's my main issue". The way to extract that is either: colnames(wsratings) or dimnames(wsratings)[[2]].
Once you have this vector (length of 320), you want to "drop that to the first column". You're asking for a cbind(), but the length of the data you want to bind it contains 43 row. You can't bind them together because the length of the two elements are not the same or multiples of each other.
Assuming you have the full dataset and their length matches, then the code would be:
customerid <-c("CustomerName", evaluation#runsTrain[[1]])
wsratings <- cbind(customerid, wsratings)
This is what I gathered you want, and it yields me the following:

Counting NA values by ID?

I'm learning R from scratch right now and am trying to count the number of NA's within a given table, aggregated by the ID of the file it came from. I then want to output that information in a new data frame, showing just the ID and the sum of the NA lines contained within. I've looked at some similar questions, but they all seem to deal with very short datasets, whereas mine is comparably long (10k + lines) so I can't call out each individual line to aggregate.
Ideally, if I start with a data table called "Data" with a total of four columns, and one column called "ID", I would like to output a data frame that is simply:
[ID] [NA_Count]
1 500
2 352
3 100
Thanks in advance...
Something like the following should work, although I am assuming that Date is always there and Field 1 and Field 2 are numeric:
# get file names and initialize a vector for the counts
fileNames <- list.files(<filePath>)
missRowsVec <- integer(length(fileNames))
# loop through files, get number of
for(filePos in 1:length(fileNames)) {
# read in files **fill in <filePath>**
temp <- read.csv(paste0(<filePath>, fileNames[filePos]), as.is=TRUE)
# count the number of rows with missing values,
# ** fill in <fieldName#> with strings of variable names **
missRowsVec[filePos] <- sum(apply(temp[, c(<field1Name>, <field2Name>)],
function(i) anyNA(i)))
} # end loop
# build data frame
myDataFrame <- data.frame("fileNames"=fileNames, "missCount"=missRowsVec)
This may be a bit dense, but it should work more or less. Try small portions of it, like just some inner function, to see how stuff works.

R rbind error row.names duplicates not allowed

There are other issues here addressing the same question, but I don't realize how to solve my problem based on it. So, I have 5 data frames that I want to merge rows in one unique data frame using rbind, but it returns the error:
"Error in row.names<-.data.frame(*tmp*, value = value) :
'row.names' duplicated not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘10’, ‘100’, ‘1000’, ‘10000’, ‘100000’, ‘1000000’, ‘1000001 [....]"
The data frames have the same columns but different number of rows. I thought the rbind command took the first column as row.names. So tried to put a sequential id in the five data frames but it doesn't work. I've tried to specify a sequential row names among the data frames via row.names() but with no success too. The merge command is not an option I think because are 5 data frames and successive merges will overwrite precedents. I've created a new data frame only with ids and tried to join but the resulting data frame don't append the columns of joined df.
Follows an extract of df 1:
id image power value pol class
1 1 tsx_sm_hh 0.1834515 -7.364787 hh FR
2 2 tsx_sm_hh 0.1834515 -7.364787 hh FR
3 3 tsx_sm_hh 0.1991938 -7.007242 hh FR
4 4 tsx_sm_hh 0.1991938 -7.007242 hh FR
5 5 tsx_sm_hh 0.2079365 -6.820693 hh FR
6 6 tsx_sm_hh 0.2079365 -6.820693 hh FR
[...]
1802124 1802124 tsx_sm_hh 0.1991938 -7.007242 hh FR
The four other df's are the same structure, except the 'id' columns that don't have duplicated numbers among it. 'pol' and 'image' columns are defined as levels.
and all.pol <- rbind(df1,df2,df3,df4,df5) return the this error of row.names duplicated.
Any idea?
Thanks in advance
I had the same error recently. What turned out to be the problem in my case was one of the attributes of the data frame was a list. After casting it to basic object (e.g. numeric) rbind worked just fine.
By the way row name is the "row numbers" to the left of the first variable. In your example, it is 1, 2, 3, ... (the same as your id variable).
You can see it using rownames(df) and set it using rownames(df) <- name_vector (name_vector must have the same length as df and its elements must be unique).
I had the same error.
My problem was that one of the columns in the dataframes was itself a dataframe. and I couldn't easily find the offending column
data.table::rbindlist() helped to locate it
library(data.table)
rbindlist(a)
# Error in rbindlist(a) :
# Column 25 of item 1 is length 2 inconsistent with column 1 which is length 16. Only length-1 columns are recycled.
a[[1]][, 25] %>% class # "data.frame" K- this should obviously be converted to a column or removed
After removing the errant columndo.call(rbind, a) worked as expected

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

Error when using mshapiro.test: "U[] is not a matrix with number of columns (sample size) between 3 and 5000"

I am trying to perform a multivariate test for normality on some density data from five sites, using mshapiro.test from the mvnormtest package. Each site is a column, and densities are below. It is 5 columns and 5 rows, with the top row as the header (site names). Here is how I loaded my data:
datafilename="/Users/megsiesiple/Documents/Lisa/lisadensities.csv"
data.nc5=read.csv(datafilename,header=T)
attach(data.nc5)`
The data look like this:
B07 B08 B09 B10 M
1 72571.43 17714.29 3142.86 22571.43 8000.00
2 44571.43 46857.14 49142.86 16857.14 7142.86
3 54571.43 44000.00 26571.43 6571.43 17714.29
4 57714.29 38857.14 32571.43 2000.00 5428.57
When I call mshapiro.test() for data.nc5 I get this message: Error in mshapiro.test(data.nc5) :
U[] is not a matrix with number of columns (sample size) between 3 and 5000
I know that to perform a Shapiro-Wilk test using mshapiro.test(), the data has to be in a numeric matrix, with a number of columns between 3 and 5000. However, even when I make the .csv a matrix with only numbers (i.e., when I omit the Site names), I still get the error. Do I need to set up the matrix differently? Has anyone else had this problem?
Thanks!
You need to transpose the data in a matrix, so that your variables are in rows, and observations in columns. The command will be :
M <- t(data.nc5[1:4,1:5])
mshapiro.test(M)
It works for me this way. The labels in the first row should be recognized during the import, so the data will start from row 1. Otherwise, there will be a "missing value" error.
If you read the numeric matrix into R via read.csv() using similar code to that you do show, it will be read in as a data frame, and that is not a matrix.
Try
mat <- data.matrix(data.nc5)
mshapiro.test(mat)
(Not tested as you don't give a reproducible example and it is late-ish in my time zone now ;-)

Resources