Creating vectors between specific values in a dataset with R - r

I have a quite special case with a dataset and what I want to do with it.To make it comprehensive I have to give a brief description of the background:
I have a sensor producing data, which needs maintenance every-now-and-then. Between every maintenance the data produced has a decreasing trend which I want to get rid of, and since maintenance is carried out quite often, I want to automate this procedure.
The sensor is turned off when carrying out maintenance but the telemetry system still produces readings marked with " * ". Therefore the subsets of data to be detrended can be easily spotted between batches of "*" readings.
I have been (unsuccessfully) trying to create a vector (on which I can then carry out a detrending procedure) with this data by selecting the desired values by looping through the data using conditional statements. To begin selecting the values I used to following statement:
if((tryp[i-2,2]="*")&(tryp[i-1,2]="*")&(tryp[i,2]!="*"))
and to finish the selection (exit the loop):
if((tryp[i-2,2]!="*")&(tryp[i-1,2]!="*")&(tryp[i,2]="*"))
However, this last statement gives an error of "argument is of length zero" and the first statement doesn't seem to be working properly either.
This is how the data looks like
So for example, one subset of data that I would like to select for de-trending is between data points 9686 and 9690. Obviously this is very small subset, but it shows well what I am trying to communicate.
I would really appreciate if someone could let me know about an elegant way of doing this, including anything way different from what I was trying to do originally.
Many thanks!

library(dplyr)
my_df <- data.frame(a = LETTERS[1:10], b = c('+','*','*', '+', '*', '*', '+', '+', '*', '*'))
my_df %>% filter(b != '*')
Suppose the '+'-signs are your data points, you can easily get rid of the '*'-signs with filtering the rows which does not contain it.
And of course a solution without the dplyr-package:
my_df[which(my_df$b!='*'),]

Related

R: Reclin Package: Is there a way to keep the weights generated in score_problink() and used in select_n_to_m() after using the link() function?

I am trying to perform a record linkage on 2 datasets containing company names. While Reclin does a very good job indeed, the linked data will need some manual cleaning and because I will most likely have to clean about 3000 rows in a day or 2 it would be great to keep the weights generated in the reclin process as shown below:
CH_ecorda_to_Patstat_left <- pair_blocking(companies_x, companies_y) %>%
compare_pairs(by= "nameor", default_comparator = jaro_winkler()) %>%
score_problink() %>%
select_n_to_m()%>%
link(all_x=TRUE, all_y = FALSE)
I know these weights are still kept up until I use the link() function. I would like to add the weights based to compare the variable "nameor" so I can use these weights to order my data in ascending order, from smallest weight to biggest weight to find mistakes in the attempted match quicker.
For context: I need to find out how many companies_x have handed in patents in the patent data base companies_y. I don´t need to know how often they handed them in, just if there are any at all. So I need matches of x to y, however I don´t know the true number of matches and not every companies_x company will have a match, so some manual cleaning will be necessary as n_to_m forces a match for each entry even if there should be none.
Try doing something like this:
weight<-problink_em(paired)
paired<-score_problink(paired, weight)
You'll have the result stored as weight now.

R code incredibly slow

Recently I have been working on some R scripts to do some reports. One of the tasks involved is to check if a value in a column matches any row of another dataframe. If this is true, then set a new column with logical TRUE/FALSE.
More specifically, I need help improving this code chunk:
for (i in 1:length(df1$Id)) {
df1 <- within(df1, newCol <- df1$Id %in% df2$Id)
}
df1$newCol <- as.factor(df1$newCol)
The dataset has about 10k rows so it does not make sense to need 6 minutes (tested with proc.time() to execute it completely, which is what it is currently happening. Also, I need to do so other types of checking, so I really need to get this right.
What am I doing wrong there that is devouring time to accomplish?
Thank you for your help!
Your code is vectorized - there is no need for the for loop. In this case, you can tell because you don't even use i inside the loop. This means your loop is executing the exact same code for the exact same result 10k times. If you delete the for wrapper around your functional line
df1 <- within(df1, newCol <- df1$Id %in% df2$Id)
you should get ~10k times speed-up.
One other comment is that the point of within is to avoid re-typing a data frame's name inside. So you're missing the point by using df1$ inside within(), and your data frame name is so short that it is longer to type within() in this case. Your entire code could be simplified to one line:
df1$newCol = factor(df1$Id %in% df2$Id)
My last comment I'm making from a state of ignorance about your application, so take it with a grain of salt, but a binary variable is almost always nicer to have as boolean (TRUE/FALSE) or integer (1/0) than as a factor. It does depend what you're doing with it, but I would leave the factor() off until necessary.

Looping in R to create transformed variables

I have a dataset of 80 variables, and I want to loop though a subset of 50 of them and construct returns. I have a list of the names of the variables for which I want to construct returns, and am attempting to use the dplyr command mutate to construct the variables in a loop. Specifically my code is:
for (i in returnvars) {
alldta <- mutate(alldta,paste("r",i,sep="") = (i - lag(i,1))/lag(i,1))}
where returnvars is my list, and alldta is my dataset. When I run this code outside the loop with just one of the `i' values, it works fine. The code for that looks like this:
alldta <- mutate(alldta,rVar = (Var- lag(Var,1))/lag(Var,1))
However, when I run it in the loop (e.g., attempting to do the previous line of code 50 times for 50 different variables), I get the following error:
Error: unexpected '=' in:
"for (i in returnvars) {
alldta <- mutate(alldta,paste("r",i,sep="") ="
I am unsure why this issue is coming up. I have looked into a number of ways to try and do this, and have attempted solutions that use lapply as well, without success.
Any help would be much appreciated! If there is an easy way to do this with one of the apply commands as well, that would be great. I did not provide a dataset because my question is not data specific, I'm simply trying to understand, as a relative R beginner, how to construct many transformed variables at once and add them to my data frame.
EDIT: As per Frank's comment, I updated the code to the following:
for (i in returnvars) {
varname <- paste("r",i,sep="")
alldta <- mutate(alldta,varname = (i - lag(i,1))/lag(i,1))}
This fixes the previous error, but I am still not referencing the variable correctly, so I get the error
Error in "Var" - lag("Var", 1) :
non-numeric argument to binary operator
Which I assume is because R sees my variable name Var as a string, rather than as a variable. How would I correctly reference the variable in my dataset alldta? I tried get(i) and alldta$get(i), both without success.
I'm also still open to (and actively curious about), more R-style ways to do this entire process, as opposed to using a loop.
Using mutate inside a loop might not be a good idea either. I am not sure if mutate makes a copy of the data frame but its generally not a good practice to grow a data frame inside a loop. Instead create a separate data frame with the output and then name the columns based on your logic.
result = do.call(rbind,lapply(returnvars,function(i) {...})
names(result) = paste("r",returnvars,sep="")
After playing around with this more, I discovered (thanks to Frank's suggestion), that the following works:
extended <- alldta # Make a copy of my dataset
for (i in returnvars) {
varname <- paste("r",i,sep="")
extended[[varname]] = (extended[[i]] - lag(extended[[i]],1))/lag(extended[[i]],1)}
This is still not very R-styled in that I am using a loop, but for a task that is only repeating about 50 times, this shouldn't be a large issue.

Strangeness with filtering in R and showing summary of filtered data

I have a data frame loaded using the CSV Library in R, like
mySheet <- read.csv("Table.csv", sep=";")
I now can print a summary on that mySheet object
summary(mySheet)
and it will show me a summary for each column, for example, one column named Diagnose has the unique values RCM, UCM, HCM and it shows the number of occurences of each of these values.
I now filter by a diagnose, like
subSheet <- mySheet[mySheet$Diagnose=='UCM',]
which seems to be working, when I just type subSheet in the console it will print only the rows where the value has been matched with 'UCM'
However, if I do a summary on that subSheet, like
summary(subSheet)
it still 'knows' about the other two possibilities RCM and HCM and prints those having a value of 0. However, I expected that the new created object will NOT know about the possible values of the original mySheet I initially loaded.
Is there any way to get rid of those other possible values after filtering? I also tried subset but this one just seems to be some kind of shortcut to '[' for the interactive mode... I also tried DROP=TRUE as option, but this one didn't change the game.
Totally mind squeezing :D Any help is highly appreciated!
What you are dealing with here are factors from reading the csv file. You can get subSheet to forget the missing factors with
subSheet$Diagnose <- droplevels(subSheet$Diagnose)
or
subSheet$Diagnose <- subSheet$Diagnose[ , drop=TRUE]
just before you do summary(subSheet).
Personally I dislike factors, as they cause me too many problems, and I only convert strings to factors when I really need to. So I would have started with something like
mySheet <- read.csv("Table.csv", sep=";", stringsAsFactors=FALSE)

How can I create a table with two categories and then sort by one of them in R?

I have a full dataset of observations and over 40 columns of categories but I only want two, NameID and Error and I want to sort Error in a descending order but still have NameID connected to each observation. Here is some code I've tried:
z<-15
sort(data.frame(skill$Error,skill$NameID),decreasing = TRUE)[1:z]
data.frame(skill$NameID,sort(kill#Error,decreasing=T)[1:z])
error2<-skill[order(Error , )]
Hopefully from what I've tried you can understand what I'm trying to do. Again, I want to pull two values from my skills data set, Error and NameID, but have Error sorted at the same time with NameID attached to the values. I need this all done inside of R. Thanks!
df <- data.frame(Error=skill$Error,NameID=skill$NameID)
df <- df[order(df$Error, decreasing=TRUE), ]
best of luck with whatever you are doing. Hopefully you have someone else to learn some R from.
Assuming that skill is a data frame
Errors <- skill[,c("Error","NameID")]
Errors <- Errors[order(-Errors$Error),]
You don't want to ever use sort in a data frame because it sorts whatever column you tell it to independently from the rest of the data frame. You only ever want order, order keeps the links between other columns intact.

Resources