Giving a unique identifier to every execution in R - r

Can anyone please tell me how to assign a unique value to a result set every time its executed ? As displayed in table below, a entry should be added in front of every record and this entry should be same for all the result set that has been obtained during a single execution. The purpose of this to extract the all records in future by just giving a short statement like (where Unique ID = A_Ground_01). Thanks
User DateTime Latitude Longitude Floor **Unique ID**
1 A 2017-06-15 47.29404 5.010650 Ground A_Ground_01
2 A 2017-06-15 47.29403 5.010634 Ground A_Ground_01
3 A 2017-06-15 47.29403 5.010668 Ground A_Ground_02
4 A 2017-06-15 47.29403 5.010663 Ground A_Ground_02

With not knowing anything about your initial dataframe, or the function being executed I might recommend something similar to the following.
In this example I'll assume you have a main dataframe we'll call df.main and some new data you will be binding to the main dataframe, we'll call df.newdata
Create a column in your main dataframe called df.main$ExecID that will contain integer values.
Run whatever your function is and assign df.newdata$ExecID <- max(df.main$ExecID) + 1
Generate the unique id using df.newdata$UniqueID <- paste(df.newdata$User, df.newdata$Floor, df.newdata$ExecID, sep = "_")
Then run rbind(df.main, df.newdata)
To provide a better solution for your specific situation, we really would need to see example code of how your script is written.

Related

R:how to extract the first integer or decimal number from a text, and if the first number equal to specific numbers extract the second integer/decimal

The data is like this:
example - name of database
detail - the first column the contain sting with number in it (the number can be attached to $ etc. like 25m$ and also can be decimal like 1.2m$ or $1.2M)
lets say the datatable look like this:
example$detail<- c("The cole mine market worth every year 100M$ and the equipment they use worth 30$m per capita", "In 2017 the first enterpenur realized there is a potential of 500$M in cole mining", "The cole can make 23b$ per year ans help 1000000 familys living on it")
i want to add a column to the example data table - named: "number" that will extract the first number in the string in column "detail". BUT if this number is equal to one of the numbers in vector "year" (its not in the example database - its a seprate list i created) i want it to extract the second number of the string example$detail.
so i create another years list (separate from the database),
years<-c(2016:2030 )
im trying to create new column - number
what i did so far:
I managed to add variable that extract the first number of a string, by writing the following command:
example$number<-as.integer( sub("\\D*(\\d+).*", "\\1", example$detail) ) # EXTRACT ONLT INTEGERS
example$number1<-format(round(as.numeric(str_extract(example$detail, "\\d+\\.*\\d*")), 2), nsmall = 2) #EXTRACT THE NUMBERS AS DECIMALS WITH TWO DIGITS AFTER THE . (ITS ENOUGH FOR ME)
example$number1<-ifelse(example$number %in% years, TRUE, example$number1 ) #IF THE FIRST NUMBER EXTRACTED ARE IN THE YEARS VECTOR RETURN "TRUE"
and then i tried to write a code that extract the second number according to this if and its not working, just return me errors
i tried:
gsub("[^\d]*[\d]+[^\d]+([\d]+)", example$detail)
str_extract(example$detail, "\d+(?=[A-Z\s.]+$)",[[2]])
as.integer( sub("\\D*(\\d+).*", "\\1", example$detail) )
as.numeric(strsplit(example$detail, "\\D+")[1])
i didnt understand how i symbolized any number (integer\digits) or how i symbolized THE SECOND number in string.
thanks a lot!!
List item
Since no good example data is provided I'm just going to 'wing-it' here.
Imagine the dataframe df has the columns year (int) and details (char), then
df = mutate(clean_details = sub("[^0-9.-]", "",details),
clean_details_part1 = as.integer(strsplit(clean_details,"[.]")[[1]][1]),
clean_details_part2 = as.integer(strsplit(clean_details,"[.]")[[1]][2])
)
This works with the code I wrote up. I didn't apply the logic because I see you're proficient enough to do that. I believe a simple ifelse statement would do to create a boolean and then you can filter on that boolean, or a most direct way.

R set column value to be other column value based on string search

I'm trying to find a clean way to get the first column of my DT, for each row, to be equal to the user_id found in other columns. That is, I must perform a search of "user_id" across each row, and return the entirety of the cell where the instance is found.
I first tried to get the index of the column where the partial match is found, and then use this to set the first column's values, but it did not work. Example:
user_id 1 2
1: N/A 300 user_id154
2: N/A user_id301 user_id125040
3: N/A 302 user_id2
For instance, I want to obtain the following
**user_id**
user_id154
user_id301
user_id2
Please bear in mind I am new to such data formatting in R (most of the work I do does not involve cleaning JSON files..), and that my data.table has overs 1M rows. The answer does not need to be super efficient, but it definitely shouldn't take more than 5 minutes or it will be considered as too slow by my boss.
Hopefully it is understandable
I'm sure someone will provide a more elegant solution, but this does the trick:
dt[, user_id := str_extract(str_c(1, 2), "user_id[0-9]*")]
This first combines all columns row-per-row, then for each row, looks for the first user_id in the combined value.
(Requires the stringr package)
For every row in your table grep first value that has "user_id" in it and put result into column user_id.
df$user_id <- apply(df, 1, function(x) grep("user_id", x, value = TRUE)[1])

Return matching names instead of binary variables in R

I'm new here and diving into R, and I'm encountering a problem while trying to solve a knapsack problem.
For optimization purposes I wrote a dynamic program in R, however, now that I am at the point of returning the items, which I succeeded in, I only get the binary numbers saying whether the item has been selected or not (1 = yes). Like this:
Select
[1] 1 0 0 1
However, now I would like the Select function to return the names of values instead of these binary values. Underneath I created an example of what my problem looks like.
This would be the data and a related data frame.
items <- c("Glasses","gloves","shoes")
grams <- c(4,2,3)
value <- c(100,20,50)
data <- data.frame(items,grams,value)
Now, I created various functions, with the final one clarifying whether a product has been selected by 1 (yes) or 0 (no). Like above. However, I would really like for it to return the related name of the item. Is there a manner to go around this by linking back to the dataframe created?
So that it would say instead of (in case all products are selected)
Select
[1] 1 1 1
Select
[1] Glasses gloves shoes
I believe I would have to create a new function. But as I mentioned, is there a good way to refer back to the data frame to take related values from another column in the data frame in case of a 1 (yes)?
I really hope my question is more clear now and someone can direct me in the right direction.
Best, Berber
Lets say your binary vector is
idx <- [1, 0, 1, 0, 1]
just use,
items[as.logical(idx)]
will give you the name for selected items, and
items[!as.logical(idx)]
will give you name for unselected items

Convert data frame to matrix without looping

The Question:
I have a data frame with a column that shows whether an event occurred, and columns for month, day, and year. These last 3 were converted to a date vector. I want to make a matrix that shows whether or not an event occurred within a given time period. In this matrix, a row represents a site and a column a date. I was able to write a for loop to do it, but it seemed like there might be a better way to do this, either with apply or some other basic operation. How would you do this?
The Code:
#Initialize events matrix
events = matrix(FALSE,nrow(predicted),ncol(predicted))
# Mark the presence of events
for (i in 1:nrow(events)){
if ((days_from_start[i]>-1)&(days_from_start[i]<=ncol(predicted)))
events[i,days_from_start[i]] = !input_data$Event[i]
}
The Background:
The next step is to compare the events matrix against various model outputs with the same shape. There are relatively few events in the data frame compared to the matrix size; the (probably incorrect) assumption is that the data frame completely lists all events and that unlisted matrix cells did not experience an event. I’m very new to R, so I’d be interested in hearing about other approaches to the same problem, if you think I’m going about it the hard way.
The Data:
> input_data$Event[1:5]
[1] FALSE FALSE FALSE FALSE TRUE
> input_data$Year[1:5]
[1] 2010 2010 2011 2010 2010
> days_from_start[1:5]
Time differences in days
[1] 834 1018 1106 847 1055
> dim(predicted)
[1] 649 732
Since events[i,days_from_start[i]] is accessing more or less random locations in the events matrix (since presumably you have no pattern to days_from_start ) , it may be difficult not to use a loop. Possibly something like the following will work. I haven't tested this since you posted no datasets.
foo<- (days_from_start>-1)&(days_from_start<=ncol(predicted) )
index_matrix<-cbind((1:i)[foo],days_from_start[(1:i)[foo]])
events[index_matrix]<-!input_data$Event[index_matrix[,1]]
What the first line does is create a vector of logicals, TRUE where you want to do something
The next line creates a set of index pairs where you're going to insert data into events matrix. The last line does the insertion.

Reading Raw Data in SAS Or R

For our analysis we need to read raw data from csv (xls) & convert it into SAS dataset before doing our analysis.
Now, the problem is this raw data generally have 2 issues:
1. The ordering of columns changes sometimes. So, if in the earlier period we have columns in order of variable A,then B, then C, etc. It might change to B, then C, then A.
2. There are foreign elements like "#", or ".", or "some letters", etc.
Now, we have to first clean the raw data, before reading into SAS. This take considerable amount of time. Is there any way we can clean the data within SAS system itself before reading the data. If we can rectify the data with SAS code, it will save quite amount of time.
Here's the example:
Period 1: I got the data in Data1.csv in this format. In column B, which is numeric, I've "#" & ".". And colummn C, which is also numeric, I've "g". If I import Data1.csv using either PROC IMPORT or Infile statement, these foreign elements in column B & C will remain. The question here is how to do that? I can use If STATEMENT. But the problem is there are too many foreign elements (e.g. instead of "#", ".", "g", I might get other foreign elements like "$", "h" etc.) If there's any way we can have a code which detect & remove foreign elements without I've to specifying it using IF STATEMENT everytime I import the raw data in SAS.
A B C
Name1 1 5
Name2 2 6
Name3 3 4
Name4 # g
Name5 5 3
Name6 . 6
Period 2: In this period I got DATA2.csv which is given below. When I use INFILE statement, I specify 1st A should be read with the specific name, then B with specific name & then C. In 2nd period when I get the data B is given 1st. So, when SAS read the data I've B instead of A. So, I've to check the variables ordering with previous phase data everytime & correct it before reading the data using infile statement. Since the number of variables are too large, it's very time consuming ( & at time frustrating) to verify the column ordering in this fashion. Is there SAS code, with which SAS will automatically read A,& then B & then C, even though it's not in this order?
B A C
1 Name1 5
2 Name2 6
3 Name3 4
# Name4 g
5 Name5 3
. Name6 6
Even though I mainly use SAS in my analysis purpose. But I can use R to clean the data, then use to read it in SAS for further analysis. So R code can also be helpful.
Thanks.
In R you increase the speed of file reading when you specify that a column is a particular class. With the example provided (3 columns with the middle one being "character" you might use this code:
dat <- read.csv( filename, colClasses=c("numeric", "character", "numeric"), comment.char="")
The "#" and "." would become NA values when encountered in the numeric columns. The above code removes the default specification of the comment character which is "#". If you wanted the "#" and "." entries in character columns to be coerced to NA_character_, you could use this code:
dat <- read.csv( filename,
colClasses=c("numeric", "character", "numeric"),
comment.char="",
na.strings=c("NA", ".", "#") )
By default the header=TRUE setting is assumed by read.csv(), but if you used read.table() you would need to assert header=TRUE with the two file structures you showed. There is further documentation and worked examples of reading Excel data here: However, my advice is to do as you are planning and use CSV transfer. You will see the screwy things Excel does with dates and missing values more quickly that way. You would be well advised to change the data formats to a custom "yyyy-mm-dd" in agreement with the POSIX standard, in which case you can also specify a "Date" classed column and skip the process of turning character classed columns in the default Excel formats (all of which are bad) into dates.
Yes, you can use SAS to do any kind of "data cleaning" you might imagine. The SAS DATA step language is full of features to do things like this, but there is no magic bullet; you need to write the code yourself.
A csv file is just a plain text file (very different from an xls file). Normally the first row in a csv file contains column names and the data begins with the second row. If you use PROC IMPORT, SAS will use the first row to construct variable names and try to determine data types by scanning the first several rows of the file. For example:
proc import datafile='c:\temp\somefile.csv'
out=SASdata
dbms=csv replace;
run;
Alternatively, you can read the file with a data step. This would require that you know the file layout in advance. For example:
data SASdata;
infile 'c:\temp\somefile.csv' dsd firstobs=2 lrecl=32767 truncover;
informat A $50.; /* A character variable with max length 50 */
informat B yymmdd10.; /* A date presented like 2012-08-25 */
informat C dollar12.; /* A number containing dollar sign, commas, or decimals */
input A B C; /* The order of the variables in the file */
if B = . then B = today(); /* A possible data cleaning statement */
run;
Note that the INPUT statement controls the order that the variables exist in the file. The point is that the code you use must match the layout of each file you process.
These are just general comments. If you encounter problems, post back with a more specific question.
UPDATE FOR UPDATED QUESTION: The variables from the raw data file must be listed in the INPUT statment in the same order as they existin each file. Also, you need to define the column types directly, and establish whatever rules they need to follow. There is no way to do this automatically; each file much be treated separately.
In this case, let's assume your variables are A, B, and C, where A is character and B and C are numbers. This program might process both files and add them to a history dataset (let's say ALLDATA):
data temp;
infile 'c:\temp\data1.csv' dsd firstobs=2 lrecl=32767 truncover;
/* Define dataset variables */
informat A $50.;
informat B 12.;
informat C 12.;
/* Add a KEEP statement to keep only the variables you want */
keep A B C;
input A B C;
run;
proc append base=ALLDATA data=temp;
run;
data temp;
infile 'c:\temp\data2.csv' dsd firstobs=2 lrecl=32767 truncover;
informat A $50.;
informat B 12.;
informat C 12.;
input B A C;
run;
proc append base=ALLDATA data=temp;
run;
Notice that the "data definition" part of each data step is the same; the only difference is the order of variables listed in the INPUT statement. Notice that because the variables A and B are defined as numeric, when those invalid characters are read (# and g), the values are stored as missing values.
In your case, I'd create a template SAS program to define all the variables you want in the order you expect them to be. Then use that template to import each file using the order of the variables in that file. Setting up the template program might take a while, but to run it you would only need to modify the INPUT statement.

Resources