Data transforming: not recognising values in r - r

I am trying to create a new variable (IMD_new) in r which is based on the values of another variable (IMD_raw). The code I have written recognises some of the values in IMD_raw, but it fails to consistently work for one specific label ("1"). No error messages are produced. I have 400+ observations, with v small number of NAs. Here is an example dataframe (data_IMD) with current and expected outcomes:
Current:
ID IMD_Raw IMD_New
1 26022 3
2 7847 7847
3 12004 1
4 24622 2
5 8810 8810
Anticipated
ID IMD_Raw IMD_New
1 26022 3
2 7847 1
3 12004 1
4 24622 2
5 8810 1
Here is the code I am using:
data_IMD$IMD_New <- 1
data_IMD$IMD_New <- data_IMD$IMD_Raw
data_IMD$IMD_New[data_IMD$IMD_Raw >= 0 & data_IMD$IMD_Raw <"150242"] <- "1"
data_IMD$IMD_New[data_IMD$IMD_Raw >"150241" & data_IMD$IMD_Raw <="24683"] <- "2"
data_IMD$IMD_New[data_IMD$IMD_Raw >24683 & data_IMD$IMD_Raw <="32831"] <- "3"
Any suggestions for what I'm doing wrong? For some reason it is failing to recognize that ID 2 should be classified as '1' because their IMD_Raw is greater than 0 and less than 150242.
Thanks!

Related

Creating sublists from one bigger list

I am writing my Thesis in R and I would like, if possible, some help in a problem that I have.
I have a table, which is called tkalp, with 2 columns and 3001 rows and after a 'subset' command that I wrote this table contains now 1084 rows and called kp. Some values of kp are:
As you can see some values from the column V1 are continuously with step = 2 and some are not.
So my difficulty is:
1. I would like to 'break' this big list/table into smaller lists/tables that contain only continuous numbers. For this difficulty, I tried to implement it with these commands but it didn't go as planned:
for (n in 1:nrow(kp)) {
kp1 <- subset(kp, kp[n+1,1] - kp[n,1])==2)
}
2. After completing this task I would like to keep only the sublists that contain more than 10 rows.
Any idea or help is more than welcome! Thank you very much
EDIT
I have uploaded a picture of my table and I have separated the numbers that I want to be contained in different tables. And I would like to do that for all the original table.
blue is one smaller table than the original
black another
yellow another
red another
And after I create all those smaller tables I would like to keep only the tables that contain more than 10 numbers. For example I don't want to keep the yellow table since it contains only 4 numbers.
Thank you again
What about
df <- data.frame(V1=c(1,3,5,10,12,14, 20, 22), V2=runif(8))
df$diff <- c(2,diff(df$V1))
df$numSubset <- cumsum(df$diff != 2) + 1
iter <- seq(max(df$numSubset))
purrr::map(iter, function(i) filter(df, numSubset == i))
listOfSubsets <- purrr::map(iter, function(i) dplyr::filter(df, numSubset == i))
Then you loop through the list and select only those you want. Btw purrr also provides a means to filter the list you get without looping. Check the documentation of purrr.
With base R
kp=data.frame(V1=c(seq(8628,8618,by=-2),seq(8576,8566,by=-2),78,76),V2=runif(14))
kp$diffV1=c(-2,diff(kp$V1))/-2
kp$group=cumsum(ifelse(kp$diffV1/-2==1,0,1))+1
lkp=split(kp,kp$group)
# > kp
# V1 V2 diffV1 group
# 1 8628 0.74304325 -2 1
# 2 8626 0.84658101 -2 1
# 3 8624 0.74540089 -2 1
# 4 8622 0.83551473 -2 1
# 5 8620 0.63605222 -2 1
# 6 8618 0.92702915 -2 1
# 7 8576 0.81978587 -42 2
# 8 8574 0.01661538 -2 2
# 9 8572 0.52313859 -2 2
# 10 8570 0.39997951 -2 2
# 11 8568 0.61444445 -2 2
# 12 8566 0.23570017 -2 2
# 13 78 0.58397923 -8488 3
# 14 76 0.03634809 -2 3

If (condition), add 1 to previous value, else, subtract 1

I'm tracking Meals and satiety in a dataframe. I would like to have R add 1 to the previous value in the satiety column when a meal is eaten, and subtract 1 when no meal is eaten (meal=NA).
I'm trying to accomplish this with a for loop nested in an ifelse statement but it is not working.
My current attempt:
ifelse(Meals=="NA",for (i in 1:length(Day$Fullness)){
print(Day$Fullness[[i]]-1+i)}, for (i in 1:length(Day$Fullness)){
print(Day$Fullness[[i]]+1+i)}
Error: Error in ans[test & ok] <- rep(yes, length.out = length(ans))
[test & ok] :
replacement has length zero
In addition: Warning message:
In rep(yes, length.out = length(ans)) :
'x' is NULL so the result will be NULL
I'm not sure how to create a table on here but I will do my best to make sense.
Time: 9:30 AM 10:00 AM 10:30 AM ETC
Meals: NA NA Breakfast NA NA Snack NA NA NA ETC
Satiety: Range from 0-10.
My current satiety data is just a vector I created, but I would like it to start at 0 and increase by 1 after every meal, while decreasing by 1 after every 30 minute timeframe where there is no meal(where meal= NA).
I'm sure there is a much better way to do this.
Thank you.
Here's some sample data and a potential solution.
set.seed(123)
meals <- sample(c(1, 1, 1, NA), 20, replace = TRUE)
df <- data.frame(meals = meals)
head(df)
# meals
# 1 1
# 2 NA
# 3 1
# 4 NA
# 5 NA
# 6 1
df$meals[is.na(df$meals)] <- -1
df$satiety <- cumsum(df$meals)
head(df)
# meals satiety
# 1 1 1
# 2 -1 0
# 3 1 1
# 4 -1 0
# 5 -1 -1
# 6 1 0
tail(df)
# meals satiety
# 15 1 5
# 16 -1 4
# 17 1 5
# 18 1 6
# 19 1 7
# 20 -1 6
I would suggest not coding the absence of a meal (or a skipped meal) as NA which means "I don't know". If you're using NA to mean the meal was skipped, than you do actually know and you should give it something that represents a skipped meal. Here, since your model interprets a skipped meal as having a negative impact on satiety (not a neutral impact), -1 actually makes quite a lot of sense. If that's how you use it in your model, then code it that way.
A couple of things here.
Unless the data includes the string "NA", you should use the command is.na(x) to check if a value or values are NA. It's hard to tell however without sample data.
Generally speaking, in R you will want to use vectorised solutions. In many cases, if you're using a for loop, it's incorrect.
You've stated that "Meals" is in a dataframe. As such, you will need to refer to Meals as a subset of that data frame. For example, if the data frame is data, then the expression should be data$Meals.
Summarising all of this, I'd probably do something similar to the following:
Day$Meals.na <- is.na(Day$Meals)
print(Day$Fullness + (-1)^Day$Meals.na)
This uses a nice trick: TRUE and FALSE are both stored as 1 and 0 respectively under the hood.
Hopefully this helps. If not, we'd really need sample data and expected outputs to be able to be of more use.

Trace back to certain indiviuals

I have the following data and my issue is the following: at some point of time and at a certain place, a contamination occurs. I don't know who caused it but I want to be able to trace this back with the largest likelihood possible. I need a probability for each individual to be the cause of this contamination. This is what the desired column "Prob_Contiminator" should show.
I know certain times at which a note was made about an occured contamination but this is only the time where the contamination was reported. What I have in mind is a high probability of having caused the contamination if someone is temporally close to the ocurrence, which decreases the further away the observation of an individual is from the contamination.
It is important that only individuals are considered to have caused the contamination if they have the same location_id as the occurence row had. Another problem is that people who frequently appear in the data would automatically apear to have caused the contamination more often. I additionally have data on when cleaning occured. I thought about limiting the observations of these "frequent users" to one observation which is closest to the event within one cleaning interval. How do I manage to properly spot the contaminators without discriminating the people who happen to be "heavy users"?
Data:
"Event_ID" "Person_ID" "Note" "time" "location_id" "Cleaning"
1 1 "" 1990-01-01 1 1
2 1 "" 1990-01-02 1 0
3 2 "" 1990-01-03 1 0
4 3 "Occured" 1993-01-03 1 1
5 3 "" 1995-01-04 2 0
6 3 "" 1995-01-04 2 0
7 4 "" 1995-01-04 3 0
8 5 "" 1995-01-05 6 0
9 6 "" 1995-01-05 5 0
10 7 "Ocurred" 1995-01-05 6 1
This is what I need ( Prob_Contaminator column not complete):
"Event_ID" "Person_ID" "Note" "time" "location_id" "Cleaning" "Prob_Contaminator"
1 1 "" 1990-01-01 1 1 0.4
2 1 "" 1990-01-02 1 0 0.4
3 2 "" 1990-01-03 1 0 0.6
4 3 "Occured" 1993-01-03 1 1
5 3 "" 1995-01-04 2 0
6 3 "" 1995-01-04 2 0
7 4 "" 1995-01-04 3 0
8 5 "" 1995-01-05 6 0
9 6 "" 1995-01-05 5 0
10 7 "Ocurred" 1995-01-05 6 1
The following example shows how I imagine the column Prob_Contaminated to be constructed. If we consider the row number 4 (Event ID=4), we see that a contamination has occured. Now i want to look back at all events since the last cleaning (in this case 3 events, based on the cleaning taken place at event_ID =1) and consider how far they are away from the event of contamination. This should only occur under the condition that the events looked at take place in the same location_id. Since the location_id is the same in this example (=1), the probability of being the contaminator for these 3 events is 1/3. Multiple events of 1 person should be reduced to the time closest to the contamination. This reduces the cases to two, and ergo the probability for Person_ID 1 and Person_ID 2 to be 1/2. Additionally I want to weight each probability by the distance that they have to the contamination. Since "time" value of Person_ID=2 is closer to the row with contamination than the the "time" value of Person_ID=1, the Prob_Contaminated for Person_ID=2 should be weighted higher. In this case I applied a weight of 1.2 for the more "recent" ID to the event (1.2*0.5=0.6) and a weight of 0.8*0.5=0.4) to the less recent event.
Code:
df <- data.frame(Event_ID = c(1:10),
Person_ID = c("1","1","2","3","3","3","4","5","6","7"),
Note = c("","","","Occured","","","","","","Ocurred"),
time = as.Date(c('1990-1-1','1990-1-2','1990-1-3','1993-1-3','1995-1-4','1995-1-4','1995-1-4',"1995-1-5","1995-1-5","1995-1-5")),
location_id = c("1","1","1","1","2","2","3","6","5","6"),
Cleaning = c("1","0","0","1","0","0","0","0","0","1"))
df2 <- data.frame(Event_ID = c(1:10),
Person_ID = c("1","1","2","3","3","3","4","5","6","7"),
Note = c("","","","Occured","","","","","","Ocurred"),
time = as.Date(c('1990-1-1','1990-1-2','1990-1-3','1993-1-3','1995-1-4','1995-1-4','1995-1-4',"1995-1-5","1995-1-5","1995-1-5")),
location_id = c("1","1","1","1","2","2","3","6","5","6"),
Cleaning = c("1","0","0","1","0","0","0","0","0","1"),
Prob_Contiminator = c("0.4","0.4","0.6","","","","","","",""))

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

New calculation loop

I want to have a loop that will perform a calculation for me, and export the variable (along with identifying information) into a new data frame.
My data look like this:
Each unique sampling point (UNIQUE) has 4 data points associated with it (they differ by WAVE).
WAVE REFLECT REFEREN PLOT LOCAT COMCOMP DATE UNIQUE
1 679.9 119 0 1 1 1 11.16.12 1
2 799.9 119 0 1 1 1 11.16.12 1
3 899.8 117 0 1 1 1 11.16.12 1
4 970.3 113 0 1 1 1 11.16.12 1
5 679.9 914 31504 1 2 1 11.16.12 2
6 799.9 1693 25194 1 2 1 11.16.12 2
And I want to create a new data frame that will look like this:
For each unique sampling point, I want to calculate "WBI" from 2 specific "WAVE" measurements.
WBI PLOT .... UNIQUE
(WAVE==899.8/WAVE==970) 1 1
(WAVE==899.8/WAVE==970) 1 2
(WAVE==899.8/WAVE==970) 1 3
Depends on the size of your input data.frame there could be better solution in terms of efficiency but the following should work ok for small or medium data sets, and is kind of simple:
out.unique = unique(input$UNIQUE);
out.plot = sapply(out.unique,simplify=T,function(uq) {
# assuming that plot is simply the first PLOT of those belonging to that
# unique number. If not yo should change this.
subset(input,subset= UNIQUE == uq)$PLOT[1];
});
out.wbi = sapply(out.unique,simplify=T,function(uq) {
# not sure how you compose WBI but I assume that are the two last
# record with that unique number so it matches the first output of your example
uq.subset = subset(input,subset= UNIQUE == uq);
uq.nrow = nrow(uq.subset);
paste("(WAVE=",uq.subset$WAVE[uq.nrow-1],"/WAVE=",uq.subset$WAVE[uq.nrow],")",sep="")
});
output = data.frame(WBI=out.wbi,PLOT=out.plot,UNIQUE=out.unique);
If the input data is big however you may want to exploit de fact that records seem to be sorted by "UNIQUE"; repetitive data.frame sub-setting would be costly. Also both sapply calls can be combined into one but make it a bit more cumbersome so I had leave it like this.

Resources