R how to conditionally update a numeric value in a data frame - r

I am trying to update a value in a data frame that is numeric when it is above a certain value due to input error. The value should be in the hundreds but, on occasion is in the thousands as it has an extra zero.
Data Frame is called df and the column is called Value1
Value1 (sample values)
650
6640
550
The value for 7650 should be 765. I am trying to use the following:
df$Value1[df$Value1>1000] <- df$Value1/10
This is generating very odd results. I end up not having values greater than 1000 but, a value of 6640 became 74.1 instead of 664 as I expected.
Any suggestions?
Thanks in advance

Here's how to do this in one line, without having to compute the target row indexes twice:
df$Value1[ris <- which(df$Value1>1000)] <- df$Value1[ris]/10;
df;
## Value1
## 1 650
## 2 664
## 3 550
Data
df <- data.frame(Value1=c(650L,6640L,550L));

Or we can use data.table (data from #bgoldst's post)
library(data.table)
setDT(df)[Value1 > 1000, Value1 := Value1/10]
df
# Value1
#1: 650
#2: 664
#3: 550

Here is one way :
#Sample data frame
d1
Value1
1 650
2 6640
3 550
d1$Value1 = as.numeric(substr(d1$Value1,1,3))
#result
d1
Value1
1 650
2 664
3 550

Related

Copy a subset of a column, based on conditions, to another dataframe in R

I have very limited R skills, and after hours searching for a solution I could not see an option that would work.
I have several large data tables. From each one, I would like to copy part of a column into an dataframe, to populate a column there.
My data tables (tabn1, tabn2, tabn3) all have the same format, but with different lengths. Each subset will have a different number of rows. I would want empty spaces to be filled with NA. I can't even copy the first column, so the subsequent are the next problem!
Ro Co Red Green Yellow
1 3 123 999 265
1 3 223 875 5877
1 4 21488 555 478
1 4 558 23698 5558
2 3 558 559 148
2 3 4579 557 59
2 4 1489 545 2369
2 4 123 999 265
3 3 558 559 148
3 3 558 23698 5558
3 4 4579 557 59
3 4 1478 4579 557
4 3 1488 555 478
4 3 1478 2945 5889
4 4 448 259 4548
4 4 26576 158 15
My new data frame col names:
cls <- c("n1","n2","n3")
I created a dataframe with the column names:
df <- setNames(data.frame(matrix(ncol=3)),cls)
For each of my tables, I want to subset Ro > = 3, Co = 3, column "Red" only
I have tried:
sub1 <- (filter(tabn1, tabn1$Ro >=3 | tabn$Co == 3)
df$n1 <- sub1$Red
> Error in `$<-.data.frame`(`*tmp*`, n1, value = c(183.94, 180.884, :
replacement has 32292 rows, data has 1
Also:
df$n1 <- cut(sub1$Red)
> Error in cut.default(sub1$Red) :
argument "breaks" is missing, with no default
I tried using df as a datatable instead of dataframe, but also got the following errors:
df <- setNames(data.table(matrix(ncol=3)),cls)
df$n1 <- sub1$Red
> Error in set(x, j = name, value = value) :
Supplied 32292 items to be assigned to 1 items of column 'nn1'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
I would subsequently tried to subset and copy from tabn2 to df$n2, and so forth. As indicated above, the original tables have different lengths.
Thanks in advance!
The issue is that the number of rows in 'df' and 'sub1' are different. 'df' is created with 1 row. Instead, we can create the 'df' directly from the 'sub1' itself
df <- sub1['Red']
names(df) <- cls[1]
Also, another way to create the data.frame, would be to specify the nrow as well
df <- as.data.frame(matrix(nrow = nrow(sub1), ncol = length(cls)),
dimnames = list(NULL, cls))
Regarding the second error with cut, it needs breaks. Either we specify the number of breaks
cut(sub1$Red, breaks = 3)
Or a vector of break points
cut(sub1$Red, breaks = c(-Inf, 100, 500, 1000, Inf))
If there are many 'tabn' objects, get them into a list, loop over the list with lapply
lst1 <- mget(ls(pattern = '^tabn\\d+$'))
out_lst <- lapply(lst1, function(x) subset(x, Ro >=3 | Co == 3)$Red)
It is possible that after subsetting and selecting the 'Red' column, the number of elements may be different. If the lengths are different, a option is to pad NA at the end for those having lesser number of elements before cbinding it
mx <- max(lengths(out_lst))
df <- do.call(cbind, lapply(out_lst, `length<-`, mx))

how to subtract a value from one column from a value from a previous row, different column in r

I have a dataframe composed of 3 columns and ~2000 rows.
ID DistA DistB
1 100 200
2 239 390
3 392 550
4 700 760
5 770 900
The first column (ID) is a unique identifier for each row. I'd like my script to read each row, and subtract/compare the value from column "DistA" in each row from the value of column "DistB" from the previous row. If the difference of the distance of any subsequent pairs is <40, to output that they are in the same area.
For example: In the above example comparing row 2 and 1, '239' from row 2 and '200' from row 1 is <40 and therefore in the same area. The same way 2 and 3, are in the same area ie the difference is 2 and 2<40. But rows 3 and 4 are not as the difference is 150.
I have not been able to go far, as I am stuck in the comparison (subtraction/difference) step. I have tried to write a loop to iterate in all the rows, but I keep getting errors. Should I even use a loop, or can I do this without a loop?
I am a new R learner, and this is the 'rookie' code that I have so far. Where am I going wrong. Thanks in advance:
#the function to compare the two columns
funct <- function(x){
for(i in 1:(nrow(dat)))
(as.numeric(dat$DistA[i-1])) - (as.numeric(dat$DistB[i]))}
#creating a new column 'new2' with the differences
dat$new2 <- apply(dat[,c('DistB','DistA')]),1, funct
When I run this, I get the following error:
Error: unexpected ',' in "dat$new2 <- apply(dat[,c('DistB','DistA')]),"
I'll appreciate all the comments/suggestions.
I believe dplyr can help you here.
library(dplyr)
dfData <- data.frame(ID = c(1, 2, 3, 4, 5),
DistA = c(100, 239, 392, 700, 770),
DistB = c(200, 390, 550, 760, 900))
dfData <- mutate(dfData, comparison = DistA - lag(DistB))
This results in...
dfData
ID DistA DistB comparison
1 1 100 200 NA
2 2 239 390 39
3 3 392 550 2
4 4 700 760 150
5 5 770 900 10
You could then check to see if a row is within the same "area" as your previous row.
We could also try data.table (similar to the approach as suggested in the comments by #David Arenburg). shift is a new function introduced in the devel version with type='lag' as the default option. It can be installed from here
library(data.table)#data.table_1.9.5
setDT(df1)[, Categ := c('Diff', 'Same')[
(abs(DistA-shift(DistB)) < 40 )+1L]][]
# ID DistA DistB Categ
#1: 1 100 200 NA
#2: 2 239 390 Same
#3: 3 392 550 Same
#4: 4 700 760 Diff
#5: 5 770 900 Same
If we need both the 'difference' and 'category' columns
setDT(df1)[,c('Dist', 'Categ'):={tmp= abs(DistA-shift(DistB))
list(tmp, c('Diff', 'Same')[(tmp <40)+1L])}]
df1
# ID DistA DistB Dist Categ
#1: 1 100 200 NA NA
#2: 2 239 390 39 Same
#3: 3 392 550 2 Same
#4: 4 700 760 150 Diff
#5: 5 770 900 10 Same

removing duplicate units from data frame

I'm working on a large dataset with n covariates. Many of the rows are duplicates. In order to identify the duplicates I need to use a subset of the covariates to create an identification variable. That is, (n-x) covariates are irrelevant. I want to concatenate the values on the x covariates to uniquely identify the observations and eliminate the duplicates.
set.seed(1234)
UNIT <- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,6)
DATE <- c("1/1/2010","1/1/2010","1/1/2010","1/2/2012","1/2/2009","1/2/2004","1/2/2005","1/2/2005",
"1/1/2011","1/1/2011","1/1/2011","1/1/2009","1/1/2008","1/1/2008","1/1/2012","1/1/2013",
"1/1/2012")
OUT1 <- c(300,400,400,400,600,700,700,800,800,800,900,700,700,100,100,100,500)
JUNK1 <- c(rnorm(17,0,1))
JUNK2 <- c(rnorm(17,0,1))
test = data.frame(UNIT,DATE,OUT1,JUNK1,JUNK2)
'test' is a sample data frame. The variables I need to use to uniquely identify the observations are 'UNIT', 'DATE' and 'OUT1'. For example,
head(test)
UNIT DATE OUT1 JUNK1 JUNK2
1 1 1/1/2010 300 -1.2070657 -0.9111954
2 1 1/1/2010 400 0.2774292 -0.8371717
3 1 1/1/2010 400 1.0844412 2.4158352
4 1 1/2/2012 400 -2.3456977 0.1340882
5 2 1/2/2009 600 0.4291247 -0.4906859
6 2 1/2/2004 700 0.5060559 -0.4405479
Observations 1 and 4 are not a duplicate in the dataset. Observations 2 and 3 are duplicates. The new dataset I want to create would keep observations 1 and 4 and only one of 2 and 3. The solution I have tried is:
subset(test, !duplicated(c(UNIT,DATE,OUT1)))
Which unfortunately does not do the trick:
UNIT DATE OUT1 JUNK1 JUNK2
1 1 1/1/2010 300 -1.20706575 -0.9111954
5 2 1/2/2009 600 0.42912469 -0.4906859
8 3 1/2/2005 800 -0.54663186 -0.6937202
11 4 1/1/2011 900 -0.47719270 -1.0236557
14 5 1/1/2008 100 0.06445882 1.1022975
15 6 1/1/2012 100 0.95949406 -0.4755931
Although it does ignore the irrelevant variables (JUNK1, JUNK2) , the technique is too greedy. The new dataset should contain three observations on unit one because there are three unique combinations of UNIT + DATE + OUT1 when UNIT = 1. Is there a way to achieve this without writing a function?
You can pass a data.frame to duplicated
In your case, you want to pass the first 3 columns of test
test2 <- test[!duplicated(test[,1:3]),]
If you are using big data, and want to embrace data.tables, then you can set the key to be the first three columns (which you want to remove the duplicates from) and then use unique
library(data.table)
DT <- data.table(test)
# set the key
setkey(DT, UNIT,DATE,OUT1)
DTU <- unique(DT)
For more details on duplicates and data.tables see Filtering out duplicated/non-unique rows in data.table
Thanks! Looks like we can do:
test2 <- test[!duplicated(test[,c("OUT1","DATE","UNIT")]),]
and it delivers the goods as well. So, we can just use the column names rather than 1:3 and the order doesn't matter
You can use distinct() from the dplyr package:
library(dplyr)
test %>%
distinct(UNIT, DATE, OUT1)
Or without the %>% pipe:
distinct(test, UNIT, DATE, OUT1)

Deleting rows dataframe in R conditional to "if any of (a specific variable) is equal to"

I have been struggling for some time now with this code...
I have this vector of unique ID "EID" of length 821 extracted from one of my dataframe (skate). It looks like this:
> head(skate$EID)
[1] "896-19" "895-8" "899-1" "899-5" "899-8" "895-7"
I would like to remove the complete rows in another dataframe (t5) if any of the t5$EID is equal (a duplicate) of skate$EID.
I was able to get my 'duplicated' dataframe in t5 of all my matching EID as follow:
> xx<-skate$EID
> t5[match(xx,t5[,26]), ]#gives me a dataframe of all matching EID in skate$EID
record.t trip set month stratum NAFO unit.area time dur.set distance
8948 5 896 19 11 221 2J N12 908 15 8
8849 5 895 8 10 766 3O R36 1650 16 8
9289 5 899 1 12 743 3L V26 2052 15 8
9299 5 899 5 12 746 3L W27 1129 14 7
Where t5[,26] correspond to t5$EID column.
I'm sure it's simple, but I'm not sure how to remove all of these now from my t5 dataframe!
Tips would be very much appreciated!
Thank you!
There are many ways to do this. To test for elements of vector A not in vector B, you can use a combination of !, R's logical negation operator (see ?"!") and %in% (see ?%in%). You then use the results of that test to indicate which rows to keep.
# Create two example data.frames
skate <- data.frame(EID = c("896-19", "895-8", "899-1", "899-5"),
score = 1:4)
t5 <- data.frame(EID = c("896-19", "camel", "899-1", "goat", "899-1"),
score = 105:101)
# Method 1
t5[!t5$EID %in% skate$EID, ]
# Method 2 (using the very handy subset() function)
subset(t5, !EID %in% skate$EID)

recoding using R

I have a data set with dam, sire, plus other variables but I need to recode my dam and sire id's. The dam column is sorted and each animal is only apprearing once. On the other hand, the sire column is unsorted and some animals are appearing more than once.
I would like to start my numbering of dams from 50,000 such that the first animal will get 50001, second animal 50002 and so on. I have this script that numbers each dam from 1 to N and wondering if it can be modified to begin from 50,000.
mydf$dam2 <- as.numeric(factor(paste(mydf$dam,sep="")))
*EDITED
my data set is similar to this but more variables
dam <- c("1M521","1M584","1M790","1M871","1M888","1M933")
sire <- c("1X057","1T456","1W865","1W209","1W209","1W648")
wt <- c(369,300,332,351,303,314)
p2 <- c(NA,16,18,NA,NA,15)
mydf <- data.frame(dam,sire,wt,p2)
For the sire column, I would like to start numbering from 10,000.
Any help would be very much appreciated.
Baz
At the moment, those sire and dam columns are factor variables, but in this case that means you can just add the as.numeric() results to you base number:
> mydf$dam_n <- 50000 +as.numeric(mydf$dam)
> mydf$sire_n <- 10000 +as.numeric(mydf$sire)
> mydf
dam sire wt p2 dam_n sire_n
1 1M521 1X057 369 NA 50001 10005
2 1M584 1T456 300 16 50002 10001
3 1M790 1W865 332 18 50003 10004
4 1M871 1W209 351 NA 50004 10002
5 1M888 1W209 303 NA 50005 10002
6 1M933 1W648 314 15 50006 10003
Why not use:
names(mydf$dam2) <- 50000:whatEverYourLengthIs
I am not sure if I understood your datastructures completly but usually the names-function is used to set names.
EDIT:
You can use dimnames to names columns and rows.
Like:
[,1] [,2]
a 1 2
b 4 5
c 7 8
and
dimnames(mymatrix) <- list(c("Jan", "Feb", "Mar"), c("2005", "2006"))
yields
2005 2006
Jan 1 2
Feb 4 5
Mar 7 8

Resources