How to update a specific cell with a numeric value? - r

I am reading in a CSV file. When I first check if there are any NA's there are none. I then clean my data and convert my Income variable from num to factor by using this code to discretize income by equal-width bins:
min_income <- min(bd$income)
max_income <- max(bd$income)
bins = 3
width=(max_income - min_income)/bins;
bd$income = cut(bd$income, breaks=seq(min_income, max_income, width))
When I complete cleaning/updating my data and check again for NA's I receive one. It is specific to row 65 for my income column. If I want to update the actual value in it, using the below code I receive an error.
> bd[65,5] = 5014.21
invalid factor level, NA generated
Is there a way to update this without having to change the type of variable? Why would it change the value to an NA (especially for only one value)? I have not come across this issue previously. I could just remove the row, but since I have the value I figured I should just use it.

I suspect the NA value is the lowest value because cut() does not include the lower boundary by default. You can change that by setting include_lowest = TRUE. See example below.
bd = data.frame(income = sample(seq(100,500, 10), 10))
min_income <- min(bd$income)
max_income <- max(bd$income)
bins = 3
width=(max_income - min_income)/bins;
bd$income2 = cut(bd$income, breaks=seq(min_income, max_income, width))
bd$income3 = cut(bd$income, breaks=seq(min_income, max_income, width),
include.lowest = TRUE)
bd
income income2 income3
1 340 (247,373] (247,373]
2 360 (247,373] (247,373]
3 250 (247,373] (247,373]
4 120 <NA> [120,247]
5 290 (247,373] (247,373]
6 210 (120,247] [120,247]
7 440 (373,500] (373,500]
8 500 (373,500] (373,500]
9 450 (373,500] (373,500]
10 380 (373,500] (373,500]
So there should be no need to have an NA value in need of changing in the first place. However, for the sake of completeness: You change bd$income into a factor and hence can only assign a value corresponding to a factor level. For instance like this:
bd$income2[is.na(bd$income2)] = levels(bd$income2)[1]
bd
income income2 income3
1 340 (247,373] (247,373]
2 360 (247,373] (247,373]
3 250 (247,373] (247,373]
4 120 (120,247] [120,247]
5 290 (247,373] (247,373]
6 210 (120,247] [120,247]
7 440 (373,500] (373,500]
8 500 (373,500] (373,500]
9 450 (373,500] (373,500]
10 380 (373,500] (373,500]

Related

Cross joining for the computation of a new variable

I have a game data set and I observe the number of points of one player.
da = data.frame(points = c(144,186,220,410,433))
da
points
1 144
2 186
3 220
4 410
5 433
I also now, in which the level the player was, because I know the ranges of points for different levels.
ranges = data.frame(level = c(1,2,3,4,5), points_from = c(0,100,200,300,430), points_to = c(100,170,300,430,550))
ranges
level points_from points_to
1 1 0 100
2 2 100 170
3 3 200 300
4 4 300 430
5 5 430 550
Now I want to compute a new variable, that indicates how far away the player was from the next level. It is computed by da$points/ranges$points_to of this specific level.
For example, if the player has 144 points and the next elvel is reached when achieving 170 points, the levle progress is 144/170.
Thus, the data set I want to have looks like this:
da_new = data.frame(points = c(144,186,220,410,433), points_to = c(170,300,300,430,550), level_progress = c(144/170,186/300,220/300,410/430,433/550))
da_new
points points_to level_progress
1 144 170 0.8471
2 186 300 0.6200
3 220 300 0.7333
4 410 430 0.9535
5 433 550 0.7873
How can I now compute this variable?
The main idea is to use merge(da, ranges, all = T) to do a "cross join" between the data. Then, we filter to where points is between points_from and points_to (meaning 186 is not in the final data).
library(dplyr)
merge(da, ranges, all = T) %>%
# keep only where points fall between points_from and points_to
filter(points >= points_from & points <= points_to) %>%
mutate(level_progress = points / points_to)
points level points_from points_to level_progress
1 144 2 100 170 0.8470588
2 220 3 200 300 0.7333333
3 410 4 300 430 0.9534884
4 433 5 430 550 0.7872727
Another option is to filter where points <= point_to, and find where points is closest to points_to (this method keeps 186):
merge(da, ranges, all = T) %>%
filter(points <= points_to) %>%
group_by(points) %>%
slice(which.min(abs(points - points_to))) %>%
mutate(level_progress = points / points_to)
points level points_from points_to level_progress
<dbl> <dbl> <dbl> <dbl> <dbl>
1 144 2 100 170 0.847
2 186 3 200 300 0.62
3 220 3 200 300 0.733
4 410 4 300 430 0.953
5 433 5 430 550 0.787
Here is a base R solution using findInterval
da_new <- da
da_new$points_to <- ranges$points_to[findInterval(da_new$points,c(0,ranges$points_to))]
da_new$level_progress <- da_new$points/da_new$points_to
such that
> da_new
points points_to level_progress
1 144 170 0.8470588
2 186 300 0.6200000
3 220 300 0.7333333
4 410 430 0.9534884
5 433 550 0.7872727

Change one specific value in a data table in R [duplicate]

This question already has answers here:
Replacing values from a column using a condition in R
(2 answers)
Closed 4 years ago.
Here is my code
nutrients<- read.csv("nutrients.csv", head = TRUE, sep = ",")
> plot(nutrients)
> head(nutrients)
crop Nutrient.dens N..tons.acre. P2O5 K2O sum.nut
1 broccoli 340.0 210 245 100 555
2 carrot 458.0 70 250 50 370
3 cauliflower 315.0 25 35 80 140
4 letuce 318.5 165 150 90 405
5 onion 109.0 120 30 150 300
6 tomato 186.0 175 85 275 535
> df_nutrients<- as.data.frame(nutrients)
> df_nutrients<- df_nutrients[1,1=="broc"]
I am sure this is easy, and Ive tried searching anything i can find to get the answer but i cannot find it. I just need to change that one variable to "broc". is there a specific function i need or something?
If crop is a character type, then a simple subset should work
nutrients$crop[nutrients$crop == "broccoli"] <- "broc"
If crop is a factor, then use this:
levels(nutrients$crop)[levels(nutrients$crop) == "broccoli"] <- "proc"

R One sample test for set of columns for each row

I have a data set where I have the Levels and Trends for say 50 cities for 3 scenarios. Below is the sample data -
City <- paste0("City",1:50)
L1 <- sample(100:500,50,replace = T)
L2 <- sample(100:500,50,replace = T)
L3 <- sample(100:500,50,replace = T)
T1 <- runif(50,0,3)
T2 <- runif(50,0,3)
T3 <- runif(50,0,3)
df <- data.frame(City,L1,L2,L3,T1,T2,T3)
Now, across the 3 scenarios I find the minimum Level and Minimum Trend using the below code -
df$L_min <- apply(df[,2:4],1,min)
df$T_min <- apply(df[,5:7],1,min)
Now I want to check if these minimum values are significantly different between the levels and trends respectively. So check L_min with columns 2-4 and T_min with columns 5-7. This needs to be done for each city (row) and if significant then return which column it is significantly different with.
It would help if some one could guide how this can be done.
Thank you!!
I'll put my idea here, nevertheless I'm looking forward for ideas for others.
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min
1 City1 251 176 263 1.162313 0.07196579 2.0925715 176 0.07196579
2 City2 385 406 264 0.353124 0.66089524 2.5613980 264 0.35312402
3 City3 437 333 426 2.625795 1.43547766 1.7667891 333 1.43547766
4 City4 431 405 493 2.042905 0.93041254 1.3872058 405 0.93041254
5 City5 101 429 100 1.731004 2.89794314 0.3535423 100 0.35354230
6 City6 374 394 465 1.854794 0.57909775 2.7485841 374 0.57909775
> df$FC <- rowMeans(df[,2:4])/df[,8]
> df <- df[order(-df$FC), ]
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min FC
18 City18 461 425 117 2.7786757 2.6577894 0.75974121 117 0.75974121 2.857550
38 City38 370 117 445 0.1103141 2.6890014 2.26174542 117 0.11031411 2.655271
44 City44 101 473 222 1.2754675 0.8667007 0.04057544 101 0.04057544 2.627063
10 City10 459 361 132 0.1529519 2.4678493 2.23373484 132 0.15295194 2.404040
16 City16 232 393 110 0.8628494 1.3995549 1.01689217 110 0.86284938 2.227273
15 City15 499 475 182 0.3679611 0.2519497 2.82647041 182 0.25194969 2.117216
Now you have the most different rows based on columns 2:4 at the top. Columns 5:7 in analogous way.
And some tips for stastical tests:
Always use t.test(parametrical, based on mean) instead of wilcoxon(u-mann whitney - non-parametrical, based on median), it has more power; HOWEVER:
-Data sets should be big ex. hipotesis: Montreal has taller citizens than Quebec; t.test will work fine when you take a 100 people from each city, so we have height measurment of 200 people 100 vs 100.
-Distribution should be close to normal distribution in all samples; or both samples should have similar distribution far from normal - it may be binominal. Anyway we can't use this test when one sample has normal distribution, and second hasn't.
-Size of both samples should be eqal, so 100 vs 100 is ok, but 87 vs 234 not exactly, p-value will be below 0.05, however it may be misrepresented.
If your data doesn't meet above conditions, I prefer non-parametrical test, less power but more resistant.

how to discretize R data.frame cloumn in a given width?

Say, I have a data.frame() like this
>head(Acquisition)
original_date first_payment_date LTV DTI FICO
1 01/2007 03/2007 56 37 734
2 02/2007 04/2007 80 11 762
3 12/2006 02/2007 80 28 656
4 12/2006 03/2007 70 50 700
I want to discretize the Acquisition$LTV and Acquisition$DTI by the step size 0.05 and Acquisition$FICO by the step size 10.
I have found the answer just use cut function is okay.
dis.LTV=cut(Acquisition$LTV,(max(Acquisition$LTV)-min(Acquisition$LTV))/0.05)

How to obtain a new table after filtering only one column in an existing table in R?

I have a data frame having 20 columns. I need to filter / remove noise from one column. After filtering using convolve function I get a new vector of values. Many values in the original column become NA due to filtering process. The problem is that I need the whole table (for later analysis) with only those rows where the filtered column has values but I can't bind the filtered column to original table as the number of rows for both are different. Let me illustrate using the 'age' column in 'Orange' data set in R:
> head(Orange)
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Convolve filter used
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r <- head(tail(r, -D), -D)
r
}
Filtering the 'age' column
age2 <- smooth(Orange$age, 5,10)
data.frame(age2)
The number of rows for age column and age2 column are 35 and 15 respectively. The original dataset has 2 more columns and I like to work with them also. Now, I only need 15 rows of each column corresponding to the 15 rows of age2 column. The filter here removed first and last ten values from age column. How can I apply the filter in a way that I get truncated dataset with all columns and filtered rows?
You would need to figure out how the variables line up. If you can add NA's to age2 and then do Orange$age2 <- age2 followed by na.omit(Orange) you should have what you want. Or, equivalently, perhaps this is what you are looking for?
df <- tail(head(Orange, -10), -10) # chop off the first and last 10 observations
df$age2 <- age2
df
Tree age circumference age2
11 2 1004 156 915.1678
12 2 1231 172 876.1048
13 2 1372 203 841.3156
14 2 1582 203 911.0914
15 3 118 30 948.2045
16 3 484 51 1008.0198
17 3 664 75 955.0961
18 3 1004 108 915.1678
19 3 1231 115 876.1048
20 3 1372 139 841.3156
21 3 1582 140 911.0914
22 4 118 32 948.2045
23 4 484 62 1008.0198
24 4 664 112 955.0961
25 4 1004 167 915.1678
Edit: If you know the first and last x observations will be removed then the following works:
x <- 2
df <- tail(head(Orange, -x), -x) # chop off the first and last x observations
df$age2 <- age2

Resources