This question already has answers here:
Replacing values from a column using a condition in R
(2 answers)
Closed 4 years ago.
Here is my code
nutrients<- read.csv("nutrients.csv", head = TRUE, sep = ",")
> plot(nutrients)
> head(nutrients)
crop Nutrient.dens N..tons.acre. P2O5 K2O sum.nut
1 broccoli 340.0 210 245 100 555
2 carrot 458.0 70 250 50 370
3 cauliflower 315.0 25 35 80 140
4 letuce 318.5 165 150 90 405
5 onion 109.0 120 30 150 300
6 tomato 186.0 175 85 275 535
> df_nutrients<- as.data.frame(nutrients)
> df_nutrients<- df_nutrients[1,1=="broc"]
I am sure this is easy, and Ive tried searching anything i can find to get the answer but i cannot find it. I just need to change that one variable to "broc". is there a specific function i need or something?
If crop is a character type, then a simple subset should work
nutrients$crop[nutrients$crop == "broccoli"] <- "broc"
If crop is a factor, then use this:
levels(nutrients$crop)[levels(nutrients$crop) == "broccoli"] <- "proc"
Related
An VERY simplified example of my dataset:
HUC8 YEAR RO_MM
1: 10010001 1961 78.2
2: 10010001 1962 84.0
3: 10010001 1963 70.2
4: 10010001 1964 130.5
5: 10010001 1965 54.3
I found this code online which sort of, but not quite, does what I want:
#create a list of the files from your target directory
file_list <- list.files(path="~/Desktop/Rprojects")
#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable
allHUCS <- data.frame()
#I want to read each .csv from a folder named "Rprojects" on my desktop into one huge dataframe for further use.
for (i in 1:length(file_list)){
temp_data <- fread(file_list[i], stringsAsFactors = F)
allHUCS <- rbindlist(list(allHUCS, temp_data), use.names = T)
}
Question: I have read that one should not use rbindlist for a large dataset:
"You should never ever ever iteratively rbind within a loop: performance might be okay in the beginning, but with each call to rbind it makes a complete copy of the data, so with each pass the total data to copy increases. It scales horribly. Consider do.call(rbind.data.frame, file_list)." – #r2evans
I know this may seem simple but I'm unclear about how to use his directive. Would I write this for the last line?
allHUCS <- do.call(rbind.data.frame(allHUCS, temp_data), use.names = T)
Or something else? In my actual data, each .csv has 2099 objects with 3 variables (but I only care about the last two.) The total dataframe should contain 47,000,000+ objects of 2 variables. When I ran the original code I got these errors:
Error in rbindlist(list(allHUCS, temp_data), use.names = T) : Item 2
has 2 columns, inconsistent with item 1 which has 3 columns. To fill
missing columns use fill=TRUE.
In addition: Warning messages: 1: In fread(file_list[i],
stringsAsFactors = F) : Detected 1 column names but the data has 2
columns (i.e. invalid file). Added 1 extra default column name for the
first column which is guessed to be row names or an index. Use
setnames() afterwards if this guess is not correct, or fix the file
write command that created the file to create a valid file.
2: In fread(file_list[i], stringsAsFactors = F) : Stopped early on
line 20. Expected 2 fields but found 3. Consider fill=TRUE and
comment.char=. First discarded non-empty line: <<# mv *.csv .. ; >>
Except for the setnames() suggestion, I don't understand what I'm being told. I know it says it stopped early, but I don't even know how to see the entire dataset or to tell where it stopped.
I'm now reading that rbindlist and rbind are two different things and rbindlist is faster than do.call(rbind, data). But the suggestion is do.call(rbind.data.frame(allHUCS, temp_data). Which is going to be fastest?
Since the original post does not include a reproducible example, here is one that reads data from the Pokémon Stats data that I maintain on Github.
First, we download a zip file containing one CSV file for each generation of Pokémon, and unzip it to the ./pokemonData subdirectory of the R working directory.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonData.zip",
"pokemonData.zip",
method="curl",mode="wb")
unzip("pokemonData.zip",exdir="./pokemonData")
Next, we obtain a list of files in the directory to which we unzipped the CSV files.
thePokemonFiles <- list.files("./pokemonData",
full.names=TRUE)
Finally, we load the data.table package, use lapply() with data.table::fread() to read the files, combine the resulting list of data tables with do.call(), and print the head() and `tail() of the resulting data frame with all 8 generations of Pokémon stats.
library(data.table)
data <- do.call(rbind,lapply(thePokemonFiles,fread))
head(data)
tail(data)
...and the output:
> head(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk Sp. Def Speed
1: 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
2: 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
3: 3 Venusaur Grass Poison 525 80 82 83 100 100 80
4: 4 Charmander Fire 309 39 52 43 60 50 65
5: 5 Charmeleon Fire 405 58 64 58 80 65 80
6: 6 Charizard Fire Flying 534 78 84 78 109 85 100
Generation
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
> tail(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk
1: 895 Regidrago Dragon 580 200 100 50 100
2: 896 Glastrier Ice 580 100 145 130 65
3: 897 Spectrier Ghost 580 100 65 60 145
4: 898 Calyrex Psychic Grass 500 100 80 80 80
5: 898 Calyrex Ice Rider Psychic Ice 680 100 165 150 85
6: 898 Calyrex Shadow Rider Psychic Ghost 680 100 85 80 165
Sp. Def Speed Generation
1: 50 80 8
2: 110 30 8
3: 80 130 8
4: 80 80 8
5: 130 50 8
6: 100 150 8
>
I am reading in a CSV file. When I first check if there are any NA's there are none. I then clean my data and convert my Income variable from num to factor by using this code to discretize income by equal-width bins:
min_income <- min(bd$income)
max_income <- max(bd$income)
bins = 3
width=(max_income - min_income)/bins;
bd$income = cut(bd$income, breaks=seq(min_income, max_income, width))
When I complete cleaning/updating my data and check again for NA's I receive one. It is specific to row 65 for my income column. If I want to update the actual value in it, using the below code I receive an error.
> bd[65,5] = 5014.21
invalid factor level, NA generated
Is there a way to update this without having to change the type of variable? Why would it change the value to an NA (especially for only one value)? I have not come across this issue previously. I could just remove the row, but since I have the value I figured I should just use it.
I suspect the NA value is the lowest value because cut() does not include the lower boundary by default. You can change that by setting include_lowest = TRUE. See example below.
bd = data.frame(income = sample(seq(100,500, 10), 10))
min_income <- min(bd$income)
max_income <- max(bd$income)
bins = 3
width=(max_income - min_income)/bins;
bd$income2 = cut(bd$income, breaks=seq(min_income, max_income, width))
bd$income3 = cut(bd$income, breaks=seq(min_income, max_income, width),
include.lowest = TRUE)
bd
income income2 income3
1 340 (247,373] (247,373]
2 360 (247,373] (247,373]
3 250 (247,373] (247,373]
4 120 <NA> [120,247]
5 290 (247,373] (247,373]
6 210 (120,247] [120,247]
7 440 (373,500] (373,500]
8 500 (373,500] (373,500]
9 450 (373,500] (373,500]
10 380 (373,500] (373,500]
So there should be no need to have an NA value in need of changing in the first place. However, for the sake of completeness: You change bd$income into a factor and hence can only assign a value corresponding to a factor level. For instance like this:
bd$income2[is.na(bd$income2)] = levels(bd$income2)[1]
bd
income income2 income3
1 340 (247,373] (247,373]
2 360 (247,373] (247,373]
3 250 (247,373] (247,373]
4 120 (120,247] [120,247]
5 290 (247,373] (247,373]
6 210 (120,247] [120,247]
7 440 (373,500] (373,500]
8 500 (373,500] (373,500]
9 450 (373,500] (373,500]
10 380 (373,500] (373,500]
Household Size 0 1 2 3 4 5+
Bedrooms Bedrooms Bedrooms Bedrooms Bedrooms Bedrooms
1 253 4486 2033 930 105 8
2 10 666 3703 947 85 7
3 4 68 1972 1621 52 5
4 1 12 680 1835 164 11
5+ 0 6 147 1230 721 122
I have the above dataframe where 'Bedrooms' is a label on the columns.
I'm trying to change this into a data table I can then use within rmarkdown to add into a flexdashboard. When I use the below code:
DT::datatable(df, rownames = FALSE, extensions = 'FixedColumns', escape=TRUE,options= list(bPaginate = FALSE))
I get the output:
Household Size 0 1 2 3 4 5+
1 253 4486 2033 930 105 8
2 10 666 3703 947 85 7
3 4 68 1972 1621 52 5
4 1 12 680 1835 164 11
5+ 0 6 147 1230 721 122
I have a few problems with this:
The lables that say 'Bedrooms' don't show, so there's no way of knowing what these numbers in the columns actually mean. I'd like to include the labels or have a Row on top of the column names that says "Number of Bedrooms" that covers all of the rows?
The column Household Size and 5+ have a wider width than the rest of the columns, I want these to either be the same or Household Size to be slightly bigger than the rest
I think it's worth noting that the row 5+ and the column 5+ are both a new row/column that count any value above 5.
Also, this is just an extra but I'd like to colour the bottom left cells red and the top right cells green, is this possible?
I've figured out how to keep 'Bedrooms' in the column titles. It's possible to set the column names within DT::datatable using the code below;
DT::datatable(HS_BED_ALL, rownames = FALSE, colnames=c('Household Size','0 Bedrooms','1 Bedroom','2 Bedrooms','3 Bedrooms','4 Bedrooms','5+ Bedrooms'), extensions = 'FixedColumns', escape=TRUE, options= list(bPaginate = FALSE, dom = 't',buttons = c('excel')))%>%formatStyle(1:7,fontSize = '14px')
Which gives the desired output.
Say, I have a data.frame() like this
>head(Acquisition)
original_date first_payment_date LTV DTI FICO
1 01/2007 03/2007 56 37 734
2 02/2007 04/2007 80 11 762
3 12/2006 02/2007 80 28 656
4 12/2006 03/2007 70 50 700
I want to discretize the Acquisition$LTV and Acquisition$DTI by the step size 0.05 and Acquisition$FICO by the step size 10.
I have found the answer just use cut function is okay.
dis.LTV=cut(Acquisition$LTV,(max(Acquisition$LTV)-min(Acquisition$LTV))/0.05)
I am trying to get my head around for loops in R and I have what seems to me a very basic example which isn't working.
I have data in a table:
Author ev.ctrl n.ctrl ev.trt n.trt year
1 Cammu 8 56 7 54 1994
2 Eckert 49 137 46 137 2001
3 Kuusela 1 15 1 18 1998
4 Ohlisson 205 625 183 612 2001
5 Rush 259 392 235 393 1996
6 Woodward 7 20 6 40 2004
I want to calculate the sum of the column n.trt I know I could do sum(epidural$n.trt) but want to try and use a for loop.
I have:
for (i in 1:6){
sum(epidural$n.trt[i])
}
This is not giving me anything, not a number nor an error. Any idea what the problem is?
Thanks
Do this instead... we don't need no steenking loops:
> treats <- sum(epidural['n.trt']); treats
[1] 1254
You need to declare sum variable outside of for loop and add values to it. There is no need to call sum function since you have only one value not vector.
s <- 0
for (i in 1:6){
s <- s + epidural$n.trt[i]
}
s