R: How to select rows when column is in given range?

R: How to select rows when column is in given range? - r

This is how my data looks like
> d[1,]
Date sulfate nitrate ID
1 2003-01-01 NA NA 1
>
Total observations
> dim(d)
[1] 772087 4
I want to get the rows where ID is in range 70:72 (this is coming from parameter)
What I do
d[d$ID==(70:71),]
What I get back is
Warning message:
In d$ID == (70:71) :
longer object length is not a multiple of shorter object length

Run d[d$ID %in% 70:71, ] to subset your data frame.

Related

Creating/Populating Empty Data Frames in R

I am working with R. I found this link here on creating empty data frames in R: Create an empty data.frame .
I tried to do something similar:
df <- data.frame(Date=as.Date(character()),
country=factor(),
total=numeric(),
stringsAsFactors=FALSE)
Yet, when I try to populate it:
df$total = 7
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, total, value = 7) :
replacement has 1 row, data has 0
df[1, "total"] <- rnorm(100,100,100)
Error in `[<-.data.frame`(`*tmp*`, 1, "total", value = c(-79.4584309347689, :
replacement has 100 rows, data has 1
Does anyone know how to fix this error?
Thanks

An option is to specify the row index
df[1, "total"] <- 7
-output
str(df)
#'data.frame': 1 obs. of 3 variables:
# $ Date : Date, format: NA
# $ country: Factor w/ 0 levels: NA
# $ total : num 7
The issue is that when we select a single column and assign on a 0 row dataset, it is not automatically expanding the row for other columns. By specifying the row index, other columns will automatically filled with default NA
Regarding the second question (updated), a standard data.frame column is a vector and the length of the vector should be the same as the index we are specifying. Suppose, we want to expand to 100 rows, change the index accordingly
df[1:100, "total"] <- rnorm(100, 100, 100) # length is 100 here
dim(df)
#[1] 100 3
Or if we need to cram everything in a single row, then wrap the rnorm in a list
df[1, "total"] <- list(rnorm(100, 100, 100))
In short, the lhs should be of the same length as the rhs. Another case is when we are assigning from a different dataset
df[seq_along(aa$bb), "total"] <- aa$bb
This can also be done without initialization i.e.
df <- data.frame(total = aa$bb)

Turn NULL to NA in R

I m trying to convert all the NULL values in my dataset to NA. In short
Explanation of question
My data set looks like below:
One thing that I noticed though is that when I try to find the number of empty values it shows the number of NA values in my dataset not including the NULL values. I would like to convert the NULL values to NA in order to remove them.
So I counted the number of missing values in my complete dataset then in the columns as
> dim(raw_data)
[1] 80983 16
> # Count missing values in entire data set
> table(is.na(raw_data))
FALSE TRUE
1247232 48496
> # Count na 's column wise
> na_count <-sapply(raw_data, function(y) sum(length(which(is.na(y)))))
> na_count <- data.frame(na_count)
> na_count
na_count
Merchant_Id 1
Tran_Date 1
Military_Time 1
Terminal_Id_Key 1
Amount 1
Card_Amount_Paid 1
Merchant_Name 1
Town 1
Area_Code 1
Client_ID 48481
Age_Band 1
Gender_code 1
Province 1
Avg_Income_3M 1
Value_Spent 1
Number_Spent 1
As you can see it does not show the NULL as NA so I tried to convert it as:
> # Turn Null to NA
> temp_data <- raw_data
>
> temp_data[temp_data == ''] = NA
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
I also tried
> # Turn Null to NA
> temp_data <- raw_data
> temp_data[temp_data == 'NULL'] = NA
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
But I am getting the error above. This was followed by the last one below (which was better because I did not have an error but I still got NULL values in my data set).
> raw_data[is.null(raw_data)] <- NA
> table(is.na(raw_data))
FALSE TRUE
1247232 48496
Could you perhaps suggest ways to deal with this error?
I also tried to get rid of the date and got this different error when I once again tried to remove the NULL values:
> df <- raw_data
>
> df1 <- transform(df, date = as.Date(df$Tran_Date), time = format(df$Tran_Date, "%T"))
>
> df1[df1 == NULL] = NA
Error in matrix(if (is.null(value)) logical() else value, nrow = nr, dimnames = list(rn, :
length of 'dimnames' [2] not equal to array extent

This solved my issue. Instead of changing the NULL values to NA. I imported the values in from the github account as NA values.
I added
na = c("","NA","NULL",NULL)
to my importing argument read.table or read_tsv from readr package. This then did the trick and changed my NULL values to NA.

Numeric Variable has 20 observations that won't merge because they are "factor"

I have two dataframes with two variables, date and count1 and date and count2. Date is numeric. I want to merge on date keeping dates that only have one count observation (either a count1 or a count2).
> class(totals$date)
[1] "numeric"
> totals$date <- as.numeric(totals$date)
> news.prop <- merge(ME_StoryCount,totals,by="date",all = T)
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = c(20110104, 20110105, 20110106, :
invalid factor level, NA generated
This gives a data frame with the majority of the observations merged except for about 20 of the 102. Please help...

lapply on single column in data frame

I have a data frame which I populate from a csv file as follows (data for sample only) :
> csv_data <- read.csv('test.csv')
> csv_data
gender country income
1 1 20 10000
2 2 20 12000
3 2 23 3000
I want to convert country to factor. However when I do the following, it fails :
> csv_data[,2] <- lapply(csv_data[,2], factor)
Warning message:
In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
provided 3 variables to replace 1 variables
However, if I convert both gender and country to factor, it succeeds :
> csv_data[,1:2] <- lapply(csv_data[,1:2], factor)
> is.factor(csv_data[,1])
[1] TRUE
> is.factor(csv_data[,2])
[1] TRUE
Is there something I am doing wrong? I want to use lapply since I want to programmatically convert the columns into factors and it could be possible that the number of columns to be converted is only 1(it could be more as well, this number is driven from arguments to a function). Any way I can do it using lapply only?

When subsetting for one single column, you'll need to change it slightly.
There's a big difference between
lapply(df[,2], factor)
and
lapply(df[2], factor)
## and/or
lapply(df[, 2, drop=FALSE], factor)
Have a look at the output of each. If you remove the comma, everything should work fine. Using the comma in [,] turns a single column into a vector and therefore each element in the vector is factored individually. Whereas leaving it out keeps the column as a list, which is what you want to give to lapply in this situation. However, if you use drop=FALSE, you can leave the comma in, and the column will remain a list/data.frame.
No good:
df[,2] <- lapply(df[,2], factor)
# Warning message:
# In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
# provided 3 variables to replace 1 variables
Succeeds on a single column:
df[,2] <- lapply(df[,2,drop=FALSE], factor)
df[,2]
# [1] 20 20 23
# Levels: 20 23
On my opinion, the best way to subset data frame columns is without the comma. This also succeeds:
df[2] <- lapply(df[2], factor)
df[[2]]
# [1] 20 20 23
# Levels: 20 23

Unexpected row(s) of NAs when selecting subset of dataframe

When selecting a subset of data from a dataframe, I get row(s) entirely made up of NA values that were not present in the original dataframe. For example:
example.df[example.df$census_tract == 27702, ]
returns:
census_tract number_households_est
NA NA NA
23611 27702 2864
Where did that first row of NAs come from? And why is it returned even though example.df$census_tract != 27702 for that row?

That is because there is a missing observation
> sum(is.na(example.df$census_tract))
[1] 1
> example.df[which(is.na(example.df$census_tract)), ]
census_tract number_households_est
64 NA NA
When == evaluates the 64th row it gives NA because by default we can't know wheter 27702 is equal to the missing value. Therefore the result is missing (aka NA). So a NA is putted in the logical vector used for indexing purposes. And this gives, by default, a full-of-NA row, because we are asking for a row but "we don't know which one".
The proper way is
> example.df[example.df$census_tract %in% 27702, ]
census_tract number_households_est
23611 27702 2864
HTH, Luca

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: How to select rows when column is in given range? - r

Run d[d$ID %in% 70:71, ] to subset your data frame.

Related

Creating/Populating Empty Data Frames in R

Turn NULL to NA in R

Numeric Variable has 20 observations that won't merge because they are "factor"

lapply on single column in data frame

Unexpected row(s) of NAs when selecting subset of dataframe

Categories

Resources