I am trying to see if the amount of information that I have about a case is correlated to the duration of the user.
Currently, I have a dataframe, df, and I attempted to do the following:
df["amount_known"] <-df[rowSums(!is.na(df)),]
This resulted in the following error:
Error in [<-.data.frame(*tmp*, "amount_known", value = list(status = c(3L, :
replacement element 1 has 808047 rows, need 808247
What could cause this to happen (and of course, how do I fix it)?
If you want the number of non-NA entries in a new column amount_known in df you can do it like this:
df$amount_known <-rowSums(!is.na(df))
Here's a small example of what is happening:
df <- data.frame(x = 1:3, y = 66:68)
df$y[1] <- NA
df$x[3] <- NA
df
# x y
#1 1 NA
#2 2 67
#3 NA 68
rowSums(!is.na(df))
#[1] 1 2 1
This results in a vector with the number of non-NAs in df.
Now, if you do
df[rowSums(!is.na(df)),]
This will select the rows in the vector c(1,2,1) from df:
# x y
#1 1 NA
#2 2 67
#1.1 1 NA
So for example, row 1 is shown twice.
And in your code, you were then assigning that output to a new column in df.
Related
I need to fill in R data.frame (or data.table) using named vectors as rows. The problem is that named vectors to be used as rows usually do not have all the variables. In other words, usually named vector has smaller length than the number of columns. Names of variables in the vectors coincide with column names of the dataframe:
df <- data.frame(matrix(NA, 2, 3))
colnames(df) <- c("A", "B", "C")
obs1 <- c(A=2, B=4)
obs2 <- c(A=3, C=10)
I want df as follows:
> df
A B C
1 2 4 NA
2 3 NA 10
So I want to fill in the first two rows with obs1 and obs2 respectively. When I try to do it, I get an error:
> df[1,] <- obs1
Error in `[<-.data.frame`(`*tmp*`, 1, , value = c(A = 2, B = 4)) :
replacement has 2 items, need 3
I suspect that similar question was already asked, but I could not find it. Does anybody know how to do it using data.frame or data.table?
We need to select the columns as well based on the names of 'obs1' and 'obs2'
df[1, names(obs1)] <- obs1
df[2, names(obs2)] <- obs2
-output
> df
A B C
1 2 4 NA
2 3 NA 10
When we do df[1,], it returns the first row with all the columns i.e. the length is 3 where as 'obs1' or 'obs2' have only a length of 2, thus getting the error in length
Also, creating a template dataset to fill is not really needed as we can use bind_rows which will automatically fill with NA for those columns not present
library(dplyr)
bind_rows(obs1, obs2)
# A tibble: 2 x 3
A B C
<dbl> <dbl> <dbl>
1 2 4 NA
2 3 NA 10
solution with data.table;
library(data.table)
obs1 <- data.table(t(obs1))
obs2 <- data.table(t(obs2))
df <- rbindlist(list(obs1,obs2),fill=T)
df
output;
A B C
<dbl> <dbl> <dbl>
1 2 4 NA
2 3 NA 10
I am trying to generate a lot of test data for other programs.
Working in R Studio I import an SPSS sav file which has 73 variables and the values and labels recorded in it using Haven as a dataframe "td". This gives me all the variable names which I need to work with. Then I delete all the existing data.
td <- td[0,]
Then I generate 10,000 test data rows by loading the index IDs
td$ID <- 12340000:12349999
So far so good.
I have a constant called ThismanyRows <- 10000
I have a large list of Column header names in a variable called BinaryVariables
And a vector of valid values for it called CheckedOrNot <- c(NA, 1)
This is where the problem is:
td[,BinaryVariables] <- sample(x = CheckedOrNot, size= ThismanyRows, replace = TRUE)
does fill all the columns with data. But its all exactly the same data, which isn't what I want.
I want the sample function to run against each column, but not each value in each column as in.
Even when
Fillbinary <- function () {sample(x = CheckedOrNot, size= ThismanyRows, replace = TRUE)}
and
td <- lapply(td[,BinaryVariables],Fillbinary)
generates: Error in FUN(X[[i]], ...) : unused argument (X[[i]])
So far I have not been able to work out how to deal with each column as a column and apply the sample function to it.
Any help much appreciated.
Let's generate some fake data first for the example:
BinaryVariables <- c("v1","v2","v3")
CheckedOrNot <- c(NA, 1)
ThismanyRows <- 10
td <- data.frame(ID=1:10)
The issue is that you are generating 10 values and feeding that in to replace 3 * 10 values.
There's a couple of ways to solve this. You might initially think, well, I'll generate 10 values 3 times, like so:
td[BinaryVariables] <- replicate(length(BinaryVariables),
sample(x = CheckedOrNot, size=ThismanyRows, replace=TRUE),
simplify=FALSE)
That will work fine, but why sample 3 times if you can sample once and fill once?
td[BinaryVariables] <- sample(x = CheckedOrNot,
size=ThismanyRows*length(BinaryVariables), replace = TRUE)
And the (well, a) result shows that the values in each column are different:
# TD v1 v2 v3
#1 1 NA 1 1
#2 2 NA 1 1
#3 3 NA 1 NA
#4 4 NA 1 NA
#5 5 1 NA 1
#6 6 NA 1 1
#7 7 1 NA 1
#8 8 1 1 NA
#9 9 1 NA NA
#10 10 1 NA NA
I am getting wrong result while removing all NA value column in R
data file : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
trainingData <- read.csv("D:\\pml-training.csv",na.strings = c("NA","", "#DIV/0!"))
Now I want to remove all the column which only has NA's
Approach 1: here I mean read all the column which has more than 0 sum and not NA
aa <- trainingData[colSums(!is.na(trainingData)) > 0]
length(colnames(aa))
154 columns
Approach 2: As per this query, it will give all the columns which is NA and sum = 0, but it is giving the result of column which does not have NA and gives expected result
bb <- trainingData[,colSums(is.na(trainingData)) == 0]
length(colnames(bb))
60 columns (expected)
Can someone please help me to understand what is wrong in first statement and what is right in second one
aa <- trainingData[,colSums(!is.na(trainingData)) > 0]
length(colnames(aa))
You convert the dataframe to a boolean dataframe with !is.na(trainingData), and find all columns where there is more than one TRUE (so non-NA) in the column. So this returns all columns that have at least one non-NA value, which seem to be all but 6 columns.
bb <- trainingData[colSums(is.na(trainingData)) == 0]
length(colnames(bb))
You convert the dataframe to boolean with is.na(trainingData) and return all values where there is no TRUE (no NA) in the column. This returns all columns where there are no missing values (i.e. no NA's).
Example as requested in comment:
df = data.frame(a=c(1,2,3),b=c(NA,1,1),c=c(NA,NA,NA))
bb <- df[colSums(is.na(df)) == 0]
> df
a b c
1 1 NA NA
2 2 1 NA
3 3 1 NA
> bb
a
1 1
2 2
3 3
So the statements are in fact different. If you want to remove all columns that are only NA's, you should use the first statement. Hope this helps.
I have a data frame that has the first column go from 1 to 365 like this
c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2...
and the second column has times that repeat over and over again like this
c(0,30,130,200,230,300,330,400,430,500,0,30,130,200,230,300,330,400,430,500...
so for every 1 value in the first column I have a corresponding time in the second column then when I get to the 2's the times start over and each 2 has a corresponding time,
occasionally I will come across
c(3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4...
c(0,30,130,200,230,330,400,430,500,0,30,130,200,230,300,330,400,430,500...
Here one of the 3's is missing and the corresponding time of 300 is missing with it.
How can I go through my entire data frame and add these missing values? I need a way for R to go through and identify any missing values then insert a row and put the appropriate value, 1 to 365, in column one and the appropriate time with it. So for the given example R would add a row in between 230 and 330 and then place a 3 in the first column and 300 in the second. There are parts of the column that are missing several consecutive values. It is not just one here and there
EDIT: Solution with all 10 times clearly specified in advance and code tidy up/commenting
You need to create another data.frame containing every possible row and then merge it with your data.frame. The key aspect is the all.x = TRUE in the final merge which forces the gaps in your data to be highlighted. I simulated the gaps by sampling only 15 of the first 20 possible day/time combinations in your.dat
# create vectors for the days and times
the.days = 1:365
the.times = c(0,30,100,130,200,230,330,400,430,500) # the 10 times to repeat
# create a master data.frame with all the times repeated for each day, taking only the first 20 observations
dat.all = data.frame(x1=rep(the.days, each=10), x2 = rep(the.times,times = 365))[1:20,]
# mimic your data.frame with some gaps in it (only 15 of 20 observations are present)
your.sample = sample(1:20, 15)
your.dat = data.frame(x1=rep(the.days, each=10), x2 = rep(the.times,times = 365), x3 = rnorm(365*10))[your.sample,]
# left outer join merge to include ALL of the master set and all of your matching subset, filling blanks with NA
merge(dat.all, your.dat, all.x = TRUE)
Here is the output from the merge, showing all 20 possible records with the gaps clearly visible as NA:
x1 x2 x3
1 1 0 NA
2 1 30 1.23128294
3 1 100 0.95806838
4 1 130 2.27075361
5 1 200 0.45347199
6 1 230 -1.61945983
7 1 330 NA
8 1 400 -0.98702883
9 1 430 NA
10 1 500 0.09342522
11 2 0 0.44340164
12 2 30 0.61114408
13 2 100 0.94592127
14 2 130 0.48916825
15 2 200 0.48850478
16 2 230 NA
17 2 330 0.52789171
18 2 400 -0.16939587
19 2 430 0.20961745
20 2 500 NA
Here are a few NA handling functions that could help you getting started.
For the inserting task, you should provide your own data using dput or a reproducible example.
df <- data.frame(x = sample(c(1, 2, 3, 4), 100, replace = T),
y = sample(c(0,30,130,200,230,300,330,400,430,500), 100, replace = T))
nas <- sample(NA, 20, replace = T)
df[1:20, 1] <- nas
df$y <- ifelse(df$y == 0, NA, df$y)
# Columns x and y have NA's in diferent places.
# Logical test for NA
is.na(df)
# Keep not NA cases of one colum
df[!is.na(df$x),]
df[!is.na(df$y),]
# Returns complete cases on both rows
df[complete.cases(df),]
# Gives the cases that are incomplete.
df[!complete.cases(df),]
# Returns the cases without NAs
na.omit(df)
Given
index = c(1,2,3,4,5)
codes = c("c1","c1,c2","","c3,c1","c2")
df=data.frame(index,codes)
df
index codes
1 1 c1
2 2 c1,c2
3 3
4 4 c3,c1
5 5 c2
How can I create a new df that looks like
df1
index codes
1 1 c1
2 2 c1
3 2 c2
4 3
5 4 c3
6 4 c1
7 5 c2
so that I can perform aggregates on the codes? The "index" of the actual data set are a series of timestamps, so I'll want to aggregate by day or hour.
The method of Roland is quite good, provided the variable index has unique keys. You can gain some speed by working with the lists directly. Take into account that :
in your original data frame, codes is a factor. No point in doing that, you want it to be character.
in your original data frame, "" is used instead of NA. As the length of that one is 0, you can get in all kind of trouble later on. I'd use NA there. " " is an actual value, "" is no value at all, but you want a missing value. Hence NA.
So my idea would be:
The data:
index = c(1,2,3,4,5)
codes = c("c1","c1,c2",NA,"c3,c1","c2")
df=data.frame(index,codes,stringsAsFactors=FALSE)
Then :
X <- strsplit(df$codes,",")
data.frame(
index = rep(df$index,sapply(X,length)),
codes = unlist(X)
)
Or, if you insist on using "" instead of NA:
X <- strsplit(df$codes,",")
ll <- sapply(X,length)
X[ll==0] <- NA
data.frame(
index = rep(df$index,pmax(1,ll)),
codes = unlist(X)
)
Neither of both methods assume a unique key in index. They work perfectly well with non-unique timestamps.
You need to split the string (using strsplit) and then combine the resulting list with the data.frame.
The following relies on the assumption that codes are unique in each row. If you have many codes in some rows and only few in others, this might waste a lot of RAM and it might be better to loop.
#to avoid character(0), which would be omitted in rbind
levels(df$codes)[levels(df$codes)==""] <- " "
#rbind fills each row by propagating the values to the "empty" columns for each row
df2 <- cbind(df, do.call(rbind,strsplit(as.character(df$codes),",")))[,-2]
library(reshape2)
df2 <- melt(df2, id="index")[-2]
#here the assumtion is needed
df2 <- df2[!duplicated(df2),]
df2[order(df2[,1], df2[,2]),]
# index value
#1 1 c1
#2 2 c1
#7 2 c2
#3 3
#9 4 c1
#4 4 c3
#5 5 c2
Here's another alternative using "data.table". The sample data includes NA instead of a blank space and includes duplicated index values:
index = c(1,2,3,2,4,5)
codes = c("c1","c1,c2",NA,"c3,c1","c2","c3")
df = data.frame(index,codes,stringsAsFactors=FALSE)
library(data.table)
## We could create the data.table directly, but I'm
## assuming you already have a data.frame ready to work with
DT <- data.table(df)
DT[, list(codes = unlist(strsplit(codes, ","))), by = "index"]
# index codes
# 1: 1 c1
# 2: 2 c1
# 3: 2 c2
# 4: 2 c3
# 5: 2 c1
# 6: 3 NA
# 7: 4 c2
# 8: 5 c3