I have a dataset that has 5 columns and 50 rows. I want to divide it into two parts, one with 35 rows and 15 in the other randomly. Then i would like to add another column to this dataset which contains value TRUE/FALSE. TRUE if the row belongs to the 35 randomly selected rows and FALSE if it belongs to the 15. How do i achieve it in R...
All help is greatly appreciated..
Thanks
We create a vector of 'TRUE/FALSE' elements using rep by specifying the times to replicate the 'TRUE/FALSE' values, sample it, and create a new column ('ind') by assigning the output. Then, split the dataset into a list of 2 data.frames by 'ind' column.
df1$ind <- sample(rep(c(TRUE, FALSE), times = c(35, 15)))
split(df1, df1$ind)
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(9, 50*5, replace=TRUE), ncol=5))
Related
I have a .csv file of 39 variables and 713 rows, each containing a count of plastic items. I have another column which is the survey length, and I want to standardise each count of items by a survey length of 100. I am unsure how to create a loop to run through each row and cell individually to do this. Many also have NA values.
Any ideas would be great.
Thank you.
Consider applying formula directly on columns without need of looping:
# RETRIEVE ALL COLUMN NAMES (MINUS SURVEY LENGTH)
vars <- names(df)[!grepl("survey_length", names(df))]
# EXPAND SINGLE COLUMN TO EQUAL DIMENSION OF DATA FRAME
survey_length_mat <- matrix(df$survey_length, ncol=length(vars), nrow=nrow(df))
# APPLY FORMULA
df[vars] <- (df[vars] / survey_length_mat) * 100
df
I'm trying to replace a chunck of a dataset A (say, the columns 7 to 25 for some rows listed in a vector "rows") with a dataset B of the same size. The dataset B repeats one row whose values are contained in the vector "new_values." I tried the following code:
A[rows, 7:25] <- sapply(A[rows, 7:25], replace, values=new_values, list= 1:ncol(A[rows, 7:25]))
It's not working, however. What is happening is that the columns in A are all the same and each row has a different value in "new_values"
Any idea how to fix that?
Thank you!
Sample data
mysample <- data.frame(ID = 1:100, kWh = rnorm(100))
I'm trying to automate the process of returning the rows in a data frame that contain the 5 highest values in a certain column. In the sample data, the 5 highest values in the "kWh" column can be found using the code:
(tail(sort(mysample$kWh), 5))
which in my case returns:
[1] 1.477391 1.765312 1.778396 2.686136 2.710494
I would like to create a table that contains rows that contain these numbers in column 2.
I am attempting to use this code:
mysample[mysample$kWh == (tail(sort(mysample$kWh), 5)),]
This returns:
ID kWh
87 87 1.765312
I would like it to return the r rows that contain the figures above in the "kWh" column. I'm sure I've missed something basic but I can't figure it out.
We can use rank
mysample$Rank <- rank(-mysample$kWh)
head(mysample[order(mysample$Rank),],5)
if we don't need to create column, directly use order (as #Jaap mentioned in three alternative methods)
#order descending and get the first 5 rows
head(mysample[order(-mysample$kWh),],5)
#order ascending and get the last 5 rows
tail(mysample[order(mysample$kWh),],5)
#or just use sequence as index to get the rows.
mysample[order(-mysample$kWh),][1:5]
I am trying to subset a data frame by taking the integer values of 2 columns om my data frame
Subs1<-subset(DATA,DATA[,2][!is.na(DATA[,2])] & DATA[,3][!is.na(DATA[,3])])
but it gives me an error : longer object length is not a multiple of shorter object length.
How can I construct a subset which is composed of NON NA values of column 2 AND column 3?
Thanks a lot?
Try this:
Subs1<-subset(DATA, (!is.na(DATA[,2])) & (!is.na(DATA[,3])))
The second parameter of subset is a logical vector with same length of nrow(DATA), indicating whether to keep the corresponding row.
The na.omit functions can be an answer to you question
Subs1 <- na.omit(DATA[2:3])
[https://stat.ethz.ch/R-manual/R-patched/library/stats/html/na.fail.html]
Here an example.
a,b ,c are 3 vectors which a and b have a missing value.
once they are created i use cbind in order to bind them in one matrix which afterwards you can transform to data frame.
The final result is a dataframe where 2 out of 3 columns have a missing value.
So we need to keep only the rows with complete cases.DATA[complete.cases(DATA), ] is used in order to keep only these rows that have not missing values in every column. subset object is these rows that have complete cases.
a <- c(1,NA,2)
b <- c(NA,1,2)
c <- c(1,2,3)
DATA <- as.data.frame(cbind(a,b,c))
subset <- DATA[complete.cases(DATA), ]
I am new to R with a fairly simple question, I just can't figure out the answer. For my example I will use a data frame with 3 columns, but my actual data set is 139 columns with 10000 rows.
I want to replace all of the values in a given row with NA if the value in the same row in column C contains a value < 10.
Assume that all of my columns are either number or integer values.
so I want to take the data frame:
x=data.frame(c(5,9,2),c(3,4,6),c(12,9,11))
names(x)=c("A","B","C")
and replace row 2 with NA to create
y=data.frame(c(5,"NA",2),c(3,"NA",6),c(12,"NA",11))
names(y)=c("A","B","C")
Thanks!
how about:
x[x$C <10 ,] <- NA