How to split a dataframe with missing values? - r

I've got a dataframe that I need to split based on the values in one of the columns - most of them are either 0 or 1, but a couple are NA, which I can't get to form a subset. This is what I've done:
all <- read.csv("XXX.csv")
splitted <- split(all, all$case_con)
dim(splitted[[1]]) #--> gives me 185
dim(splitted[[2]]) #--> gives me 180
but all contained 403 rows, which means that 38 NA values were left out and I don't know how to form a similar subset to the ones above with them. Any suggestions?

Try this:
splitted<-c(split(all, all$case_con,list(subset(all, is.na(case_con))))
This should tack on the data frame subset with the NAs as the last one in the list...

list(split(all, all$cases_con), split(all, is.na(all$cases_con)))
I think it would be work. Ty

Related

Merging cells in each rows using semicolon?

This may be a very basic question but I am not able to fo since last hour. I want to merge cells in each row using a comma or semicolon. The data looks like
OTU_1 23 15 273 51 127 190 220 83 k__Bacteria p__Chloroflexi c__SJA-15 o__C10_SB1A f__C10_SB1A g__Candidatus Amarilinum s__
The output would be like this
OTU_1;23;15;273;51;127;190;220;83;k__Bacteria;p__Chloroflexi;c__SJA-15;o__C10_SB1A;f__C10_SB1A;g__Candidatus Amarilinum;s__
Can you please guide how it can be done in R. I know how to use concatinate function but i am wondering if it can be done in R?
Thanks
It's not clear what you mean by cells. Is this a vector? Columns of a data.frame?
Tool to use here might be paste, but how you use it might depend on what the underlying structure of the data is.
> paste(letters, collapse = ";")
[1] "a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z"
to merge cells by rows you can use apply with second argument equal to 1, i.e.
apply(your_dataframe,1, function(x) paste(x, collapse = ","))
You'll receive a list of length equal to number of rows where each element is equal to cell merge

Correct Subsetting Solution

I have two data frames clust1 and clust2 with the different number of rows. clust1 has 53 rows and clust2 has 150 rows. I would like to subset the items to identify the row items in clust2 that have the similar longitude and latitude of clust1.
If I write this code:
a <- subset(clust2, clust2$Pickup_longitude == clust1$Pickup_longitude)
I will occur the below error:
Longer object length is not a multiple of shorter object length
If I write in this way:
a <- subset(clust2, clust2[53,]$Pickup_longitude == clust1$Pickup_longitude)
I will get the answer but definitely, my answer is wrong as I have limited the number of rows in clust2. what should I do to get the proper answer?
You could use dplyr's semi_join().
library(dplyr)
a <- semi_join(clust2, clust1, by = "Pickup_longitude")
That should give you all rows in clust2 that have Pickup_longitude values that appear in clust1.
(Edited to add the quotes in the "by" - thanks Gopala)
Sarina comment will work, you just need to:
a <- subset(clust2, clust2$Pickup_longitude %in% clust1$Pickup_longitude)
I also suggest, as you asked, if you want to identify the rows that have the similar longitude and latitude you can use which():
which(clust2$Pickup_longitude %in% clust1$Pickup_longitude)
This will give you the row numbers in clust2 that have the same long in clust1.

Data Manipulation, Looping to add columns

I have asked this question a couple times without any help. I have since improved the code so I am hoping somebody has some ideas! I have a dataset full of 0's and 1's. I simply want to add the 10 columns together resulting in 1 column with 3835 rows. This is my code thus far:
# select for valid IDs
data = history[history$studyid %in% valid$studyid,]
sibling = data[,c('b16aa','b16ba','b16ca','b16da','b16ea','b16fa','b16ga','b16ha','b16ia','b16ja')]
# replace all NA values by 0
sibling[is.na(sibling)] <- 0
# loop over all columns and count the number of 174
apply(sibling, 2, function(x) sum(x==174))
The problem is this code adds together all the rows, I want to add together all the columns so I would result with 1 column. This is the answer I am now getting which is wrong:
b16aa b16ba b16ca b16da b16ea b16fa b16ga b16ha b16ia b16ja
68 36 22 18 9 5 6 5 4 1
In apply() you have the MARGIN set to 2, which is columns. Set the MARGIN argument to 1, so that your function, sum, will be applied across rows. This was mentioned by #sgibb.
If that doesn't work (can't reproduce example), you could try first converting the elements of the matrix to integers X2 <- apply(sibling, c(1,2), function(x) x==174), and then use rowSums to add up the columns in each row: Xsum <- rowSums(X2, na.rm=TRUE). With this setup you do not need to first change the NA's to 0's, as you can just handle the NA's with the na.rm argument in rowSums()

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

dataframe where one column only has na values omitted

I have a data frame "accdata".
dim(accdata)
[1] 6496 188
One of the variables - "VAL" is of interest to me. I must calculate the number of instances where VAL is equal to 24.
I tried a few functions that returned error messages. After some research it seems I need to remove the NA values from VAL first.
I would try something like nonaaccdaa <- na.omit(accdata) except this removes instances of NA in any variable, not just VAL.
I tried nonaval <- na.omit(accdata[accdata$VAL]) but when I then checked the number of rows using nrow the result was null. I had expected a value between 1 and 6,496.
Whats up here?
This should do the trick:
sum(accdata$VAL == 24, na.rm=TRUE)

Resources