How to remove NA data in only one columns? - r

I have a file that looks like so:
date A B
2014-01-01 2 3
2014-01-02 5 NA
2014-01-03 NA NA
2014-01-04 7 11
If I use newdata <- na.omit(data) where data is the above table loaded via R, then I get only two data points. I get that since it will filter all instances of NA. What I want to do is to filter for each A and B so that I get three data points for A and only two for B. Clearly, my main data set is much larger than that and the numbers are different but neither should not matter.
How can I achieve that?

Use is.na() on the relevant vector of data you wish to look for and index using the negated result. For exmaple:
R> data[!is.na(data$A), ]
date A B
1 2014-01-01 2 3
2 2014-01-02 5 NA
4 2014-01-04 7 11
R> data[!is.na(data$B), ]
date A B
1 2014-01-01 2 3
4 2014-01-04 7 11
is.na() returns TRUE for every element that is NA and FALSE otherwise. To index the rows of the data frame, we can use this logical vector, but we want its converse. Hence we use ! to imply the opposite (TRUE becomes FALSE and vice versa).
You can restrict which columns you return by adding an index for the columns after the , in [ , ], e.g.
R> data[!is.na(data$A), 1:2]
date A
1 2014-01-01 2
2 2014-01-02 5
4 2014-01-04 7

Every column in a data frame must have the same number of elements, that is why NAs come in handy in the first place...
What you can do is
df.a <- df[!is.na(df$A), -3]
df.b <- df[!is.na(df$B), -2]

In case of Python we can use subset to define column/columns and inplace true is to make the changes in DF:-
rounds2.dropna(subset=['company_permalink'],inplace=True)

Related

R: how to merge two columns (column addition) while ignoring rows with same value

I have a data.frame like this
I want to add Sample_Intensity_RTC and Sample_Intensity_nRTC's values and then create a new column, however in cases of Sample_Intensity_RTC and Sample_Intensity_nRTC have the same value, no addition operation is done.
Please not that these columns are not rounded in the same way, so many numbers are same with different nsmall.
It seems you just want to combine these two columns, not add them in the sense of addition (+). Think of a zipper perhaps. Or two roads merging into one.
The two columns seem to have been created by two separate processes, the first looks to have more accuracy. However, after importing the data provided in the link, they have exactly the same values.
test <- read.csv("test.csv", row.names = 1)
options(digits=10)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC
1 191017QMXP002 NA NA
2 191017QNXP008 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46
6 191017USXP002 NA 76984658.00
In any case, to combine them, we can just use ifelse with the condition is.na for the first column.
test$new_col <- ifelse(is.na(test$Sample_Intensity_RTC),
test$Sample_Intensity_nRTC,
test$Sample_Intensity_RTC)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
1 191017QMXP002 NA NA NA
2 191017QNXP008 41293681.00 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46 76693308.46
6 191017USXP002 NA 76984658.00 76984658.00
sapply(test, function(x) sum(is.na(x)))
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
0 126 143 108
You could also use the coalesce function from dplyr.

Reformat wrapped data coerced into a dataframe? (R)

I have some data I need to extract from a .txt file in a very weird, wrapped format. It looks like this:
eid nint hisv large NA
1 1.00 1.00000e+00 0 1.0 NA
2 -152552.00 -6.90613e+04 -884198 -48775.7 1151.70
3 -5190.13 4.17751e-05 NA NA NA
4 2.00 1.00000e+00 0 1.0 NA
5 -172188.00 -8.16684e+04 -809131 -56956.1 -1364.07
6 -5480.54 4.01573e-05 NA NA NA
Luckily, I do not need all of this data. I just want to match eid with the value written in scientific notation. so:
eid sigma
1 1 4.17751e-005
2 2 4.01573e-005
3 3 3.72098e-005
This data goes on for hundreds of thousands of eids. It needs to discard the last three values of each first row, all of the values in row 2, and keep the last/second value in the third row. Then place it next to the 1st value of row 1. Then repeat. The column names other than 'eid' are totally disposable, too. I've never had to deal with wrapped data before so don't know where to begin.
**edited to show df after read-in.

unexpected rbind.fill behavior when combining columns of different class

I tried to use the rbind.fill function from the plyr package to combine two dataframes with a column A, which contains only digits in the first dataframe, but (also) strings in the second dataframe. Reproducible example:
data1 <- data.frame(A=c(11111,22222,33333), b=c(4444,444,44444), c=c(5555,66666,7777))
data2 <- data.frame(A=c(1234,"ss150",123456), c=c(888,777,666))
rbind.fill(data1,data2)
This produced the output below with incorrect data in column A, row 4,5,6. It did not produce an error message.
A b c
1 107778 33434 6
2 1756756 4 7
3 2324234 5 8
4 2 NA 14562
5 3 NA 45613
6 1 NA 14
I had expected that the function would coerce the whole column into character class, or at least display NA or a warning. Instead, it inserted digits that I do not understand (in the actual file, these are two digit numbers that are not sorted). The documentation does not specify that columns must be of the same type in the to-be-combined data.frames.
How can I get this combination?
A b c
1 11111 4444 5555
2 22222 444 66666
3 33333 44444 7777
4 1234 NA 888
5 ss150 NA 777
6 123456 NA 666
look at class(data2$A). It's a factor which is actually an integer with a label vector. Use stringsAsFactors=F in your data.frame creation or in read.csv and friends. This will force the variables be either numeric or character vectors.
data1 <- data.frame(A=c(11111,22222,33333), b=c(4444,444,44444), c=c(5555,66666,7777))
data2 <- data.frame(A=c(1234,"ss150",123456), c=c(888,777,666), stringsAsFactors=FALSE)
rbind.fill(data1,data2)

Subsetting in R using a list

I have a large amount of data which I would like to subset based on the values in one of the columns (dive site in this case). The data looks like this:
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
alice rain 95 NA 50 NA 2 4 9
alice over NA 25 NA 25 2 4 9
steps clear NA 27 NA 25 2 4 9
steps NA 30 NA 20 1 4 9
andrea1 clear 60 NA 60 NA 2 4 5
I would like to create a subset of the data which contains only data for one dive site at a time (e.g. one subset for alice, one for steps, one for andrea1 etc...).
I understand that I could subset each individually using
alice <- subset(reefdata, site=="alice")
But as I have over 100 different sites to subset by would like to avoid having to individually specify each subset. I think that subset is probably not flexible enough for me to ask it to subset by a list of names (or at least not to my current knowledge of R, which is growing, but still in infancy), is there another command which I should be looking into?
Thank you
This will create a list that contains the subset data frames in separate list elements.
splitdat <- split(reefdata, reefdata$site)
Then if you want to access the "alice" data you can reference it like
splitdat[["alice"]]
I would use the plyr package.
library(plyr)
ll <- dlply(df,.variables = c("site"))
Result:
>ll
$alice
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 alice rain 95 NA 50 NA 2 4 9
2 alice over NA 25 NA 25 2 4 9
$andrea1
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 andrea1 clear 60 NA 60 NA 2 4 5
$steps
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 steps clear NA 27 NA 25 2 4 9
2 steps <NA> 30 NA 20 1 4 9 NA
split() and dlply() are perfect one shot solutions.
If you want a "step by step" procedure with a loop (which is frowned upon by many R users, but I find it helpful in order to understand what's going on), try this:
# create vector with site names, assuming reefdata$site is a factor
sites <- as.character( unique( reefdata$site ) )
# create empty list to take dive data per site
dives <- list( NULL )
# collect data per site into the list
for( i in 1:length( sites ) )
{
# subset
dive <- reefdata[ reefdata$site == sites[ i ] , ]
# add resulting data.frame to the list
dives[[ i ]] <- dive
# name the list element
names( dives )[ i ] <- sites[ i ]
}

delete row by subset

I have a data.frame with 2 columns dates=dates of observations for each station and values=observation data
> head(dataset.a)
dates values
1 1976-01-01 7.5
2 1976-01-02 NA
3 1976-01-03 NA
4 1976-01-04 NA
5 1976-01-05 NA
6 1976-01-06 10.2
(...)
I have to multiply each row by a value that I have already from another data.frame:
> head(dataset.b)
dates values
1 1976-01-01 0.23
2 1976-01-02 NA
3 1976-01-03 NA
4 1976-01-04 NA
5 1976-01-05 NA
6 1976-01-06 1.23
(...)
Both datasets contain the Gregorian Calendar, however the dataset.a contains
Leap years (adds a 29th day to February) and the dataset.b has always 28 days in February. I want to ignore all 29th days of February in dataset.a and make the multiplication.
I should be able to make a basic subset using both indices:
which(strftime(dataset.a[,1],"%d")!="29")
which(strftime(dataset.a[,1],"%m")!="02")
However once I add a logical AND I loose the position in the data.frame were I have YEAR-02-29 and he returns me the number of rows that are TRUE for the combination of both indices.
I guess this is a very basic question, but I am lost.
Try a logical index:
idx <- strftime(ws.hb1.dataset[d,1],"%d")!="29" & strftime(ws.hb1.dataset[d,1],"%m")!="02"
Note: I'm assuming ws.hb1.dataset[d,1] is basically dataset.a[,1] here?
Then you'll get a vector of TRUE TRUE ... TRUE FALSE TRUE TRUE .. with the FALSE coinciding with 29/Feb.
Then you can just do dataset.a[idx,] to get the non 29/Feb dates.

Resources