Finding missing values - r

I have a couple of questions as I am new to this.
How can I read in data using \-\- to look for missing values?
How can I determine how many values are missing in each variable?
I tried using the summary command and is.na but can't's seem to get it right.

the first question is not clear, for the second one you can use
sapply(yourdataframe, function(x) sum(is.na(x))

Related

Filtering with dplyr not working as expected

how are you?
I have the next problem, that is very weird because the task it is very simple.
I want to filter one of my factor variables in R, but the outcome is an empty dataframe.
So my data frame is called "data_2022", if i execute this code:
sum(data_2022$CANALDEVENTA=="WEB")
The result is 2704800 that is the number of times that this filter is TRUE.
a= data_2022 %>% filter(CANALDEVENTA=="WEB")
This returns an empty data frame.
I know i am not an expert in R, but i have done the last thing a million times and i never had this error before.
Do you have a clue about whats the problem with this?
Sorry i did not make a reproducible example.
Already thank you.
you could use subset function:
a<-subset(data_2022, CANALDEVENTA=="WEB")
using tidyverse, make sure you are using the function from dplyr::filter. filter is looking for a logical expression but probably you apply it to a data.frame. Try this code too:
my_names<-c("WEB")
a<-dplyr::filter(data_2022, CANALDEVENTA %in% my_names)
Hope it works.

Assign a Value based on the numbers in a separate columns in R

So I kind of already know the possible solution but I don't know how to exactly go about it so please give me a bit of grace here.
I have a dataset for youtube trends that I want to read the values from two columns (likes and dislikes) and based off their contents I want an entry to be made in the new column. If the likes are higher than the dislikes I want it to be said as a 'positive' video and if it has more dislikes it should be 'negative'.
I'm primarily not sure how to go about this since most of the previous asks are based off of one column rather than two. I know some mentioned using cut, but would it still work the same?
all help is appreciated, thanks.
You can use a simple ifelse :
df$new_col <- ifelse(df$likes > df$dislikes, 'positive', 'negative')
This can also be written without ifelse as :
df$new_col <- c('negative', 'positive')[as.integer(df$likes > df$dislikes) + 1]
You can use Vectorize to create a vectorized version of a function. vfunc <- Vectorize(func) will allow you to call df$newcol <- vfunc(df$likes, df$dislikes) if your function takes two arguments and then return the result for each row in a vector that's assigned to a new column.

Using semi_join to find similarities but returns none mistakenly

I am trying to find the similar genes between two columns that I can later work with just the similar genes. Below is my code:
top100_1Beta <- data.frame(grp1_Beta$my_data.SYMBOL[1:100])
top100_2Beta<- data.frame(grp2_Beta$my_data.SYMBOL[1:100])
common100_Beta <- semi_join(top100_1Beta,top100_2Beta)`
When I run the code I get the following error:
Error: by required, because the data sources have no common variables
This is wrong since when I open top100_1Beta and top100_2Beta I can see at least the first few list the exact same genes: ATP2A1, SLMAP, MEOX2,...
I am confused on why then it's returning that no commonalities.
Any help would be greatly appreciated.
Thanks!
I don't think you need any form of *_join here; instead it seems you're looking for intersect
intersect(grp1_Beta$my_data.SYMBOL[1:100], grp2_Beta$my_data.SYMBOL[1:100])
This returns a vector of common entries amongst the first 100 entries of grp1_Beta$my_data.SYMBOL and grp1_Beta$my_data.SYMBOL.
Without a full working example, I'm guessing that your top100_1Beta and top100_2Beta dataframes do not have the same column names. They are probably grp1_Beta.my_data.SYMBOL.1.100. and grp2_Beta.my_data.SYMBOL.1.100.. This means the semi_join function doesn't know where to match the dataframes up. Renaming the columns should fix the issue.

Using ifelse statement to condense variables

New to R, taking a very accelerated class with very minimal instruction. So I apologize in advance if this is a rookie question.
The assignment I have is to take a specific column that has 21 levels from a dataframe, and condense them into 4 levels, using an if, or ifelse statement. I've tried what feels like hundreds of combinations, but this is the code that seemed most promising:
> b2$LANDFORM=ifelse(b2$LANDFORM=="af","af_type",
ifelse(b2$LANDFORM=="aflb","af_type",
ifelse(b2$LANDFORM=="afub","af_type",
ifelse(b2$LANDFORD=="afwb","af_type",
ifelse(b2$LANDFORM=="afws","af_type",
ifelse(b2$LANDFORM=="bfr","bf_type",
ifelse(b2$LANDFORM=="bfrlb","bf_type",
ifelse(b2$LANDFORM=="bfrwb","bf_type",
ifelse(b2$LANDFORM=="bfrwbws","bf_type",
ifelse(b2$LANDFORM=="bfrws","bf_type",
ifelse(b2$LANDFORM=="lb","lb_type",
ifelse(bs$LANDFORM=="lbaf","lb_type",
ifelse(b2$LANDFORM=="lbub","lb_type",
ifelse(b2$LANDFORM=="lbwb","lb_type","ws_type"))))))))))))))
LANDFORM is a factor, but I tried changing it to a character too, and the code still didn't work.
"ws_type" is the catch all for the remaining variables.
the code runs without errors, but when I check it, all I get is:
> unique(b2$LANDFORM)
[1] NA "af_type"
Am I even on the right path? Any suggestions? Should I bite the bullet and make a new column with substr()? Thanks in advance.
If your new levels are just the first two letters of the old ones followed by _type you can easily achieve what you want through:
#prototype of your column
mycol<-factor(sample(c("aflb","afub","afwb","afws","bfrlb","bfrwb","bfrws","lb","lbwb","lbws","wslb","wsub"), replace=TRUE, size=100))
as.factor(paste(sep="",substr(mycol,1,2),"_type"))
After a great deal of experimenting, I consulted a co-worker, and he was able to simplify a huge amount of this. Basically, I should have made a new column composed of the first two letters of the variables in LANDFORM, and then sample from that new column and replace values in LANDFORM, in order to make the ifelse() statement much shorter. The code is:
> b2$index=as.factor(substring(b2$LANDFORM,1,2))
b2$LANDFORM=ifelse(b2$index=="af","af_type",
ifelse(b2$index=="bf","bf_type",
ifelse(b2$index=="lb","lb_type",
ifelse(b2$index=="wb","wb_type",
ifelse(b2$index=="ws","ws_type","ub_type")))))
b2$LANDFORM=as.factor(b2$LANDFORM)
Thanks to everyone who gave me some guidance!

How to subset (without filtering) multiple columns from a data frame in R

I'm sorry this may have been done to death, but all the answers I've found veer all over the map into extreme exotica. I can subset using [[]] (I've learned from stackoverflow that I'm not supposed to use subset() and similar for my scripts, since they're intended for interactive use) for a single column, but I can't figure out how to make the leap to more than one column. These two work, of course:
outcomeA <- outcome[['Hospital.Name']]
outcomeB <- outcome[['TX]]
But I've tried a dozen permutations to get both of those columns, like so:
outcomeC <- outcome[[c('Hospital.Name', 'TX')]] (gives "subscript out of bound")
outcomeC <- outcome[c('Hospital.Name', 'TX')] (gives "undefined columns selected")
etc, but they all fail. Can someone please put me out of my misery and help me select more than one column?
Thanks - Ed
Did you try this with a comma and single brackets
outcomeC <- outcome[,c('Hospital.Name', 'TX')]
Also you can only get column names that exist in your data. check them against:
names(outcome)

Resources