How to convert a column with a for loop and grep expressions? - r

I have a dataset of airbnb and one of the variables is amenities. The “amenities” column lists all the amenities provided by the host. What’s the total number of amenities offered? Convert this to a numeric value that indicates the number of amenities provided. For example, if an instance of “amenities” is {TV,Internet,Wifi,Washer}, it should convert to 4. Add this as a column to the dataframe. I am very confused on how to do this. Some of the amenities go up to 50 different amenities. Manually making vector would take forever.
I'm also confused on this as well for the airbnb dataset. Before we do any further analysis involving calculations, we should first clean the data for mathematical operations. For example, the character “$” appears in the “price” column, making the data type of “price” character instead of numeric. Remove the “$” and “,” in this column and convert the data type as numeric (modify the raw data). I believe I have to use grep expressions.

if you have that info on a data frame you should try to use strsplit function:
sapply(strsplit(data.frame$amenities,","),length)
for subtitution of characters try gsub function

Related

Finding a character variable in a column

I have a huge data frame df, with many columns. One of the columns named id_nm happens to be a character with values such as: aksh123dn.Ins
class(df$id_nm)
returns character
I need to lookup all those values which have the id_nm say aksh123dn.Ins
I used:
new_df<-df[df$id_nm=='aksh123dn.Ins',]
this returns the entire df which isn't the case in reality
also tried:
new_df<-df%filter(id_nm=='aksh123dn.Ins']
still getting the same answer
I think its possibly because it is a character string. Please help me with this. TIA

In R i have a column with text. How can i write a script in R that counts the frequency of the specific words?

The text column can hold up to 100 letters for each entry. How can i write a script that recognizes the word "Approved" or "Rejected". Sometimes the word will be "-Approved", "Approved","Approved" or "Approve". I want it to account for each scenario with a "LIKE" type of function.
There are two words i am looking for so "OR" may be applicable to this as opposed to a range.
R has a pair of text-similarity functions, agrep and agrepl, which are like grep and grepl in returning a vector when given a vector. The agrepl function is logical and of the same length as the input so works better in cases like this:
agrepl("Approved", df$text_col) | agrepl("Rejected", df$text_col)
That could be used to logically index matching rows of a dataframe. Or you could sum the logical vector to get a count. Suggestion: Edit your question with an example to use for demonstration.
There are additional parameters that can be used to adjust the tightness of the approximate matching.

How to calculate combination of Data frame in R

I am a beginner in R program.
I imported a csv file. This file only contains one column with 50 characters, but R classifies it as a dataframe. I need all possible combinations within elements of this column. I think I need to work with a vector not with a data frame, how can I do it?
Thank you!
Actually your data frame already contains the vector you need. You can call it with
dataframe$column_name
The text before the $ operator specifies your data frame, and after is your vector, which is a column in your data frame. So when you run your calculations you can just write
function(dataframe$column_name)
In your specific case with a single vector, it may be simplest to change the dataframe into a 2d vector. But when you start manipulating your data, you'll likely store more vectors of variables. You'll want to keep those vectors organized within data frames.
Do you mean unlist?
You can use it to change a data frame into a vector, then you can use combn to get combination.

What's the easiest way to ignore one row of data when creating a histogram in R?

I have this csv with 4000+ entries and I am trying to create a histogram of one of the variables. Because of the way the data was collected, there was a possibility that if data was uncollectable for that entry, it was coded as a period (.). I still want to create a histogram and just ignore that specific entry.
What would be the best or easiest way to go about this?
I tried making it so that the histogram would only use the data for every entry except the one with the period by doing
newlist <- data1$var[1:3722]+data1$var[3724:4282]
where 3723 is the entry with the period, but R said that + is not meaningful for factors. I'm not sure if I went about this the right way, my intention was to create a vector or list or table conjoining those two subsets above into one bigger list called newlist.
Your problem is deeper that you realize. When R read in the data and saw the lone . it interpreted that column as a factor (categorical variable).
You need to either convert the factor back to a numeric variable (this is FAQ 7.10) or reread the data forcing it to read that column as numeric, if you are using read.table or one of the functions that calls read.table then you can set the colClasses argument to specify a numeric column.
Once the column of data is a numeric variable then a negative subscript or !is.na will work (or some functions will automatically ignore the missing value).

Extracting Data from a column based on symbols

I have a tricky question that I'm hoping someone can help me with. I have an output file that looks pretty standard in that there is one value per row, per column - except for one column (excerpt below) that contains multiple entries per row:
4:103806204-103940896,4:103806204-103940896,4:103822084-103940896,4:103806204-103940896
7:27135712-27139877,7:27135712-27139877
2:209030070-209054773
1:16091458-16113084,1:16090993-16101715,1:16085254-16113084
16:70333061-70367735,16:70323669-70367735,16:70333061-70367735,16:70333061-70367735,16:70328735-70367735,16:70328699-70367735,16:70333061-70367735
It would be easy enough to split this column by ',' but then I won't be able to read it into, say, R very easily.
Instead, I'm hoping I can use a simple bit of code to select only the first two values, and then make one column into two, removing the rest. So the above would become the below:
4 103806204
7 27135712
2 209030070
1 16091458
16 70333061
I lose a little bit of info this way, but it makes the data more manageable. Does anyone have any suggestions?
We can use str_extract_all from library(stringr). We extract the numeric elements (\\d+) in a list, convert the 'character' class to numeric and get the first two elements with head, rbind the list elements.
library(stringr)
do.call(rbind, lapply(str_extract_all(df$col, '\\d+'),
function(x) head(as.numeric(x),2)))

Resources