Finding a character variable in a column - r

I have a huge data frame df, with many columns. One of the columns named id_nm happens to be a character with values such as: aksh123dn.Ins
class(df$id_nm)
returns character
I need to lookup all those values which have the id_nm say aksh123dn.Ins
I used:
new_df<-df[df$id_nm=='aksh123dn.Ins',]
this returns the entire df which isn't the case in reality
also tried:
new_df<-df%filter(id_nm=='aksh123dn.Ins']
still getting the same answer
I think its possibly because it is a character string. Please help me with this. TIA

Related

How to convert a column with a for loop and grep expressions?

I have a dataset of airbnb and one of the variables is amenities. The “amenities” column lists all the amenities provided by the host. What’s the total number of amenities offered? Convert this to a numeric value that indicates the number of amenities provided. For example, if an instance of “amenities” is {TV,Internet,Wifi,Washer}, it should convert to 4. Add this as a column to the dataframe. I am very confused on how to do this. Some of the amenities go up to 50 different amenities. Manually making vector would take forever.
I'm also confused on this as well for the airbnb dataset. Before we do any further analysis involving calculations, we should first clean the data for mathematical operations. For example, the character “$” appears in the “price” column, making the data type of “price” character instead of numeric. Remove the “$” and “,” in this column and convert the data type as numeric (modify the raw data). I believe I have to use grep expressions.
if you have that info on a data frame you should try to use strsplit function:
sapply(strsplit(data.frame$amenities,","),length)
for subtitution of characters try gsub function

Why does R think my imported vector of characters are numbers?

This is probably a basic question, but why does R think my vector, which has a bunch of words in it, are numbers when I try to use these vectors as column names?
I imported a data set and it turns out the first row of data are the column headers that I want. The column headers that came with the data set are wrong ones. So I want to replace the column names. I figured this should be easy.
So what I did was I extracted the first row of data into a new object:
names <- data[1,]
Then I deleted the first row of data:
data <- data[-1,]
Then I tried to rename the column headers with the "names" object:
colnames(data) <- names
However, when I do this, instead of changing my column names to the words within the names object, it turns it into a bunch of numbers. I have no idea where these numbers come from.
Thanks
You need to actually show us the data, and the read.csv()/read.table() command you used to import.
If R thinks your numeric column is string, it sounds like that's because it wrongly includes the column name, i.e. you omitted header=TRUE in your read.csv()/read.table() import.
But show us your actual data and commands used.

Why are dates in a list first converted to numeric when coercing to character

I'm trying to convert a column of dates into strings, because I want to use them as factor levels at some later point in my code.
The date column is part of a tibble, and is of class Date. I figured that a simple as.character() conversion would do the trick, but unfortunately I was wrong. Instead of neatly formatted strings it returns a number in string form. For example today (22 november 2017) would come out as "17492". So somewhere in the process the date gets converted into its numeric format and only then turned into a character string.
Now I did find a workaround, by unlisting the data, converting it again to dates and then to character strings, but it is fairly inefficient.
Can anyone explain i) why this occurs and ii) if there is an easier fix?
Below a reproducible example:
#Get current system date
foo <-Sys.Date()
#Convert to list
foo <- as.list(foo)
#The following then produces the number string:
as.character(foo)
[1] "17492"
#The following code works but is a rather annoying work-around
as.character(as.Date(unlist(foo), origin=as.Date("1970-01-01")))
[1] "2017-11-22"
Given the amount of useful comments and the final solutions provided I'll post an answer summary here.
The first thing to do if you run into this problem is check whether you actually want to convert the full list, or a column within the list, with the column actually being a vector. This was my underlying problem as MrFlick and neilfws pointed out. The reason I missed that was because in my case the list was a one column tibble, the column being named "date". Using as.character(foo) returned my "numeric string" "17492", but using as.character(foo$date), did exactly what it was supposed to do and returned "2017-11-22".
In case your list is really just a list, or a list of lists, the solution of d.b. works like a breeze: use lapply(foo, as.character) or sapply(foo, as.character) depending a bit on your desired output.
Now as to the why this happens: the direct reason, as pointed out by d.b. is that if as.character() encounters a list it first unlist() it, and then does the conversion.
The deeper why was pointed out by joran and the duplicate question on that here. In short: usually it does not make sense to convert a full list to a single data type class, as it can can contain many. For example as.numeric(foo) would just return an error. The only exception to that is as.character(), that actually makes a full write-out of the list (perhaps to keep records).

R data frame issue - non-numeric headers

This is definitely a rookie question but I'm not finding an answer for this (maybe because of my wording) so here goes:
I'm reading a data frame into R studio (csv file) that has 24 columns with headers. There are only numbers in these columns (they're essentially concentrations of several chemicals). It's called all. I need to use them as numeric vectors. When I read them in and type
is.numeric(all[,1])
I get
TRUE
When I type
is.numeric(all[1])
I get
FALSE
I think this is because R interprets the header as a factor. I also tried reading in a table without headers and with headers=FALSE, but R renames it to V1, V2 etc so the result ends up being the same.
I need to work with functions where I invoke something like all[2:24]. How can I go about to make R either "not see" the header or remove it altogether?
Thanks for the answers!
PS: the dataframe I am using (without headers - if it had headers, it would just have names instead of V1, V2, etc) is something like this:
This is a subset from the first column, not the first row.
all[,1]) #subset first column
The following is subset of first row
all[1,]) #subset first row (headers of df not included)
To give columnames
colnames(all) <- c("col1","col2")
Your assumption is wrong. You have a data.frame and all[1] does list subsetting, which results in a data.frame, which is not a vector, and not a numeric vector in particular.
You should study help("[") and An Introduction to R.

Extracting Data from a column based on symbols

I have a tricky question that I'm hoping someone can help me with. I have an output file that looks pretty standard in that there is one value per row, per column - except for one column (excerpt below) that contains multiple entries per row:
4:103806204-103940896,4:103806204-103940896,4:103822084-103940896,4:103806204-103940896
7:27135712-27139877,7:27135712-27139877
2:209030070-209054773
1:16091458-16113084,1:16090993-16101715,1:16085254-16113084
16:70333061-70367735,16:70323669-70367735,16:70333061-70367735,16:70333061-70367735,16:70328735-70367735,16:70328699-70367735,16:70333061-70367735
It would be easy enough to split this column by ',' but then I won't be able to read it into, say, R very easily.
Instead, I'm hoping I can use a simple bit of code to select only the first two values, and then make one column into two, removing the rest. So the above would become the below:
4 103806204
7 27135712
2 209030070
1 16091458
16 70333061
I lose a little bit of info this way, but it makes the data more manageable. Does anyone have any suggestions?
We can use str_extract_all from library(stringr). We extract the numeric elements (\\d+) in a list, convert the 'character' class to numeric and get the first two elements with head, rbind the list elements.
library(stringr)
do.call(rbind, lapply(str_extract_all(df$col, '\\d+'),
function(x) head(as.numeric(x),2)))

Resources