Concatenate excel cells with numbers without summing - r

I know you can use the concatenate function in excel to combine two strings in two different cells into one cell, but how do I do the same for cells with numbers in them? I have two columns as seen in the image below (I have started the process by hand to demonstrate what I want) and I want to concatenate the value to read into R to perform a choice experiment evaluation on the data but using concatenate sums the values.

You can also do the concatenation within R itself, by setting the column type to string, and then using the paste0(string1, string2) function.

You can use the & sign
=A1&C1
concatenate does also work in excel
=CONCATENATE(A1,C1)

Related

How to skip empty rows while reading multiple tabs in R?

I am trying to read an excel file with multiple tabs. For that, I use the code provided here.
The problem is that each tab has a different number of empty rows before the actual data begins. For example, the first tab has two empty rows, the second tab has three empty rows, and so on.
Normally, I would use the parameter skip in the read_excel function to indicate the number of empty lines to skip. But how do I do that for multiple tabs with different numbers of rows to skip?
perhaps the easiest solution would be to read it as it is then remove rows, i.e. yourdata <- yourdata[!is.na(yourdata$columname),] ; this would work if you don't expect any NA's in a particular column, like id. If you have data gaps everywhere you can test for all NAs in multiple columns - let me know if that's what you need.

How to convert a column with a for loop and grep expressions?

I have a dataset of airbnb and one of the variables is amenities. The “amenities” column lists all the amenities provided by the host. What’s the total number of amenities offered? Convert this to a numeric value that indicates the number of amenities provided. For example, if an instance of “amenities” is {TV,Internet,Wifi,Washer}, it should convert to 4. Add this as a column to the dataframe. I am very confused on how to do this. Some of the amenities go up to 50 different amenities. Manually making vector would take forever.
I'm also confused on this as well for the airbnb dataset. Before we do any further analysis involving calculations, we should first clean the data for mathematical operations. For example, the character “$” appears in the “price” column, making the data type of “price” character instead of numeric. Remove the “$” and “,” in this column and convert the data type as numeric (modify the raw data). I believe I have to use grep expressions.
if you have that info on a data frame you should try to use strsplit function:
sapply(strsplit(data.frame$amenities,","),length)
for subtitution of characters try gsub function

What's the easiest way to ignore one row of data when creating a histogram in R?

I have this csv with 4000+ entries and I am trying to create a histogram of one of the variables. Because of the way the data was collected, there was a possibility that if data was uncollectable for that entry, it was coded as a period (.). I still want to create a histogram and just ignore that specific entry.
What would be the best or easiest way to go about this?
I tried making it so that the histogram would only use the data for every entry except the one with the period by doing
newlist <- data1$var[1:3722]+data1$var[3724:4282]
where 3723 is the entry with the period, but R said that + is not meaningful for factors. I'm not sure if I went about this the right way, my intention was to create a vector or list or table conjoining those two subsets above into one bigger list called newlist.
Your problem is deeper that you realize. When R read in the data and saw the lone . it interpreted that column as a factor (categorical variable).
You need to either convert the factor back to a numeric variable (this is FAQ 7.10) or reread the data forcing it to read that column as numeric, if you are using read.table or one of the functions that calls read.table then you can set the colClasses argument to specify a numeric column.
Once the column of data is a numeric variable then a negative subscript or !is.na will work (or some functions will automatically ignore the missing value).

Cleansing an excel spreadsheet with whitespace cells

I'm looking for advice about how to cleanse an excel spreadsheet using R.
http://www.abs.gov.au/AUSSTATS/abs#.nsf/DetailsPage/5506.02012-13?OpenDocument
Gathering the years by tidyr::gather is simple enough. The difficulty is the subgroups. The groups are defined by whitespace. Each amount of whitespace is a subgroup.
My question is how to assign each row to its group, so that the table is tidy form.
My initial instinct was to look where there is a line of NAs in the spreadsheet and use na.locf to fill them, but that method cannot distinguish between subgroups followed by groups without subgroups. Is there a way to count the amount of whitespace visible before the cells in the linked excel spreadsheet?
On the particular sheet you are talking about, there aren't any leading characters - the indentation is just the formatting applied to the cell, in much the same way as you might apply a font to a cell.
The only way to count the indents in the formatting is to create a macro . Here's a user defined function that will work:
Public Function inds(r As Excel.Range) As Integer
inds = r.Cells(1, 1).IndentLevel
End Function
You would then just count the indents with =inds(a3)
Looks like you might be trying to prepare the data for a pivot table (there might be better options). However to count the leading spaces, simple formula:
=len(a3)-len(trim(a3))+1

Changing hundreds of column names simultaneously in R

I have a data frame with hundreds of columns whose names I want to change. I'm very new to R, so it's rather easy to think through the logic of this, but I simply can't find a relevant example online.
The closest I could sort of get was this:
projectFileAllCombinedNames <- for (i in 1:200){names(projectFileAllCombined)[i+1] <-variableNames[i]}
Basically, starting at the second column of projectFileAllCombined, I want to loop through the columns in the dataframe and assign them the data values in the second data frame. I was able to change one column name manually with this code:
colnames(projectFileAllCombined)[2]<-"newColumnName"
but I can't possibly do that for hundreds of columns. I've spent multiple hours on this and can't crack it with any number of Google searches on "change multiple columns in r" or "change column names in r". The best I can find online is examples where people change a few columns with a c() function and I get how that works, but that still seems to require typing out all the column names as parameters to the function, unless there is a way to just pass the "variableNames" file into that c() function, but I don't know of one.
Will
colnames(projectFileAllCombined)[-1] <- variableNames
not suffice?
This assumes the ordering of columns in projectFileAllCombined is the same as the ordering of the new variable names in variableNames, and that
length(variableNames) == (ncol(projectFileAllCombined) - 1)
The key point here is that the replacement function 'colnames<-'() is vectorised and can replace any number of column names in a single call if passed a vector of replacement values.

Resources