Can't get toupper to work - r

I want to transform the contents of a factor column in a dataframe from lowercase to upper case. The function toupper(dataframe$columnname) prints the contents in uppercase, but nothing actually seems to happen to the contents. When I check using levels(dataframe$columnname) or just visually inspecting the dataframe, the contents are still in lowercase. What I am doing wrong?

To change your data.taframe's content you must alter the columns values
dataframe$columnname <- toupper(dataframe$columnname)
Although, if you want to play with characters, do it like
dataframe$columnname <- toupper(as.character(dataframe$columnname))

Related

Is there an R function for removing a specific snippet of the data in a column?

So I have a data frame that includes a column like this:
image
And I would like to remove the operator as well as the numbers to the right of it, i.e. so the first entry would just say 51.81 rather than 51.81 - 11.19. How would I go about this? I feel like using a for loop might work but I'm unsure of the syntax required.
Thanks
We can use sub to match zero or more spaces (\\s*) followed by a - or + and other characters, and replace with blank ("")
df1$xG <- as.numeric(sub("\\s*[-+]+.*", "", df1$xG))

Cleaning a column with break spaces that obtain last, first name so I can filter it from my data frame

I'm stumped. My issue is that I want to grab specific names from a given column. However, when I try and filter them I get most of the names except for a few, even though I can clearly see their names in the original excel file. I think it has to do what some sort of special characters or spacing in the name column. I am confused on how I can fix this.
I have tried using excels clean() function to apply that to the given column. I have tried working an Alteryx flow to clean the data. All of these steps haven't helped any. I am starting to wonder if this is an r issue.
surveyData %>% filter(`Completed By` == "Spencer,(redbox with whitedot in middle)Amy")
surveyData %>% filter(`Completed By` == "Spencer, Amy")
in r the first line had this redbox with white dot in between the comma and the first name. I got this red box with white dot by copy the name from the data frame and copying it into notepad and then pasting it in r. This actually works and returns what I want. Now the second case is a standard space which doesn't return what I want. So how can I fix this issue by not having to copy a name from the data frame and copy to notepad then copying the results from notepad to r, which has the redbox with a white dot in between the comma(,) and first name.
Expected results is that I get the rows that are attached to what ever name I filter by.
I was able to find the answer, it turns out the space is actually a break space with unicode of (U+00A0) compared to the normal space unicode (U+0020). The break space is not apart of the American Standard Code for Information Interchange(ACSII). Thus r filter() couldn't grab some names because they had break spaces. I fixed this by subbing the Unicode of the break space with the Unicode for a normal space and applying that to my given column. Example below:
space_fix = gsub("\u00A0", " ", surveyData$`Completed By`, fixed = TRUE) #subbing break space unicode with space unicode for the given column I am interested in
surveyData$`Completed By Clean` = space_fix
Once, I applied this I could easily filter any name!
Thanks everyone!

Parsing strings with grep/str_extract

As part of my feature engineering, I need to parse text strings from different languages and keep text enclosed within parentheses. Everything was going well until I encountered a very strange phenomenon. For some languages, the parentheses I need to find look slightly different, and various regexp options fail.
I'm pasting screen-shots because strangely, copying and pasting the strange parentheses changes it to a 'normal' one, so I can't set up a different regex to find those separately.
Notice that the parentheses in the first entry look normal, but for the second entry, it appears sort of 'sharp'
If I use stringr's str_extract, the first instance works fine, but the second fails.
But, the encodings are the same. Anyone know what's going on?
[Edit: here are the results of dput on these same examples. dput apparently sees the parentheses as equivalent, even though grep does not]
c("Obnaružena poterâ šaga na (Motor šprica pipettora R1).", "(STAT tàn zhen Z zhóu ma dá) tàn cè dào diu bù<U+3002>")
Finally, I am actually copy and pasting the two parentheses from R into the code window below; they do appear different this way. First is normal, second is the strange one.
( (

How to fix a character value returned as a numeric value in R?

I am a new R user and having some difficulty when trying to rename certain records in a column.
My data have columns named classcode and fish_tl, among others. Classcode is a character value, fish_tl is numeric.
When classcode='OCAL' and fish_tl<20, I need to rename that value of classcode so that it is now "OCALYOY". I don't want to change any of the other records in classcode.
I'm running the following code:
data$classcode<-ifelse(data$classcode=='OCAL'& data$fish_tl<20,
'OCALYOY',data$classcode)
My problem seems to be with the "else" aspect: the code runs fine, and returns 'OCALYOY' as expected, but the other values of classcode have now been converted to numeric (although when I look at the mode of that field, it still returns as "character").
What am I doing wrong?
Thanks very much!
You can make the else part as.character(data$classcode). ifelse has some odd semantics with regard to the classes of the arguments, and it is turning your factor into it's underlying numeric representation. as.character will keep it as a character value.
You may be getting tripped up in a factor vs character issue, though you point out that R thinks it's character. Regardless, wrapping as.character() around your code seems to fix the problem for me:
> ifelse(data$classcode=='OCAL'& data$fish_tl<20,
+ 'OCALYOY',as.character(data$classcode))
#-----
[1] "BFRE" "BFRE" "BFRE" "HARG" "OCALYOY" "OYT" "OYT" "PFUR"
[9] "SPAU" "BFRE" "OCALYOY" "OCAL"
If this isn't it, can you make your question reproducible by adding the output of dput() to your question instead of the text representation?

Setting column width in a data set

I would like to set column widths (for all the 3 columns) in this data set, as: anim=1-10; sireid=11-20; damid=21-30. Some columns have missing values.
anim=c("1A038","1C467","2F179","38138","030081")
sireid=c("NA","NA","1W960","1W960","64404")
damid=c("NA","NA","1P119","1P119","63666")
mydf=data.frame(anim,sireid,damid)
From reading your question as well as your comments to previous answers, it seems to me that you are trying to create a fixed width file with your data. If this is the case, you can use the function write.fwf in package gdata:
Load the package and create a temporary output file:
library(gdata)
ff <- tempfile()
Write your data in fixed width format to the temporary file:
write.fwf(mydf, file=ff, width=c(10,10,10), colnames=FALSE)
Read the file with scan and print the results (to demonstrate fixed width output):
zz <- scan(ff, what="character", sep="\n")
cat(zz, sep="\n")
1A038 NA NA
1C467 NA NA
2F179 1W960 1P119
38138 1W960 1P119
030081 64404 63666
Delete the temporary file:
unlink(ff)
You can also write fixed width output for numbers and strings using the sprintf() function, which derives from C's counterpart.
For instance, to pad integers with 0s:
sprintf("%012d",99)
To pad with spaces:
sprintf("%12d",123)
And to pad strings:
sprintf("%20s","hello world")
The options for formatting are found via ?sprintf and there are many guides to formatting C output for fixed width.
It sounds like you're coming from a SAS background, where character variables should have explicit lengths specified to avoid unexpected truncations. In R, you don't need to worry about this. A character string has exactly as many characters as it needs, and automatically expands and contracts as its contents change.
One thing you should be aware of, though, is silent conversion of character variables to factors in a data frame. However, unless you change the contents at a later point in time, you should be able to live with the default.

Resources