Can I replace values in cells conditional on a string? - r

I am new to R and trying to create my own dataset by modifying Eurostat data. I now have regions with names such as AT111, AT112 and ITC11. I want to give each country a number, so that all regions from AT have a country code equal to 1.
For that I have added a new empty column to my dataset. Is there a way for me to do this:
NUTS3.3[NUTS3.3$geo == "AT111", "country"] <- 1
for all observations whose geo string contains "AT" at once?
I have >26 000 observations, so doing it for every single regional code would be tedious.

We can get the substr of the column and do the ==
NUTS3.3$country[substr(NUTS3.3$geo, 1, 2)=="AT"] <- 1

Related

Creating horizontal dataframe from vertical table (with repeated variables)

Need to create usable dataframe using R or Excel
Variable1
ID
Variable2
Name of A person 1
002157
NULL
Drugs used
NULL
3.0
Days in hospital
NULL
2
Name of a surgeon
NULL
JOHN T.
Name of A person 2
002158
NULL
Drugs used
NULL
4.0
Days in hospital
NULL
5
Name of a surgeon
NULL
ADAM S.
I have a table exported from 1C (accounting software). It contains more than 20 thousand observations. A task is to analyze: How many drugs were used and how many days the patient stayed in the hospital.
For that reason, I need to transform the one dataframe into a second dataframe, which will be suitable for doing analysis (from horizontal to vertical). Basically, I have to create a dataframe consisting of 4 columns: ID, drugs used, Hospital stay, and Name of a surgeon. I am guessing that it requires two functions:
for ID it must read the first dataframe and extract filled rows
for Name of a surgeon, Drugs used and Days in hospital the function have to check that the row corresponds to one of that variables and extracts date from the third column, adding it to the second dataframe.
Shortly, I have no idea how to do that. Could you guys help me to write functions for R or tips for excel?
for R, I guess you want something like this:
load the table, make sure to substitute the "," with the separator that is used in your file (could be ";" or "\t" for tab etc.).
df = read.table("path/to/file", sep=",")
create subset tables that contain only one row for the patient
id = subset(df, !is.null(ID))
drugs = subset(df, Variable1 %in% "Drugs used")
days = subset(df, Variable1 %in% "Days in hospital")
#...etc...
make a new data frame that contains these information
new_df = data.frame(
id = id$ID,
drugs = drugs$Variable2,
days = days$Variable2,
#...etc...no comma after the last!
)
EDIT:
Note that this approach only works if your table is basically perfect! Otherwise there might be shifts in the data.
#=====================================================
EDIT 2:
If you have an imperfect table, you might wanna do something like this:
Step 1.5) , change all NA-values (which in you table is labeled as NULL, but I assume R will change that to NA) to the patient ID. Note that the is.na() function in the code below is specifically for that, and will not work with NULL or "NULL" or other stuff:
for(i in seq_along(df$ID)){
if(is.na(df$ID[i])) df$ID[i] <- df$ID[i-1]
}
Then go again to step 2) above (you dont need the id subset though) and then you have to change each data frame a little. As an example for the drugs and days data frames:
drugs = drugs[, -1] #removes the first column
colnames(drugs) = c("ID","drugs") #renames the columns
days = days[, -1]
colnames(days) = c("ID", "days")
Then instead of doing step 3 as above, use merge and choose the ID column to be the merging column.
new_df = merge(drugs, days, by="ID")
Repeat this for other subsetted data frames:
new_df = merge(new_df, surgeon, by="ID")
# etc...
That is much more robust and even if some patients have a line that others dont have (e.g. days), their respective column in this new data frame will just contain an NA for this patient.

Adding rows to data frame with zero values

I have a dataset with multiple records, each of which is assigned a country, and I want to create a worldmap using rworldmap, coloured according to the frequency with which each country occurs in the dataset. Not all countries appear in the dataset - either because they have no corresponding records or because they are not eligible (e.g. middle/low income countries).
To build the map, I have created a dataframe (dfmap) based on a table of countries, where one column is the country code and the second column is the frequency with which it appears in the dataset.
In order to identify on the map countries which are eligible, but have no records, I have tried to use add_row to add these to my dataframe e.g. for Andorra:
add_row(dfmap, Var1="AND", Freq=0)
When I run add_row for each country, it appears to work (no error message and that new row appears in the table below the command) - but previously added rows where the Freq=0 do not appear.
When I then look at the dataframe using "dfmap" or "summary(dfmap)", none of the rows where Freq=0 appear, and when I build the map, they are coloured as for missing countries.
I'm not sure where I'm going wrong and would welcome any suggestions.
Many thanks
Using the method suggested in the comment above, one can use the join function and then the replace_na function to create a tibble with the complete country and give these a count value of zero.
As there was no sample data in the question i created two data frames below based on what I thought was implied by the question.
dfrm_counts = tibble(Country = c('England','Germany'),
Count = c(1,4))
dfrm_all = tibble(Country = c('England', 'Germany', 'France'))
dfrm_final = dfrm_counts %>%
right_join(dfrm_all, by = "Country") %>%
replace_na(list(Count = 0))
dfrm_final
# A tibble: 3 x 2
Country Count
<chr> <dbl>
1 England 1
2 Germany 4
3 France 0

R: returning the 5 rows with the highest values

Sample data
mysample <- data.frame(ID = 1:100, kWh = rnorm(100))
I'm trying to automate the process of returning the rows in a data frame that contain the 5 highest values in a certain column. In the sample data, the 5 highest values in the "kWh" column can be found using the code:
(tail(sort(mysample$kWh), 5))
which in my case returns:
[1] 1.477391 1.765312 1.778396 2.686136 2.710494
I would like to create a table that contains rows that contain these numbers in column 2.
I am attempting to use this code:
mysample[mysample$kWh == (tail(sort(mysample$kWh), 5)),]
This returns:
ID kWh
87 87 1.765312
I would like it to return the r rows that contain the figures above in the "kWh" column. I'm sure I've missed something basic but I can't figure it out.
We can use rank
mysample$Rank <- rank(-mysample$kWh)
head(mysample[order(mysample$Rank),],5)
if we don't need to create column, directly use order (as #Jaap mentioned in three alternative methods)
#order descending and get the first 5 rows
head(mysample[order(-mysample$kWh),],5)
#order ascending and get the last 5 rows
tail(mysample[order(mysample$kWh),],5)
#or just use sequence as index to get the rows.
mysample[order(-mysample$kWh),][1:5]

Copy columns of a data frame based on the value of a third column in R

I have a data frame with 4 columns. On one of the columns I added a date so that each value looks like this
>print(result[[4]][[10000]])
[[10000]]
[1] "Jan" "14" "2012"
That means that on the 1000'th field of the 4th column I have these 3 fields. This is the only column that is multiple.
Now the other 3 columns of the data frame result are single values not multiple. One of those columns, the first one, has the states of the United States as values. What I want to do is create a new data frame from column 2 and 4 (the one described above) of the result data frame but depending on the state.
So for example I want all the 2nd column and 4th column data of the state of Alabama. I tried this but I don't think it is working properly. "levels" is the 2nd column and "weeks" is the 4th column of the data frame result.
rst <- subset(result, result$states == 'Alabama', select = c(result$levels, result$weeks))
The problem here is that subset is copying all the columns to rst and not just the second and fourth ones of the result data frame that are linked to Alabama state which are the only ones I want. Any idea how to do this correctly?
Edit to add the code
I'm adding the code here since I think there must be something I'm not seeing here. First a small sample of the original data which is on a csv file
st URL WEBSITE al aln wk WEEKSEASON
Alabama http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-04-2008 40 2008-09
Alabama http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-11-2008 41 2008-09
Alaska http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-18-2008 42 2008-09
Alaska http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-25-2008 43 2008-09
And this is the code
#Extracts relevant data from the csv file
extract_data<-function(){
#open the file. NAME SHOULD BE CHANGED
sd <- read.csv(file="sdr.csv",head=TRUE,sep=",")
#Extracts the data from the ACTIVITY LEVEL column. Notice that the name of the column was changed on the file
#to 'al' to make the reference easier
lv_list <- sd$al
#Gets only the number from each value getting rid of the word "Level"
lvs <- lapply(strsplit(as.character(lv_list), " "), function(x) x[2])
#Gets the ACTIVITY LEVEL NAME. Column name was changed to 'aln' on the file
lvn_list <- sd$aln
#Gets the state. Column name was changed to 'st' on the file
st_list <- sd$st
#Gets the week. Column name was changed to 'wk' on the file
wk_list <- sd$wk
#Divides the weeks data in month, day, year
wks <- strsplit(as.character(wk_list), "-")
result<-list("states"=st_list,"levels"=lvs,"lvlnames"=lvn_list,"weeks"=wks)
return(result)
}
forecast<-function(){
result=extract_data()
rst <- subset(result, states == 'Alabama', select = c(levels, weeks))
return(0) #return results
}
You're nearly there, but you don't need to reference the dataframe in the select argument - this should work:
rst <- subset(result, states == 'Alabama', select = c(levels, weeks))
You could also look into the package dplyr, which gives you SQL like abilities and is great for manipulating larger and more complicated data sets.
EDIT
Thanks for posting your code - I think I've identified a few problems.
The result you return from extract_data() is a list, not a data.frame - which is why the code in forecast() doesn't work. If it did return a dataframe the original solution would work.
You're forming your list out of a combination of vectors and lists, which is a problem - a dataframe is (roughly) a list of vectors, not a collection of the two types. If you replace your list creation line with result <- data.frame(...) you run into problems because of this.
There are two problematic columns - lvs (or levels) and wks (weeks). Where you use lapply(), using sapply() instead would give you a vector, as required (see the manual). The second issue is the weeks column. What you're actually dealing with here is a list of character vectors of length 3. There's no easy way to do what you want - you can't, for example, have each 'cell' of a column in a dataframe contain a character vector, as the columns are themselves vectors.
My suggestions would be to either:
Use the original format "Oct-01-2008", i.e. construct your data.frame with wk_list rather than splitting each date into the three strings;
Convert the original format into a better time format with a package like lubridate (A+++++ would recommend, great package);
Or finally, split the week column into three columns, so you'd have one for month, one for day and one for year. You could do this very simply from wk_list like this:
wks <- sapply(strsplit(as.character(wk_list), "-"), function(x) c(x[1], x[2], x[3]))
Month <- wks[1,]
Day <- wks[2,]
Year <- wks[3,]
Once both lvs and wks are in vector form, you're good to just run
result<-data.frame("states"=st_list,"levels"=lvs,"lvlnames"=lvn_list,"weeks"=wks)
and the script should work.

Naming the number of the row in a data frame that contains a certain value

I've done some thorough research and I am struggling with an attempt to find a function that will name the number of the row (in my data frame the rows don't contain numbers) that contains a certain value. In this case a number.
e.g. Call the data frame = df
I don't know how to show a little image of the data frame but say that in row 5, column 4 the value was '162', is there a function I could use that will end with the return being '5' or 'row 5'?
I have used rowsums(df=="162")
which gives a long line of the rows, if they contain the values there is a '1' under them, if not a '0' but I need a function that simply states the row.
I couldn't figure out how to correctly use the 'which' function either.
which(df$col4=='162')
I am assuming that col4 is the name of the column number 4

Resources