Running VMware’s vCSA on MSSQL
I really like the new vCenter Appliance, but I am not a fan of either the embedded DB2,DB2 or Oracle database options. It seems unusual that VMware did not support MSSQL on the first release of the appliance. I suspect they have their reasons but I prefer not to wait for it when I know it runs without issue in my lab. If you’re interested in the details read on and discover how I hacked it into submission.
The vCSA Linux host contains almost all the necessary components to drive an MSSQL DB. The missing elements are an MSSQL ODBC and JDBC driver. They are both available from Microsoft and can be installed on the appliance. Now as you should know VMware would not support such a hack and I’m not suggesting you run it in your world. For me it’s more about the challenge and adventure of it. Besides I don’t expect VMware to support it nor do I need the support.
Outside of these two Microsoft products it is necessary to modify some of the VMware property files and bash code to allow for the mssql drivers.
The appliance hosts 3 major application components. A web front end using lightpd, a vpxd engine which appears to be coded in C and a Tomcat instance. Surrounding these elements we have configuration scripts and files that provide end users an easy way to setup the appliance. The first area to address for MSSQL connectivity surrounds Microsofts 1.0 ODBC driver for Linux. It can be directly downloaded and installed on the vCSA using curl.
Enter YES to accept the license or anything else to terminate the installation: YES
Checking for 64 bit Linux compatible OS ..................................... OK
Checking required libs are installed ................................. NOT FOUND
unixODBC utilities (odbc_config and odbcinst) installed ............ NOT CHECKED
unixODBC Driver Manager version 2.3.0 installed .................... NOT CHECKED
unixODBC Driver Manager configuration correct ...................... NOT CHECKED
Microsoft SQL Server ODBC Driver V1.0 for Linux already installed .. NOT CHECKED
Microsoft SQL Server ODBC Driver V1.0 for Linux files copied ................ OK
Symbolic links for bcp and sqlcmd created ................................... OK
Microsoft SQL Server ODBC Driver V1.0 for Linux registered ........... INSTALLED
You will find the install places the ODBC driver in /opt/microsoft
We need to edit the appliance odbcinst template file to include the newly added driver.
vcsa1:/ # vi /etc/vmware-vpx/odbcinst.ini.tpl
We need to append the following ODBC driver entry:
[MSSQL]
Description = Microsoft ODBC driver for SQL v11
Driver = /opt/microsoft/sqlncli/lib64/libsqlncli-11.0.so.1790.0
UsageCount = 1
Threading = 1
The Microsoft driver will expect to have Openssl 1.0 available. It’s not installed on the appliance and I don’t feel it’s necessary either. We can just point to the installed 0.9.8 code and it will have no issues. Some symbolic links are all we need to get things rolling as shown here.
vcsa1:/tmp # ln -s /usr/lib64/libcrypto.so.0.9.8 /usr/lib64/libcrypto.so.10
vcsa1:/tmp # ln -s /usr/lib64/libssl.so.0.9.8 /usr/lib64/libssl.so.10
Tomcat as well needs to access the MSSQL server as well which requires a Microsoft JDBC driver and can be as well downloaded with curl.
vcsa1:/ # cd /tmp
vcsa1:/tmp # curl http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz -o sqljdbc_4.0.2206.100_enu.tar.gz
vcsa1:/tmp # tar -xvf sqljdbc_4.0.2206.100_enu.tar.gz
vcsa1:/tmp # cp sqljdbc_4.0/enu/sqljdbc4.jar /usr/lib/vmware-vpx/common-jars/
I suspect that the JDBC driver is used within the Tomcat application to collect status info from ESX agents, but don’t hold me to that guess.
Once we have our MSSQL drivers in place we need to focus on hacking the config files and shell scripts. Let’s start with the web front end first.
Within /opt/vmware/share/htdocs/service/virtualcenter we find the appliance service configuration scripts and other various files. We need to edit the following files.
layout.xml – Database action fields
field.properties – Database type field list values
We need to add the mssql DBType values to give us the option from the database configuration menu and to enable the action.
Layout needs the following segment replaced.
<changeHandlers>
<!-- actions can be enable,disable,clear -->
<onChange id="database.vc.type">
<field id="database.vc.server">
<if value="embedded" actions="disable,clear"/>
<if value="UNCONFIGURED" actions="disable,clear"/>
<if value="db2" actions="enable"/>
<if value="oracle" actions="enable"/>
<if value="mssql" actions="enable"/>
</field>
<field id="database.vc.port">
<if value="embedded" actions="disable,clear"/>
<if value="UNCONFIGURED" actions="disable,clear"/>
<if value="db2" actions="enable"/>
<if value="oracle" actions="enable"/>
<if value="mssql" actions="enable"/>
</field>
<field id="database.vc.instance">
<if value="embedded" actions="disable,clear"/>
<if value="UNCONFIGURED" actions="disable,clear"/>
<if value="db2" actions="enable"/>
<if value="oracle" actions="enable"/>
<if value="mssql" actions="enable"/>
</field>
<field id="database.vc.login">
<if value="embedded" actions="disable,clear"/>
<if value="UNCONFIGURED" actions="disable,clear"/>
<if value="db2" actions="enable"/>
<if value="oracle" actions="enable"/>
<if value="mssql" actions="enable"/>
</field>
<field id="database.vc.password">
<if value="embedded" actions="disable,clear"/>
<if value="UNCONFIGURED" actions="disable,clear"/>
<if value="db2" actions="enable"/>
<if value="oracle" actions="enable"/>
<if value="mssql" actions="enable"/>
</field>
</onChange>
</changeHandlers>
Field.properties needs the following edit where we are adding mssql to the assignment statement.
database.type.vc.values = UNCONFIGURED;embedded;oracle;mssql
Once we have the web front end elements populated with the new values we can focus on the bash shell script. The scripts are located in /usr/sbin. We need to work the following script.
vpxd_servicecfg – This script needs the following subroutines replaced with one that formats the database connection string for mssql. There are two areas which need modification, do_db_test and do_db_write. The test section needs to accept mssql as a valid DBType and will, based on the DBType make a connection using a series of input parms like the server address user and instance. The cfg write routine needs to also detect the mssql DBType and do a custom mod for the db connection url. These calls depend on a proper mssql odbc driver configuration.
###############################
#
# Test DB configuration
#
do_db_test()
{
DB_TYPE=$1
DB_SERVER=$2
DB_PORT=$3
DB_INSTANCE=$4
DB_USER=$5
DB_PASSWORD=$6
log "Testing DB. Type ($DB_TYPE) Server ($DB_SERVER) Port ($DB_PORT)
Instance ($DB_INSTANCE) User ($DB_USER)"
case "$DB_TYPE" in
"mssql" )
log "DB Type is MSSQL"
;;
"oracle" )
;;
"embedded" )
set_embedded_db
;;
*)
log "ERROR: Invalid DB TYPE ($DB_TYPE)"
RESULT=$ERROR_DB_INVALID_TYPE
return 1
;;
esac
if [[ -z "$DB_SERVER" ]]; then
log "ERROR: DB Server was not specified"
RESULT=$ERROR_DB_SERVER_NOT_FOUND
return 1
fi
ping_host "$DB_SERVER"
if [[ $? -ne 0 ]]; then
log "ERROR: Failed to ping DB server: " "$DB_SERVER"
RESULT=$ERROR_DB_SERVER_NOT_FOUND
return 1
fi
# Check for spaces
DB_PORT=`$SED 's/^ *$/0/' <<< $DB_PORT`
# check for non-digits
if [[ ! "$DB_PORT" =~ ^[0-9]+$ ]]; then
log "Error: Invalid database port: " $DB_PORT
RESULT=$ERROR_DB_SERVER_PORT_INVALID
return 1
fi
if [[ -z "$DB_PORT" || "$DB_PORT" == "0" ]]; then
# Set port to default
case "$DB_TYPE" in
"db2")
DB_PORT="50000"
;;
"oracle")
DB_PORT="1521"
;;
*)
DB_PORT="-1"
;;
esac
fi
#Check whether numeric
typeset -i xport
xport=$(($DB_PORT+0))
if [ $xport -eq 0 ]; then
log "Error: Invalid database port: " $DB_PORT
RESULT=$ERROR_DB_SERVER_PORT_INVALID
return 1
fi
#Check whether within valid range
if [[ $xport -lt 1 || $xport -gt 65535 ]]; then
log "Error: Invalid database port: " $DB_PORT
RESULT=$ERROR_DB_SERVER_PORT_INVALID
return 1
fi
if [[ -z "$DB_INSTANCE" ]]; then
log "ERROR: DB instance was not specified"
RESULT=$ERROR_DB_INSTANCE_NOT_FOUND
return 1
fi
if [[ -z "$DB_USER" ]]; then
log "ERROR: DB user was not specified"
RESULT=$ERROR_DB_CREDENTIALS_INVALID
return 1
fi
if [[ -z "$DB_PASSWORD" ]]; then
log "ERROR: DB password was not specified"
RESULT=$ERROR_DB_CREDENTIALS_INVALID
return 1
fi
if [ `date +%s` -lt `cat /etc/vmware-vpx/install.time` ]; then
log "ERROR: Wrong system time"
RESULT=$ERROR_DB_WRONG_TIME
return 1
fi
return 0
}
###############################
#
# Write DB configuration
#
do_db_write()
{
DB_TYPE=$1
DB_SERVER=$2
DB_PORT=$3
DB_INSTANCE=$4
DB_USER=$5
DB_PASSWORD=$6
case "$DB_TYPE" in
"embedded" )
set_embedded_db_autostart on &>/dev/null
start_embedded_db &>/dev/null
if [[ $? -ne 0 ]]; then
log "ERROR: Failed to start embedded DB"
fi
;;
* )
set_embedded_db_autostart off &>/dev/null
stop_embedded_db &>/dev/null
;;
esac
set_embedded_db
ESCAPED_DB_INSTANCE=$(escape_for_sed $DB_INSTANCE)
ESCAPED_DB_TYPE=$(escape_for_sed $DB_TYPE)
ESCAPED_DB_USER=$(escape_for_sed $DB_USER)
# these may be changed below
ESCAPED_DB_SERVER=$(escape_for_sed $DB_SERVER)
ESCAPED_DB_PORT=$(escape_for_sed $DB_PORT)
case "$DB_TYPE" in
"db2")
# Set port to default if its set to 0
if [[ "$DB_PORT" -eq 0 ]]; then
DB_PORT=50000
ESCAPED_DB_PORT=$(escape_for_sed $DB_PORT)
fi
DRIVER_NAME=""
URL=""
FILE=`$MKTEMP`
$CP $DB2CLI_INI_OUT $FILE 1>/dev/null 2>&1
DB_FILES[${#DB_FILES[*]}]="$DB2CLI_INI_OUT $FILE" # Store file tuple
$SED
-e "s!$TNS_SERVICE_SED_STRING!$ESCAPED_DB_INSTANCE!"
-e "s!$SERVER_NAME_SED_STRING!$ESCAPED_DB_SERVER!"
-e "s!$SERVER_PORT_SED_STRING!$ESCAPED_DB_PORT!"
-e "s!$USER_ID_SED_STRING!$ESCAPED_DB_USER!"
$DB2CLI_INI_IN > $DB2CLI_INI_OUT
;;
"oracle")
# Add [ ] around IPv6 addresses
echo "$DB_SERVER" | grep -q '^[^[].*:' && DB_SERVER='['"$DB_SERVER"']'
;;
"mssql" )
TNS_SERVICE=$DB_INSTANCE
# Set port to default if its set to 0
if [[ "$DB_PORT" -eq 0 ]]; then
DB_PORT=1433
ESCAPED_DB_PORT=$(escape_for_sed $DB_PORT)
fi
;;
esac
if [[ "$DB_PORT" -eq 0 ]]; then
DB_PORT=`get_default_db_port $DB_TYPE`
fi
# Save the original ODBC and DB configuration files
FILE=`$MKTEMP`
$CP $ODBC_INI_OUT $FILE 1>/dev/null 2>&1
DB_FILES[${#DB_FILES[*]}]="$ODBC_INI_OUT $FILE" # Store filename
FILE=`$MKTEMP`
$CP $ODBCINST_INI_OUT $FILE 1>/dev/null 2>&1
DB_FILES[${#DB_FILES[*]}]="$ODBCINST_INI_OUT $FILE" # Store filename
# update the values
ESCAPED_DB_SERVER=$(escape_for_sed $DB_SERVER)
ESCAPED_DB_PORT=$(escape_for_sed $DB_PORT)
# Create new configuration files
$SED
-e "s!$DB_TYPE_SED_STRING!$ESCAPED_DB_TYPE!"
-e "s!$TNS_SERVICE_SED_STRING!$ESCAPED_DB_INSTANCE!"
-e "s!$SERVER_NAME_SED_STRING!$ESCAPED_DB_SERVER!"
-e "s!$SERVER_PORT_SED_STRING!$ESCAPED_DB_PORT!"
-e "s!$USER_ID_SED_STRING!$ESCAPED_DB_USER!"
$ODBC_INI_IN > $ODBC_INI_OUT
$CP $ODBCINST_INI_IN $ODBCINST_INI_OUT 1>/dev/null 2>&1
do_jdbc_write "$DB_TYPE" "$DB_SERVER" "$DB_PORT" "$DB_INSTANCE"
"$DB_USER" "$DB_PASSWORD"
return 0
}
At this point the appliance MUST be restarted to work correctly.
With the hacks applied, our appliance is now capable of driving an MSSQL database. On the MSSQL server side you need to have the database created and named VCDB. You will also require an SQL user named vc which needs to be initially set as a sysadmin and once the database is initialized you can downgrade it as a dbo of only the VCDB.
The steps to add your database to the appliance are very easy and here are some screen shots of the web console database config panel to demonstrate this ease of implementation.


If your interested in trying it out I have included the files for release 5.0.0-455964 here.
/etc/vmware-vpx/odbcinst.ini.tpl
/opt/vmware/share/htdocs/service/virtualcenter/layout.xml
/opt/vmware/share/htdocs/service/virtualcenter/fields.properties
I have found no issue to date in my lab after 15 days, this does not mean it’s issue free and I would advise anyone to use caution. This was not tested with heavy loads.
Well I hope you found this blog entry to be interesting and possibly useful.
Regards,
Mike
Site Contents: © 2012 Mike La Spina
Updated ZFS Replication and Snapshot Rollup Script
Thanks to the efforts of Ryan Kernan we have an updated ZFS replication and snapshot rollup script. Ryan’s OpenIdiana/Solaris/Illumos community contribution improves the script to allow for a more dynamic source to target pool replication and changes the shapshot retention method to a specific number of snapshots rather than a Grandfather Father Son method.
Regards,
Mike
Site Contents: © 2011 Mike La Spina
OpenSolaris Door Closes
Months of silence from Oracle on any official statement about OpenSolaris support and development have passed. With that inaction the OpenSolaris Governing Board has motioned to disband and has passed this motion. This now leaves Oracle alone with its not so open “Open Source” operating system. Unfortunately this inaction does not follow Larry Ellison’s public statement indicating that OpenSolaris support would continue. In my view closing the source code until Oracle develops a new release of Solaris is certainly not what I consider an Open Source effort. Granted that most of the development of OpenSolaris code was performed by SUN Engineers, this would be expected as the Solaris OS source was only exposed for 5 years and this really is a short window of time for other developers to familiarize with the code. Surely Larry as a intelligent man knows that this will result in the abandonment of OpenSolaris by the 40,000+ users that were actively exploring it. The only viable alternative will now be held by Nexenta in Illumos. In the background IRC chatter we have seen leaked email evidence that Oracle wishes to keep Solaris closed and will only release the OpenSolaris source when they have greatly distanced the public from it’s features. Not a good move for the Open Source world but that’s the way this bit of history is unfolding.
I’m looking forward to working with Illumos, hopefully you are too.
Maybe Mr. Ellison could do some rethinking about what legacy he would like to leave the world.
Regards,
Mike
Site Contents: © 2010 Mike La Spina
The Illumos Project Launches
If you use or are interested in OpenSolaris then you should check out the Illumos Project which was announced today by Garrett D’Amore of Nexenta. It’s an excellent development project which initially is working toward delivering a compatible, fully open sourced version of the closed OpenSolaris binaries. At first I thought this was going to be a pure fork of OpenSolaris, however its not really a fork. The Illumos project t maintains close compatibility and functionally with it parent OpenSolaris code stream while granting more innovative development freedom and full community control. All good things in my books.
http://www.illumos.org/projects/site/wiki/Announcement
Regards,
Mike
Site Contents: © 2010 Mike La Spina
Encapsulating VT-d Accelerated ZFS Storage within ESXi
Some time ago I found myself conceptually provisioning ESXi hosts that could transition local storage in a distributed manner within an array of hypervisors. The architectural model likens itself to an amorphous cluster of servers which share a common VM client service that self provisions shared storage to it’s parent hypervisor or even other external hypervisiors. This concept originally became a reality in one of my earlier blog entries named Provisioning Disaster Recovery with ZFS, iSCSI and VMware. With this previous success of a DR scope we can now explore more adventurous applications of storage encapsulation and further coin the phrase of “rampent layering violations of storage provisioning” thanks to Jeff Bonwick, Jim Moore and many other brilliant creative minds behind the ZFS storage technology advancements. One of the main barriers of success for this concept was the serious issue of circular latency from within the self provisioning storage VM. What this commonly means is we have a long wait cycle for the storage VM to ready the requested storage since it must wait for the hypervisior to schedule access to the raw storage blocks for the virtualized shared target which then will re-provision it to other VM’s. This issue is acceptable for a DR application but it’s a major show stopper for applications that require normal performance levels.
This major issue now has a solution with the introduction of Intel’s VT-d technology. VT-d allows us to accelerate storage I/O functionality directly inside a VM served by a VMware based ESX and ESXi hypervisors. VMware has leveraged Intel’s VT-d technology on ESXi 4.x (AMD I/O Virtualization Technology (IOMMU) is also supported) as part of the named feature VMDirectPath. This feature now allows us to insert high speed devices inside a VM which can now host a device that operates at the hardware speed of the PCI Bus and that my friend allows virtualized ZFS storage provisioning VMs to dramatically reduce or eliminate the hypervisor’s circular latency issue.
Very exciting indeed, so lets leverage a visual diagram of this amorphous server cluster concept to better capture what this envisioning actually entails.
The concept depicted here sets a multipoint NFS share strategy. Each ESXi host provisions it’s own NFS share from it’s local storage which can be accessed by any of the other hosts including itself. Additionally each encapsulated storage VM incorporates ZFS replication to a neighboring storage VM in a ring pattern thus allowing for crash based recovery in the event of a host failure. Each ESXi instance hosts a DDRdrive X1 PCIe Card which is presented to it’s storage VM over VT-d and VMDirectPath aka. PCI Pass Through. When managed via vCenter this solution allows us to svMotion VM’s across the cluster allowing rolling upgrades or hardware servicing.
The ZFS replication cycle works as a background ZFS send receive script process that incrementally updates the target storage VM. One very useful feature of ZFS send receive capability is the include ZFS properties flag -p. When this flag is used any NFS share properties that are defined using “sharenfs= ” will be sent the the target host. Thus the only required action to enable access to the replicated NFS share is to add it as an NFS storage target on our ESXi host. Of course we would also need to stop replication if we wish to use the backup or clone it to a new share for testing. Testing the backup without cloning will result in a modified ZFS target file system and this could force a complete ZFS resend of the file system in some cases.
Within this architecture our storage VM is built with OpenSolaris snv_134 thus we have the ability to engage in ZFS deduplication. This not only improves the storage capacity it also grants improved performance when we allocate sufficient memory to the storage VM. ZFS Arc caching needs only to cache these dedup block hits once which accelerates all depup access requests. For example if this cluster served a Virtual Desktop Environment (VDI) we would see all the OS file allocation blocks enter into the ZFS Arc cache and thus all VMs that reference the same OS file blocks would be cache accelerated. Dedup also grants a benefit with ZFS replication with the use of the ZFS send -D flag. This flag instructs ZFS send to the stream in dedup format and this dramatically reduces replication bandwidth and time consumption in a VMware environment.
With VT-d we now have the ability to add a non-volatile disk device as a dedicated ZIL accelerator commonly called a SLOG or Separate Intent Log. In this proof of concept architecture I have defined the DDRdrive X1 as a SLOG disk over VMware VMDirectPath to our storage VM. This was a challenge to accomplish as VT-d is just emerging and has many unknown behaviors with system PCI BUS timing and IRQ handling. Coaxing VT-d to work correctly proved to be the most technically difficult component of this proof of concept, however success is at hand using a reasonably cost effective ASUS motherboard in my home lab environment.
Let’s begin with the configuration of VT-d and VMware VMDirectPath.
VT-d requires system BIOS support and this function is available on the ASUS P6X58D series of motherboards. The feature is not enabled by default you must change it in BIOS. I have found that enabling VT-d does impact how ESXi behaves, for example some local storage devices that were available prior to enabling VT-d may not be accessible after enabling it and could result in messages like “cannot retrieve extended partition information”.
The following screen shots demonstrate where you would find the VT-d BIOS setting on the P6X58D mobo.


If your using an AMD 890FX based ASUS Crosshair IV mobo then look for the IOMMU setting as depicted here:
Thanks go to Stu Radnidge over at http://vinternals.com/ for the screen shot!

Once VT-d or IOMMU is enabled ESXi VMDirectPath can be enabled from the VMware vSphere client host configuration-> advanced menu and will require a reboot to complete any further PCI sharing configurations.
One challenge I encountered was PCIe BUS timing issues, fortunately the ASUS P6X58D overclocking capability grants us the ability to align our clock timing on the PCIe BUS by tuning the frequency and voltage and thus I was able to stabilize the PCIe interface running on the DDRdrive X1. Here are original values I used that worked. Since that time I have pushed the i7 CPU to 4.0Ghz, but that can be risky since you need to up the CPU and DRAM voltages so I will leave the safe values for public consumption.



Once VT-d is active you will be able to edit the enumerated PCI device list check boxes and allow pass through for the device of your choice. There are three important PCI values to note. The device ID, Vendor ID and the Class ID of which you can Google it or take this short cut http://www.pcidatabase.com/ and discover who owns the device and what class it belongs to. In this case I needed to ID the DDRdrive X1 and I know by the class ID 0100 that it is a SCSI device.
Once our DDRdrive X1 device is added to the encapsulated OpenSolaris VM it’s shared IRQ mode will need to be adjusted such that no other IRQ’s are chained to it. This is adjusted by adding a custom VM config parameter named pciPassthru0.msiEnabled and setting its value to false.
In this proof of concept the storage VM is assigned 4Gb of memory which is reasonable for non-deduped storage. If you plan to dedup the storage I would suggest significantly more memory to allow the block hash table to be held in memory, this is important for performance and is also needed if you have to delete a ZFS file system. The amount will vary depending on the total storage provisioned. I would rough estimate about 8GB of memory for each 1TB of used storage. As well we have two network interfaces of which one will provision the storage traffic only. Keep in mind that dedup is still developing and should be heavily tested, you should expect some issues.
.
If you have read my previous blog entry Running ZFS Over NFS as a VMware Store you will find the next section to be very similar. This is essentially many of the same steps but excludes aggregation and IPMP capability.
Using a basic OpenSolaris Indiana completed install we can proceed to configure a shared NFS store so let’s begin with the IP interface. We don’t need a complex network configuration for this storage VM and therefore we will just setup simple static IP interfaces, one to manage the OpenSolaris storage VM and one to provision the NFS store. Remember that you should normally separate storage networks from other network types from both a management and security perspective.
OpenSolaris will default to a dynamic network service configuration named nwam, this needs to be disabled and the physical:default service enabled.
root@uss1:~# svcadm disable svc:/network/physical:nwam
root@uss1:~# svcadm enable svc:/network/physical:default
To persistently configure the interfaces we can store the IP address in the local hosts file. The file will be referenced by the physical:default service to define the network IP address of the interfaces when the service starts up.
Edit /etc/hosts to have the following host entries.
::1 localhost
127.0.0.1 uss1.local localhost loghost
10.0.0.1 uss1 uss1.domain.name
10.1.0.1 uss1.esan.data1
As an option if you don’t normally use vi you can install nano.
root@uss1:~# pkg install SUNWgnu-nano
When an OpenSolaris host starts up the physical:default service will reference the /etc directory and match any plumbed network device to a file which contains the interface name a prefix of “hostname” and an extension using the interface name. For example in this VM we have defined two Intel e1000 interfaces which will be plumbed using the following commands.
root@uss1:~# ifconfig e1000g0 plumb
root@uss1:~# ifconfig e1000g1 plumb
Once plumbed these network devices will be enumerated by the physical:default service and if a file exists in the /etc directory named hostname.e1000g0 the service will use the content of this file to configure this interface in the format that ifconfig uses. Here we have created the file using echo, the “uss1.esan.data1″ name will be looked up in the hosts file and maps to IP 10.1.0.1, the network mask and broadcast will be assigned as specified.
root@uss1:~# echo uss1.esan.data1 netmask 255.255.0.0 broadcast 10.1.255.255 > /etc/hostname.e1000g0
One important note: if your /etc/hostname.e1000g0 file has blank lines you may find that persistence fails on any interface after the blank line, thus no blank in the file sanity check would be advised.
One important requirement is the default gateway or route. Here we will assign a default route to network 10.0.0.1 which is the management network. also we need to add a route for network 10.1.0.0. using the following commands. Normally the routing function will dynamically assign the route for 10.1.0.0 so assigning a static one will ensure that no undesired discovered gateways are found and used which may cause poor performance.
root@uss1:~# route -p add default 10.0.0.254
root@uss1:~# route -p add 10.1.0.0 10.1.0.1
When using NFS I prefer provisioning name resolution as a additional layer of access control. If we use names to define NFS shares and clients we can externally validate the incoming IP with a static file or DNS based name lookup. An OpenSolaris NFS implementation inherently grants this methodology. When a client IP requests access to an NFS share we can define a forward lookup to ensure the IP maps to a name which is granted access to the targeted share. We can simply define the desired FQDNs against the NFS shares.
In small configurations static files are acceptable as is in the case here. For large host farms the use of a DNS service instance would ease the admin cycle. You would just have to be careful that your cached TimeToLive (TTL) value is greater that 2 hours thus preventing excessive name resolution traffic. The TTL value will control how long the name is cached and this prevents constant external DNS lookups.
To configure name resolution for both file and DNS we simply copy the predefined config file named nsswitch.dns to the active config file nsswitch.conf as follows:
root@uss1:~# cp /etc/nsswitch.dns /etc/nsswitch.conf
Enabling DNS will require the configuration of our /etc/resolv.conf file which defines our name servers and namespace.
e.g.
root@ss1:~# cat /etc/resolv.conf
domain laspina.ca
nameserver 10.1.0.200
nameserver 10.1.0.201
You can also use the static /etc/hosts file to define any resolvable name to IP mapping, which is my preferred method but since were are using ESXi I will use DNS to ease the administration cycle and avoid the unsupported console hack of ESXi.
It is now necessary to define a zpool using our VT-d enabled PCI DDRdrive X1 and VMDK. The VMDK can be located on any suitable VT-d compatible adapter. There is a good change that some HBA devices will not work with VT-d correctly with your system BIOS. As a tip I suggest you use a USB disk to provision the ESXi installation as it almost always works and is easy to backup and transfer to other hardware. In this POC I used a 500GB SATA disk attached over an ICH10 AHCI interface. Obviously there are other better performing disk subsystems available, however this is a POC and not for production consumption.
To establish the zpool we need to ID the PCI to CxTxDx device mappings, there are two ways that I am aware to find these names. You can ream the output of the prtconf -v command and look for disk instances and dev_links or do it the easy way and use the format command like the following.
root@uss1:~# format
Searching for disks…done
AVAILABLE DISK SELECTIONS:
0. c8t0d0 <DEFAULT cyl 4093 alt 2 hd 128 sec 32>
/pci@0,0/pci15ad,1976@10/sd@0,0
1. c8t1d0 <VMware-Virtual disk-1.0-256.00GB>
/pci@0,0/pci15ad,1976@10/sd@1,0
2. c11t0d0 <DDRDRIVE-X1-0030-3.87GB>
/pci@0,0/pci15ad,7a0@15/pci19e3,8@0/sd@0,0
Specify disk (enter its number): ^C
root@uss1:~#
With the device link info handy we can define the zpool with the DDRdrive X1 as a ZIL using the following command:
root@uss1:~# zpool create sp1 c8t1d0 log c11t0d0
root@uss1:~# zpool status
pool: rpool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c8t0d0s0 ONLINE 0 0 0
errors: No known data errors
pool: sp1
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
sp1 ONLINE 0 0 0
c8t1d0 ONLINE 0 0 0
logs
c11t0d0 ONLINE 0 0 0
errors: No known data errors
With a functional IP interface and ZFS pool complete you can define the NFS share and ZFS file system. Always define NFS properties using ZFS set sharenfs=, the share parameters will store as part of the ZFS file system which is ideal for a system failure recovery or ZFS relocation.
zfs create -p sp1/nas/vol0
zfs set mountpoint=/export/uss1-nas-vol0 sp1/nas/vol0
zfs set sharenfs=rw,nosuid,root=vh3-nas:vh2-nas:vh1-nas:vh0-nas sp1/nas/vol0
To connect a VMware ESXi host to this NFS store(s) we need to define a vmkernel network interface which I like to name eSAN-Interface1. This interface should only connect to the storage network vSwitch. The management network and VM network should be on another separate vSwitch.
Since we are encapsulating the storage VM on the same server we also need to connect the VM to the storage interface over a VM network port group as show above. At this point we have all the base NFS services ready, we can now connect our ESXi host to the newly defined NAS storage target.
Thus we now have an Encapsulated NFS storage VM provisioning an NFS share to it’s parent hypervisor.
You may have noticed that the capacity of this share is ~390GB however we only granted a 256GB vmdk to this storage VM. The capacity anomaly is the result of ZFS deduplication on the shared file system. There are 10 16GB Windows XP hosts and 2 32GB Linux host located on this file system which would normally require 224GB of storage. Obviously dedup is a serious benefit in this case however you need to be aware of the costs, in order to sustain performance levels similar to non-deduped storage you MUST grant the ZFS code sufficient memory to hold the block hash table in memory. If this is memory not provisioned in sufficient amounts, your storage VM will be relegated to a what appears to be a permanent storage bottle neck, in other words you will enter a “processing time vortex”. (Thus as I have cautioned in the past ZFS dedup is maturing and needs some code changes before trusting it to mission critical loads, always test, test, test and repeat until you’re head spins)
Here’ s the result of using dedup within the encapsulated storage VM.
root@uss1:~# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
rpool 7.94G 3.64G 4.30G 45% 1.00x ONLINE -
sp1 254G 24.9G 229G 9% 6.97x ONLINE -
And here’s a look at what’s it’s serving.
Incredibly the IO performance is simply jaw dropping fast, here we are observing a grueling 100% random read load at 512 bytes per request. Yes that’s correct we are reaching 40,420 IOs per second.
Even more incredible is the IO performance with a 100% random write load at 512 bytes per request. it’s simply unbelievable seeing 38491 IOs per second inside a VM which is served from a peer VM all on the same hypervisor.
With a successfully configured and operational NFS share provisioned the next logical task is to define and automate the replication of this share and any others shares we may we to add to a neighboring encapsulated storage VM or for that matter any OpenSolaris host.
The basic elements to this functionality as follows:
- Define a dedicated secured user to execute the replication functions.
- Grant the appropriate permissions to this user to access a cron and ZFS.
- Assign an RSA Key pair for automated ssh authentication.
- Define a snapshot replication script using ZFS send/receive calls.
- Define a cron job to regularly invoke the script.
Let define the dedicated replication user. In this example I will use the name zfsadm.
First we need to create the zfsadm user on all of our storage VMs.
root@uss1:~# useradd -s /bin/bash -d /export/home/zfsadm -P ‘ZFS File System Management’ zfsadm
root@uss1:~# mkdir /export/home/zfsadm
root@uss1:~# cp /etc/skel/* /export/home/zfsadm
root@uss1:~# echo PATH=/bin:/sbin:/usr/ucb:/etc:. > /export/home/zfsadm/.profile
root@uss1:~# echo export PATH >> /export/home/zfsadm/.profile
root@uss1:~# echo PS1=$’${LOGNAME}@$(/usr/bin/hostname)’~#’ ‘ >> /export/home/zfsadm/.profile
root@uss1:~# chown –R zfsadm /export/home/zfsadm
root@uss1:~# passwd zfsadm
In order to use an RSA key for authentication we must first generate an RSA private/public key pair on the storage head. This is performed using ssh-keygen while logged in as the zfsadm user. You must set the passphrase as blank otherwise the session will prompt for it.
root@uss1:~# su – zfsadm
zfsadm@uss1~#ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/export/home/zfsadm/.ssh/id_rsa):
Created directory ‘/export/home/zfsadm/.ssh’.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /export/home/zfsadm/.ssh/id_rsa.
Your public key has been saved in /export/home/zfsadm/.ssh/id_rsa.pub.
The key fingerprint is:
0c:82:88:fa:46:c7:a2:6c:e2:28:5e:13:0f:a2:38:7f zfsadm@uss1
zfsadm@uss1~#
The id_rsa file should not be exposed outside of this directory as it contains the private key of the pair, only the public key file id_rsa.pub needs to be exported. Now that our key pair is generated we need to append the public portion of the key pair to a file named authorized_keys2.
# cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys2
Repeat all the crypto key steps on the target VM as well.
We will use the Secure Copy command to place the public key file on the target hosts zfsadm users home directory. It’s very important that the private key is secured properly and it is not necessary to back it up as you can regenerate them if required.
From the local server here named uss1 (The remote server is uss2)
zfsadm@uss1~# scp $HOME/.ssh/id_rsa.pub uss2:$HOME/.ssh/uss1.pub
Password:
id_rsa.pub 100% |**********************************************| 603 00:00
zfsadm@uss1~# scp uss2:$HOME/.ssh/id_rsa.pub $HOME/.ssh/uss2.pub
Password:
id_rsa.pub 100% |**********************************************| 603 00:00
zfsadm@uss1~# cat $HOME/.ssh/uss2.pub >> $HOME/.ssh/authorized_keys2
And on the remote server uss2
# ssh uss2
password:
zfsadm@uss2~# cat $HOME/.ssh/uss1.pub >> $HOME/.ssh/authorized_keys2
# exit
Now that we are able to authenticate without a password prompt we need to define the automated replication launch using cron. Rather that using the /etc/cron.allow file to grant permissions to the zfsadm user we are going to use a finer instrument and grant the user access at the user properties level shown here. Keep in mind you can not use both ways simultaneously.
root@uss1~# usermod -A solaris.jobs.user zfsadm
root@uss1~# crontab –e zfsadm
59 23 * * * ./zfs-daily-rpl.sh zfs-daily.rpl
Hint: crontab uses vi – http://www.kcomputing.com/kcvi.pdf “vi cheat sheet”
The key sequence would be hit “i” and key in the line then hit “esc :wq” and to abort “esc :q!”
Be aware of the timezone the cron service runs under, you should check it and adjust it if required. Here is a example of whats required to set it.
root@uss1~# pargs -e `pgrep -f /usr/sbin/cron`
8550: /usr/sbin/cron
envp[0]: LOGNAME=root
envp[1]: _=/usr/sbin/cron
envp[2]: LANG=en_US.UTF-8
envp[3]: PATH=/usr/sbin:/usr/bin
envp[4]: PWD=/root
envp[5]: SMF_FMRI=svc:/system/cron:default
envp[6]: SMF_METHOD=start
envp[7]: SMF_RESTARTER=svc:/system/svc/restarter:default
envp[8]: SMF_ZONENAME=global
envp[9]: TZ=PST8PDT
Let’s change it to CST6CDT
root@uss1~# svccfg -s system/cron:default setenv TZ CST6DST
Also the default environment path for cron may cause some script “command not found” issues, check for a path and adjust it if required.
root@uss1~# cat /etc/default/cron
#
# Copyright 1991 Sun Microsystems, Inc. All rights reserved.
# Use is subject to license terms.
#
#pragma ident “%Z%%M% %I% %E% SMI”
CRONLOG=YES
This one has no default path, add the path using echo.
root@uss1~# echo PATH=/usr/bin:/usr/sbin:/usr/ucb:/etc:. > /etc/default/cron
# svcadm refresh cron
# svcadm restart cron
The final part of the replication process is a script that will handle the ZFS send/recv invocations. I have written a script in the past that can serve this task with some very minor changes.
Here is the link for the modified zfs-daily-rpl.sh replication script you will need to grant exec rights to this file e.g.
# chmod 755 zfs-daily-rpl.sh
This script will require that a zpool named sp2 exists on the target system, this is shamefully hard coded in the script.
A file containing the file system to replicate and the target are required as well.
e.g.
zfs-daily-rpl.sh filesystems.lst
Where filesystems.lst contains:
sp1/nas/vol0 uss2
sp1/nas/vol1 uss2
With any ZFS replicated file system that you wish to invoke on a remote host it is important to remember not make changes to the active replication stream. You must take a clone of this replication stream and this will avoid forcing a complete resend or other replication issues when you wish to test or validate that it’s operating as you expect.
For example:
We take a clone of one of the snapshots and then share it via NFS:
root@uss2~# zfs clone sp2/nas/vol0@31-04-10-23:59 sp2/clones/uss1/nas/vol0
root@uss2~# zfs set mountpoint=/export/uss1-nas-vol0 sp2/clones/uss1/nas/vol0
root@uss2~# zfs set sharenfs=rw,nosuid,root=vh3-nas:vh2-nas:vh1-nas:vh0-nas sp2/clones/uss1/nas/vol0
Well I hope you found this entry interesting.
Regards,
Mike
Site Contents: © 2010 Mike La Spina
Follow me on Twitter